# Action Plan

Create Validation and Sample sets
Rearrange image files into their respective directories
Finetune and Train model
Generate predictions
Validate predictions
Submit predictions to Kaggle



## Create Validation and Sample Sets

In [1]:
#Verify we are in the nbs directory
%pwd

u'/home/ubuntu/deeplearning'

In [2]:
#Create references to some commonly referred directories
import os, sys
current_dir = os.getcwd()
NBS_HOME_DIR = current_dir
DATA_HOME_DIR = current_dir + '/data/redux'

In [3]:
#Import modules
from utils import *
from vgg16 import Vgg16

#Setup inline plots
%matplotlib inline

 https://github.com/Theano/Theano/wiki/Converting-to-the-new-gpu-back-end%28gpuarray%29

Using Theano backend.


In [4]:
# Create directories
%cd $DATA_HOME_DIR
%mkdir valid
%mkdir results
%mkdir -p sample/train
%mkdir -p sample/test
%mkdir -p sample/valid
%mkdir -p sample/results
%mkdir -p test/unknown

/home/ubuntu/deeplearning/data/redux
mkdir: cannot create directory ‘valid’: File exists
mkdir: cannot create directory ‘results’: File exists


In [5]:
%cd $DATA_HOME_DIR/train

/home/ubuntu/deeplearning/data/redux/train


In [6]:
#Seperate a validation set
g = glob('*.jpg')
shuf = np.random.permutation(g)
for i in range(2000): os.rename(shuf[i], DATA_HOME_DIR+'/valid/'+shuf[i])

In [7]:
from shutil import copyfile

In [8]:
g = glob('*.jpg')
shuf = np.random.permutation(g)
for i in range(200): copyfile(shuf[i], DATA_HOME_DIR+'/sample/train/'+shuf[i])

In [9]:
%cd $DATA_HOME_DIR/valid

/home/ubuntu/deeplearning/data/redux/valid


In [10]:
g = glob('*.jpg')
shuf = np.random.permutation(g)
for i in range(50): copyfile(shuf[i], DATA_HOME_DIR+'/sample/valid/'+shuf[i])

## Rearrange image files into respective directories

In [11]:
#Seperate cat and dog images into seperate directories
%cd $DATA_HOME_DIR/valid
%mkdir cats
%mkdir dogs
%mv cat.*.jpg cats/
%mv dog.*.jpg dogs/

%cd $DATA_HOME_DIR/train
%mkdir cats
%mkdir dogs
%mv cat.*.jpg cats/
%mv dog.*.jpg dogs/

%cd $DATA_HOME_DIR/sample/train
%mkdir cats
%mkdir dogs
%mv cat.*.jpg cats/
%mv dog.*.jpg dogs/

%cd $DATA_HOME_DIR/sample/valid
%mkdir cats
%mkdir dogs
%mv cat.*.jpg cats/
%mv dog.*.jpg dogs/


/home/ubuntu/deeplearning/data/redux/valid
/home/ubuntu/deeplearning/data/redux/train
/home/ubuntu/deeplearning/data/redux/sample/train
/home/ubuntu/deeplearning/data/redux/sample/valid


In [12]:
# Create single unkown class for test
%cd $DATA_HOME_DIR/test
%mv *jpg unknown/

/home/ubuntu/deeplearning/data/redux/test


## Finetune and Train Model

In [13]:
%cd $DATA_HOME_DIR

#set path to sample path if desired
#path = DATA_HOME_DIR + '/'
path = DATA_HOME_DIR + '/sample/'
test_path = DATA_HOME_DIR + '/test/'
results_path = DATA_HOME_DIR + 'results'
train_path = path + '/train/'
valid_path = path + '/valid'

/home/ubuntu/deeplearning/data/redux


In [14]:
#import vgg16 helper function
vgg = Vgg16()

In [15]:
#Set constants
batch_size = 64
no_of_epochs = 3

In [16]:
#Finetune the Model
batches = vgg.get_batches(train_path, batch_size=batch_size)
val_batches = vgg.get_batches(valid_path, batch_size=batch_size*2)
vgg.finetune(batches)

vgg.model.optimizer.lr = 0.01

Found 200 images belonging to 2 classes.
Found 50 images belonging to 2 classes.


In [17]:
#Notice we are passing in the validation dataset to the fit() method
#For each epoch we test our model against the validation set
latest_weights_filename = None
for epoch in range(no_of_epochs):
    print "Running epoch: %d" % epoch
    vgg.fit(batches, val_batches, nb_epoch=1)
    latest_weights_filename = 'ft%d.h5' % epoch
    vgg.model.save_weights(results_path+latest_weights_filename)
print "Completed %s fit operations" % no_of_epochs

Running epoch: 0
Epoch 1/1
Running epoch: 1
Epoch 1/1
Running epoch: 2
Epoch 1/1
Completed 3 fit operations


## Generate Predictions

Lets use the new model to make predicitions on the test data set

In [None]:
batches, preds = vgg.test(test_path, batch_size=batch_size*2)

Found 12500 images belonging to 1 classes.


In [None]:
print preds[:5]

filenames = batches.filenames
print filenames [:5]

In [None]:
#You can verify the column ordering by viewing some images
from PIL import Image
Image.open(test_path + filenames[2])

In [None]:
#Save our test results arrays so we can use them again later
save_array(results_path + 'test_preds.dat', preds)
save_array(results_path + 'filenames.dat', filenames)

## Validate Predictions

As well as looking at the overall metrics, it's also a good idea to look at examples of each of:

    1. A few correct labels at random
    2. A few incorrect labels at random
    3. The most correct labels of each class (ie those with highest probability that are correct)
    4. The most incorrect labels of each class (ie those with highest probability that are incorrect)
    5. The most uncertain labels (ie those with probability closest to 0.5).

Let's see what we can learn from these examples. (In general, this is a particularly useful technique for debugging problems in the model. However, since this model is so simple, there may not be too much to learn at this stage.)

Calculate predictions on validation set, so we can find correct and incorrect examples:


In [None]:
vgg.model.load_weights(results_path+latest_weights_filename)

In [None]:
val_batches, probs = vgg.test(valid_path, batch_size=batch_size)

In [None]:
filenames = val_batches.filenames
expected_labels = val_batches.classes #0 or 1

#Round our predictions to 0/1 to generate labels
our_predictions = probs[:,0]
our_labels = np.round(1-our_predictions)

## Submit Predictions to Kaggle!

Here's the format Kaggle requires for new submissions:
```
imageId,isDog
1242, .3984
3947, .1000
4539, .9082
2345, .0000
```

Kaggle wants the imageId followed by the probability of the image being a dog. Kaggle uses a metric called [Log Loss](http://wiki.fast.ai/index.php/Log_Loss) to evaluate your submission.

In [None]:
#Load our test predictions from file
preds = load_array(results_path + 'test_preds.dat')
filenames = load_array(results_path + 'filenames.dat')

In [None]:
#Grab the dog prediction column
isdog = preds[:,1]
print "Raw Predictions: " + str(isdog[:5])
print "Mid Predictions: " + str(isdog[(isdog < .6) & (isdog > .4)])
print "Edge Predictions: " + str(isdog[(isdog == 1) | (isdog == 0)])

Log Loss doesn't support probability values of 0 or 1--they are undefined (and we have many). Fortunately, Kaggle helps us by offsetting our 0s and 1s by a very small value. So if we upload our submission now we will have lots of .99999999 and .000000001 values. This seems good, right?

Not so. There is an additional twist due to how log loss is calculated--log loss rewards predictions that are confident and correct (p=.9999,label=1), but it punishes predictions that are confident and wrong far more (p=.0001,label=1). See visualization below.


In [None]:
#Visualize Log Loss when True value = 1
#y-axis is log loss, x-axis is probabilty that label = 1
#As you can see Log Loss increases rapidly as we approach 0
#But increases slowly as our predicted probability gets closer to 1
import matplotlib.pyplot as plt
import numpy as np
from sklearn.metrics import log_loss

x = [i*.0001 for i in range(1,10000)]
y = [log_loss([1],[[i*.0001,1-(i*.0001)]],eps=1e-15) for i in range(1,10000,1)]

plt.plot(x, y)
plt.axis([-.05, 1.1, -.8, 10])
plt.title("Log Loss when true label = 1")
plt.xlabel("predicted probability")
plt.ylabel("log loss")

plt.show()

In [None]:
#So to play it safe, we use a sneaky trick to round down our edge predictions
#Swap all ones with .95 and all zeros with .05
isdog = isdog.clip(min=0.05, max=0.95)

In [None]:
#Extract imageIds from the filenames in our test/unknown directory 
filenames = batches.filenames
ids = np.array([int(f[8:f.find('.')]) for f in filenames])

In [None]:
subm = np.stack([ids,isdog], axis=1)
subm[:5]

In [None]:
%cd $DATA_HOME_DIR
submission_file_name = 'submission1.csv'
np.savetxt(submission_file_name, subm, fmt='%d,%.5f', header='id,label', comments='')

In [None]:
from IPython.display import FileLink
%cd $LESSON_HOME_DIR
FileLink('data/redux/'+submission_file_name)

You can download this file and submit on the Kaggle website or use the Kaggle command line tool's "submit" method.m