# Adding a dense layer to VGG

The last layer of Vgg16 outputs a vector of 1000 categories, because that is the number of categories the competition asked for. Of these categories, some of them certainly correspond to cats and dogs, but at a much more granular level (specific breeds).

We will simply add a Dense layer on top of the imagenet layer, and train the model to map the imagenet classifications of input images of cats and dogs to cat and dog labels.

Note that this is not what we have been doing in the very first lecture!

Have a look at [CS231n: Linear Classification](http://cs231n.github.io/linear-classify/) for more precisions and especially to [CS231n: Softmax classifier](http://cs231n.github.io/linear-classify/#softmax) if you had trouble with logistic regression.

## 1. Preparations

In [None]:
from PIL import Image
import numpy as np
import matplotlib.pyplot as plt
import os
import torch
import torch.nn as nn
import torchvision
from torchvision import models,transforms,datasets
import bcolz
import time
%matplotlib inline

We did precompute the outputs of Vgg16 model on our dataset (with Colab) and stored these values.

In [None]:
use_gpu = torch.cuda.is_available()
print('Using gpu: %s ' % use_gpu)

dtype = torch.FloatTensor
if use_gpu:
    dtype = torch.cuda.FloatTensor

In [None]:
def load_array(fname):
    return bcolz.open(fname)[:]

In [None]:
#where you stored your features
data_dir_colab = '/home/lelarge/courses/data/colab/'

In [None]:
feat_train = load_array(os.path.join(data_dir_colab,'vgg16','feat_train.bc'))
lbs_train = load_array(os.path.join(data_dir_colab,'vgg16','lbs_train.bc'))
feat_val = load_array(os.path.join(data_dir_colab,'vgg16','feat_val.bc'))
lbs_val = load_array(os.path.join(data_dir_colab,'vgg16','lbs_val.bc'))

## 2. Linear model for VGG16 features

We are now ready to define our linear model.

For more details about the [cross entropy cost function](http://neuralnetworksanddeeplearning.com/chap3.html#the_cross-entropy_cost_function)

In [None]:
lm = torch.nn.Sequential(
    torch.nn.Linear(1000, 2),
    torch.nn.LogSoftmax(dim = 1)
)
loss_fn = torch.nn.NLLLoss(size_average=False)
if use_gpu:
    lm = lm.cuda()

Since our features are currently stacked in a _numpy ndarray_, we need to create a dataset of tensors and then a dataloader.

For the dataset, you can use _torch.from_numpy_, _torch.tensor_ and _zip_

For the dataloader, you should use _torch.utils.data.DataLoader_

In [None]:
bs = 128

train_dataset = # your code
test_dataset = # your code
train_loader = torch.utils.data.DataLoader(#your code here)
test_loader = torch.utils.data.DataLoader(#your code here)

### 2.1 Training

We define next a holistic training function (```train_model```) that will:
- run for a pre-defined number of epochs/iterations
- fetch training samples randomly during each epoch(all samples are used during an epoch)
- pass samples through network, compute error, gradients and updates network parameters
- keep and print training statistics: training loss, accuracy


In [None]:
def train_model(model,size,data_loader=None,epochs=1,optimizer=None):
    model.train()
    loss_t = np.zeros(epochs)
    acc_t = np.zeros(epochs)
    for epoch in range(epochs):
        
        running_loss = 0.0
        running_corrects = 0
        for inputs,classes in data_loader:
                            
            #
            # your code
            #
            running_loss += # your code
            running_corrects += # your code
            
        epoch_loss = running_loss / size
        epoch_acc = running_corrects.data.item() / size
        
        loss_t[epoch] = epoch_loss
        acc_t[epoch] = epoch_acc
    print('Loss: {:.4f} Acc: {:.4f}'.format(epoch_loss, epoch_acc))
    return loss_t, acc_t

We set our hyperparameters:
- learning rate
- optimizer to be used for gradient descent, here SGD (Stochastic Gradient Descent)

In [None]:
learning_rate = 1e-4
optimizer_lm = torch.optim.SGD(lm.parameters(), lr=learning_rate)

In [None]:
dset_sizes = {'train': 23000, 'valid': 2000}

We train our model for 100 epochs

In [None]:
%%time
loss1, acc1 = train_model(model=lm,size=dset_sizes['train'],data_loader = train_loader ,epochs=100,optimizer=optimizer_lm)
#loss1, acc1 = (train_model(model=lm,size=dset_sizes['train'],feat=feat_train,labels=lbs_train, epochs=100,optimizer=optimizer_lm,batch_size = 64,shuffle=True))

We plot the evolution of the training loss across epochs. 

Ideally is should have a steep descent in the first epochs, then decrease smoothly.

In [None]:
plt.plot(loss1)

We plot the evolution of the accuracy of our model on the training data. The behavior resembles globally to the one of the loss: big improvement at the beginning, then smaller improvements as training advances.


In [None]:
plt.plot(acc1)

The __loss__ helps the network to learn and update the parameters according to the criterion that we give to the network.

The __accuracy__ on the other hand is a performance metric for the task for which we want to use the network for. In many cases the accuracy cannot be integrated as a loss/criterion function, so we need to identify or design loss functions that will guide the model towards the behavior we wish to have for our task.

Next we let the model train for 100 additional epochs

In [None]:
%%time
loss2, acc2 = train_model(model=lm,size=dset_sizes['train'],data_loader =train_loader ,epochs=100,optimizer=optimizer_lm)

Again we plot the loss and accuracy for the current training interval: _epochs[100:200]_.
What changes do you notice? 

In [None]:
plt.plot(loss2)

In [None]:
plt.plot(acc2)

We train the model train for 100 more epochs and plot the evolution of our training indicators. 
How are they evolving comparing to the previous runs?
 

In [None]:
%%time
loss3, acc3 = train_model(model=lm,size=dset_sizes['train'],data_loader =train_loader ,epochs=100,optimizer=optimizer_lm)

In [None]:
plt.plot(loss3)

In [None]:
plt.plot(acc3)

### 2.2 Testing

We define next a holistic test function (```test_model```) that will:
- fetch test samples
- pass samples through network, compute error, accuracy and predictions
- keep and print test statistics: test loss, accuracy


In [None]:
def test_model(model,size,data_loader=None):
    model.eval()
    
    predictions = np.zeros(size)
    running_loss = 0.0
    running_corrects = 0
    count = 0 
    for inputs,classes in data_loader:
        #
        # your code
        #
        count +=1
        
    print('Loss: {:.4f} Acc: {:.4f}'.format(running_loss / size, running_corrects.data.item() / size))
    return predictions, running_loss / size, running_corrects.data.item() / size

We evaluate on the test data a snapshot of our model at _epoch #300_

In [None]:
%%time
preds, loss_val, acc_val = test_model(model=lm,size=dset_sizes['valid'],data_loader=test_loader)

In [None]:
loss_val

## 3. Quantitative analysis

We concatenate the training losses across the 300 training epochs and plot them along with the loss on the test data using a snapshot of our model at epoch #300.

What do you notice? 

In [None]:
plt.plot(np.concatenate((loss1, loss2, loss3)))
plt.plot([loss_val]*300)

We illustrate a similar plot for the training loss values at _epochs[200:300]_

In [None]:
plt.plot(loss3)
plt.plot([loss_val]*100)

We now illustrate the aggregated training accuracies on epochs[0:300] along with the test accuracy for the model at epoch #300.

In [None]:
plt.plot(np.concatenate((acc1, acc2, acc3)))
plt.plot([acc_val]*300)

We train our model for 1000 more epochs.

In [None]:
%%time
loss4, acc4 = train_model(model=lm,size=dset_sizes['train'],data_loader =train_loader ,epochs=1000,optimizer=optimizer_lm)

We test the model snapshot at _epoch #1300_ and keep its statiscs and performance.

In [None]:
%%time
preds2, conf2, loss_val2, acc_val2 = (test_model(model=lm,size=dset_sizes['valid'],feat=feat_val,labels=lbs_val,batch_size=2000))

We aggregate train loss values at _epochs[300:1300]_ and test loss at _epochs[300]_ and _epochs[1300]_.
Do you notice a trend?

In [None]:
plt.plot(np.concatenate((loss3,loss4)))
plt.plot(np.concatenate(([loss_val]*100,[loss_val2]*1000)))

A similar plot for the accuracy values

In [None]:
plt.plot(np.concatenate((acc1, acc2, acc3, acc4)))
plt.plot(np.concatenate(([acc_val]*300,[acc_val2]*1000)))

In [None]:
acc_val

## Exercise

What is happening? 

Make better plots on which we see the evolution of the loss/accuracy on both the training and validation sets as a function of the number of epochs.

## 4. Viewing model prediction (qualitative analysis)

The most important metrics for us to look at are for the validation set, since we want to check for over-fitting.

With our first model we should try to overfit before we start worrying about how to handle that - there's no point even thinking about regularization, data augmentation, etc if you're still under-fitting! (We'll be looking at these techniques after the 2 weeks break...)


As well as looking at the overall metrics, it's also a good idea to look at examples of each of:

   1. A few correct labels at random
   2. A few incorrect labels at random
   3. The most correct labels of each class (ie those with highest probability that are correct)
   4. The most incorrect labels of each class (ie those with highest probability that are incorrect)
   5. The most uncertain labels (ie those with probability closest to 0.5).

In general, these are particularly useful for debugging problems in the model. Since our model is very simple, there may not be too much to learn at this stage...

In [None]:
# Number of images to view for each visualization task
n_view = 8

Selecting correct predictions.

In [None]:
correct = np.where(preds==lbs_val)[0]

In [None]:
from numpy.random import random, permutation
idx = permutation(correct)[:n_view]

In [None]:
idx

In [None]:
def imshow(inp, title=None):
#   Imshow for Tensor.
    inp = inp.numpy().transpose((1, 2, 0))
    mean = np.array([0.485, 0.456, 0.406])
    std = np.array([0.229, 0.224, 0.225])
    inp = std * inp + mean
    plt.imshow(inp)
    if title is not None:
        plt.title(title)
    plt.pause(0.001)  # pause a bit so that plots are updated

data_dir = '/home/lelarge/courses/data/dogscats'
normalize = transforms.Normalize(mean=[0.485, 0.456, 0.406], std=[0.229, 0.224, 0.225])

prep1 = transforms.Compose([
                transforms.CenterCrop(224),
                transforms.ToTensor(),
                normalize,
            ])
dsets = {x: datasets.ImageFolder(os.path.join(data_dir, x), prep1)
         for x in ['train', 'valid']}

dataset_correct = torch.utils.data.DataLoader([dsets['valid'][x] for x in idx],batch_size = n_view,shuffle=True)

In [None]:
for data in dataset_correct:
    inputs_cor,labels_cor = data

In [None]:
# Make a grid from batch
out = torchvision.utils.make_grid(inputs_cor)

imshow(out, title=[x for x in labels_cor])

In [None]:
from IPython.display import Image, display
for x in idx:
    display(Image(filename=dsets['valid'].imgs[x][0], retina=True))

Selecting incorrect predictions.

In [None]:
incorrect = np.where(preds!=lbs_val)[0]
for x in permutation(incorrect)[:n_view]:
    print(dsets['valid'].imgs[x][1])
    display(Image(filename=dsets['valid'].imgs[x][0], retina=True))

In [None]:
#3. The images we most confident were cats, and are actually cats
correct_cats = np.where((preds==0) & (preds==lbs_val))[0]
most_correct_cats = np.argsort(conf[correct_cats,1])[:n_view]

In [None]:
for x in most_correct_cats:
    display(Image(filename=dsets['valid'].imgs[correct_cats[x]][0], retina=True))

In [None]:
#3. The images we most confident were dogs, and are actually dogs
correct_dogs = np.where((preds==1) & (preds==lbs_val))[0]
most_correct_dogs = np.argsort(conf[correct_dogs,0])[:n_view]

In [None]:
for x in most_correct_dogs:
    display(Image(filename=dsets['valid'].imgs[correct_dogs[x]][0], retina=True))

## Exercise

As seen in the first lecture, the last layer of Vgg16 is simply a dense layer that outputs 1000 elements. Therefore, it seems somewhat unreasonable to stack a dense layer meant to find cats and dogs on top of one that's meant to find imagenet categories, in that we're limiting the information available to us by first coercing the neural network to classify to imagenet before cats and dogs...

Instead, do finetuning, i.e remove that last layer and add on a new layer for cats and dogs. 

Compare to what we did in the first lecture.