### Transfer learning & The art of using Pre-trained Models in Deep Learning
[Article](https://analyticsvidhya.com/blog/2017/06/transfer-learning-the-art-of-fine-tuning-a-pre-trained-model/)

The above mentioned article explains the theoretical aspect of Transfer Learning really well but comes a little short on when it comes to implementation of the same. 
There were lot of issues which I faced while trying to execute the code directly from the website, like:
- Finding the dataset used ( The dataset link provided in article isn't the one used in implementation )
- MNIST images are of dimensions 28x28 and when scaled to 224x224 takes up whole memory ( Tried on 32GB RAM server) and crashes it.
- In the second part of implementation "Freeze the weigts of first few layers", the CNN architecture which is created is not correct as it adds the final 10 class neuron layer on top of previous 1000 class neuron layer while we are suppose to replace the last layer instead of adding on top of it.

***
Instead of loading the whole dataset in memory at one go, there is another better and optimized way of doing it, which is using "flow_from_directory" method of Keras.
If time permits I would be writing another notebook covering that.

***
This command specify which GPU to use in case multiple GPU's are available.

This notebook was run on GTX 980 Ti GPU, so you might observe different run times on your end

In [1]:
import os
os.environ["CUDA_VISIBLE_DEVICES"]="1"

### Part 1: Retrain the output dense layers only

In [2]:
# importing required libraries

from keras.models import Sequential
from scipy.misc import imread
get_ipython().magic('matplotlib inline')
import matplotlib.pyplot as plt
import numpy as np
import keras
from keras.layers import Dense
import pandas as pd

from keras.applications.vgg16 import VGG16
from keras.preprocessing import image
from keras.applications.vgg16 import preprocess_input
import numpy as np
from keras.applications.vgg16 import decode_predictions

  from ._conv import register_converters as _register_converters
Using TensorFlow backend.


### Data Download

Data for this notebook can be downloaded from "Identify the Digits" hackathon from Analytics Vidhya website using the following link:<br>
https://datahack.analyticsvidhya.com/contest/practice-problem-identify-the-digits/ 

Extract the compressed file to a specific location and rename the csv files to train.csv and test.csv respectively.

***
Pass on the csv files path to first two variables and provide the image folder path to other two variables.

In [3]:
train_raw=pd.read_csv("/home/arpit/notebooks/data/av/mnist/train.csv")
test_raw=pd.read_csv("/home/arpit/notebooks/data/av/mnist/test.csv")
train_path="/home/arpit/notebooks/data/av/mnist/Images/train/"
test_path="/home/arpit/notebooks/data/av/mnist/Images/test/"

Lets do the data sampling by picking only 10% of the data as the main focus for this tutorial is to get up and running on Transfer Learning rather than focusing on accurcy of the data.

You can increase or decrease the sampling percent as per your convenience. 

Sampling here is important as if you try to load the provided training dataset which has 49,000 images in the way they have implemented you won't be able to load it fully as it takes everything in RAM and even after having 32GB RAM my server crashed.

In [4]:
# Number of images in raw training file
train_raw.shape

(49000, 2)

In [5]:
data_sampling_rate = 0.1
train = train_raw.sample(frac=data_sampling_rate).reset_index(drop=True)
test = test_raw.sample(frac=data_sampling_rate).reset_index(drop=True)

In [6]:
# Number of images after sampling
train.shape

(4900, 2)

In [7]:
from scipy.misc import imresize

#### Preparing the dataset
- Initialize empty array
- Read the sampled csv file line by line, find the image, upscale it to size 224x224 from the default of 28x28
- Conver the image to array
- Append it the list
- Convert it to Numpy array
- Do some Preprocessing like normalizing pixels, etc

In [8]:
# preparing the train dataset
train_img=[]
for i in range(len(train)):
    temp_img=image.load_img(train_path+train['filename'][i],target_size=(224,224))
    temp_img=image.img_to_array(temp_img)
    train_img.append(temp_img)

#converting train images to array and applying mean subtraction processing
train_img=np.array(train_img) 
train_img=preprocess_input(train_img)

In [9]:
# applying the same procedure with the test dataset
test_img=[]
for i in range(len(test)):
    temp_img=image.load_img(test_path+test['filename'][i],target_size=(224,224))
    temp_img=image.img_to_array(temp_img)
    test_img.append(temp_img)

#converting test images to array and applying mean subtraction processing
test_img=np.array(test_img) 
test_img=preprocess_input(test_img)

### Loading the VGG16 model

Below is the architecture of VGG16 model:
- Default input image size the network was trained on was 224x224
- The last MaxPooling layer outputs 7x7x512 which when flattened gives 25088 neuros ( used later when reshaping the image features trained on Convolution layers)

To save further on memory and trying out only the Part1 of transfer learning, you can resize the image further to smaller dimensions upto 48x48 but that will change the output size from 7x7x512 to some other dimension, which you need to change manually later in the code. For further details on input image size check the __input_shape__ argument on the following link:<br>
https://keras.io/applications/#vgg16

Keras implementation of VGG16:<br>
https://github.com/keras-team/keras-applications/blob/master/keras_applications/vgg16.py

![VGG16.png](../../images/VGG16.png)

__include_top = False__ loads only the Convolutional layers of VGG Model and removes the Dense layer part
- It removes the last 4 layers from architecture ( 3 blue fully connected layer + 1 brown softmax classification layer)

In [31]:
# loading VGG16 model weights
model = VGG16(weights='imagenet', include_top=False)

Below two commands of Feature extraction on Training and Testing dataset can take some time to process on CPU only machines.

In [32]:
%%time
# Extracting features from the train dataset using the VGG16 pre-trained model
features_train=model.predict(train_img)

CPU times: user 1min 7s, sys: 5.98 s, total: 1min 13s
Wall time: 19 s


In [33]:
%%time
# Extracting features from the test dataset using the VGG16 pre-trained model
features_test=model.predict(test_img)

CPU times: user 29.1 s, sys: 2.62 s, total: 31.7 s
Wall time: 8.18 s


- The first number represent the number of images which we are going to process which is 4900 in this case
- Last 3 numbers represents the dimenstions of final MaxPooling layer i.e. 7x7x512 = 25088 neurons when flattened.
- In case you have changed the target size of images from 224x224 while loading it up to anything above 48x48, multiplying these last 3 numbers will give the other parameter value in the reshape funcation used below

In [34]:
features_train.shape

(4900, 7, 7, 512)

In [35]:
features_train.shape[0]

4900

Made the first parametere of reshape command to dynamic so it can be picked automatically even if you decide to take a different sample size.

The __.reshape__ method takes two arguments:
- First argument: Number of rows/ images
- Seocond argument: Number of neurons we will get on flattening the output of final MaxPooling layer

In [36]:
# flattening the layers to conform to MLP input
# train_x=features_train.reshape(49000,25088)
train_x=features_train.reshape(features_train.shape[0],25088)

# converting target variable to array
train_y=np.asarray(train['label'])

# performing one-hot encoding for the target variable
train_y=pd.get_dummies(train_y)
train_y=np.array(train_y)

In [37]:
# creating training and validation set
from sklearn.model_selection import train_test_split
X_train, X_valid, Y_train, Y_valid=train_test_split(train_x,train_y,\
                                        test_size=0.3, random_state=42)

Further change the __input_dim__ parameter in case you lowered the target_size dimension originally

In [38]:
# creating a mlp model
from keras.layers import Dense, Activation
model=Sequential()

# Change input_dim if required
model.add(Dense(1000, input_dim=25088, activation='relu',kernel_initializer='uniform'))
keras.layers.core.Dropout(0.3, noise_shape=None, seed=None)

model.add(Dense(500,input_dim=1000,activation='sigmoid'))
keras.layers.core.Dropout(0.4, noise_shape=None, seed=None)

model.add(Dense(150,input_dim=500,activation='sigmoid'))
keras.layers.core.Dropout(0.2, noise_shape=None, seed=None)

model.add(Dense(units=10))
model.add(Activation('softmax'))

model.compile(loss='categorical_crossentropy', optimizer="adam", metrics=['accuracy'])

In [39]:
model.fit(X_train, Y_train, epochs=10, batch_size=128,validation_data=(X_valid,Y_valid))

Train on 3430 samples, validate on 1470 samples
Epoch 1/10
Epoch 2/10
Epoch 3/10
Epoch 4/10
Epoch 5/10
Epoch 6/10
Epoch 7/10
Epoch 8/10
Epoch 9/10
Epoch 10/10


<keras.callbacks.History at 0x7f4605052358>

***
You can see the importance of Transfer learning as despite having only 10% of data sampling we were able to achieve 100% accuracy on training data and 98.2% accuracy on validation data

### Part 2: Freeze the weights of first few layers

In Part 2, we will be using the same architecture as that of original VGG16, where we will be freezing the first 15 layers ( till 4th red color Max Pooling layer) and will be retraining the weights for the last 8 layers.

We also need to replace the last layer of 1000 classes with another layer of 10 classes for Digit classification.

Original article on Analytics Vidhya ended up adding another layer on top of 1000 neuron classes while we actually have to replace it

![VGG16.png](../../images/VGG16.png)

The part till loading up the train and test data is similar to that of the above. For detailed explanation check Part 1 above

In [1]:
# GPU Visibility
import os
os.environ["CUDA_VISIBLE_DEVICES"]="1"

In [2]:
# Importing the required libraries
from keras.models import Sequential
from scipy.misc import imread
get_ipython().magic('matplotlib inline')
import matplotlib.pyplot as plt
import numpy as np
import keras
from keras.layers import Dense
import pandas as pd

from keras.applications.vgg16 import VGG16
from keras.preprocessing import image
from keras.applications.vgg16 import preprocess_input
import numpy as np
from keras.applications.vgg16 import decode_predictions
from keras.utils.np_utils import to_categorical

from sklearn.preprocessing import LabelEncoder
from keras.models import Sequential
from keras.optimizers import SGD
from keras.layers import Input, Dense, Convolution2D, MaxPooling2D, AveragePooling2D, ZeroPadding2D, Dropout, Flatten, merge, Reshape, Activation

from sklearn.metrics import log_loss
from keras.models import Model
from keras.utils import multi_gpu_model

  from ._conv import register_converters as _register_converters
Using TensorFlow backend.


In [3]:
train_raw=pd.read_csv("/home/arpit/notebooks/data/av/mnist/train.csv")
test_raw=pd.read_csv("/home/arpit/notebooks/data/av/mnist/test.csv")
train_path="/home/arpit/notebooks/data/av/mnist/Images/train/"
test_path="/home/arpit/notebooks/data/av/mnist/Images/test/"

In [4]:
data_sampling_rate = 0.1
train = train_raw.sample(frac=data_sampling_rate).reset_index(drop=True)
test = test_raw.sample(frac=data_sampling_rate).reset_index(drop=True)

In [5]:
print(train.shape)
print(test.shape)

(4900, 2)
(2100, 1)


In [6]:
from scipy.misc import imresize

train_img=[]
for i in range(len(train)):
    temp_img=image.load_img(train_path+train['filename'][i],target_size=(224,224))
    temp_img=image.img_to_array(temp_img)
    train_img.append(temp_img)

train_img=np.array(train_img) 
train_img=preprocess_input(train_img)

In [7]:
test_img=[]
for i in range(len(test)):
    temp_img=image.load_img(test_path+test['filename'][i],target_size=(224,224))
    temp_img=image.img_to_array(temp_img)
    test_img.append(temp_img)

test_img=np.array(test_img) 
test_img=preprocess_input(test_img)

To get a better understanding of the below function, read the following page from Keras documentation:<br>
https://keras.io/getting-started/functional-api-guide/

In [8]:
from keras.models import Model

def vgg16_model(channel=1, num_classes=None):

    # Loads the complete VGG16 model, including the top dense layer
    model = VGG16(weights='imagenet', include_top=True)

    # Removes the last layer of 1000 classes
    model.layers.pop()

    # Makes the model output point to output of second last layer i.e. the one with 4096 neurons
    model.outputs = [model.layers[-1].output]
    
    # Removes the connection between neurons of second last layer and orignal last layer of 1000 classes
    model.layers[-1].outbound_nodes = []
    
    # Original Article
    # "model.output" still has details regarding the orignal 1000 classes
    # So that can't be used as it ends up adding the 10 class neuron on top of 1000 class neuron
    # x=Dense(num_classes, activation='softmax')(model.output)
    
    # Modified
    # This adds the newly created 10 class layers on top of the output from second last layer
    x=Dense(num_classes, activation='softmax')(model.outputs[0])
        
    # Defining the model architecture
    model=Model(model.input,x)

    #To set the first 15 layers to non-trainable (weights will not be updated)
    # Originally set to 8, updated it to 15 to reduce the training time.
    for layer in model.layers[:15]:
       layer.trainable = False

    # Learning rate is changed to 0.001
    sgd = SGD(lr=1e-3, decay=1e-6, momentum=0.9, nesterov=True)
    
    model.compile(optimizer=sgd, loss='categorical_crossentropy', metrics=['accuracy'])

    return model

In [9]:
# One hot encoding
train_y=np.asarray(train['label'])
le = LabelEncoder()
train_y = le.fit_transform(train_y)
train_y=to_categorical(train_y)
train_y=np.array(train_y)

In [10]:
from sklearn.model_selection import train_test_split
X_train, X_valid, Y_train, Y_valid=train_test_split(train_img,train_y,test_size=0.2, random_state=42)

In [11]:
img_rows, img_cols = 224, 224 # Resolution of inputs
channel = 3
num_classes = 10 
batch_size = 16 
nb_epoch = 10

We can see in the below model summary that in the last, we have only only 10 layer and we have replaced the 1000 class layer successfully

In [12]:
# Load our model
model = vgg16_model( channel, num_classes)

model.summary()

_________________________________________________________________
Layer (type)                 Output Shape              Param #   
input_1 (InputLayer)         (None, 224, 224, 3)       0         
_________________________________________________________________
block1_conv1 (Conv2D)        (None, 224, 224, 64)      1792      
_________________________________________________________________
block1_conv2 (Conv2D)        (None, 224, 224, 64)      36928     
_________________________________________________________________
block1_pool (MaxPooling2D)   (None, 112, 112, 64)      0         
_________________________________________________________________
block2_conv1 (Conv2D)        (None, 112, 112, 128)     73856     
_________________________________________________________________
block2_conv2 (Conv2D)        (None, 112, 112, 128)     147584    
_________________________________________________________________
block2_pool (MaxPooling2D)   (None, 56, 56, 128)       0         
__________

***
- In above summary, in the last part, it shows the total Trainable parameters are ~127 Million, which is actually a lot to train, but as we are loading the pre-trained model, the initialized weights are already relevant and so it converges faster as compared to training it from scratch where we initilize the weights randomly.

***
- Below times are on GPU. Training the same on CPU can take a lot more time
- Maybe you can try it out with lesser epoch's


In [17]:
# Check the trainable status of the individual layers
for layer in model.layers:
    print(layer, layer.trainable)

<keras.engine.topology.InputLayer object at 0x7fb4bca36c50> False
<keras.layers.convolutional.Conv2D object at 0x7fb4a68b2cf8> False
<keras.layers.convolutional.Conv2D object at 0x7fb4a68f4e10> False
<keras.layers.pooling.MaxPooling2D object at 0x7fb4a6646a58> False
<keras.layers.convolutional.Conv2D object at 0x7fb4a666b898> False
<keras.layers.convolutional.Conv2D object at 0x7fb4a6605ef0> False
<keras.layers.pooling.MaxPooling2D object at 0x7fb4a661cfd0> False
<keras.layers.convolutional.Conv2D object at 0x7fb4a65c9128> False
<keras.layers.convolutional.Conv2D object at 0x7fb4a65c9400> False
<keras.layers.convolutional.Conv2D object at 0x7fb4a65dbf98> False
<keras.layers.pooling.MaxPooling2D object at 0x7fb4a6581780> False
<keras.layers.convolutional.Conv2D object at 0x7fb4a65ac198> False
<keras.layers.convolutional.Conv2D object at 0x7fb4a65ac470> False
<keras.layers.convolutional.Conv2D object at 0x7fb4a6558a90> False
<keras.layers.pooling.MaxPooling2D object at 0x7fb4a6569860> Fa

In [13]:
# Start Fine-tuning
model.fit(X_train, Y_train,batch_size=batch_size,epochs=nb_epoch,shuffle=True,verbose=1,validation_data=(X_valid, Y_valid))

Train on 3920 samples, validate on 980 samples
Epoch 1/10
Epoch 2/10
Epoch 3/10
Epoch 4/10
Epoch 5/10
Epoch 6/10
Epoch 7/10
Epoch 8/10
Epoch 9/10
Epoch 10/10


<keras.callbacks.History at 0x7fb4a62255f8>

After running our model just for 10 epochs, we are able to achieve 100% accuracy on trainig data and 98.3% accuracy on validation data

In [14]:
# Make predictions
predictions_valid = model.predict(X_valid, verbose=1)



In [15]:
# Cross-entropy loss score
score = log_loss(Y_valid, predictions_valid)

In [16]:
score

0.09168487530338551