## Transfer Learning using CIFAR-10 data
You will work with the CIFAR-10 Dataset. This is a well-known dataset for image classification, which consists of 60000 32x32 color images in 10 classes, with 6000 images per class. There are 50000 training images and 10000 test images.

The 10 classes are: 
airplane, automobile, bird, cat, deer, dog, frog, horse, ship, truck

To illustrate the power and concept of transfer learning, you will train a CNN on just the classes (airplane, automobile, bird, cat, deer). Then you will train just the last layer(s) of the network on the classes (dog, frog, horse, ship, truck) and see how well the features learned on (airplane, automobile, bird, cat, deer) help with classifying (dog, frog, horse, ship, truck).


In [1]:
import datetime
import numpy as np
import keras
from keras.datasets import cifar10
from keras.models import Sequential
from keras.layers import Dense, Dropout, Activation, Flatten
from keras.layers import Conv2D, MaxPooling2D
from keras import backend as K
from keras.optimizers import RMSprop

In [2]:
#used to help some of the timing functions
now = datetime.datetime.now

In [3]:
## This just handles some variability in how the input data is loaded
img_rows, img_cols = 32, 32
if K.image_data_format() == 'channels_first':
    input_shape = (3, img_rows, img_cols)
else:
    input_shape = (img_rows, img_cols, 3)

In [4]:
## To simplify things, write a function to include all the training steps
## As input, function takes a model, training set, test set, and the number of classes
## Inside the model object will be the state about which layers we are freezing and which we are training

def train_model(model, train, test, num_classes):
    x_train = train[0].reshape((train[0].shape[0],) + input_shape)
    x_test = test[0].reshape((test[0].shape[0],) + input_shape)
    x_train = x_train.astype('float32')
    x_test = x_test.astype('float32')
    x_train /= 255
    x_test /= 255
    print('x_train shape:', x_train.shape)
    print(x_train.shape[0], 'train samples')
    print(x_test.shape[0], 'test samples')

    # convert class vectors to binary class matrices
    y_train = keras.utils.to_categorical(train[1], num_classes)
    y_test = keras.utils.to_categorical(test[1], num_classes)

    model.compile(loss='categorical_crossentropy',
                  optimizer=RMSprop(learning_rate=0.001),
                  metrics=['accuracy'])

    t = now()
    model.fit(x_train, y_train,
              batch_size=64, # I defined the values inside the function directly
              epochs=10,
              verbose=1,
              validation_data=(x_test, y_test))
    print('Training time: %s' % (now() - t))

    score = model.evaluate(x_test, y_test, verbose=0)
    print('Test score:', score[0])
    print('Test accuracy:', score[1])

In [5]:
# the data, shuffled and split between train and test sets
(x_train, y_train), (x_test, y_test) = cifar10.load_data()
y_train = np.squeeze(y_train)
y_test = np.squeeze(y_test)

# create two datasets: one with classes below 5 and one with 5 and above
x_train_lt5 = x_train[y_train < 5]
y_train_lt5 = y_train[y_train < 5]
x_test_lt5 = x_test[y_test < 5]
y_test_lt5 = y_test[y_test < 5]

x_train_gte5 = x_train[y_train >= 5]
y_train_gte5 = y_train[y_train >= 5] - 5
x_test_gte5 = x_test[y_test >= 5]
y_test_gte5 = y_test[y_test >= 5] - 5

## Assignment
### PART-1: Build your CNN model
PART-1: Build your own model
Build a CNN model with the following specifications:

1. Two convolutional layers with ReLU activations.
2. MaxPooling with stride 2. Dropout of 0.25 after MaxPooling.
3. Two hidden fully-connected layers, two dropouts of 0.25 and a final output layer for classification.
4. Train this model for 10 epochs with RMSProp at a learning rate of .001 and a batch size of 64.
5. Evaluate the test results.


In [6]:
# hyperparamaters
epochs = 10
batch_size = 64
learning_rate = 0.001
num_classes = 5

In [7]:
# Define your model here
# Creating model with sequantial to add several layers
model = Sequential()
# First convolutional layer 
model.add(Conv2D(32, kernel_size=(3, 3), padding='same', input_shape=input_shape))
model.add(Activation('relu'))

# Second convolutional layer
model.add(Conv2D(32, kernel_size=(3, 3), padding='same'))
model.add(Activation('relu'))

# MaxPooling layer with stride 2
model.add(MaxPooling2D(pool_size=(2, 2), strides=2))
model.add(Dropout(0.25))

# Flatten 
model.add(Flatten())

# Adding first fully connected layer
model.add(Dense(512))
model.add(Activation('relu'))
model.add(Dropout(0.25))

# Adding second fully connected layer
model.add(Dense(256))
model.add(Activation('relu'))
model.add(Dropout(0.25))

# final output for classification with softmax activation for multi-class classification with 5 classes with num_classes=5
model.add(Dense(num_classes))
model.add(Activation('softmax'))

  super().__init__(activity_regularizer=activity_regularizer, **kwargs)


In [8]:
model.summary()

In [9]:
# Now, train your model on the classes (airplane, automobile, bird, cat, deer)
train_model(model, (x_train_lt5, y_train_lt5), (x_test_lt5, y_test_lt5), num_classes)

x_train shape: (25000, 32, 32, 3)
25000 train samples
5000 test samples
Epoch 1/10
[1m391/391[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m16s[0m 40ms/step - accuracy: 0.4404 - loss: 1.3305 - val_accuracy: 0.6058 - val_loss: 1.0352
Epoch 2/10
[1m391/391[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m16s[0m 40ms/step - accuracy: 0.6802 - loss: 0.8233 - val_accuracy: 0.6982 - val_loss: 0.7697
Epoch 3/10
[1m391/391[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m16s[0m 40ms/step - accuracy: 0.7423 - loss: 0.6751 - val_accuracy: 0.7320 - val_loss: 0.6995
Epoch 4/10
[1m391/391[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m15s[0m 40ms/step - accuracy: 0.7857 - loss: 0.5707 - val_accuracy: 0.7622 - val_loss: 0.6352
Epoch 5/10
[1m391/391[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m16s[0m 40ms/step - accuracy: 0.8220 - loss: 0.4797 - val_accuracy: 0.7716 - val_loss: 0.6409
Epoch 6/10
[1m391/391[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m16s[0m 40ms/step - accuracy: 0.8556 -

### Evaluate the test results.
Test score corresponds to categorical cross entropy loss on the test dataset. Lower value of loss indicates better model but this value depends on the scale of the problem and complexity of the dataset. We can look for training and validation loss in each epoch to observe it is decrease or increase to understand model learning each step. Loss values both in training and validation decreases so model weights are updated in a way. We can look for accuracy values of validation to see model performance on test dataset. Test accuracy is 78% approximately which means model correctly predicts approximately %78 of the test dataset so accuracy metric measures the model performance on test dataset in terms of classification accuracy.

### PART-2: Transfer Learning
### Freezing Layers
Keras allows layers to be "frozen" during the training process. That is, some layers would have their weights updated during the training process, while others would not. This is a core part of transfer learning, the ability to train just the last one or several layers.

A model layer (at position i) can be frozen as follows:

model.layers[i].trainable = False

Now, fine-tune your model in two different ways and compare the performances:

1. Freeze all layers except the output layer, and train your model on the classes (dog, frog, horse, ship, truck).
2. Freeze all layers except the fully connected layer and the output layer, and train your model on the classes (dog, frog, horse, ship, truck). 

Compare the classification accuracy and training time of these two fine-tuning approaches and answer the following:

1. How many trainable parameters are there in each case?
2. Which fine-tuning performs better in terms of classification accuracy and why?
3. Why is fine-tuning much faster than the initial training of the network?


In [10]:
model_ft_1 = keras.models.clone_model(model)
model_ft_2 = keras.models.clone_model(model)

In [13]:
# Freeze all layers except the last one 
for layer in model_ft_1.layers[:-2]:
    layer.trainable = False
for i, layer in enumerate(model_ft_1.layers):
    print(f"Layer {i}: {layer.name}, Trainable: {layer.trainable}")

Layer 0: conv2d, Trainable: False
Layer 1: activation, Trainable: False
Layer 2: conv2d_1, Trainable: False
Layer 3: activation_1, Trainable: False
Layer 4: max_pooling2d, Trainable: False
Layer 5: dropout, Trainable: False
Layer 6: flatten, Trainable: False
Layer 7: dense, Trainable: False
Layer 8: activation_2, Trainable: False
Layer 9: dropout_1, Trainable: False
Layer 10: dense_1, Trainable: False
Layer 11: activation_3, Trainable: False
Layer 12: dropout_2, Trainable: False
Layer 13: dense_2, Trainable: True
Layer 14: activation_4, Trainable: True


Observe below the differences between the number of *total params*, *trainable params*, and *non-trainable params*.

In [14]:
model_ft_1.summary()

In [15]:
# Now, fine-tune your model on the classes (dog, frog, horse, ship, truck)
train_model(model_ft_1, (x_train_gte5, y_train_gte5),(x_test_gte5, y_test_gte5), num_classes)

x_train shape: (25000, 32, 32, 3)
25000 train samples
5000 test samples
Epoch 1/10
[1m391/391[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m6s[0m 15ms/step - accuracy: 0.2483 - loss: 1.5905 - val_accuracy: 0.4156 - val_loss: 1.5245
Epoch 2/10
[1m391/391[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m5s[0m 14ms/step - accuracy: 0.3617 - loss: 1.5167 - val_accuracy: 0.4442 - val_loss: 1.4684
Epoch 3/10
[1m391/391[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m5s[0m 14ms/step - accuracy: 0.3895 - loss: 1.4746 - val_accuracy: 0.4854 - val_loss: 1.4297
Epoch 4/10
[1m391/391[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m5s[0m 14ms/step - accuracy: 0.4018 - loss: 1.4481 - val_accuracy: 0.4932 - val_loss: 1.4006
Epoch 5/10
[1m391/391[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m5s[0m 13ms/step - accuracy: 0.4082 - loss: 1.4265 - val_accuracy: 0.4766 - val_loss: 1.3829
Epoch 6/10
[1m391/391[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m6s[0m 14ms/step - accuracy: 0.4133 - loss:

In [17]:
# Freeze all layers first
for layer in model_ft_2.layers:
    layer.trainable = False
# Trainable fully connected layers and the output layer between 7 and 14th layers
for layer in model_ft_2.layers[7:15]:
    layer.trainable = True
# Printing the results for control
for i, layer in enumerate(model_ft_2.layers):
    print(f"Layer {i}: {layer.name}, Trainable: {layer.trainable}")

Layer 0: conv2d, Trainable: False
Layer 1: activation, Trainable: False
Layer 2: conv2d_1, Trainable: False
Layer 3: activation_1, Trainable: False
Layer 4: max_pooling2d, Trainable: False
Layer 5: dropout, Trainable: False
Layer 6: flatten, Trainable: False
Layer 7: dense, Trainable: True
Layer 8: activation_2, Trainable: True
Layer 9: dropout_1, Trainable: True
Layer 10: dense_1, Trainable: True
Layer 11: activation_3, Trainable: True
Layer 12: dropout_2, Trainable: True
Layer 13: dense_2, Trainable: True
Layer 14: activation_4, Trainable: True


In [18]:
model_ft_2.summary()

In [19]:
train_model(model_ft_2, (x_train_gte5, y_train_gte5),(x_test_gte5, y_test_gte5), num_classes)

x_train shape: (25000, 32, 32, 3)
25000 train samples
5000 test samples
Epoch 1/10
[1m391/391[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m9s[0m 23ms/step - accuracy: 0.3901 - loss: 1.3929 - val_accuracy: 0.5176 - val_loss: 1.1304
Epoch 2/10
[1m391/391[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m9s[0m 23ms/step - accuracy: 0.5878 - loss: 1.0462 - val_accuracy: 0.5508 - val_loss: 1.0533
Epoch 3/10
[1m391/391[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m9s[0m 23ms/step - accuracy: 0.6341 - loss: 0.9475 - val_accuracy: 0.6520 - val_loss: 0.8742
Epoch 4/10
[1m391/391[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m9s[0m 23ms/step - accuracy: 0.6595 - loss: 0.8888 - val_accuracy: 0.6110 - val_loss: 1.0432
Epoch 5/10
[1m391/391[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m9s[0m 23ms/step - accuracy: 0.6783 - loss: 0.8492 - val_accuracy: 0.6530 - val_loss: 0.8703
Epoch 6/10
[1m391/391[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m9s[0m 23ms/step - accuracy: 0.6920 - loss:

### 1. How many trainable parameters are there in each case?
In the original model there are 4,337,573 parameters in total and all the parameters are trainable so there are 4,337,573 trainable parameters. In the first version of fine-tuning, there are 1,285 trainable parameters because we froze the layers except output layer. In the second version of fine-tuning, there are 4,327,429 trainable parameters because we froze the layers except fully connected layers and the output layer. 

### 2. Which fine-tuning performs better in terms of classification accuracy and why?
Second version of the fine-tuning performs better in terms of classification accuracy because there are more trainable parameters in the second version. In the first version, we only use the output layer as a trainable which is very small part of the model so the second version beats the accuracy. Trainable parameters are much higher than the first version due to we train fully connected layers extra for first version.

### 3. Why is fine-tuning much faster than the initial training of the network?
Fine tuning froze desired layers and train only the small part of the architecture. Fine tuning has small number amount trainable parameters compare to the original training because original training update weights of the all architecture. Fine-tuning enhances the time by reducing the trainable number of parameters so this enbales the transfer learning by training a base model than fine-tuning on another task to predict on different dataset.