# Mini Project: Transfer Learning with Keras

Transfer learning is a machine learning technique where a model trained on one task is used as a starting point to solve a different but related task. Instead of training a model from scratch, transfer learning leverages the knowledge learned from the source task and applies it to the target task. This approach is especially useful when the target task has limited data or computational resources.

In transfer learning, the pre-trained model, also known as the "base model" or "source model," is typically trained on a large dataset and a more general problem (e.g., image classification on ImageNet, a vast dataset with millions of labeled images). The knowledge learned by the base model in the form of feature representations and weights captures common patterns and features in the data.

To perform transfer learning, the following steps are commonly followed:

1. Pre-training: The base model is trained on a source task using a large dataset, which can take a considerable amount of time and computational resources.

2. Feature Extraction: After pre-training, the base model is used as a feature extractor. The last few layers (classifier layers) of the model are discarded, and the remaining layers (feature extraction layers) are retained. These layers serve as feature extractors, producing meaningful representations of the data.

3. Fine-tuning: The feature extraction layers and sometimes some of the earlier layers are connected to a new set of layers, often called the "classifier layers" or "task-specific layers." These layers are randomly initialized, and the model is trained on the target task with a smaller dataset. The weights of the base model can be frozen during fine-tuning, or they can be allowed to be updated with a lower learning rate to fine-tune the model for the target task.

Transfer learning has several benefits:

1. Reduced training time and resource requirements: Since the base model has already learned generic features, transfer learning can save time and resources compared to training a model from scratch.

2. Improved generalization: Transfer learning helps the model generalize better to the target task, especially when the target dataset is small and dissimilar from the source dataset.

3. Better performance: By starting from a model that is already trained on a large dataset, transfer learning can lead to better performance on the target task, especially in scenarios with limited data.

4. Effective feature extraction: The feature extraction layers of the pre-trained model can serve as powerful feature extractors for different tasks, even when the task domains differ.

Transfer learning is commonly used in various domains, including computer vision, natural language processing (NLP), and speech recognition, where pre-trained models are fine-tuned for specific applications like object detection, sentiment analysis, or speech-to-text.

In this mini-project you will perform fine-tuning using Keras with a pre-trained VGG16 model on the CIFAR-10 dataset.

First, import all the libraries you'll need.

In [1]:
import numpy as np
import tensorflow as tf
from tensorflow.keras.datasets import cifar10
from tensorflow.keras.applications import VGG16
from tensorflow.keras.models import Model
from tensorflow.keras.layers import Dense, GlobalAveragePooling2D
from tensorflow.keras.optimizers import Adam
from tensorflow.keras.utils import to_categorical
from sklearn.model_selection import train_test_split

The CIFAR-10 dataset is a widely used benchmark dataset in the field of computer vision and machine learning. It stands for the "Canadian Institute for Advanced Research 10" dataset. CIFAR-10 was created by researchers at the CIFAR institute and was originally introduced as part of the Neural Information Processing Systems (NIPS) 2009 competition.

The dataset consists of 60,000 color images, each of size 32x32 pixels, belonging to ten different classes. Each class contains 6,000 images. The ten classes in CIFAR-10 are:

1. Airplane
2. Automobile
3. Bird
4. Cat
5. Deer
6. Dog
7. Frog
8. Horse
9. Ship
10. Truck

The images are evenly distributed across the classes, making CIFAR-10 a balanced dataset. The dataset is divided into two sets: a training set and a test set. The training set contains 50,000 images, while the test set contains the remaining 10,000 images.

CIFAR-10 is often used for tasks such as image classification, object recognition, and transfer learning experiments. The relatively small size of the images and the variety of classes make it a challenging dataset for training machine learning models, especially deep neural networks. It also serves as a good dataset for teaching and learning purposes due to its manageable size and straightforward class labels.

Here are your tasks:

1. Load the CIFAR-10 dataset after referencing the documentation [here](https://keras.io/api/datasets/cifar10/).
2. Normalize the pixel values so they're all in the range [0, 1].
3. Apply One Hot Encoding to the train and test labels using the [to_categorical](https://www.tensorflow.org/api_docs/python/tf/keras/utils/to_categorical) function.
4. Further split the the training data into training and validation sets using [train_test_split](https://scikit-learn.org/stable/modules/generated/sklearn.model_selection.train_test_split.html). Use only 10% of the data for validation.  

In [2]:
# Load the CIFAR-10 dataset
(x_train, y_train), (x_test, y_test) = tf.keras.datasets.cifar10.load_data()
assert x_train.shape == (50000, 32, 32, 3)
assert x_test.shape == (10000, 32, 32, 3)
assert y_train.shape == (50000, 1)
assert y_test.shape == (10000, 1)
print('data loaded...')

Downloading data from https://www.cs.toronto.edu/~kriz/cifar-10-python.tar.gz
data loaded...


In [3]:
# Normalize the pixel values to [0, 1]
x_train = x_train.astype('float32') / 255.0
x_test = x_test.astype('float32') / 255.0
print('values normalized...')

values normalized...


In [4]:
# One-hot encode the labels
y_train = to_categorical(y_train, 10)
y_test = to_categorical(y_test, 10)
print('labels encoded...')

labels encoded...


In [5]:
# Split the data into training and validation sets
x_train, x_val, y_train, y_val = train_test_split(x_train, y_train, test_size=0.1, random_state=1995)
print('data split into training and validation sets...')
print('Training set shape:', x_train.shape, y_train.shape)
print('Validation set shape:', x_val.shape, y_val.shape)
print('Test set shape:', x_test.shape, y_test.shape)

data split into training and validation sets...
Training set shape: (45000, 32, 32, 3) (45000, 10)
Validation set shape: (5000, 32, 32, 3) (5000, 10)
Test set shape: (10000, 32, 32, 3) (10000, 10)


VGG16 (Visual Geometry Group 16) is a deep convolutional neural network architecture that was developed by the Visual Geometry Group at the University of Oxford. It was proposed by researchers Karen Simonyan and Andrew Zisserman in their paper titled "Very Deep Convolutional Networks for Large-Scale Image Recognition," which was presented at the International Conference on Learning Representations (ICLR) in 2015.

The VGG16 architecture gained significant popularity for its simplicity and effectiveness in image classification tasks. It was one of the pioneering models that demonstrated the power of deeper neural networks for visual recognition tasks.

Key characteristics of the VGG16 architecture:

1. Architecture: VGG16 consists of a total of 16 layers, hence the name "16." These layers are stacked one after another, forming a deep neural network.

2. Convolutional Layers: The main building blocks of VGG16 are the convolutional layers. It primarily uses 3x3 convolutional filters throughout the network, which allows it to capture local features effectively.

3. Max Pooling: After each set of convolutional layers, VGG16 applies max-pooling layers with 2x2 filters and stride 2, which halves the spatial dimensions (width and height) of the feature maps and reduces the number of parameters.

4. Fully Connected Layers: Towards the end of the network, VGG16 has fully connected layers that act as a classifier to make predictions based on the learned features.

5. Activation Function: The network uses the Rectified Linear Unit (ReLU) activation function for all hidden layers, which helps with faster convergence during training.

6. Number of Filters: The number of filters in each convolutional layer is relatively small compared to more recent architectures like ResNet or InceptionNet. However, stacking multiple layers allows VGG16 to learn complex hierarchical features.

7. Output Layer: The output layer consists of 1000 units, corresponding to 1000 ImageNet classes. VGG16 was originally trained on the large-scale ImageNet dataset, which contains millions of images from 1000 different classes.

VGG16 was instrumental in showing that increasing the depth of a neural network can significantly improve its performance on image recognition tasks. However, the main drawback of VGG16 is its high number of parameters, making it computationally expensive and memory-intensive to train. Despite this limitation, VGG16 remains an essential benchmark architecture and has paved the way for even deeper and more efficient models in the field of computer vision, such as ResNet, DenseNet, and EfficientNet.

Here are your tasks:

1. Load [VGG16](https://keras.io/api/applications/vgg/#vgg16-function) as a base model. Make sure to exclude the top layer.
2. Freeze all the layers in the base model. We'll be using these weights as a feature extraction layer to forward to layers that are trainable.

In [6]:
# Load the pre-trained VGG16 model (excluding the top classifier)
base_model = VGG16(include_top=False, weights='imagenet', input_shape=(32, 32, 3))

# Display a model summary to illustrate that all of the layers are currently trainable.
base_model.summary()

Downloading data from https://storage.googleapis.com/tensorflow/keras-applications/vgg16/vgg16_weights_tf_dim_ordering_tf_kernels_notop.h5
Model: "vgg16"
_________________________________________________________________
 Layer (type)                Output Shape              Param #   
 input_1 (InputLayer)        [(None, 32, 32, 3)]       0         
                                                                 
 block1_conv1 (Conv2D)       (None, 32, 32, 64)        1792      
                                                                 
 block1_conv2 (Conv2D)       (None, 32, 32, 64)        36928     
                                                                 
 block1_pool (MaxPooling2D)  (None, 16, 16, 64)        0         
                                                                 
 block2_conv1 (Conv2D)       (None, 16, 16, 128)       73856     
                                                                 
 block2_conv2 (Conv2D)       (None, 16, 16, 128)      

In [7]:
# Freeze the layers in the base model
for layer in base_model.layers:
    layer.trainable = False

# Display the updated summary to verify that the layers have been frozen.
base_model.summary()

Model: "vgg16"
_________________________________________________________________
 Layer (type)                Output Shape              Param #   
 input_1 (InputLayer)        [(None, 32, 32, 3)]       0         
                                                                 
 block1_conv1 (Conv2D)       (None, 32, 32, 64)        1792      
                                                                 
 block1_conv2 (Conv2D)       (None, 32, 32, 64)        36928     
                                                                 
 block1_pool (MaxPooling2D)  (None, 16, 16, 64)        0         
                                                                 
 block2_conv1 (Conv2D)       (None, 16, 16, 128)       73856     
                                                                 
 block2_conv2 (Conv2D)       (None, 16, 16, 128)       147584    
                                                                 
 block2_pool (MaxPooling2D)  (None, 8, 8, 128)         0     

Now, we'll add some trainable layers to the base model.

1. Using the base model, add a [GlobalAveragePooling2D](https://keras.io/api/layers/pooling_layers/global_average_pooling2d/) layer, followed by a [Dense](https://keras.io/api/layers/core_layers/dense/) layer of length 256 with ReLU activation. Finally, add a classification layer with 10 units, corresponding to the 10 CIFAR-10 classes, with softmax activation.
2. Create a Keras [Model](https://keras.io/api/models/model/) that takes in approproate inputs and outputs.

In [8]:
# Add a global average pooling layer
base_output = base_model.output
x = GlobalAveragePooling2D()(base_output)

In [9]:
# Add a fully connected layer with 256 units and ReLU activation
x = Dense(256, activation='relu')(x)

In [10]:
# Add the final classification layer with 10 units (for CIFAR-10 classes) and softmax activation
predictions = Dense(10, activation='softmax')(x)

In [11]:
# Create the fine-tuned model
model = Model(inputs=base_model.input, outputs=predictions)
model.summary() # Display the new model architecture

Model: "model"
_________________________________________________________________
 Layer (type)                Output Shape              Param #   
 input_1 (InputLayer)        [(None, 32, 32, 3)]       0         
                                                                 
 block1_conv1 (Conv2D)       (None, 32, 32, 64)        1792      
                                                                 
 block1_conv2 (Conv2D)       (None, 32, 32, 64)        36928     
                                                                 
 block1_pool (MaxPooling2D)  (None, 16, 16, 64)        0         
                                                                 
 block2_conv1 (Conv2D)       (None, 16, 16, 128)       73856     
                                                                 
 block2_conv2 (Conv2D)       (None, 16, 16, 128)       147584    
                                                                 
 block2_pool (MaxPooling2D)  (None, 8, 8, 128)         0     

With your model complete it's time to train it and assess its performance.

1. Compile your model using an appropriate loss function. Feel free to play around with the optimizer, but a good starting optimizer might be Adam with a learning rate of 0.001.
2. Fit your model on the training data. Use the validation data to print the accuracy for each epoch. Try training for 10 epochs. Note, training can take a few hours so go ahead and grab a cup of coffee.

**Optional**: See if you can implement an [Early Stopping](https://keras.io/api/callbacks/early_stopping/) criteria as a callback function.

In [12]:
# Compile the model
model.compile(optimizer=tf.keras.optimizers.Adam(learning_rate=0.001), loss='categorical_crossentropy', metrics=['accuracy'])

In [None]:
# Train the model

# Import EarlyStopping
from tensorflow.keras.callbacks import EarlyStopping

# Create the early stopping criteria
early_stopping = EarlyStopping(monitor='val_loss', patience=3, verbose=1, mode='min', restore_best_weights=True)

# Fit the model on the CIFAR10 data.
history = model.fit(x_train, y_train, validation_data=(x_val, y_val), epochs=10, callbacks=[early_stopping], batch_size=32, verbose=1)

Epoch 1/10
Epoch 2/10
Epoch 3/10
Epoch 4/10
Epoch 5/10
Epoch 6/10
Epoch 7/10
Epoch 8/10
Epoch 9/10
Epoch 10/10

With your model trained, it's time to assess how well it performs on the test data.

1. Use your trained model to calculate the accuracy on the test set. Is the model performance better than random?
2. Experiment! See if you can tweak your model to improve performance.  

In [14]:
# Evaluate the model on the test set
test_loss, test_accuracy = model.evaluate(x_test, y_test, verbose=1)
print(f"Test accuracy: {test_accuracy*100:.2f}%")

Test accuracy: 60.87%


# **Experiments:**

Below this section are the various experiments that I performed to try and achieve a better performance than a test accuracy of **60.87%**.  
1. First, I add in **regularization** to test its effect.  After determining that it reduced initial accuracy, but improved generalization, I decided to keep it within my model for now. This model only performed at: **53.74%**  
2. Now that regularization has been added in, I conducted a small **random search** over a few hyperparameters: regularization rates, optimizer learning rates, and lastly the number of units in the Dense layer. Although the search identified a configuration of hyperparemeters that exceeded the previous regularization model, its results were still below base level performance.  The best model from the search had a Test Accuracy of **60.04%**.
3. After performing this hyperparameter search and not recieiving higher accuracy I decide to revert back to the original model and add additional layer of **feature extraction** that could be fine-tuned to possibly achieve a higher score.  From this model, I was able to achieve a higher accuracy score of: **61.54%**

# **Conclusion:**

After achieving a higher Test Accuracy score through the inclusion of the additional feature extraction layers I have a few additional areas I'd like to explore.  However, due to conducting these experiments in a free Google colab environment, I have decided to continue them elsewhere as to not have to deal with session timeout.  During the training of the convolutional model and the base model there were indications that the model tended to overfit on the training data.  For this there are multiple avenues to explore to expand generalization.

# **Additional Ideas:**
1. Introduce Dropout layers and rates to see if we can mitigate some of the overfitting.
2. Dataset Augmentation in order to increase the size and variance in our training data in order to improve generalization.
3. Conducting another hyperparemeter search over whichever model architecture I have identified as most effective.

**Experiment: *Adding Additional Feature Extraction***

In [8]:
# Experiment: Adding Convolutional layers before the dense to further extract features.
from keras.models import Model
from keras.layers import Conv2D, BatchNormalization, ReLU, GlobalAveragePooling2D, Dense
from keras.regularizers import l2

# Grab the base model as before
base_model_output = base_model.output

# Add a new Conv2D layer
x = Conv2D(filters=512, kernel_size=(3, 3), padding='same')(base_model_output)
x = BatchNormalization()(x)
x = ReLU()(x)

# Continue with the global average pooling and dense layers as before
x = GlobalAveragePooling2D()(x)
x = Dense(256, activation='relu')(x)  # Existing Dense layer
predictions = Dense(10, activation='softmax')(x)  # Existing output layer

# Create the new model
model_with_conv = Model(inputs=base_model.input, outputs=predictions)

# Compile and fit the model as before
model_with_conv.compile(optimizer='adam',
                        loss='categorical_crossentropy',
                        metrics=['accuracy'])


In [9]:
# Fit the model to the training data.
# Import EarlyStopping
from tensorflow.keras.callbacks import EarlyStopping

# Create the early stopping criteria
early_stopping = EarlyStopping(monitor='val_loss', patience=3, verbose=1, mode='min', restore_best_weights=True)

# Fit the model on the CIFAR10 data.
history = model_with_conv.fit(x_train, y_train, validation_data=(x_val, y_val), epochs=10, callbacks=[early_stopping], batch_size=32, verbose=1)

Epoch 1/10
Epoch 2/10
Epoch 3/10
Epoch 4/10
Epoch 5/10
Epoch 6/10
Epoch 7/10
Epoch 8/10
Epoch 9/10
Epoch 10/10
Epoch 10: early stopping


In [12]:
# Evaluate the model on the test set
test_loss_conv, test_accuracy_conv = model_with_conv.evaluate(x_test, y_test, verbose=1)
print(f"Test Accuracy: {test_accuracy_conv*100:.2f}%")

Test Accuracy: 61.54%


**Experiment: *Regularization***

In [15]:
# Experiment: Testing the effect of introducing regularization.
from tensorflow.keras.regularizers import l2

# Construct the model
base_model_output = base_model.output
x = GlobalAveragePooling2D()(base_model_output)
x = Dense(256, activation='relu', kernel_regularizer=l2(0.01))(x) # Added a Dense layer with L2 regularization
predictions = Dense(10, activation='softmax')(x)

# Create the new model with regularization
model_with_reg = Model(inputs=base_model.input, outputs=predictions)

# Compile and fit the model as before
model_with_reg.compile(optimizer=Adam(learning_rate=0.001),
                       loss='categorical_crossentropy',
                       metrics=['accuracy'])

# Fit the new model.
history_with_reg = model_with_reg.fit(x_train, y_train,
                                      epochs=10,
                                      batch_size=32,
                                      validation_data=(x_val, y_val),
                                      callbacks=[EarlyStopping(monitor='val_loss', patience=3, verbose=1)],
                                      verbose=1)

Epoch 1/10
Epoch 2/10
Epoch 3/10
Epoch 4/10
Epoch 5/10
Epoch 6/10
Epoch 7/10
Epoch 8/10
Epoch 9/10
Epoch 10/10


In [16]:
# Evaluate the new model's accuracy.
test_loss_reg, test_accuracy_reg = model_with_reg.evaluate(x_test, y_test, verbose=1)
print(f"Test accuracy: {test_accuracy_reg*100:.2f}%")

Test accuracy: 53.74%


**Hyperparameter Search:** regularization rates, learning rates, # of units in Dense layer

In [17]:
pip install keras-tuner

Collecting keras-tuner
  Downloading keras_tuner-1.4.6-py3-none-any.whl (128 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m128.9/128.9 kB[0m [31m1.3 MB/s[0m eta [36m0:00:00[0m
Collecting kt-legacy (from keras-tuner)
  Downloading kt_legacy-1.0.5-py3-none-any.whl (9.6 kB)
Installing collected packages: kt-legacy, keras-tuner
Successfully installed keras-tuner-1.4.6 kt-legacy-1.0.5


In [19]:
from keras_tuner import HyperModel

# Class of HyperModel to be tuned
class MyHyperModel(HyperModel):
    def __init__(self, base_model):
        self.base_model = base_model

    def build(self, hp):
        # Freeze the base model
        for layer in self.base_model.layers:
            layer.trainable = False

        # Start model definition
        base_model_output = self.base_model.output
        x = GlobalAveragePooling2D()(base_model_output)

        # Hyperparameters
        x = Dense(units=hp.Int('units', min_value=256, max_value=1024, step=128),
                  activation='relu',
                  kernel_regularizer=l2(hp.Float('l2', min_value=1e-4, max_value=1e-2, sampling='LOG')))(x)
        outputs = Dense(10, activation='softmax')(x)

        model = Model(inputs=self.base_model.input, outputs=outputs)

        # Compile model
        model.compile(optimizer=Adam(hp.Float('learning_rate', min_value=1e-4, max_value=1e-2, sampling='LOG')),
                      loss='categorical_crossentropy',
                      metrics=['accuracy'])
        return model

In [None]:
# Experiment: Conduct a small random search over Dense layer depth, the optimizer learning rate, and the regularization rate.
from keras_tuner import RandomSearch

# Instantiate the hypermodel
hypermodel = MyHyperModel(base_model=VGG16(include_top=False, weights='imagenet', input_shape=(32, 32, 3)))

# Initialize the tuner
tuner = RandomSearch(
    hypermodel,
    objective='val_accuracy',
    max_trials=6,  # Set to a reasonable number to limit search time
    executions_per_trial=1,
    directory='my_dir',
    project_name='keras_tuner_vgg16_cifar10'
)

# Perform the hyperparameter search
tuner.search(x_train, y_train,
             epochs=10,
             validation_data=(x_val, y_val),
             callbacks=[EarlyStopping(monitor='val_loss', patience=3, verbose=1)])

In [21]:
# Get the optimal hyperparameters
best_hps = tuner.get_best_hyperparameters(num_trials=1)[0]

print(f"""
The hyperparameter search is complete. The optimal number of units in the first densely-connected
layer is {best_hps.get('units')} and the optimal learning rate for the optimizer
is {best_hps.get('learning_rate')}. The optimal L2 regularization strength is {best_hps.get('l2')}.
""")

# Get the best model and evaluate accuracy.
best_model = tuner.get_best_models(num_models=1)[0]
test_loss_best, test_accuracy_best = best_model.evaluate(x_test, y_test)
print(f"Test Loss: {test_loss_best}, Test Accuracy: {test_accuracy_best}")


The hyperparameter search is complete. The optimal number of units in the first densely-connected
layer is 896 and the optimal learning rate for the optimizer
is 0.001490488195996845. The optimal L2 regularization strength is 0.00040995319944661575.

Test Loss: 1.2706390619277954, Test Accuracy: 0.6004999876022339


**Additional Analysis:**

After conducting this search, we are able to see that:
* None of the configurations exceeded the initial model's performance within 10 epochs of training.
* Although this search didn't exceed the base model, the configuration with the most units in the dense layer performed the best.


**Other ideas:**
* different optimizers (SGD, etc.)
* adding conv layers

**SGD Optimizer**

In [25]:
from tensorflow.keras.optimizers import SGD

# Construct the model
base_model_output = base_model.output
x = GlobalAveragePooling2D()(base_model_output)
x = Dense(256, activation='relu')(x)
predictions = Dense(10, activation='softmax')(x)

# Create the new model
model_with_sgd = Model(inputs=base_model.input, outputs=predictions)

# Compile the model using SGD as the optimizer
model_with_sgd.compile(optimizer=SGD(learning_rate=0.001, momentum=0.9),
                       loss='categorical_crossentropy',
                       metrics=['accuracy'])

# Fit the new model
history_with_sgd = model_with_sgd.fit(x_train, y_train,
                                      epochs=10,
                                      batch_size=32,
                                      validation_data=(x_val, y_val),
                                      callbacks=[EarlyStopping(monitor='val_loss', patience=3, verbose=1)],
                                      verbose=1)

Epoch 1/10
Epoch 2/10
Epoch 3/10
Epoch 4/10
Epoch 5/10
Epoch 6/10
Epoch 7/10
Epoch 8/10
Epoch 9/10
Epoch 10/10


In [None]:
# Evaluate the new model's accuracy.
test_loss_sgd, test_accuracy_sgd = model_with_sgd.evaluate(x_test, y_test, verbose=1)
print(f"Test accuracy: {test_accuracy_sgd*100:.2f}%")