**NOTE: This notebook is written for the Google Colab platform, which provides free hardware acceleration. However it can also be run (possibly with minor modifications) as a standard Jupyter notebook, using a local GPU.**

In [None]:
#@title -- Installation of Packages -- { display-mode: "form" }
import sys
!{sys.executable} -m pip install skorch

In [None]:
#@title -- Import of Necessary Packages -- { display-mode: "form" }
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
from sklearn.metrics import accuracy_score
from torchvision import datasets, models, transforms
from skorch import NeuralNetClassifier, NeuralNetRegressor
from skorch.callbacks import Checkpoint, EarlyStopping
from skorch.helper import predefined_split
from torch.utils.data import TensorDataset, DataLoader
import torch.nn as nn
import torch

In [None]:
#@title -- Downloading Data -- { display-mode: "form" }
!mkdir -p data/food5v2
!wget -nc -O data/food5v2.zip https://www.dropbox.com/s/w4pg809npvatye0/food5v2.zip?dl=1
!unzip -oq -d data/food5v2 data/food5v2.zip

In [None]:
#@title -- Auxiliary Functions -- { display-mode: "form" }
def predefined_array_split(X_valid, Y_valid):
    return predefined_split(
        TensorDataset(
            torch.as_tensor(X_valid),
            torch.as_tensor(Y_valid)
        )
    )

As usual, we will select the device to run the network on automatically.

In [None]:
device = "cuda" if torch.cuda.is_available() else "cpu"

# Transfer Learning

In this notebook we will use the **Food 5** dataset to illustrate transfer learning. The dataset is a downsized version of the [Food 11](https://www.kaggle.com/vermaavi/food11) dataset.

Transfer learning is a very useful technique. Under ordinary circumstance deep learning requires a huge amount of data and computation. If we intend to apply it to a small dataset we will typically not be able to achieve good generalization. The problem is connected to the fact that a small dataset typically cannot sufficiently cover all the possible variations of samples that a model can encounter. In the case of image recognition, for instance, there is virtually an infinite number of variations that a photo of a dog can take: the environment, the lighting, the breed of the dog, the angle – these and other aspects can all change. A small dataset is very unlikely to cover such complex space sufficiently.

One of the solutions that allow us to apply deep learning to small datasets even in spite of these problems is **transfer learning**. Under this technique the neural network is first pre-trained on a large, more general dataset (for image recognition this tends to be the ImageNet dataset). The network uses this dataset to learn what natural images look like and how they need to be preprocessed. Once this pre-training is complete, the dataset is then further trained for the specific target task.

## The Overall Procedure

The overall procedure for transfer learning in image recognition:
* Pre-train a network on ImageNet.

* Remove one or several of the final layers (the top of the network) and replace them with new layers. The new output layer will now have as many outputs as there are classes in the dataset.

* The weights of the pre-trained layers are frozen. Only the new layers are trained using the target dataset.

* One the new layers have been trained we can (an optional step) unfreeze the weights of the pre-trained layers as well and fine-tune the network as a whole. We will need to use a significantly lower learning rate. This is so that we do not destroy the pre-trained layers by doing excessively aggressive updates, but also because when the pre-trained layers can be modified, the risk of overfitting tends to increase.

## Preparation of the Dataset

As usual, let us start by preparing our dataset. For most image recognition tasks the dataset will be too large to fit into memory at once. We will therefore typically not attempt to load all the data at once as ``numpy`` arrays the way we did up till now. We will instead use the dataset abstraction from ``PyTorch``: we will construct an object that will represent our data and that will load the images from the hard drive upon request.

In the present case, our data comes pre-split into the train, validation and test folds, with each stored in a separate folder. The folders are structured so that each class has its own subfolder.

In [None]:
!ls data/food5v2

In [None]:
!ls data/food5v2/training

Given that our data has this structure, we can use the ``ImageFolder`` dataset class from ``PyTorch``.

Every image will need to be preprocessed before it is passed to the neural network: it will need to be resized, cropped and normalized in a way that matches the preprocessing that was done when pre-training the original neural network. We will now define two preprocessing procedures. The first one will do standard preprocessing. The second one will include augmentation: there will be a few randomized steps that will modify the image every time that it is loaded. This is to add more variety to the training data: essentially, the network will never see the exact same image twice.

In [None]:
normal_preproc = transforms.Compose([
    transforms.Resize(256),
    transforms.CenterCrop(224),
    transforms.ToTensor(),
    transforms.Normalize([0.485, 0.456, 0.406], [0.229, 0.224, 0.225])
])

augment_preproc = transforms.Compose([
    transforms.RandomResizedCrop(224),
    transforms.RandomHorizontalFlip(),
    transforms.ToTensor(),
    transforms.Normalize([0.485, 0.456, 0.406], [0.229, 0.224, 0.225])
])

Next we can construct the ``ImageFolder`` datasets themselves. We specify the paths to the individual folds of our dataset as well as the way in which the images should be preprocessed for each fold. We will use the normal pipeline for validation and testing data and the pipeline with augmentation for training data.

In [None]:
train_dataset = datasets.ImageFolder(
    "data/food5v2/training",
    augment_preproc
)

valid_dataset = datasets.ImageFolder(
    "data/food5v2/validation",
    normal_preproc
)

test_dataset = datasets.ImageFolder(
    "data/food5v2/testing",
    normal_preproc
)

### Displaying a Few Samples

In [None]:
#@title -- Display Data Samples --
disp_dataset = datasets.ImageFolder(
    "data/food5v2/training",
    transforms.ToTensor()
)
loader = DataLoader(disp_dataset, batch_size=1, shuffle=True)
loader_iter = iter(loader)

num_rows = 4; num_cols = 4
fig, axes = plt.subplots(num_rows, num_cols, figsize=(10, 8))

for row in axes:
    for ax in row:
        sample = next(loader_iter)[0][0].numpy().transpose((1, 2, 0))
        ax.imshow(sample)
        ax.set_xticks([])
        ax.set_yticks([])

## Loading the Pre-Trained Network

We load a pre-trained ResNet50 network. The weights pre-trained on ImageNet will download automatically.

In [None]:
net = models.resnet50(pretrained=True)

## Replacing the Final Layer

We will replace the last, fully-connected layer of our resnet (``net.fc``) with a new module that contains a dropout layer, a fully-connected layer and the softmax activation.

In [None]:
class ModelTop(nn.Module):
    def __init__(self, num_features, num_outputs):
        super().__init__()
        self.dropout = nn.Dropout(0.5)
        self.fc = nn.Linear(num_features, num_outputs)
    
    def forward(self, x):
        y = torch.flatten(x, 1)
        y = self.dropout(y)
        y = self.fc(y)
        y = torch.softmax(y, dim=1)
        return y

In [None]:
top = ModelTop(num_features=net.fc.in_features, num_outputs=10)
net.fc = top

## Training the New Layers

Recall that at first, we only want to train our new top layers and leave the pre-trained layers as they are. We will therefore need to freeze all the layers except last by flipping the ``requires_grad`` flag for all their parameters to ``False``. We define an auxiliary function that does this and call it.

In [None]:
def freeze_except_last(model, freeze):
    for layer in list(model.children())[:-1]:
        for param in layer.parameters():
            param.requires_grad = not freeze

In [None]:
freeze_except_last(net, True)

We create our usual ``NeuralNetClassifier`` object. We will be using a ``Checkpoint`` callback. This will ensure that the best (lowest validation loss) version of the weights will always get saved to a file.

In [None]:
device = "cuda" if torch.cuda.is_available() else "cpu"

model = NeuralNetClassifier(
    net,
    max_epochs=20,
    batch_size=64,
    lr=1e-3,
    optimizer=torch.optim.Adam,
    train_split=predefined_split(valid_dataset),
    iterator_train__shuffle=True,
    device=device,
    callbacks=[Checkpoint(f_params="train.pt")]
)

In [None]:
model.fit(train_dataset, y=None);

Once the training is finished, we load the weights from the checkpoint file. This will ensure that we continue on with the best weights found during training rather than the last weights (which might possibly have overfitted already).

In [None]:
model.load_params(f_params="train.pt")

After this first phase we are ready for testing. However, we are not done with our model yet so we will only be testing it on the **validation set, not on the testing set**.

In [None]:
Y_valid = []
y_valid = []

for X_batch, Y_batch in model.get_iterator(valid_dataset):
    Y_valid.extend(Y_batch.numpy())
    y_valid.extend(model.predict(X_batch))

print("Validation set accuracy: {}.".format(
    accuracy_score(Y_valid, y_valid)
))

## Fine-tuning the Entire Model

Having trained the new top of the model, we will now unfreeze all the rest of the network and continue training. However, we will lower the learning rate significantly to ensure that we do not undo all the work by taking overly aggressive steps.

This time we will also be using early stopping so we add it as a further callback.

In [None]:
freeze_except_last(net, False)

In [None]:
model = NeuralNetClassifier(
    net,
    max_epochs=40,
    batch_size=64,
    lr=1e-5,
    optimizer=torch.optim.Adam,
    train_split=predefined_split(valid_dataset),
    iterator_train__shuffle=True,
    device=device, 
    callbacks=[
        Checkpoint(f_params="finetune.pt"),
        EarlyStopping(patience=15)
    ]
)

In [None]:
model.fit(train_dataset, y=None);

After training, we will again restore the best weights using the checkpoint file.

In [None]:
model.load_params(f_params="finetune.pt")

## Testing

Having trained the final version of our model, we will now test on all 3 folds of our data: the training, the validation and the testing sets.

In [None]:
Y_train = []
y_train = []

for X_batch, Y_batch in model.get_iterator(train_dataset):
    Y_train.extend(Y_batch.numpy())
    y_train.extend(model.predict(X_batch))
    
print("Train set accuracy: {}.".format(
    accuracy_score(Y_train, y_train)
))

In [None]:
Y_valid = []
y_valid = []

for X_batch, Y_batch in model.get_iterator(valid_dataset):
    Y_valid.extend(Y_batch.numpy())
    y_valid.extend(model.predict(X_batch))
    
print("Validation set accuracy: {}.".format(
    accuracy_score(Y_valid, y_valid)
))

In [None]:
Y_test = []
y_test = []

for X_batch, Y_batch in model.get_iterator(test_dataset):
    Y_test.extend(Y_batch.numpy())
    y_test.extend(model.predict(X_batch))
    
print("Test set accuracy: {}.".format(
    accuracy_score(Y_test, y_test)
))

## An Alternative: Using the Pre-trained Network as a Feature Extractor

There is an alternative approach that we can take: we can use the pre-trained network as a feature extractor and use it to pre-process the dataset. We would then train the new top as a separate network, which would be significantly faster because the preprocessing would already have been done.

In [None]:
pretrained_net = models.resnet50(pretrained=True)
pretrained_net.fc = nn.Sequential()

In [None]:
feature_extractor = NeuralNetRegressor(
    pretrained_net,
    batch_size=64,
    device=device,
)

feature_extractor.initialize();

In [None]:
def preproc_data(feature_extractor, dataset):
    X = []
    Y = []
    
    for X_batch, Y_batch in feature_extractor.get_iterator(dataset):
        X.extend(feature_extractor.predict(X_batch))
        Y.extend(Y_batch.numpy())
  
    return np.asarray(X), np.asarray(Y)

In [None]:
X_train, Y_train = preproc_data(feature_extractor, train_dataset)
X_valid, Y_valid = preproc_data(feature_extractor, valid_dataset)
X_test, Y_test = preproc_data(feature_extractor, test_dataset)

In [None]:
top_net = ModelTop(X_train.shape[1], 10)

In [None]:
top_model = NeuralNetClassifier(
    top_net,
    max_epochs=500,
    batch_size=200,
    lr=1e-3,
    optimizer=torch.optim.Adam,
    train_split=predefined_array_split(X_valid, Y_valid),
    iterator_train__shuffle=True,
    device=device,
    callbacks=[
        Checkpoint(f_params="train_top.pt")
    ]
)

In [None]:
top_model.fit(X_train, Y_train);

In [None]:
top_model.load_params(f_params="train_top.pt")

### Testing

In [None]:
y_valid = top_model.predict(X_valid)

print("Validation set accuracy: {}.".format(
    accuracy_score(Y_valid, y_valid)
))

In [None]:
y_test = top_model.predict(X_test)

print("Test set accuracy: {}.".format(
    accuracy_score(Y_test, y_test)
))