# Transfer learning

In this lab we will make use of pretrained models in order to boost performance on smaller datasets. For this experiment, we will be working with an AlexNet model pretrained on the Imagenet dataset in order to get a good accuracy score on the Caltech 101 dataset.

### Prerequisites

1. In order to perform the experiments, please download in advance the Caltech 101 dataset from https://drive.google.com/file/d/137RyRjvTBkBiIfeYBNZBtViDHQ6_Ewsp/view
2. In the working directory please create a folder named 'dataset' and a subfolder named 'caltech101' within it. Extract the dataset in the subfolder. The overall folder structure should look as follows: dataset/caltech101/101_ObjectCategories.
3. Install the torchvision module using 'conda install torchvision' if you have not done so already.

In [1]:
from tqdm import tqdm
import numpy as np
import numpy.random as random
import torch
import torchvision
import warnings
import matplotlib.pyplot as plt
import typing as t
from torch import Tensor
from torch.utils.data import random_split
from torch.utils.data import DataLoader, Dataset
from torchvision.models import AlexNet_Weights


warnings.filterwarnings('ignore')
device = torch.device('cuda' if torch.cuda.is_available() else 'cpu')
seed = 42

torchvision.set_image_backend('PIL')
gen = torch.Generator()
gen.manual_seed(seed)
random.seed(seed)

Firstly, we will load the AlexNet model architecture using torchvision. All available models with their respective parameters can be found at: https://pytorch.org/vision/stable/models.html

In [2]:
model = torchvision.models.alexnet()

In the first run we will just load the model architecture, without the pretrained weights. We can visualize the model architecture as follows:

In [3]:
model

AlexNet(
  (features): Sequential(
    (0): Conv2d(3, 64, kernel_size=(11, 11), stride=(4, 4), padding=(2, 2))
    (1): ReLU(inplace=True)
    (2): MaxPool2d(kernel_size=3, stride=2, padding=0, dilation=1, ceil_mode=False)
    (3): Conv2d(64, 192, kernel_size=(5, 5), stride=(1, 1), padding=(2, 2))
    (4): ReLU(inplace=True)
    (5): MaxPool2d(kernel_size=3, stride=2, padding=0, dilation=1, ceil_mode=False)
    (6): Conv2d(192, 384, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1))
    (7): ReLU(inplace=True)
    (8): Conv2d(384, 256, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1))
    (9): ReLU(inplace=True)
    (10): Conv2d(256, 256, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1))
    (11): ReLU(inplace=True)
    (12): MaxPool2d(kernel_size=3, stride=2, padding=0, dilation=1, ceil_mode=False)
  )
  (avgpool): AdaptiveAvgPool2d(output_size=(6, 6))
  (classifier): Sequential(
    (0): Dropout(p=0.5, inplace=False)
    (1): Linear(in_features=9216, out_features=4096, bias=True)
 

Next, we will load the Caltech 101 dataset and apply the neccesary transformations on it. Afterwards, we will split the dataset into train, validation and test.

In this block of code, define the dataloaders for train, validation and test and try to iterate through the data. What happens? Try to fix the problem using a lambda transform: https://pytorch.org/vision/stable/transforms.html#generic-transforms

In [4]:
from torchvision.transforms.v2 import Compose, ToImage, ToDtype, Resize, Normalize, Lambda, Grayscale
from torchvision.models import AlexNet_Weights
from torchvision.transforms.v2 import Transform

# Use original transformations of AlexNet
weights = AlexNet_Weights.DEFAULT
preprocess: Transform = weights.transforms()
transform = Compose([
    ToImage(),
    ToDtype(dtype=torch.float32, scale=True),
    Resize((224, 224)),
    Lambda(lambda x: x.repeat(3, 1, 1) if x.shape[0] != 3 else x),
])


def collate_fn(batch):
    images = []
    labels = []
    for X, y in batch:
        images.append(transform(X))
        labels.append(y)
    return torch.stack(images), torch.tensor(labels)


# Preprocess the dataset using those transforms
dataset = torchvision.datasets.Caltech101('./dataset', download=True)

# Split datasets
batch_size = 16
n_samples = len(dataset)
train_ds, val_ds, test_ds = random_split(dataset, [0.8, 0.1, 0.1])

# Speedup settings
settings = {
    'batch_size': batch_size,
    'shuffle': True,
    'generator': gen,
    'collate_fn': collate_fn,
    'pin_memory': True,
    'pin_memory_device': device.type,
    'num_workers': 8,
    'prefetch_factor': 2
}

# Define dataloaders for train, validation and test
# Iterate through the dataloaders
train_dl = DataLoader(train_ds, **settings)
valid_dl = DataLoader(val_ds, **settings)
test_dl = DataLoader(test_ds, **settings)

Files already downloaded and verified


With the dataset ready, it is now time to adapt the model architecture in order to fit our needs. Define a new classifier for the AlexNet model having the same structure, changing only the number of output neurons to 101.

In [5]:
model.classifier

Sequential(
  (0): Dropout(p=0.5, inplace=False)
  (1): Linear(in_features=9216, out_features=4096, bias=True)
  (2): ReLU(inplace=True)
  (3): Dropout(p=0.5, inplace=False)
  (4): Linear(in_features=4096, out_features=4096, bias=True)
  (5): ReLU(inplace=True)
  (6): Linear(in_features=4096, out_features=1000, bias=True)
)

In [6]:
import torch.nn as nn
from torch.nn import Dropout, Linear, ReLU


# Create a new classifier similar to AlexNet
model.classifier = torch.nn.Sequential(
    Dropout(p=0.5, inplace=False),
    Linear(in_features=9216, out_features=4096, bias=True),
    ReLU(inplace=True),
    Dropout(p=0.5, inplace=False),
    Linear(in_features=4096, out_features=4096, bias=True),
    ReLU(inplace=True),
    Linear(in_features=4096, out_features=101, bias=True)
)

### Training the model

Define an Adam optimizer with a learining rate of 1e-4 and a cross entropy loss. Afterwards, train the model for 2 epochs. Note the results

In [7]:

class Metrics(t.TypedDict):
    accuracy: t.List[float]
    loss: t.List[float]


class TrainHistory(t.TypedDict):
    train: Metrics
    valid: Metrics


def train_validate(model: nn.Module,
                   train_dl: DataLoader,
                   valid_dl: DataLoader,
                   epochs: int,
                   loss_fn: nn.Module,
                   optim: torch.optim.Optimizer) -> TrainHistory:
    # Track history
    history: TrainHistory = {
        'train': {
            'accuracy': [],
            'loss': [],
        },
        'valid': {
            'accuracy': [],
            'loss': [],
        }
    }

    # Do Training & Validation & Testing
    for epoch in range(epochs):
        print('Epoch [%d/%d]' % (epoch + 1, epochs), end=' - ')

        ### Training ###
        model.train(True)

        # Track across a single epoch
        train_loss = []
        train_accuracy = []

        for b, (X, y) in enumerate(train_dl):
            X, y = X.to(device), y.to(device)

            # Prevent grad accumulation
            optim.zero_grad()

            # Forward pass
            logits = model.forward(X)
            loss: Tensor = loss_fn(logits, y)
            y_pred: Tensor = logits.argmax(dim=1).detach()

            # Backward pass
            loss.backward()
            optim.step()

            # Track metrics
            train_loss.append(loss.detach().cpu().item())
            train_accuracy.extend((y_pred == y).detach().cpu().tolist())

        # Aggregate training results
        history['train']['loss'].append(torch.mean(torch.tensor(train_loss)).item())
        history['train']['accuracy'].append((torch.sum(torch.tensor(train_accuracy)) / len(train_accuracy)).item())

        ### Validation ###
        model.train(False)

        # Track across a single epoch
        valid_loss = []
        valid_accuracy = []

        for b, (X, y) in enumerate(valid_dl):
            X, y = X.to(device), y.to(device)

            # Forward pass
            with torch.no_grad():
                logits = model.forward(X)
                loss: Tensor = loss_fn(logits, y)
                y_pred: Tensor = logits.argmax(dim=1)

            # Track metrics
            valid_loss.append(loss.detach().cpu().item())
            valid_accuracy.extend((y_pred == y).detach().cpu().tolist())

        # Aggregate training results
        history['valid']['loss'].append(torch.mean(torch.tensor(valid_loss)).item())
        history['valid']['accuracy'].append((torch.sum(torch.tensor(valid_accuracy)) / len(valid_accuracy)).item())

        # Inform regarding current metrics
        print('t_loss: %f, t_acc: %f, v_loss: %f, v_acc: %f'
              % (history['train']['loss'][-1], history['train']['accuracy'][-1], history['valid']['loss'][-1], history['valid']['accuracy'][-1]))

    # Output the obtained results so far
    return history

In [8]:
# Q: Train the model for 2 epochs using a cross-entropy loss and an Adam optimizer with a lr of 1e-4
# Prepare training settings
epochs = 10
lr_rate = 1e-4
loss_fn = nn.CrossEntropyLoss()
optim = torch.optim.Adam(model.parameters(), lr=lr_rate)

# Send model to GPU
model = model.to(device)

# Start training
p1_history = train_validate(
    model=model,
    train_dl=train_dl,
    valid_dl=valid_dl,
    epochs=epochs,
    loss_fn=loss_fn,
    optim=optim,
)

Epoch [1/10] - 

t_loss: 3.794581, t_acc: 0.189283, v_loss: 3.359960, v_acc: 0.266129
Epoch [2/10] - t_loss: 2.977539, t_acc: 0.359695, v_loss: 2.769620, v_acc: 0.369816
Epoch [3/10] - t_loss: 2.352770, t_acc: 0.457361, v_loss: 2.314114, v_acc: 0.450461
Epoch [4/10] - t_loss: 1.917602, t_acc: 0.540334, v_loss: 1.940583, v_acc: 0.538018
Epoch [5/10] - t_loss: 1.592094, t_acc: 0.608902, v_loss: 1.742621, v_acc: 0.580645
Epoch [6/10] - t_loss: 1.296296, t_acc: 0.666955, v_loss: 1.652966, v_acc: 0.601382
Epoch [7/10] - t_loss: 1.065086, t_acc: 0.718093, v_loss: 1.618707, v_acc: 0.604839
Epoch [8/10] - t_loss: 0.819117, t_acc: 0.771248, v_loss: 1.655721, v_acc: 0.617512
Epoch [9/10] - t_loss: 0.617592, t_acc: 0.827715, v_loss: 1.655025, v_acc: 0.610599
Epoch [10/10] - t_loss: 0.460418, t_acc: 0.863872, v_loss: 1.650661, v_acc: 0.660138


## Experiments:

1. Rerun training (restart kernel and run all cells) but this time, when loading the model in the first block of code, specify 'pretrained = True' in order to make use of the weights pretrained on Imagenet.
2. Rerun the code using the pretrained model but this time use a learning rate of 1e-3. What happens?
3. Rerun using the pretrained model and a lr of 1e-4 but this time only change the last layer in the model instead of the entire classifier.
4. Rerun the code using the pretrained model and a lr of 1e-4. This time, freeze the pretrained layers and only update the new layers for the first epochs. Afterwards, proceed to update the entire model. You can freeze parameters by specifying 'requires_grad = False'.
5. Rerun experiment 3 but gradually unfreeze layers instead of unfreezeing the entire model at once.

In [9]:
from torchvision.models import AlexNet_Weights
from torchvision.transforms.v2 import Transform


# Use original transformations of AlexNet
weights = AlexNet_Weights.DEFAULT
preprocess: Transform = weights.transforms()

# Preprocess the dataset using those transforms
dataset = torchvision.datasets.Caltech101(
    './dataset',
    transform = Compose([
        ToImage(),
        Lambda(lambda x: x.repeat(3, 1, 1) if x.shape[0] != 3 else x),
        Lambda(lambda x: preprocess(x)),
    ])
)

# Redefine subsets & dataloaders
train_ds, val_ds, test_ds = random_split(dataset, [0.8, 0.1, 0.1], gen)
train_dl = DataLoader(train_ds, **settings)
valid_dl = DataLoader(val_ds, **settings)
test_dl = DataLoader(test_ds, **settings)

### Experiment 1

Rerun training (restart kernel and run all cells) but this time, when loading the model in the first block of code, specify 'pretrained = True' in order to make use of the weights 

In [10]:
# Use pretrained model
model = torchvision.models.alexnet(weights=weights)

# Create a new classifier similar to AlexNet
model.classifier = torch.nn.Sequential(
    Dropout(p=0.5, inplace=False),
    Linear(in_features=9216, out_features=4096, bias=True),
    ReLU(inplace=True),
    Dropout(p=0.5, inplace=False),
    Linear(in_features=4096, out_features=4096, bias=True),
    ReLU(inplace=True),
    Linear(in_features=4096, out_features=101, bias=True)
)

# Prepare training settings
lr_rate = 1e-4
optim = torch.optim.Adam(model.parameters(), lr=lr_rate)

# Send model to GPU
model = model.to(device)

# Start training
p1_history = train_validate(
    model=model,
    train_dl=train_dl,
    valid_dl=valid_dl,
    epochs=epochs,
    loss_fn=loss_fn,
    optim=optim,
)

Epoch [1/10] - t_loss: 1.822363, t_acc: 0.585278, v_loss: 0.811106, v_acc: 0.798387
Epoch [2/10] - t_loss: 0.544301, t_acc: 0.855085, v_loss: 0.565776, v_acc: 0.857143
Epoch [3/10] - t_loss: 0.241973, t_acc: 0.934457, v_loss: 0.612962, v_acc: 0.846774
Epoch [4/10] - t_loss: 0.170993, t_acc: 0.953616, v_loss: 0.571546, v_acc: 0.868664
Epoch [5/10] - t_loss: 0.123834, t_acc: 0.963411, v_loss: 0.607776, v_acc: 0.866359
Epoch [6/10] - t_loss: 0.093824, t_acc: 0.973351, v_loss: 0.584042, v_acc: 0.866359
Epoch [7/10] - t_loss: 0.101693, t_acc: 0.972198, v_loss: 0.714532, v_acc: 0.847926
Epoch [8/10] - t_loss: 0.069689, t_acc: 0.979833, v_loss: 0.815824, v_acc: 0.835253
Epoch [9/10] - t_loss: 0.104316, t_acc: 0.971766, v_loss: 0.504716, v_acc: 0.868664
Epoch [10/10] - t_loss: 0.090155, t_acc: 0.977384, v_loss: 0.552409, v_acc: 0.866359


### Experiment 2

Rerun the code using the pretrained model but this time use a learning rate of 1e-3. What happens?

In [11]:
# Use pretrained model
model = torchvision.models.alexnet(weights=weights)

# Create a new classifier similar to AlexNet
model.classifier = torch.nn.Sequential(
    Dropout(p=0.5, inplace=False),
    Linear(in_features=9216, out_features=4096, bias=True),
    ReLU(inplace=True),
    Dropout(p=0.5, inplace=False),
    Linear(in_features=4096, out_features=4096, bias=True),
    ReLU(inplace=True),
    Linear(in_features=4096, out_features=101, bias=True)
)

# Prepare training settings
lr_rate = 1e-3
optim = torch.optim.Adam(model.parameters(), lr=lr_rate)

# Send model to GPU
model = model.to(device)

# Start training
p1_history = train_validate(
    model=model,
    train_dl=train_dl,
    valid_dl=valid_dl,
    epochs=epochs,
    loss_fn=loss_fn,
    optim=optim,
)

Epoch [1/10] - t_loss: 4.297482, t_acc: 0.084270, v_loss: 4.232741, v_acc: 0.081797
Epoch [2/10] - t_loss: 4.209764, t_acc: 0.091760, v_loss: 4.225294, v_acc: 0.095622
Epoch [3/10] - t_loss: 4.206759, t_acc: 0.092625, v_loss: 4.224267, v_acc: 0.081797
Epoch [4/10] - t_loss: 4.204620, t_acc: 0.091616, v_loss: 4.229794, v_acc: 0.095622
Epoch [5/10] - t_loss: 4.203608, t_acc: 0.092913, v_loss: 4.213019, v_acc: 0.095622
Epoch [6/10] - t_loss: 4.203966, t_acc: 0.086863, v_loss: 4.218480, v_acc: 0.081797
Epoch [7/10] - t_loss: 4.201900, t_acc: 0.091760, v_loss: 4.211495, v_acc: 0.081797
Epoch [8/10] - t_loss: 4.200199, t_acc: 0.088159, v_loss: 4.223351, v_acc: 0.081797
Epoch [9/10] - t_loss: 4.198719, t_acc: 0.091904, v_loss: 4.232707, v_acc: 0.081797
Epoch [10/10] - t_loss: 4.199227, t_acc: 0.090608, v_loss: 4.230655, v_acc: 0.081797


### Experiment 3

Rerun using the pretrained model and a lr of 1e-4 but this time only change the last layer in the model instead of the entire classifier.

In [44]:
# Use pretrained model
model = torchvision.models.alexnet(weights=weights)

# Change only the last layer of the network
model.classifier[-1] = nn.Linear(in_features=4096, out_features=101, bias=True)

# Prepare training settings
lr_rate = 1e-4
optim = torch.optim.Adam(model.classifier.parameters(), lr=lr_rate)

# Send model to GPU
model = model.to(device)

# Start training
p1_history = train_validate(
    model=model,
    train_dl=train_dl,
    valid_dl=valid_dl,
    epochs=epochs,
    loss_fn=loss_fn,
    optim=optim,
)

Epoch [1/10] - t_loss: 1.210781, t_acc: 0.702391, v_loss: 0.571359, v_acc: 0.838710
Epoch [2/10] - t_loss: 0.227575, t_acc: 0.933449, v_loss: 0.534217, v_acc: 0.854839
Epoch [3/10] - t_loss: 0.099769, t_acc: 0.971766, v_loss: 0.510533, v_acc: 0.862903
Epoch [4/10] - t_loss: 0.066575, t_acc: 0.979833, v_loss: 0.456373, v_acc: 0.883641
Epoch [5/10] - t_loss: 0.053408, t_acc: 0.984875, v_loss: 0.524013, v_acc: 0.880184
Epoch [6/10] - t_loss: 0.053102, t_acc: 0.984443, v_loss: 0.525328, v_acc: 0.884793
Epoch [7/10] - t_loss: 0.050242, t_acc: 0.985019, v_loss: 0.535523, v_acc: 0.882488
Epoch [8/10] - t_loss: 0.062937, t_acc: 0.980985, v_loss: 0.612010, v_acc: 0.858295
Epoch [9/10] - t_loss: 0.045422, t_acc: 0.986027, v_loss: 0.633534, v_acc: 0.862903
Epoch [10/10] - t_loss: 0.043690, t_acc: 0.986603, v_loss: 0.496371, v_acc: 0.882488


### Experiment 4

Rerun the code using the pretrained model and a lr of 1e-4. This time, freeze the pretrained layers and only update the new layers for the first epochs. Afterwards, proceed to update the entire model. You can freeze parameters by specifying 'requires_grad = False'.

In [45]:
# Init the model from the pretrained weights
model = torchvision.models.alexnet(weights=weights)

# Freeze the backbone initially
model.requires_grad_(False)

# Replace the top classifier
model.classifier = nn.Sequential(
    nn.Dropout(p=0.5, inplace=False),
    nn.Linear(in_features=9216, out_features=4096, bias=True),
    nn.ReLU(inplace=True),
    nn.Dropout(p=0.5, inplace=False),
    nn.Linear(in_features=4096, out_features=4096, bias=True),
    nn.ReLU(inplace=True),
    nn.Linear(in_features=4096, out_features=101, bias=True)
)

# Training settings
lr = 1e-4
model = model.to(device)
optim = torch.optim.Adam(model.classifier.parameters(), lr)

p1_ratio = 0.3
p2_ratio = 1 - p1_ratio
assert 1.0 - (p1_ratio + p2_ratio) <= 1e-5

p1_epochs: int = int(np.floor(p1_ratio * epochs))
p2_epochs: int = int(np.floor(p2_ratio * epochs))
assert p1_epochs + p2_epochs == epochs

# Phase 1 - Train the classifier
p1_history = train_validate(
    model=model,
    train_dl=train_dl,
    valid_dl=valid_dl,
    epochs=p1_epochs,
    loss_fn=loss_fn,
    optim=optim,
)

# Consider all params for optimization
optim.add_param_group({
    'params': model.features.parameters(),
})

# Unfreeze the whole model
model.requires_grad_(True)

# Phase 2 - Train the whole network
p2_history = train_validate(
    model=model,
    train_dl=train_dl,
    valid_dl=valid_dl,
    epochs=p2_epochs,
    loss_fn=loss_fn,
    optim=optim,
)

Epoch [1/3] - t_loss: 1.705078, t_acc: 0.616825, v_loss: 0.705593, v_acc: 0.813364
Epoch [2/3] - t_loss: 0.463794, t_acc: 0.871363, v_loss: 0.578746, v_acc: 0.842166
Epoch [3/3] - t_loss: 0.235555, t_acc: 0.937626, v_loss: 0.521116, v_acc: 0.855991
Epoch [1/7] - t_loss: 0.263022, t_acc: 0.926102, v_loss: 0.553663, v_acc: 0.853687
Epoch [2/7] - t_loss: 0.173932, t_acc: 0.949006, v_loss: 0.597264, v_acc: 0.836406
Epoch [3/7] - t_loss: 0.129096, t_acc: 0.963123, v_loss: 0.529493, v_acc: 0.861751
Epoch [4/7] - t_loss: 0.132028, t_acc: 0.962835, v_loss: 0.613628, v_acc: 0.864055
Epoch [5/7] - t_loss: 0.095486, t_acc: 0.972054, v_loss: 0.553972, v_acc: 0.875576
Epoch [6/7] - t_loss: 0.067284, t_acc: 0.979977, v_loss: 0.571430, v_acc: 0.861751
Epoch [7/7] - t_loss: 0.067713, t_acc: 0.981706, v_loss: 0.584910, v_acc: 0.868664


### Experiment 5

5. Rerun experiment 3 but gradually unfreeze layers instead of unfreezeing the entire model at once.


In [76]:
model = torchvision.models.alexnet(weights=weights)
model.classifier[-1] = nn.Linear(in_features=4096, out_features=101, bias=True)
model.requires_grad_(False)

# Prepare training settings
lr_rate = 1e-4
optim = torch.optim.Adam(model.parameters(), lr=lr_rate)

# Send model to GPU
model = model.to(device)

# Get all layes in reverse order (bias included)
param_names = list(map(lambda x: x[0], reversed(list(model.named_parameters()))))
param_groups = list(zip(*(iter(param_names),) * 2))
history = []
epochs_per_layer = 2

for bias, weight in param_groups:
    # Unfreeze layer by layer
    model.get_parameter(bias).requires_grad_(True)
    model.get_parameter(weight).requires_grad_(True)

    # See what parameters are currently trained
    print(list(filter(lambda x: 'true' in x.lower(), map(lambda x: f'{x[0]}:{x[1].requires_grad}', model.named_parameters()))))

    # Train model partially
    h = train_validate(
        model=model,
        train_dl=train_dl,
        valid_dl=valid_dl,
        epochs=epochs_per_layer,
        loss_fn=loss_fn,
        optim=optim,
    )

    # Keep results
    history.append(h)

['classifier.6.weight:True', 'classifier.6.bias:True']
Epoch [1/2] - t_loss: 1.617894, t_acc: 0.630510, v_loss: 0.801819, v_acc: 0.828341
Epoch [2/2] - t_loss: 0.505628, t_acc: 0.877989, v_loss: 0.606884, v_acc: 0.850230
['classifier.4.weight:True', 'classifier.4.bias:True', 'classifier.6.weight:True', 'classifier.6.bias:True']
Epoch [1/2] - t_loss: 0.321002, t_acc: 0.903630, v_loss: 0.538197, v_acc: 0.850230
Epoch [2/2] - t_loss: 0.108363, t_acc: 0.968165, v_loss: 0.532982, v_acc: 0.864055
['classifier.1.weight:True', 'classifier.1.bias:True', 'classifier.4.weight:True', 'classifier.4.bias:True', 'classifier.6.weight:True', 'classifier.6.bias:True']
Epoch [1/2] - t_loss: 0.156007, t_acc: 0.957073, v_loss: 0.627823, v_acc: 0.857143
Epoch [2/2] - t_loss: 0.122495, t_acc: 0.962403, v_loss: 0.579784, v_acc: 0.854839
['features.10.weight:True', 'features.10.bias:True', 'classifier.1.weight:True', 'classifier.1.bias:True', 'classifier.4.weight:True', 'classifier.4.bias:True', 'classifier.6.