## Day 10 - Pretraining / Finetuning

We trained our CIFAR CNNs from scratch. That worked in just a few minutes—because the models are pretty small, the images are tiny (28×28), and the training dataset only has about 50,000 pictures. But ResNet is a different story. It was trained on ImageNet, which has 1.2 million images, and each one is much bigger (224×224 pixels).

Now, imagine you want a ResNet model for a totally new kind of image, not in ImageNet (or a different set of target classes). Usually we do not have enough trainign data to train a full model from scratch. Even if you did, training a ResNet from scratch would take a long time and need a lot of computing power. For example, training ResNet on all of ImageNet for 100 epochs can take around three days on eight super-fast Nvidia V100 GPUs!

Instead, maybe we can start with a version of ResNet **pretrained** on ImageNet and **finetune** it on a much smaller number of domain specific images. 

There are multiple different approaches. The easiest option is to keep the image encoder portion frozen, replace the "classification head" (the final fully connected layer) and train on the new data. This is also known as **transfer learning** -- we hope that the image features trained on ImageNet will still be relevant for the new data. 

The second option is to replace the classification head, and then continue training all the weights in the model (**full fine-tuning**). This is considerably more expensive. As a trade-off, we can selective unfreeze only the last convolution block. 

### Transfer Learning 

You can try transfer learning first, that is only replacing and training the final classification head while keeping all other model parameters frozen. Here is how that works in principle. 

First, we download the pretrained model again.

In [27]:
from torchvision import models, transforms
from PIL import Image
import torch
import torch.nn as nn
import torch.nn.functional as F
# Load pre-trained ResNet-50 model from torchvision
model = models.resnet50(weights="IMAGENET1K_V1") 
model.eval()


ResNet(
  (conv1): Conv2d(3, 64, kernel_size=(7, 7), stride=(2, 2), padding=(3, 3), bias=False)
  (bn1): BatchNorm2d(64, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
  (relu): ReLU(inplace=True)
  (maxpool): MaxPool2d(kernel_size=3, stride=2, padding=1, dilation=1, ceil_mode=False)
  (layer1): Sequential(
    (0): Bottleneck(
      (conv1): Conv2d(64, 64, kernel_size=(1, 1), stride=(1, 1), bias=False)
      (bn1): BatchNorm2d(64, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
      (conv2): Conv2d(64, 64, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1), bias=False)
      (bn2): BatchNorm2d(64, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
      (conv3): Conv2d(64, 256, kernel_size=(1, 1), stride=(1, 1), bias=False)
      (bn3): BatchNorm2d(256, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
      (relu): ReLU(inplace=True)
      (downsample): Sequential(
        (0): Conv2d(64, 256, kernel_size=(1, 1), stride=(1, 

We can now access the individual layers individually. 

In [6]:
model.fc

Linear(in_features=2048, out_features=1000, bias=True)

In [13]:
model.layer4

Sequential(
  (0): Bottleneck(
    (conv1): Conv2d(1024, 512, kernel_size=(1, 1), stride=(1, 1), bias=False)
    (bn1): BatchNorm2d(512, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
    (conv2): Conv2d(512, 512, kernel_size=(3, 3), stride=(2, 2), padding=(1, 1), bias=False)
    (bn2): BatchNorm2d(512, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
    (conv3): Conv2d(512, 2048, kernel_size=(1, 1), stride=(1, 1), bias=False)
    (bn3): BatchNorm2d(2048, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
    (relu): ReLU(inplace=True)
    (downsample): Sequential(
      (0): Conv2d(1024, 2048, kernel_size=(1, 1), stride=(2, 2), bias=False)
      (1): BatchNorm2d(2048, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
    )
  )
  (1): Bottleneck(
    (conv1): Conv2d(2048, 512, kernel_size=(1, 1), stride=(1, 1), bias=False)
    (bn1): BatchNorm2d(512, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
    (conv2): 

We freeze all parameters of the model: 

In [15]:
for param in model.parameters():
    param.requires_grad = False

Then we replace the final layer.

In [16]:
num_classes = 100 # or however many
model.fc = nn.Linear(2048, num_classes) # Create and add new layer -- input size must match. 

In [21]:
for param in model.fc.parameters(): # unfreeze new model parameters
    param.requires_grad = True

Now we would have to load the data and train the model -- which is your project for today. 

Note: Pre-processing should use the same steps as were used during pre-training. Here is the code again. 

In [4]:
# Resnet images need to be of size 3x224x224
# - Resize shorter side to 256 (we will crop some of the sides).
# - Center crop to 224x224
# - Convert to tensor and normalize with ImageNet mean / stdev (pre-calculated)
preprocess = transforms.Compose([
    transforms.Resize(256),
    transforms.CenterCrop(224),
    transforms.ToTensor(),
    transforms.Normalize(
        mean=[0.485, 0.456, 0.406],
        std=[0.229, 0.224, 0.225],
    ),
])

Here is what we used to pass an image to the model: 

In [29]:
from PIL import Image

image = Image.open("path/image.jpg").convert("RGB")

input_tensor = preprocess(image) # image is loaded from file, for example 
input_batch = input_tensor.unsqueeze(0) # input must be a batch

with torch.no_grad():
    output = model(input_batch)  # output shape: [1, 1000] (ImageNet classes)
    probabilities = F.softmax(output[0], dim=0) #softmax over first image

_, label = torch.topk(probabilities, 1)

FileNotFoundError: [Errno 2] No such file or directory: 'path/image.jpg'

Steps required: 
1. Pick a dataset. There are two suggestions. The oxford pet dataset: https://www.robots.ox.ac.uk/~vgg/data/pets/ (37 dog and cat breeds, 200 images per class); Food 101 dataset: https://data.vision.ee.ethz.ch/cvl/datasets_extra/food-101/ (101 food items, up to 300 images per class).

2. Write a Dataset class to fetch the data. See template below. Use the original train/test split if possible. 
class ImageClassificationDataset(Dataset):
    def __init__(self, root_dir, transform=None):
        """
        Args:
            root_dir (str): Path to the dataset root folder.
                            Expected structure: root/class_name/image.jpg
            transform (callable, optional): Optional transform to be applied
                                            on a sample (e.g., torchvision transforms).
        """
        self.root_dir = root_dir
        self.transform = transform
        
        # TODO: 1) Scan root_dir to collect (image_path, label) pairs
        # Hint: os.walk or os.listdir can help
        # Hint: You'll need a mapping {class_name: index}
        self.samples = []     # list of (img_path, label_index)
        self.class_to_idx = {} # mapping: class_name -> label_index

        # --- Student work starts here ---
        # 1. Find all subdirectories (these are class names)
        # 2. Assign each class an index
        # 3. Collect (path, label) for every image in each class folder
        # -------------------------------

    def __len__(self):
        # TODO: return number of samples
        return len(self.samples)

    def __getitem__(self, idx):
        # TODO:
        # 1. Get (path, label) for this index
        # 2. Load the image with PIL.Image.open(path).convert("RGB")
        # 3. Apply transform if provided
        # 4. Return (image_tensor, label)
        pass
3. Implement the training and validation function. 

In [26]:
class CustomDataset(Dataset):
    def __init__(self, root_dir, transform=None):
      
        self.root_dir = root_dir
        self.transform = transform

        self.data = ...  # i recommend storing the filename, not the actual image data, which may not all fit in memory
    
        # In the food dataset, the directory names are the class labels. I think the same is true for pets. 
        # Hint: os.walk or os.listdir is useful. 
        
    def __len__(self):
        # TODO: return number of samples
        return ...

    def __getitem__(self, idx):
        # TODO:
        # 1. Get (path, label) for this index
        # 2. Load the image with PIL.Image.open(path).convert("RGB")
        # 3. Apply transform if provided
        # 4. Return (image_tensor, label)
        pass

NameError: name 'Dataset' is not defined

In [None]:
def evaluate(model, loader, loss, device):
    model.eval()
    val_loss = 0.0
    val_correct = 0
    val_total = 0

    with torch.no_grad():
        for images, labels in loader:
            images, labels = images.to(device), labels.to(device)
            outputs = model(images)

            loss_val = loss(outputs, labels)
            val_loss += loss_val.item() * images.size(0)

            _, predicted = torch.max(outputs, 1)
            val_correct += (predicted == labels).sum().item()
            val_total += labels.size(0)

    avg_loss = val_loss / val_total
    accuracy = val_correct / val_total
    return avg_loss, accuracy


def train_model(model, train_loader, test_loader, num_epochs=10, learning_rate=0.001, device='cuda'):
    model.to(device)
    criterion = nn.CrossEntropyLoss()
    optimizer = optim.Adam(model.parameters(), lr=learning_rate)

    for epoch in range(num_epochs):
        model.train()
        running_loss = 0.0
        correct = 0
        total = 0

        for images, labels in train_loader:
            images, labels = images.to(device), labels.to(device)

            optimizer.zero_grad()
            outputs = model(images)
            loss = criterion(outputs, labels)
            loss.backward()
            optimizer.step()

            running_loss += loss.item() * images.size(0)
            _, predicted = torch.max(outputs, 1)
            correct += (predicted == labels).sum().item()
            total += labels.size(0)

        train_loss = running_loss / total
        train_acc = correct / total

        val_loss, val_acc = evaluate(model, test_loader, loss, device)

        print(f"Epoch {epoch+1}/{num_epochs}: "
              f"Train Loss: {train_loss:.4f}, Train Acc: {train_acc:.4f}, "
              f"Val Loss: {val_loss:.4f}, Val Acc: {val_acc:.4f}")

### Updating the image encoder

Now try to unfreeze the last layer of the image encoder and retrain the model. 
