# Tutorial 5-3: Standing on the Shoulders of Giants â€“ "Transfer Learning"

**Course:** CSEN 342: Deep Learning  
**Topic:** Transfer Learning, Feature Visualization, and Fine-Tuning

## Objective
Training a deep ConvNet from scratch requires massive datasets (like ImageNet with 1.2M images) and days of GPU time. Fortunately, we rarely need to do this. 

As discussed in the lecture (Slide 81), features learned on one task (e.g., classifying 1000 objects) often transfer well to other tasks. In this tutorial, we will:
1.  **Load a Pre-trained ResNet-18:** A state-of-the-art model trained on ImageNet.
2.  **Visualize Features:** Use "hooks" to peek inside the network and see how it processes an image (Early vs. Late layers).
3.  **Perform Transfer Learning:** Adapt this powerful model to a new, smaller dataset (CIFAR-10) using the **Linear Probing** and **Fine-Tuning** strategies.

---

## Part 1: The Pre-trained Model

We will use `torchvision.models` to download a ResNet-18. 
* **`weights='IMAGENET1K_V1'`**: Tells PyTorch to download the weights learned from the ImageNet dataset.
* **Architecture:** ResNet-18 is a deep CNN with residual connections (which we will study in depth later), but for now, treat it as a powerful feature extractor.

In [None]:
import os
import torch
import torch.nn as nn
import torch.optim as optim
import torchvision
import torchvision.transforms as transforms
from torchvision.models import resnet18, ResNet18_Weights
import matplotlib.pyplot as plt
import numpy as np
from PIL import Image
import sys

# Add parent directory to path to import utils
sys.path.append(os.path.abspath(os.path.join('..')))
from utils import download_cifar10

# 1. Pre-download Data
download_cifar10()

# Device config
device = torch.device('cuda' if torch.cuda.is_available() else 'cpu')

# 1. Load Pre-trained Model
# weights=ResNet18_Weights.IMAGENET1K_V1 is the modern way to specify pretrained=True
try:
    # Attempt to download automatically (may fail)
    model = resnet18(weights=ResNet18_Weights.IMAGENET1K_V1)
except Exception as e:
    if not os.path.exists('../data/resnet18-f37072fd.pth'):
        !wget -q -O '../data/resnet18-f37072fd.pth' https://download.pytorch.org/models/resnet18-f37072fd.pth
    state_dict = torch.load('../data/resnet18-f37072fd.pth', weights_only=True)
    
    model = resnet18(weights=None)
    model.load_state_dict(state_dict)
    print("Loaded Pre-trained ImageNet Weights (Offline).")
    
print("Model Loaded.")
# Let's look at the final layer (the head)
print("Original Head:", model.fc)

---

## Part 2: What does a ConvNet see? (Feature Visualization)

Before we modify the network, let's visualize the features it extracts. We will use a **Forward Hook** to save the output of intermediate convolutional layers.

We expect:
* **Early Layers:** Edges, colors, simple textures.
* **Late Layers:** Abstract shapes, parts of objects (eyes, wheels).

In [None]:
# 1. Setup Hooks
activation = {}
def get_activation(name):
    def hook(model, input, output):
        activation[name] = output.detach()
    return hook

# Register hooks on the first layer (conv1) and a deep layer (layer4)
model.conv1.register_forward_hook(get_activation('conv1'))
model.layer4[1].conv2.register_forward_hook(get_activation('layer4'))

# 2. Load a sample image
try:
    !wget -q -O dog.jpg https://raw.githubusercontent.com/pytorch/hub/master/images/dog.jpg
    img_pil = Image.open("dog.jpg")
except:
    print("Could not download, using noise.")
    img_pil = Image.fromarray(np.uint8(np.random.rand(224,224,3)*255))

# Preprocess for ResNet (Resize, Normalize)
preprocess = ResNet18_Weights.IMAGENET1K_V1.transforms()
img_tensor = preprocess(img_pil).unsqueeze(0) # Add batch dim

# 3. Forward Pass
model.eval()
out = model(img_tensor)

# 4. Visualize
def plot_features(layer_name, num_filters=6):
    act = activation[layer_name].squeeze()
    fig, axarr = plt.subplots(1, num_filters, figsize=(15, 3))
    fig.suptitle(f"Features from {layer_name}", fontsize=16)
    for idx in range(num_filters):
        axarr[idx].imshow(act[idx], cmap='viridis')
        axarr[idx].axis('off')
    plt.show()

plt.imshow(img_pil); plt.title("Input Image"); plt.axis('off'); plt.show()
plot_features('conv1')
plot_features('layer4')

### Discussion
Notice the resolution difference! 
* **`conv1`** is high-resolution ($112\times112$) and retains spatial details (edges of the dog).
* **`layer4`** is low-resolution ($7\times7$). It looks like a blurry heatmap. Each pixel here represents a complex concept found in a large region of the original image (the Receptive Field!).

---

## Part 3: Transfer Learning (Linear Probing)

We want to classify **CIFAR-10** images (10 classes), but ResNet-18 was built for ImageNet (1000 classes). 

**Strategy 1 (Slide 86):** Freeze the backbone (treat it as a fixed feature extractor) and train *only* the final linear layer. This is fast and effective for small datasets.

**Steps:**
1.  Set `requires_grad = False` for all parameters.
2.  Replace `model.fc` with a new `nn.Linear(512, 10)`.
3.  Train.

In [None]:
# 1. Freeze Backbone
for param in model.parameters():
    param.requires_grad = False

# 2. Replace Head (New layer automatically has requires_grad=True)
num_ftrs = model.fc.in_features # 512 for ResNet18
model.fc = nn.Linear(num_ftrs, 10) # CIFAR-10 has 10 classes

model = model.to(device)

# 3. Data Loaders (CIFAR-10)
# ResNet expects 224x224 input usually, but we can use smaller if we want speed.
# However, CIFAR is 32x32. ResNet downsizes by 32x.
# If we input 32x32, the final feature map is 1x1. This works!
transform_train = transforms.Compose([
    transforms.Resize(224), # Resize to match ResNet expectation (better accuracy)
    transforms.ToTensor(),
    transforms.Normalize((0.4914, 0.4822, 0.4465), (0.2023, 0.1994, 0.2010)),
])

trainset = torchvision.datasets.CIFAR10(root='../data', train=True, download=True, transform=transform_train)
# Use subset for tutorial speed (500 images)
train_subset, _ = torch.utils.data.random_split(trainset, [500, 49500])
trainloader = torch.utils.data.DataLoader(train_subset, batch_size=32, shuffle=True)

# 4. Train Only Head
print("Params to learn:")
params_to_update = []
for name, param in model.named_parameters():
    if param.requires_grad == True:
        params_to_update.append(param)
        print("\t", name)

optimizer = optim.Adam(params_to_update, lr=0.001)
criterion = nn.CrossEntropyLoss()

print("\nStarting Linear Probing (Head Training)...")
model.train()
for epoch in range(3): # Short run
    running_loss = 0.0
    for inputs, labels in trainloader:
        inputs, labels = inputs.to(device), labels.to(device)
        optimizer.zero_grad()
        outputs = model(inputs)
        loss = criterion(outputs, labels)
        loss.backward()
        optimizer.step()
        running_loss += loss.item()
    print(f"Epoch {epoch+1} Loss: {running_loss/len(trainloader):.4f}")

---

## Part 4: Fine-Tuning (The Golden Rule)

We have trained the head. Now, to squeeze out maximum performance, we **unfreeze** the rest of the network and train everything together.

**The Golden Rule (Slide 88):** Use a **much smaller learning rate** (e.g., 1/10th or 1/100th) for the backbone than the head. We don't want to destroy the beautiful pre-trained weights with large gradient updates.

In [None]:
# 1. Unfreeze everything
for param in model.parameters():
    param.requires_grad = True

# 2. Differential Learning Rates
# Backbone gets 1e-4, Head gets 1e-3
optimizer = optim.Adam([
    {'params': model.layer1.parameters(), 'lr': 1e-5}, # Very small LR for early layers
    {'params': model.layer2.parameters(), 'lr': 1e-5},
    {'params': model.layer3.parameters(), 'lr': 1e-4}, 
    {'params': model.layer4.parameters(), 'lr': 1e-4},
    {'params': model.fc.parameters(), 'lr': 1e-3}      # Larger LR for the head
])

print("Starting Fine-Tuning (Backbone Unfrozen)...")
model.train()
for epoch in range(2):
    running_loss = 0.0
    for inputs, labels in trainloader:
        inputs, labels = inputs.to(device), labels.to(device)
        optimizer.zero_grad()
        outputs = model(inputs)
        loss = criterion(outputs, labels)
        loss.backward()
        optimizer.step()
        running_loss += loss.item()
    print(f"Epoch {epoch+1} Loss: {running_loss/len(trainloader):.4f}")

### Conclusion
1.  **Transfer Learning** allows you to use state-of-the-art models on small datasets.
2.  **Linear Probing (Freezing)** is a safe first step to align the new classifier with the pre-trained features.
3.  **Fine-Tuning (Unfreezing)** adapts the features themselves to your specific domain (e.g., teaching the network to look for "airplane wings" specifically, rather than generic edges).

**Pro Tip:** Always fine-tune with a low learning rate to avoid "catastrophic forgetting" of the ImageNet knowledge.