# Day 3: Transfer Learning
## CV Bootcamp 2024

Leverage pretrained models for faster training and better results!

**Why Transfer Learning?**
- Train in minutes instead of hours
- Need only 100s of images instead of 1000s
- Achieve state-of-the-art results
- Leverage knowledge from ImageNet (1.4M images)

In [None]:
import torch
import torch.nn as nn
import torch.optim as optim
from torchvision import models, transforms
import matplotlib.pyplot as plt

device = torch.device('cuda' if torch.cuda.is_available() else 'cpu')
print(f'Using device: {device}')

## 1. Why Transfer Learning Works

### The Key Insight

Early layers of CNNs learn **generic features** (edges, textures, colors) that are useful for ALL vision tasks.

Only the final layers learn **task-specific features**.

**Strategy:**
1. Use a model pretrained on ImageNet (learned generic features)
2. Freeze early layers (keep the generic features)
3. Only train final layers for your specific task

**When to use:**
- ‚úì Limited training data
- ‚úì Similar task to ImageNet (object recognition)
- ‚úì Want quick baseline
- ‚úì Limited compute resources

## 2. Load Pretrained ResNet18

In [None]:
# Load pretrained ResNet18
model = models.resnet18(pretrained=True)

print('ResNet18 architecture:')
print(model)

print(f'\nFinal layer input features: {model.fc.in_features}')
print(f'Final layer output features: {model.fc.out_features}')

## 3. Understanding ResNet Architecture

### ResNet's Innovation: Skip Connections

```
      input (x)
         |
   [conv-relu-conv]  ‚Üê learns F(x)
         |
         +  <--------- skip connection (adds x)
         |
      output = F(x) + x
```

**Why it works:**
- Gradients flow directly through skip connections
- Enables training very deep networks (100+ layers)
- If a layer isn't helpful, it can learn F(x) = 0

In [None]:
# Examine ResNet layers
print("ResNet18 structure:")
print("="*50)
for name, module in model.named_children():
    print(f"{name:15s}: {module.__class__.__name__}")

print("\nResidual blocks in layer1:")
for i, block in enumerate(model.layer1):
    print(f"  Block {i}: {block}")

## 4. Freeze Pretrained Layers

In [None]:
# Freeze all layers
for param in model.parameters():
    param.requires_grad = False

# Replace final layer for binary classification (Cat vs Dog)
model.fc = nn.Linear(model.fc.in_features, 2)

model = model.to(device)

print('Modified model final layer:')
print(model.fc)

# Count trainable vs total parameters
total_params = sum(p.numel() for p in model.parameters())
trainable_params = sum(p.numel() for p in model.parameters() if p.requires_grad)

print(f'\nTotal parameters: {total_params:,}')
print(f'Trainable parameters: {trainable_params:,}')
print(f'Frozen parameters: {total_params - trainable_params:,}')
print(f'\n% trainable: {100*trainable_params/total_params:.2f}%')

## 5. Training Setup

In [None]:
# Only optimize parameters of final layer
criterion = nn.CrossEntropyLoss()
optimizer = optim.Adam(model.fc.parameters(), lr=0.001)

print('Optimizer configured to train only final layer')
print(f'Learning rate: {optimizer.param_groups[0]["lr"]}')

## 6. Learning Rate Scheduling

Learning rate should decrease over time for better convergence.

### Common Schedulers

In [None]:
# 1. StepLR: Reduce LR every N epochs
scheduler_step = optim.lr_scheduler.StepLR(optimizer, step_size=5, gamma=0.1)
print("StepLR: Multiply LR by 0.1 every 5 epochs")

# 2. ReduceLROnPlateau: Reduce when validation stops improving
scheduler_plateau = optim.lr_scheduler.ReduceLROnPlateau(
    optimizer, mode='min', factor=0.5, patience=3, verbose=True
)
print("ReduceLROnPlateau: Reduce LR by 0.5 if no improvement for 3 epochs")

# 3. CosineAnnealingLR: Smooth decrease
scheduler_cosine = optim.lr_scheduler.CosineAnnealingLR(optimizer, T_max=10)
print("CosineAnnealingLR: Smooth decrease following cosine curve")

# Visualize schedules
def plot_lr_schedule(scheduler, epochs=20):
    lrs = []
    for epoch in range(epochs):
        lrs.append(optimizer.param_groups[0]['lr'])
        scheduler.step()
    return lrs

# Reset optimizer
optimizer = optim.Adam(model.fc.parameters(), lr=0.001)

fig, axes = plt.subplots(1, 3, figsize=(15, 4))

# StepLR
optimizer.param_groups[0]['lr'] = 0.001
sch = optim.lr_scheduler.StepLR(optimizer, step_size=5, gamma=0.5)
lrs = plot_lr_schedule(sch, 20)
axes[0].plot(lrs)
axes[0].set_title('StepLR')
axes[0].set_xlabel('Epoch')
axes[0].set_ylabel('Learning Rate')
axes[0].grid(True, alpha=0.3)

# Cosine
optimizer.param_groups[0]['lr'] = 0.001
sch = optim.lr_scheduler.CosineAnnealingLR(optimizer, T_max=20)
lrs = plot_lr_schedule(sch, 20)
axes[1].plot(lrs)
axes[1].set_title('CosineAnnealingLR')
axes[1].set_xlabel('Epoch')
axes[1].grid(True, alpha=0.3)

# Exponential
optimizer.param_groups[0]['lr'] = 0.001
sch = optim.lr_scheduler.ExponentialLR(optimizer, gamma=0.9)
lrs = plot_lr_schedule(sch, 20)
axes[2].plot(lrs)
axes[2].set_title('ExponentialLR')
axes[2].set_xlabel('Epoch')
axes[2].grid(True, alpha=0.3)

plt.tight_layout()
plt.show()

# Reset for actual training
optimizer = optim.Adam(model.fc.parameters(), lr=0.001)
scheduler = optim.lr_scheduler.ReduceLROnPlateau(optimizer, patience=3, factor=0.5)

## 7. Fine-Tuning Strategies

In [None]:
# Strategy 1: Train only final layer (what we're doing)
print("Strategy 1: Train only final layer")
print("  - Fastest")
print("  - Needs least data")
print("  - Good for similar tasks to ImageNet\n")

# Strategy 2: Unfreeze last few layers
def unfreeze_last_n_layers(model, n=1):
    """Unfreeze last n residual blocks"""
    layers = [model.layer4, model.layer3, model.layer2, model.layer1]
    for i in range(n):
        for param in layers[i].parameters():
            param.requires_grad = True
    return model

print("Strategy 2: Unfreeze last few layers")
print("  - More flexible")
print("  - Needs more data")
print("  - Better for different tasks\n")

# Strategy 3: Unfreeze all with different learning rates
print("Strategy 3: Different LR for different layers")
print("  - Best results")
print("  - Needs most data")
print("  - Use lower LR for early layers")

# Example of strategy 3
# for param in model.parameters():
#     param.requires_grad = True
# 
# optimizer = optim.Adam([
#     {'params': model.layer4.parameters(), 'lr': 1e-4},
#     {'params': model.layer3.parameters(), 'lr': 1e-5},
#     {'params': model.fc.parameters(), 'lr': 1e-3}
# ])

## 8. Data Augmentation for Transfer Learning

**Critical:** Use ImageNet statistics for normalization!

In [None]:
# ImageNet statistics
imagenet_mean = [0.485, 0.456, 0.406]
imagenet_std = [0.229, 0.224, 0.225]

# Training: WITH augmentation
train_transforms = transforms.Compose([
    transforms.Resize(256),
    transforms.RandomCrop(224),
    transforms.RandomHorizontalFlip(),
    transforms.ColorJitter(brightness=0.2, contrast=0.2, saturation=0.2),
    transforms.ToTensor(),
    transforms.Normalize(imagenet_mean, imagenet_std)
])

# Validation/Test: NO augmentation
val_transforms = transforms.Compose([
    transforms.Resize(256),
    transforms.CenterCrop(224),
    transforms.ToTensor(),
    transforms.Normalize(imagenet_mean, imagenet_std)
])

print('Transforms defined with ImageNet statistics')
print(f'Mean: {imagenet_mean}')
print(f'Std:  {imagenet_std}')
print('\n‚ö† Important: NEVER augment validation/test data!')

## 9. Popular Pretrained Models

PyTorch provides many pretrained models. Let's compare them!

In [None]:
# Load different architectures
resnet18 = models.resnet18(pretrained=True)
resnet50 = models.resnet50(pretrained=True)
mobilenet_v2 = models.mobilenet_v2(pretrained=True)

print('Available pretrained models:')
print('- ResNet family (18, 34, 50, 101, 152)')
print('- VGG family (16, 19)')
print('- MobileNet v2/v3 (mobile-optimized)')
print('- EfficientNet family (b0-b7)')
print('- DenseNet family (121, 161, 169, 201)')

## 10. Model Comparison

In [None]:
models_to_compare = {
    'ResNet18': models.resnet18(pretrained=True),
    'ResNet50': models.resnet50(pretrained=True),
    'MobileNetV2': models.mobilenet_v2(pretrained=True)
}

print('Model Comparison:\n')
print(f'{"Model":<15} {"Parameters":>12} {"Size (MB)":>12} {"Use Case"}')
print('=' * 70)

for name, model in models_to_compare.items():
    params = sum(p.numel() for p in model.parameters())
    size_mb = params * 4 / 1024 / 1024  # Assuming float32
    
    use_case = {
        'ResNet18': 'Quick baseline, learning',
        'ResNet50': 'General purpose, best accuracy',
        'MobileNetV2': 'Mobile/edge deployment'
    }[name]
    
    print(f'{name:<15} {params:>12,} {size_mb:>11.1f} {use_case}')

# Speed test
print('\nInference Speed Test (CPU):')
test_input = torch.randn(1, 3, 224, 224)

import time
for name, model in models_to_compare.items():
    model.eval()
    with torch.no_grad():
        start = time.time()
        for _ in range(10):
            _ = model(test_input)
        elapsed = (time.time() - start) / 10
    print(f'{name:<15}: {elapsed*1000:.1f} ms per image')

## 11. Choosing the Right Model

### Decision Guide:

**For Learning/Prototyping:**
- Use ResNet18
- Fast to train and test

**For Best Accuracy:**
- Use ResNet50 or EfficientNet-B3
- More parameters, better performance

**For Mobile/Edge Deployment:**
- Use MobileNetV2 or MobileNetV3
- Optimized for speed and size

**For Production with GPUs:**
- Use ResNet50 or EfficientNet-B0
- Good balance of speed and accuracy

## 12. Complete Training Example

Here's how you'd train with transfer learning in practice:

In [None]:
# Pseudo-code for complete training

def train_with_transfer_learning(train_loader, val_loader, num_epochs=10):
    """
    Complete transfer learning training loop
    """
    # 1. Load pretrained model
    model = models.resnet18(pretrained=True)
    
    # 2. Freeze all layers
    for param in model.parameters():
        param.requires_grad = False
    
    # 3. Replace final layer
    model.fc = nn.Linear(model.fc.in_features, 2)  # Binary classification
    model = model.to(device)
    
    # 4. Setup training
    criterion = nn.CrossEntropyLoss()
    optimizer = optim.Adam(model.fc.parameters(), lr=0.001)
    scheduler = optim.lr_scheduler.ReduceLROnPlateau(optimizer, patience=3)
    
    best_val_acc = 0.0
    
    # 5. Training loop
    for epoch in range(num_epochs):
        # Train
        model.train()
        train_loss = 0
        for images, labels in train_loader:
            images, labels = images.to(device), labels.to(device)
            
            outputs = model(images)
            loss = criterion(outputs, labels)
            
            optimizer.zero_grad()
            loss.backward()
            optimizer.step()
            
            train_loss += loss.item()
        
        # Validate
        model.eval()
        val_loss = 0
        correct = 0
        total = 0
        
        with torch.no_grad():
            for images, labels in val_loader:
                images, labels = images.to(device), labels.to(device)
                outputs = model(images)
                loss = criterion(outputs, labels)
                
                val_loss += loss.item()
                _, predicted = torch.max(outputs, 1)
                total += labels.size(0)
                correct += (predicted == labels).sum().item()
        
        val_acc = correct / total
        
        # Update learning rate
        scheduler.step(val_loss)
        
        # Save best model
        if val_acc > best_val_acc:
            best_val_acc = val_acc
            torch.save(model.state_dict(), 'best_model.pth')
        
        print(f'Epoch {epoch+1}: Val Acc = {val_acc*100:.2f}%')
    
    return model

print("Training function defined!")
print("Call with: model = train_with_transfer_learning(train_loader, val_loader)")

## Summary

You've learned:
- ‚úì Why transfer learning works (generic features)
- ‚úì Loading pretrained models
- ‚úì Freezing layers for transfer learning
- ‚úì Replacing final layer for custom tasks
- ‚úì Fine-tuning strategies (feature extraction vs full fine-tuning)
- ‚úì Using ImageNet statistics
- ‚úì Learning rate scheduling
- ‚úì Popular model architectures and when to use them
- ‚úì Complete training workflow

**Key Takeaways:**
1. Transfer learning dramatically reduces training time and data requirements
2. Start with ResNet18 for prototyping
3. Always use ImageNet statistics for normalization
4. Use learning rate scheduling for better convergence
5. Choose model based on your deployment constraints

**Next Step:** Apply this to your Cat vs Dog assignment! üê±üê∂