# Part 5.4: Fine-tuning & PEFT

Imagine you want to build a model that classifies medical images. Training from scratch would require millions of labeled medical images and weeks of compute. But what if you could start with a model that already understands edges, textures, shapes, and objects from training on millions of general images? You'd only need to teach it the *medical-specific* part.

This is **transfer learning** -- one of the most important practical techniques in modern deep learning. And with today's massive language models (billions of parameters), we can't even afford to fine-tune all parameters. That's where **Parameter-Efficient Fine-Tuning (PEFT)** methods like **LoRA** come in, letting us adapt huge models by modifying less than 1% of their parameters.

## Learning Objectives

- [ ] Understand transfer learning and when to use feature extraction vs fine-tuning
- [ ] Implement full fine-tuning with discriminative learning rates
- [ ] Explain why Parameter-Efficient Fine-Tuning (PEFT) is necessary for large models
- [ ] Implement LoRA (Low-Rank Adaptation) from scratch and understand rank decomposition
- [ ] Compare adapters, prompt tuning, and LoRA as PEFT methods
- [ ] Build a complete fine-tuning pipeline with evaluation
- [ ] Understand the RLHF pipeline that produces ChatGPT-like models

In [None]:
import numpy as np
import matplotlib.pyplot as plt
import torch
import torch.nn as nn
import torch.nn.functional as F
import torch.optim as optim
import math
from copy import deepcopy

%matplotlib inline
plt.style.use('seaborn-v0_8-whitegrid')
torch.manual_seed(42)
np.random.seed(42)

---

## 1. Transfer Learning

### The Key Insight

Think about how humans learn. When you learn to play tennis, you don't start from scratch -- you already know how to move your body, track objects with your eyes, and coordinate hand movements. You **transfer** these general skills to the new task.

Neural networks work the same way:

- **Early layers** learn general features (edges, textures, basic patterns)
- **Middle layers** learn compositional features (shapes, parts, common structures)
- **Later layers** learn task-specific features (specific objects, categories)

A model trained on ImageNet has already learned a rich hierarchy of visual features. A language model trained on the internet already "knows" grammar, facts, and reasoning patterns. We can **reuse** this knowledge for new tasks.

### Two Approaches

| Approach | What You Do | When to Use |
|----------|-------------|-------------|
| **Feature Extraction** | Freeze pretrained layers, only train new head | Small dataset, similar domain |
| **Fine-tuning** | Unfreeze some/all layers, train with small LR | Larger dataset, different domain |

**What this means:** Feature extraction treats the pretrained model as a fixed feature extractor -- like using it as a sophisticated preprocessing step. Fine-tuning actually updates the pretrained weights to better fit your specific task.

In [None]:
# Visualization: What different layers learn (general -> task-specific)
fig, axes = plt.subplots(1, 4, figsize=(16, 4))

# Simulate what each layer "responds to"
np.random.seed(42)

# Layer 1: Edge detectors (general)
x = np.linspace(0, 1, 50)
y = np.linspace(0, 1, 50)
X, Y = np.meshgrid(x, y)

# Simple edge patterns
patterns = [
    ("Layer 1:\nEdges & Textures", X > 0.5, "General\n(Transfer easily)"),
    ("Layer 2:\nShapes & Parts", ((X-0.5)**2 + (Y-0.5)**2) < 0.15, "Somewhat General\n(Usually transfer)"),
    ("Layer 3:\nObject Parts", ((X-0.3)**2 + (Y-0.3)**2 < 0.05) | ((X-0.7)**2 + (Y-0.7)**2 < 0.05), "Task-Related\n(May need updating)"),
    ("Layer 4:\nTask-Specific", ((X-0.5)**2/0.08 + (Y-0.5)**2/0.2) < 1, "Task-Specific\n(Must retrain)")
]

colors = ['#2ecc71', '#27ae60', '#e67e22', '#e74c3c']
for ax, (title, pattern, transfer), color in zip(axes, patterns, colors):
    ax.imshow(pattern.astype(float), cmap='gray', extent=[0,1,0,1])
    ax.set_title(title, fontsize=12, fontweight='bold')
    ax.set_xlabel(transfer, fontsize=10, color=color, fontweight='bold')
    ax.set_xticks([])
    ax.set_yticks([])

plt.suptitle('What Neural Network Layers Learn (General to Specific)',
             fontsize=14, fontweight='bold', y=1.02)
plt.tight_layout()
plt.show()

print("Key insight: Earlier layers learn universal features that transfer across tasks")
print("Later layers learn task-specific features that need retraining")

### When to Use Which Approach

The right strategy depends on two factors: **dataset size** and **domain similarity**.

| | Small Dataset | Large Dataset |
|---|---|---|
| **Similar Domain** | Feature extraction (freeze all, train head) | Fine-tune last few layers |
| **Different Domain** | Feature extraction + augmentation (risky!) | Fine-tune entire network |

**Examples:**
- Medical X-rays with 500 images, pretrained on ImageNet: **Feature extraction** (similar visual domain, small data)
- Satellite imagery with 100K images, pretrained on ImageNet: **Fine-tune all** (different domain, large data)
- Product reviews with 1K examples, pretrained on Wikipedia: **Feature extraction** (similar text domain, small data)

In [None]:
# Visualization: Feature extraction vs Fine-tuning decision flowchart
fig, ax = plt.subplots(figsize=(12, 7))
ax.set_xlim(0, 10)
ax.set_ylim(0, 8)
ax.axis('off')

# Draw boxes and arrows
boxes = {
    'start': (5, 7.2, 'Start: Have a\npretrained model'),
    'data': (5, 5.5, 'How much\nlabeled data?'),
    'sim_small': (2.5, 3.8, 'Domain\nsimilar?'),
    'sim_large': (7.5, 3.8, 'Domain\nsimilar?'),
    'fe': (1, 2, 'Feature\nExtraction'),
    'fe_aug': (4, 2, 'Feature Extraction\n+ Augmentation'),
    'ft_last': (6, 2, 'Fine-tune\nLast Layers'),
    'ft_all': (9, 2, 'Fine-tune\nAll Layers'),
}

colors_map = {
    'start': '#3498db', 'data': '#f39c12',
    'sim_small': '#f39c12', 'sim_large': '#f39c12',
    'fe': '#2ecc71', 'fe_aug': '#e67e22',
    'ft_last': '#2ecc71', 'ft_all': '#e74c3c'
}

for key, (x, y, text) in boxes.items():
    color = colors_map[key]
    bbox = dict(boxstyle='round,pad=0.3', facecolor=color, alpha=0.3, edgecolor=color)
    ax.text(x, y, text, ha='center', va='center', fontsize=11, fontweight='bold', bbox=bbox)

# Arrows with labels
arrow_props = dict(arrowstyle='->', color='gray', lw=2)
ax.annotate('', xy=(5, 6.1), xytext=(5, 6.8), arrowprops=arrow_props)
ax.annotate('', xy=(2.5, 4.5), xytext=(4.2, 5.2), arrowprops=arrow_props)
ax.text(3, 5.1, 'Small', fontsize=10, color='gray')
ax.annotate('', xy=(7.5, 4.5), xytext=(5.8, 5.2), arrowprops=arrow_props)
ax.text(6.8, 5.1, 'Large', fontsize=10, color='gray')

ax.annotate('', xy=(1, 2.7), xytext=(2, 3.4), arrowprops=arrow_props)
ax.text(1.1, 3.2, 'Yes', fontsize=9, color='green')
ax.annotate('', xy=(4, 2.7), xytext=(3, 3.4), arrowprops=arrow_props)
ax.text(3.7, 3.2, 'No', fontsize=9, color='red')
ax.annotate('', xy=(6, 2.7), xytext=(7, 3.4), arrowprops=arrow_props)
ax.text(6.1, 3.2, 'Yes', fontsize=9, color='green')
ax.annotate('', xy=(9, 2.7), xytext=(8, 3.4), arrowprops=arrow_props)
ax.text(8.7, 3.2, 'No', fontsize=9, color='red')

ax.set_title('Transfer Learning Strategy Decision Guide', fontsize=14, fontweight='bold')
plt.tight_layout()
plt.show()

### Deep Dive: Why Transfer Learning Works

Transfer learning works because of a fundamental property of neural networks: **hierarchical feature learning**.

When a deep network is trained on a large dataset (like ImageNet with 1.2M images or a text corpus with billions of words), the early layers learn features that are remarkably universal:
- In vision: edges, corners, color gradients, textures
- In language: word meanings, grammar patterns, syntactic structures

These features are useful for almost any task in the same modality. The later layers combine these basic features into increasingly abstract and task-specific representations.

#### Key Insight

The knowledge stored in pretrained weights represents a **prior** over the space of useful features. Instead of learning everything from random initialization, you start with a good initialization that already captures the structure of the data domain.

#### Common Misconceptions

| Misconception | Reality |
|---------------|---------|
| Transfer learning only works within the same domain | Cross-domain transfer works too (e.g., ImageNet to medical) |
| You always need to fine-tune | Feature extraction alone often works well |
| More fine-tuning is always better | Too much fine-tuning can cause catastrophic forgetting |
| Transfer learning is optional for large datasets | Even with large datasets, pretrained models converge faster |

---

## 2. Full Fine-tuning

### The Approach

In full fine-tuning, we take a pretrained model and continue training **all parameters** on our new task-specific dataset. This is the most straightforward form of transfer learning.

**The process:**
1. Load a pretrained model (e.g., BERT, ResNet, GPT)
2. Replace the final classification head with one suited to your task
3. Train the entire model on your new dataset with a **small learning rate**

### The Challenge: Catastrophic Forgetting

When you fine-tune too aggressively, the model can "forget" the useful features it learned during pretraining. This is called **catastrophic forgetting**.

**What this means:** The pretrained weights encode valuable knowledge. If you update them too much with new task data (especially small datasets), you overwrite that knowledge and lose the benefit of pretraining.

### Solution: Learning Rate Strategies

| Strategy | Description | When to Use |
|----------|-------------|-------------|
| **Small uniform LR** | Same small LR for all layers (e.g., 1e-5) | Simple, works well in practice |
| **Discriminative LR** | Lower LR for early layers, higher for later | Better performance, more complex |
| **Gradual unfreezing** | Unfreeze layers one at a time during training | Most conservative, prevents forgetting |
| **Warmup** | Start with tiny LR, ramp up | Stabilizes early training |

In [None]:
# Implement a simple pretrained model and fine-tuning

class SimplePretrainedModel(nn.Module):
    """
    A simple model to demonstrate fine-tuning concepts.
    Pretrained on a 'source' task, fine-tuned on a 'target' task.
    """
    def __init__(self, input_dim=20, hidden_dim=64, num_classes=5):
        super().__init__()
        # "Pretrained" feature extractor
        self.feature_extractor = nn.Sequential(
            nn.Linear(input_dim, hidden_dim),
            nn.ReLU(),
            nn.Linear(hidden_dim, hidden_dim),
            nn.ReLU(),
            nn.Linear(hidden_dim, 32),
            nn.ReLU()
        )
        # Task-specific head
        self.classifier = nn.Linear(32, num_classes)

    def forward(self, x):
        features = self.feature_extractor(x)
        return self.classifier(features)


def create_synthetic_data(n_samples, input_dim=20, num_classes=5, task='source'):
    """Create synthetic classification data."""
    torch.manual_seed(42 if task == 'source' else 123)
    X = torch.randn(n_samples, input_dim)
    # Different tasks use different feature combinations
    if task == 'source':
        logits = X[:, :num_classes] * 2 + X[:, 5:5+num_classes]
    else:  # target task uses related but different features
        logits = X[:, 2:2+num_classes] * 1.5 + X[:, 7:7+num_classes] * 0.8
    y = logits.argmax(dim=1)
    return X, y


def train_model(model, X_train, y_train, X_val, y_val, epochs=50, lr=0.01, verbose=False):
    """Train a model and return loss/accuracy history."""
    optimizer = optim.Adam(model.parameters(), lr=lr)
    criterion = nn.CrossEntropyLoss()

    train_losses, val_accs = [], []
    for epoch in range(epochs):
        model.train()
        optimizer.zero_grad()
        outputs = model(X_train)
        loss = criterion(outputs, y_train)
        loss.backward()
        optimizer.step()

        model.eval()
        with torch.no_grad():
            val_preds = model(X_val).argmax(dim=1)
            val_acc = (val_preds == y_val).float().mean().item()

        train_losses.append(loss.item())
        val_accs.append(val_acc)

    return train_losses, val_accs


# Step 1: "Pretrain" on source task (large dataset)
X_source, y_source = create_synthetic_data(2000, task='source')
pretrained_model = SimplePretrainedModel()
source_losses, source_accs = train_model(
    pretrained_model, X_source[:1600], y_source[:1600],
    X_source[1600:], y_source[1600:], epochs=100, lr=0.005
)
print(f"Source task final accuracy: {source_accs[-1]:.4f}")

# Step 2: Create target task (small dataset)
X_target, y_target = create_synthetic_data(200, task='target')
X_train_t, y_train_t = X_target[:150], y_target[:150]
X_val_t, y_val_t = X_target[150:], y_target[150:]

# Approach 1: Train from scratch
scratch_model = SimplePretrainedModel()
scratch_losses, scratch_accs = train_model(
    scratch_model, X_train_t, y_train_t, X_val_t, y_val_t,
    epochs=100, lr=0.005
)

# Approach 2: Feature extraction (freeze backbone)
fe_model = deepcopy(pretrained_model)
for param in fe_model.feature_extractor.parameters():
    param.requires_grad = False
fe_model.classifier = nn.Linear(32, 5)  # New head
fe_losses, fe_accs = train_model(
    fe_model, X_train_t, y_train_t, X_val_t, y_val_t,
    epochs=100, lr=0.01
)

# Approach 3: Full fine-tuning (small LR)
ft_model = deepcopy(pretrained_model)
ft_model.classifier = nn.Linear(32, 5)  # New head
ft_losses, ft_accs = train_model(
    ft_model, X_train_t, y_train_t, X_val_t, y_val_t,
    epochs=100, lr=0.001
)

print(f"\nTarget task accuracy (200 samples):")
print(f"  From scratch:       {scratch_accs[-1]:.4f}")
print(f"  Feature extraction: {fe_accs[-1]:.4f}")
print(f"  Full fine-tuning:   {ft_accs[-1]:.4f}")

In [None]:
# Visualization: Training curves comparison
fig, axes = plt.subplots(1, 2, figsize=(14, 5))

# Loss curves
axes[0].plot(scratch_losses, label='From Scratch', color='red', alpha=0.7)
axes[0].plot(fe_losses, label='Feature Extraction', color='blue', alpha=0.7)
axes[0].plot(ft_losses, label='Full Fine-tuning', color='green', alpha=0.7)
axes[0].set_xlabel('Epoch')
axes[0].set_ylabel('Training Loss')
axes[0].set_title('Training Loss: Fine-tuning vs From Scratch', fontweight='bold')
axes[0].legend()
axes[0].grid(True, alpha=0.3)

# Accuracy curves
axes[1].plot(scratch_accs, label='From Scratch', color='red', alpha=0.7)
axes[1].plot(fe_accs, label='Feature Extraction', color='blue', alpha=0.7)
axes[1].plot(ft_accs, label='Full Fine-tuning', color='green', alpha=0.7)
axes[1].set_xlabel('Epoch')
axes[1].set_ylabel('Validation Accuracy')
axes[1].set_title('Validation Accuracy: Fine-tuning vs From Scratch', fontweight='bold')
axes[1].legend()
axes[1].grid(True, alpha=0.3)

plt.tight_layout()
plt.show()

print("Notice: Fine-tuning reaches higher accuracy faster than training from scratch")
print("Feature extraction converges quickly but may plateau lower (frozen features)")

In [None]:
# Discriminative Learning Rates

class DiscriminativeLRModel(nn.Module):
    """Model with separate parameter groups for discriminative LR."""
    def __init__(self, pretrained_model, num_classes=5):
        super().__init__()
        self.feature_extractor = deepcopy(pretrained_model.feature_extractor)
        self.classifier = nn.Linear(32, num_classes)

    def get_param_groups(self, base_lr=0.001, lr_mult=0.1):
        """
        Return parameter groups with discriminative learning rates.
        Earlier layers get smaller LR (lr * lr_mult).
        """
        # Get the layers from the sequential
        layers = list(self.feature_extractor.children())

        param_groups = []
        # Early layers: very small LR
        for i, layer in enumerate(layers[:2]):
            if hasattr(layer, 'parameters'):
                lr = base_lr * (lr_mult ** 2)  # Smallest LR
                param_groups.append({'params': layer.parameters(), 'lr': lr})

        # Middle layers: medium LR
        for layer in layers[2:4]:
            if hasattr(layer, 'parameters'):
                lr = base_lr * lr_mult
                param_groups.append({'params': layer.parameters(), 'lr': lr})

        # Later layers: larger LR
        for layer in layers[4:]:
            if hasattr(layer, 'parameters'):
                param_groups.append({'params': layer.parameters(), 'lr': base_lr})

        # Classifier: highest LR
        param_groups.append({'params': self.classifier.parameters(), 'lr': base_lr * 10})

        return param_groups

# Demonstrate discriminative LR
disc_model = DiscriminativeLRModel(pretrained_model)
param_groups = disc_model.get_param_groups(base_lr=0.001, lr_mult=0.1)

print("Discriminative Learning Rates:")
print("=" * 50)
layer_names = ['Early Linear 1', 'Early ReLU', 'Mid Linear 2', 'Mid ReLU',
               'Late Linear 3', 'Late ReLU', 'Classifier']
idx = 0
for pg in param_groups:
    n_params = sum(p.numel() for p in pg['params'])
    if n_params > 0:
        print(f"  Layer group {idx}: LR = {pg['lr']:.6f}, params = {n_params}")
        idx += 1

print("\nKey idea: Earlier (more general) layers change slowly,")
print("later (more task-specific) layers change more freely")

---

## 3. Parameter-Efficient Fine-Tuning (PEFT)

### The Problem

Full fine-tuning works great for small models (millions of parameters). But consider modern language models:

| Model | Parameters | Storage (fp16) | Full Fine-tuning Memory |
|-------|-----------|-----------------|------------------------|
| BERT-base | 110M | 220 MB | ~1 GB |
| GPT-2 | 1.5B | 3 GB | ~12 GB |
| LLaMA-7B | 7B | 14 GB | ~56 GB |
| LLaMA-70B | 70B | 140 GB | ~560 GB |
| GPT-4 (est.) | ~1.8T | ~3.6 TB | ~14 TB |

**The challenges of full fine-tuning at scale:**
1. **Memory:** Need to store model + optimizer states + gradients for ALL parameters
2. **Storage:** Each fine-tuned version is a full copy of the model
3. **Catastrophic forgetting:** More parameters to destabilize
4. **Cost:** Training compute scales with parameter count

### The Key Insight

Research has shown that when you fine-tune a model, the weight changes are typically **low-dimensional** -- you don't need to modify all parameters to adapt the model. PEFT methods exploit this by modifying only a tiny fraction of parameters.

**What this means:** Think of it like adjusting a complex machine. Instead of rebuilding the entire machine for a new task, you just add a few small adapters or turn a few knobs.

In [None]:
# Visualization: Parameter counts comparison
fig, axes = plt.subplots(1, 2, figsize=(14, 5))

# Bar chart: trainable parameters
methods = ['Full\nFine-tuning', 'LoRA\n(r=8)', 'LoRA\n(r=4)', 'Adapters', 'Prompt\nTuning']
# For a 7B parameter model
total_params = 7_000_000_000
trainable = [total_params, 33_500_000, 16_800_000, 28_000_000, 40_960]
percentages = [100, 0.48, 0.24, 0.4, 0.0006]
colors = ['#e74c3c', '#3498db', '#2ecc71', '#f39c12', '#9b59b6']

bars = axes[0].bar(methods, [t/1e6 for t in trainable], color=colors, alpha=0.8, edgecolor='black')
axes[0].set_ylabel('Trainable Parameters (Millions)', fontsize=12)
axes[0].set_title('Trainable Parameters by Method\n(7B Parameter Model)', fontsize=13, fontweight='bold')
axes[0].set_yscale('log')
axes[0].grid(True, alpha=0.3, axis='y')

# Add percentage labels
for bar, pct in zip(bars, percentages):
    axes[0].text(bar.get_x() + bar.get_width()/2., bar.get_height(),
                f'{pct}%', ha='center', va='bottom', fontweight='bold', fontsize=10)

# Pie chart: what gets trained in LoRA
sizes = [total_params - 33_500_000, 33_500_000]
labels = [f'Frozen\n{(total_params-33_500_000)/1e9:.2f}B', f'LoRA Trainable\n{33_500_000/1e6:.1f}M']
explode = (0, 0.1)
axes[1].pie(sizes, explode=explode, labels=labels, colors=['#95a5a6', '#3498db'],
            autopct='%1.2f%%', shadow=True, startangle=90, textprops={'fontsize': 12})
axes[1].set_title('LoRA (r=8): What Gets Trained?', fontsize=13, fontweight='bold')

plt.tight_layout()
plt.show()

print("Key takeaway: PEFT methods train < 1% of parameters while achieving")
print("comparable performance to full fine-tuning!")

### Why This Matters in Machine Learning

| Application | How PEFT Helps |
|-------------|----------------|
| **Multi-task serving** | Store one base model + tiny adapters per task |
| **Personalization** | Each user gets their own small adapter |
| **Edge deployment** | Adapt models on resource-constrained devices |
| **Rapid experimentation** | Try many configurations cheaply |
| **Democratization** | Fine-tune billion-parameter models on consumer GPUs |

---

## 4. LoRA (Low-Rank Adaptation)

### The Key Insight

When you fine-tune a pretrained model, the weight update matrix $\Delta W$ has been shown to have **low intrinsic rank**. This means the changes can be represented by a much smaller matrix.

**Intuition:** Imagine you have a 1000x1000 weight matrix (1 million parameters). During fine-tuning, the *change* to this matrix ($\Delta W$) can be well-approximated by the product of a 1000x4 matrix and a 4x1000 matrix. That's only 8,000 parameters instead of 1,000,000!

### The LoRA Formula

Instead of learning $\Delta W$ directly (which is huge), LoRA decomposes it:

$$W_{new} = W_{pretrained} + \Delta W = W_{pretrained} + BA$$

Where:
- $W_{pretrained}$ is the original weight matrix ($d \times d$), **frozen**
- $B$ is a learned matrix ($d \times r$), initialized to **zeros**
- $A$ is a learned matrix ($r \times d$), initialized **randomly**
- $r$ is the **rank** (typically 4, 8, or 16) -- much smaller than $d$

#### Breaking down the formula:

| Component | Shape | Parameters | Role |
|-----------|-------|-----------|------|
| $W_{pretrained}$ | $d \times d$ | $d^2$ (frozen) | Original pretrained knowledge |
| $B$ | $d \times r$ | $d \times r$ (trainable) | "Down-projection" of adaptation |
| $A$ | $r \times d$ | $r \times d$ (trainable) | "Up-projection" of adaptation |
| $BA$ | $d \times d$ | 0 (computed) | Low-rank approximation of $\Delta W$ |
| Total trainable | -- | $2 \times d \times r$ | Much less than $d^2$! |

**What this means:** LoRA says "the way this weight matrix needs to change for the new task can be captured by a narrow bottleneck of rank $r$." At $r=8$ and $d=4096$ (typical for LLMs), you're training $2 \times 4096 \times 8 = 65,536$ parameters instead of $4096^2 = 16,777,216$ -- a **256x reduction!**

In [None]:
# Visualization: Rank decomposition explained step by step
fig, axes = plt.subplots(1, 4, figsize=(16, 4))

d = 8  # Use small dimension for visualization
r = 2  # Rank

# Create example matrices
torch.manual_seed(42)
W_pretrained = torch.randn(d, d) * 0.5

# LoRA matrices
A = torch.randn(r, d) * 0.1  # Random init
B = torch.zeros(d, r)          # Zero init (so initial delta W = 0)

# After some "training"
A_trained = torch.randn(r, d) * 0.3
B_trained = torch.randn(d, r) * 0.3
delta_W = B_trained @ A_trained

# Plot W_pretrained
im0 = axes[0].imshow(W_pretrained.numpy(), cmap='RdBu', vmin=-1.5, vmax=1.5)
axes[0].set_title(f'W_pretrained\n({d}x{d} = {d*d} params)\nFROZEN', fontweight='bold')
axes[0].set_xlabel('Frozen (not updated)')

# Plot B (d x r)
im1 = axes[1].imshow(B_trained.numpy(), cmap='RdBu', vmin=-1, vmax=1)
axes[1].set_title(f'B matrix\n({d}x{r} = {d*r} params)\nTRAINABLE', fontweight='bold', color='blue')

# Plot A (r x d)
im2 = axes[2].imshow(A_trained.numpy(), cmap='RdBu', vmin=-1, vmax=1)
axes[2].set_title(f'A matrix\n({r}x{d} = {r*d} params)\nTRAINABLE', fontweight='bold', color='blue')

# Plot BA = delta_W
im3 = axes[3].imshow(delta_W.numpy(), cmap='RdBu', vmin=-1.5, vmax=1.5)
axes[3].set_title(f'BA = delta_W\n({d}x{d} but rank {r})\nLow-rank update', fontweight='bold', color='green')

plt.suptitle('LoRA: Low-Rank Decomposition of Weight Updates', fontsize=14, fontweight='bold', y=1.05)

# Add equation annotation
fig.text(0.5, -0.02, f'W_new = W_pretrained + B*A    |    Trainable params: {2*d*r} vs {d*d} (full)    |    Compression: {d*d/(2*d*r):.1f}x',
         ha='center', fontsize=12, style='italic',
         bbox=dict(boxstyle='round', facecolor='lightyellow', alpha=0.8))

plt.tight_layout()
plt.show()

In [None]:
# Step-by-step numerical example of LoRA
print("=" * 65)
print("STEP-BY-STEP LoRA EXAMPLE")
print("=" * 65)

# Dimensions
d = 4  # Small for clarity
r = 2  # Rank
print(f"\nWeight matrix dimension: {d}x{d} = {d*d} parameters")
print(f"LoRA rank: r = {r}")
print(f"LoRA parameters: 2 * {d} * {r} = {2*d*r} parameters")
print(f"Compression ratio: {d*d}/{2*d*r} = {d*d/(2*d*r):.1f}x\n")

# Original weight matrix (frozen)
W = torch.tensor([[1.0, 0.5, -0.3, 0.2],
                   [0.4, -0.8, 0.6, 0.1],
                   [-0.2, 0.3, 1.0, -0.5],
                   [0.7, -0.1, 0.4, 0.9]])
print("Step 1: Original pretrained weight W (FROZEN):")
print(W.numpy().round(2))

# Initialize LoRA matrices
B = torch.zeros(d, r)  # Zero init!
A = torch.randn(d, r).T * 0.1  # Small random init, shape (r, d)

print(f"\nStep 2: Initialize LoRA matrices")
print(f"  B ({d}x{r}), initialized to ZEROS:")
print(f"  {B.numpy().round(3)}")
print(f"  A ({r}x{d}), initialized randomly:")
print(f"  {A.numpy().round(3)}")

# Initial output: BA = 0, so W_new = W (no change!)
delta_W_init = B @ A
print(f"\nStep 3: Initial delta_W = B @ A = all zeros")
print(f"  (This is key! At initialization, LoRA changes nothing)")

# After training (simulate learned values)
B_learned = torch.tensor([[0.3, -0.1],
                           [0.1, 0.4],
                           [-0.2, 0.2],
                           [0.0, -0.3]])
A_learned = torch.tensor([[0.2, -0.3, 0.1, 0.4],
                           [-0.1, 0.2, 0.3, -0.2]])

print(f"\nStep 4: After training, B and A are updated:")
print(f"  B_learned ({d}x{r}):")
print(f"  {B_learned.numpy().round(3)}")
print(f"  A_learned ({r}x{d}):")
print(f"  {A_learned.numpy().round(3)}")

delta_W = B_learned @ A_learned
W_new = W + delta_W

print(f"\nStep 5: Compute delta_W = B_learned @ A_learned:")
print(delta_W.numpy().round(3))

print(f"\nStep 6: W_new = W + delta_W:")
print(W_new.numpy().round(3))

print(f"\nNote: delta_W is rank-{r} (at most), but modifies all {d*d} entries!")
print(f"We only trained {2*d*r} = {2*d*r} parameters to get a {d}x{d} = {d*d} update!")

In [None]:
# Implement LoRA from scratch!

class LoRALinear(nn.Module):
    """
    A Linear layer augmented with LoRA (Low-Rank Adaptation).

    The forward pass computes: y = W_frozen @ x + (B @ A) @ x * (alpha/r)

    Args:
        original_layer: The pretrained nn.Linear layer to augment
        rank: The rank of the LoRA decomposition
        alpha: Scaling factor for LoRA (controls magnitude)
    """
    def __init__(self, original_layer, rank=4, alpha=1.0):
        super().__init__()
        self.original_layer = original_layer
        self.rank = rank
        self.alpha = alpha

        in_features = original_layer.in_features
        out_features = original_layer.out_features

        # Freeze original weights
        for param in self.original_layer.parameters():
            param.requires_grad = False

        # LoRA matrices
        # A: (rank, in_features) - initialized with small random values
        self.lora_A = nn.Parameter(torch.randn(rank, in_features) * 0.01)
        # B: (out_features, rank) - initialized to zeros
        self.lora_B = nn.Parameter(torch.zeros(out_features, rank))

        # Scaling factor
        self.scaling = alpha / rank

    def forward(self, x):
        # Original frozen computation
        original_output = self.original_layer(x)

        # LoRA adaptation: x @ A^T @ B^T * scaling
        lora_output = F.linear(F.linear(x, self.lora_A), self.lora_B) * self.scaling

        return original_output + lora_output

    def get_merged_weight(self):
        """Merge LoRA weights into original for inference (no extra cost!)."""
        return self.original_layer.weight + (self.lora_B @ self.lora_A) * self.scaling

    def extra_repr(self):
        return f'rank={self.rank}, alpha={self.alpha}, scaling={self.scaling:.4f}'


# Demonstrate LoRA
print("LoRA Linear Layer Demo")
print("=" * 50)

# Create original layer
original = nn.Linear(64, 32)
total_original = sum(p.numel() for p in original.parameters())
print(f"Original layer: {64}x{32} + bias = {total_original} parameters")

# Wrap with LoRA
for rank in [1, 2, 4, 8, 16]:
    lora_layer = LoRALinear(original, rank=rank, alpha=rank)
    trainable = sum(p.numel() for p in lora_layer.parameters() if p.requires_grad)
    total = sum(p.numel() for p in lora_layer.parameters())
    print(f"  LoRA rank={rank:2d}: {trainable:5d} trainable / {total} total "
          f"({100*trainable/total:.1f}% trainable, {total_original/trainable:.1f}x compression)")

# Verify forward pass works
x = torch.randn(4, 64)
lora = LoRALinear(nn.Linear(64, 32), rank=4, alpha=4)
y = lora(x)
print(f"\nForward pass: input {x.shape} -> output {y.shape}")

# Verify initialization preserves original output
original_out = lora.original_layer(x)
lora_out = lora(x)
print(f"Output matches original at init: {torch.allclose(original_out, lora_out, atol=1e-6)}")
print("(Because B is initialized to zeros, so B@A = 0)")

In [None]:
# Visualization: How LoRA modifies weight matrices for different ranks

fig, axes = plt.subplots(2, 4, figsize=(16, 8))

torch.manual_seed(42)
d_in, d_out = 32, 32
original_weight = torch.randn(d_out, d_in) * 0.5

ranks = [1, 2, 4, 8]

for i, r in enumerate(ranks):
    # Simulate trained LoRA
    A = torch.randn(r, d_in) * 0.3
    B = torch.randn(d_out, r) * 0.3
    delta_W = B @ A
    W_new = original_weight + delta_W

    # Top row: delta_W
    im = axes[0, i].imshow(delta_W.numpy(), cmap='RdBu', vmin=-2, vmax=2)
    axes[0, i].set_title(f'delta_W (rank={r})\nParams: {2*d_in*r}', fontweight='bold')
    axes[0, i].set_ylabel('delta_W' if i == 0 else '')

    # Bottom row: singular values showing rank
    U, S, V = torch.svd(delta_W)
    axes[1, i].bar(range(min(16, len(S))), S[:16].numpy(), color='steelblue', alpha=0.8)
    axes[1, i].axhline(y=0.01, color='red', linestyle='--', alpha=0.5, label='Threshold')
    axes[1, i].set_xlabel('Singular Value Index')
    axes[1, i].set_ylabel('Value' if i == 0 else '')
    axes[1, i].set_title(f'Singular Values\n(Only {r} non-zero!)', fontweight='bold')
    axes[1, i].grid(True, alpha=0.3)

plt.suptitle('LoRA Weight Updates at Different Ranks', fontsize=14, fontweight='bold')
plt.tight_layout()
plt.show()

print("Key observation: delta_W = B@A has exactly r non-zero singular values")
print("Higher rank = more expressive updates, but more parameters")

In [None]:
# Interactive exploration: Choosing rank r
fig, axes = plt.subplots(1, 2, figsize=(14, 5))

# Simulate accuracy vs rank tradeoff
d = 768  # BERT hidden size
ranks = [1, 2, 4, 8, 16, 32, 64, 128, 256]
params_per_layer = [2 * d * r for r in ranks]
total_params_pct = [2 * d * r / (d * d) * 100 for r in ranks]

# Simulated accuracy (typical pattern: diminishing returns)
accuracies = [85.2, 88.1, 90.5, 91.8, 92.3, 92.5, 92.6, 92.7, 92.7]
full_ft_accuracy = 92.8

# Params vs rank
axes[0].plot(ranks, total_params_pct, 'bo-', linewidth=2, markersize=8)
axes[0].fill_between(ranks, total_params_pct, alpha=0.1, color='blue')
axes[0].set_xlabel('LoRA Rank (r)', fontsize=12)
axes[0].set_ylabel('Trainable Parameters\n(% of full weight matrix)', fontsize=12)
axes[0].set_title('Parameter Efficiency vs Rank', fontweight='bold', fontsize=13)
axes[0].grid(True, alpha=0.3)
axes[0].set_xscale('log', base=2)

# Accuracy vs rank
axes[1].plot(ranks, accuracies, 'go-', linewidth=2, markersize=8, label='LoRA')
axes[1].axhline(y=full_ft_accuracy, color='red', linestyle='--', linewidth=2, label='Full fine-tuning')
axes[1].fill_between(ranks, accuracies, alpha=0.1, color='green')
axes[1].set_xlabel('LoRA Rank (r)', fontsize=12)
axes[1].set_ylabel('Accuracy (%)', fontsize=12)
axes[1].set_title('Accuracy vs Rank (Diminishing Returns)', fontweight='bold', fontsize=13)
axes[1].legend(fontsize=11)
axes[1].grid(True, alpha=0.3)
axes[1].set_xscale('log', base=2)
axes[1].annotate('Sweet spot\n(r=8)', xy=(8, 91.8), xytext=(20, 88),
                arrowprops=dict(arrowstyle='->', color='orange', lw=2),
                fontsize=12, fontweight='bold', color='orange')

plt.tight_layout()
plt.show()

print("Practical guidance for choosing rank r:")
print("  r=4:  Good for simple tasks, maximum efficiency")
print("  r=8:  Best tradeoff for most tasks (recommended default)")
print("  r=16: Use when r=8 underperforms")
print("  r>16: Rarely needed, diminishing returns")

### Deep Dive: Why Low-Rank Works

This is a profound finding: when you fine-tune a large model for a specific task, the weight changes lie in a **low-dimensional subspace**. But why?

**Intuition 1: Task Simplicity.** Most downstream tasks are simpler than the pretraining task. Classifying sentiment (positive/negative) requires much less information than predicting the next word in all of Wikipedia. The "direction" of change in weight space is simple.

**Intuition 2: Over-parameterization.** Large models have far more parameters than needed for any single task. The relevant "task information" can be compressed into a small subspace.

**Intuition 3: Weight Matrix Structure.** Pretrained weight matrices already capture rich representations. Fine-tuning only needs to "steer" these representations slightly -- and small steering adjustments are inherently low-rank.

#### Key Insight

LoRA doesn't just save parameters -- it acts as a **regularizer**. By limiting the rank of the update, it prevents the model from making large, complex changes to the pretrained weights, which helps prevent catastrophic forgetting.

#### Common Misconceptions

| Misconception | Reality |
|---------------|---------|
| LoRA always matches full fine-tuning | For very different tasks, full FT can still win |
| Lower rank is always better | Too low rank limits expressiveness |
| LoRA adds inference cost | Weights can be merged: W_new = W + BA (zero overhead!) |
| LoRA only works for attention layers | Works for any linear layer, but attention is most impactful |

---

## 5. Adapters

### The Idea

Adapters take a different approach from LoRA: instead of decomposing weight updates, they **insert small bottleneck layers** into the existing model architecture.

An adapter module:
1. **Down-projects** the hidden representation to a small dimension
2. Applies a **non-linearity** (like ReLU)
3. **Up-projects** back to the original dimension
4. Adds a **residual connection** (so the adapter can be "skipped")

This is similar to LoRA in spirit (bottleneck = low rank), but differs in execution (sequential layers vs parallel decomposition).

### Comparison: LoRA vs Adapters

| Aspect | LoRA | Adapters |
|--------|------|----------|
| **Where** | Modifies existing weight matrices | Inserts new layers |
| **Inference overhead** | None (can merge weights) | Small (extra computation) |
| **Architecture change** | No | Yes (adds modules) |
| **Training parameters** | $2 \times d \times r$ per layer | $2 \times d \times b + 2b$ per adapter |
| **Non-linearity** | No | Yes (ReLU in bottleneck) |
| **Popular for** | LLMs (GPT, LLaMA) | Earlier work (BERT, ViT) |

In [None]:
# Implement an Adapter module

class AdapterLayer(nn.Module):
    """
    Adapter module: down-project -> ReLU -> up-project + residual.

    Args:
        hidden_dim: Dimension of the input/output
        bottleneck_dim: Dimension of the bottleneck (smaller = fewer params)
    """
    def __init__(self, hidden_dim, bottleneck_dim=16):
        super().__init__()
        self.down_project = nn.Linear(hidden_dim, bottleneck_dim)
        self.activation = nn.ReLU()
        self.up_project = nn.Linear(bottleneck_dim, hidden_dim)

        # Initialize up-project close to zero (minimal initial impact)
        nn.init.zeros_(self.up_project.weight)
        nn.init.zeros_(self.up_project.bias)

    def forward(self, x):
        residual = x
        x = self.down_project(x)
        x = self.activation(x)
        x = self.up_project(x)
        return x + residual  # Residual connection


# Visualize adapter architecture
fig, ax = plt.subplots(figsize=(8, 10))
ax.set_xlim(0, 10)
ax.set_ylim(0, 12)
ax.axis('off')

# Draw transformer block with adapter
components = [
    (5, 11, 'Input Embeddings', '#bdc3c7', 3),
    (5, 9.5, 'Self-Attention\n(Frozen)', '#3498db', 3),
    (5, 8, 'Adapter', '#e74c3c', 2),
    (5, 6.5, 'Layer Norm + Residual', '#bdc3c7', 3),
    (5, 5, 'Feed-Forward\n(Frozen)', '#3498db', 3),
    (5, 3.5, 'Adapter', '#e74c3c', 2),
    (5, 2, 'Layer Norm + Residual', '#bdc3c7', 3),
    (5, 0.5, 'Output', '#bdc3c7', 3),
]

for x, y, text, color, width in components:
    bbox = dict(boxstyle=f'round,pad=0.3', facecolor=color, alpha=0.3, edgecolor=color)
    fontsize = 11 if 'Adapter' not in text else 12
    fw = 'bold' if 'Adapter' in text else 'normal'
    ax.text(x, y, text, ha='center', va='center', fontsize=fontsize, fontweight=fw, bbox=bbox)

# Arrows
for i in range(len(components)-1):
    ax.annotate('', xy=(5, components[i+1][1]+0.5), xytext=(5, components[i][1]-0.5),
               arrowprops=dict(arrowstyle='->', color='gray', lw=1.5))

# Adapter detail box
detail_x = 8.5
ax.text(detail_x, 8, 'Adapter Detail:', ha='center', fontsize=11, fontweight='bold')
detail_parts = [
    (detail_x, 7.3, 'Down: d->b', '#e74c3c'),
    (detail_x, 6.8, 'ReLU', '#f39c12'),
    (detail_x, 6.3, 'Up: b->d', '#e74c3c'),
    (detail_x, 5.8, '+ Residual', '#2ecc71'),
]
for x, y, t, c in detail_parts:
    ax.text(x, y, t, ha='center', fontsize=10,
            bbox=dict(boxstyle='round,pad=0.2', facecolor=c, alpha=0.2))

ax.set_title('Transformer Block with Adapters', fontsize=14, fontweight='bold')
plt.tight_layout()
plt.show()

# Show parameter comparison
hidden_dim = 768  # BERT-base
for bottleneck in [8, 16, 32, 64]:
    adapter = AdapterLayer(hidden_dim, bottleneck)
    n_params = sum(p.numel() for p in adapter.parameters())
    full_layer = hidden_dim * hidden_dim
    print(f"Adapter (bottleneck={bottleneck:2d}): {n_params:6d} params "
          f"({100*n_params/full_layer:.2f}% of full layer {full_layer:,})")

---

## 6. Prompt Engineering & Prompt Tuning

### Hard Prompts vs Soft Prompts

There are two fundamentally different ways to "steer" a language model:

**Hard Prompts (Prompt Engineering):** Craft text instructions in natural language.
```
"Classify the following review as positive or negative: {review}"
```
No model parameters change. You're just finding the right input words.

**Soft Prompts (Prompt Tuning):** Prepend learned continuous vectors to the input.
These vectors aren't real words -- they're optimized embeddings that steer the model.

### Comparison of Approaches

| Method | What Changes | # Parameters | Human Readable? | Training Required? |
|--------|-------------|-------------|-----------------|-------------------|
| **Prompt Engineering** | Input text | 0 | Yes | No (manual craft) |
| **In-Context Learning** | Input text (examples) | 0 | Yes | No (select examples) |
| **Prompt Tuning** | Prepended embeddings | ~20K | No | Yes (gradient) |
| **Prefix Tuning** | Key/Value prefixes per layer | ~200K | No | Yes (gradient) |
| **LoRA** | Weight decomposition | ~1M-30M | N/A | Yes (gradient) |
| **Full Fine-tuning** | All weights | All | N/A | Yes (gradient) |

In [None]:
# Visualization: Hard prompts vs Soft prompts

fig, axes = plt.subplots(1, 2, figsize=(14, 6))

# Hard prompts
ax = axes[0]
ax.set_xlim(0, 10)
ax.set_ylim(0, 8)
ax.axis('off')
ax.set_title('Hard Prompts (Prompt Engineering)', fontsize=13, fontweight='bold')

# Token boxes for hard prompt
hard_tokens = ['Classify', 'this', 'review', 'as', 'positive', 'or', 'negative', ':', '"Great', 'movie!"']
y_pos = 6
for i, token in enumerate(hard_tokens):
    x_pos = 0.5 + (i % 5) * 1.8
    y = y_pos - (i // 5) * 1.5
    color = '#3498db' if i < 8 else '#2ecc71'
    label = 'Instruction' if i == 0 else ('Input' if i == 8 else '')
    ax.text(x_pos, y, token, ha='center', va='center', fontsize=10,
            bbox=dict(boxstyle='round,pad=0.3', facecolor=color, alpha=0.3))

ax.text(5, 2.5, 'All real words!\nDesigned by humans', ha='center', fontsize=12,
        style='italic', color='gray')
ax.text(5, 1.5, 'Model: "positive"', ha='center', fontsize=13, fontweight='bold',
        bbox=dict(boxstyle='round', facecolor='#2ecc71', alpha=0.3))

# Soft prompts
ax = axes[1]
ax.set_xlim(0, 10)
ax.set_ylim(0, 8)
ax.axis('off')
ax.set_title('Soft Prompts (Prompt Tuning)', fontsize=13, fontweight='bold')

# Learned vectors
for i in range(5):
    x_pos = 0.8 + i * 1.8
    color = '#e74c3c'
    ax.text(x_pos, 6, f'v_{i+1}', ha='center', va='center', fontsize=11, fontweight='bold',
            bbox=dict(boxstyle='round,pad=0.3', facecolor=color, alpha=0.3))

# Input tokens
input_tokens = ['"Great', 'movie!"']
for i, token in enumerate(input_tokens):
    x_pos = 0.8 + i * 1.8
    ax.text(x_pos, 4.5, token, ha='center', va='center', fontsize=10,
            bbox=dict(boxstyle='round,pad=0.3', facecolor='#2ecc71', alpha=0.3))

ax.text(5, 3.2, 'v_1...v_5 are learned embeddings\n(not real words!)',
        ha='center', fontsize=11, style='italic', color='gray')
ax.text(5, 2, 'Optimized via gradient descent', ha='center', fontsize=11,
        style='italic', color='#e74c3c')
ax.text(5, 1, 'Model: "positive"', ha='center', fontsize=13, fontweight='bold',
        bbox=dict(boxstyle='round', facecolor='#2ecc71', alpha=0.3))

plt.tight_layout()
plt.show()

print("Hard prompts: Zero parameters, but requires human expertise to craft")
print("Soft prompts: ~20K parameters, learned automatically, often outperform hard prompts")

In [None]:
# Implement soft prompt tuning

class SoftPromptModel(nn.Module):
    """
    Demonstrates soft prompt tuning.
    Prepends learnable embeddings to the input sequence.

    Args:
        base_model: A frozen model that takes embeddings as input
        embed_dim: Dimension of each embedding vector
        n_prompt_tokens: Number of soft prompt tokens to prepend
    """
    def __init__(self, embed_dim=64, n_prompt_tokens=10, vocab_size=100,
                 hidden_dim=128, num_classes=3):
        super().__init__()
        # "Base model" components (frozen)
        self.embedding = nn.Embedding(vocab_size, embed_dim)
        self.encoder = nn.Sequential(
            nn.Linear(embed_dim, hidden_dim),
            nn.ReLU(),
            nn.Linear(hidden_dim, hidden_dim),
            nn.ReLU()
        )

        # Freeze base model
        for param in self.embedding.parameters():
            param.requires_grad = False
        for param in self.encoder.parameters():
            param.requires_grad = False

        # Soft prompt tokens (TRAINABLE)
        self.soft_prompt = nn.Parameter(torch.randn(n_prompt_tokens, embed_dim) * 0.01)

        # Classification head (TRAINABLE)
        self.classifier = nn.Linear(hidden_dim, num_classes)

        self.n_prompt_tokens = n_prompt_tokens

    def forward(self, input_ids):
        # Get input embeddings (frozen)
        input_embeds = self.embedding(input_ids)  # (batch, seq_len, embed_dim)

        # Prepend soft prompt
        batch_size = input_ids.size(0)
        prompt = self.soft_prompt.unsqueeze(0).expand(batch_size, -1, -1)
        combined = torch.cat([prompt, input_embeds], dim=1)  # (batch, n_prompt + seq_len, embed_dim)

        # Encode (mean pooling)
        encoded = self.encoder(combined)
        pooled = encoded.mean(dim=1)  # (batch, hidden_dim)

        return self.classifier(pooled)


# Demo
model = SoftPromptModel(n_prompt_tokens=10)
total_params = sum(p.numel() for p in model.parameters())
trainable_params = sum(p.numel() for p in model.parameters() if p.requires_grad)

print("Soft Prompt Tuning Model:")
print(f"  Total parameters:     {total_params:,}")
print(f"  Trainable parameters: {trainable_params:,}")
print(f"  Frozen parameters:    {total_params - trainable_params:,}")
print(f"  Trainable fraction:   {100*trainable_params/total_params:.2f}%")
print(f"  Soft prompt shape:    {model.soft_prompt.shape}")

# Test forward pass
dummy_input = torch.randint(0, 100, (4, 20))  # batch=4, seq_len=20
output = model(dummy_input)
print(f"\n  Input shape:  {dummy_input.shape}")
print(f"  Output shape: {output.shape}")

---

## 7. Practical Fine-tuning Pipeline

Let's put it all together with a complete example: fine-tuning a small model for text classification, comparing **full fine-tuning** vs **LoRA**.

### The Setup

We'll:
1. Create a pretrained "mini language model"
2. Prepare a downstream classification dataset
3. Fine-tune with both approaches
4. Compare results (accuracy, parameter count, training speed)

In [None]:
# Step 1: Create a "pretrained" base model

class MiniLanguageModel(nn.Module):
    """A small model pretending to be a pretrained language model."""
    def __init__(self, vocab_size=200, embed_dim=64, hidden_dim=128, num_layers=3):
        super().__init__()
        self.embedding = nn.Embedding(vocab_size, embed_dim)

        # Transformer-like layers (simplified)
        self.layers = nn.ModuleList()
        for _ in range(num_layers):
            self.layers.append(nn.ModuleDict({
                'attention': nn.Linear(embed_dim, embed_dim),  # Simplified attention
                'ff1': nn.Linear(embed_dim, hidden_dim),
                'ff2': nn.Linear(hidden_dim, embed_dim),
                'norm': nn.LayerNorm(embed_dim)
            }))

        self.final_norm = nn.LayerNorm(embed_dim)

    def forward(self, x, return_features=True):
        h = self.embedding(x)  # (batch, seq, embed)

        for layer in self.layers:
            # Simplified self-attention
            attn_out = torch.tanh(layer['attention'](h))
            h = h + attn_out

            # Feed-forward
            ff_out = F.gelu(layer['ff1'](h))
            ff_out = layer['ff2'](ff_out)
            h = layer['norm'](h + ff_out)

        h = self.final_norm(h)

        if return_features:
            return h.mean(dim=1)  # Mean pooling -> (batch, embed)
        return h


# "Pretrain" the model (simulate by training on a pretext task)
torch.manual_seed(42)
base_model = MiniLanguageModel()

# Simulate pretraining with a simple task
pretrain_optimizer = optim.Adam(base_model.parameters(), lr=0.001)
for step in range(200):
    fake_input = torch.randint(0, 200, (32, 15))
    features = base_model(fake_input)
    # Simple pretraining objective: predict bag-of-words statistics
    target = torch.zeros(32, 64)
    for i in range(32):
        for token in fake_input[i]:
            target[i, token.item() % 64] += 1
    target = target / target.sum(dim=1, keepdim=True)
    loss = F.mse_loss(features, target)
    pretrain_optimizer.zero_grad()
    loss.backward()
    pretrain_optimizer.step()

print("'Pretrained' base model created!")
print(f"Base model parameters: {sum(p.numel() for p in base_model.parameters()):,}")

In [None]:
# Step 2: Create downstream classification dataset

def create_classification_data(n_samples=500, seq_len=15, vocab_size=200, num_classes=4):
    """
    Create synthetic text classification data.
    Different classes use different vocabulary distributions.
    """
    torch.manual_seed(42)
    X = torch.zeros(n_samples, seq_len, dtype=torch.long)
    y = torch.zeros(n_samples, dtype=torch.long)

    samples_per_class = n_samples // num_classes
    for c in range(num_classes):
        start = c * samples_per_class
        end = start + samples_per_class
        # Each class favors different vocabulary ranges
        base_vocab = c * (vocab_size // num_classes)
        X[start:end] = torch.randint(base_vocab, base_vocab + vocab_size // num_classes,
                                      (samples_per_class, seq_len))
        y[start:end] = c

    # Shuffle
    perm = torch.randperm(n_samples)
    return X[perm], y[perm]

# Create data
X_data, y_data = create_classification_data(n_samples=600)
X_train, y_train = X_data[:400], y_data[:400]
X_val, y_val = X_data[400:500], y_data[400:500]
X_test, y_test = X_data[500:], y_data[500:]

print(f"Dataset splits:")
print(f"  Train: {len(X_train)} samples")
print(f"  Val:   {len(X_val)} samples")
print(f"  Test:  {len(X_test)} samples")
print(f"  Classes: {y_data.unique().tolist()}")
print(f"  Sequence length: {X_train.shape[1]}")

In [None]:
# Step 3: Define classifiers using different fine-tuning strategies

class FullFineTuneClassifier(nn.Module):
    """Full fine-tuning: all parameters are trainable."""
    def __init__(self, base_model, num_classes=4):
        super().__init__()
        self.base = deepcopy(base_model)
        self.classifier = nn.Linear(64, num_classes)

    def forward(self, x):
        features = self.base(x, return_features=True)
        return self.classifier(features)


class LoRAClassifier(nn.Module):
    """LoRA fine-tuning: only LoRA params + classifier are trainable."""
    def __init__(self, base_model, num_classes=4, lora_rank=4, lora_alpha=4):
        super().__init__()
        self.base = deepcopy(base_model)

        # Freeze all base model parameters
        for param in self.base.parameters():
            param.requires_grad = False

        # Add LoRA to attention and FF layers
        self.lora_layers = nn.ModuleList()
        for layer in self.base.layers:
            # LoRA on attention
            lora_attn = LoRALinear(layer['attention'], rank=lora_rank, alpha=lora_alpha)
            layer['attention'] = lora_attn  # This won't freeze properly, handle below
            self.lora_layers.append(lora_attn)

            # LoRA on ff1
            lora_ff = LoRALinear(layer['ff1'], rank=lora_rank, alpha=lora_alpha)
            layer['ff1'] = lora_ff
            self.lora_layers.append(lora_ff)

        self.classifier = nn.Linear(64, num_classes)

    def forward(self, x):
        features = self.base(x, return_features=True)
        return self.classifier(features)


def train_classifier(model, X_train, y_train, X_val, y_val, epochs=80, lr=0.001, batch_size=32):
    """Train classifier and track metrics."""
    import time

    optimizer = optim.Adam(filter(lambda p: p.requires_grad, model.parameters()), lr=lr)
    criterion = nn.CrossEntropyLoss()

    train_losses, val_accs = [], []
    start_time = time.time()

    for epoch in range(epochs):
        model.train()
        epoch_loss = 0
        n_batches = 0

        for i in range(0, len(X_train), batch_size):
            batch_X = X_train[i:i+batch_size]
            batch_y = y_train[i:i+batch_size]

            optimizer.zero_grad()
            outputs = model(batch_X)
            loss = criterion(outputs, batch_y)
            loss.backward()
            optimizer.step()

            epoch_loss += loss.item()
            n_batches += 1

        train_losses.append(epoch_loss / n_batches)

        model.eval()
        with torch.no_grad():
            val_preds = model(X_val).argmax(dim=1)
            val_acc = (val_preds == y_val).float().mean().item()
            val_accs.append(val_acc)

    elapsed = time.time() - start_time
    return train_losses, val_accs, elapsed

print("Classifier classes defined!")
print("Ready to train Full Fine-tuning vs LoRA...")

In [None]:
# Step 4: Train both approaches and compare

# Full fine-tuning
ft_model = FullFineTuneClassifier(base_model)
ft_trainable = sum(p.numel() for p in ft_model.parameters() if p.requires_grad)
ft_total = sum(p.numel() for p in ft_model.parameters())
ft_losses, ft_accs, ft_time = train_classifier(ft_model, X_train, y_train, X_val, y_val)

# LoRA fine-tuning
lora_model = LoRAClassifier(base_model, lora_rank=4, lora_alpha=4)
lora_trainable = sum(p.numel() for p in lora_model.parameters() if p.requires_grad)
lora_total = sum(p.numel() for p in lora_model.parameters())
lora_losses, lora_accs, lora_time = train_classifier(lora_model, X_train, y_train, X_val, y_val)

# From scratch (no pretraining)
scratch_base = MiniLanguageModel()
scratch_model = FullFineTuneClassifier(scratch_base)
scratch_losses, scratch_accs, scratch_time = train_classifier(
    scratch_model, X_train, y_train, X_val, y_val
)

# Test set evaluation
ft_model.eval()
lora_model.eval()
scratch_model.eval()
with torch.no_grad():
    ft_test_acc = (ft_model(X_test).argmax(1) == y_test).float().mean().item()
    lora_test_acc = (lora_model(X_test).argmax(1) == y_test).float().mean().item()
    scratch_test_acc = (scratch_model(X_test).argmax(1) == y_test).float().mean().item()

print("\n" + "=" * 70)
print("RESULTS COMPARISON")
print("=" * 70)
print(f"{'Method':<20} {'Test Acc':>10} {'Trainable':>12} {'% of Total':>12} {'Time':>8}")
print("-" * 70)
print(f"{'From Scratch':<20} {scratch_test_acc:>10.4f} {ft_trainable:>12,} {'100.0%':>12} {scratch_time:>7.2f}s")
print(f"{'Full Fine-tuning':<20} {ft_test_acc:>10.4f} {ft_trainable:>12,} {'100.0%':>12} {ft_time:>7.2f}s")
print(f"{'LoRA (r=4)':<20} {lora_test_acc:>10.4f} {lora_trainable:>12,} {100*lora_trainable/lora_total:>11.1f}% {lora_time:>7.2f}s")
print("=" * 70)

In [None]:
# Step 5: Visualization of results comparison

fig, axes = plt.subplots(1, 3, figsize=(16, 5))

# Training loss
axes[0].plot(scratch_losses, label='From Scratch', color='red', alpha=0.7)
axes[0].plot(ft_losses, label='Full Fine-tuning', color='blue', alpha=0.7)
axes[0].plot(lora_losses, label='LoRA (r=4)', color='green', alpha=0.7)
axes[0].set_xlabel('Epoch')
axes[0].set_ylabel('Training Loss')
axes[0].set_title('Training Loss', fontweight='bold')
axes[0].legend()
axes[0].grid(True, alpha=0.3)

# Validation accuracy
axes[1].plot(scratch_accs, label='From Scratch', color='red', alpha=0.7)
axes[1].plot(ft_accs, label='Full Fine-tuning', color='blue', alpha=0.7)
axes[1].plot(lora_accs, label='LoRA (r=4)', color='green', alpha=0.7)
axes[1].set_xlabel('Epoch')
axes[1].set_ylabel('Validation Accuracy')
axes[1].set_title('Validation Accuracy', fontweight='bold')
axes[1].legend()
axes[1].grid(True, alpha=0.3)

# Parameter comparison
methods = ['From\nScratch', 'Full\nFine-tune', 'LoRA\n(r=4)']
trainable_counts = [ft_trainable, ft_trainable, lora_trainable]
colors = ['red', 'blue', 'green']
bars = axes[2].bar(methods, trainable_counts, color=colors, alpha=0.7, edgecolor='black')
axes[2].set_ylabel('Trainable Parameters')
axes[2].set_title('Trainable Parameters', fontweight='bold')
axes[2].grid(True, alpha=0.3, axis='y')

# Add count labels on bars
for bar, count in zip(bars, trainable_counts):
    axes[2].text(bar.get_x() + bar.get_width()/2., bar.get_height(),
                f'{count:,}', ha='center', va='bottom', fontweight='bold')

plt.suptitle('Full Fine-tuning vs LoRA: Complete Comparison', fontsize=14, fontweight='bold')
plt.tight_layout()
plt.show()

print("\nKey takeaways:")
print("1. LoRA achieves comparable accuracy with far fewer trainable parameters")
print("2. Full fine-tuning may slightly outperform but at much higher cost")
print("3. Both fine-tuning approaches beat training from scratch")

---

## 8. RLHF: Reinforcement Learning from Human Feedback

### Why Fine-tuning on Data Isn't Enough

A language model trained on internet text can generate fluent text, but it might:
- Give harmful or toxic responses
- Make up facts (hallucinate) confidently
- Not follow instructions well
- Be verbose when brevity is needed

**The problem:** The model learned to predict the next word, not to be **helpful, harmless, and honest**.

### The RLHF Pipeline

RLHF adds human preferences to guide the model's behavior:

| Stage | What Happens | Goal |
|-------|-------------|------|
| **1. Pretraining** | Train on massive text corpus | Learn language understanding |
| **2. SFT (Supervised Fine-tuning)** | Fine-tune on demonstration data | Learn to follow instructions |
| **3. Reward Model** | Train a model to predict human preferences | Learn what humans prefer |
| **4. RL (PPO)** | Optimize policy against reward model | Maximize human preference |

### How It Works

1. **Collect comparison data:** Show humans two model responses, ask which is better
2. **Train reward model:** Learn to score responses the way humans would
3. **Optimize with PPO:** Use the reward model as a "reward function" in reinforcement learning to update the language model

**What this means:** RLHF bridges the gap between "predicting the next word" and "being a helpful assistant." This is the key technique that turned GPT-3 into ChatGPT.

In [None]:
# Visualization: The RLHF Pipeline

fig, ax = plt.subplots(figsize=(16, 8))
ax.set_xlim(0, 16)
ax.set_ylim(0, 9)
ax.axis('off')

# Stage 1: Pretraining
ax.text(2, 7.5, 'Stage 1: Pretraining', ha='center', fontsize=13, fontweight='bold')
boxes1 = [
    (2, 6.5, 'Internet Text\n(TB of data)', '#bdc3c7'),
    (2, 5.3, 'Base LLM\n(Next token prediction)', '#3498db'),
]
for x, y, text, color in boxes1:
    ax.text(x, y, text, ha='center', va='center', fontsize=10,
            bbox=dict(boxstyle='round,pad=0.4', facecolor=color, alpha=0.3, edgecolor=color))
ax.annotate('', xy=(2, 5.8), xytext=(2, 6.2), arrowprops=dict(arrowstyle='->', lw=2, color='gray'))

# Stage 2: SFT
ax.text(6, 7.5, 'Stage 2: SFT', ha='center', fontsize=13, fontweight='bold')
boxes2 = [
    (6, 6.5, 'Demonstration Data\n(Human-written)', '#bdc3c7'),
    (6, 5.3, 'SFT Model\n(Follows instructions)', '#2ecc71'),
]
for x, y, text, color in boxes2:
    ax.text(x, y, text, ha='center', va='center', fontsize=10,
            bbox=dict(boxstyle='round,pad=0.4', facecolor=color, alpha=0.3, edgecolor=color))
ax.annotate('', xy=(6, 5.8), xytext=(6, 6.2), arrowprops=dict(arrowstyle='->', lw=2, color='gray'))
ax.annotate('', xy=(4.5, 5.3), xytext=(3.5, 5.3), arrowprops=dict(arrowstyle='->', lw=2, color='gray'))

# Stage 3: Reward Model
ax.text(10, 7.5, 'Stage 3: Reward Model', ha='center', fontsize=13, fontweight='bold')
boxes3 = [
    (10, 6.5, 'Comparison Data\n(A > B preferences)', '#bdc3c7'),
    (10, 5.3, 'Reward Model\n(Scores responses)', '#f39c12'),
]
for x, y, text, color in boxes3:
    ax.text(x, y, text, ha='center', va='center', fontsize=10,
            bbox=dict(boxstyle='round,pad=0.4', facecolor=color, alpha=0.3, edgecolor=color))
ax.annotate('', xy=(10, 5.8), xytext=(10, 6.2), arrowprops=dict(arrowstyle='->', lw=2, color='gray'))

# Stage 4: PPO
ax.text(14, 7.5, 'Stage 4: RL (PPO)', ha='center', fontsize=13, fontweight='bold')
ax.text(14, 5.3, 'RLHF Model\n(Aligned with humans!)', ha='center', va='center', fontsize=10,
        bbox=dict(boxstyle='round,pad=0.4', facecolor='#e74c3c', alpha=0.3, edgecolor='#e74c3c'))
ax.annotate('', xy=(12.5, 5.3), xytext=(11.5, 5.3), arrowprops=dict(arrowstyle='->', lw=2, color='gray'))

# PPO loop
ax.annotate('', xy=(14, 4.5), xytext=(14, 4.9), arrowprops=dict(arrowstyle='->', lw=1.5, color='#e74c3c'))
ax.text(14, 4, 'Generate\nresponse', ha='center', fontsize=9,
        bbox=dict(boxstyle='round,pad=0.2', facecolor='#e8e8e8', alpha=0.5))
ax.annotate('', xy=(14, 3.2), xytext=(14, 3.6), arrowprops=dict(arrowstyle='->', lw=1.5, color='#e74c3c'))
ax.text(14, 2.7, 'Score with\nReward Model', ha='center', fontsize=9,
        bbox=dict(boxstyle='round,pad=0.2', facecolor='#f39c12', alpha=0.3))
ax.annotate('', xy=(14, 2), xytext=(14, 2.4), arrowprops=dict(arrowstyle='->', lw=1.5, color='#e74c3c'))
ax.text(14, 1.5, 'Update policy\n(PPO)', ha='center', fontsize=9,
        bbox=dict(boxstyle='round,pad=0.2', facecolor='#e74c3c', alpha=0.3))

# Curved arrow for loop
from matplotlib.patches import FancyArrowPatch
ax.annotate('', xy=(15.2, 5.0), xytext=(15.2, 1.5),
           arrowprops=dict(arrowstyle='->', connectionstyle='arc3,rad=-0.3',
                          color='#e74c3c', lw=1.5, linestyle='--'))

# Bottom result
ax.text(8, 0.5, 'Result: A model that is helpful, harmless, and honest (like ChatGPT!)',
        ha='center', fontsize=12, fontweight='bold', style='italic',
        bbox=dict(boxstyle='round,pad=0.5', facecolor='#2ecc71', alpha=0.2))

ax.set_title('The RLHF Pipeline: From Base Model to AI Assistant',
             fontsize=15, fontweight='bold', pad=10)
plt.tight_layout()
plt.show()

In [None]:
# Simplified demonstration of the reward model concept

class SimpleRewardModel(nn.Module):
    """
    A reward model that scores text quality.
    In practice, this is trained on human preference data.
    """
    def __init__(self, input_dim=64, hidden_dim=32):
        super().__init__()
        self.network = nn.Sequential(
            nn.Linear(input_dim, hidden_dim),
            nn.ReLU(),
            nn.Linear(hidden_dim, 1)  # Output: scalar reward
        )

    def forward(self, x):
        return self.network(x).squeeze(-1)

# Simulate the preference learning process
print("RLHF Reward Model Training (Simplified)")
print("=" * 55)

# Create simulated "response quality" data
torch.manual_seed(42)
n_comparisons = 200

# For each comparison: feature vectors of response A and B
features_A = torch.randn(n_comparisons, 64)
features_B = torch.randn(n_comparisons, 64)

# "True" quality scores (unknown to the model)
true_quality_A = features_A[:, 0] + features_A[:, 1] * 0.5  # Simple quality function
true_quality_B = features_B[:, 0] + features_B[:, 1] * 0.5

# Human preference: 1 if A is preferred, 0 if B is preferred
human_prefers_A = (true_quality_A > true_quality_B).float()

# Train reward model
reward_model = SimpleRewardModel()
optimizer = optim.Adam(reward_model.parameters(), lr=0.01)

losses = []
for epoch in range(100):
    score_A = reward_model(features_A)
    score_B = reward_model(features_B)

    # Bradley-Terry model: P(A > B) = sigmoid(score_A - score_B)
    logits = score_A - score_B
    loss = F.binary_cross_entropy_with_logits(logits, human_prefers_A)

    optimizer.zero_grad()
    loss.backward()
    optimizer.step()
    losses.append(loss.item())

# Evaluate
with torch.no_grad():
    pred_A_better = (reward_model(features_A) > reward_model(features_B)).float()
    accuracy = (pred_A_better == human_prefers_A).float().mean()

print(f"Reward model accuracy: {accuracy:.4f}")
print(f"Final loss: {losses[-1]:.4f}")
print(f"\nThe reward model learned to predict which response humans prefer!")
print(f"In RLHF, this reward signal guides the LLM to generate better responses.")

---

## Exercises

### Exercise 1: Implement LoRA with Different Ranks

Compare the parameter efficiency and expressiveness of LoRA at different ranks.

In [None]:
# EXERCISE 1: Implement LoRA analysis at different ranks
def analyze_lora_efficiency(input_dim, output_dim, ranks):
    """
    For each rank, compute:
    - Number of LoRA parameters
    - Compression ratio vs full matrix
    - The rank of the resulting delta_W

    Args:
        input_dim: Input dimension of the weight matrix
        output_dim: Output dimension of the weight matrix
        ranks: List of LoRA ranks to analyze

    Returns:
        Dictionary with 'rank', 'params', 'compression', 'actual_rank' keys
    """
    # TODO: Implement this!
    # Hint: LoRA params = output_dim * r + r * input_dim
    # Hint: Create random B (out x r) and A (r x in), compute rank of B@A

    results = {'rank': [], 'params': [], 'compression': [], 'actual_rank': []}

    full_params = input_dim * output_dim

    for r in ranks:
        # Calculate LoRA parameters
        lora_params = output_dim * r + r * input_dim

        # Calculate compression ratio
        compression = full_params / lora_params

        # Create random matrices and verify rank
        B = torch.randn(output_dim, r)
        A = torch.randn(r, input_dim)
        delta_W = B @ A
        actual_rank = torch.linalg.matrix_rank(delta_W).item()

        results['rank'].append(r)
        results['params'].append(lora_params)
        results['compression'].append(compression)
        results['actual_rank'].append(actual_rank)

    return results

# Test
results = analyze_lora_efficiency(768, 768, [1, 2, 4, 8, 16, 32, 64])
print(f"{'Rank':>6} {'Params':>10} {'Compression':>12} {'Actual Rank':>12}")
print("-" * 45)
for i in range(len(results['rank'])):
    print(f"{results['rank'][i]:>6} {results['params'][i]:>10,} "
          f"{results['compression'][i]:>11.1f}x {results['actual_rank'][i]:>12}")

# Verify
assert results['params'][0] == 768 * 1 + 1 * 768, "Params for rank 1 should be 1536"
assert results['actual_rank'][2] == 4, "Rank of delta_W should equal LoRA rank"
print("\nAll checks passed!")

### Exercise 2: Build a Custom LoRA Model

In [None]:
# EXERCISE 2: Apply LoRA to a multi-layer model
class LoRAModel(nn.Module):
    """
    Take a simple feedforward model and add LoRA to specified layers.

    Args:
        base_model: nn.Sequential with Linear layers
        lora_rank: Rank for LoRA decomposition
        lora_alpha: Scaling factor
        target_layers: List of layer indices to apply LoRA to
    """
    def __init__(self, base_model, lora_rank=4, lora_alpha=4, target_layers=None):
        super().__init__()
        # TODO: Implement this!
        # Hint: Iterate through base_model layers
        # Hint: Replace Linear layers at target indices with LoRALinear
        # Hint: Freeze all base model parameters

        self.layers = nn.ModuleList()
        layer_idx = 0

        for module in base_model:
            if isinstance(module, nn.Linear):
                if target_layers is None or layer_idx in target_layers:
                    self.layers.append(LoRALinear(module, rank=lora_rank, alpha=lora_alpha))
                else:
                    # Freeze this layer
                    for p in module.parameters():
                        p.requires_grad = False
                    self.layers.append(module)
                layer_idx += 1
            else:
                self.layers.append(module)

    def forward(self, x):
        for layer in self.layers:
            x = layer(x)
        return x

# Test
base = nn.Sequential(
    nn.Linear(128, 64),
    nn.ReLU(),
    nn.Linear(64, 64),
    nn.ReLU(),
    nn.Linear(64, 10)
)

# Apply LoRA only to first and last linear layers
lora_model = LoRAModel(base, lora_rank=4, target_layers=[0, 2])

total = sum(p.numel() for p in lora_model.parameters())
trainable = sum(p.numel() for p in lora_model.parameters() if p.requires_grad)

print(f"Total parameters:     {total:,}")
print(f"Trainable parameters: {trainable:,}")
print(f"Frozen parameters:    {total - trainable:,}")
print(f"Trainable fraction:   {100*trainable/total:.1f}%")

# Test forward pass
x = torch.randn(8, 128)
y = lora_model(x)
print(f"\nForward pass: {x.shape} -> {y.shape}")

# Verify only LoRA params and target layers are trainable
expected_trainable = (128*4 + 4*64) + (64*4 + 4*10)  # LoRA params for layers 0 and 2
print(f"\nExpected LoRA params: {expected_trainable}")
print(f"Actual trainable:     {trainable}")
print(f"Match: {trainable == expected_trainable}")

### Exercise 3: Compare Fine-tuning Strategies

In [None]:
# EXERCISE 3: Run a systematic comparison of strategies
def compare_strategies(base_model, X_train, y_train, X_val, y_val, num_classes=4):
    """
    Compare four strategies:
    1. From scratch (random init)
    2. Feature extraction (frozen base + trainable head)
    3. Full fine-tuning (all trainable)
    4. LoRA fine-tuning (LoRA + head trainable)

    Returns dict with strategy names and their (trainable_params, final_val_acc)
    """
    results = {}

    # TODO: Implement each strategy, train for 50 epochs, record results
    # Hint: Reuse the training functions from earlier

    # Strategy 1: From scratch
    scratch = FullFineTuneClassifier(MiniLanguageModel(), num_classes)
    _, accs, _ = train_classifier(scratch, X_train, y_train, X_val, y_val, epochs=50)
    trainable = sum(p.numel() for p in scratch.parameters() if p.requires_grad)
    results['From Scratch'] = (trainable, accs[-1])

    # Strategy 2: Feature extraction
    fe = FullFineTuneClassifier(base_model, num_classes)
    for p in fe.base.parameters():
        p.requires_grad = False
    _, accs, _ = train_classifier(fe, X_train, y_train, X_val, y_val, epochs=50)
    trainable = sum(p.numel() for p in fe.parameters() if p.requires_grad)
    results['Feature Extraction'] = (trainable, accs[-1])

    # Strategy 3: Full fine-tuning
    full = FullFineTuneClassifier(base_model, num_classes)
    _, accs, _ = train_classifier(full, X_train, y_train, X_val, y_val, epochs=50)
    trainable = sum(p.numel() for p in full.parameters() if p.requires_grad)
    results['Full Fine-tuning'] = (trainable, accs[-1])

    # Strategy 4: LoRA
    lora = LoRAClassifier(base_model, num_classes, lora_rank=4)
    _, accs, _ = train_classifier(lora, X_train, y_train, X_val, y_val, epochs=50)
    trainable = sum(p.numel() for p in lora.parameters() if p.requires_grad)
    results['LoRA (r=4)'] = (trainable, accs[-1])

    return results

# Run comparison
results = compare_strategies(base_model, X_train, y_train, X_val, y_val)

print("Strategy Comparison")
print("=" * 55)
print(f"{'Strategy':<20} {'Trainable Params':>16} {'Val Accuracy':>14}")
print("-" * 55)
for name, (params, acc) in results.items():
    print(f"{name:<20} {params:>16,} {acc:>14.4f}")

---

## Summary

### Key Concepts

- **Transfer Learning** reuses knowledge from pretrained models, dramatically reducing the data and compute needed for new tasks
- **Feature Extraction** freezes the pretrained model and only trains a new classification head
- **Full Fine-tuning** updates all parameters with a small learning rate; risk of catastrophic forgetting
- **Discriminative Learning Rates** use smaller LR for earlier (more general) layers
- **PEFT** methods modify only a small fraction of parameters (often < 1%) while achieving comparable performance
- **LoRA** decomposes weight updates as $\Delta W = BA$ where $B \in \mathbb{R}^{d \times r}$ and $A \in \mathbb{R}^{r \times d}$ with $r \ll d$
- **Adapters** insert small bottleneck layers into the network
- **Prompt Tuning** prepends learnable embeddings to the input
- **RLHF** uses human feedback to align language models with human preferences

### Connection to Deep Learning

| Concept | Where It's Used |
|---------|----------------|
| Transfer learning | Almost every practical DL application |
| Full fine-tuning | When you have enough data and compute |
| LoRA | Adapting LLMs (LLaMA, GPT, etc.) efficiently |
| Adapters | Multi-task learning, modular AI systems |
| Prompt tuning | When you can't modify model weights at all |
| RLHF | Creating AI assistants (ChatGPT, Claude, etc.) |

### Checklist

- [ ] I can explain why transfer learning works (hierarchical features)
- [ ] I know when to use feature extraction vs fine-tuning
- [ ] I can implement discriminative learning rates
- [ ] I understand why PEFT is necessary for large models
- [ ] I can implement LoRA from scratch and explain the rank decomposition
- [ ] I can calculate the parameter savings of LoRA for any given rank and dimension
- [ ] I understand how adapters differ from LoRA
- [ ] I know the difference between hard prompts and soft prompts
- [ ] I can describe the four stages of the RLHF pipeline

---

## Next Steps

In the next notebook, we'll explore **Generative Models** -- including Variational Autoencoders (VAEs), Generative Adversarial Networks (GANs), and Diffusion Models. These are the techniques behind image generation (DALL-E, Stable Diffusion), text generation, and more. We'll see how some of the fine-tuning techniques from this notebook (especially LoRA) are used to customize generative models for specific styles and tasks.