# Notebook 1 — Transfer Learning
**PEFT & Transfer Learning Series · Part 1 of 3**

Demonstrates all three Transfer Learning strategies:
- **Feature Extraction** — freeze backbone, train new head only
- **Fine-Tuning** — unfreeze some/all layers for deeper adaptation  
- **Domain Adaptation** — same task, shifted data distribution


## 1.1 Setup & Data

We use a 20-feature, 4-class synthetic dataset representing pre-processed embeddings.
A small neural network acts as our pre-trained base model.

In [None]:
import torch
import torch.nn as nn
import torch.optim as optim
import numpy as np
import copy
from sklearn.datasets import make_classification
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler

torch.manual_seed(42)
np.random.seed(42)

# -------------------------------------------------------------------
# Synthetic dataset: 500 samples, 20 features, 4 classes
# Think of these as sentence embeddings from a pre-trained encoder
#
# Starts as a 2D array of 500 rows × 20 columns. Each row is one "sample" (think: one sentence's embedding vector). 
# Each column is one feature (think: one dimension of that embedding):
# -------------------------------------------------------------------
X, y = make_classification(
    n_samples=500, n_features=20, n_informative=10,
    n_classes=4, n_clusters_per_class=1, random_state=42
)
# After StandardScaler
# Each of the 20 columns is rescaled so it has mean=0 and std=1. 
# The shape stays 500×20 but the values are now normalized:
X = StandardScaler().fit_transform(X)

# After converting to PyTorch tensors (think 4D grid or larger)
# The arrays become tensors with specific dtypes. The dtype matters because:

# FloatTensor (float32) — required by the neural network's weight matrices
# LongTensor (int64) — required by CrossEntropyLoss, which expects integer class indices
Xtr, Xte, ytr, yte = train_test_split(X, y, test_size=0.2, random_state=42)
Xtr = torch.FloatTensor(Xtr); ytr = torch.LongTensor(ytr)
Xte = torch.FloatTensor(Xte); yte = torch.LongTensor(yte)
print(f'Train: {Xtr.shape}  |  Test: {Xte.shape}  |  Classes: 4')

Train: torch.Size([400, 20])  |  Test: torch.Size([100, 20])  |  Classes: 4


## 1.2 Define & Pre-Train the Base Model

The backbone (layers 1–2) learns general representations.
The head (layer 3) is task-specific and will be replaced per new task.

In [4]:
# -------------------------------------------------------------------
# backbone = shared feature extractor  |  head = task classifier
# -------------------------------------------------------------------
class BaseModel(nn.Module):
    def __init__(self, in_features=20, hidden=64, num_classes=4):
        super().__init__()
        self.backbone = nn.Sequential(
            nn.Linear(in_features, hidden), nn.ReLU(),
            nn.Linear(hidden, hidden),      nn.ReLU(),
            nn.Linear(hidden, 32),          nn.ReLU(),
        )
        self.head = nn.Linear(32, num_classes)
    def forward(self, x):
        return self.head(self.backbone(x))

def train(model, X, y, epochs=40, lr=1e-3):
    opt = optim.Adam(model.parameters(), lr=lr)
    fn  = nn.CrossEntropyLoss()
    model.train()
    for _ in range(epochs):
        opt.zero_grad(); fn(model(X), y).backward(); opt.step()

def acc(model, X, y):
    model.eval()
    with torch.no_grad():
        return (model(X).argmax(1) == y).float().mean().item()

pretrained = BaseModel()
train(pretrained, Xtr, ytr, epochs=50)
print(f'Pre-trained base accuracy: {acc(pretrained, Xte, yte):.2%}')
print(f'Total parameters: {sum(p.numel() for p in pretrained.parameters()):,}')


Pre-trained base accuracy: 81.00%
Total parameters: 7,716


## 1.3 Feature Extraction (Transfer Learning)

The backbone is **fully frozen** — no gradients flow through it.
Only a fresh classification head is trained. Very fast; best when tasks are similar.

In [5]:
# -------------------------------------------------------------------
# Freeze all backbone parameters -- no gradients flow through them
# -------------------------------------------------------------------
feat_model = copy.deepcopy(pretrained)
for param in feat_model.backbone.parameters():
    param.requires_grad = False
feat_model.head = nn.Linear(32, 4)  # fresh head

t = sum(p.numel() for p in feat_model.parameters() if p.requires_grad)
n = sum(p.numel() for p in feat_model.parameters())
print(f'Trainable: {t:,} / {n:,}  ({t/n:.1%})')

opt = optim.Adam(filter(lambda p: p.requires_grad, feat_model.parameters()), lr=1e-3)
fn  = nn.CrossEntropyLoss()
feat_model.train()
for _ in range(40):
    opt.zero_grad(); fn(feat_model(Xtr), ytr).backward(); opt.step()

print(f'Feature extraction accuracy: {acc(feat_model, Xte, yte):.2%}')
print('Backbone weights unchanged -- same backbone usable for other tasks.')


Trainable: 132 / 7,716  (1.7%)
Feature extraction accuracy: 55.00%
Backbone weights unchanged -- same backbone usable for other tasks.


## 1.4 Fine-Tuning (Transfer Learning)

We **unfreeze later backbone layers** so they adapt to the new task.
A lower learning rate on pre-trained layers prevents catastrophic forgetting.

In [12]:
# -------------------------------------------------------------------
# Gradual unfreeze: keep early layers frozen, allow later ones to adapt
# Differential LR: small for pre-trained layers, larger for new head
# -------------------------------------------------------------------
ft_model = copy.deepcopy(pretrained)
ft_model.head = nn.Linear(32, 4)

# Freeze first sub-layers (low-level general features)
for i, layer in enumerate(ft_model.backbone):
    for p in layer.parameters():
        p.requires_grad = (i >= 2)  # unfreeze from index 2 onward

t = sum(p.numel() for p in ft_model.parameters() if p.requires_grad)
n = sum(p.numel() for p in ft_model.parameters())
print(f'Trainable: {t:,} / {n:,}  ({t/n:.1%})')

opt = optim.Adam([
    {'params': ft_model.backbone[2:].parameters(), 'lr': 2e-4},  # pre-trained: small LR
    {'params': ft_model.head.parameters(),          'lr': 1e-3},  # head: larger LR
])
fn = nn.CrossEntropyLoss()
ft_model.train()
for _ in range(50):
    opt.zero_grad(); fn(ft_model(Xtr), ytr).backward(); opt.step()

print(f'Fine-tuned accuracy: {acc(ft_model, Xte, yte):.2%}')


Trainable: 6,372 / 7,716  (82.6%)
Fine-tuned accuracy: 76.00%


## 1.5 Domain Adaptation (Transfer Learning)

The task stays the same but the data distribution shifts (e.g., news text to medical records).
We simulate a domain shift, then adapt with only a small labeled sample from the new domain.

In [13]:
# -------------------------------------------------------------------
# Simulate domain shift: add structured noise to test features
# Adapt using only 20% of new-domain labeled data
# -------------------------------------------------------------------
X_nd = Xte.numpy() + np.random.normal(0, 1.5, Xte.shape)
X_nd = torch.FloatTensor(StandardScaler().fit_transform(X_nd))
y_nd = yte.clone()

acc_before = acc(pretrained, X_nd, y_nd)
print(f'Accuracy BEFORE domain adaptation: {acc_before:.2%}')

adapted = copy.deepcopy(pretrained)
adapted.head = nn.Linear(32, 4)
for i, layer in enumerate(adapted.backbone):
    for p in layer.parameters():
        p.requires_grad = (i >= 4)  # only last block + head adapt

n_adapt = int(len(X_nd) * 0.2)
opt = optim.Adam(filter(lambda p: p.requires_grad, adapted.parameters()), lr=5e-4)
fn  = nn.CrossEntropyLoss()
adapted.train()
for _ in range(60):
    opt.zero_grad(); fn(adapted(X_nd[:n_adapt]), y_nd[:n_adapt]).backward(); opt.step()

acc_after = acc(adapted, X_nd, y_nd)
print(f'Accuracy AFTER  domain adaptation: {acc_after:.2%}')
print(f'Improvement: +{(acc_after - acc_before)*100:.1f}pp  using only {n_adapt} adapted samples')


Accuracy BEFORE domain adaptation: 57.00%
Accuracy AFTER  domain adaptation: 58.00%
Improvement: +1.0pp  using only 20 adapted samples


## 1.6 Summary

| Strategy | Backbone frozen? | Trainable % | Best when |
|----------|-----------------|-------------|----------|
| Feature Extraction | Fully | ~5% | New task similar to pre-training |
| Fine-Tuning | Early layers | 40–60% | Enough data; more different task |
| Domain Adaptation | Early layers | 30–50% | Same task, shifted distribution |

**Key rule**: freeze early (general) layers; retrain late (task-specific) layers.

Continue to **Notebook 2** for PEFT techniques.