# Notebook 2 — PEFT Techniques
**PEFT & Transfer Learning Series · Part 2 of 3**

Implements all four PEFT techniques from scratch in pure PyTorch.
Each method keeps the base model frozen and adds only tiny trainable parameters.

| Section | Technique | Core idea |
|---------|-----------|----------|
| 2.2 | Adapter Layers | Bottleneck modules inserted after each layer |
| 2.3 | LoRA | Low-rank matrix pairs injected into weights |
| 2.4 | Prompt Tuning | Learnable vectors prepended to input |
| 2.5 | Prefix Tuning | Trainable vectors at every transformer layer |


## 2.1 Base Model Setup

A two-layer transformer-style model acts as our frozen pre-trained base.
All PEFT methods use this same model — only the adapter strategy changes.

In [2]:
import torch
import torch.nn as nn
import torch.optim as optim
import copy

torch.manual_seed(0)
D, VOCAB, SEQ, N_CLS, N = 32, 100, 10, 4, 200

# -------------------------------------------------------------------
# Tiny two-layer transformer (represents BERT / GPT-2 etc. at small scale)
# -------------------------------------------------------------------
class TinyLayer(nn.Module):
    def __init__(self, d=D):
        super().__init__()
        self.attn = nn.MultiheadAttention(d, 4, batch_first=True)
        self.ff   = nn.Sequential(nn.Linear(d, d*4), nn.GELU(), nn.Linear(d*4, d))
        self.n1   = nn.LayerNorm(d); self.n2 = nn.LayerNorm(d)
    def forward(self, x):
        a, _ = self.attn(x, x, x)
        x = self.n1(x + a); return self.n2(x + self.ff(x))

class TinyModel(nn.Module):
    def __init__(self):
        super().__init__()
        self.embed  = nn.Embedding(VOCAB, D)
        self.layers = nn.ModuleList([TinyLayer() for _ in range(2)])
        self.head   = nn.Linear(D, N_CLS)
    def forward(self, x):
        h = self.embed(x)
        for l in self.layers: h = l(h)
        return self.head(h.mean(1))

X = torch.randint(0, VOCAB, (N, SEQ)); y = torch.randint(0, N_CLS, (N,))
Xtr, Xte = X[:160], X[160:]; ytr, yte = y[:160], y[160:]

def run(model, epochs=80, lr=1e-3):
    opt = optim.Adam(filter(lambda p: p.requires_grad, model.parameters()), lr=lr)
    fn  = nn.CrossEntropyLoss()
    model.train()
    for _ in range(epochs):
        opt.zero_grad(); fn(model(Xtr), ytr).backward(); opt.step()

def acc(model, X, y):
    model.eval()
    with torch.no_grad():
        return (model(X).argmax(1) == y).float().mean().item()

base = TinyModel()
run(base, epochs=60)
print(f'Pre-trained accuracy: {acc(base, Xte, yte):.2%}')
print(f'Total base params: {sum(p.numel() for p in base.parameters()):,}')


Pre-trained accuracy: 12.50%
Total base params: 28,740


## 2.2 Adapter-Based Fine-Tuning

A **bottleneck adapter** (down-project to rank r, activate, up-project) is inserted
after each transformer layer. The up-projection is zero-initialized so the adapter
starts as an identity function — safe to insert without disrupting the pre-trained model.

In [3]:
# -------------------------------------------------------------------
# Adapter: Linear(d->r) -> GELU -> Linear(r->d) + residual skip
# Zero-init on up-proj = identity at init (no disruption to base model)
# -------------------------------------------------------------------
class Adapter(nn.Module):
    def __init__(self, d=D, r=4):
        super().__init__()
        self.down = nn.Linear(d, r)
        self.act  = nn.GELU()
        self.up   = nn.Linear(r, d)
        nn.init.zeros_(self.up.weight)  # zero-init: starts as identity
        nn.init.zeros_(self.up.bias)
    def forward(self, x):
        return x + self.up(self.act(self.down(x)))  # residual

class LayerWithAdapter(nn.Module):
    def __init__(self, base_layer, r=4):
        super().__init__()
        self.base    = base_layer
        self.adapter = Adapter(D, r)
    def forward(self, x):
        return self.adapter(self.base(x))

adapter_model = copy.deepcopy(base)
adapter_model.layers = nn.ModuleList([
    LayerWithAdapter(layer) for layer in adapter_model.layers
])
# Freeze everything except adapter sub-modules
for name, p in adapter_model.named_parameters():
    p.requires_grad = 'adapter' in name

t = sum(p.numel() for p in adapter_model.parameters() if p.requires_grad)
n = sum(p.numel() for p in adapter_model.parameters())
print(f'Adapter trainable: {t:,} / {n:,}  ({t/n:.2%})')

run(adapter_model, epochs=80, lr=3e-3)
print(f'Adapter accuracy : {acc(adapter_model, Xte, yte):.2%}')
print(f'Base accuracy    : {acc(base, Xte, yte):.2%}  (base unchanged)')


Adapter trainable: 584 / 29,324  (1.99%)
Adapter accuracy : 12.50%
Base accuracy    : 12.50%  (base unchanged)


## 2.3 LoRA — Low-Rank Adaptation

LoRA decomposes the weight update into two small matrices: **delta_W = B @ A**
- A is (r x d_in), B is (d_out x r), with r much smaller than d
- B is zero-initialized so delta_W starts at zero — safe initialization
- During training only A and B are updated; the frozen W never changes
- At inference: `W_effective = W_frozen + (B @ A) * scale`

In [4]:
# -------------------------------------------------------------------
# LoRALinear: wraps a frozen Linear, adds trainable low-rank delta
# W_effective = W_frozen + B @ A * (alpha/r)
# -------------------------------------------------------------------
class LoRALinear(nn.Module):
    def __init__(self, linear, r=4, alpha=8):
        super().__init__()
        d_out, d_in = linear.weight.shape
        self.frozen = linear
        self.A = nn.Parameter(torch.randn(r, d_in) * 0.01)   # (r x d_in)
        self.B = nn.Parameter(torch.zeros(d_out, r))           # (d_out x r) zero init
        self.scale = alpha / r
        self.frozen.weight.requires_grad = False
        if self.frozen.bias is not None:
            self.frozen.bias.requires_grad = False
    def forward(self, x):
        return self.frozen(x) + (x @ self.A.T @ self.B.T) * self.scale

lora_model = copy.deepcopy(base)
# Apply LoRA to the first FF linear in each layer
for layer in lora_model.layers:
    layer.ff[0] = LoRALinear(layer.ff[0], r=4, alpha=8)

for name, p in lora_model.named_parameters():
    p.requires_grad = ('.A' in name or '.B' in name)

t = sum(p.numel() for p in lora_model.parameters() if p.requires_grad)
n = sum(p.numel() for p in lora_model.parameters())
print(f'LoRA trainable : {t:,} / {n:,}  ({t/n:.2%})')
print(f'A shape: {lora_model.layers[0].ff[0].A.shape}  (r x d_in)')
print(f'B shape: {lora_model.layers[0].ff[0].B.shape}  (d_out x r)')

run(lora_model, epochs=80, lr=3e-3)
print(f'LoRA accuracy  : {acc(lora_model, Xte, yte):.2%}')


LoRA trainable : 1,280 / 30,020  (4.26%)
A shape: torch.Size([4, 32])  (r x d_in)
B shape: torch.Size([128, 4])  (d_out x r)
LoRA accuracy  : 10.00%


## 2.4 Prompt Tuning

Learnable **soft prompt vectors** are prepended to the token embedding sequence
before it enters the transformer. The entire base model stays frozen.
Extremely lightweight — best for very large models where even LoRA is expensive.

In [5]:
# -------------------------------------------------------------------
# Soft prompt: n_vt learnable embedding vectors prepended to real tokens
# Only self.prompt is trained -- base model fully frozen
# -------------------------------------------------------------------
class PromptTunedModel(nn.Module):
    def __init__(self, base, n_vt=5):
        super().__init__()
        self.base   = base
        self.n_vt   = n_vt
        self.prompt = nn.Parameter(torch.randn(n_vt, D) * 0.01)
    def forward(self, x):
        with torch.no_grad():
            tok_emb = self.base.embed(x)                                  # [B, seq, D]
        prompt_exp = self.prompt.unsqueeze(0).expand(x.size(0), -1, -1)  # [B, n_vt, D]
        h = torch.cat([prompt_exp, tok_emb], dim=1)                       # [B, n_vt+seq, D]
        for layer in self.base.layers:
            h = layer(h)
        return self.base.head(h[:, self.n_vt:].mean(1))  # pool over real tokens

pt_model = PromptTunedModel(copy.deepcopy(base), n_vt=5)
for name, p in pt_model.named_parameters():
    p.requires_grad = (name == 'prompt')

t = sum(p.numel() for p in pt_model.parameters() if p.requires_grad)
n = sum(p.numel() for p in pt_model.parameters())
print(f'Prompt tuning trainable: {t:,} / {n:,}  ({t/n:.3%})')
print(f'Soft prompt shape: {pt_model.prompt.shape}  (n_virtual_tokens x d_model)')

run(pt_model, epochs=100, lr=1e-2)
print(f'Prompt tuning accuracy : {acc(pt_model, Xte, yte):.2%}')


Prompt tuning trainable: 160 / 28,900  (0.554%)
Soft prompt shape: torch.Size([5, 32])  (n_virtual_tokens x d_model)
Prompt tuning accuracy : 12.50%


## 2.5 Prefix Tuning

Similar to prompt tuning but trainable prefix vectors are prepended at **every transformer layer**.
This gives the prefix direct influence over each layer's attention computation,
making it more expressive — especially effective for text generation tasks.

In [6]:
# -------------------------------------------------------------------
# PrefixLayer: each transformer layer gets its own trainable prefix tokens
# Prefix is concatenated to the full sequence for attention, then stripped from output
# -------------------------------------------------------------------
class PrefixLayer(nn.Module):
    def __init__(self, base_layer, n_prefix=5):
        super().__init__()
        self.base     = base_layer
        self.n_prefix = n_prefix
        self.prefix   = nn.Parameter(torch.randn(n_prefix, D) * 0.01)
    def forward(self, x):
        B = x.size(0)
        pre = self.prefix.unsqueeze(0).expand(B, -1, -1)  # [B, n_prefix, D]
        x_ext = torch.cat([pre, x], dim=1)                 # [B, n_prefix+seq, D]
        attn_out, _ = self.base.attn(x_ext, x_ext, x_ext)
        attn_out = attn_out[:, self.n_prefix:]             # strip prefix from output
        x = self.base.n1(x + attn_out)
        return self.base.n2(x + self.base.ff(x))

prefix_model = copy.deepcopy(base)
prefix_model.layers = nn.ModuleList([
    PrefixLayer(layer, n_prefix=5) for layer in prefix_model.layers
])
for name, p in prefix_model.named_parameters():
    p.requires_grad = 'prefix' in name

t = sum(p.numel() for p in prefix_model.parameters() if p.requires_grad)
n = sum(p.numel() for p in prefix_model.parameters())
print(f'Prefix tuning trainable: {t:,} / {n:,}  ({t/n:.3%})')

run(prefix_model, epochs=100, lr=5e-3)
print(f'Prefix tuning accuracy : {acc(prefix_model, Xte, yte):.2%}')


Prefix tuning trainable: 320 / 29,060  (1.101%)
Prefix tuning accuracy : 10.00%


## 2.6 Side-by-Side Comparison


In [7]:
# Full fine-tune for reference
full = copy.deepcopy(base)
for p in full.parameters(): p.requires_grad = True
run(full, epochs=80, lr=5e-4)

base_total = sum(p.numel() for p in base.parameters())
rows = [
    ('Full Fine-Tune',  full,          base_total),
    ('Adapter',         adapter_model, sum(p.numel() for p in adapter_model.parameters() if p.requires_grad)),
    ('LoRA',            lora_model,    sum(p.numel() for p in lora_model.parameters()    if p.requires_grad)),
    ('Prompt Tuning',   pt_model,      sum(p.numel() for p in pt_model.parameters()      if p.requires_grad)),
    ('Prefix Tuning',   prefix_model,  sum(p.numel() for p in prefix_model.parameters() if p.requires_grad)),
]
print(f"{'Method':<18} {'Trainable':>10} {'% of base':>10} {'Accuracy':>10}")
print('-' * 52)
for name, model, t in rows:
    print(f'{name:<18} {t:>10,} {t/base_total*100:>9.2f}% {acc(model, Xte, yte):>9.2%}')


Method              Trainable  % of base   Accuracy
----------------------------------------------------
Full Fine-Tune         28,740    100.00%    10.00%
Adapter                   584      2.03%    12.50%
LoRA                    1,280      4.45%    10.00%
Prompt Tuning             160      0.56%    12.50%
Prefix Tuning             320      1.11%    10.00%


## 2.7 Key Takeaways

| Method | Params | Strength | Weakness |
|--------|--------|----------|----------|
| **Adapter** | ~5–10% | Easy per-task swapping | Adds inference latency |
| **LoRA** | <1% | Industry standard; can merge into W | Must choose target modules |
| **Prompt Tuning** | <0.1% | Extremely lightweight | Needs very large base model |
| **Prefix Tuning** | <0.5% | More expressive than prompt tuning | Complex per-layer implementation |

All four keep base weights frozen — the same base model can serve many tasks simultaneously.