# BERT & Family Models from Scratch

[![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/adiel2012/deep-learning-abc/blob/main/bert_family.ipynb)

This notebook implements BERT and its key variants from scratch using raw tensor operations:

1. **BERT** — Bidirectional Encoder Representations from Transformers (Devlin et al., 2018)
2. **RoBERTa** — Robustly Optimized BERT (Liu et al., 2019)
3. **ALBERT** — A Lite BERT with parameter sharing (Lan et al., 2019)
4. **DistilBERT** — Knowledge distillation (Sanh et al., 2019)
5. **DeBERTa** — Disentangled attention (He et al., 2020)

We build each model's distinctive components and compare their design choices.

In [None]:
!pip install torch matplotlib

In [None]:
import torch
import matplotlib.pyplot as plt
import math

device = torch.device('cuda' if torch.cuda.is_available() else 'cpu')
print(f'Using device: {device}')

## 0. Mathematical Foundations

### BERT: Bidirectional Pre-training

Unlike GPT (left-to-right), BERT uses **bidirectional** attention — each token can attend to all other tokens in both directions. BERT is pre-trained with two objectives:

1. **Masked Language Model (MLM):** Randomly mask 15% of tokens, predict them from context.
2. **Next Sentence Prediction (NSP):** Given two sentences, predict if the second follows the first.

### Key Architecture Differences

| Model | Key Innovation | Parameters |
|-------|---------------|------------|
| BERT-base | Bidirectional encoder, MLM + NSP | 110M |
| RoBERTa | Better training recipe, no NSP | 125M |
| ALBERT | Cross-layer parameter sharing + factorized embeddings | 12M |
| DistilBERT | Knowledge distillation (6 layers from 12) | 66M |
| DeBERTa | Disentangled content + position attention | 140M |

## 1. Core Components

In [None]:
def layer_norm(x, gamma, beta, eps=1e-5):
    mean = x.mean(dim=-1, keepdim=True)
    var = x.var(dim=-1, keepdim=True, unbiased=False)
    return gamma * (x - mean) / torch.sqrt(var + eps) + beta

def gelu(x):
    return 0.5 * x * (1.0 + torch.tanh(math.sqrt(2.0 / math.pi) * (x + 0.044715 * x ** 3)))

def multi_head_attention(Q, K, V, W_Q, W_K, W_V, W_O, n_heads, mask=None):
    """Standard multi-head attention."""
    batch, seq_len, d_model = Q.shape
    d_k = d_model // n_heads
    
    Q = (Q @ W_Q).view(batch, seq_len, n_heads, d_k).transpose(1, 2)
    K = (K @ W_K).view(batch, -1, n_heads, d_k).transpose(1, 2)
    V = (V @ W_V).view(batch, -1, n_heads, d_k).transpose(1, 2)
    
    scores = torch.matmul(Q, K.transpose(-2, -1)) / math.sqrt(d_k)
    if mask is not None:
        scores = scores.masked_fill(mask.unsqueeze(1) == 0, float('-inf'))
    
    weights = torch.softmax(scores, dim=-1)
    out = torch.matmul(weights, V)
    out = out.transpose(1, 2).contiguous().view(batch, seq_len, d_model)
    return out @ W_O, weights

def ffn(x, W1, b1, W2, b2):
    return gelu(x @ W1 + b1) @ W2 + b2

print('Core components ready: layer_norm, gelu, multi_head_attention, ffn')

## 2. BERT from Scratch

BERT = Token Embeddings + Segment Embeddings + Position Embeddings → N Transformer Encoder Layers → Task-specific heads.

In [None]:
def init_bert(vocab_size, d_model, n_heads, d_ff, n_layers, max_len, device):
    """Initialize BERT parameters."""
    scale = 0.02
    params = {
        # Embeddings
        'token_emb': torch.randn(vocab_size, d_model, device=device) * scale,
        'segment_emb': torch.randn(2, d_model, device=device) * scale,  # 2 segments (A, B)
        'position_emb': torch.randn(max_len, d_model, device=device) * scale,
        'emb_ln_gamma': torch.ones(d_model, device=device),
        'emb_ln_beta': torch.zeros(d_model, device=device),
        'layers': [],
        # MLM head
        'mlm_dense_W': torch.randn(d_model, d_model, device=device) * scale,
        'mlm_dense_b': torch.zeros(d_model, device=device),
        'mlm_ln_gamma': torch.ones(d_model, device=device),
        'mlm_ln_beta': torch.zeros(d_model, device=device),
        # NSP head
        'nsp_W': torch.randn(d_model, 2, device=device) * scale,
        'nsp_b': torch.zeros(2, device=device),
    }
    
    for _ in range(n_layers):
        layer = {
            'W_Q': torch.randn(d_model, d_model, device=device) * scale,
            'W_K': torch.randn(d_model, d_model, device=device) * scale,
            'W_V': torch.randn(d_model, d_model, device=device) * scale,
            'W_O': torch.randn(d_model, d_model, device=device) * scale,
            'ln1_gamma': torch.ones(d_model, device=device),
            'ln1_beta': torch.zeros(d_model, device=device),
            'W1': torch.randn(d_model, d_ff, device=device) * scale,
            'b1': torch.zeros(d_ff, device=device),
            'W2': torch.randn(d_ff, d_model, device=device) * scale,
            'b2': torch.zeros(d_model, device=device),
            'ln2_gamma': torch.ones(d_model, device=device),
            'ln2_beta': torch.zeros(d_model, device=device),
        }
        params['layers'].append(layer)
    
    return params

def bert_forward(token_ids, segment_ids, params, n_heads):
    """BERT forward pass with MLM and NSP outputs."""
    batch, seq_len = token_ids.shape
    
    # Embedding: token + segment + position
    positions = torch.arange(seq_len, device=token_ids.device)
    x = (params['token_emb'][token_ids] +
         params['segment_emb'][segment_ids] +
         params['position_emb'][positions])
    x = layer_norm(x, params['emb_ln_gamma'], params['emb_ln_beta'])
    
    # Encoder layers (Post-LN, as in original BERT)
    all_weights = []
    for layer in params['layers']:
        # Self-attention
        attn_out, w = multi_head_attention(
            x, x, x, layer['W_Q'], layer['W_K'], layer['W_V'], layer['W_O'], n_heads
        )
        x = layer_norm(x + attn_out, layer['ln1_gamma'], layer['ln1_beta'])
        all_weights.append(w)
        
        # FFN
        ffn_out = ffn(x, layer['W1'], layer['b1'], layer['W2'], layer['b2'])
        x = layer_norm(x + ffn_out, layer['ln2_gamma'], layer['ln2_beta'])
    
    # MLM head: predict masked tokens
    mlm_hidden = gelu(x @ params['mlm_dense_W'] + params['mlm_dense_b'])
    mlm_hidden = layer_norm(mlm_hidden, params['mlm_ln_gamma'], params['mlm_ln_beta'])
    mlm_logits = mlm_hidden @ params['token_emb'].T  # tied weights
    
    # NSP head: use [CLS] token (position 0)
    cls_output = x[:, 0]  # (batch, d_model)
    nsp_logits = cls_output @ params['nsp_W'] + params['nsp_b']
    
    return mlm_logits, nsp_logits, x, all_weights

# Test BERT
torch.manual_seed(42)
vocab_size, d_model, n_heads, d_ff, n_layers, max_len = 1000, 64, 4, 256, 4, 128

bert_params = init_bert(vocab_size, d_model, n_heads, d_ff, n_layers, max_len, device)

# Simulated input: [CLS] tokens_A [SEP] tokens_B [SEP]
token_ids = torch.randint(0, vocab_size, (2, 12), device=device)
segment_ids = torch.tensor([[0,0,0,0,0,0,1,1,1,1,1,1],
                             [0,0,0,0,0,0,0,1,1,1,1,1]], device=device)

mlm_logits, nsp_logits, hidden, attn_weights = bert_forward(
    token_ids, segment_ids, bert_params, n_heads
)

print('BERT outputs:')
print(f'  MLM logits: {mlm_logits.shape} (batch, seq_len, vocab_size)')
print(f'  NSP logits: {nsp_logits.shape} (batch, 2)')
print(f'  Hidden states: {hidden.shape}')
print(f'  Attention weights: {len(attn_weights)} layers x {attn_weights[0].shape}')

# Count parameters
total_params = (vocab_size * d_model + 2 * d_model + max_len * d_model +  # embeddings
                n_layers * (4 * d_model * d_model + d_model * d_ff * 2 + d_ff + d_model + 4 * d_model) +  # layers
                d_model * d_model + d_model + 2 * d_model + d_model * 2 + 2)  # heads
print(f'\nTotal parameters: ~{total_params:,}')

In [None]:
# Demonstrate Masked Language Modeling
print('=== Masked Language Modeling Demo ===')
print(f'Input tokens: {token_ids[0].cpu().tolist()}')

# Mask positions 3 and 7
mask_positions = [3, 7]
print(f'Masked positions: {mask_positions}')

# MLM predictions at masked positions
for pos in mask_positions:
    probs = torch.softmax(mlm_logits[0, pos], dim=-1)
    top5 = probs.topk(5)
    print(f'\nPosition {pos} (true token={token_ids[0, pos].item()}):')
    print(f'  Top 5 predictions: {top5.indices.cpu().tolist()}')
    print(f'  Probabilities:     {[f"{p:.4f}" for p in top5.values.cpu().tolist()]}')

# NSP prediction
nsp_probs = torch.softmax(nsp_logits, dim=-1)
print(f'\nNSP predictions (0=not-next, 1=is-next):')
for b in range(2):
    print(f'  Sample {b}: {nsp_probs[b].cpu().tolist()}')

## 3. RoBERTa: Better Training Recipe

RoBERTa uses the **same architecture** as BERT but with improved training:
- **No NSP objective** (dropped — found to be unnecessary)
- **Dynamic masking** (different mask each epoch, not static)
- **Larger batches**, **more data**, longer training
- **Byte-level BPE** tokenization

In [None]:
def roberta_forward(token_ids, params, n_heads):
    """RoBERTa forward — same as BERT but NO segment embeddings, NO NSP."""
    batch, seq_len = token_ids.shape
    positions = torch.arange(seq_len, device=token_ids.device)
    
    # No segment embeddings — all zeros segment
    x = (params['token_emb'][token_ids] +
         params['position_emb'][positions])
    x = layer_norm(x, params['emb_ln_gamma'], params['emb_ln_beta'])
    
    for layer_p in params['layers']:
        attn_out, _ = multi_head_attention(
            x, x, x, layer_p['W_Q'], layer_p['W_K'], layer_p['W_V'], layer_p['W_O'], n_heads
        )
        x = layer_norm(x + attn_out, layer_p['ln1_gamma'], layer_p['ln1_beta'])
        ffn_out = ffn(x, layer_p['W1'], layer_p['b1'], layer_p['W2'], layer_p['b2'])
        x = layer_norm(x + ffn_out, layer_p['ln2_gamma'], layer_p['ln2_beta'])
    
    # MLM head only — no NSP
    mlm_hidden = gelu(x @ params['mlm_dense_W'] + params['mlm_dense_b'])
    mlm_hidden = layer_norm(mlm_hidden, params['mlm_ln_gamma'], params['mlm_ln_beta'])
    mlm_logits = mlm_hidden @ params['token_emb'].T
    
    return mlm_logits, x

# Dynamic masking — different mask each call
def dynamic_masking(token_ids, mask_prob=0.15, mask_token_id=103):
    """RoBERTa-style dynamic masking: new random mask each time."""
    masked_ids = token_ids.clone()
    mask = torch.rand_like(token_ids.float()) < mask_prob
    # 80% → [MASK], 10% → random, 10% → keep
    rand = torch.rand_like(token_ids.float())
    mask_token = mask & (rand < 0.8)
    random_token = mask & (rand >= 0.8) & (rand < 0.9)
    
    masked_ids[mask_token] = mask_token_id
    masked_ids[random_token] = torch.randint_like(masked_ids[random_token], 0, 1000)
    return masked_ids, mask

# Test dynamic masking
tokens = torch.randint(0, 100, (1, 20), device=device)
print('Original:', tokens[0].cpu().tolist())
for i in range(3):
    masked, mask = dynamic_masking(tokens)
    print(f'Mask {i+1}:  {masked[0].cpu().tolist()}')
    print(f'  Masked positions: {mask[0].nonzero().squeeze(-1).cpu().tolist()}')

print('\n→ Different mask each time! (BERT used the same mask for all epochs)')

## 4. ALBERT: Parameter Efficiency

ALBERT introduces two key tricks to reduce parameters:

1. **Factorized Embedding**: Instead of projecting directly from vocab → d_model, use vocab → d_emb → d_model (with d_emb << d_model)
2. **Cross-Layer Parameter Sharing**: All encoder layers share the same weights

In [None]:
def init_albert(vocab_size, d_emb, d_model, n_heads, d_ff, n_layers, max_len, device):
    """Initialize ALBERT — factorized embeddings + shared layers."""
    scale = 0.02
    params = {
        # Factorized embedding: vocab → d_emb → d_model
        'token_emb': torch.randn(vocab_size, d_emb, device=device) * scale,  # Small!
        'emb_proj': torch.randn(d_emb, d_model, device=device) * scale,  # Project up
        'position_emb': torch.randn(max_len, d_emb, device=device) * scale,
        'emb_ln_gamma': torch.ones(d_model, device=device),
        'emb_ln_beta': torch.zeros(d_model, device=device),
        'n_layers': n_layers,
        # SHARED layer — only ONE set of weights for all layers!
        'shared_layer': {
            'W_Q': torch.randn(d_model, d_model, device=device) * scale,
            'W_K': torch.randn(d_model, d_model, device=device) * scale,
            'W_V': torch.randn(d_model, d_model, device=device) * scale,
            'W_O': torch.randn(d_model, d_model, device=device) * scale,
            'ln1_gamma': torch.ones(d_model, device=device),
            'ln1_beta': torch.zeros(d_model, device=device),
            'W1': torch.randn(d_model, d_ff, device=device) * scale,
            'b1': torch.zeros(d_ff, device=device),
            'W2': torch.randn(d_ff, d_model, device=device) * scale,
            'b2': torch.zeros(d_model, device=device),
            'ln2_gamma': torch.ones(d_model, device=device),
            'ln2_beta': torch.zeros(d_model, device=device),
        },
    }
    return params

def albert_forward(token_ids, params, n_heads):
    """ALBERT forward: factorized embedding + shared layers."""
    batch, seq_len = token_ids.shape
    positions = torch.arange(seq_len, device=token_ids.device)
    
    # Factorized embedding: token + position in small space, then project up
    x = params['token_emb'][token_ids] + params['position_emb'][positions]
    x = x @ params['emb_proj']  # d_emb → d_model
    x = layer_norm(x, params['emb_ln_gamma'], params['emb_ln_beta'])
    
    # Reuse same layer N times
    layer_p = params['shared_layer']
    for _ in range(params['n_layers']):
        attn_out, _ = multi_head_attention(
            x, x, x, layer_p['W_Q'], layer_p['W_K'], layer_p['W_V'], layer_p['W_O'], n_heads
        )
        x = layer_norm(x + attn_out, layer_p['ln1_gamma'], layer_p['ln1_beta'])
        ffn_out = ffn(x, layer_p['W1'], layer_p['b1'], layer_p['W2'], layer_p['b2'])
        x = layer_norm(x + ffn_out, layer_p['ln2_gamma'], layer_p['ln2_beta'])
    
    return x

# Compare parameter counts
torch.manual_seed(42)
d_emb = 16  # Small embedding dimension
albert_params = init_albert(vocab_size, d_emb, d_model, n_heads, d_ff, n_layers, max_len, device)

albert_out = albert_forward(token_ids, albert_params, n_heads)
print(f'ALBERT output: {albert_out.shape}')

# Parameter comparison
bert_emb_params = vocab_size * d_model  # Direct: V × d_model
albert_emb_params = vocab_size * d_emb + d_emb * d_model  # Factorized: V × d_emb + d_emb × d_model

bert_layer_params = 4 * d_model * d_model + 2 * d_model * d_ff  # Per layer
albert_layer_params = bert_layer_params  # Same size, but shared!

print(f'\nEmbedding parameter comparison:')
print(f'  BERT:   V × d_model = {vocab_size} × {d_model} = {bert_emb_params:,}')
print(f'  ALBERT: V × d_emb + d_emb × d_model = {vocab_size} × {d_emb} + {d_emb} × {d_model} = {albert_emb_params:,}')
print(f'  Savings: {bert_emb_params / albert_emb_params:.1f}x fewer embedding params')

print(f'\nEncoder parameter comparison:')
print(f'  BERT:   {n_layers} layers × {bert_layer_params} = {n_layers * bert_layer_params:,}')
print(f'  ALBERT: 1 shared layer × {albert_layer_params} = {albert_layer_params:,}')
print(f'  Savings: {n_layers}x fewer encoder params')

## 5. DistilBERT: Knowledge Distillation

DistilBERT trains a smaller student (6 layers) to mimic a larger teacher (12 layers) using:
- **Distillation loss**: Match teacher's soft probability distribution
- **MLM loss**: Standard masked language modeling
- **Cosine embedding loss**: Align hidden representations

In [None]:
def distillation_loss(student_logits, teacher_logits, temperature=4.0):
    """Compute KL-divergence between teacher and student soft predictions.
    
    Higher temperature → softer distributions → more information transferred.
    """
    # Soft targets from teacher
    teacher_probs = torch.softmax(teacher_logits / temperature, dim=-1)
    
    # Soft predictions from student
    student_log_probs = torch.log_softmax(student_logits / temperature, dim=-1)
    
    # KL divergence (scaled by T²)
    kl = -(teacher_probs * student_log_probs).sum(dim=-1).mean()
    return kl * (temperature ** 2)

def cosine_embedding_loss(student_hidden, teacher_hidden):
    """Align student and teacher hidden representations."""
    cos_sim = torch.nn.functional.cosine_similarity(student_hidden, teacher_hidden, dim=-1)
    return (1 - cos_sim).mean()

# Demonstrate distillation
torch.manual_seed(42)

# Teacher: 12-layer BERT
teacher_params = init_bert(vocab_size, d_model, n_heads, d_ff, 12, max_len, device)
teacher_mlm, _, teacher_hidden, _ = bert_forward(token_ids, segment_ids, teacher_params, n_heads)

# Student: 6-layer BERT (DistilBERT)
student_params = init_bert(vocab_size, d_model, n_heads, d_ff, 6, max_len, device)
student_mlm, _, student_hidden, _ = bert_forward(token_ids, segment_ids, student_params, n_heads)

# Compute losses
d_loss = distillation_loss(student_mlm, teacher_mlm.detach(), temperature=4.0)
c_loss = cosine_embedding_loss(student_hidden, teacher_hidden.detach())

print('DistilBERT training losses:')
print(f'  Distillation loss (KL-div): {d_loss.item():.4f}')
print(f'  Cosine embedding loss:      {c_loss.item():.4f}')
print(f'\nTeacher: 12 layers, Student: 6 layers')
print(f'→ Student is ~2x faster with ~97% of teacher performance')

In [None]:
# Visualize temperature effect on soft distributions
torch.manual_seed(42)
logits = torch.tensor([5.0, 2.0, 0.5, -1.0, -3.0], device=device)
temperatures = [0.5, 1.0, 2.0, 4.0, 8.0]

fig, ax = plt.subplots(figsize=(10, 5))
x_pos = range(len(logits))
width = 0.15

for i, T in enumerate(temperatures):
    probs = torch.softmax(logits / T, dim=-1).cpu()
    offset = (i - len(temperatures) / 2) * width
    ax.bar([p + offset for p in x_pos], probs, width=width, label=f'T={T}')

ax.set_xlabel('Class')
ax.set_ylabel('Probability')
ax.set_title('Temperature Effect on Softmax\n(Higher T → softer distribution → more dark knowledge)')
ax.legend()
ax.set_xticks(x_pos)
plt.tight_layout()
plt.show()

## 6. DeBERTa: Disentangled Attention

DeBERTa's key innovation: separate **content** and **position** information in attention. Instead of adding position embeddings to token embeddings, DeBERTa computes attention from three separate terms:

$$A_{ij} = \underbrace{H_i H_j^T}_{\text{content-to-content}} + \underbrace{H_i P_{j|i}^T}_{\text{content-to-position}} + \underbrace{P_{i|j} H_j^T}_{\text{position-to-content}}$$

where $H$ = content embeddings, $P$ = relative position embeddings.

In [None]:
def disentangled_attention(x, rel_pos_emb, W_Qc, W_Kc, W_Qp, W_Kp, n_heads):
    """DeBERTa-style disentangled attention.
    
    Computes three attention components:
    1. Content-to-content (standard)
    2. Content-to-position (what position am I attending to?)
    3. Position-to-content (what content is at this relative position?)
    """
    batch, seq_len, d_model = x.shape
    d_k = d_model // n_heads
    
    # Content projections
    Q_c = (x @ W_Qc).view(batch, seq_len, n_heads, d_k).transpose(1, 2)
    K_c = (x @ W_Kc).view(batch, seq_len, n_heads, d_k).transpose(1, 2)
    
    # Position projections (shared across batch)
    Q_p = (rel_pos_emb @ W_Qp).view(1, -1, n_heads, d_k).transpose(1, 2)
    K_p = (rel_pos_emb @ W_Kp).view(1, -1, n_heads, d_k).transpose(1, 2)
    
    # 1. Content-to-content: standard attention
    c2c = torch.matmul(Q_c, K_c.transpose(-2, -1))
    
    # 2. Content-to-position: query content attends to key positions
    # We use relative position indices
    c2p = torch.matmul(Q_c, K_p[:, :, :seq_len, :].transpose(-2, -1))
    
    # 3. Position-to-content: query position attends to key content
    p2c = torch.matmul(Q_p[:, :, :seq_len, :], K_c.transpose(-2, -1))
    
    # Combine all three
    scores = (c2c + c2p + p2c) / math.sqrt(3 * d_k)
    
    return scores, (c2c, c2p, p2c)

# Test
torch.manual_seed(42)
max_rel_pos = 20
rel_pos_emb = torch.randn(max_rel_pos, d_model, device=device) * 0.02

W_Qc = torch.randn(d_model, d_model, device=device) * 0.02
W_Kc = torch.randn(d_model, d_model, device=device) * 0.02
W_Qp = torch.randn(d_model, d_model, device=device) * 0.02
W_Kp = torch.randn(d_model, d_model, device=device) * 0.02

x_test = torch.randn(1, 8, d_model, device=device)
scores, (c2c, c2p, p2c) = disentangled_attention(
    x_test, rel_pos_emb, W_Qc, W_Kc, W_Qp, W_Kp, n_heads
)

print(f'Disentangled attention scores: {scores.shape}')
print(f'\nComponent contributions (mean absolute value):')
print(f'  Content-to-content: {c2c.abs().mean().item():.4f}')
print(f'  Content-to-position: {c2p.abs().mean().item():.4f}')
print(f'  Position-to-content: {p2c.abs().mean().item():.4f}')

In [None]:
# Visualize the three attention components
fig, axes = plt.subplots(1, 4, figsize=(20, 4))

components = [
    ('Content→Content', c2c),
    ('Content→Position', c2p),
    ('Position→Content', p2c),
    ('Combined', c2c + c2p + p2c),
]

for ax, (title, comp) in zip(axes, components):
    im = ax.imshow(comp[0, 0].detach().cpu().numpy(), cmap='RdBu', aspect='auto')
    ax.set_title(title)
    ax.set_xlabel('Key pos')
    ax.set_ylabel('Query pos')
    plt.colorbar(im, ax=ax)

plt.suptitle('DeBERTa: Disentangled Attention Components (Head 0)', fontsize=13)
plt.tight_layout()
plt.show()

## 7. Comparison Summary

In [None]:
# Visualize attention patterns from BERT
fig, axes = plt.subplots(1, n_heads, figsize=(16, 4))
for h in range(n_heads):
    ax = axes[h]
    im = ax.imshow(attn_weights[0][0, h].detach().cpu().numpy(), cmap='Blues')
    ax.set_title(f'Head {h}')
    ax.set_xlabel('Key')
    ax.set_ylabel('Query')
plt.suptitle('BERT Attention Weights (Layer 0) — Bidirectional!', fontsize=13)
plt.tight_layout()
plt.show()

print('→ Notice: BERT attention is BIDIRECTIONAL — no causal mask!')
print('  Every token can attend to every other token (past AND future).')

In [None]:
print('=' * 85)
print('COMPARISON: BERT Family Models')
print('=' * 85)
print(f'{"Model":<12} {"Layers":<8} {"Key Innovation":<35} {"Params (base)":<15} {"Year"}')
print('-' * 85)
print(f'{"BERT":<12} {"12":<8} {"MLM + NSP, bidirectional":<35} {"110M":<15} {"2018"}')
print(f'{"RoBERTa":<12} {"12":<8} {"No NSP, dynamic masking, more data":<35} {"125M":<15} {"2019"}')
print(f'{"ALBERT":<12} {"12*":<8} {"Shared layers + factorized emb":<35} {"12M":<15} {"2019"}')
print(f'{"DistilBERT":<12} {"6":<8} {"Knowledge distillation from BERT":<35} {"66M":<15} {"2019"}')
print(f'{"DeBERTa":<12} {"12":<8} {"Disentangled content+position attn":<35} {"140M":<15} {"2020"}')
print('-' * 85)
print('* ALBERT uses 12 virtual layers but only 1 set of shared parameters')
print('=' * 85)

## Summary

In this notebook we implemented from scratch:

1. **BERT** — Bidirectional encoder with MLM + NSP pre-training, token/segment/position embeddings
2. **RoBERTa** — Same architecture, better training: no NSP, dynamic masking, more data
3. **ALBERT** — Factorized embeddings (V×d_emb + d_emb×d_model) + cross-layer parameter sharing
4. **DistilBERT** — Knowledge distillation: smaller student trained to match teacher's soft distributions
5. **DeBERTa** — Disentangled attention: separate content-to-content, content-to-position, and position-to-content scores

**Key insight:** BERT established that bidirectional pre-training on masked language modeling produces powerful general-purpose representations. Each variant optimized a different axis: training recipe (RoBERTa), parameter efficiency (ALBERT), inference speed (DistilBERT), or attention quality (DeBERTa).