# Transformer Encoder from Scratch

[![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/adiel2012/deep-learning-abc/blob/main/transformer_encoder.ipynb)

This notebook builds the **encoder** side of the Transformer from ["Attention Is All You Need"](https://arxiv.org/abs/1706.03762) (Vaswani et al., 2017), using only low-level tensor operations.

We assemble every component step by step:

```
Input Tokens
    │
    ▼
Token Embedding + Positional Encoding
    │
    ▼
┌─────────────────────────────────┐
│  Encoder Layer  (×N)           │
│  ┌───────────────────────────┐ │
│  │ Multi-Head Self-Attention │ │
│  └─────────────┬─────────────┘ │
│          Add & LayerNorm       │
│  ┌─────────────┴─────────────┐ │
│  │ Feed-Forward Network      │ │
│  └─────────────┬─────────────┘ │
│          Add & LayerNorm       │
└────────────────┬────────────────┘
                 │
                 ▼
          Encoder Output
```

> **Prerequisites:** [attention_from_scratch.ipynb](https://colab.research.google.com/github/adiel2012/deep-learning-abc/blob/main/attention_from_scratch.ipynb) and [positional_encoding.ipynb](https://colab.research.google.com/github/adiel2012/deep-learning-abc/blob/main/positional_encoding.ipynb)

In [None]:
# Install dependencies (Colab already has torch, but this ensures compatibility)
!pip install torch matplotlib -q

In [None]:
import torch
import math
import matplotlib.pyplot as plt

torch.manual_seed(42)

device = torch.device('cuda' if torch.cuda.is_available() else 'cpu')
print(f"Using device: {device}")

## 0. Mathematical Foundations

The encoder introduces two new building blocks on top of attention: **Layer Normalization** and the **Position-wise Feed-Forward Network**. It also uses **residual connections** to enable training deep stacks.

---

### 0.1 Residual Connections

A residual (skip) connection adds the input of a sub-layer directly to its output:

$$\text{output} = x + \text{SubLayer}(x)$$

**Why?** In deep networks, gradients can vanish as they flow backward through many layers. The skip connection provides a "gradient highway" — even if the sub-layer's gradient is tiny, the gradient through the identity path ($\frac{\partial x}{\partial x} = 1$) is always 1.

This means the sub-layer only needs to learn the **residual** (the difference between the desired output and the input), which is typically easier than learning the full transformation.

---

### 0.2 Layer Normalization

Layer Norm normalizes each token's feature vector independently:

$$\text{LayerNorm}(x) = \gamma \odot \frac{x - \mu}{\sqrt{\sigma^2 + \epsilon}} + \beta$$

where, for a single token vector $x \in \mathbb{R}^{d_{\text{model}}}$:

$$\mu = \frac{1}{d_{\text{model}}} \sum_{i=1}^{d_{\text{model}}} x_i \qquad \sigma^2 = \frac{1}{d_{\text{model}}} \sum_{i=1}^{d_{\text{model}}} (x_i - \mu)^2$$

- $\gamma, \beta \in \mathbb{R}^{d_{\text{model}}}$ are learnable scale and shift parameters
- $\epsilon$ is a small constant for numerical stability (typically $10^{-5}$)
- $\odot$ denotes element-wise multiplication

**Why not Batch Norm?** Batch Norm normalizes across the batch dimension, which doesn't work well for variable-length sequences and small batches. Layer Norm normalizes across the feature dimension of each individual token, making it independent of batch size and sequence length.

**In the Transformer:** Layer Norm is applied **after** each sub-layer (post-norm), combined with the residual:

$$\text{output} = \text{LayerNorm}(x + \text{SubLayer}(x))$$

---

### 0.3 Position-wise Feed-Forward Network (FFN)

Each encoder layer has a two-layer MLP applied **independently** to each token:

$$\text{FFN}(x) = W_2 \cdot \text{ReLU}(W_1 x + b_1) + b_2$$

where:
- $W_1 \in \mathbb{R}^{d_{ff} \times d_{\text{model}}}$ — projects up to a wider hidden space
- $W_2 \in \mathbb{R}^{d_{\text{model}} \times d_{ff}}$ — projects back down
- Typically $d_{ff} = 4 \times d_{\text{model}}$ (expansion factor of 4)

**Why "position-wise"?** The same weights are applied to every position independently — there's no interaction between tokens. Attention handles inter-token mixing; the FFN handles per-token feature transformation.

**Why expand then compress?** The wider hidden layer gives the network more capacity to learn complex nonlinear transformations before compressing back to $d_{\text{model}}$.

---

### 0.4 The Full Encoder Layer

Putting it all together, one encoder layer computes:

$$z = \text{LayerNorm}\big(x + \text{MultiHeadAttention}(x)\big)$$
$$\text{output} = \text{LayerNorm}\big(z + \text{FFN}(z)\big)$$

The encoder stacks $N$ of these layers (the paper uses $N=6$). Each layer refines the representations: attention captures token interactions, the FFN transforms features, residuals preserve information, and layer norm stabilizes training.

---

### 0.5 Encoder Hyperparameters (from the paper)

| Parameter | Symbol | Base model value |
|-----------|--------|------------------|
| Model dimension | $d_{\text{model}}$ | 512 |
| Number of heads | $h$ | 8 |
| Key/value dimension per head | $d_k = d_v = d_{\text{model}}/h$ | 64 |
| FFN inner dimension | $d_{ff}$ | 2048 |
| Number of layers | $N$ | 6 |
| Dropout rate | $p$ | 0.1 |

We'll use smaller values in this notebook for clarity.

---

Now let's build each piece.

## 1. Helper Functions from Previous Notebooks

We reuse softmax, scaled dot-product attention, multi-head attention, and positional encoding.

In [None]:
def softmax(x):
    """Row-wise softmax from scratch."""
    x_max = x.max(dim=-1, keepdim=True).values
    exp_x = torch.exp(x - x_max)
    return exp_x / exp_x.sum(dim=-1, keepdim=True)


def sinusoidal_positional_encoding(max_len, d_model, device=None):
    """Sinusoidal positional encoding (see positional_encoding.ipynb)."""
    pe = torch.zeros(max_len, d_model, device=device)
    pos = torch.arange(0, max_len, device=device).unsqueeze(1)
    i = torch.arange(0, d_model, 2, device=device).float()
    div_term = torch.exp(i * -(math.log(10000.0) / d_model))
    angles = pos * div_term
    pe[:, 0::2] = torch.sin(angles)
    pe[:, 1::2] = torch.cos(angles)
    return pe


print("Helper functions loaded.")

## 2. Configuration

We use small dimensions to keep outputs readable.

In [None]:
# Model hyperparameters (small for demonstration)
d_model = 16     # embedding / model dimension
n_heads = 2      # number of attention heads
d_ff = 64        # feed-forward inner dimension (4 × d_model)
n_layers = 2     # number of encoder layers
vocab_size = 100  # vocabulary size
seq_len = 5      # sequence length

d_k = d_model // n_heads  # dimension per head

print(f"d_model={d_model}, n_heads={n_heads}, d_k={d_k}, d_ff={d_ff}, n_layers={n_layers}")

## 3. Input: Token Embedding + Positional Encoding

$$\text{input} = \sqrt{d_{\text{model}}} \cdot \text{Embed}(\text{tokens}) + PE$$

In [None]:
# Embedding matrix: (vocab_size, d_model) — raw parameter
W_embed = torch.randn(vocab_size, d_model, device=device) * 0.1

# Simulate tokens: ["The", "cat", "sat", "on", "mat"]
token_ids = torch.tensor([5, 12, 31, 7, 42], device=device)

# Look up embeddings (equivalent to nn.Embedding)
token_emb = W_embed[token_ids]  # (seq_len, d_model)

# Add positional encoding
PE = sinusoidal_positional_encoding(seq_len, d_model, device=device)
X = math.sqrt(d_model) * token_emb + PE

print("Token embeddings shape:", token_emb.shape)
print("Positional encoding shape:", PE.shape)
print("Encoder input shape:", X.shape)

## 4. Layer Normalization (from scratch)

$$\text{LayerNorm}(x) = \gamma \odot \frac{x - \mu}{\sqrt{\sigma^2 + \epsilon}} + \beta$$

In [None]:
def layer_norm(x, gamma, beta, eps=1e-5):
    """
    Layer normalization from scratch.
    
    x:     (seq_len, d_model)
    gamma: (d_model,) — learnable scale
    beta:  (d_model,) — learnable shift
    """
    # Compute mean and variance across the feature dimension (last dim)
    mean = x.mean(dim=-1, keepdim=True)          # (seq_len, 1)
    var = x.var(dim=-1, keepdim=True, unbiased=False)  # (seq_len, 1)
    
    # Normalize
    x_norm = (x - mean) / torch.sqrt(var + eps)  # (seq_len, d_model)
    
    # Scale and shift
    return gamma * x_norm + beta


# Initialize learnable parameters
gamma = torch.ones(d_model, device=device)   # scale (init to 1)
beta = torch.zeros(d_model, device=device)   # shift (init to 0)

# Test it
X_normed = layer_norm(X, gamma, beta)

print("Before LayerNorm:")
print(f"  mean per token: {X.mean(dim=-1)}")
print(f"  std per token:  {X.std(dim=-1)}")
print("\nAfter LayerNorm:")
print(f"  mean per token: {X_normed.mean(dim=-1)}")
print(f"  std per token:  {X_normed.std(dim=-1)}")
print("\nEach token now has mean≈0 and std≈1.")

## 5. Multi-Head Self-Attention (recap)

Same as in `attention_from_scratch.ipynb` — the efficient batched version.

In [None]:
def multi_head_attention(X, W_Q, W_K, W_V, W_O, n_heads):
    """
    Multi-head self-attention from raw ops.
    
    X:    (seq_len, d_model)
    W_Q, W_K, W_V, W_O: (d_model, d_model)
    
    Returns: output (seq_len, d_model), weights (n_heads, seq_len, seq_len)
    """
    seq_len, d_model = X.shape
    d_k = d_model // n_heads
    
    # Project
    Q = X @ W_Q
    K = X @ W_K
    V = X @ W_V
    
    # Split into heads: (seq_len, d_model) -> (n_heads, seq_len, d_k)
    Q = Q.view(seq_len, n_heads, d_k).transpose(0, 1)
    K = K.view(seq_len, n_heads, d_k).transpose(0, 1)
    V = V.view(seq_len, n_heads, d_k).transpose(0, 1)
    
    # Scaled dot-product attention
    scores = Q @ K.transpose(-2, -1) / math.sqrt(d_k)
    weights = softmax(scores)
    attn_out = weights @ V
    
    # Merge heads
    attn_out = attn_out.transpose(0, 1).contiguous().view(seq_len, d_model)
    
    # Output projection
    output = attn_out @ W_O
    
    return output, weights

print("multi_head_attention() defined.")

## 6. Position-wise Feed-Forward Network

$$\text{FFN}(x) = W_2 \cdot \text{ReLU}(W_1 x + b_1) + b_2$$

Applied independently to each token position.

In [None]:
def relu(x):
    """ReLU activation from scratch."""
    return torch.clamp(x, min=0)


def feed_forward(x, W1, b1, W2, b2):
    """
    Position-wise feed-forward network.
    
    x:  (seq_len, d_model)
    W1: (d_model, d_ff)
    b1: (d_ff,)
    W2: (d_ff, d_model)
    b2: (d_model,)
    
    Returns: (seq_len, d_model)
    """
    # Expand: d_model -> d_ff
    hidden = relu(x @ W1 + b1)   # (seq_len, d_ff)
    
    # Compress: d_ff -> d_model
    output = hidden @ W2 + b2    # (seq_len, d_model)
    
    return output


# Test it
W1_test = torch.randn(d_model, d_ff, device=device) * 0.1
b1_test = torch.zeros(d_ff, device=device)
W2_test = torch.randn(d_ff, d_model, device=device) * 0.1
b2_test = torch.zeros(d_model, device=device)

ffn_out = feed_forward(X, W1_test, b1_test, W2_test, b2_test)
print(f"FFN input shape:  {X.shape}  (seq_len, d_model)")
print(f"FFN output shape: {ffn_out.shape}  (seq_len, d_model)")
print(f"\nInternally: {d_model} -> {d_ff} -> {d_model}")
print(f"Each token is transformed independently (same weights, no cross-token interaction).")

## 7. Single Encoder Layer

One encoder layer = Multi-Head Attention + Add & Norm + FFN + Add & Norm.

$$z = \text{LayerNorm}(x + \text{MHA}(x))$$
$$\text{output} = \text{LayerNorm}(z + \text{FFN}(z))$$

In [None]:
def init_encoder_layer(d_model, n_heads, d_ff, device):
    """Initialize all parameters for one encoder layer."""
    scale = 0.1  # small init for stability
    params = {
        # Multi-head attention weights
        'W_Q': torch.randn(d_model, d_model, device=device) * scale,
        'W_K': torch.randn(d_model, d_model, device=device) * scale,
        'W_V': torch.randn(d_model, d_model, device=device) * scale,
        'W_O': torch.randn(d_model, d_model, device=device) * scale,
        # LayerNorm 1 (after attention)
        'gamma1': torch.ones(d_model, device=device),
        'beta1': torch.zeros(d_model, device=device),
        # Feed-forward weights
        'W1': torch.randn(d_model, d_ff, device=device) * scale,
        'b1': torch.zeros(d_ff, device=device),
        'W2': torch.randn(d_ff, d_model, device=device) * scale,
        'b2': torch.zeros(d_model, device=device),
        # LayerNorm 2 (after FFN)
        'gamma2': torch.ones(d_model, device=device),
        'beta2': torch.zeros(d_model, device=device),
    }
    return params


def encoder_layer(x, params, n_heads):
    """
    One Transformer encoder layer.
    
    x: (seq_len, d_model)
    Returns: (seq_len, d_model), attention_weights (n_heads, seq_len, seq_len)
    """
    # --- Sub-layer 1: Multi-Head Self-Attention ---
    attn_out, attn_weights = multi_head_attention(
        x, params['W_Q'], params['W_K'], params['W_V'], params['W_O'], n_heads
    )
    # Residual connection + LayerNorm
    z = layer_norm(x + attn_out, params['gamma1'], params['beta1'])
    
    # --- Sub-layer 2: Feed-Forward Network ---
    ffn_out = feed_forward(z, params['W1'], params['b1'], params['W2'], params['b2'])
    # Residual connection + LayerNorm
    output = layer_norm(z + ffn_out, params['gamma2'], params['beta2'])
    
    return output, attn_weights


# Test one layer
layer_params = init_encoder_layer(d_model, n_heads, d_ff, device)
out, weights = encoder_layer(X, layer_params, n_heads)

print(f"Input shape:  {X.shape}")
print(f"Output shape: {out.shape}  — same as input (residual connections preserve shape)")
print(f"Attention weights shape: {weights.shape}  — (n_heads, seq_len, seq_len)")

### Data flow through the encoder layer

In [None]:
# Trace the data flow step by step
print("=== Encoder Layer Data Flow ===")
print(f"\n1. Input x:                          {X.shape}")

attn_out, attn_w = multi_head_attention(
    X, layer_params['W_Q'], layer_params['W_K'],
    layer_params['W_V'], layer_params['W_O'], n_heads
)
print(f"2. Multi-Head Attention output:       {attn_out.shape}")

residual1 = X + attn_out
print(f"3. After residual (x + MHA(x)):       {residual1.shape}")

z = layer_norm(residual1, layer_params['gamma1'], layer_params['beta1'])
print(f"4. After LayerNorm 1:                 {z.shape}")

ffn_out = feed_forward(z, layer_params['W1'], layer_params['b1'],
                       layer_params['W2'], layer_params['b2'])
print(f"5. FFN output:                        {ffn_out.shape}")
print(f"   (internally: {d_model} -> {d_ff} -> {d_model})")

residual2 = z + ffn_out
print(f"6. After residual (z + FFN(z)):        {residual2.shape}")

final = layer_norm(residual2, layer_params['gamma2'], layer_params['beta2'])
print(f"7. After LayerNorm 2 (final output):  {final.shape}")
print(f"\nShape is preserved at every step: {X.shape} in, {final.shape} out.")

## 8. Full Encoder (N stacked layers)

The complete encoder is:
1. Token embedding + positional encoding
2. N identical encoder layers stacked sequentially

In [None]:
def init_encoder(vocab_size, d_model, n_heads, d_ff, n_layers, device):
    """Initialize all parameters for the full encoder."""
    params = {
        'W_embed': torch.randn(vocab_size, d_model, device=device) * 0.1,
        'layers': [init_encoder_layer(d_model, n_heads, d_ff, device)
                   for _ in range(n_layers)]
    }
    return params


def encoder_forward(token_ids, params, d_model, n_heads):
    """
    Full encoder forward pass.
    
    token_ids: (seq_len,) integer tensor
    Returns: (seq_len, d_model) encoder output, list of attention weights per layer
    """
    seq_len = token_ids.shape[0]
    
    # 1. Token embedding + positional encoding
    token_emb = params['W_embed'][token_ids]
    PE = sinusoidal_positional_encoding(seq_len, d_model, device=token_ids.device)
    x = math.sqrt(d_model) * token_emb + PE
    
    # 2. Pass through N encoder layers
    all_attn_weights = []
    for i, layer_params in enumerate(params['layers']):
        x, attn_weights = encoder_layer(x, layer_params, n_heads)
        all_attn_weights.append(attn_weights)
    
    return x, all_attn_weights


# Build and run the encoder
enc_params = init_encoder(vocab_size, d_model, n_heads, d_ff, n_layers, device)

token_ids = torch.tensor([5, 12, 31, 7, 42], device=device)
enc_output, all_weights = encoder_forward(token_ids, enc_params, d_model, n_heads)

print(f"Token IDs: {token_ids.tolist()}")
print(f"Encoder output shape: {enc_output.shape}")
print(f"Number of layers: {len(all_weights)}")
print(f"Attention weights per layer: {all_weights[0].shape}")

## 9. Parameter Count

Let's count exactly how many parameters our encoder has.

In [None]:
def count_params(enc_params):
    total = 0
    
    # Embedding
    emb_count = enc_params['W_embed'].numel()
    print(f"Embedding:       {emb_count:>8,}  ({list(enc_params['W_embed'].shape)})")
    total += emb_count
    
    # Each layer
    for i, lp in enumerate(enc_params['layers']):
        layer_total = 0
        details = []
        for name, param in lp.items():
            n = param.numel()
            layer_total += n
            details.append(f"{name}: {list(param.shape)}")
        print(f"Layer {i}:         {layer_total:>8,}  ({', '.join(details[:4])}...)")
        total += layer_total
    
    print(f"{'':─<50}")
    print(f"Total:           {total:>8,}")
    return total

total = count_params(enc_params)
print(f"\nFor comparison, the original Transformer encoder: ~44M parameters")

## 10. Visualizing Attention Across Layers

Each layer's attention pattern can capture different relationships.

In [None]:
tokens = ["The", "cat", "sat", "on", "mat"]

fig, axes = plt.subplots(n_layers, n_heads, figsize=(5 * n_heads, 4 * n_layers))
if n_layers == 1:
    axes = [axes]
if n_heads == 1:
    axes = [[ax] for ax in axes]

for layer_idx in range(n_layers):
    for head_idx in range(n_heads):
        ax = axes[layer_idx][head_idx]
        w = all_weights[layer_idx][head_idx].detach().cpu().numpy()
        
        im = ax.imshow(w, cmap='Blues', vmin=0, vmax=1)
        ax.set_xticks(range(len(tokens)))
        ax.set_yticks(range(len(tokens)))
        ax.set_xticklabels(tokens, fontsize=9)
        ax.set_yticklabels(tokens, fontsize=9)
        ax.set_title(f'Layer {layer_idx}, Head {head_idx}', fontsize=11)
        
        if head_idx == 0:
            ax.set_ylabel('Query')
        if layer_idx == n_layers - 1:
            ax.set_xlabel('Key')
        
        for row in range(len(tokens)):
            for col in range(len(tokens)):
                ax.text(col, row, f'{w[row, col]:.2f}',
                        ha='center', va='center', fontsize=8)

plt.suptitle('Encoder Attention Weights by Layer and Head', y=1.02, fontsize=14)
plt.tight_layout()
plt.show()

print("Different layers and heads learn to attend to different patterns.")
print("With random weights, patterns are uniform — they specialize during training.")

## 11. How Representations Evolve Through Layers

Let's track how token representations change as they pass through each layer.

In [None]:
# Collect intermediate representations
representations = []

# Input embedding + PE
token_emb = enc_params['W_embed'][token_ids]
PE = sinusoidal_positional_encoding(seq_len, d_model, device=device)
x = math.sqrt(d_model) * token_emb + PE
representations.append(('Input (Emb+PE)', x.clone()))

# Each encoder layer
for i, lp in enumerate(enc_params['layers']):
    x, _ = encoder_layer(x, lp, n_heads)
    representations.append((f'After Layer {i}', x.clone()))

# Compute cosine similarity between tokens at each stage
fig, axes = plt.subplots(1, len(representations), figsize=(5 * len(representations), 4))

for idx, (name, rep) in enumerate(representations):
    rep_norm = rep / rep.norm(dim=-1, keepdim=True)
    sim = (rep_norm @ rep_norm.T).detach().cpu().numpy()
    
    ax = axes[idx]
    im = ax.imshow(sim, cmap='RdYlGn', vmin=-1, vmax=1)
    ax.set_xticks(range(len(tokens)))
    ax.set_yticks(range(len(tokens)))
    ax.set_xticklabels(tokens, fontsize=9)
    ax.set_yticklabels(tokens, fontsize=9)
    ax.set_title(name, fontsize=11)
    
    for row in range(len(tokens)):
        for col in range(len(tokens)):
            ax.text(col, row, f'{sim[row, col]:.2f}',
                    ha='center', va='center', fontsize=8)

plt.suptitle('Token Similarity at Each Stage', y=1.02, fontsize=14)
plt.tight_layout()
plt.show()

print("Each layer mixes information between tokens via attention,")
print("changing their pairwise similarities.")

## 12. The Residual Connection Effect

Let's see what happens without residual connections.

In [None]:
def encoder_layer_no_residual(x, params, n_heads):
    """Encoder layer WITHOUT residual connections."""
    attn_out, _ = multi_head_attention(
        x, params['W_Q'], params['W_K'], params['W_V'], params['W_O'], n_heads
    )
    z = layer_norm(attn_out, params['gamma1'], params['beta1'])  # no x +
    
    ffn_out = feed_forward(z, params['W1'], params['b1'], params['W2'], params['b2'])
    output = layer_norm(ffn_out, params['gamma2'], params['beta2'])  # no z +
    
    return output


# Compare norms through 10 layers (with vs without residuals)
n_test_layers = 10
test_layers = [init_encoder_layer(d_model, n_heads, d_ff, device) for _ in range(n_test_layers)]

x_with = X.clone()
x_without = X.clone()
norms_with = [x_with.norm().item()]
norms_without = [x_without.norm().item()]

for lp in test_layers:
    x_with, _ = encoder_layer(x_with, lp, n_heads)
    x_without = encoder_layer_no_residual(x_without, lp, n_heads)
    norms_with.append(x_with.norm().item())
    norms_without.append(x_without.norm().item())

fig, ax = plt.subplots(figsize=(8, 4))
ax.plot(norms_with, 'o-', label='With residual connections')
ax.plot(norms_without, 's--', label='Without residual connections')
ax.set_xlabel('Layer')
ax.set_ylabel('Output norm')
ax.set_title('Signal Propagation: Residual vs No Residual')
ax.legend()
ax.grid(True, alpha=0.3)
plt.tight_layout()
plt.show()

print(f"With residuals:    norm stays stable (~{norms_with[-1]:.2f})")
print(f"Without residuals: norm may drift or collapse ({norms_without[-1]:.2f})")

## 13. Complete Architecture Diagram

Here's everything we built, mapped to the original paper:

```
                    "Attention Is All You Need" — Encoder
                    ════════════════════════════════════

  token_ids ──► W_embed[token_ids] ──► ×√d_model ──► + PE ──► x
                                                              │
         ┌────────────────────────────────────────────────────┐
         │  Encoder Layer (×N)                                │
         │                                                    │
         │  x ──┬──► MHA(x) ──► + ◄── x  ──► LayerNorm ──► z │
         │      │               ▲                             │
         │      │          (residual)                         │
         │                                                    │
         │  z ──┬──► FFN(z) ──► + ◄── z  ──► LayerNorm ──► out│
         │      │               ▲                             │
         │      │          (residual)                         │
         └────────────────────────────────────────────────────┘
                                                              │
                                                              ▼
                                                      encoder_output
```

Every box in this diagram corresponds to a function we implemented from raw tensor ops.

## 14. Summary

| Component | Function | Parameters | Purpose |
|-----------|----------|------------|---------|
| **Token Embedding** | `W_embed[ids]` | vocab × d_model | Convert token IDs to vectors |
| **Positional Encoding** | `sinusoidal_pe()` | 0 (fixed) | Inject position information |
| **Multi-Head Attention** | `multi_head_attention()` | 4 × d_model² | Token interaction (who attends to whom) |
| **Layer Normalization** | `layer_norm()` | 2 × d_model | Stabilize activations |
| **Feed-Forward Network** | `feed_forward()` | 2 × d_model × d_ff + biases | Per-token nonlinear transformation |
| **Residual Connection** | `x + sublayer(x)` | 0 | Gradient highway, information preservation |

All implemented with only: `@` (matmul), `+`, `*`, `/`, `torch.exp`, `torch.clamp`, `torch.sqrt`, `.view()`, `.transpose()`.

No `nn.Module`, no `nn.Linear`, no `nn.LayerNorm`, no `nn.TransformerEncoder`.