
    # Transformer Fundamentals – Guided Notebook 06 — Feed-Forward, Stacking & Generation Loop
    **Date:** 2025-10-29  
    **Style:** Guided, hands-on; from-scratch first, then frameworks; interactive visuals

    ## Learning Objectives

- Understand residual connections, layer normalization, and position-wise feed-forward networks.
- Stack layers; observe how representations evolve.
- Run a tiny generation loop (causal masking) on a toy corpus; compare GPT-2 behavior.


    ## TL;DR
    Transformer layers alternate attention and MLPs with residuals and layer norms; generation repeats forward passes token by token with causal masks.


## Concept Overview
- Each layer: LN → (Attention → Residual) → LN → (MLP → Residual).
- Causal masking prevents tokens from attending to the future in decoder-only models.


In [None]:

# %% [setup] Environment check & minimal installs (run once per kernel)
# Target: Python 3.12.12, PyTorch 2.5+, transformers 4.44+, datasets 3+, ipywidgets 8+, matplotlib 3.8+
import sys, platform, subprocess, os

print("Python:", sys.version)
print("Platform:", platform.platform())

# Optional: uncomment to install/upgrade on this machine (internet required)
# !pip install --upgrade pip
# !pip install "torch>=2.5" "transformers>=4.44" "datasets>=3.0.0" "ipywidgets>=8.1.0" "matplotlib>=3.8" "umap-learn>=0.5.6"

try:
    import torch
    print("Torch:", torch.__version__, "| CUDA available:", torch.cuda.is_available())
    if torch.cuda.is_available():
        print("CUDA device name:", torch.cuda.get_device_name(0))
except Exception as e:
    print("PyTorch not available yet:", e)

%config InlineBackend.figure_format = 'retina'
from IPython.display import display, HTML
try:
    import ipywidgets as widgets
    from ipywidgets import interact, interactive
    print("ipywidgets:", widgets.__version__)
except Exception as e:
    print("ipywidgets not available yet:", e)

import numpy as np
import matplotlib.pyplot as plt
np.random.seed(42)


In [None]:

# %% [utils] Small helpers used throughout
import numpy as np

def softmax(x, axis=-1):
    x = x - np.max(x, axis=axis, keepdims=True)
    e = np.exp(x)
    return e / np.sum(e, axis=axis, keepdims=True)

def cosine_sim(a, b, eps=1e-9):
    a_norm = a / (np.linalg.norm(a, axis=-1, keepdims=True) + eps)
    b_norm = b / (np.linalg.norm(b, axis=-1, keepdims=True) + eps)
    return np.dot(a_norm, b_norm.T)

def show_heatmap(mat, xticklabels=None, yticklabels=None, title=""):
    plt.figure()
    plt.imshow(mat, aspect="auto")
    plt.colorbar()
    if xticklabels is not None: plt.xticks(range(len(xticklabels)), xticklabels, rotation=45, ha="right")
    if yticklabels is not None: plt.yticks(range(len(yticklabels)), yticklabels)
    plt.title(title)
    plt.tight_layout()
    plt.show()


In [None]:

# %% [from-scratch] Minimal transformer block (NumPy, illustrative – not optimized)
T, d_model, nheads = 8, 32, 4
d_ff = 64
d_k = d_model // nheads

def layer_norm(x, eps=1e-5):
    mu = x.mean(-1, keepdims=True)
    sigma2 = ((x - mu)**2).mean(-1, keepdims=True)
    return (x - mu) / np.sqrt(sigma2 + eps)

X = np.random.randn(T, d_model) * 0.1

# Attention params (shared within block)
W_Q = np.random.randn(d_model, d_model) * 0.1
W_K = np.random.randn(d_model, d_model) * 0.1
W_V = np.random.randn(d_model, d_model) * 0.1
W_O = np.random.randn(d_model, d_model) * 0.1

# FFN params
W1 = np.random.randn(d_model, d_ff) * 0.1
b1 = np.zeros(d_ff)
W2 = np.random.randn(d_ff, d_model) * 0.1
b2 = np.zeros(d_model)

def mha_block(x):
    def split_heads(M):
        return M.reshape(T, nheads, d_k).transpose(1,0,2)
    Q = x @ W_Q; K = x @ W_K; V = x @ W_V
    Qh, Kh, Vh = map(split_heads, (Q,K,V))
    heads = []
    for h in range(nheads):
        scores = (Qh[h] @ Kh[h].T) / np.sqrt(d_k)
        # causal mask (upper triangle = -inf)
        mask = np.triu(np.ones_like(scores), k=1)*1e9
        scores = scores - mask
        attn = softmax(scores, -1)
        heads.append(attn @ Vh[h])
    H = np.stack(heads, axis=1).reshape(T, d_model)
    return H @ W_O

# LayerNorm → MHA → Residual
y = layer_norm(X)
y = X + mha_block(y)
# LayerNorm → FFN → Residual
z = layer_norm(y)
z = y + np.maximum(0, z @ W1 + b1) @ W2 + b2   # ReLU
print("Block output shape:", z.shape)


In [None]:

# %% [framework] Tiny generation loop with GPT-2 (HF Transformers)
import torch
from transformers import AutoTokenizer, AutoModelForCausalLM

tok = AutoTokenizer.from_pretrained("gpt2")
mdl = AutoModelForCausalLM.from_pretrained("gpt2")
mdl.eval()
device = "cuda" if torch.cuda.is_available() else "cpu"
mdl.to(device)

prompt = "In a small village,"
ids = tok(prompt, return_tensors="pt").to(device)
with torch.no_grad():
    gen = mdl.generate(**ids, max_length=40, temperature=0.8, top_p=0.95, do_sample=True)
print(tok.decode(gen[0], skip_special_tokens=True))



---
### Bonus: Multilingual Extension
- Swap the tokenizer/model for a multilingual variant (e.g., `bert-base-multilingual-cased` or `xlm-roberta-base`).
- Repeat a small slice of the notebook (tokenization, attention map) on non-English sentences and compare.



---
## Reflection & Next Steps
- What changed when you tweaked dimensions, temperatures, or prompts?
- Where did the attention concentrate, and did it match your intuition?
- Re-run the interactive widgets on your own text.
- Save a copy of the figures that best illustrate your understanding.
