
    # Transformer Fundamentals – Guided Notebook 05 — Weighted Sum (Attention Output & Multi-Head)
    **Date:** 2025-10-29  
    **Style:** Guided, hands-on; from-scratch first, then frameworks; interactive visuals

    ## Learning Objectives

- Compute attention outputs as weighted sums of V.
- Build multi-head attention by splitting heads and concatenating.
- Visualize per-head patterns; discuss why multiple heads help.


    ## TL;DR
    Attention output is a weighted sum of values; multiple heads learn diverse relational patterns.


## Concept Overview
- Single-head: `Attn = softmax(QK^T / sqrt(d_k))`, `Out = Attn V`.
- Multi-head: split into heads, apply attention in parallel, concat, then project.


In [None]:

# %% [setup] Environment check & minimal installs (run once per kernel)
# Target: Python 3.12.12, PyTorch 2.5+, transformers 4.44+, datasets 3+, ipywidgets 8+, matplotlib 3.8+
import sys, platform, subprocess, os

print("Python:", sys.version)
print("Platform:", platform.platform())

# Optional: uncomment to install/upgrade on this machine (internet required)
# !pip install --upgrade pip
# !pip install "torch>=2.5" "transformers>=4.44" "datasets>=3.0.0" "ipywidgets>=8.1.0" "matplotlib>=3.8" "umap-learn>=0.5.6"

try:
    import torch
    print("Torch:", torch.__version__, "| CUDA available:", torch.cuda.is_available())
    if torch.cuda.is_available():
        print("CUDA device name:", torch.cuda.get_device_name(0))
except Exception as e:
    print("PyTorch not available yet:", e)

%config InlineBackend.figure_format = 'retina'
from IPython.display import display, HTML
try:
    import ipywidgets as widgets
    from ipywidgets import interact, interactive
    print("ipywidgets:", widgets.__version__)
except Exception as e:
    print("ipywidgets not available yet:", e)

import numpy as np
import matplotlib.pyplot as plt
np.random.seed(42)


In [None]:

# %% [utils] Small helpers used throughout
import numpy as np

def softmax(x, axis=-1):
    x = x - np.max(x, axis=axis, keepdims=True)
    e = np.exp(x)
    return e / np.sum(e, axis=axis, keepdims=True)

def cosine_sim(a, b, eps=1e-9):
    a_norm = a / (np.linalg.norm(a, axis=-1, keepdims=True) + eps)
    b_norm = b / (np.linalg.norm(b, axis=-1, keepdims=True) + eps)
    return np.dot(a_norm, b_norm.T)

def show_heatmap(mat, xticklabels=None, yticklabels=None, title=""):
    plt.figure()
    plt.imshow(mat, aspect="auto")
    plt.colorbar()
    if xticklabels is not None: plt.xticks(range(len(xticklabels)), xticklabels, rotation=45, ha="right")
    if yticklabels is not None: plt.yticks(range(len(yticklabels)), yticklabels)
    plt.title(title)
    plt.tight_layout()
    plt.show()


In [None]:

# %% [from-scratch] Multi-head attention (NumPy, minimal)
T, d_model, nheads = 8, 32, 4
d_k = d_model // nheads

X = np.random.randn(T, d_model) * 0.2
W_Q = np.random.randn(d_model, d_model) * 0.1
W_K = np.random.randn(d_model, d_model) * 0.1
W_V = np.random.randn(d_model, d_model) * 0.1
W_O = np.random.randn(d_model, d_model) * 0.1

Q = X @ W_Q
K = X @ W_K
V = X @ W_V

def split_heads(M):
    return M.reshape(T, nheads, d_k).transpose(1,0,2)  # [heads, T, d_k]

Qh, Kh, Vh = map(split_heads, (Q,K,V))
heads = []
for h in range(nheads):
    scores = (Qh[h] @ Kh[h].T) / np.sqrt(d_k)
    attn = softmax(scores, -1)
    heads.append(attn @ Vh[h])
H = np.stack(heads, axis=1).reshape(T, d_model)
Out = H @ W_O

print("Out shape:", Out.shape)


In [None]:

# %% [visualize] Per-head attention heatmaps (PyTorch)
import torch, torch.nn as nn

torch.manual_seed(0)
T, d_model, nheads = 10, 64, 4
mha = nn.MultiheadAttention(embed_dim=d_model, num_heads=nheads, batch_first=True)
x = torch.randn(1, T, d_model)
out, weights = mha(x, x, x, need_weights=True)  # weights: [1, heads, T, T]
weights = weights[0].detach().numpy()

for h in range(nheads):
    show_heatmap(weights[h], title=f"Head {h} Attention")



---
### Bonus: Multilingual Extension
- Swap the tokenizer/model for a multilingual variant (e.g., `bert-base-multilingual-cased` or `xlm-roberta-base`).
- Repeat a small slice of the notebook (tokenization, attention map) on non-English sentences and compare.



---
## Reflection & Next Steps
- What changed when you tweaked dimensions, temperatures, or prompts?
- Where did the attention concentrate, and did it match your intuition?
- Re-run the interactive widgets on your own text.
- Save a copy of the figures that best illustrate your understanding.
