<a href="https://colab.research.google.com/github/besimorhino/ai-workshop/blob/main/transformer_embeddings_low_level.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Transformers for Embeddings — A Low-Level, End-to-End Walkthrough

Warning: This notebook is not for the faint of heart. It is designed to be **maximally instructive**, exposing every moving part in a transformer encoder used to produce **sentence embeddings**. It prioritizes **clarity and low-level details** over brevity or performance. Where helpful, we switch between **NumPy-only** (pure mechanics) and **PyTorch** (autograd + training) implementations.

**What you'll build & see:**
1. A tiny tokenizer & vocabulary; numericalization.
2. Token embeddings and two positional encoding strategies (sinusoidal vs learned).
3. Scaled dot-product attention step-by-step (explicit matrices, masking, softmax, weighted sums).
4. Multi-head attention: head splitting/merging, per-head attention, and concatenation.
5. LayerNorm and residual connections from scratch.
6. Position-wise feed-forward network (FFN) with GELU.
7. A full **TransformerEncoderLayer** (NumPy) and then a stack to form a **TransformerEncoder**.
8. Producing **sentence embeddings** via [CLS] and mean pooling; cosine similarity search.
9. (Optional) A **tiny PyTorch** encoder trained for a few iterations to illustrate learning dynamics and gradients.

> ✨ Tip: Run cells in order. Wherever you see a `### STEP-BY-STEP` section, the cell prints intermediate tensors to demystify the math.



## Background & Notation

We focus on **encoder-only** transformers for embeddings. Let a tokenized sequence have length \(L\). The model dimension is \(d_{\text{model}}\), attention key/query dimension is \(d_k\) (usually \(d_{\text{model}}/h\)), and value dimension is \(d_v\) (usually \(d_{\text{model}}/h\)), with \(h\) heads.

**Scaled Dot-Product Attention:**
Given \(Q \in \mathbb{R}^{L \times d_k}\), \(K \in \mathbb{R}^{L \times d_k}\), \(V \in \mathbb{R}^{L \times d_v}\),
\[
\text{Attention}(Q,K,V) = \text{Softmax}\!\Big(\frac{QK^\top}{\sqrt{d_k}} + M\Big)V
\]
where \(M\) is a mask with \(-\infty\) at disallowed positions (or 0 if no mask).

**Multi-Head Attention (MHA):**
Project input \(X \in \mathbb{R}^{L \times d_{\text{model}}}\) into per-head queries/keys/values, apply attention per head, then concatenate and re-project:
\[
\text{MHA}(X) = \big[\text{head}_1 \| \cdots \| \text{head}_h\big]W_O
\]

**Add & Norm (Pre-Norm style here):**
We will use *pre-norm* blocks: \(Y = X + \text{SubLayer}(\text{LayerNorm}(X))\).

**FFN:**
\[
\text{FFN}(x) = W_2\,\text{GELU}(W_1 x + b_1) + b_2
\]

**Embeddings for sentences:**
- **[CLS]**: take the embedding corresponding to the special classification token at position 0.
- **Mean pooling**: average the token embeddings (optionally excluding padding and special tokens), then \(\ell_2\)-normalize.


In [None]:
# "OPTIONAL" CELL!
# in Google colab you _should_ be able to skip this cell.
# this step is included for those who wish to run the workbook locally.
!pip install torch

In [None]:

# Core imports
import math, random, string, itertools, functools, types
from collections import Counter, defaultdict
import numpy as np
import matplotlib.pyplot as plt

# Try torch for the optional training section
try:
    import torch
    import torch.nn as nn
    import torch.nn.functional as F
    TORCH_AVAILABLE = True
except Exception as e:
    TORCH_AVAILABLE = False

np.set_printoptions(precision=4, suppress=True)
random.seed(7)
np.random.seed(7)

print("NumPy version:", np.__version__)
print("PyTorch available:", TORCH_AVAILABLE)


Note: if you got a message about torch not being availble, please go back and run the pip install cell earlier in the workbook


## 1) Tiny Tokenizer & Vocabulary

We'll implement a super-simple tokenizer:
- Lowercase text
- Split on spaces and punctuation (keeping punctuation as separate tokens)
- Add special tokens: `[PAD]`, `[UNK]`, `[BOS]`, `[EOS]`, `[CLS]`

> This is **not** a production tokenizer. It's intentionally simple to expose mechanics.


In [None]:

SPECIALS = ["[PAD]", "[UNK]", "[BOS]", "[EOS]", "[CLS]"]
PAD, UNK, BOS, EOS, CLS = SPECIALS

def simple_tokenize(text):
    text = text.lower().strip()
    # Separate punctuation by spacing it out
    punct = set(list(".,!?;:()[]{}'\"-"))
    spaced = []
    for ch in text:
        if ch in punct:
            spaced.extend([" ", ch, " "])
        else:
            spaced.append(ch)
    text = "".join(spaced)
    # Collapse whitespace
    toks = [t for t in text.split() if t]
    return toks

corpus = [
    "Transformers map sequences to sequences using attention.",
    "We will build a tiny encoder to learn embeddings.",
    "Attention lets each token attend to others.",
    "Embeddings capture semantic content of sentences.",
    "Mean pooling and CLS pooling are common strategies.",
    "Cosine similarity compares sentence embeddings.",
]

# Build vocabulary from corpus
tok_counts = Counter()
for s in corpus:
    tok_counts.update(simple_tokenize(s))

vocab = SPECIALS + sorted(tok_counts.keys())
stoi = {t:i for i,t in enumerate(vocab)}
itos = {i:t for t,i in stoi.items()}

def encode(text, add_specials=True, max_len=None):
    toks = simple_tokenize(text)
    if add_specials:
        toks = [CLS, BOS] + toks + [EOS]
    ids = [stoi.get(t, stoi[UNK]) for t in toks]
    if max_len is not None:
        ids = ids[:max_len] + [stoi[PAD]] * max(0, max_len - len(ids))
    return np.array(ids, dtype=np.int64)

def decode(ids):
    toks = [itos.get(int(i), UNK) for i in ids]
    # remove PADs for readability
    toks = [t for t in toks if t != PAD]
    return " ".join(toks)

print("Vocab size:", len(vocab))
print("Sample tokens:", vocab[:25])

ex = encode("Transformers are amazing!", add_specials=True, max_len=16)
print("Encoded:", ex)
print("Decoded:", decode(ex))



## 2) Positional Encodings

We demonstrate two variants:

- **Sinusoidal** (deterministic, no parameters):
  \[ PE_{(pos,2i)} = \sin\big(pos / 10000^{2i/d_{model}}\big), \quad
     PE_{(pos,2i+1)} = \cos\big(pos / 10000^{2i/d_{model}}\big) \]

- **Learned positional embeddings**: a trainable matrix \(P \in \mathbb{R}^{L_\text{max} \times d_{\text{model}}}\).

We also define **token embeddings** \(E \in \mathbb{R}^{|V| \times d_{\text{model}}}\) and combine as \(X = E[\text{tokens}] + P[\text{positions}]\).


In [None]:

def sinusoidal_positional_encoding(max_len, d_model):
    pos = np.arange(max_len)[:, None]  # (L,1)
    i = np.arange(d_model)[None, :]    # (1,d)
    angle_rates = 1 / np.power(10000, (2*(i//2))/np.float32(d_model))
    angles = pos * angle_rates
    pe = np.zeros((max_len, d_model), dtype=np.float32)
    pe[:, 0::2] = np.sin(angles[:, 0::2])
    pe[:, 1::2] = np.cos(angles[:, 1::2])
    return pe  # (L,d)

class LearnedPositionalEncoding:
    def __init__(self, max_len, d_model, rng=np.random.default_rng(7)):
        self.table = rng.normal(0.0, 0.02, size=(max_len, d_model)).astype(np.float32)
    def __call__(self, positions):
        return self.table[positions]

# Token embedding matrix
def make_token_embedding(vocab_size, d_model, rng=np.random.default_rng(7)):
    return rng.normal(0.0, 0.02, size=(vocab_size, d_model)).astype(np.float32)

# Visualize sinusoidal patterns
max_len, d_model = 64, 32
pe = sinusoidal_positional_encoding(max_len, d_model)

plt.figure(figsize=(8, 3))
plt.imshow(pe[:64, :32])
plt.title("Sinusoidal Positional Encoding (first 64x32)")
plt.xlabel("d_model dim")
plt.ylabel("position")
plt.colorbar()
plt.show()

# Compose embeddings
vocab_size = len(vocab)
E = make_token_embedding(vocab_size, d_model)
positions = np.arange(16)
LP = LearnedPositionalEncoding(512, d_model)

sample_ids = encode("attention mechanisms are cool.", add_specials=True, max_len=16)
X_sinus = E[sample_ids] + pe[positions]
X_learn = E[sample_ids] + LP(positions)

print("X_sinus shape:", X_sinus.shape, "| X_learn shape:", X_learn.shape)



## 3) Scaled Dot-Product Attention — **STEP-BY-STEP**

We construct \(Q= XW_Q\), \(K= XW_K\), \(V = XW_V\) and compute attention explicitly. We'll use a small \(d_{\text{model}}\) to print matrices.

Masking option: we'll show a padding mask that prevents attending to PAD tokens.


In [None]:
def softmax(x, axis=-1):
    x = x - np.max(x, axis=axis, keepdims=True)
    e = np.exp(x)
    return e / np.sum(e, axis=axis, keepdims=True)

def make_qkv(X, d_model_in, d_k, d_v, rng=np.random.default_rng(7)):
    W_Q = rng.normal(0, 0.02, size=(d_model_in, d_k)).astype(np.float32)
    W_K = rng.normal(0, 0.02, size=(d_model_in, d_k)).astype(np.float32)
    W_V = rng.normal(0, 0.02, size=(d_model_in, d_v)).astype(np.float32)
    Q = X @ W_Q
    K = X @ W_K
    V = X @ W_V
    return Q, K, V, W_Q, W_K, W_V

def attention(Q, K, V, mask=None):
    # Q: (L,d_k), K: (L,d_k), V: (L,d_v)
    scores = (Q @ K.T) / math.sqrt(Q.shape[-1])  # (L,L)
    if mask is not None:
        scores = scores + mask  # mask should be 0 or -inf at invalid positions
    weights = softmax(scores, axis=-1)           # (L,L)
    out = weights @ V                             # (L,d_v)
    return out, weights, scores

# Prepare a tiny example
L = 6; d_model = 32; num_heads = 4; d_k = d_model // num_heads; d_v = d_model // num_heads
ids = encode("Transformers map sequences to sequences using attention.", add_specials=True, max_len=L)
pe_slice = sinusoidal_positional_encoding(max_len=L, d_model=d_model)
X = (E[ids] + pe_slice)  # (L,d_model)

Q, K, V, WQ, WK, WV = make_qkv(X, d_model, d_k, d_v)

# Build a padding mask: 0 for valid, -1e9 for PAD positions
pad_mask = (ids == stoi[PAD]).astype(np.float32)
# Convert to additive mask over keys: (L,L), mask rows for each query over every key position
additive_mask = np.where(pad_mask[None, :]==1, -1e9, 0.0).astype(np.float32)

out, weights, scores = attention(Q, K, V, mask=additive_mask)

print("X shape:", X.shape)
print("Q/K/V shapes:", Q.shape, K.shape, V.shape)
print("Raw scores (pre-softmax):\n", np.round(scores, 4))
print("Attention weights (rows sum to 1):\n", np.round(weights, 4))
print("Output (weighted sums):\n", np.round(out, 4))

# Visualize attention weights
plt.figure(figsize=(4,4))
plt.imshow(weights)
plt.title("Attention Weights")
plt.xlabel("Key j")
plt.ylabel("Query i")
plt.colorbar()
plt.show()


## 4) Multi-Head Attention (MHA)

We split \(d_{\text{model}}\) into \(h\) heads of size \(d_k=d_v=d_{\text{model}}/h\). For each head \(i\), compute attention independently and then concatenate results. Finally apply an output projection \(W_O\).


In [None]:

class MultiHeadAttentionNumpy:
    def __init__(self, d_model, num_heads, rng=np.random.default_rng(7)):
        assert d_model % num_heads == 0
        self.d_model = d_model
        self.h = num_heads
        self.d_k = d_model // num_heads
        self.d_v = self.d_k

        # Combine projections in single big matrices for convenience
        self.W_Q = rng.normal(0, 0.02, size=(d_model, d_model)).astype(np.float32)
        self.W_K = rng.normal(0, 0.02, size=(d_model, d_model)).astype(np.float32)
        self.W_V = rng.normal(0, 0.02, size=(d_model, d_model)).astype(np.float32)
        self.W_O = rng.normal(0, 0.02, size=(d_model, d_model)).astype(np.float32)

    def _split_heads(self, X):
        # X: (L, d_model) -> (h, L, d_k)
        L = X.shape[0]
        Xh = X.reshape(L, self.h, self.d_k).transpose(1,0,2)
        return Xh

    def _merge_heads(self, Xh):
        # Xh: (h, L, d_v) -> (L, d_model)
        h, L, dk = Xh.shape
        return Xh.transpose(1,0,2).reshape(L, h*dk)

    def __call__(self, X, additive_mask=None, return_weights=False):
        # Project
        Q = X @ self.W_Q  # (L,d_model)
        K = X @ self.W_K
        V = X @ self.W_V
        # Split heads
        Qh, Kh, Vh = self._split_heads(Q), self._split_heads(K), self._split_heads(V)
        outputs = []
        all_weights = []
        for i in range(self.h):
            out, w, _ = attention(Qh[i], Kh[i], Vh[i], mask=additive_mask)
            outputs.append(out[None, ...])   # (1,L,d_k)
            all_weights.append(w[None, ...]) # (1,L,L)
        H = np.concatenate(outputs, axis=0)      # (h,L,d_k)
        concat = self._merge_heads(H)            # (L,d_model)
        Y = concat @ self.W_O                    # (L,d_model)
        if return_weights:
            W = np.concatenate(all_weights, axis=0) # (h,L,L)
            return Y, W
        return Y

# Demo
mha = MultiHeadAttentionNumpy(d_model=32, num_heads=4)
ids = encode("embeddings capture semantic content of sentences.", add_specials=True, max_len=10)
X = E[ids] + pe[:len(ids)]
pad_mask = (ids == stoi[PAD]).astype(np.float32)
add_mask = np.where(pad_mask[None, :]==1, -1e9, 0.0).astype(np.float32)

Y, W = mha(X, additive_mask=add_mask, return_weights=True)
print("MHA output shape:", Y.shape, "| attn weights shape:", W.shape)

# Visualize head 0 weights
plt.figure(figsize=(4,4))
plt.imshow(W[0])
plt.title("Head 0 Attention Weights")
plt.xlabel("Key j")
plt.ylabel("Query i")
plt.colorbar()
plt.show()



## 5) LayerNorm & Residuals

We implement LayerNorm from scratch with epsilon for numerical stability. We'll use **pre-norm** blocks:
- \(Z = X + \text{SubLayer}(\text{LayerNorm}(X))\)


In [None]:

class LayerNormNumpy:
    def __init__(self, d_model, eps=1e-5):
        self.gamma = np.ones((d_model,), dtype=np.float32)
        self.beta  = np.zeros((d_model,), dtype=np.float32)
        self.eps = eps
    def __call__(self, X):
        # X: (L, d_model)
        mu = X.mean(axis=-1, keepdims=True)
        var = ((X - mu)**2).mean(axis=-1, keepdims=True)
        Xhat = (X - mu) / np.sqrt(var + self.eps)
        return self.gamma * Xhat + self.beta

# Quick test
X = np.random.randn(5, 32).astype(np.float32)
ln = LayerNormNumpy(32)
Y = ln(X)
print("LayerNorm ok, mean~", Y.mean(), "std~", Y.std())



## 6) Position-wise Feed-Forward Network (FFN)

Two linear layers with a nonlinearity (GELU). Implemented from scratch with NumPy.


In [None]:

def gelu(x):
    # Approximate GELU
    return 0.5 * x * (1.0 + np.tanh(np.sqrt(2.0/np.pi) * (x + 0.044715 * (x**3))))

class FFNNumpy:
    def __init__(self, d_model, d_hidden, rng=np.random.default_rng(7)):
        self.W1 = rng.normal(0, 0.02, size=(d_model, d_hidden)).astype(np.float32)
        self.b1 = np.zeros((d_hidden,), dtype=np.float32)
        self.W2 = rng.normal(0, 0.02, size=(d_hidden, d_model)).astype(np.float32)
        self.b2 = np.zeros((d_model,), dtype=np.float32)
    def __call__(self, X):
        return (gelu(X @ self.W1 + self.b1)) @ self.W2 + self.b2

# Quick test
ff = FFNNumpy(32, 64)
X = np.random.randn(7, 32).astype(np.float32)
print("FFN out shape:", ff(X).shape)



## 7) TransformerEncoderLayer (NumPy, Pre-Norm)

Combine: LayerNorm → MHA → Residual → LayerNorm → FFN → Residual.


In [None]:

class TransformerEncoderLayerNumpy:
    def __init__(self, d_model=64, num_heads=4, d_hidden=128, rng=np.random.default_rng(7)):
        self.ln1 = LayerNormNumpy(d_model)
        self.ln2 = LayerNormNumpy(d_model)
        self.mha = MultiHeadAttentionNumpy(d_model, num_heads, rng=rng)
        self.ffn = FFNNumpy(d_model, d_hidden, rng=rng)
    def __call__(self, X, additive_mask=None, return_attn=False):
        H = self.ln1(X)
        Hm, W = self.mha(H, additive_mask=additive_mask, return_weights=True)
        X = X + Hm
        H2 = self.ln2(X)
        X = X + self.ffn(H2)
        if return_attn:
            return X, W
        return X

class TransformerEncoderNumpy:
    def __init__(self, num_layers, d_model=64, num_heads=4, d_hidden=128, rng=np.random.default_rng(7)):
        self.layers = [TransformerEncoderLayerNumpy(d_model, num_heads, d_hidden, rng=rng) for _ in range(num_layers)]
        self.d_model = d_model
    def __call__(self, X, additive_mask=None, return_all=False):
        attns = []
        for i,layer in enumerate(self.layers):
            X, W = layer(X, additive_mask=additive_mask, return_attn=True)
            attns.append(W)
        if return_all:
            return X, attns
        return X

# Demo end-to-end
d_model = 64; num_heads=4; d_hidden=128; num_layers=2
max_len = 24

E = make_token_embedding(len(vocab), d_model)
pe = sinusoidal_positional_encoding(max_len, d_model)

sent = "Mean pooling and CLS pooling are common strategies."
ids = encode(sent, add_specials=True, max_len=max_len)
X0 = E[ids] + pe[:len(ids)]
pad_mask = (ids == stoi[PAD]).astype(np.float32)
add_mask = np.where(pad_mask[None, :]==1, -1e9, 0.0).astype(np.float32)

encoder = TransformerEncoderNumpy(num_layers, d_model, num_heads, d_hidden)
Z, attn_list = encoder(X0, additive_mask=add_mask, return_all=True)

print("Encoder output shape:", Z.shape)
print("Attn heads per layer:", [A.shape for A in attn_list])  # each (h,L,L)



## 8) Sentence Embeddings: [CLS] vs Mean Pool

We derive fixed-length vectors from token-level outputs:
- **CLS pooling**: take vector at position of `[CLS]` (index 0 if we prepended it).
- **Mean pooling**: average token vectors over non-padding positions (optionally ignore specials).
Finally, L2-normalize for cosine similarity.


In [None]:

def l2_normalize(x, eps=1e-12):
    nrm = np.sqrt((x**2).sum(-1, keepdims=True))
    return x / np.clip(nrm, eps, None)

def get_sentence_embedding(Z, ids, method="mean", ignore_specials=True):
    # Z: (L, d_model), ids: (L,)
    if method == "cls":
        # CLS is the very first token if we added [CLS], [BOS], ...
        idx = 0
        v = Z[idx]
        return l2_normalize(v[None, :])[0]
    elif method == "mean":
        mask = (ids != stoi[PAD]).astype(np.float32)
        if ignore_specials:
            specials = {stoi[t] for t in SPECIALS}
            mask = mask * np.array([0.0 if int(i) in specials else 1.0 for i in ids], dtype=np.float32)
        denom = max(1.0, mask.sum())
        v = (Z * mask[:, None]).sum(axis=0) / denom
        return l2_normalize(v[None, :])[0]
    else:
        raise ValueError("Unknown method")

def cosine_sim(a, b):
    return float(np.dot(a, b) / (np.linalg.norm(a) * np.linalg.norm(b) + 1e-12))

# Build a small set and compare
sentences = [
    "A transformer encoder builds contextual token embeddings.",
    "We compute mean pooled sentence vectors for similarity.",
    "Fluffy cats sleep in sunny windows.",
    "Attention allows tokens to interact across positions.",
    "The weather is rainy, bring an umbrella.",
    "CLS pooling extracts the first token representation.",
]

def embed_sentence(encoder, E, pe, text, max_len=24, method="mean"):
    ids = encode(text, add_specials=True, max_len=max_len)
    X = E[ids] + pe[:len(ids)]
    pad_mask = (ids == stoi[PAD]).astype(np.float32)
    add_mask = np.where(pad_mask[None, :]==1, -1e9, 0.0).astype(np.float32)
    Z = encoder(X, additive_mask=add_mask)
    return get_sentence_embedding(Z, ids, method=method), ids, Z

# Reuse encoder from above (random weights => not trained)
method = "mean"  # try "cls" as well
vecs = []
for s in sentences:
    v, ids_s, Zs = embed_sentence(encoder, E, pe, s, method=method)
    vecs.append(v)

print("Pairwise cosine similarities:")
for i in range(len(sentences)):
    row = []
    for j in range(len(sentences)):
        row.append(f"{cosine_sim(vecs[i], vecs[j]): .3f}")
    print(i, row)

# Visualize attention of first layer, head 0 for a sample sentence
ids = encode(sentences[0], add_specials=True, max_len=24)
X = E[ids] + pe[:len(ids)]
pad_mask = (ids == stoi[PAD]).astype(np.float32)
add_mask = np.where(pad_mask[None, :]==1, -1e9, 0.0).astype(np.float32)
Z, attn_all = encoder(X, additive_mask=add_mask, return_all=True)

plt.figure(figsize=(5,5))
plt.imshow(attn_all[0][0])  # layer 0, head 0
plt.title("Layer 0, Head 0 Attention")
plt.xlabel("Key j")
plt.ylabel("Query i")
plt.colorbar()
plt.show()



## 9) (Optional) Tiny PyTorch Encoder + Quick Training

If PyTorch is available, we'll build a small encoder and train for a few hundred steps on a toy objective:
- **Next-token prediction** (causal-ish over our **encoder** for demo) or
- **Denoising** (mask a token and predict it).

This is just to show **gradients**, **loss decreasing**, and how embeddings become more meaningful.


In [None]:

if TORCH_AVAILABLE:
    device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
    print("Using device:", device)

    # Torch versions of token & positional embeddings
    class TorchSinusoidalPositionalEncoding(torch.nn.Module):
        def __init__(self, max_len, d_model):
            super().__init__()
            pe = torch.zeros(max_len, d_model)
            position = torch.arange(0, max_len, dtype=torch.float).unsqueeze(1)
            div_term = torch.exp(torch.arange(0, d_model, 2).float() * (-math.log(10000.0) / d_model))
            pe[:, 0::2] = torch.sin(position * div_term)
            pe[:, 1::2] = torch.cos(position * div_term)
            self.register_buffer('pe', pe)  # (max_len, d_model)

        def forward(self, positions):
            return self.pe[positions]

    class TinyEncoder(nn.Module):
        def __init__(self, vocab_size, d_model=64, num_heads=4, d_hidden=128, num_layers=2, max_len=64):
            super().__init__()
            self.token_emb = nn.Embedding(vocab_size, d_model)
            self.pos_enc = TorchSinusoidalPositionalEncoding(max_len, d_model)
            enc_layer = nn.TransformerEncoderLayer(d_model=d_model, nhead=num_heads, dim_feedforward=d_hidden, batch_first=True, norm_first=True, activation="gelu")
            self.encoder = nn.TransformerEncoder(enc_layer, num_layers=num_layers)
            self.lm_head = nn.Linear(d_model, vocab_size)  # for next-token prediction
            self.max_len = max_len
            self.d_model = d_model

        def forward(self, input_ids, pad_id):
            # input_ids: (B, L)
            B, L = input_ids.shape
            positions = torch.arange(L, device=input_ids.device)
            X = self.token_emb(input_ids) + self.pos_enc(positions)
            key_padding_mask = (input_ids == pad_id)  # (B, L) True at pads
            Z = self.encoder(X, src_key_padding_mask=key_padding_mask)
            logits = self.lm_head(Z)  # (B, L, V)
            return Z, logits

    # Build dataset (toy): sequences from our small corpus
    def batchify(texts, max_len=24, batch_size=8):
        ids = [encode(t, add_specials=True, max_len=max_len) for t in texts]
        arr = torch.tensor(np.stack(ids), dtype=torch.long)
        # simple batching by repeat + shuffle
        reps = 64 // len(texts) + 1
        data = arr.repeat((reps, 1))
        idx = torch.randperm(data.size(0))
        data = data[idx]
        # chunk into batches
        for i in range(0, data.size(0), batch_size):
            yield data[i:i+batch_size]

    vocab_size = len(vocab)
    pad_id = stoi[PAD]
    model = TinyEncoder(vocab_size).to(device)
    opt = torch.optim.AdamW(model.parameters(), lr=3e-3)
    loss_fn = nn.CrossEntropyLoss(ignore_index=pad_id)

    # Quick training loop: next-token prediction (shift inputs by 1)
    steps, log_every = 200, 40
    model.train()
    for step, batch in enumerate(itertools.islice(batchify(corpus, batch_size=16), steps)):
        batch = batch.to(device)
        Z, logits = model(batch, pad_id=pad_id) # (B, L, V)
        # targets are next tokens (right-shift)
        targets = batch[:, 1:].contiguous()
        preds = logits[:, :-1, :].contiguous()  # align
        loss = loss_fn(preds.view(-1, vocab_size), targets.view(-1))
        opt.zero_grad()
        loss.backward()
        opt.step()
        if (step+1) % log_every == 0:
            print(f"step {step+1:4d} | loss {loss.item():.4f}")

    # Extract embeddings and compare similarities post-training
    model.eval()
    with torch.no_grad():
        def torch_embed_sentence(text, method="mean"):
            ids_np = encode(text, add_specials=True, max_len=24)
            ids_t = torch.tensor(ids_np)[None, :].to(device)
            Z, _ = model(ids_t, pad_id=pad_id)  # (1, L, d)
            Z = Z[0]  # (L, d)
            if method == "cls":
                v = Z[0]
            else:
                mask = (ids_t[0] != pad_id).float()
                # ignore specials
                specials = [stoi[s] for s in SPECIALS]
                special_mask = torch.ones_like(mask)
                for s in specials:
                    special_mask = special_mask * (ids_t[0] != s).float()
                mask = mask * special_mask
                denom = torch.clamp(mask.sum(), min=1.0)
                v = (Z * mask[:, None]).sum(dim=0) / denom
            v = F.normalize(v, dim=0)
            return v.cpu().numpy()

        v_train = [torch_embed_sentence(s, method="mean") for s in sentences]
        print("Pairwise cosine similarities after quick training:")
        for i in range(len(sentences)):
            row = []
            for j in range(len(sentences)):
                row.append(f"{float(np.dot(v_train[i], v_train[j])): .3f}")
            print(i, row)
else:
    print("PyTorch not available; skipping training section.")



## 10) Exercises & Further Work

1. **Pooling variants:** Try max-pooling, attention-pooling (learn a query vector), or combinations.
2. **Masking:** Implement a causal mask and compare to padding mask effects on attention patterns.
3. **Normalization:** Switch to *post-norm* (apply LayerNorm after residual) and observe training stability in Torch section.
4. **Dimensionality:** Increase `d_model`, `num_heads`, and `num_layers`; check how attention maps change.
5. **Objective:** Replace next-token with a denoising (masking) objective; compare the quality of embeddings (cosine clusters).
6. **Whitening & anisotropy:** After mean pooling, try whitening or a root-mean-square normalization on embeddings and evaluate cosine similarities.
7. **Visualization:** Log attention across layers and heads for multiple sentences; build an attention rollout visualization.



### Some Tips:
- If you have a CUDA supported hardware, it may make sense to run this workbook on that hardware instead of the free tier of Google Colab.
- If you still want to run this in Google Colab, you may want to consider changing the runtime type → (optional) GPU for the tiny Torch training speedup.
- If you see `PyTorch not available; skipping training section.`, ensure Torch is installed in your Colab runtime (it typically is). If it's not be sure to run `!pip install torch`

