# Transformer Fundamentals – Guided Notebook 01 — Input Tokens (Tokenization & Embeddings)
**Date:** 2025-10-29
**Style:** Guided, hands-on; from-scratch first, then frameworks; interactive visuals

## Learning Objectives

- Understand subword tokenization (BPE/WordPiece) and why it’s used.
- Map tokens to vectors via embedding tables.
- Visualize token embeddings with [Principal Component Analysis, or PCA](./GLOSSARY.md#pca); explore proximity and analogies.
- Connect toy NumPy embeddings to HF tokenizers and real models (GPT-2, BERT).

## TL;DR

Text → tokens → indices → embedding vectors. Tokenization shapes what the model can represent;
embeddings give each token a learned coordinate in vector space.

## Concept Overview

### Tokenization splits text into subword units to balance vocabulary size and coverage
Language models can’t operate directly on raw text — they need discrete units called *tokens*.
Early systems used full words, but that led to extremely large vocabularies and poor handling of rare or new words.
Modern models instead use **subword tokenization**, such as *Byte Pair Encoding (BPE)* or *WordPiece*, which splits words into smaller chunks like `"play"`, `"##ing"`, `"##ed"`.
This approach strikes a balance between:
- **Coverage:** the ability to represent any input text, including unseen words
- **Compactness:** keeping the vocabulary small enough for efficient training and inference

For example:
> `"playing"` → `["play", "##ing"]`
> `"cats"` → `["cat", "##s"]`

These consistent subword fragments allow the model to generalize across similar forms of words without memorizing every variant.

---

### Each token id indexes into an embedding matrix `E` with shape `[vocab_size, d_model]`
Once tokenized, text becomes a list of integer IDs — each representing a position in the vocabulary.
These IDs map into an **embedding matrix** `E`, a learnable lookup table where each row corresponds to a token’s vector representation.
If your vocabulary has 50,000 tokens and the model’s hidden size (`d_model`) is 768, then `E` has shape `[50000, 768]`.
When the model processes a sentence, it retrieves the embeddings for the tokens it sees:

\[
\text{Embeddings} = E[\text{token\_ids}]
\]

Each token’s embedding acts like its *coordinate* in a high-dimensional semantic space.
During training, these vectors are updated so that tokens appearing in similar contexts end up close to each other in this space.

---

### Similar tokens often cluster in embedding space
Because embeddings capture contextual meaning, similar or related words develop similar vector representations.
For instance:
- `"king"` and `"queen"` end up close together
- `"cat"` and `"dog"` might form a cluster separate from `"car"` or `"tree"`

We can visualize this by projecting embeddings into 2D (e.g., using PCA or UMAP).
You’ll typically see clear groupings: plural forms, verb tenses, or semantically related concepts cluster naturally.

This property makes embeddings useful beyond Transformers — they’re the backbone for many semantic similarity and retrieval systems.

---

**In short:**
Tokenization breaks text into manageable, reusable pieces.
Embeddings turn those pieces into vectors that capture relationships between words.
Together, they form the foundation upon which attention and all subsequent Transformer layers operate.


In [None]:

# %% [setup] Environment check & minimal installs (run once per kernel)
# Target: Python 3.12.12, PyTorch 2.5+, transformers 4.44+, datasets 3+, ipywidgets 8+, matplotlib 3.8+
import sys, platform, subprocess, os

print("Python:", sys.version)
print("Platform:", platform.platform())

# Optional: uncomment to install/upgrade on this machine (internet required)
# !pip install --upgrade pip
# !pip install "torch>=2.5" "transformers>=4.44" "datasets>=3.0.0" "ipywidgets>=8.1.0" "matplotlib>=3.8" "umap-learn>=0.5.6"

try:
    import torch
    print("Torch:", torch.__version__, "| CUDA available:", torch.cuda.is_available())
    if torch.cuda.is_available():
        print("CUDA device count:", torch.cuda.device_count())
        print("CUDA device name:", torch.cuda.get_device_name(0))
except Exception as e:
    print("PyTorch not available yet:", e)

%config InlineBackend.figure_format = 'retina'
from IPython.display import display, HTML
try:
    import ipywidgets as widgets
    from ipywidgets import interact, interactive
    print("ipywidgets:", widgets.__version__)
except Exception as e:
    print("ipywidgets not available yet:", e)

import numpy as np
import matplotlib.pyplot as plt
np.random.seed(42)


In [None]:

# %% [utils] Small helpers used throughout
import numpy as np

def softmax(x, axis=-1):
    x = x - np.max(x, axis=axis, keepdims=True)
    e = np.exp(x)
    return e / np.sum(e, axis=axis, keepdims=True)

def cosine_sim(a, b, eps=1e-9):
    a_norm = a / (np.linalg.norm(a, axis=-1, keepdims=True) + eps)
    b_norm = b / (np.linalg.norm(b, axis=-1, keepdims=True) + eps)
    return np.dot(a_norm, b_norm.T)

def show_heatmap(mat, xticklabels=None, yticklabels=None, title=""):
    plt.figure()
    plt.imshow(mat, aspect="auto")
    plt.colorbar()
    if xticklabels is not None: plt.xticks(range(len(xticklabels)), xticklabels, rotation=45, ha="right")
    if yticklabels is not None: plt.yticks(range(len(yticklabels)), yticklabels)
    plt.title(title)
    plt.tight_layout()
    plt.show()


In [None]:

# %% [from-scratch] Toy BPE-like tokenization & embeddings (NumPy)
vocab = {"<pad>":0, "<unk>":1, "the":2, "cat":3, "##s":4, "sat":5, "on":6, "##ting":7}
d_model = 8
E = np.random.randn(len(vocab), d_model) * 0.1  # random toy embeddings

def toy_tokenize(text):
    # minimal heuristic just for demonstration
    parts = text.lower().split()
    tokens = []
    for w in parts:
        if w in vocab: tokens.append(w)
        elif w.endswith("s") and w[:-1] in vocab:
            tokens += [w[:-1], "##s"]
        elif w.endswith("ting") and w[:-5] in vocab:
            tokens += [w[:-5], "##ting"]
        else:
            tokens.append("<unk>")
    return tokens

text = "The cat sits on the cat"
toks = toy_tokenize(text)
ids = [vocab.get(t, vocab["<unk>"]) for t in toks]
vecs = E[ids]
print("Tokens:", toks)
print("IDs:", ids)
print("Embeddings shape:", vecs.shape)


In [None]:

# %% [visualize] PCA projection of embeddings of unique tokens in this sentence
from sklearn.decomposition import PCA

unique_ids = sorted(set(ids))
X = E[unique_ids]
pca = PCA(n_components=2).fit(X)
X2 = pca.transform(X)

plt.figure()
plt.scatter(X2[:,0], X2[:,1])
for i, tid in enumerate(unique_ids):
    token = list(vocab.keys())[list(vocab.values()).index(tid)]
    plt.text(X2[i,0], X2[i,1], token)
plt.title("Toy embedding PCA (2D)")
plt.tight_layout()
plt.show()


### Framework Section: HF Tokenizers & Real Embeddings
- Use GPT-2 and BERT tokenizers to compare vocabulary and tokenization behavior.
- Load small models to peek at real embedding matrices.


In [None]:

# %% [framework] Hugging Face tokenizers & model embeddings
from transformers import AutoTokenizer, AutoModel

model_names = {
    "gpt2": "gpt2",
    "bert": "bert-base-uncased"
}

for label, m in model_names.items():
    print(f"--- {label.upper()} ---")
    tok = AutoTokenizer.from_pretrained(m)
    sample = "The quick brown fox jumps over the lazy dog."
    toks = tok.tokenize(sample)
    ids = tok.encode(sample, add_special_tokens=True)
    print("Tokens:", toks)
    print("IDs:", ids[:12], "...")

    mdl = AutoModel.from_pretrained(m)
    emb = mdl.get_input_embeddings().weight.detach().cpu().numpy()
    print("Embedding matrix shape:", emb.shape)



---
### Bonus: Multilingual Extension
- Swap the tokenizer/model for a multilingual variant (e.g., `bert-base-multilingual-cased` or `xlm-roberta-base`).
- Repeat a small slice of the notebook (tokenization, attention map) on non-English sentences and compare.



---
## Reflection & Next Steps
- What changed when you tweaked dimensions, temperatures, or prompts?
- Where did the attention concentrate, and did it match your intuition?
- Re-run the interactive widgets on your own text.
- Save a copy of the figures that best illustrate your understanding.
