# Embeddings & Positional Encoding

**Module 4.1, Lesson 3** — Language Modeling Fundamentals

In this notebook you'll:
- Create `nn.Embedding` and verify it's just matrix indexing
- Prove the one-hot equivalence (one-hot x W = row selection)
- Compute cosine similarity between token pairs to see what training produces
- Implement sinusoidal positional encoding from the formula
- Combine embeddings + positional encoding into the model input
- (Stretch) Explore pretrained GPT-2 embeddings

**For each exercise, PREDICT the output before running the cell.**

In [None]:
import torch
import torch.nn as nn
import torch.nn.functional as F
import matplotlib.pyplot as plt
import numpy as np

# Reproducible results
torch.manual_seed(42)

# Nice plots
plt.style.use('dark_background')
plt.rcParams['figure.figsize'] = [10, 4]

print(f"PyTorch version: {torch.__version__}")

---
## Exercise 1: nn.Embedding Basics (Guided)

Create an embedding layer and verify that calling it with a token ID returns the same vector as indexing the weight matrix directly.

**Before running, predict:** If you create `nn.Embedding(10, 4)`, what shape will the weight matrix be? If you look up token ID 3, what shape will the result be?

In [None]:
# Create a small embedding: 10 tokens, 4 dimensions
vocab_size = 10
embed_dim = 4
embedding = nn.Embedding(vocab_size, embed_dim)

print(f"Embedding weight shape: {embedding.weight.shape}")
print(f"Weight matrix:\n{embedding.weight.data}")

In [None]:
# Look up token ID 3 two ways:
token_id = 3

# Way 1: Call the embedding layer with a tensor
via_call = embedding(torch.tensor([token_id]))

# Way 2: Index the weight matrix directly
via_index = embedding.weight[token_id]

print(f"Via embedding call: {via_call}")
print(f"Via weight index:   {via_index}")
print(f"\nAre they equal? {torch.allclose(via_call.squeeze(), via_index)}")

In [None]:
# Verify that embeddings are learnable parameters
print(f"requires_grad: {embedding.weight.requires_grad}")
print(f"\nNumber of parameters: {embedding.weight.numel()}")
print(f"That's {vocab_size} tokens × {embed_dim} dimensions = {vocab_size * embed_dim}")

**Your turn:** What would `nn.Embedding(50000, 768)` have for parameter count?

In [None]:
# YOUR ANSWER HERE
# Calculate: vocab_size * embed_dim
big_params = 50000 * 768
print(f"Parameters: {big_params:,}")  # 38,400,000 — 38.4 million just for embeddings!

---
## Exercise 2: One-Hot Equivalence Proof (Guided)

Show that multiplying a one-hot vector by the embedding weight matrix gives the same result as `nn.Embedding` lookup.

**Before running, predict:** If you multiply a one-hot vector of shape `[10]` by a weight matrix of shape `[10, 4]`, what shape is the result? Will it match the embedding lookup?

In [None]:
# Create one-hot vector for token ID 3
token_id = 3
one_hot = F.one_hot(torch.tensor(token_id), num_classes=vocab_size).float()

print(f"One-hot vector: {one_hot}")
print(f"Shape: {one_hot.shape}")

In [None]:
# Multiply one-hot by the weight matrix
# one_hot shape: [10], weight shape: [10, 4]
# Result shape: [4] — one row of the weight matrix
via_matmul = one_hot @ embedding.weight

# Compare with direct lookup
via_lookup = embedding(torch.tensor([token_id])).squeeze()

print(f"Via one-hot @ weight: {via_matmul}")
print(f"Via embedding lookup: {via_lookup}")
print(f"\nAre they equal? {torch.allclose(via_matmul, via_lookup)}")
print("\n✓ one-hot × W = row selection. nn.Embedding skips the sparse vector.")

In [None]:
# Let's verify for ALL tokens at once
all_ids = torch.arange(vocab_size)
all_one_hot = F.one_hot(all_ids, num_classes=vocab_size).float()

via_matmul_all = all_one_hot @ embedding.weight
via_lookup_all = embedding(all_ids)

print(f"All equal? {torch.allclose(via_matmul_all, via_lookup_all)}")
print(f"\nOne-hot identity × W = W. The embedding IS the weight matrix.")

---
## Exercise 3: Cosine Similarity Between Token Pairs (Guided)

At initialization, embeddings are random — no meaningful similarity. But let's see what cosine similarity looks like, and then we'll look at pretrained embeddings.

**Before running, predict:** For random 64-dimensional vectors, will cosine similarity between any two tokens be close to 0, close to 1, or random?

In [None]:
def cosine_similarity(a: torch.Tensor, b: torch.Tensor) -> float:
    """Compute cosine similarity between two vectors."""
    return (F.cosine_similarity(a.unsqueeze(0), b.unsqueeze(0))).item()

# Random embeddings: similarity is ~0 (random directions)
random_emb = nn.Embedding(5, 64)  # larger dim for more meaningful cosine sim
tokens = {"cat": 0, "dog": 1, "bird": 2, "the": 3, "seven": 4}

print("=== Random (untrained) embeddings ===")
for name_a, id_a in tokens.items():
    for name_b, id_b in tokens.items():
        if id_a < id_b:
            sim = cosine_similarity(
                random_emb.weight[id_a],
                random_emb.weight[id_b]
            )
            print(f"  cos({name_a}, {name_b}) = {sim:.3f}")

print("\nAll roughly ~0. Random vectors in high dimensions are nearly orthogonal.")

In [None]:
# Now let's MANUALLY set embeddings to show what training produces.
# After training, similar tokens should have similar vectors.

trained_emb = nn.Embedding(5, 4)
with torch.no_grad():
    trained_emb.weight[0] = torch.tensor([0.8, 0.3, -0.2, 0.5])   # "cat" - animal
    trained_emb.weight[1] = torch.tensor([0.7, 0.4, -0.1, 0.6])   # "dog" - animal (similar to cat!)
    trained_emb.weight[2] = torch.tensor([0.6, 0.5, -0.3, 0.4])   # "bird" - animal (close too)
    trained_emb.weight[3] = torch.tensor([-0.1, -0.8, 0.9, -0.3]) # "the" - function word (different!)
    trained_emb.weight[4] = torch.tensor([0.2, -0.6, 0.1, 0.8])   # "seven" - number (different!)

print("=== Trained (simulated) embeddings ===")
for name_a, id_a in tokens.items():
    for name_b, id_b in tokens.items():
        if id_a < id_b:
            sim = cosine_similarity(
                trained_emb.weight[id_a],
                trained_emb.weight[id_b]
            )
            print(f"  cos({name_a}, {name_b}) = {sim:.3f}")

print("\nAnimals cluster together (high similarity). Function words are distant.")

**Key insight:** At initialization, all pairs have ~0 similarity (random). After training, similar tokens have high cosine similarity. Training shapes the embedding space.

---
## Exercise 4: Sinusoidal Positional Encoding (Guided)

Implement the sinusoidal positional encoding from the formula:

$$PE_{(pos, 2i)} = \sin\left(\frac{pos}{10000^{2i/d}}\right)$$
$$PE_{(pos, 2i+1)} = \cos\left(\frac{pos}{10000^{2i/d}}\right)$$

**Before running, predict:** Position 0 uses `sin(0)` and `cos(0)` for its first two dimensions. What values are those?

In [None]:
def sinusoidal_positional_encoding(max_len: int, d_model: int) -> torch.Tensor:
    """
    Generate sinusoidal positional encoding matrix.
    
    Args:
        max_len: Maximum sequence length
        d_model: Embedding dimension
    
    Returns:
        Tensor of shape [max_len, d_model]
    """
    pe = torch.zeros(max_len, d_model)
    position = torch.arange(0, max_len, dtype=torch.float).unsqueeze(1)  # [max_len, 1]
    
    # Compute the division term: 10000^(2i/d_model)
    div_term = torch.exp(
        torch.arange(0, d_model, 2, dtype=torch.float) * (-np.log(10000.0) / d_model)
    )  # [d_model/2]
    
    # Even dimensions: sin
    pe[:, 0::2] = torch.sin(position * div_term)
    # Odd dimensions: cos
    pe[:, 1::2] = torch.cos(position * div_term)
    
    return pe

# Generate for 20 positions, 64 dimensions
pe = sinusoidal_positional_encoding(20, 64)
print(f"Shape: {pe.shape}")
print(f"\nPosition 0, first 8 dims: {pe[0, :8].tolist()}")
print(f"Position 1, first 8 dims: {pe[1, :8].tolist()}")

In [None]:
# Visualize as a heatmap
plt.figure(figsize=(12, 5))
plt.imshow(pe.numpy(), cmap='RdBu_r', aspect='auto', vmin=-1, vmax=1)
plt.colorbar(label='Value')
plt.xlabel('Encoding Dimension')
plt.ylabel('Position')
plt.title('Sinusoidal Positional Encoding')
plt.tight_layout()
plt.show()

print("Left: slow-changing waves (coarse position)")
print("Right: fast-changing waves (fine position)")

In [None]:
# Verify key properties:

# 1. Each position has a unique encoding
for i in range(20):
    for j in range(i + 1, 20):
        assert not torch.allclose(pe[i], pe[j], atol=1e-6), f"Positions {i} and {j} are the same!"
print("✓ All 20 positions have unique encodings")

# 2. Nearby positions have similar encodings
sim_adjacent = cosine_similarity(pe[0], pe[1])
sim_distant = cosine_similarity(pe[0], pe[19])
print(f"\n✓ Nearby positions are more similar:")
print(f"  cos(pos 0, pos 1)  = {sim_adjacent:.3f}")
print(f"  cos(pos 0, pos 19) = {sim_distant:.3f}")

# 3. Works for any sequence length
pe_long = sinusoidal_positional_encoding(10000, 64)
print(f"\n✓ Can generate for any length: shape {pe_long.shape}")

---
## Exercise 5: Combining Embeddings + Positional Encoding (Guided)

Put it all together: take a sentence, look up token embeddings, add positional encoding.

**Before running, predict:** If token embeddings have shape `[5, 64]` and positional encodings have shape `[5, 64]`, what shape is the model input? What operation combines them?

In [None]:
# Simulate the full input pipeline
vocab_size = 100
embed_dim = 64
max_seq_len = 128

# Create both embedding tables
token_embedding = nn.Embedding(vocab_size, embed_dim)
# For sinusoidal: no learnable parameters, just compute it
pe_matrix = sinusoidal_positional_encoding(max_seq_len, embed_dim)

# Simulate a tokenized sentence: [23, 45, 12, 67, 89]
token_ids = torch.tensor([23, 45, 12, 67, 89])
seq_len = len(token_ids)

# Step 1: Look up token embeddings
tok_emb = token_embedding(token_ids)  # [5, 64]
print(f"Token embeddings shape: {tok_emb.shape}")

# Step 2: Get positional encodings for positions 0..4
pos_enc = pe_matrix[:seq_len]  # [5, 64]
print(f"Positional encoding shape: {pos_enc.shape}")

# Step 3: Add them together
model_input = tok_emb + pos_enc  # [5, 64]
print(f"Model input shape: {model_input.shape}")
print(f"\n✓ The tensor that enters the transformer: {seq_len} tokens × {embed_dim} dimensions")

In [None]:
# The SAME sentence with LEARNED positional encoding:
pos_embedding = nn.Embedding(max_seq_len, embed_dim)  # learnable
positions = torch.arange(seq_len)  # [0, 1, 2, 3, 4]

model_input_learned = token_embedding(token_ids) + pos_embedding(positions)
print(f"Model input (learned PE) shape: {model_input_learned.shape}")
print(f"\nSame shape, different approach. Learned PE has {pos_embedding.weight.numel():,} extra parameters.")

In [None]:
# Demonstrate the bag-of-words problem:
# Same tokens, different order — without position, they're identical

sentence_a = torch.tensor([10, 20, 30])  # "dog bites man"
sentence_b = torch.tensor([30, 20, 10])  # "man bites dog"

# Without positional encoding
emb_a = token_embedding(sentence_a)
emb_b = token_embedding(sentence_b)

# Sum embeddings (a simple aggregation): identical!
print("WITHOUT positional encoding:")
print(f"  Sum of embeddings equal? {torch.allclose(emb_a.sum(dim=0), emb_b.sum(dim=0))}")

# WITH positional encoding: different!
pe_3 = pe_matrix[:3]
input_a = emb_a + pe_3
input_b = emb_b + pe_3

print(f"\nWITH positional encoding:")
print(f"  Sum of inputs equal? {torch.allclose(input_a.sum(dim=0), input_b.sum(dim=0))}")
print(f"\n✓ Position breaks the symmetry. The model can now distinguish order.")

---
## Exercise 6 (Stretch): Explore Pretrained GPT-2 Embeddings (Guided)

Load the actual GPT-2 token embeddings and explore semantic similarity.

**Before running, predict:** Will semantically similar words (like "cat" and "dog") have higher cosine similarity than unrelated words (like "cat" and "seven") in trained embeddings?

In [None]:
# Install transformers if needed
try:
    from transformers import GPT2Tokenizer, GPT2Model
except ImportError:
    !pip install transformers -q
    from transformers import GPT2Tokenizer, GPT2Model

In [None]:
# Load GPT-2 tokenizer and model
tokenizer = GPT2Tokenizer.from_pretrained('gpt2')
model = GPT2Model.from_pretrained('gpt2')

# Extract the token embedding matrix
wte = model.wte.weight.detach()  # [50257, 768]
print(f"GPT-2 embedding matrix shape: {wte.shape}")
print(f"Vocabulary size: {wte.shape[0]}")
print(f"Embedding dimension: {wte.shape[1]}")
print(f"Total parameters: {wte.numel():,}")

In [None]:
def get_token_embedding(word: str) -> torch.Tensor:
    """Get the embedding for a single token (word must be a single token)."""
    ids = tokenizer.encode(word)
    # Take the first token if multiple
    return wte[ids[0]]

def token_similarity(word_a: str, word_b: str) -> float:
    """Cosine similarity between two tokens' embeddings."""
    emb_a = get_token_embedding(word_a)
    emb_b = get_token_embedding(word_b)
    return cosine_similarity(emb_a, emb_b)

# Explore semantic similarities
pairs = [
    (" cat", " dog"),
    (" cat", " the"),
    (" king", " queen"),
    (" man", " woman"),
    (" happy", " sad"),
    (" happy", " seven"),
    (" France", " Germany"),
    (" France", " banana"),
]

print("GPT-2 Token Embedding Similarities:")
print("=" * 45)
for word_a, word_b in pairs:
    sim = token_similarity(word_a, word_b)
    bar = "█" * int(abs(sim) * 20)
    print(f"  cos({word_a.strip():>8}, {word_b.strip():<8}) = {sim:+.3f}  {bar}")

In [None]:
def find_nearest_tokens(word: str, k: int = 10) -> None:
    """Find the k nearest tokens to a given word in embedding space."""
    emb = get_token_embedding(word).unsqueeze(0)  # [1, 768]
    # Cosine similarity with all tokens
    sims = F.cosine_similarity(emb, wte)  # [50257]
    
    # Get top-k (skip the first one — it's the token itself)
    top_k = torch.topk(sims, k + 1)
    
    print(f"\nNearest neighbors to '{word.strip()}':")
    for i in range(1, k + 1):
        idx = top_k.indices[i].item()
        sim = top_k.values[i].item()
        token_str = tokenizer.decode([idx])
        print(f"  {i}. '{token_str}' (sim={sim:.3f})")

find_nearest_tokens(" cat")
find_nearest_tokens(" king")
find_nearest_tokens(" Python")

In [None]:
# Visualize clusters with PCA
from sklearn.decomposition import PCA

# Select interesting tokens
token_groups = {
    'animals': [' cat', ' dog', ' bird', ' fish', ' horse', ' bear'],
    'numbers': [' one', ' two', ' three', ' four', ' five', ' six'],
    'countries': [' France', ' Germany', ' Japan', ' China', ' India', ' Brazil'],
    'colors': [' red', ' blue', ' green', ' yellow', ' black', ' white'],
}

colors_map = {'animals': 'green', 'numbers': 'orange', 'countries': 'blue', 'colors': 'red'}

# Collect embeddings
all_tokens = []
all_embeddings = []
all_groups = []

for group, tokens in token_groups.items():
    for token in tokens:
        all_tokens.append(token.strip())
        all_embeddings.append(get_token_embedding(token).numpy())
        all_groups.append(group)

embeddings_matrix = np.stack(all_embeddings)

# PCA to 2D
pca = PCA(n_components=2)
coords = pca.fit_transform(embeddings_matrix)

# Plot
plt.figure(figsize=(10, 8))
for group in token_groups:
    mask = [g == group for g in all_groups]
    group_coords = coords[mask]
    group_tokens = [t for t, m in zip(all_tokens, mask) if m]
    plt.scatter(group_coords[:, 0], group_coords[:, 1], 
                label=group, c=colors_map[group], s=100, alpha=0.7)
    for j, token in enumerate(group_tokens):
        plt.annotate(token, (group_coords[j, 0], group_coords[j, 1]),
                     fontsize=9, ha='center', va='bottom', 
                     textcoords='offset points', xytext=(0, 5))

plt.legend()
plt.title('GPT-2 Token Embeddings (PCA 2D Projection)')
plt.xlabel('PC1')
plt.ylabel('PC2')
plt.grid(True, alpha=0.3)
plt.tight_layout()
plt.show()

print("Similar tokens cluster together — this is what training produces!")

---
## Key Takeaways

1. **`nn.Embedding` is a learnable lookup table.** It stores a weight matrix of shape `[vocab_size, embed_dim]` and indexing by token ID returns one row. Equivalent to one-hot x matrix, without the sparse vector.
2. **Cosine similarity reveals structure.** Random embeddings have ~0 similarity between all pairs. After training, semantically similar tokens cluster together in embedding space.
3. **Sinusoidal positional encoding gives each position a unique, smooth encoding** using multi-frequency waves. Low dimensions change slowly (coarse position), high dimensions change fast (fine position).
4. **Token embedding + positional encoding = the model's input.** Without position, the model sees a bag of words and cannot distinguish "dog bites man" from "man bites dog."
5. **Everything downstream operates on these vectors.** Attention, transformer blocks, the whole model — it all starts from this `[seq_len, embed_dim]` tensor.