# Positional Encoding from Scratch

[![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/adiel2012/deep-learning-abc/blob/main/positional_encoding.ipynb)

Attention treats its input as a **set** — it has no notion of token order. The sentence "cat sat on mat" and "mat sat on cat" produce identical attention scores if the embeddings are the same.

**Positional encoding** injects order information so the model knows *where* each token appears in the sequence.

This notebook covers:
1. Why position matters
2. The math behind sinusoidal encoding
3. Implementation from raw ops
4. Visualizations that build intuition
5. Learned positional embeddings (the alternative)

> **Prerequisites:** [attention_from_scratch.ipynb](https://colab.research.google.com/github/adiel2012/deep-learning-abc/blob/main/attention_from_scratch.ipynb)

In [None]:
# Install dependencies (Colab already has torch, but this ensures compatibility)
!pip install torch matplotlib -q

In [None]:
import torch
import math
import matplotlib.pyplot as plt
import matplotlib.ticker as ticker

torch.manual_seed(42)

device = torch.device('cuda' if torch.cuda.is_available() else 'cpu')
print(f"Using device: {device}")

## 0. Mathematical Foundations

---

### 0.1 The Problem: Attention is Permutation-Invariant

Consider the self-attention output for token $i$:

$$\text{output}_i = \sum_j \text{softmax}\!\left(\frac{q_i \cdot k_j}{\sqrt{d_k}}\right) v_j$$

If we shuffle the input tokens, the dot products $q_i \cdot k_j$ don't change (the same pairs still exist), so the outputs are just a permutation of the original outputs. The model **cannot distinguish** "the cat sat" from "sat cat the".

We need to **break this symmetry** by encoding position.

---

### 0.2 Sinusoidal Positional Encoding

The original Transformer (Vaswani et al., 2017) uses a fixed encoding based on sine and cosine waves at different frequencies:

$$PE_{(pos, 2i)} = \sin\!\left(\frac{pos}{10000^{2i/d_{\text{model}}}}\right)$$

$$PE_{(pos, 2i+1)} = \cos\!\left(\frac{pos}{10000^{2i/d_{\text{model}}}}\right)$$

where:
- $pos$ = position in the sequence ($0, 1, 2, \ldots$)
- $i$ = dimension index ($0, 1, \ldots, d_{\text{model}}/2 - 1$)
- $d_{\text{model}}$ = embedding dimension

Each dimension gets a sinusoid with a different **wavelength**, ranging from $2\pi$ (dimension 0) to $10000 \cdot 2\pi$ (last dimension).

---

### 0.3 Why Sinusoids?

**Property 1: Unique encoding.** Each position gets a unique pattern across all dimensions — like a binary counter but with smooth, continuous values.

**Property 2: Relative positions via linear transformation.** For any fixed offset $k$:

$$PE_{pos+k} = M_k \cdot PE_{pos}$$

where $M_k$ is a rotation matrix that depends only on $k$, not on $pos$. This means the model can learn to attend to relative positions ("2 tokens ago") because the relationship between any two positions is a simple linear function.

Proof sketch for a single frequency $\omega$:

$$\begin{bmatrix} \sin(\omega(pos+k)) \\ \cos(\omega(pos+k)) \end{bmatrix} = \begin{bmatrix} \cos(\omega k) & \sin(\omega k) \\ -\sin(\omega k) & \cos(\omega k) \end{bmatrix} \begin{bmatrix} \sin(\omega \cdot pos) \\ \cos(\omega \cdot pos) \end{bmatrix}$$

This is just the angle-addition identity from trigonometry.

**Property 3: Bounded values.** All values are in $[-1, 1]$, same scale as typical normalized embeddings.

**Property 4: Extrapolation.** Since it's a formula (not a lookup table), it works for sequence lengths longer than those seen during training.

---

### 0.4 The Frequency Spectrum

The denominator $10000^{2i/d_{\text{model}}}$ creates a geometric progression of wavelengths:

| Dimension $i$ | Wavelength | Intuition |
|:---:|:---:|:---|
| 0 | $2\pi \approx 6.3$ | Changes rapidly — encodes fine position |
| mid | ~$630$ | Medium frequency |
| last | $20000\pi \approx 62{,}832$ | Changes very slowly — encodes coarse position |

Low dimensions act like the "ones digit" of a number (fast-changing), while high dimensions act like the "thousands digit" (slow-changing). Together they uniquely identify each position, similar to how digits in a number uniquely identify a value.

---

### 0.5 How It's Used

The positional encoding is simply **added** to the token embedding:

$$\text{input}_i = \text{Embedding}(\text{token}_i) + PE_i$$

Addition (rather than concatenation) keeps the dimension unchanged and lets the model learn to use position and content jointly through the same projections.

---

Now let's implement and visualize all of this.

## 1. The Problem: Attention Ignores Order

In [None]:
def softmax(x):
    x_max = x.max(dim=-1, keepdim=True).values
    exp_x = torch.exp(x - x_max)
    return exp_x / exp_x.sum(dim=-1, keepdim=True)

def attention_weights(X):
    """Compute self-attention weights (no projection, for demonstration)."""
    scores = X @ X.T / math.sqrt(X.shape[-1])
    return softmax(scores)

# Three token embeddings (4 dims each)
embeddings = torch.tensor([
    [1.0, 0.0, 0.5, 0.2],   # "cat"
    [0.0, 1.0, 0.3, 0.8],   # "sat"
    [0.5, 0.5, 1.0, 0.1],   # "down"
], device=device)

# Original order: cat, sat, down
original = embeddings[[0, 1, 2]]
# Shuffled order: down, cat, sat
shuffled = embeddings[[2, 0, 1]]

w_orig = attention_weights(original)
w_shuf = attention_weights(shuffled)

print("Original order [cat, sat, down] — attention weights:")
print(w_orig)
print("\nShuffled order [down, cat, sat] — attention weights:")
print(w_shuf)
print("\nThe weights are just a permutation of each other!")
print("Attention has NO idea which token came first.")

## 2. Sinusoidal Positional Encoding — Implementation

In [None]:
def sinusoidal_positional_encoding(max_len, d_model, device=None):
    """
    Compute sinusoidal positional encoding from scratch.
    
    PE(pos, 2i)   = sin(pos / 10000^(2i/d_model))
    PE(pos, 2i+1) = cos(pos / 10000^(2i/d_model))
    
    Returns: (max_len, d_model) tensor
    """
    pe = torch.zeros(max_len, d_model, device=device)
    
    # Position indices: [0, 1, 2, ..., max_len-1]
    pos = torch.arange(0, max_len, device=device).unsqueeze(1)  # (max_len, 1)
    
    # Dimension indices for pairs: [0, 2, 4, ...]
    i = torch.arange(0, d_model, 2, device=device).float()     # (d_model/2,)
    
    # Compute the denominator: 10000^(2i/d_model)
    # Using exp-log trick: 10000^(2i/d) = exp(2i/d * ln(10000))
    div_term = torch.exp(i * -(math.log(10000.0) / d_model))   # (d_model/2,)
    
    # Compute angles: pos / 10000^(2i/d_model)
    angles = pos * div_term  # (max_len, d_model/2) via broadcasting
    
    # Even dimensions: sin, odd dimensions: cos
    pe[:, 0::2] = torch.sin(angles)
    pe[:, 1::2] = torch.cos(angles)
    
    return pe


# Generate encoding for 50 positions, 16 dimensions
max_len = 50
d_model = 16
PE = sinusoidal_positional_encoding(max_len, d_model, device=device)

print(f"PE shape: {PE.shape}  (max_len, d_model)")
print(f"\nPosition 0 encoding: {PE[0]}")
print(f"Position 1 encoding: {PE[1]}")
print(f"\nAll values in [-1, 1]: min={PE.min():.4f}, max={PE.max():.4f}")

## 3. Visualizing the Encoding

### 3.1 Heatmap: All positions × all dimensions

In [None]:
fig, ax = plt.subplots(figsize=(10, 6))
im = ax.imshow(PE.cpu().numpy(), cmap='RdBu_r', aspect='auto', vmin=-1, vmax=1)
ax.set_xlabel('Dimension')
ax.set_ylabel('Position')
ax.set_title('Sinusoidal Positional Encoding')
plt.colorbar(im, ax=ax, label='Value')
plt.tight_layout()
plt.show()

**Reading the heatmap:**
- Left columns (low dimensions) oscillate rapidly — they encode fine-grained position
- Right columns (high dimensions) change slowly — they encode coarse position
- Each row (position) has a unique pattern

### 3.2 Individual Sinusoids at Different Frequencies

In [None]:
positions = torch.arange(max_len)
dims_to_show = [0, 2, 4, 8, 14]  # even dims (sin channels)

fig, ax = plt.subplots(figsize=(10, 4))
for dim in dims_to_show:
    ax.plot(positions.numpy(), PE[:, dim].cpu().numpy(), label=f'dim {dim} (sin)')

ax.set_xlabel('Position')
ax.set_ylabel('PE value')
ax.set_title('Sinusoids at Different Frequencies')
ax.legend()
ax.grid(True, alpha=0.3)
plt.tight_layout()
plt.show()

Low-index dimensions complete many cycles over the sequence (high frequency), while high-index dimensions barely change (low frequency).

### 3.3 Similarity Between Positions

If the encoding works well, nearby positions should be more similar than distant ones.

In [None]:
# Cosine similarity between all pairs of positions
PE_norm = PE / PE.norm(dim=-1, keepdim=True)
similarity = (PE_norm @ PE_norm.T).cpu().numpy()

fig, ax = plt.subplots(figsize=(7, 6))
im = ax.imshow(similarity, cmap='viridis', aspect='equal')
ax.set_xlabel('Position')
ax.set_ylabel('Position')
ax.set_title('Cosine Similarity Between Position Encodings')
plt.colorbar(im, ax=ax)
plt.tight_layout()
plt.show()

print("Nearby positions are more similar (bright diagonal band).")
print("Distant positions are less similar (darker off-diagonal).")

### 3.4 The "Binary Counter" Analogy

Think of positions in binary:

| Position | Binary | Dim 0 (fast) | Dim 1 | Dim 2 | Dim 3 (slow) |
|:---:|:---:|:---:|:---:|:---:|:---:|
| 0 | 0000 | 0 | 0 | 0 | 0 |
| 1 | 0001 | 1 | 0 | 0 | 0 |
| 2 | 0010 | 0 | 1 | 0 | 0 |
| 3 | 0011 | 1 | 1 | 0 | 0 |
| 4 | 0100 | 0 | 0 | 1 | 0 |

The lowest bit toggles every step, the next bit every 2 steps, etc. Sinusoidal encoding does the same thing but with **continuous** waves instead of discrete bits — dim 0 oscillates fastest, dim $d_{\text{model}}-1$ oscillates slowest.

## 4. Relative Position via Dot Product

A key property: the dot product $PE_{pos} \cdot PE_{pos+k}$ depends mainly on the **offset** $k$, not the absolute position.

In [None]:
# Dot product between PE[pos] and PE[pos+k] for different starting positions
offsets = range(0, 20)
start_positions = [0, 5, 10, 20]

fig, ax = plt.subplots(figsize=(8, 4))

for start in start_positions:
    dots = []
    for k in offsets:
        if start + k < max_len:
            dot = (PE[start] * PE[start + k]).sum().item()
            dots.append(dot)
        else:
            dots.append(float('nan'))
    ax.plot(list(offsets), dots, 'o-', markersize=4, label=f'start={start}')

ax.set_xlabel('Offset k')
ax.set_ylabel('Dot product PE[pos] · PE[pos+k]')
ax.set_title('Dot Product Depends on Offset, Not Absolute Position')
ax.legend()
ax.grid(True, alpha=0.3)
plt.tight_layout()
plt.show()

print("All curves nearly overlap — the dot product is a function of the offset k,")
print("regardless of where in the sequence we start.")

## 5. Adding Positional Encoding to Embeddings

In [None]:
# Simulate a small model
d_model = 16
seq_len = 6
vocab_size = 100

# Fake token embeddings (random, as if from nn.Embedding)
token_embeddings = torch.randn(seq_len, d_model, device=device) * 0.1

# Get positional encoding for these positions
PE = sinusoidal_positional_encoding(seq_len, d_model, device=device)

# Add them together — this is exactly what the Transformer does
input_to_attention = token_embeddings + PE

print("Token embeddings shape:", token_embeddings.shape)
print("Positional encoding shape:", PE.shape)
print("Combined input shape:", input_to_attention.shape)

print("\nToken embedding (pos 0):", token_embeddings[0, :4].tolist())
print("Positional enc  (pos 0):", PE[0, :4].tolist())
print("Sum             (pos 0):", input_to_attention[0, :4].tolist())

### Does it fix the permutation problem?

In [None]:
# Same 3 tokens as before, but with 4 dimensions (must be even for sin/cos pairs)
embeddings = torch.tensor([
    [1.0, 0.0, 0.5, 0.2],   # "cat"
    [0.0, 1.0, 0.3, 0.8],   # "sat"
    [0.5, 0.5, 1.0, 0.1],   # "down"
], device=device)

PE_3 = sinusoidal_positional_encoding(3, 4, device=device)

# Original: cat(pos0), sat(pos1), down(pos2)
original_with_pe = embeddings + PE_3
# Shuffled: down(pos0), cat(pos1), sat(pos2)
shuffled_with_pe = embeddings[[2, 0, 1]] + PE_3

w_orig = attention_weights(original_with_pe)
w_shuf = attention_weights(shuffled_with_pe)

print("WITH positional encoding:")
print("\nOriginal [cat, sat, down]:")
print(w_orig)
print("\nShuffled [down, cat, sat]:")
print(w_shuf)
print("\nThe weights are now DIFFERENT — order matters!")

## 6. Scaling with $\sqrt{d_{\text{model}}}$

In practice, token embeddings are often multiplied by $\sqrt{d_{\text{model}}}$ before adding the positional encoding:

$$\text{input}_i = \sqrt{d_{\text{model}}} \cdot \text{Embedding}(\text{token}_i) + PE_i$$

Why? The embedding vectors have variance $\approx 1/d_{\text{model}}$ after initialization, while $PE$ has values in $[-1, 1]$. Without scaling, the positional signal would dominate. Multiplying by $\sqrt{d_{\text{model}}}$ brings both to a similar scale.

In [None]:
d_model = 64

# Typical embedding variance after initialization
emb = torch.randn(10, d_model, device=device)  # default init ~ N(0,1)
pe = sinusoidal_positional_encoding(10, d_model, device=device)

print(f"Embedding norm (per token):       {emb.norm(dim=-1).mean():.2f}")
print(f"PE norm (per position):           {pe.norm(dim=-1).mean():.2f}")
print(f"Scaled embedding norm (×√d):      {(emb * math.sqrt(d_model)).norm(dim=-1).mean():.2f}")
print(f"\nAfter scaling, both are ~{pe.norm(dim=-1).mean():.1f}, so neither dominates.")

## 7. Learned Positional Embeddings

An alternative: instead of a fixed formula, **learn** a position embedding for each position, just like token embeddings.

$$PE_{\text{learned}} = \text{nn.Embedding}(\text{max\_len},\; d_{\text{model}})$$

In [None]:
import torch.nn as nn

max_len = 50
d_model = 16

# Learned: just an embedding table indexed by position
learned_pe = nn.Embedding(max_len, d_model).to(device)

positions = torch.arange(max_len, device=device)
pe_vectors = learned_pe(positions)  # (max_len, d_model)

print(f"Learned PE shape: {pe_vectors.shape}")
print(f"Parameters: {max_len} × {d_model} = {max_len * d_model}")

### Comparison

| | Sinusoidal (fixed) | Learned |
|---|---|---|
| **Parameters** | 0 | max_len × d_model |
| **Extrapolation** | Works for unseen lengths | Cannot — no embedding for unseen positions |
| **Performance** | Slightly worse on short sequences | Slightly better when max_len is known |
| **Used by** | Original Transformer, some LLMs | BERT, GPT-2, most modern models |

In practice, learned embeddings perform as well or slightly better, but sinusoidal encodings are simpler and generalize to longer sequences.

## 8. Full Example: Embeddings + PE → Attention

In [None]:
d_model = 16
seq_len = 4
tokens = ["The", "cat", "sat", "down"]

# Step 1: Token embeddings (simulated)
torch.manual_seed(42)
X = torch.randn(seq_len, d_model, device=device) * 0.1

# Step 2: Add positional encoding
PE = sinusoidal_positional_encoding(seq_len, d_model, device=device)
X_with_pos = math.sqrt(d_model) * X + PE

# Step 3: Self-attention (simplified — no projection for clarity)
scores = X_with_pos @ X_with_pos.T / math.sqrt(d_model)
weights_with_pe = softmax(scores)

# Compare: without PE
scores_no_pe = X @ X.T / math.sqrt(d_model)
weights_no_pe = softmax(scores_no_pe)

# Visualize side by side
fig, (ax1, ax2) = plt.subplots(1, 2, figsize=(10, 4))

for ax, w, title in [(ax1, weights_no_pe, 'Without PE'),
                      (ax2, weights_with_pe, 'With PE')]:
    w_np = w.detach().cpu().numpy()
    im = ax.imshow(w_np, cmap='Blues', vmin=0, vmax=1)
    ax.set_xticks(range(len(tokens)))
    ax.set_yticks(range(len(tokens)))
    ax.set_xticklabels(tokens)
    ax.set_yticklabels(tokens)
    ax.set_xlabel('Key')
    ax.set_ylabel('Query')
    ax.set_title(title)
    for row in range(len(tokens)):
        for col in range(len(tokens)):
            ax.text(col, row, f'{w_np[row, col]:.2f}',
                    ha='center', va='center', fontsize=9)

plt.suptitle('Attention Weights: Effect of Positional Encoding', y=1.02, fontsize=13)
plt.tight_layout()
plt.show()

print("Without PE: weights are nearly uniform (no position info).")
print("With PE: weights vary — the model can distinguish positions.")

## 9. Summary

| Concept | Key Idea |
|---------|----------|
| **Problem** | Self-attention is permutation-invariant — it ignores token order |
| **Solution** | Add a position-dependent signal to each embedding |
| **Sinusoidal PE** | Fixed formula using sin/cos at geometrically-spaced frequencies |
| **Why sin/cos?** | Unique per position, enables relative position via linear transform, bounded, extrapolates |
| **Learned PE** | Alternative: learn an embedding per position (more flexible, can't extrapolate) |
| **Usage** | $\text{input} = \sqrt{d_{\text{model}}} \cdot \text{Embed}(\text{token}) + PE$ |