In [1]:
import numpy as np
rng = np.random.default_rng(seed=42)

# Permutation Equivariance
This is the key concept that motivates the need for positional encodings. To develop an intuition for the concept, let's first answer whether humans are permutation equivariant.

## Are Humans?

Let $f(x)$ denote a function that takes a natural language query $x$ and produces a "human-like" natural language response $y$. Let's consider $x$ to be the sentence
* $x=[\text{The}, \text{developer}, \text{deployed}, \text{the}, \text{agent}, \text{to}, \text{production}]$

A "human-like" response would be something along the lines of $y=[\text{Cool}, \text{which}, \text{framework}, \text{did}, \text{she}, \text{use}, \text{?}]$.

Now, let's permute $x$:
* $x'=[\text{The}, \text{agent}, \text{deployed}, \text{the}, \text{developer}, \text{to}, \text{production}]$

A natural response would be $y'=[\text{What}, \text{the}, \text{heck}, \text{?}, \text{Did}, \text{AI}, \text{already}, \text{take}, \text{over}, \text{?}]$. 

If a human was permutation equivariant, we'd expect their response to be permuted in the same was the input is permuted. This means $f(x')$ would have to be $[\text{Cool}, \text{she}, \text{framework}, \text{did}, \text{which}, \text{use}, \text{?}]$. 

This is clearly not what we want. If the input is permuted, we want a response that captures the semantical difference not merely the permutation.

## Permutation Matrices

A **permutation matrix** $P$ is a square matrix with exactly one 1 in each row and column (all other entries 0). Left-multiplying $P \cdot X$ reorders the rows of $X$ according to where those 1s sit.

The identity matrix $I$ is the special case where every 1 sits on the diagonal — each row maps to itself, so nothing changes. Moving a 1 off the diagonal causes the corresponding rows to swap:

In [2]:
N = 7

def mat_lines(arr, title, col_prefix="col", val_fmt="{:2d}", swapped=set()):
    col_w = len(val_fmt.format(0)) + 2
    header = "          " + "".join(f"{col_prefix} {j}".rjust(col_w) for j in range(arr.shape[1]))
    lines = [title, header]
    for i, row in enumerate(arr):
        vals = "  ".join(val_fmt.format(v) for v in row)
        tag = "  <- swapped" if i in swapped else ""
        lines.append(f"  row {i}:  [ {vals} ]{tag}")
    return lines

def print_side_by_side(left, right, gap=6):
    width = max(len(l) for l in left)
    for l, r in zip(left, right):
        print(l.ljust(width + gap) + r)

I = np.eye(N, dtype=int)
P = np.eye(N, dtype=int)
P[[1, 4]] = P[[4, 1]]  # indices 1 and 4 → rows 2 and 5

left  = mat_lines(I, "Identity I  (no change)")
right = mat_lines(P, "Permutation P  (rows 2 & 5 swap)", swapped={1, 4})

print_side_by_side(left, right)

Identity I  (no change)                            Permutation P  (rows 2 & 5 swap)
          col 0col 1col 2col 3col 4col 5col 6                col 0col 1col 2col 3col 4col 5col 6
  row 0:  [  1   0   0   0   0   0   0 ]             row 0:  [  1   0   0   0   0   0   0 ]
  row 1:  [  0   1   0   0   0   0   0 ]             row 1:  [  0   0   0   0   1   0   0 ]  <- swapped
  row 2:  [  0   0   1   0   0   0   0 ]             row 2:  [  0   0   1   0   0   0   0 ]
  row 3:  [  0   0   0   1   0   0   0 ]             row 3:  [  0   0   0   1   0   0   0 ]
  row 4:  [  0   0   0   0   1   0   0 ]             row 4:  [  0   1   0   0   0   0   0 ]  <- swapped
  row 5:  [  0   0   0   0   0   1   0 ]             row 5:  [  0   0   0   0   0   1   0 ]
  row 6:  [  0   0   0   0   0   0   1 ]             row 6:  [  0   0   0   0   0   0   1 ]


## Is Attention Permutation Equivariant?

A function $f$ is permutation equivariant if for any permutation matrix $P$:

$$f(PX) = P \cdot f(X)$$

Shuffle the input rows, and the output rows are shuffled in exactly the same way. Let's see:

In [3]:
def attention(X):
    W_q = rng.random(size=(d_model, d_head))
    W_k = rng.random(size=(d_model, d_head))
    W_v = rng.random(size=(d_model, d_head))

    Q = X @ W_q
    K = X @ W_k
    V = X @ W_v

    S = (Q @ K.T) / np.sqrt(d_head)
    exp_S = np.exp(S - np.max(S, axis=1, keepdims=True))
    P = exp_S / np.sum(exp_S, axis=1, keepdims=True)
    
    return P @ V

In [None]:
N, d_model, d_head = 7, 4, 4

x = np.random.rand(N, d_model)

y   = attention(x)      # original
y_p = attention(P @ x)  # permuted input

def pprint(arr, title, highlight={1, 4}):
    header = "          " + "".join(f"   dim {j}" for j in range(arr.shape[1]))
    print(title)
    print(header)
    print("    ...")
    for i in sorted(highlight):
        vals = "  ".join(f"{v:6.3f}" for v in arr[i])
        print(f"  row {i}:  [ {vals} ]  <- swapped")
        print("    ...")
    print()

pprint(y,   "Attention(x)  — original:")
pprint(y_p, "Attention(Px) — permuted input:")

print(f"Attention(Px) == P · Attention(x): {np.allclose(y_p, P @ y)}")

Attention **is** permutation equivariant. Shuffle the tokens in, get the same representations back — just reordered.

But recall from the opening: this is exactly the property we *don't* want. Rearranging *"The developer deployed the agent"* into *"The agent deployed the developer"* should produce a fundamentally different response, not a permuted version of the same one.

# Positional Encodings

This is why transformers need **positional encodings** — a way to stamp each token with its position *before* attention runs, so that order is no longer invisible.

In other words, we need something akin to

* $x_{\text{pos-encoded}}=[\text{The}_1, \text{developer}_2, \text{deployed}_3, \text{the}_4, \text{agent}_5, \text{to}_6, \text{production}
_7]$

The original permutation would then look like:

* $x'_{\text{pos-encoded}}=[\text{The}_1, \text{agent}_2, \text{deployed}_3, \text{the}_4, \text{developer}_2, \text{to}_6, \text{production}
_7]$

We enrich the word with information about its position. This way the same word will have a different representation depending on its location in the sequence ($\text{agent}_5 \neq \text{agent}_2$). 

## What We Need from a Positional Encoding

We need a function $PE(\text{pos})$ that maps each position to a $d$-dimensional vector, with these properties:

| Property | Why |
|----------|-----|
| **Unique** per position | So the model can distinguish position 5 from position 50 |
| **Compatible** with embeddings | So it doesn't dominate the token embedding or cause instability |
| **Smooth distance** | Nearby positions should have similar encodings |
| **Generalizes** to unseen lengths | A model trained on length 512 should handle length 1024 |


### The shortcomings of naive approaches
Let's investigate why a simple absolute positional encoding ($PE(\text{pos}) = \text{pos}$) breaks some of the above properties.

Let's assume our token embeddings live in a roughly normalized space. Values of a similar order of magnitude across all tokens. For this example, we assume they are roughly between $0$ and $1$. If we added their absolute position to each dimension of the token's embedding, positional similarity (i.e., distance) loses its meaning and we would not be able to generalize to unseen sequence lengths.


In [5]:
N, model_dim = 100, 512

embeds = rng.random(size=(N, model_dim))
embeds += np.arange(N).reshape(N, 1)  # add absolute position to each dim

# Dot-product similarity between pairs that are 4 apart
early = embeds[0] @ embeds[4]    # tokens 1 and 5
late  = embeds[95] @ embeds[99]  # tokens 96 and 100

print(f"sim(token  1, token  5):  {early:.1f}")
print(f"sim(token 96, token 100): {late:.1f}")

# Plot similarity for all "gap-4" pairs
sims = [embeds[i] @ embeds[i + 4] for i in range(N - 4)]

fig, ax = plt.subplots(figsize=(8, 4))
ax.plot(sims)
ax.set_xlabel("Token position $i$")
ax.set_ylabel("Dot-product similarity")
ax.set_title("sim(token $i$, token $i+4$) — same gap, very different similarity")
plt.tight_layout()
plt.show()

np.float64(385.20409993451403)

In [6]:
np.arange(100)[:, np.newaxis].shape

(100, 1)

In [7]:
np.arange(0, N).reshape(N, 1).T

array([[ 0,  1,  2,  3,  4,  5,  6,  7,  8,  9, 10, 11, 12, 13, 14, 15,
        16, 17, 18, 19, 20, 21, 22, 23, 24, 25, 26, 27, 28, 29, 30, 31,
        32, 33, 34, 35, 36, 37, 38, 39, 40, 41, 42, 43, 44, 45, 46, 47,
        48, 49, 50, 51, 52, 53, 54, 55, 56, 57, 58, 59, 60, 61, 62, 63,
        64, 65, 66, 67, 68, 69, 70, 71, 72, 73, 74, 75, 76, 77, 78, 79,
        80, 81, 82, 83, 84, 85, 86, 87, 88, 89, 90, 91, 92, 93, 94, 95,
        96, 97, 98, 99]])

In [8]:

A naive approach — just use the integer position as a feature ($PE(\text{pos}) = \text{pos}$) — fails immediately: values grow unboundedly, and there's no meaningful relationship between dimensions.

What if we normalized to [0, 1]? Then $PE(\text{pos}) = \text{pos} / L$ where $L$ is the sequence length. But now the encoding depends on $L$, so position 5 means something different in a 10-token sequence vs. a 1000-token sequence.

We need something smarter.

SyntaxError: invalid character '—' (U+2014) (176109703.py, line 1)

## 3. Building Intuition: Binary Counting

Before jumping to the formula, let's look at a familiar system that encodes integers: **binary representation**.

In [None]:
print("pos │ bit3  bit2  bit1  bit0")
print("────┼────────────────────────")
for pos in range(16):
    bits = [(pos >> b) & 1 for b in range(3, -1, -1)]
    print(f" {pos:2d} │  {'     '.join(str(b) for b in bits)}")

Notice the pattern:
- **bit0** (rightmost) flips every step — frequency = 1
- **bit1** flips every 2 steps — frequency = 1/2
- **bit2** flips every 4 steps — frequency = 1/4
- **bit3** flips every 8 steps — frequency = 1/8

Each bit oscillates at a different frequency, and together they uniquely identify every position. **This is a positional encoding!** Each "dimension" (bit) captures position information at a different resolution — fast-changing bits give fine-grained position, slow-changing bits give coarse position.

But binary has a problem for neural networks: it's **discrete** (0 or 1) and the transitions are sharp. We want smooth, continuous values that are friendly to gradient-based learning.

The fix: replace square waves with **sinusoids**.

## 4. Sinusoidal Positional Encoding

The original transformer paper (*Attention Is All You Need*, Vaswani et al. 2017) proposes:

$$PE(\text{pos}, 2i) = \sin\left(\frac{\text{pos}}{10000^{2i/d}}\right)$$

$$PE(\text{pos}, 2i+1) = \cos\left(\frac{\text{pos}}{10000^{2i/d}}\right)$$

Where:
- $\text{pos}$ is the token position (0, 1, 2, ...)
- $i$ is the dimension index (0, 1, ..., $d/2 - 1$)
- $d$ is the encoding dimension (= $d_{\text{model}}$)

Each pair of dimensions $(2i, 2i+1)$ forms a sin/cos pair at a specific frequency. The frequencies decrease geometrically from dimension 0 to dimension $d-1$:

| Dimensions | Wavelength | What it captures |
|-----------|------------|------------------|
| 0, 1 | $2\pi \approx 6.3$ positions | Fine-grained: distinguishes adjacent tokens |
| middle | ~hundreds of positions | Medium-scale patterns |
| $d-2$, $d-1$ | $2\pi \cdot 10000 \approx 62{,}832$ positions | Coarse: "beginning" vs "end" of document |

Just like binary counting — but smooth and continuous.

In [None]:
def sinusoidal_pe(max_len, d_model):
    """Generate sinusoidal positional encoding matrix.
    
    Returns: (max_len, d_model) array where each row is the encoding for that position.
    """
    pe = np.zeros((max_len, d_model))
    pos = np.arange(max_len)[:, np.newaxis]          # (max_len, 1)
    i = np.arange(0, d_model, 2)[np.newaxis, :]      # (1, d_model/2)
    
    # Frequencies decrease geometrically: 1, 1/10000^(2/d), 1/10000^(4/d), ...
    angles = pos / np.power(10000, i / d_model)       # (max_len, d_model/2)
    
    pe[:, 0::2] = np.sin(angles)  # Even dimensions
    pe[:, 1::2] = np.cos(angles)  # Odd dimensions
    
    return pe

# Generate for a small example
pe = sinusoidal_pe(max_len=100, d_model=64)
print(f"Shape: {pe.shape}  (100 positions, 64 dimensions)")
print(f"Value range: [{pe.min():.1f}, {pe.max():.1f}]")
print()
print("Position 0 (first 8 dims):")
print(pe[0, :8])
print()
print("Position 1 (first 8 dims):")
print(pe[1, :8])
print()
print("Position 99 (first 8 dims):")
print(pe[99, :8])

In [None]:
import matplotlib.pyplot as plt
import seaborn as sns

sns.set_theme(style="dark")

pe = sinusoidal_pe(max_len=128, d_model=64)

fig, axes = plt.subplots(1, 2, figsize=(14, 5))

# Left: heatmap of the full PE matrix
im = axes[0].imshow(pe, aspect='auto', cmap='RdBu_r', vmin=-1, vmax=1)
axes[0].set_xlabel('Dimension')
axes[0].set_ylabel('Position')
axes[0].set_title('Sinusoidal Positional Encoding')
plt.colorbar(im, ax=axes[0], fraction=0.046)

# Right: individual sinusoids at different dimensions
positions = np.arange(128)
for dim, label in [(0, 'dim 0 (sin)'), (4, 'dim 4 (sin)'), (16, 'dim 16 (sin)'), (62, 'dim 62 (sin)')]:
    axes[1].plot(positions, pe[:, dim], label=label, alpha=0.8)
axes[1].set_xlabel('Position')
axes[1].set_ylabel('Value')
axes[1].set_title('Different Dimensions = Different Frequencies')
axes[1].legend(fontsize=9)
axes[1].set_ylim(-1.3, 1.3)

plt.tight_layout()
plt.show()

The heatmap shows the key structure:
- **Left columns** (low dimensions): high-frequency oscillations that change rapidly with position
- **Right columns** (high dimensions): low-frequency oscillations that change slowly
- Each row is a unique "fingerprint" for that position

This is exactly the binary counting pattern — but with smooth sinusoids instead of square waves.

## 5. Key Properties

### 5.1 Unique Encodings

Every position gets a distinct vector. We can verify this by checking that no two positions have the same encoding:

In [None]:
pe = sinusoidal_pe(max_len=1000, d_model=64)

# Compute all pairwise distances
# ||pe[i] - pe[j]||^2 = ||pe[i]||^2 + ||pe[j]||^2 - 2 * pe[i] . pe[j]
norms_sq = np.sum(pe ** 2, axis=1)
dists_sq = norms_sq[:, None] + norms_sq[None, :] - 2 * pe @ pe.T
dists = np.sqrt(np.maximum(0, dists_sq))  # Clamp to avoid float rounding issues

# Zero out the diagonal
np.fill_diagonal(dists, np.inf)

print(f"Minimum distance between any two positions: {dists.min():.4f}")
print(f"This occurs between positions {np.unravel_index(dists.argmin(), dists.shape)}")
print()
print("Every position is distinct.")

### 5.2 Distance Structure

Nearby positions should have similar encodings. Let's visualize the dot product between all pairs of position encodings — this reveals how the encoding captures distance:

In [None]:
pe = sinusoidal_pe(max_len=128, d_model=64)

# Dot product similarity between positions
similarity = pe @ pe.T

fig, axes = plt.subplots(1, 2, figsize=(14, 5))

# Left: full similarity matrix
im = axes[0].imshow(similarity, cmap='RdBu_r')
axes[0].set_xlabel('Position')
axes[0].set_ylabel('Position')
axes[0].set_title('Dot Product Similarity Between Positions')
plt.colorbar(im, ax=axes[0], fraction=0.046)

# Right: similarity of position 0 with all other positions
axes[1].plot(similarity[0], label='sim(pos 0, pos j)', alpha=0.8)
axes[1].plot(similarity[32], label='sim(pos 32, pos j)', alpha=0.8)
axes[1].plot(similarity[64], label='sim(pos 64, pos j)', alpha=0.8)
axes[1].set_xlabel('Position j')
axes[1].set_ylabel('Dot Product')
axes[1].set_title('Similarity Decays with Distance')
axes[1].legend(fontsize=9)

plt.tight_layout()
plt.show()

The similarity matrix shows:
- **Strong diagonal**: each position is most similar to itself
- **Decay with distance**: similarity drops as positions get farther apart
- **Symmetry**: $\text{sim}(i, j) = \text{sim}(j, i)$ — the dot product is symmetric

This is exactly what we want — the encoding naturally captures "closeness" between positions.

### 5.3 Relative Position via Linear Transformation

The most elegant property: for any fixed offset $k$, there exists a linear transformation $T_k$ such that:

$$PE(\text{pos} + k) = T_k \cdot PE(\text{pos})$$

This means the model can learn to attend to relative positions using simple linear operations.

**Why?** Each sin/cos pair at frequency $\omega_i = 1/10000^{2i/d}$ behaves like a 2D rotation. Using the angle addition identities:

$$\begin{bmatrix} \sin(\omega_i(\text{pos}+k)) \\ \cos(\omega_i(\text{pos}+k)) \end{bmatrix} = \begin{bmatrix} \cos(\omega_i k) & \sin(\omega_i k) \\ -\sin(\omega_i k) & \cos(\omega_i k) \end{bmatrix} \begin{bmatrix} \sin(\omega_i \cdot \text{pos}) \\ \cos(\omega_i \cdot \text{pos}) \end{bmatrix}$$

The full $T_k$ is a block-diagonal matrix of $2 \times 2$ rotation matrices — one per frequency. Crucially, $T_k$ depends only on the offset $k$, not on the absolute position.

Let's verify this numerically:

In [None]:
d_model = 64
pe = sinusoidal_pe(max_len=200, d_model=d_model)

def build_rotation_matrix(k, d_model):
    """Build the block-diagonal rotation matrix T_k."""
    T = np.zeros((d_model, d_model))
    for i in range(0, d_model, 2):
        omega = 1.0 / np.power(10000, i / d_model)
        cos_k = np.cos(omega * k)
        sin_k = np.sin(omega * k)
        # 2x2 rotation block for dimensions (i, i+1)
        T[i, i] = cos_k
        T[i, i+1] = sin_k
        T[i+1, i] = -sin_k
        T[i+1, i+1] = cos_k
    return T

# Test: PE(pos + k) should equal T_k @ PE(pos)
k = 7  # Offset of 7 positions
T_k = build_rotation_matrix(k, d_model)

pos = 42  # Arbitrary starting position
pe_direct = pe[pos + k]          # Compute PE(pos + k) directly
pe_rotated = T_k @ pe[pos]       # Compute T_k @ PE(pos)

print(f"PE(pos={pos+k}) directly (first 8 dims):")
print(pe_direct[:8])
print()
print(f"T_{k} @ PE(pos={pos}) (first 8 dims):")
print(pe_rotated[:8])
print()
print(f"Max difference: {np.max(np.abs(pe_direct - pe_rotated)):.2e}")
print()

# This works for ANY starting position
errors = []
for p in range(100):
    errors.append(np.max(np.abs(pe[p + k] - T_k @ pe[p])))
print(f"Works for all positions: max error across pos 0-99 = {max(errors):.2e}")

This is powerful: the same matrix $T_k$ transforms **any** position's encoding to the encoding $k$ steps ahead. A single learned linear layer could, in principle, learn to compute "what's 3 positions to the left?" regardless of absolute position.

## 6. How It's Used in Practice

Positional encoding is **added** to the token embeddings before they enter the transformer:

$$\text{Input} = \text{TokenEmbedding}(x) + PE$$

Both have the same shape — $(\text{seq\_len}, d_{\text{model}})$ — so the addition is elementwise.

In [None]:
# Let's verify that adding PE breaks the permutation equivariance
seq_len, d_model, d_head = 6, 64, 64
np.random.seed(42)

X = np.random.randn(seq_len, d_model)
W_Q = np.random.randn(d_model, d_head)
W_K = np.random.randn(d_model, d_head)
W_V = np.random.randn(d_model, d_head)

pe = sinusoidal_pe(seq_len, d_model)

# Original: add PE then compute attention
output_original = attention(X + pe, W_Q, W_K, W_V)

# Shuffled: shuffle tokens, add PE (for the NEW positions), compute attention
perm = np.array([5, 3, 1, 4, 0, 2])
inv_perm = np.argsort(perm)
X_shuffled = X[perm]
output_shuffled = attention(X_shuffled + pe, W_Q, W_K, W_V)

# Un-shuffle and compare
output_unshuffled = output_shuffled[inv_perm]

diff = np.max(np.abs(output_original - output_unshuffled))
print(f"Without PE: outputs are identical after un-shuffling (as shown above)")
print(f"With PE:    max difference after un-shuffling = {diff:.4f}")
print()
print("The outputs are now DIFFERENT — attention is position-aware!")
print()
print("Why? Because 'cat' at position 1 gets PE(1), but after shuffling")
print("'cat' lands at position 2 and gets PE(2). Different position →")
print("different encoding → different attention pattern.")

Implementation notes:
- PE is computed **once** at initialization and cached — it's just a lookup table, no runtime cost
- The encoding dimension matches $d_{\text{model}}$ so it can be directly added
- In the original transformer, a small dropout is applied after the addition

## 7. Limitations and What's Next

Sinusoidal positional encoding works — it was used in the original transformer and many subsequent models. But it has a fundamental limitation: it encodes **absolute** position.

Each token gets a fixed vector based on where it sits in the sequence. The model must then *learn* to extract relative position information from the difference of absolute encodings. This works, but it's indirect.

**What if we could encode relative position directly into the attention computation?** Instead of modifying the input embeddings, what if the attention scores themselves were aware of the distance between query and key positions?

That's the idea behind **Rotary Position Embedding (RoPE)**, which we'll explore in Part 2. RoPE rotates the query and key vectors based on their positions, so that the dot product $q_i \cdot k_j$ naturally depends on the offset $i - j$. This makes relative position a first-class citizen of the attention mechanism.

**Continue to [RoPE](./rope.ipynb) (coming soon) to see how!**