# Positional Encoding Improvements: RoPE & ALiBi

[![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/adiel2012/deep-learning-abc/blob/main/rope_alibi.ipynb)

This notebook implements from scratch the two most important positional encoding improvements since the original Transformer:

1. **RoPE** (Rotary Positional Embeddings) — rotates Q and K vectors in 2D subspaces
2. **ALiBi** (Attention with Linear Biases) — adds distance-based penalties to attention scores

We compare them with the original sinusoidal PE and visualize their key properties.

In [None]:
!pip install torch matplotlib

In [None]:
import torch
import matplotlib.pyplot as plt
import math

device = torch.device('cuda' if torch.cuda.is_available() else 'cpu')
print(f'Using device: {device}')

## 0. Mathematical Foundations

### Why Improve Positional Encoding?

The original Transformer used fixed sinusoidal positional encodings added to token embeddings:

$$\text{PE}(pos, 2i) = \sin\left(\frac{pos}{10000^{2i/d}}\right), \quad \text{PE}(pos, 2i+1) = \cos\left(\frac{pos}{10000^{2i/d}}\right)$$

**Problems with sinusoidal PE:**
- Encodes **absolute** position — token at position 5 always gets the same encoding
- Poor **extrapolation** to longer sequences than seen during training
- Position info gets diluted through layers

### RoPE: Encoding Position via Rotation

RoPE (Su et al., 2021) encodes position by **rotating** query and key vectors rather than adding a vector.

For a pair of dimensions $(x_0, x_1)$ at position $m$, RoPE applies a 2D rotation:

$$\begin{pmatrix} x_0' \\ x_1' \end{pmatrix} = \begin{pmatrix} \cos(m\theta) & -\sin(m\theta) \\ \sin(m\theta) & \cos(m\theta) \end{pmatrix} \begin{pmatrix} x_0 \\ x_1 \end{pmatrix}$$

where $\theta_i = 10000^{-2i/d}$ (same frequency scheme as sinusoidal PE).

**Key property:** The dot product between two rotated vectors depends only on the **relative** distance:

$$\langle R_m \mathbf{q}, R_n \mathbf{k} \rangle = \langle R_{m-n} \mathbf{q}, \mathbf{k} \rangle$$

This gives us relative position information for free!

### ALiBi: No Embeddings, Just Bias

ALiBi (Press et al., 2022) takes a radically different approach — it adds no positional information to the embeddings at all. Instead, it adds a **linear bias** to the attention scores:

$$\text{Attention}(Q, K, V) = \text{softmax}\left(\frac{QK^T}{\sqrt{d_k}} + \mathbf{B}\right) V$$

where $B_{ij} = -m \cdot |i - j|$ and $m$ is a head-specific slope.

The slopes follow a geometric sequence: for $h$ heads, $m_k = 2^{-8k/h}$ for $k = 1, \dots, h$.

**Key property:** Nearby tokens get small penalties, distant tokens get large penalties. Each head has a different "distance sensitivity."

## 1. Original Sinusoidal PE (Baseline)

In [None]:
def sinusoidal_pe(max_len, d_model, device=None):
    """Original sinusoidal positional encoding from 'Attention Is All You Need'."""
    pe = torch.zeros(max_len, d_model, device=device)
    position = torch.arange(0, max_len, dtype=torch.float, device=device).unsqueeze(1)
    div_term = torch.exp(torch.arange(0, d_model, 2, dtype=torch.float, device=device) * -(math.log(10000.0) / d_model))
    pe[:, 0::2] = torch.sin(position * div_term)
    pe[:, 1::2] = torch.cos(position * div_term)
    return pe

# Quick test
pe = sinusoidal_pe(10, 8, device=device)
print('Sinusoidal PE shape:', pe.shape)
print('PE[0]:', pe[0].cpu())
print('PE[1]:', pe[1].cpu())

## 2. RoPE — Rotary Positional Embeddings

RoPE works by pairing dimensions $(0,1), (2,3), (4,5), \ldots$ and applying a 2D rotation to each pair. The rotation angle increases with position and varies by dimension pair.

In [None]:
def rope_frequencies(d_model, max_len, base=10000.0, device=None):
    """Compute RoPE rotation angles for each position and dimension pair.
    
    Returns:
        freqs: (max_len, d_model//2) — rotation angle for each (position, dimension_pair)
    """
    # theta_i = base^(-2i/d) for i = 0, 1, ..., d/2-1
    dim_indices = torch.arange(0, d_model, 2, dtype=torch.float, device=device)
    theta = 1.0 / (base ** (dim_indices / d_model))  # (d_model//2,)
    
    # positions
    positions = torch.arange(max_len, dtype=torch.float, device=device)  # (max_len,)
    
    # outer product: angle = position * theta
    freqs = torch.outer(positions, theta)  # (max_len, d_model//2)
    return freqs

# Visualize the rotation frequencies
freqs = rope_frequencies(16, 64, device=device)
print('Frequencies shape:', freqs.shape)
print('Angles at position 0:', freqs[0].cpu())
print('Angles at position 1:', freqs[1].cpu())

In [None]:
def apply_rope(x, freqs):
    """Apply rotary positional embeddings to input tensor.
    
    Args:
        x: (..., seq_len, d_model) — query or key vectors
        freqs: (seq_len, d_model//2) — rotation angles
    
    Returns:
        Rotated tensor of same shape as x
    """
    d_model = x.shape[-1]
    
    # Split into pairs: (x0, x1), (x2, x3), ...
    x_pairs = x.view(*x.shape[:-1], d_model // 2, 2)  # (..., seq_len, d_model//2, 2)
    x_even = x_pairs[..., 0]  # (..., seq_len, d_model//2)
    x_odd = x_pairs[..., 1]   # (..., seq_len, d_model//2)
    
    # Get cos and sin of rotation angles
    cos_f = torch.cos(freqs)  # (seq_len, d_model//2)
    sin_f = torch.sin(freqs)  # (seq_len, d_model//2)
    
    # Apply 2D rotation: [cos -sin; sin cos] @ [x_even; x_odd]
    out_even = x_even * cos_f - x_odd * sin_f
    out_odd  = x_even * sin_f + x_odd * cos_f
    
    # Interleave back
    out = torch.stack([out_even, out_odd], dim=-1)  # (..., seq_len, d_model//2, 2)
    return out.view(*x.shape)  # (..., seq_len, d_model)

# Test: rotation preserves vector magnitude
x = torch.randn(1, 8, 16, device=device)  # (batch, seq_len, d_model)
freqs = rope_frequencies(16, 8, device=device)
x_rotated = apply_rope(x, freqs)

print('Original norms:', torch.norm(x, dim=-1).cpu())
print('Rotated norms: ', torch.norm(x_rotated, dim=-1).cpu())
print('Norms preserved:', torch.allclose(torch.norm(x, dim=-1), torch.norm(x_rotated, dim=-1), atol=1e-5))

### Verifying: RoPE Encodes Relative Position

The key mathematical property: the dot product of two RoPE-rotated vectors depends only on their **distance**, not their absolute positions.

In [None]:
# Demonstrate that q_m @ k_n depends only on (m-n)
d_model = 16
max_len = 20
freqs = rope_frequencies(d_model, max_len, device=device)

# Fixed q and k vectors (same for all positions)
torch.manual_seed(42)
q_vec = torch.randn(d_model, device=device)
k_vec = torch.randn(d_model, device=device)

# Compute dot product for different absolute positions but same relative distance
print('Dot products for SAME relative distance (gap=3) at different absolute positions:')
for m in [0, 3, 7, 12]:
    n = m + 3  # always gap of 3
    q_rot = apply_rope(q_vec.unsqueeze(0), freqs[m:m+1])
    k_rot = apply_rope(k_vec.unsqueeze(0), freqs[n:n+1])
    dot = (q_rot * k_rot).sum().item()
    print(f'  pos ({m},{n}): dot = {dot:.6f}')

print('\nDot products for DIFFERENT relative distances (from position 0):')
for gap in [0, 1, 3, 5, 10]:
    q_rot = apply_rope(q_vec.unsqueeze(0), freqs[0:1])
    k_rot = apply_rope(k_vec.unsqueeze(0), freqs[gap:gap+1])
    dot = (q_rot * k_rot).sum().item()
    print(f'  gap={gap}: dot = {dot:.6f}')

In [None]:
# Visualize RoPE rotation frequencies
d_model = 64
max_len = 128
freqs = rope_frequencies(d_model, max_len, device=device)

fig, axes = plt.subplots(1, 2, figsize=(14, 5))

# Heatmap of rotation angles
im = axes[0].imshow(freqs.cpu().numpy(), aspect='auto', cmap='RdBu')
axes[0].set_xlabel('Dimension pair index')
axes[0].set_ylabel('Position')
axes[0].set_title('RoPE Rotation Angles')
plt.colorbar(im, ax=axes[0])

# Show cosine of angles (what actually multiplies the vectors)
cos_vals = torch.cos(freqs)
im2 = axes[1].imshow(cos_vals.cpu().numpy(), aspect='auto', cmap='RdBu')
axes[1].set_xlabel('Dimension pair index')
axes[1].set_ylabel('Position')
axes[1].set_title('cos(angle) — Rotation Factor')
plt.colorbar(im2, ax=axes[1])

plt.tight_layout()
plt.show()

**Observation:** Low-index dimension pairs rotate fast (high frequency), high-index pairs rotate slowly (low frequency). This mirrors the sinusoidal PE design — but applied as rotation, not addition.

## 3. RoPE Attention — Putting It All Together

In [None]:
def attention_with_rope(Q, K, V, d_model, device=None):
    """Scaled dot-product attention with RoPE applied to Q and K.
    
    Args:
        Q, K: (batch, seq_len, d_model)
        V: (batch, seq_len, d_model)
    """
    seq_len = Q.shape[1]
    d_k = Q.shape[-1]
    
    # Compute rotation frequencies
    freqs = rope_frequencies(d_model, seq_len, device=device)
    
    # Apply RoPE to Q and K (NOT to V — V carries content, not position)
    Q_rot = apply_rope(Q, freqs)
    K_rot = apply_rope(K, freqs)
    
    # Standard scaled dot-product attention
    scores = torch.bmm(Q_rot, K_rot.transpose(1, 2)) / math.sqrt(d_k)
    weights = torch.softmax(scores, dim=-1)
    output = torch.bmm(weights, V)
    
    return output, weights

# Test
torch.manual_seed(42)
batch, seq_len, d_model = 1, 10, 16
Q = torch.randn(batch, seq_len, d_model, device=device)
K = torch.randn(batch, seq_len, d_model, device=device)
V = torch.randn(batch, seq_len, d_model, device=device)

out, attn_w = attention_with_rope(Q, K, V, d_model, device=device)
print('Output shape:', out.shape)
print('Attention weights shape:', attn_w.shape)

# Visualize attention pattern
fig, ax = plt.subplots(figsize=(6, 5))
im = ax.imshow(attn_w[0].detach().cpu().numpy(), cmap='Blues')
ax.set_xlabel('Key position')
ax.set_ylabel('Query position')
ax.set_title('Attention Weights with RoPE')
plt.colorbar(im, ax=ax)
plt.tight_layout()
plt.show()

### RoPE's Natural Decay Property

One key advantage of RoPE: attention scores naturally decay as the distance between tokens increases.

In [None]:
# Show attention score decay with distance in RoPE
d_model = 64
max_len = 100
freqs = rope_frequencies(d_model, max_len, device=device)

# Use identical q and k to isolate positional effect
torch.manual_seed(0)
q = torch.randn(d_model, device=device)
k = q.clone()

# Compute dot product as a function of distance
distances = list(range(max_len))
dots = []
for dist in distances:
    q_rot = apply_rope(q.unsqueeze(0), freqs[0:1])  # position 0
    k_rot = apply_rope(k.unsqueeze(0), freqs[dist:dist+1])  # position dist
    dots.append((q_rot * k_rot).sum().item())

fig, ax = plt.subplots(figsize=(10, 4))
ax.plot(distances, dots, linewidth=2)
ax.axhline(y=0, color='gray', linestyle='--', alpha=0.5)
ax.set_xlabel('Distance between tokens')
ax.set_ylabel('Dot product (attention score)')
ax.set_title('RoPE: Natural Attention Decay with Distance')
ax.grid(True, alpha=0.3)
plt.tight_layout()
plt.show()

## 4. ALiBi — Attention with Linear Biases

ALiBi takes a completely different approach: instead of encoding positions into the embeddings, it adds a distance-based penalty directly to the attention scores.

$$\text{score}_{ij} = \frac{q_i \cdot k_j}{\sqrt{d_k}} - m \cdot |i - j|$$

where $m$ is a per-head slope. Slopes follow a geometric sequence: $m_k = 2^{-8k/h}$.

In [None]:
def alibi_slopes(n_heads):
    """Compute ALiBi slopes for each attention head.
    
    For n_heads = 8: slopes = [1/2, 1/4, 1/8, 1/16, 1/32, 1/64, 1/128, 1/256]
    """
    # Base ratio: 2^(-8/n_heads)
    ratio = 2 ** (-8 / n_heads)
    slopes = [ratio ** (i + 1) for i in range(n_heads)]
    return torch.tensor(slopes)

def alibi_bias(seq_len, n_heads, device=None):
    """Compute the ALiBi bias matrix for all heads.
    
    Returns:
        bias: (n_heads, seq_len, seq_len) — negative distance penalties
    """
    slopes = alibi_slopes(n_heads).to(device)  # (n_heads,)
    
    # Distance matrix: |i - j|
    positions = torch.arange(seq_len, device=device)
    dist = (positions.unsqueeze(0) - positions.unsqueeze(1)).abs().float()  # (seq_len, seq_len)
    
    # bias = -slope * distance, per head
    bias = -slopes.view(n_heads, 1, 1) * dist.unsqueeze(0)  # (n_heads, seq_len, seq_len)
    return bias

# Test
slopes = alibi_slopes(8)
print('ALiBi slopes for 8 heads:', slopes)
print('Sum:', slopes.sum().item())

bias = alibi_bias(6, 4, device=device)
print('\nBias matrix for head 0 (steepest slope):')
print(bias[0].cpu())
print('\nBias matrix for head 3 (gentlest slope):')
print(bias[3].cpu())

In [None]:
# Visualize ALiBi bias matrices across heads
n_heads = 8
seq_len = 32
bias = alibi_bias(seq_len, n_heads, device=device)

fig, axes = plt.subplots(2, 4, figsize=(16, 8))
slopes = alibi_slopes(n_heads)

for h in range(n_heads):
    ax = axes[h // 4, h % 4]
    im = ax.imshow(bias[h].cpu().numpy(), cmap='RdBu', vmin=bias.min().item(), vmax=0)
    ax.set_title(f'Head {h} (slope={slopes[h]:.4f})')
    ax.set_xlabel('Key pos')
    ax.set_ylabel('Query pos')

plt.suptitle('ALiBi Bias Matrices — Each Head Has Different Distance Sensitivity', fontsize=14)
plt.tight_layout()
plt.show()

In [None]:
def attention_with_alibi(Q, K, V, n_heads, device=None):
    """Multi-head attention with ALiBi positional biases.
    
    Args:
        Q, K, V: (batch, seq_len, d_model)
        n_heads: number of attention heads
    """
    batch, seq_len, d_model = Q.shape
    d_k = d_model // n_heads
    
    # Reshape for multi-head: (batch, n_heads, seq_len, d_k)
    Q = Q.view(batch, seq_len, n_heads, d_k).transpose(1, 2)
    K = K.view(batch, seq_len, n_heads, d_k).transpose(1, 2)
    V = V.view(batch, seq_len, n_heads, d_k).transpose(1, 2)
    
    # Attention scores
    scores = torch.matmul(Q, K.transpose(-2, -1)) / math.sqrt(d_k)  # (batch, n_heads, seq, seq)
    
    # Add ALiBi bias
    bias = alibi_bias(seq_len, n_heads, device=device)  # (n_heads, seq, seq)
    scores = scores + bias.unsqueeze(0)  # broadcast over batch
    
    weights = torch.softmax(scores, dim=-1)
    output = torch.matmul(weights, V)  # (batch, n_heads, seq_len, d_k)
    
    # Reshape back
    output = output.transpose(1, 2).contiguous().view(batch, seq_len, d_model)
    return output, weights

# Test
torch.manual_seed(42)
batch, seq_len, d_model, n_heads = 1, 12, 16, 4
Q = torch.randn(batch, seq_len, d_model, device=device)
K = torch.randn(batch, seq_len, d_model, device=device)
V = torch.randn(batch, seq_len, d_model, device=device)

out, weights = attention_with_alibi(Q, K, V, n_heads, device=device)
print('Output shape:', out.shape)
print('Weights shape:', weights.shape)

In [None]:
# Visualize ALiBi attention weights per head
fig, axes = plt.subplots(1, 4, figsize=(16, 4))
slopes = alibi_slopes(n_heads)

for h in range(n_heads):
    ax = axes[h]
    im = ax.imshow(weights[0, h].detach().cpu().numpy(), cmap='Blues')
    ax.set_title(f'Head {h} (slope={slopes[h]:.4f})')
    ax.set_xlabel('Key pos')
    ax.set_ylabel('Query pos')

plt.suptitle('ALiBi Attention Weights — Steeper Slopes = More Local Attention', fontsize=13)
plt.tight_layout()
plt.show()

## 5. Length Extrapolation Comparison

A key advantage of both RoPE and ALiBi is the ability to handle sequences **longer** than those seen during training. Let's simulate this.

In [None]:
def attention_no_pe(Q, K, V):
    """Standard attention without any positional encoding."""
    d_k = Q.shape[-1]
    scores = torch.bmm(Q, K.transpose(1, 2)) / math.sqrt(d_k)
    weights = torch.softmax(scores, dim=-1)
    output = torch.bmm(weights, V)
    return output, weights

def attention_sinusoidal(Q, K, V, d_model, device=None):
    """Attention with sinusoidal PE added to Q and K."""
    seq_len = Q.shape[1]
    pe = sinusoidal_pe(seq_len, d_model, device=device)
    Q_pe = Q + pe.unsqueeze(0)
    K_pe = K + pe.unsqueeze(0)
    return attention_no_pe(Q_pe, K_pe, V)

# Simulate: train on seq_len=16, test on seq_len=64
torch.manual_seed(42)
d_model = 32
train_len = 16
test_len = 64

# Generate test data at long length
Q = torch.randn(1, test_len, d_model, device=device)
K = torch.randn(1, test_len, d_model, device=device)
V = torch.randn(1, test_len, d_model, device=device)

# All three methods at extended length
_, w_sine = attention_sinusoidal(Q, K, V, d_model, device=device)
_, w_rope = attention_with_rope(Q, K, V, d_model, device=device)

# For ALiBi we use single-head for fair comparison
bias_1head = alibi_bias(test_len, 1, device=device)  # (1, seq, seq)
scores = torch.bmm(Q, K.transpose(1, 2)) / math.sqrt(d_model)
scores_alibi = scores + bias_1head
w_alibi = torch.softmax(scores_alibi, dim=-1)

fig, axes = plt.subplots(1, 3, figsize=(18, 5))

methods = [('Sinusoidal PE', w_sine), ('RoPE', w_rope), ('ALiBi', w_alibi)]
for ax, (name, w) in zip(axes, methods):
    im = ax.imshow(w[0].detach().cpu().numpy(), cmap='Blues', aspect='auto')
    ax.set_xlabel('Key position')
    ax.set_ylabel('Query position')
    ax.set_title(f'{name} (seq_len={test_len})')
    ax.axvline(x=train_len, color='red', linestyle='--', alpha=0.7, label=f'Train boundary ({train_len})')
    ax.axhline(y=train_len, color='red', linestyle='--', alpha=0.7)
    ax.legend(fontsize=8)
    plt.colorbar(im, ax=ax)

plt.suptitle('Length Extrapolation: Trained on 16 tokens, Testing on 64', fontsize=14)
plt.tight_layout()
plt.show()

**Observations:**
- **Sinusoidal PE**: Attention patterns are similar everywhere — no strong locality bias, which can cause problems with long-range extrapolation.
- **RoPE**: Natural decay with distance means attention remains well-behaved beyond training length.
- **ALiBi**: Strongest locality bias — the linear penalty naturally limits attention scope, making extrapolation smooth.

## 6. Comparison Summary

Let's put the key properties side by side.

In [None]:
# Quantitative comparison: attention entropy (spread) at different sequence lengths
def attention_entropy(weights):
    """Shannon entropy of attention weights — higher = more spread out."""
    # Clamp to avoid log(0)
    w = weights.clamp(min=1e-10)
    return -(w * w.log()).sum(dim=-1).mean().item()

seq_lengths = [8, 16, 32, 64, 128]
d_model = 32
torch.manual_seed(42)

entropies = {'Sinusoidal': [], 'RoPE': [], 'ALiBi': []}

for sl in seq_lengths:
    Q = torch.randn(1, sl, d_model, device=device)
    K = torch.randn(1, sl, d_model, device=device)
    V = torch.randn(1, sl, d_model, device=device)
    
    _, w = attention_sinusoidal(Q, K, V, d_model, device=device)
    entropies['Sinusoidal'].append(attention_entropy(w))
    
    _, w = attention_with_rope(Q, K, V, d_model, device=device)
    entropies['RoPE'].append(attention_entropy(w))
    
    bias = alibi_bias(sl, 1, device=device)
    scores = torch.bmm(Q, K.transpose(1, 2)) / math.sqrt(d_model)
    w = torch.softmax(scores + bias, dim=-1)
    entropies['ALiBi'].append(attention_entropy(w))

fig, ax = plt.subplots(figsize=(10, 5))
for name, ent in entropies.items():
    ax.plot(seq_lengths, ent, 'o-', linewidth=2, markersize=8, label=name)

ax.set_xlabel('Sequence Length')
ax.set_ylabel('Attention Entropy (nats)')
ax.set_title('Attention Spread vs Sequence Length\n(Lower = more focused, Higher = more diffuse)')
ax.legend()
ax.grid(True, alpha=0.3)
plt.tight_layout()
plt.show()

print('Key takeaway:')
print('  - Sinusoidal PE: entropy grows fastest → attention becomes diluted at long range')
print('  - RoPE: moderate growth → natural decay helps maintain focus')
print('  - ALiBi: slowest growth → strong locality bias keeps attention concentrated')

In [None]:
# Summary table
print('=' * 80)
print('COMPARISON: Positional Encoding Methods')
print('=' * 80)
print(f'{"Property":<30} {"Sinusoidal":<18} {"RoPE":<18} {"ALiBi":<18}')
print('-' * 80)
print(f'{"Position type":<30} {"Absolute":<18} {"Relative":<18} {"Relative":<18}')
print(f'{"Where applied":<30} {"Added to embed":<18} {"Rotates Q,K":<18} {"Bias on scores":<18}')
print(f'{"Extra parameters":<30} {"0":<18} {"0":<18} {"0":<18}')
print(f'{"Extrapolation":<30} {"Poor":<18} {"Good":<18} {"Excellent":<18}')
print(f'{"Used in":<30} {"Original TF":<18} {"LLaMA,Mistral":<18} {"BLOOM,MPT":<18}')
print(f'{"Status (2024)":<30} {"Legacy":<18} {"Industry std":<18} {"Niche":<18}')
print('=' * 80)

## Summary

In this notebook we implemented from scratch:

1. **RoPE (Rotary Positional Embeddings)**
   - Pairs dimensions and applies 2D rotations at position-dependent angles
   - Key formula: $R(m\theta)$ rotation matrix applied to Q and K
   - Dot product depends only on relative distance → relative position encoding
   - Natural decay with distance
   - Used in LLaMA, Mistral, PaLM, GPT-NeoX — current industry standard

2. **ALiBi (Attention with Linear Biases)**
   - Adds $-m \cdot |i-j|$ bias directly to attention scores
   - Per-head slopes create multi-scale distance sensitivity
   - No modification to embeddings at all
   - Excellent length extrapolation (train on 2K → inference on 32K+)
   - Used in BLOOM, MPT

**Key insight:** Both methods encode *relative* position rather than absolute position, enabling better generalization to longer sequences. RoPE achieves this through geometric rotation; ALiBi through arithmetic penalty.