# Relative positional embedding in self-attention 

Self-attention is permutation-invariant, so you must inject order. **Relative** positional embeddings (RPE) tell the model *how far (and in what direction)* two tokens are from each other, instead of where each token is on an absolute axis. This brings better length generalization, translation invariance, and parameter efficiency.

---

## 1) Absolute vs. Relative (at the logits level)

Vanilla attention (per head) uses:
$$
\mathrm{Attn}(Q,K,V)=\mathrm{softmax}!\left(\frac{QK^\top}{\sqrt{d}}\right)V,,
$$
where (Q,K,V\in\mathbb{R}^{L\times d}).
Absolute positions add a term that depends on token index (i). Relative positions add a term that depends on the **pairwise offset** (r=i-j).

---

## 2) Additive relative **bias** (T5/ALiBi-style)

You add a scalar bias to each attention logit based on the relative distance bucket of ((i,j)):
$$
A_{ij} ;=; \frac{q_i^\top k_j}{\sqrt{d}} ;+; b_{\text{rel}}!\big(\operatorname{bucket}(i-j)\big),
$$
$$
\alpha_{ij} ;=; \mathrm{softmax}*j(A*{ij}),\qquad \text{output}*i=\sum_j \alpha*{ij} v_j.
$$

* (b_{\text{rel}}(\cdot)\in\mathbb{R}) is a learned table (often per head).
* (\operatorname{bucket}(\cdot)) maps distances to a small set of bins (fine for small (|r|), coarse/log for large (|r|)) to keep parameters (O(\text{#bins})).

**ALiBi** is a special case with a *fixed* linear bias per head:
$$
A_{ij} ;=; \frac{q_i^\top k_j}{\sqrt{d}} ;-; m_h,(i-j), \qquad m_h>0.
$$
No learned table; scales well to long contexts.

---

## 3) Content-aware relative keys/values (Shaw et al.)

Here the relative embedding is **vector-valued** and interacts with content:
$$
A_{ij} ;=; \frac{q_i^\top k_j + q_i^\top r^{(K)}*{i-j}}{\sqrt{d}},\qquad
\text{output}*i = \sum_j \alpha*{ij}\left(v_j + r^{(V)}*{i-j}\right).
$$

* (r^{(K)}*{r}, r^{(V)}*{r}\in\mathbb{R}^d) are learned for each offset (r) (often clipped to (|r|\le R_{\max})).
* Efficient implementations use the **“skew” trick** to avoid building all pairwise (r=i-j) tensors explicitly (keeps (O(Ld)) memory rather than (O(L^2))).

---

## 4) Rotary positional embeddings (RoPE)

RoPE encodes relative positions by **rotating** query/key subvectors in 2D planes. Split (q_i,k_j) into 2-D pairs and apply a rotation whose angle grows with position:
$$
\widetilde{q}_i ;=; R(\theta_i),q_i,\qquad
\widetilde{k}_j ;=; R(\theta_j),k_j,
$$
with (R(\theta)) block-diagonal 2D rotations and frequencies (\omega) geometrically spaced (like sinusoidal embeddings). The dot-product becomes
$$
\widetilde{q}*i^\top \widetilde{k}*j
;=;
q_i^\top R(\theta_i)^\top R(\theta_j),k_j
;=;
q_i^\top R(\theta*{j}-\theta*{i}),k_j,
$$
which depends only on the **relative** position ((j-i)). No bias tables, great extrapolation, widely used (e.g., Llama-family).

A common per-pair formula for a single 2-D plane:
$$
\begin{aligned}
\begin{bmatrix}\tilde q^{(2t)}_i \ \tilde q^{(2t+1)}_i\end{bmatrix}
&=
\begin{bmatrix}\cos\theta^{(t)}_i & -\sin\theta^{(t)}_i \ \sin\theta^{(t)}_i & \cos\theta^{(t)}_i\end{bmatrix}
\begin{bmatrix}q^{(2t)}*i \ q^{(2t+1)}*i\end{bmatrix},\
\theta^{(t)}*i &= \frac{i}{\omega_t},\quad \omega_t=\omega*{\min},(\omega*{\max}/\omega*{\min})^{t/(d/2-1)}.
\end{aligned}
$$

---

## 5) 2-D relative positions (vision)

For images (e.g., ViT/Swin windows), relative offsets decompose along height/width:
$$
A_{ij}
======

\frac{q_i^\top k_j}{\sqrt{d}}
;+;
b^{(h)}*{\Delta h(i,j)} ;+; b^{(w)}*{\Delta w(i,j)},
$$
where (\Delta h,\Delta w) are row/column differences between patches, and (b^{(h)}, b^{(w)}) are learned tables (often small, tied per head or shared).

Shaw-style content-aware 2-D also exists by learning (r^{(K)}_{\Delta h,\Delta w}) (factored or full).

---

## 6) Causal masking & clipping

For autoregressive models, only (j\le i) are visible. Relative indices are typically **clipped**:
$$
r = \mathrm{clip}(i-j,,-R_{\max},,R_{\max}),
$$
keeping parameter count bounded while still giving strong distance signals.

---

## 7) Implementation notes (fast paths)

* **Bias-only** (T5/ALiBi): precompute an (L\times L) bias (or “on-the-fly” with buckets); add to logits before softmax.
* **Shaw-style**: compute (Q R^{(K)\top}) once, then **skew** (a reshape+pad+slice) so offset (r) lines up with column (j).
* **RoPE**: apply rotations to (Q,K) once per layer; everything else is standard attention.
* **2-D**: store small ( (2H!-!1)\times(2W!-!1)) tables (or factor into two ((2H!-!1)) and ((2W!-!1)) vectors).

---

## 8) When to use what?

* **RoPE**: ✅ robust extrapolation to longer context; no extra params; default for many LLMs.
* **Additive bias (T5)**: ✅ simple, stable, cheap; works great with encoders/decoders; buckets control capacity.
* **ALiBi**: ✅ zero learned params; excellent long-context behavior; causal models.
* **Shaw-style**: ✅ highest expressivity (content × position); ❌ slightly heavier; good for tasks sensitive to fine relative geometry (e.g., some NMT or local vision windows).
* **2-D bias**: ✅ ideal for images/windows (Swin); small overhead; preserves translational inductive bias.

---

## 9) Quick reference equations (copy-ready)

**Bias-only (per head):**
$$
A_{ij}=\frac{q_i^\top k_j}{\sqrt{d}} + b_{\text{rel}}(\operatorname{bucket}(i-j)),\quad
\alpha_{ij}=\frac{e^{A_{ij}}}{\sum_{t} e^{A_{it}}},\quad
\text{out}*i=\sum_j \alpha*{ij} v_j.
$$

**Shaw et al.:**
$$
A_{ij}=\frac{q_i^\top k_j + q_i^\top r^{(K)}*{i-j}}{\sqrt{d}},\qquad
\text{out}*i=\sum_j \alpha*{ij}\left(v_j + r^{(V)}*{i-j}\right).
$$

**ALiBi (causal):**
$$
A_{ij}=\frac{q_i^\top k_j}{\sqrt{d}} - m_h,(i-j),\quad j\le i.
$$

**RoPE:**
$$
\widetilde{q}_i=R(\theta_i)q_i,\quad \widetilde{k}*j=R(\theta_j)k_j,\quad
A*{ij}=\frac{\widetilde{q}*i^\top \widetilde{k}*j}{\sqrt{d}}
=\frac{q_i^\top R(\theta*{j}-\theta*{i})k_j}{\sqrt{d}}.
$$

**2-D windowed bias (vision):**
$$
A_{ij}=\frac{q_i^\top k_j}{\sqrt{d}} + b^{(h)}*{\Delta h(i,j)} + b^{(w)}*{\Delta w(i,j)}.
$$

---

## 10) Key intuitions (why it helps)

* **Translation invariance**: the model learns “same-pattern, different place” naturally.
* **Length generalization**: depends on *offsets*, not absolute indexes.
* **Parameter efficiency**: a small table (or none with RoPE/ALiBi) replaces (O(L)) absolute embeddings.
* **Locality bias**: buckets/linear slopes can emphasize nearby tokens—crucial for language syntax and vision neighborhoods.

If you want, I can add a tiny PyTorch snippet for each variant that you can paste into a Jupyter cell and visualize the induced (L\times L) bias matrices. ✅
