# Relative Positional Embedding vs Absolute Positional Embeddings in self-attention 

That’s **exactly** the key architectural difference between **absolute** and **relative** positional embeddings in Transformers such as **ViT vs. Swin**.

---

## **1. In vanilla ViT (absolute positional embeddings)**

The ViT pipeline looks like this:

```
image → patchify → linear projection → +pos_embed → Transformer encoder
```

In code (simplified):

```python
x = self.patch_embed(img)           # [B, N, D]  (N patches, D=embed_dim)
x = x + self.pos_embed              # add absolute positional embedding
x = self.cls_token + x              # prepend class token
x = self.transformer(x)             # pass through attention layers
```

So:

* The **positional embedding is added to the input tokens before any attention**.
* Each token’s embedding = content + absolute position.
* Inside the Transformer:
  $$ A_{ij} = \frac{(x_i + p_i)^\top (x_j + p_j)}{\sqrt{d}} $$
  — the position is already mixed into Q and K implicitly.

✅ The **positional signal is part of the input feature**.
It travels through all layers as part of the representation.

---

## **2. In Swin Transformer (relative positional bias)**

In Swin, there is **no `+pos_embed` added to tokens**.

Instead, relative positional information is injected **inside each attention block**, *after* Q and K are computed:

### Inside one attention layer

```python
Q = x @ Wq
K = x @ Wk
V = x @ Wv

# attention scores
A = (Q @ K.transpose(-2, -1)) / sqrt(d)

# add relative bias (based on relative spatial offsets)
A = A + self.relative_position_bias[index_table]

attention = softmax(A)
output = attention @ V
```

So:

* You **don’t add anything** to `x` before attention.
* Instead, you **add a bias** to the attention matrix (A_{ij}) before softmax.
* The bias depends only on the relative position between patches (i) and (j).

✅ The **positional signal enters only the attention scores**, not the token embeddings.

---

## **3. Where in the computation they differ**

| Stage            | Vanilla ViT (absolute)                                   | Swin / Relative models                             |
| ---------------- | -------------------------------------------------------- | -------------------------------------------------- |
| Before attention | Add `pos_embed` to `x`                                   | Nothing added                                      |
| Q,K computation  | Position info is baked into Q and K (since added to `x`) | Purely from content features                       |
| Attention score  | No extra bias                                            | Add `relative_position_bias` based on (Δrow, Δcol) |
| Softmax          | Softmax over `QKᵀ / √d`                                  | Softmax over `(QKᵀ / √d) + bias`                   |
| After attention  | Normal                                                   | Normal                                             |

So, in Swin and similar models, **position enters at the attention level**,
while in ViT, **position enters at the token level**.

---

## **4. Why this matters**

| Property                               | Absolute (ViT)       | Relative (Swin, BEiT, ViTDet)      |
| -------------------------------------- | -------------------- | ---------------------------------- |
| Where injected                         | before attention     | inside attention (as bias)         |
| Type of info                           | absolute coordinates | relative offset (Δh, Δw)           |
| Translation invariance                 | ❌ No                 | ✅ Yes                              |
| Need interpolation for new image sizes | ✅ Yes                | ❌ No                               |
| Cost                                   | trivial (add once)   | minimal (add bias lookup per head) |

So, **ViT** learns one `pos_embed` tensor of shape `[1, N+1, D]`.
But **Swin** learns a much smaller bias table of shape `[num_heads, (2H−1)×(2W−1)]`.

---

## **5. Intuition summary**

* **Absolute positional embedding (ViT)**
  → Position added *before* Transformer
  → Tokens carry their “address” through the entire network

* **Relative positional embedding (Swin)**
  → Position added *inside* each attention layer
  → Attention itself learns directional preference (“look right,” “look up”)
  → Tokens remain purely content-based

---

✅ **In one sentence:**

> In **ViT**, positional embeddings are *added to patch embeddings before* attention (global absolute coordinates),
> while in **Swin** or other relative models, positional information is *added as a bias inside each attention layer* (local relative relationships).

---

## **6. Fully numerical Example**
A **fully numerical and visualizable** with a **tiny 2×2 image** example so we can explicitly see what’s happening in the **attention matrix** for both **absolute** and **relative** positional embeddings.

---

#### **6.1. Setup**

We take a **2×2 grid of image patches**, flattened in raster order:

```
[patch0, patch1,
 patch2, patch3]
```

So $ N = 4 $, and the flattened indices correspond to:

| Patch | Grid coord (row, col) |
| ----- | --------------------- |
| 0     | (0, 0)                |
| 1     | (0, 1)                |
| 2     | (1, 0)                |
| 3     | (1, 1)                |

We’ll use an embedding dimension (d=2) (to keep numbers readable),
and define small content embeddings (x_i):

| Patch | $x_i$  |
| ----- | ------ |
| 0     | [1, 0] |
| 1     | [0, 1] |
| 2     | [1, 1] |
| 3     | [0, 0] |

We’ll compute **attention logits** $ A_{ij} = \frac{q_i^\top k_j}{\sqrt{d}} $ for three cases:

1. No positional encoding
2. Absolute positional encoding
3. Relative positional encoding

---

#### **6.2. Base (no positional encoding)**

Assume $Q=K=X$.

Compute dot products divided by $ \sqrt{2} \approx 1.414 $:

| i\j | 0                       | 1                       | 2                       | 3                       |
| --- | ----------------------- | ----------------------- | ----------------------- | ----------------------- |
| 0   | (1·1+0·0)/1.414=0.707   | (1·0+0·1)/1.414=0       | (1·1+0·1)/1.414=0.707   | (1·0+0·0)/1.414=0       |
| 1   | (0·1+1·0)=0             | (0·0+1·1)=1/1.414=0.707 | (0·1+1·1)=1/1.414=0.707 | (0·0+1·0)=0             |
| 2   | (1·1+1·0)=1/1.414=0.707 | (1·0+1·1)=1/1.414=0.707 | (1·1+1·1)=2/1.414=1.414 | (1·0+1·0)=1/1.414=0.707 |
| 3   | (0·1+0·0)=0             | (0·0+0·1)=0             | (0·1+0·1)=1/1.414=0.707 | (0·0+0·0)=0             |

$$
A_{\text{no-pos}} =
\begin{bmatrix}
0.707 & 0 & 0.707 & 0\\
0 & 0.707 & 0.707 & 0\\
0.707 & 0.707 & 1.414 & 0.707\\
0 & 0 & 0.707 & 0
\end{bmatrix}
$$

Without position, the model sees only patch *content* — shuffling patches wouldn’t change these scores.

---

#### **6.3. Absolute positional embedding**

Let’s assign 2D positional embeddings for each patch based on grid coordinates:

| Patch | (row, col) | pos_embed  |
| ----- | ---------- | ---------- |
| 0     | (0,0)      | [0.0, 0.0] |
| 1     | (0,1)      | [0.0, 0.5] |
| 2     | (1,0)      | [0.5, 0.0] |
| 3     | (1,1)      | [0.5, 0.5] |

Then $ Q_i = x_i + p_i, ; K_i = x_i + p_i $.

Let’s compute $A_{ij} = Q_i K_j^\top / \sqrt{2}$.

Example for (i=0):

* $Q_0 = [1, 0]$
* $K_0 = [1, 0]$
* $K_1 = [0, 1.5]$
* $K_2 = [1.5, 0]$
* $K_3 = [0.5, 0.5]$

Then:

| j | dot               | /√2   |
| - | ----------------- | ----- |
| 0 | (1·1+0·0)=1       | 0.707 |
| 1 | (1·0+0·1.5)=0     | 0     |
| 2 | (1·1.5+0·0)=1.5   | 1.061 |
| 3 | (1·0.5+0·0.5)=0.5 | 0.354 |

Now do this for all rows (similar arithmetic), yielding approximately:

$$
A_{\text{abs}} =
\begin{bmatrix}
0.707 & 0 & 1.061 & 0.354\\
0 & 0.707 & 0.707 & 0.177\\
1.061 & 0.707 & 1.414 & 0.884\\
0.354 & 0.177 & 0.884 & 0.707
\end{bmatrix}
$$

Now each patch has a *unique identity* in space — patch (1,1) ≠ patch (0,0).

But note: if you **shift the whole image**, the model’s positional encodings are now *wrong* — they’re absolute.

---

#### **6.4. Relative positional embedding**

Now we’ll instead use **relative biases** that depend on 2D offsets (Δrow, Δcol):

| Offset (Δrow, Δcol) | Bias |
| ------------------- | ---- |
| (0, 0)              | 0.0  |
| (0, 1)              | −0.2 |
| (1, 0)              | −0.1 |
| (1, 1)              | −0.3 |
| (−1, 0)             | +0.1 |
| (0, −1)             | +0.2 |
| (−1, −1)            | +0.3 |

Now for each pair (i,j), find their grid coordinates, compute (Δrow, Δcol), and add corresponding bias to the *content-only* logits (A_{\text{no-pos}}).

For example:

* i=2 (1,0), j=0 (0,0) → Δ=(+1,0) → bias −0.1
* i=0 (0,0), j=2 (1,0) → Δ=(−1,0) → bias +0.1
* i=3 (1,1), j=0 (0,0) → Δ=(+1,+1) → bias −0.3

Adding biases gives (rounded):

$$
A_{\text{rel}} =
\begin{bmatrix}
0.707 & +0.2 & 0.807 & 0\\
0 & 0.707 & 0.507 & 0.2\\
0.607 & 0.707 & 1.414 & 0.607\\
-0.3 & 0 & 0.507 & 0
\end{bmatrix}
$$

Now, instead of unique IDs per patch, the **relative offsets** define attention bias:

* patch 0 prefers looking *below* itself (+Δrow),
* patch 1 prefers *left-right* neighbors,
* and so on.

---

#### **6.5. Compare**

| Model        | Encodes                    | Shift-invariant? | Key difference                 |
| ------------ | -------------------------- | ---------------- | ------------------------------ |
| **Absolute** | Position ID for each patch | ❌ No             | Learns fixed 14×14 coordinates |
| **Relative** | Offset between patches     | ✅ Yes            | Learns (Δrow, Δcol) biases     |

If you shift the entire image by one patch:

* **Absolute**: every patch now gets the wrong (p_i).
* **Relative**: biases depend only on offsets → attention pattern remains the same.

---

#### **6.6. Intuition**

In ViT:

* $p_i$ = "I am patch (r,c)"
  → model knows *where you are*.

In Relative (Swin, ViTDet):

* $b_{Δr,Δc}$ = "this patch is Δr down, Δc right from me"
  → model knows *how you are related* to others.

---

✅ **Takeaway**

> Absolute embeddings learn **where** patches are (index-based).
> Relative embeddings learn **how far and in which direction** patches are from each other.
> That’s why relative embeddings make attention **translation-invariant** and **resolution-flexible** — crucial for dense prediction tasks.

---



## 7. **What does patch 0 prefers looking below itself (+Δrow) mean?**
---

#### **7.1. Recall the setup**

We had a **2×2 grid**:

```
(0,0): patch 0   (0,1): patch 1
(1,0): patch 2   (1,1): patch 3
```

and our **relative attention logits** were:

$$
A_{\text{rel}} =
\begin{bmatrix}
0.707 & 0.2 & 0.807 & 0.0\
0.0 & 0.707 & 0.507 & 0.2\
0.607 & 0.707 & 1.414 & 0.607\
-0.3 & 0.0 & 0.507 & 0.0
\end{bmatrix}
$$

Each **row i** corresponds to *query patch i*
and each **column j** corresponds to *key patch j*.
So row i tells you how much patch i “looks at” (attends to) patch j.

---

#### **7.2. Relative bias meaning**

We had relative bias depending on **(Δrow, Δcol)**:

| Offset (Δrow, Δcol) | Bias | Interpretation        |
| ------------------- | ---- | --------------------- |
| (0, 0)              | 0.0  | same position         |
| (0, +1)             | −0.2 | neighbor to the right |
| (+1, 0)             | −0.1 | neighbor below        |
| (+1, +1)            | −0.3 | diagonal below-right  |
| (−1, 0)             | +0.1 | neighbor above        |
| (0, −1)             | +0.2 | neighbor left         |
| (−1, −1)            | +0.3 | diagonal above-left   |

Positive bias means attention is *encouraged* toward that direction; negative means *discouraged*.

---

#### **7.3. Interpret row by row**

Let’s read each **row of (A_{\text{rel}})**, and link values to spatial neighbors.

---

**Patch 0 (top-left, row 0)**

Row 0 = `[0.707, 0.2, 0.807, 0.0]`

| Attends to j | Grid  | Δ(row,col)                | Bias sign | Logit | Interpretation                  |
| ------------ | ----- | ------------------------- | --------- | ----- | ------------------------------- |
| 0 (self)     | (0,0) | (0,0)                     | 0         | 0.707 | moderate self-attention         |
| 1            | (0,1) | (0,+1) → right            | −         | 0.2   | low — looks *less* to the right |
| 2            | (1,0) | (+1,0) → below            | −         | 0.807 | **highest** — prefers below     |
| 3            | (1,1) | (+1,+1) → diag down-right | −         | 0.0   | lowest — diagonal disfavored    |

✅ **Interpretation:**
Patch 0 (top-left) most strongly attends **downward** (patch 2), less to right (patch 1), and least to diagonal (patch 3).
So we can say *“patch 0 prefers looking below itself.”*

---

**Patch 1 (top-right, row 0, col 1)**

Row 1 = `[0.0, 0.707, 0.507, 0.2]`

| Attends to j | Grid         | Δ              | Bias sign          | Logit                           | Interpretation          |      |
| ------------ | ------------ | -------------- | ------------------ | ------------------------------- | ----------------------- | ---- |
| 0            | (0,0) → left | (0,−1)         | +                  | 0.0                             | (content canceled bias) |      |
| 1 (self)     | (0,1)        | (0,0)          | 0                  | 0.707                           | high self-attn          |      |
| 2            | (1,0)        | (+1,−1)        | diagonal down-left | bias probably slightly negative | 0.507                   | okay |
| 3            | (1,1)        | (+1,0) → below | −                  | 0.2                             | small                   |      |

✅ **Interpretation:**
Patch 1 (top-right) attends moderately to **self and left neighbor**, and somewhat downward — hence “patch 1 prefers left–right neighbors.”

---

**Patch 2 (bottom-left, row 1, col 0)**

Row 2 = `[0.607, 0.707, 1.414, 0.607]`

| j        | Grid                | Δ(row,col) | Bias | Logit | Interpretation              |
| -------- | ------------------- | ---------- | ---- | ----- | --------------------------- |
| 0        | (0,0) → above       | (−1,0)     | +0.1 | 0.607 | likes above                 |
| 1        | (0,1) → above-right | (−1,+1)    | +0.3 | 0.707 | also likes up-right         |
| 2 (self) | (1,0)               | (0,0)      | 0    | 1.414 | highest — strong self-focus |
| 3        | (1,1) → right       | (0,+1)     | −0.2 | 0.607 | neutral/weak right          |

✅ **Interpretation:**
Patch 2 (bottom-left) mainly attends to itself, slightly upward, and to up-right — consistent with preferring **upper neighbors**.

---

**Patch 3 (bottom-right, row 1, col 1)**

Row 3 = `[−0.3, 0.0, 0.507, 0.0]`

| j | Grid            | Δ(row,col) | Bias | Logit | Interpretation    |
| - | --------------- | ---------- | ---- | ----- | ----------------- |
| 0 | (0,0) → up-left | (−1,−1)    | +0.3 | −0.3  | base content weak |
| 1 | (0,1) → up      | (−1,0)     | +0.1 | 0.0   | neutral           |
| 2 | (1,0) → left    | (0,−1)     | +0.2 | 0.507 | stronger left     |
| 3 | (self)          | (0,0)      | 0    | 0.0   | neutral self      |

✅ **Interpretation:**
Patch 3 (bottom-right) attends mostly **to the left (patch 2)**.
So it “prefers leftward neighbors.”

---

#### **7.4. Summary of directional attention**

| Query patch (position) | Highest attention target | Spatial relation | Interpretation                        |
| ---------------------- | ------------------------ | ---------------- | ------------------------------------- |
| Patch 0 (top-left)     | patch 2                  | ↓ (below)        | prefers looking below                 |
| Patch 1 (top-right)    | patch 0 / 2              | ← or ↓           | prefers horizontal/vertical neighbors |
| Patch 2 (bottom-left)  | self / patch 1           | ↑ or ↗           | prefers upward neighbors              |
| Patch 3 (bottom-right) | patch 2                  | ←                | prefers left neighbor                 |

So each patch’s **attention direction** emerges from the **biases associated with Δ(row, col)**.

---

#### **7.5. Why this matters**

* In absolute embeddings: the model learns each patch’s fixed ID, so it doesn’t generalize well when the object shifts.
* In relative embeddings: the *pattern* of attention (downward, leftward, etc.) is learned once and reused everywhere —
  this gives **translation invariance**.

Thus “patch 0 prefers looking below itself” really means:

> when this attention head fires, it tends to favor queries looking downward relative to themselves — a pattern that repeats anywhere in the image.

---

✅ **Summary insight**

* The **sign and magnitude** of biases for each offset (Δrow, Δcol) control the **directional attention preference**.
* These directional patterns let each attention head specialize — one might focus horizontally, another vertically, another diagonally.
* That’s why **relative positional bias** is so powerful in ViT/Swin: it gives the model spatial inductive biases similar to CNNs, but learned directly through attention.

---



<img src="images/RPE_learned_biases.png" />


---

## **8. The meaning of a bias for (Δrow, Δcol)**

If you define a learned bias table like this:

| (Δrow, Δcol) | Bias |
| ------------ | ---- |
| (0, 0)       | 0.0  |
| (0, +1)      | −0.2 |
| (0, −1)      | +0.2 |
| (+1, 0)      | −0.1 |
| (−1, 0)      | +0.1 |
| (+1, +1)     | −0.3 |
| (−1, −1)     | +0.3 |

it means:

* “Neighbor **to the right** (Δrow=0, Δcol=+1)” always gets a **−0.2** bias —
  no matter whether we’re in the top row or bottom row.
* “Neighbor **below** (Δrow=+1, Δcol=0)” always gets **−0.1**, etc.

This is exactly what gives **translation invariance**.

---

#### **8.1. Applying to patch₀ and patch₂**

Let’s recall our patch grid:

```
(0,0): patch₀     (0,1): patch₁
(1,0): patch₂     (1,1): patch₃
```

For **patch₀ (0,0)**

* Right neighbor is patch₁ (0,1).
  Offset = (Δrow=0, Δcol=+1).
  Bias = **−0.2**.
  → so attention from patch₀ → patch₁ gets **−0.2** added to its logit.

For **patch₂ (1,0)**

* Right neighbor is patch₃ (1,1).
  Offset = (Δrow=0, Δcol=+1).
  Bias = **−0.2** again.
  → so attention from patch₂ → patch₃ also gets **−0.2**.

✅ **They share the same bias**, because both represent
“look at the patch to my right.”

---

#### **8.2. Interpretation**

That means the **attention head** has learned a general rule:

> “When looking to the right, reduce the attention score slightly (−0.2).”

This rule applies **uniformly everywhere** in the image —
top-left, bottom-left, middle, etc.

So yes — both patch₀ and patch₂ have equally reduced attention to their right neighbors.

If we had set the bias for (0, +1) to **+0.2**, then *both* would have their attention **encouraged** toward the right.

---

#### **8.3. Why this is powerful**

Because it means the model’s **behavior is spatially consistent**:

* It doesn’t need to relearn “looking right” separately for each row.
* The same learned weights apply globally.

In CNN terms, this is equivalent to **weight sharing** —
the same convolutional kernel slides everywhere in the image.

So relative positional bias gives the transformer a **spatial inductive bias**:

> same directional relationship → same learned effect.

---

#### **8.4. Relation to ViT vs Swin**

| Model               | Uses bias table                       | Property                                                                          |
| ------------------- | ------------------------------------- | --------------------------------------------------------------------------------- |
| **ViT (absolute)**  | No — each patch has independent `p_i` | Attention pattern depends on absolute image coordinates (top-left ≠ bottom-left). |
| **Swin (relative)** | Yes — bias by Δrow, Δcol              | Attention pattern depends only on direction and distance; same everywhere.        |

That’s why Swin can generalize to **larger image resolutions** —
you don’t have to interpolate positional embeddings,
because the bias table already defines relationships in a **relative way**.

---

✅ **Summary**

* You’re correct: the bias for (Δrow=0, Δcol=+1) = −0.2
  applies **identically** to patch₀→patch₁ and patch₂→patch₃.
* This shared bias means: “looking right” is discouraged slightly everywhere.
* The model thus learns direction-sensitive, but **location-agnostic**, attention behavior.

---

## **9. Same bias for all patches?**
That’s an **excellent and very deep observation** — and yes, you’re absolutely right to question it.

At first glance, using the **same bias for all patches** might seem restrictive —
because, as you said, sometimes a patch should look **left**, sometimes **right**.
But the key is this: **the relative bias doesn’t dictate where a patch *must* look** — it only provides a *soft spatial prior*.

Let’s unpack this carefully.

---

#### **9.1. What the bias actually does**

In relative positional encoding, the bias term is added to the **attention logits**:

$$
A_{ij} = \frac{q_i^\top k_j}{\sqrt{d}} + b_{\Delta h(i,j)} + b_{\Delta w(i,j)}.
$$

That means the final attention weight depends on **two things**:

1. the **content similarity** (dot product of (q_i) and (k_j))
2. the **spatial bias** (based on relative offset)

So, even if the bias for “looking right” is −0.2,
if the content similarity between patch (i) and its right neighbor is very high,
the attention will still be large after softmax.

✅ **In short:**
The bias *nudges* the model toward or away from certain directions,
but the actual attention is still **content-driven**.

---

#### **9.2. Analogy: CNN kernel vs RPE bias**

Think of this like a **convolutional kernel’s receptive field**.

In a CNN:

* The kernel weights are the same everywhere.
* But the activations depend on local content — edges, corners, etc.

In a Transformer with **relative positional bias**:

* The bias table $b_{Δh,Δw}$ is shared everywhere.
* But the attention weights depend on $q_i^\top k_j$, which depends on the **patch features**.

So “−0.2 for right neighbor” doesn’t mean “never look right.”
It means “unless the content gives me a good reason, looking right is slightly discouraged.”

That’s how **both local inductive bias and flexibility** coexist.

---

#### **9.3. Why uniform biases don’t hurt expressivity**

Because:

* Each **attention head** has its **own bias table**.
  (In Swin, T5, etc., `num_heads × (2H−1)×(2W−1)` parameters.)
* Heads can specialize:

  * One head might learn to look mostly **horizontally**,
  * Another might specialize in **vertical** context,
  * Another might attend **globally**.

So, even though the bias is spatially shared,
different heads learn different directional behaviors.

---

#### **9.4. Example**

Let’s say we have 4 heads:

| Head | Learned pattern                                   |
| ---- | ------------------------------------------------- |
| 1    | looks right (bias +0.2 for Δcol=+1)               |
| 2    | looks left (bias +0.2 for Δcol=−1)                |
| 3    | looks downward (bias +0.3 for Δrow=+1)            |
| 4    | looks diagonally (bias +0.3 for Δrow=+1, Δcol=+1) |

So different heads handle different spatial relations —
together they cover all directions.

That’s how the model can decide:

* “In this image region, use horizontal attention,”
* “In that region, vertical,” etc.

---

#### **9.5. When global or flexible relations are needed**

Relative bias works *locally* (like windows in Swin).
If global relations are important (e.g., ViT global attention),
the **content term $q_i^\top k_j$** dominates and can override spatial biases completely.

That’s why Swin Transformer mixes:

* **local relative bias** (inside shifted windows)
* **content attention** (which can connect across windows when shifted)

---

## **9.6. Intuitive summary**

| Question                                                             | Answer                                                                                                                |
| -------------------------------------------------------------------- | --------------------------------------------------------------------------------------------------------------------- |
| “Does the same bias apply everywhere?”                               | Yes — for the same relative offset.                                                                                   |
| “Doesn’t that make the model look in the same direction everywhere?” | No — because each head has its own bias table and attention is still content-based.                                   |
| “So what does the bias actually do?”                                 | It provides a *consistent spatial prior* — like “prefer nearby patches,” “favor upward,” etc. — but doesn’t force it. |

---

✅ **In one sentence:**

> The relative bias doesn’t *force* the model to look in one direction;
> it just gives every attention head a consistent **spatial preference**,
> while the **content similarity** term decides where to actually look.

---



##  **10. Heatmap Example**
A **heatmap example**, showing how a content-dominant case can override a negative bias (i.e., the model still attends strongly to the right if the content is relevant)

<img src="images/heatmap_showing_a_content-dominant_case_can_override_a_negative_bias.png" />



## **11. Word Embeddings vs. Positional Embeddings**
These are **two different uses of embeddings** (word embeddings vs. positional embeddings),

---

#### **11.1. What “absolute axis” means**

When we say:

> “Absolute positional embeddings tell the model *where* each token is on an absolute axis”

the **axis** here refers to the **sequence index** — the fixed ordering of tokens in the input.

For example, for the sentence:
`The cat sat on the mat.`

| Token | Index (Position) | Absolute Positional Embedding (example) |
| ----- | ---------------- | --------------------------------------- |
| The   | 0                | [0.1, 0.3, 0.2, …]                      |
| cat   | 1                | [0.4, 0.7, 0.9, …]                      |
| sat   | 2                | [0.8, 0.1, 0.5, …]                      |
| on    | 3                | [0.3, 0.2, 0.6, …]                      |
| the   | 4                | [0.7, 0.4, 0.1, …]                      |
| mat   | 5                | [0.5, 0.9, 0.7, …]                      |

Each token gets a unique vector *based on its position index*.
That index (0, 1, 2, 3, 4, 5) is the **absolute axis** — the “ruler” along which tokens are placed.

So “absolute axis” simply means:
→ *a fixed coordinate system based on token index in the sequence.*

---

#### **11.2. Why we need position embeddings at all**

Transformers take a set of tokens and compute attention as:

$$
A_{ij} = \frac{q_i^\top k_j}{\sqrt{d}}
$$

But without any positional signal, this operation is **permutation-invariant**:
if you shuffle the tokens, the result doesn’t change.
That’s bad, because word order matters!

Hence, we add **positional embeddings** — absolute or relative — to inject the notion of *order*.

---

#### **11.3. Absolute vs. Relative (geometric analogy)**

Let’s visualize with a simple 1D line (the “absolute axis”):

```
0----1----2----3----4----5---->
|    |    |    |    |    |
The cat sat on  the  mat
```

* **Absolute position**: “This word is at coordinate 0, 1, 2, etc.”
* **Relative position**: “This word is *2 steps after* that one.”

In **absolute encoding**, each token gets its coordinates based on this line.
In **relative encoding**, the model learns the *distance and direction* between any two tokens — for example, token 3 is +2 away from token 1.

That’s why we say relative embeddings tell the model *how far (and in what direction)* one token is from another.

---

#### **11.4. Contrast with semantic embeddings (“king - man + woman = queen”)**

That’s a different *embedding space* — the **semantic word embedding space**, not the positional one.

**Word embeddings**

* Capture **semantic meaning** (king/man/woman/queen)
* Are learned from co-occurrence patterns
* Encode *analogical structure* like
  $$
  \text{king} - \text{man} + \text{woman} \approx \text{queen}
  $$
* Each dimension in that space doesn’t mean “position in a sentence,”
  it means *semantic feature direction* (e.g., gender, royalty, etc.)

**Positional embeddings**

* Capture **structural order** (position in sequence)
* Encode either:

  * Absolute position (index on a line), or
  * Relative position (offset between two positions)
* Their geometry encodes *sequence order*, not meaning.

So:

* “King – man + woman = queen” lives in **semantic space**.
* “Token 5 is two steps after token 3” lives in **positional space**.

Both use vectors and addition, but represent different concepts.

---

#### **11.5. Why the distinction matters**

* The **embedding of the word “king”** tells you *what it means* (semantic).
* The **positional embedding of index 5** tells you *where it appears* (syntactic).

When we feed a Transformer input, we actually **combine them**:

$$
x_i = \text{WordEmbedding}(w_i) + \text{PositionalEmbedding}(i)
$$

So each token embedding encodes both:

* **content meaning** (from the word embedding)
* **position meaning** (from the positional embedding)

---

#### 11.6. How “relative” modifies this

In relative embeddings, the model doesn’t learn where tokens *are*, but how they *relate*:

* Instead of “I’m token 5,”
  it learns “I’m 2 tokens after token 3.”

That makes the model **translation-invariant**:
shifting the whole sequence by 2 doesn’t change attention patterns.

For example, in music:

> A melody pattern repeated later in the song should mean the same thing.
> That’s why the **Music Transformer** used relative embeddings —
> it doesn’t care *where* the pattern starts, only *how notes relate in time*.

---

✅ **Summary**

| Concept                           | Embedding Space | Captures             | “Axis”                                      | Example                         |
| --------------------------------- | --------------- | -------------------- | ------------------------------------------- | ------------------------------- |
| **Word embedding**                | Semantic        | meaning, analogy     | conceptual dimensions (e.g., gender, tense) | king − man + woman ≈ queen      |
| **Absolute positional embedding** | Structural      | order in sequence    | absolute index axis (0,1,2,…)               | “token 3 is at position 3”      |
| **Relative positional embedding** | Structural      | distance & direction | offset between positions                    | “token j is 2 ahead of token i” |

---

## **12. How Relative Positional Embeddings Stored/ Used**

#### **12.1. What are $ b_r $ ?**

In **relative positional embeddings**, $ b_r $ (or sometimes $ r^{(K)}_r, r^{(V)}_r )$ represents the **bias or embedding vector** associated with a **relative distance**:

$$
r = i - j \in [-R_{\max}, R_{\max}]
$$

Each $ b_r $ is a **learnable parameter** — just like the absolute `pos_embed`, but indexed by **relative distance** instead of absolute index.

---





#### **12.2. What $ R_{\text{max}} $ is**

In **relative positional embeddings**, we index relative offsets $ r = i - j $ —
the distance between a query position (i) and a key position (j).

But in a sequence of length $L$, $r$ could range from $-L+1$ to $+L-1$.

Example for $L=8$:
$
r \in {-7, -6, …, 0, …, +6, +7}.
$

If we stored a separate learnable embedding $b_r$ for every possible $r$,
that would mean $2L - 1$ learnable parameters per head — too large when $L$ is big (like 512 or 1024).

So we **clip or bucket** distances to a **maximum relative distance** $R_{\text{max}}$.

---

#### **12.2. Formal definition**

We define:

$$
r_{\text{clipped}} = \text{clip}(i - j, -R_{\text{max}}, +R_{\text{max}}),
$$

and use only $2R_{\text{max}} + 14$ learnable values.

So:

$$
b_r = \text{table}[r_{\text{clipped}} + R_{\text{max}}],
$$

where `table` is a learnable parameter tensor of size `[num_heads, 2R_max + 1]`.

---

#### **12.3. Where $ R_{\text{max}} $ comes from**

It’s a **hyperparameter** chosen when building the model, not learned.

You decide it based on how long your attention window should “see” relative distances distinctly before saturating.

Typical choices:

| Model                    | $ R_{\text{max}} $      | Meaning                                           |
| ------------------------ | ----------------------- | ------------------------------------------------- |
| Transformer-XL           | 16–64                   | beyond that, all far tokens share same embedding  |
| T5                       | 128–512 (bucketed)      | finer bins for small distances, coarser for large |
| Swin Transformer         | (2H−1, 2W−1) bias table | derived from window size (e.g. 7×7 → $R_{max}=6$)     |
| BERT with Shaw-style RPE | 32–64                   | good compromise for sentence length               |

So you typically pick it relative to your **attention span or window size**.

---

#### **12.4. Why clipping is needed**

Without clipping:

* For long sequences, (2L-1) parameters explode (e.g., L=2048 → 4095 entries per head).
* For distant tokens, fine-grained distance doesn’t help much — “far” is just “far.”

So after some threshold, the model just learns a single “far away” embedding.

Formally:
$$
r' =
\begin{cases}
-R_{\text{max}}, & \text{if } i-j < -R_{\text{max}}, \\
i-j, & \text{if } |i-j| \le R_{\text{max}}, \\
+R_{\text{max}}, & \text{if } i-j > R_{\text{max}}.
\end{cases}
$$

---

#### **12.5. In 2D (Vision models)**

In vision (like Swin Transformer), you have:

* A **window** of size $M\times M$ (e.g. 7×7 patches).
* Relative offsets:
  $\Delta h \in [-(M-1), +(M-1)]$,
  $\Delta w \in [-(M-1), +(M-1)]$.

So you don’t pick $R_{\text{max}}$ manually — it’s fixed by the window.

For example:

```python
window_size = 7
relative_position_bias_table = nn.Parameter(
    torch.zeros((2*window_size-1)*(2*window_size-1), num_heads)
)
# -> (13*13, num_heads) = (169, num_heads)
```

Here $R_{\text{max}} = 6$ because the maximum offset is ±6.

---

#### **12.6. Intuitive picture**

Think of $R_{\text{max}}$ as setting the “radius of spatial awareness”:

* Within ±R_max → learn individual relations.
* Beyond that → treat all distant tokens as “far away” equivalently.

For text: “words more than 128 tokens apart are equally distant.”
For vision: “patches more than 6 cells apart are equally distant.”

---

✅ **Summary**

| Symbol             | Meaning                                                                                  | Source                                | Typical value                             |
| ------------------ | ---------------------------------------------------------------------------------------- | ------------------------------------- | ----------------------------------------- |
| $ R_{\text{max}} $ | Maximum relative offset                                                                  | Hyperparameter (user-set)             | 16–128 for text, 6 for 7×7 vision windows |
| Purpose            | Clip or bucket large distances                                                           | avoids huge tables, captures locality |                                           |
| Used in            | $b_{r_{\text{clipped}}}, \quad r^{(K)} r_{\text{clipped}}, \quad r^{(V)} r_{\text{clipped}}$ | inside attention score                |                                           |

---

#### **12.6. Numerical Mini**


**Setup**

Let’s take a **1D sequence** of **length $L = 7$** (tokens numbered 0–6).

We define:
$$r = i - j$$

as the **relative distance** between query position (i) and key position (j).

That means $r \in [-6, +6]$.

Now we choose $R_{\text{max}} = 2$.
So we only keep **five learnable biases** (for −2, −1, 0, +1, +2).
All larger distances are **clipped** to ±2.

---

**Bias lookup table**

Let’s assign arbitrary learnable scalar biases:

| Relative offset (r) | Bias (b_r) |
| ------------------- | ---------- |
| −2                  | −0.3       |
| −1                  | −0.2       |
| 0                   | 0.0        |
| +1                  | +0.2       |
| +2                  | +0.3       |

For any distance (r < -2), use (b_{-2}).
For any distance (r > +2), use (b_{+2}).

---

**Compute $r = i - j$ for all pairs**

We’ll build a (7×7) grid of all relative offsets.

| i\j | 0      | 1      | 2  | 3  | 4  | 5  | 6  |
| --- | ------ | ------ | -- | -- | -- | -- | -- |
| 0   | 0−0=0  | 0−1=−1 | −2 | −3 | −4 | −5 | −6 |
| 1   | 1−0=+1 | 0      | −1 | −2 | −3 | −4 | −5 |
| 2   | 2−0=+2 | +1     | 0  | −1 | −2 | −3 | −4 |
| 3   | 3−0=+3 | +2     | +1 | 0  | −1 | −2 | −3 |
| 4   | 4−0=+4 | +3     | +2 | +1 | 0  | −1 | −2 |
| 5   | 5−0=+5 | +4     | +3 | +2 | +1 | 0  | −1 |
| 6   | 6−0=+6 | +5     | +4 | +3 | +2 | +1 | 0  |

---

**Apply clipping**

Now clip all values to the range ([-2, +2]):

| i\j | 0  | 1  | 2  | 3  | 4  | 5  | 6  |
| --- | -- | -- | -- | -- | -- | -- | -- |
| 0   | 0  | −1 | −2 | −2 | −2 | −2 | −2 |
| 1   | +1 | 0  | −1 | −2 | −2 | −2 | −2 |
| 2   | +2 | +1 | 0  | −1 | −2 | −2 | −2 |
| 3   | +2 | +2 | +1 | 0  | −1 | −2 | −2 |
| 4   | +2 | +2 | +2 | +1 | 0  | −1 | −2 |
| 5   | +2 | +2 | +2 | +2 | +1 | 0  | −1 |
| 6   | +2 | +2 | +2 | +2 | +2 | +1 | 0  |

---

**Convert to actual bias values**

Replace each relative offset with its bias (b_r):

| i\j | 0    | 1    | 2    | 3    | 4    | 5    | 6    |
| --- | ---- | ---- | ---- | ---- | ---- | ---- | ---- |
| 0   | 0.0  | −0.2 | −0.3 | −0.3 | −0.3 | −0.3 | −0.3 |
| 1   | +0.2 | 0.0  | −0.2 | −0.3 | −0.3 | −0.3 | −0.3 |
| 2   | +0.3 | +0.2 | 0.0  | −0.2 | −0.3 | −0.3 | −0.3 |
| 3   | +0.3 | +0.3 | +0.2 | 0.0  | −0.2 | −0.3 | −0.3 |
| 4   | +0.3 | +0.3 | +0.3 | +0.2 | 0.0  | −0.2 | −0.3 |
| 5   | +0.3 | +0.3 | +0.3 | +0.3 | +0.2 | 0.0  | −0.2 |
| 6   | +0.3 | +0.3 | +0.3 | +0.3 | +0.3 | +0.2 | 0.0  |

✅ You can see:

* Far distances (|r| ≥ 3) all share the same **clipped bias** (−0.3 or +0.3).
* The bias pattern stays constant across rows (translation invariant).

---

**Visual intuition**

If we visualize this bias matrix as a heatmap:

* Diagonal (r=0) → 0.0 bias (neutral)
* Upper-right triangle (looking backward → negative r) → negative bias (−0.2, −0.3)
* Lower-left triangle (looking forward → positive r) → positive bias (+0.2, +0.3)
* Beyond ±2 → **flat color** (clipped zone)

So the model treats distances ≥2 as "equally far away."

---

**Generalization**

For longer sequences or higher dimensions:

* You still pick $R_{\text{max}}$ to control how finely you distinguish local offsets.
* Often you **bucket** distances logarithmically (T5) rather than clip linearly.

In T5-style bucketing:

* 0–16: one bias per distance
* 17–32, 33–64, etc.: grouped together

That helps the model cover both local and long-range relations efficiently.

---

✅ **Summary**

| Symbol                              | Meaning                                                   |
| ----------------------------------- | --------------------------------------------------------- |
| $R_{\text{max}}$                    | maximum relative offset we treat distinctly               |
| $[-R_{\text{max}}, R_{\text{max}}]$ | range of learnable biases                                 |
| beyond range                        | clipped or bucketed                                       |
| effect                              | makes model translation-invariant but parameter-efficient |










---
#### **12.2. How are they stored in the model?**

Typically, they are implemented as a small **lookup table**:

```python
self.relative_bias = nn.Parameter(torch.zeros(2*R_max + 1))
```

or, for multi-head attention (like T5/Swin):

```python
self.relative_bias = nn.Parameter(torch.zeros(num_heads, 2*R_max + 1))
```

The model learns these values during training by backpropagation — exactly like weights and biases of linear layers.

---

#### **12.3. How they enter the attention computation**

Recall the equation:

$$
A_{ij} = \frac{q_i^\top k_j}{\sqrt{d}} + b_{i-j}
$$

Here (b_{i-j}) is a scalar retrieved from that table:

```python
r = i - j
r_clipped = torch.clamp(r, -R_max, R_max)
bias = self.relative_bias[r_clipped + R_max]
A = (Q @ K.transpose(-2, -1)) / sqrt(d) + bias
```

If you’re using **multi-head attention**, each head can have its own bias table (b_{r}^{(h)}).
If not, it’s shared across heads.

---

#### **12.4. 2D (vision) case**

For 2D windows (like Swin Transformer), we have **separate bias tables** for horizontal and vertical offsets:

$$
A_{ij} = \frac{q_i^\top k_j}{\sqrt{d}}b^{(h)}{\Delta h(i,j)} + b^{(w)}{\Delta w(i,j)}
$$


In code, this usually becomes one parameter tensor:

```python
self.relative_position_bias_table = nn.Parameter(
    torch.zeros((2*Wh - 1)*(2*Ww - 1), num_heads)
)
```

and an **index map** precomputed to lookup the correct bias per (i,j) pair.

---

#### **12.5. Example visualization**

Imagine $R_{\max}=2$, so we have 5 learnable scalars:

```text
r ∈ {-2, -1, 0, +1, +2}
b_r = [-0.1, 0.2, 0.0, 0.3, -0.2]   # learned after training
```

During backpropagation, the gradient of the attention loss wrt (A_{ij}) updates the corresponding (b_{i-j}).
So, if the model learns that **attending to next tokens** helps prediction, it will increase (b_{+1}) and (b_{+2}).

---

#### **12.6. Comparison**

| Concept                                   | Parameter                    | Indexed by         | Learned? | Typical size         |
| ----------------------------------------- | ---------------------------- | ------------------ | -------- | -------------------- |
| **Absolute positional embedding**         | `pos_embed[position]`        | absolute index (i) | ✅ yes    | $L \times d$         |
| **Relative positional bias (scalar)**     | `b_rel[offset]`              | offset (i-j)       | ✅ yes    | $2R+1)$ or $H×W$ grid |
| **Relative key/value embedding (vector)** | `r_K[offset]`, `r_V[offset]` | offset (i-j)       | ✅ yes    | $(2R+1) \times d$    |

---

✅ **Summary**

* $b_r$ are **learnable parameters**, one per relative distance bucket.
* They are optimized through normal backpropagation.
* This makes relative embeddings trainable **just like absolute embeddings**, but the model learns *distance-dependent* behavior rather than *position-dependent* behavior.

---



## **13. Main Approaches to Relative Positional Embedding (RPE)**

We can group them into **four main families**:

| # | Approach                                          | Key idea                                                                     | Example models                              |
| - | ------------------------------------------------- | ---------------------------------------------------------------------------- | ------------------------------------------- |
| 1 | **Additive bias (scalar)**                        | Learn one bias per relative distance bucket; add to attention logits         | T5, DeBERTa, Swin, ViTDet                   |
| 2 | **Content-dependent relative key/value (vector)** | Learn relative embedding vectors that interact with queries/values           | Shaw et al. (Transformer-XL, DeBERTa-v1)    |
| 3 | **Rotary embedding (RoPE)**                       | Encode relative phase difference by rotating Q,K vectors in complex/2D space | RoFormer, GPT-NeoX, Llama-2/3               |
| 4 | **Analytic slope (ALiBi)**                        | No parameters; add fixed linear slope per head for distance bias             | ALiBi (Press et al., 2022), Mistral, Falcon |

Let’s unpack each precisely.

---

#### **13.1. Additive bias (scalar per relative distance)**

**Equation**

$$
A_{ij} = \frac{q_i^\top k_j}{\sqrt{d}} + b_{\text{rel}}(i-j)
$$

* (b_{\text{rel}}(r)\in\mathbb{R}) is a **learnable scalar bias**.
* Often **bucketed**: nearby distances get fine bins, long ones coarse.

**Implementation**

```python
b_rel = nn.Parameter(torch.zeros(num_heads, num_buckets))
bias = b_rel[:, bucket(i-j)]
A = (Q @ K.transpose(-2,-1))/sqrt(d) + bias
```

**Pros**

* Simple, efficient (O(L^2)) bias addition.
* Works well for long sequences (T5).
* Naturally supports 2D windows (Swin).

**Cons**

* Adds only a scalar per distance — can’t model directional content interaction.

---

#### **13.2. Content-dependent relative key/value (Shaw et al., 2018)**

**Equation**

$$
A_{ij} = \frac{q_i^\top k_j + q_i^\top r^{(K)}*{i-j}}{\sqrt{d}}, \qquad
\text{out}*i = \sum_j \alpha*{ij}\left(v_j + r^{(V)}*{i-j}\right)
$$

* $r^{(K)}_r, r^{(V)}_r \in \mathbb{R}^d$ are **learnable vectors** for each relative offset (r).
* Uses the “**skew trick**” to implement efficiently (avoids full (L^2) tensor).

**Pros**

* Allows direction-aware and content-aware attention (richer signal).
* Works well in Transformer-XL, DeBERTa.

**Cons**

* Slightly more compute/memory.
* Doesn’t scale well to huge (L) (needs clipping or bucketing).

---

#### **13.3. Rotary Positional Embedding (RoPE)**

Instead of adding a bias or vector, **rotate** Q and K in 2-D subspaces depending on position:

$$
\tilde{q}_i = R(\theta_i) q_i,\qquad
\tilde{k}_j = R(\theta_j) k_j
$$

Then:

$$
\tilde{q}*i^\top \tilde{k}*j
= q_i^\top R(\theta*{j}-\theta*{i}) k_j
$$

The dot-product *implicitly* encodes **relative position (j–i)** via phase difference.

**Pros**

* No new parameters (only deterministic frequencies).
* Naturally relative; extrapolates to longer sequences.
* Now standard in Llama, GPT-NeoX, etc.

**Cons**

* Needs even-dim pairing (applies 2-D rotations).
* Harder to combine with learned biases (though “XPos” fixes that).

---

#### **13.4. ALiBi (Attention with Linear Biases)**

Add a *fixed linear penalty* proportional to distance:

$$
A_{ij} = \frac{q_i^\top k_j}{\sqrt{d}} - m_h (i-j)
$$

* (m_h>0) is a fixed slope per head (non-learned).
* Encourages each head to prefer local context but still attend long-range.

**Pros**

* Zero extra parameters.
* Scales to very long context windows.
* Excellent extrapolation (no need to interpolate embeddings).

**Cons**

* Fixed — can’t adapt biases through training.
* Directional only (penalizes forward distance).

---

#### **13.5. Specialization for Vision (2D RPE)**

For 2D patches/windows (like Swin, ViTDet):

$$ A_{ij} = \frac{q_i^\top k_j}{\sqrt{d}} b^{(h)}*{\Delta h(i,j)} + b^{(w)}{\Delta w(i,j)} $$

- Two small 1-D learnable tables, one for height offsets, one for width.
- Enables **translation equivariance** across image space.

---

#### **13.6. Summary table**

| Type                 | Added params      | Equation term                             | Depends on         | Learnable?      | Typical use     |
| -------------------- | ----------------- | ----------------------------------------- | ------------------ | --------------- | --------------- |
| **Additive bias**    | small (2R+1)      | (+ b_{i-j})                               | distance           | ✅               | T5, Swin        |
| **Shaw-style**       | moderate (2R+1)×d | (+ q_i^\top r^{(K)}_{i-j})                | distance + content | ✅               | Transformer-XL  |
| **RoPE**             | none              | implicit via rotation                     | relative angle     | ❌ (fixed)       | Llama, GPT-NeoX |
| **ALiBi**            | none              | (- m_h (i-j))                             | distance           | ❌ (fixed slope) | ALiBi, Mistral  |
| **2D Bias (vision)** | small grid        | (+ b^{(h)}*{\Delta h}+b^{(w)}*{\Delta w}) | Δrow, Δcol         | ✅               | Swin, ViTDet    |

---

## **13.7. Key intuitions**

* **Absolute embeddings** encode “I am token #7”.
* **Relative embeddings** encode “token *j* is 3 steps ahead of me”.

Relative methods:

* improve **translation invariance**,
* **generalize** to longer sequences,
* and **model directionality** (past vs. future, left vs. right).

---

✅ **Summary**

> There isn’t one “RPE” — there are **several compatible mechanisms**.
> Modern Transformers usually combine **RoPE** (for long-range generalization)
> with **additive relative bias** (for local inductive bias).

---



## **14. Numerical Example, Comparison of Different Relative Positional Embedding (RPE)**
Let’s build a **numerical, side-by-side comparison** of the four major **relative positional embedding (RPE)** variants:

1. **Additive bias** (scalar per distance)
2. **Shaw-style** (content-dependent relative vectors)
3. **Rotary (RoPE)**
4. **ALiBi (linear slope)**

All on a **tiny 1D sequence** of length (L=3), embedding dim (d=2).
We’ll keep the numbers small and interpretable so you can literally check them by hand or in a Jupyter cell.

---

#### 14.1. Shared setup

Tokens $x_i$:

| index | vector $x_i$ |
| ----- | ------------ |
| 0     | [1, 0]       |
| 1     | [0, 1]       |
| 2     | [1, 1]       |

Use $Q=K=X$.
$\sqrt{d}=\sqrt{2}\approx1.414$.
The base content logits:

$$
A_{\text{content}} =
\begin{bmatrix}
0.707 & 0 & 0.707\\
0 & 0.707 & 0.707\\
0.707 & 0.707 & 1.414
\end{bmatrix}.
$$

We’ll now modify these logits differently for each RPE variant.

---

#### 2. Variant 1 — Additive bias (scalar)

Each relative offset (r=i-j) gets a learnable scalar $b_r$:

| $r$ | $b_r$ |
| --- | ----- |
| −2  | −0.3  |
| −1  | −0.2  |
| 0   | 0.0   |
| +1  | +0.2  |
| +2  | +0.3  |

Add $b_{i-j}$ to logits:

$$
A_{\text{bias}} =
\begin{bmatrix}
0.707 & -0.2 & 0.407\\
0.200 & 0.707 & 0.507\\
1.007 & 0.907 & 1.414
\end{bmatrix}.
$$

Bias shifts attention toward **future tokens** (positive offsets).

---

#### 3. Variant 2 — Shaw-style (content-dependent)

Each relative offset has a **vector** $r^{(K)}_r$:

| $r$ | $r^{(K)}_r$  |
| --- | ------------ |
| −2  | [−1, 0]      |
| −1  | [−0.5, 0.0]  |
| 0   | [ 0, 0 ]     |
| +1  | [ 0.5, 0.0 ] |
| +2  | [ 1, 0 ]     |

Compute
$$ A_{ij} = \frac{q_i^\top(k_j + r^{(K)}_{i-j})}{\sqrt{2}}. $$

Let’s compute row 0 explicitly:

* $j=0$: $r=0\Rightarrow q_0!\cdot!(k_0+r^{(K)}_0)=1·1+0·0=1$
* $j=1$: $r=-1\Rightarrow q_0!\cdot!(k_1+r^{(K)}_{-1})=1·(-0.5)+0·0=-0.5$
* $j=2$: $r=-2\Rightarrow q_0!\cdot!(k_2+r^{(K)}_{-2})=1·(1-1)+0·0=0$

Divide by 1.414 → [0.707→0.707? Wait compute again]. Let's compute properly below.

Compute all rows:

| i | j | $r=i-j$ | $k_j+r_r^{(K)}$ | $q_i\cdot(\cdot)$ | /√2    |
| - | - | ------- | --------------- | ----------------- | ------ |
| 0 | 0 | 0       | [1,0]           | 1                 | 0.707  |
| 0 | 1 | −1      | [−0.5,1]        | 1*(−0.5)+0*1=−0.5 | −0.354 |
| 0 | 2 | −2      | [0,1]           | 1*0+0*1=0         | 0      |
| 1 | 0 | +1      | [1.5,0]         | 0*1.5+1*0=0       | 0      |
| 1 | 1 | 0       | [0,1]           | 0*0+1*1=1         | 0.707  |
| 1 | 2 | −1      | [0.5,1]         | 0*0.5+1*1=1       | 0.707  |
| 2 | 0 | +2      | [2,0]           | 1*2+1*0=2         | 1.414  |
| 2 | 1 | +1      | [0.5,1]         | 1*0.5+1*1=1.5     | 1.061  |
| 2 | 2 | 0       | [1,1]           | 1*1+1*1=2         | 1.414  |

Thus

$$
A_{\text{Shaw}} =
\begin{bmatrix}
0.707 & -0.354 & 0.000\\
0.000 & 0.707 & 0.707\\
1.414 & 1.061 & 1.414
\end{bmatrix}.
$$

You see the pattern: local distance directly changes the dot-product via learned direction vectors.

---

#### 4. Variant 3 — Rotary (RoPE)

For (d=2), rotation per position (p):

$$
R(\theta_p) =
\begin{bmatrix}
\cos\theta_p & -\sin\theta_p\
\sin\theta_p & \cos\theta_p
\end{bmatrix}.
$$

Let $\theta_p = p·30^\circ = [0°, 30°, 60°]$.

Compute rotated Q,K:

| p | Qp    | Rp(Qp)                                                      | Kp         | Rp(Kp)        |
| - | ----- | ----------------------------------------------------------- | ---------- | ------------- |
| 0 | [1,0] | [1,0]                                                       | [1,0]      | [1,0]         |
| 1 | [0,1] | [−0.5, 0.866]                                               | [0,1]      | [−0.5, 0.866] |
| 2 | [1,1] | rotation 60°→ [cos60°−sin60°, sin60°+cos60°]=[0.366, 1.366] | same for K |               |

Now compute $\tilde A_{ij}=(\tilde q_i^\top\tilde k_j)/\sqrt{2}$:

| i\j | 0                             | 1                                                  | 2                                                      |
| --- | ----------------------------- | -------------------------------------------------- | ------------------------------------------------------ |
| 0   | (1·1+0·0)/1.414=0.707         | (1·−0.5+0·0.866)/1.414=−0.354                      | (1·0.366+0·1.366)/1.414=0.259                          |
| 1   | (−0.5·1+0.866·0)/1.414=−0.354 | ((−0.5)^2+0.866^2)/1.414=(1.0)/1.414=0.707         | ((−0.5·0.366)+(0.866·1.366))/1.414=(0.999)/1.414=0.707 |
| 2   | (0.366·1+1.366·0)/1.414=0.259 | (0.366·−0.5+1.366·0.866)/1.414=(0.999)/1.414=0.707 | (0.366·0.366+1.366·1.366)/1.414=(2.0)/1.414=1.414      |

So

$$
A_{\text{RoPE}} =
\begin{bmatrix}
0.707 & -0.354 & 0.259\\
-0.354 & 0.707 & 0.707\\
0.259 & 0.707 & 1.414
\end{bmatrix}.
$$

Notice how it’s **implicitly relative**: row 1 and row 2 patterns are similar but *shifted*.

---

#### 5. Variant 4 — ALiBi (linear slope)

Add a *fixed* distance-dependent penalty:

$$
A_{ij} = \frac{q_i^\top k_j}{\sqrt{d}} - m (i-j), \quad m=0.1.
$$

Compute:

| i\j | r=i−j | base  | −m·r | total |
| --- | ----- | ----- | ---- | ----- |
| 0   | 0     | 0.707 | 0    | 0.707 |
| 0   | 1     | 0     | +0.1 | 0.1   |
| 0   | 2     | 0.707 | +0.2 | 0.907 |
| 1   | 0     | 0     | −0.1 | −0.1  |
| 1   | 1     | 0.707 | 0    | 0.707 |
| 1   | 2     | 0.707 | +0.1 | 0.807 |
| 2   | 0     | 0.707 | −0.2 | 0.507 |
| 2   | 1     | 0.707 | −0.1 | 0.607 |
| 2   | 2     | 1.414 | 0    | 1.414 |

$$
A_{\text{ALiBi}} =
\begin{bmatrix}
0.707 & 0.100 & 0.907\\
-0.100 & 0.707 & 0.807\\
0.507 & 0.607 & 1.414
\end{bmatrix}.
$$

Each head could have its own slope (m_h), controlling attention decay with distance.

---

#### 6. Side-by-side summary

| Variant   | Learned?              | Adds what       | Example numeric pattern (row 0) | Comment                           |
| --------- | --------------------- | --------------- | ------------------------------- | --------------------------------- |
| **Bias**  | ✅ (b_r) scalars       | +bias to logits | [0.707, −0.2, 0.407]            | simplest learnable RPE            |
| **Shaw**  | ✅ (r_r^{(K)}) vectors | +dot$q, r_r$    | [0.707, −0.354, 0]              | richer, directional               |
| **RoPE**  | ❌ rotations           | implicit        | [0.707, −0.354, 0.259]          | parameter-free, smooth            |
| **ALiBi** | ❌ slope (m)           | −m·(i−j)        | [0.707, 0.1, 0.907]             | fixed bias, long-context friendly |

---

#### 7. Interpretation

* **Additive bias:** learns fixed attention preferences by *distance*.
* **Shaw:** learns *vector* interactions by distance — more expressive.
* **RoPE:** encodes distance via rotational phase — parameter-free, extrapolates best.
* **ALiBi:** deterministic linear bias — fastest, no learned params, good for autoregressive models.

---

✅ **Key takeaway**

> All four inject *relative information* into the attention scores,
> but differ in **what is learned**, **how it scales**, and **how interpretable** it is:
>
> * **Bias & Shaw:** learn tables via backprop.
> * **RoPE & ALiBi:** analytic / fixed functions of distance.
> * **RoPE** ⇒ used in modern LLMs (Llama, GPT-NeoX).
> * **Bias (2-D)** ⇒ used in Swin Transformer (vision).
> * **Shaw + skew** ⇒ used in Music Transformer, Transformer-XL.
> * **ALiBi** ⇒ used in Mistral, Falcon (long context).

---




#### Implementation notes (fast paths)

* **Bias-only** (T5/ALiBi): precompute an (L\times L) bias (or “on-the-fly” with buckets); add to logits before softmax.
* **Shaw-style**: compute (Q R^{(K)\top}) once, then **skew** (a reshape+pad+slice) so offset (r) lines up with column (j).
* **RoPE**: apply rotations to (Q,K) once per layer; everything else is standard attention.
* **2-D**: store small ( (2H!-!1)\times(2W!-!1)) tables (or factor into two ((2H!-!1)) and ((2W!-!1)) vectors).

---

#### When to use what?

* **RoPE**: ✅ robust extrapolation to longer context; no extra params; default for many LLMs.
* **Additive bias (T5)**: ✅ simple, stable, cheap; works great with encoders/decoders; buckets control capacity.
* **ALiBi**: ✅ zero learned params; excellent long-context behavior; causal models.
* **Shaw-style**: ✅ highest expressivity (content × position); ❌ slightly heavier; good for tasks sensitive to fine relative geometry (e.g., some NMT or local vision windows).
* **2-D bias**: ✅ ideal for images/windows (Swin); small overhead; preserves translational inductive bias.

---

#### Quick reference equations (copy-ready)

**Bias-only (per head):**
$$
A_{ij}=\frac{q_i^\top k_j}{\sqrt{d}} + b_{\text{rel}}(\operatorname{bucket}(i-j)),\quad
\alpha_{ij}=\frac{e^{A_{ij}}}{\sum_{t} e^{A_{it}}},\quad
\text{out}*i=\sum_j \alpha*{ij} v_j.
$$

**Shaw et al.:**
$$
A_{ij}=\frac{q_i^\top k_j + q_i^\top r^{(K)}{i-j}}{\sqrt{d}},\qquad
\text{out}*i=\sum_j \alpha*{ij}\left(v_j + r^{(V)}*{i-j}\right).
$$

**ALiBi (causal):**
$$
A_{ij}=\frac{q_i^\top k_j}{\sqrt{d}} - m_h,(i-j),\quad j\le i.
$$

**RoPE:**
$$
\widetilde{q}_i=R(\theta_i)q_i,\quad \widetilde{k}*j=R(\theta_j)k_j,\quad
A*{ij}=\frac{\widetilde{q}*i^\top \widetilde{k}*j}{\sqrt{d}}
=\frac{q_i^\top R(\theta*{j}-\theta*{i})k_j}{\sqrt{d}}.
$$

**2-D windowed bias (vision):**
$$
A_{ij}=\frac{q_i^\top k_j}{\sqrt{d}} + b^{(h)}*{\Delta h(i,j)} + b^{(w)}*{\Delta w(i,j)}.
$$

---

#### Key intuitions (why it helps)

* **Translation invariance**: the model learns “same-pattern, different place” naturally.
* **Length generalization**: depends on *offsets*, not absolute indexes.
* **Parameter efficiency**: a small table (or none with RoPE/ALiBi) replaces (O(L)) absolute embeddings.
* **Locality bias**: buckets/linear slopes can emphasize nearby tokens—crucial for language syntax and vision neighborhoods.

