---
title: "All you need to know about RoPE"
author: "Safouane Chergui"
date: "2026-01-08"
categories: [Python, NLP]
---

I've discovered RoPE through tutorials and I wanted to read the paper for quite a while. I've finally gotten to it.
Even though I understood the general idea and how it works, some questions kept buzzing in my mind.
So, in this blog post, I'll try to start from 0 and take you to a full understanding of RoPE.

This post will have a question-answer format.

# Q1: Why do we need position encoding at all ?

Consider these two sentences:

- Sentence 1: `The cat chased the mouse`
- Sentence 2: `The mouse chased the cat`

These sentences have a different meaning but they are the same for vanilla attention. This is because attention is **permutation invariant**.

To understand this property, let us look at the attention formula:

$$\text{Attention}(Q, K, V) = \text{softmax}\left(\frac{QK^T}{\sqrt{d_k}}\right)V$$

Let us see if the swapping of `cat` and `mouse` positions changes anything when it comes to the vanilla attention mechanism.

In sentence 1, the attention score between `cat` at position 1 and `mouse` at position 4 is:

$$\text{score}_{1,4} = \text{softmax}\left(\frac{q_{\text{cat}} \cdot k_{\text{mouse}}}{\sqrt{d_k}}\right)$$

In sentence 2, the attention score between `mouse` at position 1 and `cat` at position 4 is:

$$\text{score}_{1,4} = \text{softmax}\left(\frac{q_{\text{mouse}} \cdot k_{\text{cat}}}{\sqrt{d_k}}\right)$$

Since queries and keys are computed as $Q = W_Q \cdot X$ and $K = W_K \cdot X$ where $X$ is just the token embedding (no position info in X), the same token always produces the same query/key vector regardless of its position.

Consequently, attention cannot distinguish who is chasing whom and the two sentences will have the exact attention matrix.

The conclusion is that we must inject the position information into the attention mechanism. Otherwise, we're really just weighting a bag of words.