# Positional Encoding Calculations

Now that we have our **token embeddings** and simple **position-aware embeddings**, let’s discuss **how the positional encodings are calculated** in practice.

## Why use sine and cosine functions?

We use the following formulas for the positional encoding $P_{pos, i}$:

$$
P_{pos, 2i} = \sin\Big(\frac{pos}{10000^{2i/d}}\Big), \quad \text{for even indices} \\
P_{pos, 2i+1} = \cos\Big(\frac{pos}{10000^{2i/d}}\Big), \quad \text{for odd indices}
$$

- $pos$ = position of the word in the sequence (starting from 0)
- $i$ = dimension index of the embedding
- $d$ = total embedding dimension

**Reasoning:**

1. Each dimension of the positional encoding corresponds to a **different frequency**.
2. Sine is used for even indices, cosine for odd indices. This allows the model to **distinguish positions uniquely** and learn relative distances between words.
3. The function produces a **continuous and smooth signal** for each position, which the model can exploit for sequence ordering.

> This is why the positional encoding vector has the **same size as the embedding vector** $d$ — we need to add them element-wise.

## Example with our sentence

**Sentence:** `"The cat is black"`  
**Embedding dimension:** $d = 5$

### Step 1: Token embeddings $E_i$

| Token | $E_i$                     |
| ----- | ------------------------- |
| The   | [0.1, 0.2, 0.3, 0.4, 0.5] |
| cat   | [0.5, 0.4, 0.3, 0.2, 0.1] |
| is    | [0.0, 0.1, 0.0, 0.1, 0.0] |
| black | [0.2, 0.2, 0.2, 0.2, 0.2] |


### Step 2: Positional encoding $P_i$ (manual small numbers for clarity)

Here is calculations for only one token "The" at position "0" for the sake of brevity.

- **Dimension 0** (even) → sine:

$$
P_{0,0} = \sin\Big(\frac{0}{10000^{0/5}}\Big) = \sin(0) = 0
$$

- **Dimension 1** (odd) → cosine:

$$
P_{0,1} = \cos\Big(\frac{0}{10000^{0/5}}\Big) = \cos(0) = 1
$$

- **Dimension 2** (even) → sine:

$$
P_{0,2} = \sin\Big(\frac{0}{10000^{2/5}}\Big) = \sin(0) = 0
$$

- **Dimension 3** (odd) → cosine:

$$
P_{0,3} = \cos\Big(\frac{0}{10000^{2/5}}\Big) = \cos(0) = 1
$$

- **Dimension 4** (even) → sine:

$$
P_{0,4} = \sin\Big(\frac{0}{10000^{4/5}}\Big) = \sin(0) = 0
$$

| Position | $P_i$                               |
| -------- | ----------------------------------- |
| 0        | [0.0, 1.0, 0.0, 1.0, 0.0]           |
| 1        | [0.841, 0.540, 0.841, 0.540, 0.841] |
| 2        | [0.909, 0.141, 0.909, 0.141, 0.909] |
| 3        | [0.141, 0.990, 0.141, 0.990, 0.141] |

> Here, even indices (0,2,4) use **sine**, odd indices (1,3) use **cosine**.

### Step 3: Position-aware embeddings

We add the token embedding and positional encoding element-wise:

$$
X_i = E_i + P_i
$$

Calculations for each word:

$$
X_\text{The} = [0.1+0.0, 0.2+1.0, 0.3+0.0, 0.4+1.0, 0.5+0.0] = [0.1, 1.2, 0.3, 1.4, 0.5] \\
X_\text{cat} = [0.5+0.841, 0.4+0.540, 0.3+0.841, 0.2+0.540, 0.1+0.841] = [1.341, 0.94, 1.141, 0.74, 0.941] \\
X_\text{is} = [0.0+0.909, 0.1+0.141, 0.0+0.909, 0.1+0.141, 0.0+0.909] = [0.909, 0.241, 0.909, 0.241, 0.909] \\
X_\text{black} = [0.2+0.141, 0.2+0.990, 0.2+0.141, 0.2+0.990, 0.2+0.141] = [0.341, 1.19, 0.341, 1.19, 0.341]
$$

### Final position-aware embeddings

| Token | $X_i$                               |
| ----- | ----------------------------------- |
| The   | [0.1, 1.2, 0.3, 1.4, 0.5]           |
| cat   | [1.341, 0.94, 1.141, 0.74, 0.941]   |
| is    | [0.909, 0.241, 0.909, 0.241, 0.909] |
| black | [0.341, 1.19, 0.341, 1.19, 0.341]   |

> Notice how **each word now encodes its position** in the sentence. These vectors can be fed into the Transformer model to preserve **order information**.


### Each word now has a unique fingerprint that combines its meaning and its position because:

1. **Sine and cosine create a unique combination**:  
   Even if sine or cosine repeats at some positions (e.g., $\cos(0)=\cos(2\pi)=1$), the combination across all dimensions remains unique.

2. **High-dimensional embeddings prevent collisions**:  
   Transformers use $d \sim 768$ (BERT/DistilBERT). Even if some waves repeat, the probability that **all 768 values repeat simultaneously** is practically zero.

3. **Captures relative positions**:  
   Differences between positional encodings of two words give the model **relative distance information**.

4. **Scalable for long sequences**:  
   Periodicity allows encoding of sentences longer than seen during training without losing uniqueness.


At this point, we are **done with constructing the positional encodings** and adding them to the token embeddings. Transformers **do not have any built-in notion of order**. Unlike RNNs or CNNs, they process all tokens **in parallel**. By adding $P_i$ to $E_i$, we explicitly inject **word order information** into the embeddings.

Each $X_i$ now contains:

- semantic information (from $E_i$)
- positional information (from $P_i$)

This combined representation acts as a **unique fingerprint** for each word **at its specific position** in the sentence.

Now we are ready to provide these vectors as the **input embeddings** to the Transformer model.

- These $X_i$ vectors are passed directly into the **self-attention layers**
- From this point on, the model operates **only on these position-aware embeddings**

This is exactly what happens in models like **BERT** and **DistilBERT**, except with much higher dimensionality (e.g., $d=768$).

![Positional Encoding](../FIGS/positional-embedding.png)
