# Positional Encodings in Transformer Architectures

Both BERT and DistilBERT share the same overall structure: an embedding layer followed by multiple transformer encoder layers. The key architectural difference is depth—BERT uses more transformer layers than DistilBERT—but the internal mechanics of each layer are the same.

Each transformer layer contains multi-head self-attention, normalization steps, and feed-forward networks. However, there is a critical component that sits _between_ the embedding layer and the transformer layers: **positional encodings**.

## Why Positional Encodings Are Needed

Transformers process input tokens **in parallel**, not sequentially. Unlike recurrent models, they do not read a sentence from left to right or right to left. This parallelism is what makes transformers efficient, but it introduces a problem: **word order is not inherently known to the model**.

Without additional information, a transformer cannot distinguish between:

`"The cat sat on the mat"`

and

`"On the mat sat the cat"`

Both sentences contain the same tokens, but their meanings differ because of word order.

Positional encodings solve this problem by injecting information about **token position** into the model.

## Comparison with Sequential Models

In traditional RNN-based encoders, tokens are processed one at a time:

Input sequence:  
"the cat is black"

Processing order:

- "the" → first
- "cat" → second
- "is" → third
- "black" → fourth

Because the sequence is processed step by step, the model naturally learns word order.

In contrast, a transformer encoder receives all tokens at once:

Input tokens:
["the", "cat", "is", "black"]

All tokens are processed simultaneously. Without positional encodings, the model has no way to know which word came first or last.

![RNN vs. Transformer encoding](../FIGs/rnn-transformer.png)

## How Positional Encodings Work

After tokenization, each token is converted into an embedding vector. Positional encodings are **added directly to these embeddings** before they are passed into the transformer layers.

Conceptually:

$$\text{Position-aware embeddings}=\text{Token embeddings}+\text{Positional encodings}$$

Each position in the sentence is associated with a unique vector that represents its position. These vectors encode either absolute or relative position information, allowing the model to learn patterns such as word order, distance, and structure.

At this stage, it is not necessary to focus on the exact numerical form of these encodings. What matters is their role: **they give the model access to word order information**.

![Positional Encoding](../FIGS/positional-embedding.png)

## Position-Aware Embeddings Example

**Sentence:** "The cat is black"  
**Embedding dimension:** 5

### 1. Token Embeddings

| Token | Embedding $E_i$           |
| ----- | ------------------------- |
| The   | [0.1, 0.2, 0.3, 0.4, 0.5] |
| cat   | [0.5, 0.4, 0.3, 0.2, 0.1] |
| is    | [0.0, 0.1, 0.0, 0.1, 0.0] |
| black | [0.2, 0.2, 0.2, 0.2, 0.2] |

### 2. Positional Encodings

| Position | Positional Encoding $P_i$      |
| -------- | ------------------------------ |
| 1        | [0.01, 0.02, 0.03, 0.04, 0.05] |
| 2        | [0.02, 0.01, 0.02, 0.01, 0.02] |
| 3        | [0.03, 0.03, 0.03, 0.03, 0.03] |
| 4        | [0.04, 0.02, 0.01, 0.03, 0.05] |

### 3. Position-Aware Embeddings

**Formula:**

$$
X_i = E_i + P_i
$$

**Calculations:**

$$
\begin{aligned}
X_\text{The} &= [0.1+0.01, 0.2+0.02, 0.3+0.03, 0.4+0.04, 0.5+0.05] = [0.11, 0.22, 0.33, 0.44, 0.55] \\
X_\text{cat} &= [0.5+0.02, 0.4+0.01, 0.3+0.02, 0.2+0.01, 0.1+0.02] = [0.52, 0.41, 0.32, 0.21, 0.12] \\
X_\text{is} &= [0.0+0.03, 0.1+0.03, 0.0+0.03, 0.1+0.03, 0.0+0.03] = [0.03, 0.13, 0.03, 0.13, 0.03] \\
X_\text{black} &= [0.2+0.04, 0.2+0.02, 0.2+0.01, 0.2+0.03, 0.2+0.05] = [0.24, 0.22, 0.21, 0.23, 0.25]
\end{aligned}
$$

### Resulting Position-Aware Embeddings

| Token | $X_i$                          |
| ----- | ------------------------------ |
| The   | [0.11, 0.22, 0.33, 0.44, 0.55] |
| cat   | [0.52, 0.41, 0.32, 0.21, 0.12] |
| is    | [0.03, 0.13, 0.03, 0.13, 0.03] |
| black | [0.24, 0.22, 0.21, 0.23, 0.25] |
