# Sinusoidal (Fixed) Positional Encoding

## Introduction

Self-attention is permutation-invariatnt. Each query token attends to all its context tokens simultaneously, meaning, unlike RNN/CNNs, Transformers have no built-in notion of word order. 

To inject token order, Sinusoidal Position Encoding was introduced in the original Transformer ("Attention is All You Need", Vaswani et al. 2017). The core idea is to encode each position $p \in [0, L)$ as a deterministic vector of size d_model $PE(p) \in \mathbb{R}^d$ that represents positional information with sinusoidal (sines and cosines) at geometrically spaced frequencies (higher dimension gets higher frequency and vice versa). In this way
   -  Any position can be computed on the fly, it is determnistic, works out-of-the-box and requires no additional parameters.
   -  Adding this to embeddings lets attention also "sense" the order and distance (*relative offsets*) between tokens. The relative position signal emerges linearly, since sin(p+delta), cos(p+delta) are linear combos of sin(p), cos(p), a linear layer can recover the offset delta.
   -  Enabling extrapolation to longer sequences than seen in training, although quality can degrade as length gets too long.

With Sinusoidal Positional Encoding, the model can learn to attend to relative or absolute positions via arithmetic.

The formula from the original Transformer paper:

$$
PE_{(pos, 2i)} = sin(\frac{pos}{10000^{\frac{2i}{d}}}) 
$$
$$
PE_{(pos, 2i+1)} = cos(\frac{pos}{10000^{\frac{2i}{d}}})
$$

Where
- $d$: model dimension
- $i$: pair index to map to each feature dimension of the embedding vector. $i = 0, 1, ..., \frac{d_{model}}{2}-1$. The pair index is used to get odd and even indices and the actual dimension index is $2i$ for even indices and $2i+1$ for odd indices.

Used in
- original Transformer (2017)

## Implementation

### Division Term

In the formula the division term is $10000^{\frac{2i}{d}}$. In code, for numerical stability and efficiency, we implement the original power form with its mathematically equivalent exponential form: 

$$
\frac{1}{10000^{\frac{2i}{d}}} = 10000^{-\frac{2i}{d}} = e^{ln(10000^{(-\frac{2i}{d})})}
$$
$$
= e^{-\frac{2i}{d}ln10000}
$$

In [5]:
import torch, math

d_model = 20000 # example model dimension
i = torch.arange(0, d_model, 2) # 2i in the formula, i = 0, 2, 4, ...
div_term = torch.exp(-i / d_model * math.log(10000.0))

### Sinusoidal Positional Encoding  

A minimalistic implementation of the formula:
$$
PE_{(pos, 2i)} = sin(\frac{pos}{10000^{\frac{2i}{d}}}) 
$$
$$
PE_{(pos, 2i+1)} = cos(\frac{pos}{10000^{\frac{2i}{d}}})
$$ 

with the exponential form of the division term as implemented above.


In [None]:
import torch, math
def sinusoidal_position_encoding(seq_len: int, d_model: int):
    """
    Input: seq_len, d_model
    Output: a positional encoding tensor of shape (seq_len, d_model)
    """
    pe = torch.zeros(seq_len, d_model)  # PE matrix: (seq_len, d_model)
    pos = torch.arange(seq_len).unsqueeze(1)    # positions 0, 1, 2,..., seq_len-1 -> (seq_len, 1)
    
    # calculate div_term
    i = torch.arange(0, d_model, 2) # 2i in the fromula, = 0, 2, 4, ..., d_model-1
    div_term = torch.exp(-i / d_model * math.log(10000.0))

    # fill the PE matrix odd and even channels
    pe[:, 0::2] = torch.sin(pos * div_term) # 0, 2, 4, ..., d_model-1
    pe[:, 1::2] = torch.cos(pos * div_term) # 1, 3, 5, ..., d_model-1

    return pe
       

To use this to process the input in an autoregressive Transformer, simply add it to the input embedding. E.g.:

In [7]:
import torch.nn as nn

# example values
vocab_size = 32000
d_model = 256
seq_len = 128
batch_size = 8

# Dummy input tokenized IDs - (batch_size, seq_len)
input_ids = torch.randint(low=0, high=vocab_size, size=(batch_size, seq_len))
# Embedding layer - (vocab_size, d_model)
token_embedding = nn.Embedding(vocab_size, d_model)

# Embedded input x - (batch_size, seq_len, d_model)
x = token_embedding(input_ids)
pe = sinusoidal_position_encoding(seq_len, d_model) # (seq_len, d_model)
x = x + pe.unsqueeze(0) # broadcast pe on batch_size dimension

# x is ready for attention layers