### Positional Encoding

In order for the model to make sense of the sentence, it needs to know two things about each word.

* What does the word mean.
* What is the position of the word in the sentence.

In `attention is all you need paper` author used the following functions to create positional encoding. 
- On odd time steps a cosine function is used
- In even time steps a sine function is used.

<div align="center">
  <img src="https://miro.medium.com/v2/resize:fit:996/1*pDZr1I5WQkbw23v5lnEsDg.png" alt="Positional Encoding" width="300">
</div>

**`pos ->`** Position of a word in the sequence (e.g., first word → `pos=0`, second word → `pos=1`, etc)

**`i ->`** Index of the embedding dimension (0 to `embed_dim - 1`)

**`d_model ->`** Refers to the embedding dimension `embed_dim`



Positinal embedding will generate a matrix of similar to embedding matrix. It will create a matrix of dimension **sequence length x embedding dimension**. For each token(word) in sequence, we will find the embedding vector which is of dimension **1 x 512** and it is added with the correspondng positional vector which is of dimension **1 x 512** to get **1 x 512** dim out for each word/token.

Example: if we have batch size of `32` and seq length of `10` and let embedding dimension be `512`. Then we will have embedding vector of dimension `32 x 10 x 512`. Similarly we will have positional encoding vector of dimension `32 x 10 x 512`. Then we add both.




<div align="center">
  <img src="https://miro.medium.com/max/906/1*B-VR6R5vJl3Y7jbMNf5Fpw.png" alt="Positional Encoding" width="300">
</div>

In [8]:
import torch
import torch.nn as nn

max_sequence_length = 3
d_model = 6

In [9]:
pe = torch.zeros(max_sequence_length, d_model)
pe.shape

torch.Size([3, 6])

In [10]:
pe

tensor([[0., 0., 0., 0., 0., 0.],
        [0., 0., 0., 0., 0., 0.],
        [0., 0., 0., 0., 0., 0.]])

In [11]:
position = torch.arange(0, max_sequence_length, dtype=torch.float).unsqueeze(1)
print(position.shape)
position

torch.Size([3, 1])


tensor([[0.],
        [1.],
        [2.]])

In [12]:
even_i = torch.arange(0, d_model, 2).float()
even_i

tensor([0., 2., 4.])

In [13]:
div_term = torch.pow(10000, even_i / d_model)
print(div_term.shape)
div_term

torch.Size([3])


tensor([  1.0000,  21.5443, 464.1590])

In [14]:
torch.sin(position / div_term)

tensor([[0.0000, 0.0000, 0.0000],
        [0.8415, 0.0464, 0.0022],
        [0.9093, 0.0927, 0.0043]])

Here, we obtained the positional encoding for words at even indices.

Replacing these in the pe

In [15]:
pe[:, 0::2] = torch.sin(position / div_term)
pe

tensor([[0.0000, 0.0000, 0.0000, 0.0000, 0.0000, 0.0000],
        [0.8415, 0.0000, 0.0464, 0.0000, 0.0022, 0.0000],
        [0.9093, 0.0000, 0.0927, 0.0000, 0.0043, 0.0000]])

Notice, here odd indices are 0 since we haven't handled them yet.

In [16]:
torch.cos(position / div_term)

tensor([[ 1.0000,  1.0000,  1.0000],
        [ 0.5403,  0.9989,  1.0000],
        [-0.4161,  0.9957,  1.0000]])

Here, we obtained the positional encoding for words at odd indices.

Replacing these in the pe

In [17]:
pe[:, 1::2] = torch.cos(position / div_term)

In [18]:
pe

tensor([[ 0.0000,  1.0000,  0.0000,  1.0000,  0.0000,  1.0000],
        [ 0.8415,  0.5403,  0.0464,  0.9989,  0.0022,  1.0000],
        [ 0.9093, -0.4161,  0.0927,  0.9957,  0.0043,  1.0000]])

In [19]:
pe.shape

torch.Size([3, 6])

### Class

In [20]:
class PositionalEncoding(nn.Module):
    def __init__(self, embed_dim, max_seq_len):
        super().__init__()
        # Create a matrix of shape (max_seq_len, embed_dim)
        pe = torch.zeros(max_seq_len, embed_dim)
        # Create a vector of shape (max_seq_len, 1)
        position = torch.arange(0, max_seq_len, dtype=torch.float).unsqueeze(1)  #(max_seq_len, 1)
        # Calculate the division term: 10000^(2i/embed_dim)
        div_term = torch.pow(10000, torch.arange(0, embed_dim, 2).float() / embed_dim)  #(embed_dim/2)

        # Apply sine to even indices and cosine to odd indices
        pe[:, 0::2] = torch.sin(position / div_term) #(max_seq_len, embed_dim/2)
        pe[:, 1::2] = torch.cos(position / div_term) #(max_seq_len, embed_dim/2)

        # Add a batch dimension
        pe = pe.unsqueeze(0) #(1, max_seq_len, embed_dim)

        # Register buffer makes the parameter persistent but not a model parameter
        # that is updated during training
        self.register_buffer('pe', pe)


    def forward(self, x):
        # Add positional encoding to the input embeddings
        x = x + self.pe[:, :x.shape[1], :] # `pe` is not trainable but moves with the model
        return x 

In [21]:
d_model = 512  # Embedding dimension
max_seq_length = 1000

pos_encoder = PositionalEncoding(d_model, max_seq_length)

In [24]:
# Create a sample input tensor: (batch_size=2, seq_length=10, d_model=512)
x = torch.randn(2, 10, d_model)
# Apply positional encoding
encoded = pos_encoder(x)

In [25]:
print(f"Input shape: {x.shape}")
print(f"Output shape: {encoded.shape}")

Input shape: torch.Size([2, 10, 512])
Output shape: torch.Size([2, 10, 512])
