#### Positional Encoding Layer from the Transformers Architecture

![Alt text](Images/02_PositionalEncoding.png)

#### Snippet from the `Attention Is All You Need Paper`

![Alt text](Images/02_PositionalEncoding_Paper.png)

1. Why `Positional Encoding` is needed?

In a sentence, word order matters. For example, consider these two sentences:

    - "I Love India"
    - "India Loves I"

The meaning of the sentences is completely different based on the word order. Traditional models like `RNN` and `CNN` inherently capture this order, but Transformers do not. Positional encoding is used to provide this order information to Transformers.

2. `Positional Encoding scheme`

Positional Encoding is added to the `word embeddings` Refer [https://github.com/bala1802/Neural-Networks-and-Deep-Learning/blob/master/Transformers/Experiment02/Understanding%20Input%20Embedding%20Layer.ipynb] - (the vectors representing the words) to give the model information about the position of words. The scheme used in the paper involves adding `sinusoidal` functions of different frequencies and phases to the word embeddings

Let's understand the formula:

![Alt text](image-1.png)

1. Formula-1 calculates the value for the `even-indexed` dimension of the `positional encoding`:

-   `pos` represents the position of the `token` in a sequence.
-   `2i`  corresponds to `even indices` in the positional encoding `dimension`
-   `d_model` represents the dimensionlity of the model (for example: the size of the `word embeddings`)

Let's understand how it works:

-   `2i / d_model`: This fraction scales the position value tobe within the range [0,1]. For different dimensions (`i`) this fraction varies.
-   `10000^(2i / d_model)`: This term increases exponentially as `i` grows, creating a range of different "frequencies" for encoding positions.
-   `pos / (10000^(2i / d_model))`: Diving the position value by this term scales the position value to match the range of frequencies
-   `sin(pos / (10000^(2i / d_model)))`: Finally, the `sine` function is applied to the scaled `position value`, creating a `sinusoidal` pattern. The result is a value that varies smoothly with the position but in a periodic manner.

2. Formula-2 calculates the value for the `odd-indexed` dimension of the `positional encoding`:

Everything is similar to the `Formula-1`, execpt that `cosine` function is applied.



#### Wavelengths in Geometric Progression & Linear Relationship for Relative Positions

- `Wavelengths in Geometric Progression`: The `wavelengths` of these `sinusoids` are arranged in a geometric progression. They follow a pattern where each wavelength is a constant factor larger than the previous one. In this case, the wavelengths start from `2π` (a full cycle) and increase a factor by `10000*2π`

- For any fixed offset `k` (representing the relative position of tokens), the `positional encoding` at position `PE_pos+k` can be represented as a  `linear` function of the `positional encoding` at position `PE_pos`. This `Linear` relationship helps the model capture the relative positions

- Let's illustrate this with an example: Suppose we have a sequence of words: `I Love India`. And let's consider a specific dimension of the positional encoding, denoted by `Dim1`, which corresponds to a `sinusoidal` wave.

    * At the `first` position `I`, the `Dim1` value of `positional encoding` might be some value based on the `sinusoid` at a wavelength of `2π`.

    * At the `second` position `Love`, the `Dim1` value of `positional encoding would be different`, corresponding to the `same sinusoid` but at a `different point in its cycle`, because it's in a different position in the sequence.
    
    * At the `third` position `India`, the `Dim1` value of `positional encoding would be different again`, reflecting the `sinusoid` at yet another point in its cycle.

#### PyTorch Code implementation

In [1]:
import torch
import torch.nn as nn
import math

class PositionalEncoding(nn.Module):
    def __init__(self, d_model: int, seq_len: int, dropout: float) -> None:
        super().__init__()
        """
        self.d_model: Storing the dimension of the model
        self.seq_len: Storing the length of the sequence
        self.dropout: Storing the dropout rate
        """
        self.d_model = d_model
        self.seq_len = seq_len
        self.dropout = nn.Dropout(dropout)
        """
        Creating an empty matrix pe of shape (seq_len, d_model) to store the positional encodings.
        """
        pe = torch.zeros(seq_len, d_model)
        """
        Creating a vector position of shape (seq_len, 1) with values from 0 to seq_len-1. 
        This will represent the positions of tokens in the sequence.
        """
        position = torch.arange(0, seq_len, dtype=torch.float).unsqueeze(1)
        """
        Computing a vector div_term of shape (d_model,) with values that will be used in the positional encoding calculations.
        """
        div_term = torch.exp(torch.arange(0, d_model, 2).float() * (-math.log(10000.0) / d_model))
        """
        Applying Sine and Cosine functions
            - sin(position * (10000 ** (2i / d_model)))
            - cos(position * (10000 ** (2i / d_model)))
        
        To alternate indices of the `pe` matrix based on the positions and `div_term`. 
        Simulating the sinusoidal encoding pattern mentioned in the paper
        """
        pe[:, 0::2] = torch.sin(position * div_term)
        pe[:, 1::2] = torch.cos(position * div_term)
        """
        Adding the batch dimension to the positional encoding matrix `pe`, making it `(1, seq_len, d_model)`
        to match the batch size of the input data.
        """
        pe = pe.unsqueeze(0)
        
        """
        Registering the positional encoding matrix pe as a buffer in the module. 
        A way to include non-trainable parameters that are still part of the model.
        """
        self.register_buffer('pe', pe)

    def forward(self, x):
        x = x + (self.pe[:, :x.shape[1], :]).requires_grad_(False) # (batch, seq_len, d_model)
        return(self.dropout(x))

Let's understand the `Forward` method with an example:

- `x` is an input `tensor`, which is expected to have a shape of `(batch, seq_len, d_model)`.

- For example, lets say the 

    * `batch` size is `2`
    * `seq_len` is `4`
    * `d_model` (Model dimension) is `6`

- So `x` = Shape of `(2,4,6)`

In [2]:
x = torch.tensor([[[1, 2, 3, 4, 5, 6],
                   [7, 8, 9, 10, 11, 12],
                   [13, 14, 15, 16, 17, 18],
                   [19, 20, 21, 22, 23, 24]],
                  [[25, 26, 27, 28, 29, 30],
                   [31, 32, 33, 34, 35, 36],
                   [37, 38, 39, 40, 41, 42],
                   [43, 44, 45, 46, 47, 48]]], dtype=torch.float32)

In [3]:
x.shape

torch.Size([2, 4, 6])

Walking through `Forward` method with `x` as an `input tensor`

- `self.pe[:, :x.shape[1], :]`: Extracting the positional encoding matrix stored in `self.pe`. In this example, `x.shape[1]` is `4`, so we take the `first 4 positions` from the `positional encoding` matrix.

- `x = x + (self.pe[:, :x.shape[1], :])`: `x` will be `Embeddings` of the `input data`. This statement will add the `embeddings` to the `positional` information provided by the `self.pe`. Doing this helps the model understand the order of tokens in the sequence.

- The output of the `forward` method will be a `tensor` with the same shape as the `input tensor` - `(batch, seq_len, d_model)`

-  `x[0] = x[0] + self.pe[:, :4, :]`  # Adding positional encoding to the first sequence in the batch
-  `x[1] = x[1] + self.pe[:, :4, :]`  # Adding positional encoding to the second sequence in the batch