# Positional Encoding in Transformer Neural Network

<img src="images/transformer-architecture-2.png" width="500">

## Transformer Neural Network Architecture Overview

Let's walk through exactly how the inital part of the Transformer Neural Network architecture works so that it kind of motivates positional encodings better.

Firstly, we have an input sequence in English that is, *My name is John*. Typically, the way that Transformers and all these machine learning models work is that they understand numbers, vectors and matrices but they don't exactly understand words. So, what we would do is make sure to pass in a fixed length matrix by making sure we pad all the words that are not present — usually with a dummy character or dummy sequence input. This would be the maximum number of words allowed in an input sequence which would go into the Transformer.

The words are then one-hot encoded. The vocabulary size is the number of words in our dictionary — that is the number of possible words which can be used in the input sequence.

We now pass the input sequence into a feed forward layer where each of the vectors in the input sequence will be mapped to a 512-dimensional vector and the parameters, $W_{E}$ are learnable via back propagation. The total number of learning parameters in this feed forward layer will be the vocabulary size times 512.

The output of this feed forward layer would simply the a matrix size of maximum input sequence length times 512. We'll refer to this matrix as $X$. It is to this matrix $X$ that we are going to add some positional encoding which is of the same size. In doing so, we are going to get another matrix, $X^{1}$. 

For each word vector in matrix $X^{1}$, we want to generate a query, key and value matrix — all of 512 dimensions each. We accomplish this by passing each vecor in matrix $X^{1}$ into a set of query, key and value weights — $W_{Q}$, $W_{K}$ and $W_{V}$ respectively. This operation would map one input vector to output a query, key and value matrix. We do this for every single word of the input sequence up to the maximum sequence length.

The number of total output vectors would be 3 times the maximum sequence length because we get 3 matrices — $Q$, $K$ and $V$ — for every single word. From this point, we can probably split these output vectors into multiple heads and perform [multi head attention](2-Multihead_Attention.ipynb).

**Note:** The query, key and value weights — $W_{Q}$, $W_{K}$ and $W_{V}$ — are learnable parameters.

![diagram](images/positional-encoding.png)

## Positional Embedding

\begin{align}

PE(\text{position}, 2i) = \sin{(\frac{\text{position}}{10000^{\frac{2i}{d_{model}}}})} \newline

\nonumber\newline

PE(\text{position}, 2i + 1) = \cos{(\frac{\text{position}}{10000^{\frac{2i}{d_{model}}}})}
 
\end{align}

### Why do we formulate positional embedding this way? 

#### 1. Periodicity 

Sine and cosine functions are periodic functions — so they will repeat after some point of time. For example, let's say we want to look at a particular vector — the third vecor, which would correspond to the word *is* — in the $X$ matrix and positional encoding matrix (see figure above). 

At some point, we are going to compute the attention matrix and try to determine how much attention the word *is* should pay attention to all other words. Now during this phase, because of periodicity, the word *is* will be able to pay attention to say, five words after it and then 10 words after it, 15 words after it, in a much more tractable way.

#### 2. Constrained Values

Sine and cosine will constrain the values to be between -1 and 1. Without that, the values, at least in the positive direction, are not bounded. This would mean that positional encoding for the word *is* might be smaller than the next vector, which will be smaller than the next vector and so on. During the time when we compute the attention matrices, you'll find that the third vector — corresponding to the word *is* — is not going to be able to attend to vectors very far away from it. Therefore, it will not be able to derive any context from them.

#### 3. Easy to Extrapolate Long Sequences

Sine and consine are very deterministic formulas — very easy to compute. Even if the model hasn't seen certain sequence lengths in the training set, we'll still be able to interpret them in our test set.

In [1]:
import torch

The `max_sequence_length` is the maximum number of words that can be passed into the transformer simultaneously.

`d_model` is the dimension of the embeddings. It's typically 512 but for illustrative purposes, 6 is used.

In [2]:
max_sequence_length = 10
d_model = 6

We can rewrite the above equations $(1)$ and $(2)$ as:

\begin{align}

PE(\text{position}, i) = \sin{(\frac{\text{position}}{10000^{\frac{i}{d_{model}}}})} \text{ when $i$ is even} \nonumber

\newline\nonumber
\newline\nonumber

PE(\text{position}, i) = \cos{(\frac{\text{position}}{10000^{\frac{i-1}{d_{model}}}})} \text{ when $i$ is odd} \nonumber
 
\end{align}

The new equations now are easier to see and program.

In [3]:
even_i = torch.arange(0, d_model, 2).float()
even_i

tensor([0., 2., 4.])

In [4]:
even_denominator = torch.pow(10000, even_i / d_model)
even_denominator

tensor([  1.0000,  21.5443, 464.1590])

In [5]:
odd_i = torch.arange(1, d_model, 2).float()
odd_i

tensor([1., 3., 5.])

In [6]:
odd_denominator = torch.pow(10000, (odd_i - 1) / d_model)
odd_denominator

tensor([  1.0000,  21.5443, 464.1590])

What you will notice is that the values of the `even_denominator` and `odd_denominator` are exactly the same. This makes sense because the odd indices are 1 more than the even indices and substracting 1 from the odd indices result in the same even indices. 

So, we are going to just combine the denominators into one `denominator` variable.

In [7]:
denominator = even_denominator

Define all the positions by taking the values from 0 to 9 and then reshape to be a 2-dimensional matrix.

In [8]:
position = torch.arange(max_sequence_length, dtype=float).reshape(
    max_sequence_length, 1
)

print(position.shape)
position

torch.Size([10, 1])


tensor([[0.],
        [1.],
        [2.],
        [3.],
        [4.],
        [5.],
        [6.],
        [7.],
        [8.],
        [9.]], dtype=torch.float64)

In [9]:
even_PE = torch.sin(position / denominator)
odd_PE = torch.cos(position / denominator)

The above calculation will result in a 10 times 3 matrix for both the even and odd cases.

In [10]:
print(even_PE.shape)
even_PE

torch.Size([10, 3])


tensor([[ 0.0000,  0.0000,  0.0000],
        [ 0.8415,  0.0464,  0.0022],
        [ 0.9093,  0.0927,  0.0043],
        [ 0.1411,  0.1388,  0.0065],
        [-0.7568,  0.1846,  0.0086],
        [-0.9589,  0.2300,  0.0108],
        [-0.2794,  0.2749,  0.0129],
        [ 0.6570,  0.3192,  0.0151],
        [ 0.9894,  0.3629,  0.0172],
        [ 0.4121,  0.4057,  0.0194]], dtype=torch.float64)

In [11]:
print(odd_PE.shape)
odd_PE

torch.Size([10, 3])


tensor([[ 1.0000,  1.0000,  1.0000],
        [ 0.5403,  0.9989,  1.0000],
        [-0.4161,  0.9957,  1.0000],
        [-0.9900,  0.9903,  1.0000],
        [-0.6536,  0.9828,  1.0000],
        [ 0.2837,  0.9732,  0.9999],
        [ 0.9602,  0.9615,  0.9999],
        [ 0.7539,  0.9477,  0.9999],
        [-0.1455,  0.9318,  0.9999],
        [-0.9111,  0.9140,  0.9998]], dtype=torch.float64)

What we want to do next is interleave the matrices, `even_PE` and `odd_PE` above. What this means is that the first index in the first vector of the `even_PE` matrix should be position 0 and the first index in the first vector of the `odd_PE` matrix should be position 1, and so on.

In [12]:
stacked = torch.stack(
    [even_PE, odd_PE], dim=2
)  # This method also creates a new dimension

print(stacked.shape)
stacked

torch.Size([10, 3, 2])


tensor([[[ 0.0000,  1.0000],
         [ 0.0000,  1.0000],
         [ 0.0000,  1.0000]],

        [[ 0.8415,  0.5403],
         [ 0.0464,  0.9989],
         [ 0.0022,  1.0000]],

        [[ 0.9093, -0.4161],
         [ 0.0927,  0.9957],
         [ 0.0043,  1.0000]],

        [[ 0.1411, -0.9900],
         [ 0.1388,  0.9903],
         [ 0.0065,  1.0000]],

        [[-0.7568, -0.6536],
         [ 0.1846,  0.9828],
         [ 0.0086,  1.0000]],

        [[-0.9589,  0.2837],
         [ 0.2300,  0.9732],
         [ 0.0108,  0.9999]],

        [[-0.2794,  0.9602],
         [ 0.2749,  0.9615],
         [ 0.0129,  0.9999]],

        [[ 0.6570,  0.7539],
         [ 0.3192,  0.9477],
         [ 0.0151,  0.9999]],

        [[ 0.9894, -0.1455],
         [ 0.3629,  0.9318],
         [ 0.0172,  0.9999]],

        [[ 0.4121, -0.9111],
         [ 0.4057,  0.9140],
         [ 0.0194,  0.9998]]], dtype=torch.float64)

In [13]:
PE = torch.flatten(stacked, start_dim=1, end_dim=2)

print(PE.shape)
PE

torch.Size([10, 6])


tensor([[ 0.0000,  1.0000,  0.0000,  1.0000,  0.0000,  1.0000],
        [ 0.8415,  0.5403,  0.0464,  0.9989,  0.0022,  1.0000],
        [ 0.9093, -0.4161,  0.0927,  0.9957,  0.0043,  1.0000],
        [ 0.1411, -0.9900,  0.1388,  0.9903,  0.0065,  1.0000],
        [-0.7568, -0.6536,  0.1846,  0.9828,  0.0086,  1.0000],
        [-0.9589,  0.2837,  0.2300,  0.9732,  0.0108,  0.9999],
        [-0.2794,  0.9602,  0.2749,  0.9615,  0.0129,  0.9999],
        [ 0.6570,  0.7539,  0.3192,  0.9477,  0.0151,  0.9999],
        [ 0.9894, -0.1455,  0.3629,  0.9318,  0.0172,  0.9999],
        [ 0.4121, -0.9111,  0.4057,  0.9140,  0.0194,  0.9998]],
       dtype=torch.float64)

By flattening the `stacked` matrix, we are going to effectively be getting the interleavement here.

For the first vector in the `PE` matrix, we will have the positional encoding for the first word. For the second word, the positional encoding is the second vector, and so on.

## Class

In [14]:
import torch
import torch.nn as nn


class PositionalEncoding(nn.Module):
    def __init__(self, max_sequence_length, d_model):
        super().__init__()
        self.max_sequence_length = max_sequence_length
        self.d_model = d_model

    def forward(self):
        even_i = torch.arange(0, self.d_model, 2).float()
        print(f"even.shape: {even_i.shape}")

        denominator = torch.pow(10000, even_i / self.d_model)
        print(f"denominator.shape: {denominator.shape}")

        # or .reshape(self.max_sequence_length, 1)
        position = torch.arange(self.max_sequence_length, dtype=float).reshape(-1, 1)
        print(f"position.shape: {position.shape}")

        even_PE = torch.sin(position / denominator)
        print(f"even_PE.shape: {even_PE.shape}")

        odd_PE = torch.cos(position / denominator)
        print(f"odd_PE.shape: {odd_PE.shape}")

        # This method stacks the two matrices in a new dimension
        stacked = torch.stack([even_PE, odd_PE], dim=2)
        print(f"stacked.shape: {stacked.shape}")

        PE = torch.flatten(stacked, start_dim=1, end_dim=2)
        print(f"PE.shape: {PE.shape}")

        return PE

In [15]:
pe = PositionalEncoding(max_sequence_length=10, d_model=6)
PE = pe.forward()

PE

even.shape: torch.Size([3])
denominator.shape: torch.Size([3])
position.shape: torch.Size([10, 1])
even_PE.shape: torch.Size([10, 3])
odd_PE.shape: torch.Size([10, 3])
stacked.shape: torch.Size([10, 3, 2])
PE.shape: torch.Size([10, 6])


tensor([[ 0.0000,  1.0000,  0.0000,  1.0000,  0.0000,  1.0000],
        [ 0.8415,  0.5403,  0.0464,  0.9989,  0.0022,  1.0000],
        [ 0.9093, -0.4161,  0.0927,  0.9957,  0.0043,  1.0000],
        [ 0.1411, -0.9900,  0.1388,  0.9903,  0.0065,  1.0000],
        [-0.7568, -0.6536,  0.1846,  0.9828,  0.0086,  1.0000],
        [-0.9589,  0.2837,  0.2300,  0.9732,  0.0108,  0.9999],
        [-0.2794,  0.9602,  0.2749,  0.9615,  0.0129,  0.9999],
        [ 0.6570,  0.7539,  0.3192,  0.9477,  0.0151,  0.9999],
        [ 0.9894, -0.1455,  0.3629,  0.9318,  0.0172,  0.9999],
        [ 0.4121, -0.9111,  0.4057,  0.9140,  0.0194,  0.9998]],
       dtype=torch.float64)