# Positional Encoding

Using a vector to represent the position of a token in a sentence in order to counter the fact that there is no recurrence and no convolution to capture positional information of the tokens. The encoding function is

\\[
    p_{i,k} = \begin{cases}
        \sin(w_{i, k}) & \text{if $k$ is even} \\
        \cos(w_{i, k}) & \text{if $k$ is odd}
    \end{cases}
\\]
where
\\[
    w_{i, k} = \frac{i}{10000^{2 k / K}}.
\\]

This function ensures that there is a unique position vector for each time dimension.

In [2]:
import torch
import torch.nn as nn
import torch.nn.functional as F
import math
import numpy as np

In [2]:
B, I, J, K, H = 3, 9, 10, 18, 3

pe = torch.zeros(I, K)
for i in range(I):
    for k in range(0, K, 2):
        w = i / pow(10000, (2 * k) / K)
        pe[i, k] = math.sin(w)
        pe[i, k + 1] = math.cos(w)

In [3]:
B, I, J, K, H = 1, 9, 10, 10, 3

pe = torch.zeros(I, K)
for i in range(I):
    for k in range(0, K, 2):
        w = i / pow(10000, (2 * k) / K)
        pe[i, k] = math.sin(w)
        pe[i, k + 1] = math.cos(w)

In [6]:
pe.shape

torch.Size([9, 10])

In [3]:
class PositionalEncoder(nn.Module):
    
    def __init__(self, K, I):
        super().__init__()
        self.K = K
        pe = torch.zeros(I, K)
        for i in range(I):
            for k in range(0, K, 2):
                w = i / pow(10000, (2 * k) / K)
                pe[i, k] = math.sin(w)
                pe[i, k + 1] = math.cos(w)
                
        pe = pe.unsqueeze(0)
        # to make sure that it is not considered as a model parameter
        self.register_buffer('pe', pe)
    
    def forward(self, x):
        x = x * math.sqrt(self.K)
        K = x.size(1)
        # note that the positional encoding is added to the input vector instead of concat
        x = x + Variable(self.pe[:,:K], requires_grad=False).cuda()
        return x