# Position Encoding
In this exercise, we will look at position encoding for the Transformer architecture.

In [None]:
import torch
import torch.nn as nn

## Creating an Embedding
Consider the following vocabulary:

In [None]:
vocab = ['dog', 'cat', 'fox', 'walks', 'jumps', 'sleeps', 'and', 'the', '.', ',']
print(f'Vocab size: {len(vocab)}')

Take a look at the [nn.Embedding documentation](https://pytorch.org/docs/stable/generated/torch.nn.Embedding.html). Create an embedding with a vocabulary size of 10 and an embedding dimension of 6.

Embed the sentence: "the dog sleeps , the cat walks and the fox jumps ."

## Absolute Position Encoding
The functions for absolute position encoding, as defined in [the Transformer paper](https://proceedings.neurips.cc/paper/2017/file/3f5ee243547dee91fbd053c1c4a845aa-Paper.pdf) are as follows:

$$
\begin{align}
pos_{i, 2j} &= \sin(i / 10000^{2j/d}) \\
pos_{i, 2j+1} &= \cos(i / 10000^{2j/d})
\end{align}
$$
where $i$ is the absolute position in the sequence, and $j$ is the dimension of the embedding vector.

Create a function `absolute_position_encoding` that takes the position in the sequence $i$ and the dimension $d$ as an input and returns the position vector.

In [None]:
def absolute_position_encoding(position, dim):
    pass

Run the command below to see the values for the position vectors of the first 100 positions and the first 5 dimensions.

In [None]:
import matplotlib.pyplot as plt

plt.figure()
x = list(range(100))
pos_vectors = [absolute_position_encoding(i, emb_dim) for i in x]
for dim in range(5):
    y = [pv[dim].item() for pv in pos_vectors]
    plt.plot(x, y, label=f'dim {dim}')
plt.legend()
plt.show()

Apply the position encoding to the embeddings from earlier.

## Absolute Position Embedding
Write a class `AbsolutePositionEmbedding` that is initialized with a maximum length and an embedding dimension. In its `forward` method, it should take an input tensor (of shape `[batch_size, sequence_length, embedding_dim]`) and add the position embeddings to the input tensor.

In [None]:
class AbsolutePositionEmbedding(nn.Module):
    
    def __init__(self, embedding_dim, max_length=512):
        pass
    
    def forward(self, x):
        pass

Try your class with an example.

In [None]:
absolute_position_embedding = AbsolutePositionEmbedding(20, 512)
x1 = torch.randn(5, 12, 20)
x = absolute_position_embedding(x1)
print(x.shape)

## Relative Position Embedding
Create a class `RelativePositionEmbedding` that is initialized with a maximum relative distance and an embedding dimension. Its `forward` method should take an input tensor of size `[batch_size, hidden_dim]` and apply the relative position embeddings given the positions $i$ of the query and $j$ of the key.

In [None]:
class RelativePositionEmbedding(nn.Module):
    
    def __init__(self, embedding_dim, max_dist=16):
        pass
    
    def forward(self, x, i, j):
        pass

Try it with an example.

In [None]:
rel_pos_emb = RelativePositionEmbedding(20, 16)
x = torch.randn(5, 20)
result = rel_pos_emb(x, 0, 3)
result = rel_pos_emb(x, 49, 15)
print(result.shape)

**Question:** Where would we use this module?

**Answer:** 

**Question:** Look at the HuggingFace implementation of relative position embeddings in the BERT model:
- [initialization from line 244](https://github.com/huggingface/transformers/blob/v4.46.0/src/transformers/models/bert/modeling_bert.py#L244)
- [forward method from line 320](https://github.com/huggingface/transformers/blob/v4.46.0/src/transformers/models/bert/modeling_bert.py#L320)

Describe what their `relative_key` method does differently from the [Shaw et al. (2018)](https://aclanthology.org/N18-2074/) paper we saw in the lecture.

**Answer:**