# Positional embeddings

## Why ?
- Self attention model do not have inherent notion of position unlike RNNs. 
- Position, ordering matters in language. Same word in different order could mean different things. 




## Desired properties for awesome positional encoding
- Give unique encoding for each position in a given sequence. Token at position 5 has same encoding regardless of sequence length
- Straight forward relationship between 2 encoded positions. If we know encoding for token at position p, it should be easy to infer encoding for same token if it occurs at p + k. 
- Generalized to different sequence length. 
- Deterministic
- Extends naturally for multi models. 

## Types of embeddings


### Integer position encoding
- Just add integer of position to each component of token embedding. It should work for known sequence lengths
- Token integer will be on a different scale to the actual embedding. 
- If we normalize based on length, tokens in same position in different sequence will get different embedding
- So this does not really work. 


### Binary positional encoding
- Encode position as a binary vector, stretch and add to embedding vector
- Counting is jumpy and discrete. We need something smooth

### Fixed sinusoidal embedding 
- Each component of positional embedding vector is drawn alternatively from sine and cosine curves
- For a given embedding dimension 2i, PE(x) = sin(x/10000^(2i/d))
- No learned parameters, fixed sin and cosine embedding. Extrapolates to longer sequences.


In [3]:
import torch
import torch.nn as nn
from transformers import AutoTokenizer


In [4]:
vocab_size = 50257
embedding_dim = 2
max_context_length = 100

input_text = "Hello my name is Ajay. Hello my name is Ajay"
# GPT2 model id
model_id = "gpt2"
tokenizer = AutoTokenizer.from_pretrained(model_id)

print(input_text)

input_tokens = tokenizer.encode(input_text)
print(input_tokens)

input_tensor = torch.tensor(input_tokens[:max_context_length]).unsqueeze(0)
print(input_tensor.shape)

input_tokens_decoded = tokenizer.convert_ids_to_tokens(input_tensor[0].tolist())
print(input_tokens_decoded)

token_embedding_layer = nn.Embedding(vocab_size, embedding_dim)
input_embedding = token_embedding_layer(input_tensor)

input_embedding_dot_product_1_2 = torch.dot(input_embedding[0][1], input_embedding[0][2])
input_embedding_dot_product_8_9 = torch.dot(input_embedding[0][8], input_embedding[0][9])

print(f"Dot product of token 1 and token 2: {input_embedding_dot_product_1_2}")
print(f"Dot product of token 8 and token 9: {input_embedding_dot_product_8_9}")

Hello my name is Ajay. Hello my name is Ajay
[15496, 616, 1438, 318, 22028, 323, 13, 18435, 616, 1438, 318, 22028, 323]
torch.Size([1, 13])
['Hello', 'Ġmy', 'Ġname', 'Ġis', 'ĠAj', 'ay', '.', 'ĠHello', 'Ġmy', 'Ġname', 'Ġis', 'ĠAj', 'ay']
Dot product of token 1 and token 2: -0.18977676331996918
Dot product of token 8 and token 9: -0.18977676331996918


In [13]:
input_tokens_decoded[1], input_tokens_decoded[2], input_tokens_decoded[8], input_tokens_decoded[9]

('Ġmy', 'Ġname', 'Ġmy', 'Ġname')

### Absolute positional embedding
- Positional information is encoded through a trainable embedding matrix that converts integer positions into embedding vectors. 
- Different dimensions encode position information captured at different frequencies
- Easy to implement with standard embedding layers. It has poor sequence length extrapolation because it lacks knowledge of relative positioning.
- Although the same set of tokens is present 1,2 and 8, 9, absolute positional encoding grants different scores


In [12]:
seq_length = input_embedding.shape[1]
absolute_position_embedding_layer = nn.Embedding(seq_length, embedding_dim)

absolute_position_embedding = absolute_position_embedding_layer(torch.arange(seq_length).unsqueeze(0))

print(f"Shape of absolute position embedding vector: {absolute_position_embedding.shape}")


input_plus_absolute_position_embedding = input_embedding + absolute_position_embedding

print(f"Shape of input plus absolute position embedding: {input_plus_absolute_position_embedding.shape}")



# get dot product between vectors 1 and 2
dot_product_12 = torch.dot(input_plus_absolute_position_embedding[0][1], input_plus_absolute_position_embedding[0][2])
dot_product_89 = torch.dot(input_plus_absolute_position_embedding[0][8], input_plus_absolute_position_embedding[0][9])
print(f"Dot product between vectors 1 and 2: {dot_product_12}")
print(f"Dot product between vectors 7 and 8: {dot_product_89}")

Shape of absolute position embedding vector: torch.Size([1, 13, 2])
Shape of input plus absolute position embedding: torch.Size([1, 13, 2])
Dot product between vectors 1 and 2: 0.1645088940858841
Dot product between vectors 7 and 8: 0.35054731369018555


### Relative positional embedding 
- Encodes relative distance between tokens rather than absolute positions. 

### Rotary positional embedding
- Rotate the embedding vector based on a rotation angle that is a function of the position of the word in the sentence. 
Has both relative and absolute positional embedding.
    - A word pair that is present in different points in sentence should have same dot product score



- GO through https://huggingface.co/blog/designing-positional-encoding?
- In Rope, token vector in position i is rotated by io and token vector in position j is rotated by jo. 
- dot_product(t_i, t_j) = t_i * t_j * cos(i-j)
    - Only depends on distance between i and j. 

In [7]:
from torchtune.modules import RotaryPositionalEmbeddings

In [8]:
rope_layer = RotaryPositionalEmbeddings(dim=2, max_seq_len=13)
rope_embedding = rope_layer(input_embedding.unsqueeze(dim=2)).squeeze(dim=2)
print(f"Shape of ROPE embedding vector: {rope_embedding.shape}")

Shape of ROPE embedding vector: torch.Size([1, 13, 2])


In [11]:
dot_product_12_rope = torch.dot(rope_embedding[0][1], rope_embedding[0][2])
dot_product_78_rope = torch.dot(rope_embedding[0][8], rope_embedding[0][9])
print(f"Dot product between vectors 1 and 2: {dot_product_12_rope}")
print(f"Dot product between vectors 8 and 9: {dot_product_78_rope}")

Dot product between vectors 1 and 2: 0.30431175231933594
Dot product between vectors 8 and 9: 0.30431169271469116


- ROPE provides same dot product score to vectors 8, 9