### üß≠ Positional Encoding  

Word embeddings allow us to represent words in **numerical format**, but they don‚Äôt capture **the order or position** of words in a sentence.  
To help models understand *where* each word appears, we use **Positional Encoding**.

---

#### üß© Why Do We Need It?
Transformers don‚Äôt process tokens sequentially (like RNNs do), so they need an additional way to encode **token position information**.  
Positional encodings are **vectors added to word embeddings** ‚Äî both having the same dimensionality ‚Äî so the model can infer *sequence order*.

---

#### ‚öôÔ∏è Types of Positional Encoding

1. **Sinusoidal (Fixed) Encoding ‚Äî used in the original Transformer paper**
   - Position information is represented using sine and cosine functions of different frequencies.
   - This allows the model to generalize to longer sequences even beyond what it was trained on.

   $$
   PE_{(pos, 2i)} = \sin\left(\frac{pos}{10000^{\frac{2i}{d_{model}}}}\right)
   $$

   $$
   PE_{(pos, 2i+1)} = \cos\left(\frac{pos}{10000^{\frac{2i}{d_{model}}}}\right)
   $$

   where:  
   - \( pos \) = position of the word  
   - \( i \) = dimension index  
   - \( d_{model} \) = embedding dimension

---

2. **Learned Positional Embeddings ‚Äî used in GPT models**
   - Instead of using sine/cosine formulas, the model **learns position embeddings** as trainable parameters, just like word embeddings.

---

#### üìè Example (GPT-2)
| Parameter | Value |
|------------|--------|
| **Vocabulary Size** | 50,527 tokens |
| **Embedding Dimension** | 768 |

---

#### üí° Intuition
By combining **word embeddings** (what the word means) with **positional encodings** (where the word appears),  
Transformers can understand both **content** and **order** ‚Äî which is essential for generating coherent text.


In [2]:
# load the data 
with open('the-verdict.txt', 'r', encoding='utf-8') as f:
    raw_data = f.read()

In [4]:
print(raw_data[:100])

I HAD always thought Jack Gisburn rather a cheap genius--though a good fellow enough--so it was no g


In [16]:
# using Dataset and Dataloader class to make input - target pairs and also making batches

import torch 
from torch.utils.data import Dataset, DataLoader 

class CreateGPTDataV1(Dataset):
    def __init__(self, raw_data, tokenizer, context_size, stride):
        self.input_ids = []
        self.target_ids = []

        token_ids = tokenizer.encode(raw_data)

        for i in range(0, len(token_ids) - context_size - 1, stride):
            x = token_ids[i:i+context_size] # input 
            y = token_ids[i+1:context_size+i+1] # expected output
            self.input_ids.append(torch.tensor(x))
            self.target_ids.append(torch.tensor(y))

    def __len__(self):
        return len(self.input_ids)
    
    def __getitem__(self, idx):
        return self.input_ids[idx], self.target_ids[idx]

In [17]:
# making a dataloader function
import tiktoken
def create_dataloader(raw_data, batch_size=8, context_size=4, stride=2, shuffle=True, drop_last=True, num_workers=0):
    
    tokenizer = tiktoken.get_encoding('gpt2')
    dataset = CreateGPTDataV1(raw_data, tokenizer=tokenizer, context_size=context_size, stride=stride)

    # batch_size : number of batches model process before UPDATING parameters
    # num_workers : parallel processing

    # If your dataset size is not divisible by the batch_size, you‚Äôll end up with one last smaller batch.
    # The drop_last flag controls whether to keep or drop that final partial batch.

    dataloader = DataLoader(dataset, batch_size=batch_size, shuffle=True, drop_last=drop_last, num_workers=num_workers)

    return dataloader

In [18]:
dataloader = create_dataloader(raw_data)

In [21]:
input1, target1 = next(iter(dataloader))
print(f"Input ids of random batch : {input1}")
print(f"Target ids of random batch : {target1}")

Input ids of random batch : tensor([[   11,  3181,   503,   287],
        [   13,   314,   531, 10722],
        [ 7543,   284,   607,   599],
        [  262,  6678, 40315, 10455],
        [ 9091,    11,   393,  5017],
        [  502,  6609,  1474,   683],
        [  257,  1310,  4295,   438],
        [   11,   326, 14516,   530]])
Target ids of random batch : tensor([[ 3181,   503,   287,   262],
        [  314,   531, 10722,   292],
        [  284,   607,   599,  6321],
        [ 6678, 40315, 10455,   546],
        [   11,   393,  5017,   510],
        [ 6609,  1474,   683,   318],
        [ 1310,  4295,   438,    40],
        [  326, 14516,   530,   286]])


#### Createing Embedding layer

In [22]:
vocab_size = 50527
n_dim = 256
embedding_layer = torch.nn.Embedding(vocab_size, n_dim)
print(f"Embedding of a particular index is : {embedding_layer(torch.tensor(5))}")

Embedding of a particular index is : tensor([-1.2059, -1.4148,  0.1592,  0.7264, -0.6956,  0.5645, -0.1026,  0.8552,
        -0.5472, -0.5667, -0.0869, -0.8797, -0.1199,  0.2968,  0.2860,  0.5360,
         1.2170,  0.1915,  1.4244, -1.7213,  0.0454,  0.0283, -0.5049, -1.8557,
         1.2699, -1.0043, -0.9414, -0.3488,  1.0302,  1.3187,  0.1226,  0.0232,
        -0.7217, -0.9127,  0.2719, -0.6029, -0.4522,  0.0854,  2.1252,  0.8918,
         0.8632, -1.1154,  1.1197,  0.2170,  0.1302,  0.3208, -0.9809,  0.2683,
        -1.0661, -2.1820,  0.1911,  1.1133, -0.9528, -0.5476,  0.1193,  0.9452,
        -1.6690,  0.6063,  1.2096,  0.8946,  0.5395, -1.0538,  1.0637,  0.8040,
        -0.0799,  0.9757, -0.1454,  0.4125, -0.2916,  1.3593,  1.2002,  0.8287,
        -1.0826, -1.2300, -0.8765,  1.4432, -1.4710, -1.9401,  0.2504, -1.1035,
         1.9566,  0.0928, -0.3151, -0.2481,  2.0366, -0.8721, -0.1955, -1.9092,
        -0.1083, -0.2639,  1.0048, -0.7462, -1.0388,  0.4919,  0.0065,  0.6546,
   

#### Creating Positional Encodings 

In [23]:
torch.arange(4)

tensor([0, 1, 2, 3])

In [25]:
context_size = 4 # this represents the position 1, 2, 3, and 4
n_dim = 256 

positional_encoding_layer = torch.nn.Embedding(context_size, n_dim)
# torch.arange(4) will give a tensor of tensor([0, 1, 2, 3])
positional_encodings = positional_encoding_layer(torch.arange(context_size))
print(f"Shape of positional encodings : {positional_encodings.shape}")

Shape of positional encodings : torch.Size([4, 256])


In [28]:
# Now making embedding of first input batch 
print(f"Shape of batch of token_ids : {input1.shape}")
embds_input1 = embedding_layer(input1)
print(f"Shape of batch of embeddings : {embds_input1.shape}")

Shape of batch of token_ids : torch.Size([8, 4])
Shape of batch of embeddings : torch.Size([8, 4, 256])


In [29]:
# Now we need to add those positional encodings 
embds_after_position = embds_input1 + positional_encodings
print(f"Shape after adding positonal encodings : {embds_after_position.shape}")

Shape after adding positonal encodings : torch.Size([8, 4, 256])
