<a href="https://colab.research.google.com/github/ankitgoelcmu/DeepLearning/blob/main/Understanding_GPT__Word_Positional_embedding.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

##What Are Word Embeddings?
Word embeddings are dense vector representations of discrete tokens (words, subwords, or characters) in a vocabulary. Tokens are categorical IDs (e.g., "cat" = ID 42), but models like transformers need continuous, numerical inputs for math ops (attention, linears). Embeddings map each ID to a fixed-size vector (e.g., 384-dim) that encodes semantic meaning—similar tokens get similar vectors (e.g., "king" close to "queen").

##Analogy
Tokens are like "words in a dictionary" (sparse, one-hot vectors like [0, 1, 0, ...]). Embeddings are like "coordinate points in a semantic space" (dense, like [0.2, -0.5, 1.3, ...])—the model learns distances (e.g., "Paris" near "France" via vector proximity).

## Why do we need to convert TokenIDs into Word Embeddings ?
Lets assume our vocab size is = 50,000 words, then 1 word (ex: CAT with TokenID = 48) vector epresentation would look like = [0,0,0,......1,0,0,0]



Without word embedding, CAT would be a one-hot vector: a list of fifty-thousand numbers, all zeros except a single one at position forty-eight—because that's its ID. So: [0, 0, 0, ..., (1), 0, ..., 0] — one tiny spike in a huge desert. Why's that bad? - Wastes memory. Fifty thousand zeros for every word. Multiply by batch size, sequence length? Boom, memory explodes. - Slow math -> dot products, attention, everything scales with dimension.

Word embeddings are random , but we initialize them smartly so training doesn't explode.

Here's how:
1. Create a big matrix (vocab × dim), every entry is zero or tiny noise. In PyTorch: python nn.Embedding(50000, 384) #makes a [50000, 384] tensor.
2. Fill it with small random values —not huge, because attention can blow up. Common tricks: - Uniform tiny : all numbers between -0.01 and 0.01. - Normal with tiny std : like torch.randn() scaled to std = 0.02.
3. So cat (ID 48)'s row starts as something like: [0.004, -0.012, 0.018, -0.009, ...,] — all tiny, all random.
4. Why tiny? If you start too big (like std=1), the dot-products in attention get massive → gradients go wild → model crashes. Small start = gentle learning.
5. Then it learns — every time the model predicts wrong (cat instead of dog), backprop tweaks that 384-vector. Slowly, cat moves closer to words like kitten, pet, meow — not because we told it, but because patterns in training data shape the space.


## Now we understand what Word Embedding is, lets run through it Pytorch Code


In [None]:
import torch
import torch.nn as nn
import torch.nn.init as init
import math

# Step 1: Define vocab (e.g., char-level for nano GPT-2)
chars = sorted('abcdefghijklmnopqrstuvwxyz ,.?!\';:\n')  # Shakespeare chars
vocab_size = len(chars)  # 33
seq_len = 4  # Context length
stoi = {ch: i for i, ch in enumerate(chars)}  # Char to ID
itos = {i: ch for i, ch in enumerate(chars)}  # ID to char

print(f"Vocab size: {vocab_size}") ##Vocab Size: 33

# Step 2: Create embedding layer
embedding_dim = 3  # Small for our example (nano GPT uses 384)
token_embed = nn.Embedding(vocab_size, embedding_dim)  # Matrix: [33, 3]
print(f"Token Embedding: {token_embed}")  # [vocab_size, embedding_dim]

# Optional: Init (uniform or normal; GPT uses trunc_normal)
#Truncate means we cut the tails off . Say trunc_normal_(tensor, std=0.02, mean=0)
#It pulls numbers from a normal bell with std 0.02 (tiny spread). But it rejects anything bigger than 2 times std
init.trunc_normal_(token_embed.weight, std=0.02)  # Small random for stability

print(f"Embedding weight shape: {token_embed.weight.shape}")  # [vocab_size, embedding_dim]
# Output: torch.Size([29, 3]) ~1.8K params
#print(f"Embed weight matrix: {token_embed.weight}")

# Step 3: Forward pass—embed a sequence
input_ids = torch.tensor([[1, 4, 7, 12]])  # tokens (e.g., IDs for 'b', 'e', 'h', 'o')
embedded = token_embed(input_ids)  # Lookup rows 1,4,7,12
print(f"Input IDs: {input_ids[0, :2]}")
print(f"Input IDs shape: {input_ids.shape}")  # [1, 4]
print(f"Embedded shape: {embedded.shape}")  # [1, 4, 64]
print(f"Sample embed (first token): {embedded[:, :, :]}")  # Embedding dims of tokens'b', 'e', 'h', 'o with 1, 4, 7, 12
#Output learnable vectors embedding



# Step 4: Positional embeddings (sin/cos)
position = torch.arange(seq_len, dtype=torch.float).unsqueeze(1)  # [seq_len, 1]
print(f"Position: {position.shape}") # tensor of shape seq_len, 1

"""
Generate sinusoidal positional embeddings on-the-fly.
    Shape: [1, seq_len, embedding_dim]
    - Even dims: sin(pos / 10000^(2i/d))
    - Odd dims: cos(pos / 10000^(2i/d))
"""
num_freqs = embedding_dim // 2  # Number of frequencies
div_term = torch.exp(
    torch.arange(0, num_freqs * 2, 2, dtype=torch.float) *  # [0] length 1
    -(math.log(10000.0) / embedding_dim)
)  # [1] e.g., [1.0]

print(f"Div term: {div_term}")  # tensor([1.0000])
print(f"Div term: {div_term}") # tensor of shape embedding_dim/2
pe = torch.zeros(1, seq_len, embedding_dim)  # [1, seq_len, D]
pe[0, :, 0::2] = torch.sin(position * div_term)  # Even dims
pe[0, :, 1::2] = torch.cos(position * div_term)
print(f"PE shape: {pe.shape}")  # [1, seq_len, D]
print(f"Sample PE (pos 0): {pe[0, 0]}")  # e.g., [sin(0), cos(0), sin(0)] ≈ [0, 1, 0]

# Step 5: Combine (token + pos)
combined = embedded + pe  # [1, 4, 3] + [1, 4, 3] → [1, 4, 3]
print(f"Combined shape: {combined.shape}")
print(f"Sample combined (first token): {combined[0, 0]}")  # Token1 + pos0


Vocab size: 35
Token Embedding: Embedding(35, 3)
Embedding weight shape: torch.Size([35, 3])
Input IDs: tensor([1, 4])
Input IDs shape: torch.Size([1, 4])
Embedded shape: torch.Size([1, 4, 3])
Sample embed (first token): tensor([[[ 0.0319,  0.0104,  0.0078],
         [ 0.0178, -0.0020,  0.0088],
         [-0.0074, -0.0042,  0.0431],
         [-0.0271, -0.0080,  0.0164]]], grad_fn=<AliasBackward0>)
Position: torch.Size([4, 1])
Div term: tensor([1.])
Div term: tensor([1.])
PE shape: torch.Size([1, 4, 3])
Sample PE (pos 0): tensor([0., 1., 0.])
Combined shape: torch.Size([1, 4, 3])
Sample combined (first token): tensor([0.0319, 1.0104, 0.0078], grad_fn=<SelectBackward0>)


# What is positional Embedding
Positional encoding is a way to inject order information (position of tokens in a sequence) into a transformer.

A transformer:

1. Sees tokens all at once

2. Uses self-attention

3. Has no built-in sense of sequence or time

This information is encoded as a vector and combined with the token embedding.

Formally:
`input_embedding = token_embedding + positional_embedding`

# How does positional encoding help?

By adding positional information:

1. Tokens now carry both meaning + position

2. Attention can reason about:
    1. Before / after
    2. Distance between words
    3. Long-range dependencies

This allows attention to learn things like:

1. “verbs usually follow subjects”

2. “closing brackets match opening brackets earlier”

3. “this word refers to something 20 tokens back”

#How does the transformer know what part was word vs position?

So let’s break it down. You have your token embedding for the word "cat," which is a vector like (0.1, 0.5, 0.3). Then you have your positional embedding, say (0.2, 0.3, 0.4). When you add them together, you end up with (0.3, 0.8, 0.7).

Now, how does the transformer know that 0.3 in the first dimension is a mix of the word embedding (0.1) and the positional embedding (0.2), and so on for the other dimensions?

The short answer is: it doesn’t have to know explicitly. It’s all about what the model learns during training.

When we train a transformer, it sees many examples and learns to interpret that combined vector. The attention heads and layers figure out patterns that help them separate out the “positional flavor” from the “word meaning” flavor. It’s not a manual or explicit process; it’s something the neural network learns to disentangle on its own because it helps the model minimize its prediction errors.

So in essence, the transformer doesn’t have a built-in rule that says “this part of the vector is position and this part is token.” Instead, it just learns to interpret the combined representation in a way that helps it understand the sequence and predict accurately. It’s all part of the model’s learned internal representations.

Generate sinusoidal positional embeddings on-the-fly.
  Shape: `[1, seq_len, embedding_dim]`
1. Even dims: sin(pos / 10000^(2i/d))
2. Odd dims: cos(pos / 10000^(2i/d))

Good video on positional encoding:
https://www.youtube.com/watch?v=LBsyiaEki_8
  


### How GPT-2 Handles Positional Encoding

GPT-2 uses **learnable absolute positional embeddings**, a simple but effective approach that's common in modern transformers (like BERT or your ViT). Unlike the original Transformer paper's fixed sinusoidal (sin/cos waves), GPT-2 makes positional embeddings **trainable parameters**—a matrix learned during training alongside everything else. This allows the model to adapt position signals to the data, often leading to better performance on long sequences.

#### Why Learnable Absolute Positional Embeddings?
- **Absolute**: Each position (0, 1, 2, ..., max_len-1) gets a fixed, unique vector added to the token embedding at that spot.
- **Learnable**: Initialized randomly (small std=0.02 trunc_normal), then optimized via backprop—learns what "position 5" means for language patterns.
- **Benefits**: Flexible (no fixed waves); extrapolates reasonably to longer sequences (up to max_len=1024 in GPT-2); simple to implement.
- **Drawbacks**: Can't extrapolate beyond max_len without issues (unlike sinusoidal, which can); ~1M params (1024 * 768 for GPT-2 small).

#### How It Works (Step-by-Step)
1. **Pre-Allocate Pos Matrix**: A learnable Parameter [1, max_seq_len, embedding_dim] (e.g., [1, 1024, 768]).
2. **Slice to Actual Length**: For input seq_len=T (e.g., 10), take first T rows: [1, T, D].
3. **Add to Token Embeds**: Element-wise: `x = token_emb + pos_emb` — token at pos i gets semantics + pos i's vector.
4. **Broadcast**: The [1, T, D] pos expands to match batch size B.

This injects "where am I?" info without loops—vectorized add.

#### Code Snippet (From GPT-2 Style)
In the `__init__`:
```python
self.pos_embed = nn.Parameter(torch.zeros(1, max_seq_len, embedding_dim))  # Learnable [1, 1024, 768]
nn.init.trunc_normal_(self.pos_embed, std=0.02)  # Small random start
```

In `forward`:
```python
def forward(self, x):  # x: [B, T] token IDs
    B, T = x.shape
    tok_emb = self.token_embed(x)  # [B, T, D]
    
    # Slice pos to T
    pos_emb = self.pos_embed[:, :T, :]  # [1, T, D]
    
    # Add (broadcast to B)
    x = tok_emb + pos_emb  # [B, T, D]
    # ... to transformer ...
```

- **Example**: T=3, pos_embed rows 0-2 added to tokens 0-2.
- **Training**: Gradients update the pos matrix—learns relative distances (e.g., pos 0 near pos 1 for bigrams).

