# 🚀 Token Processing in LLMs: Step-by-Step Demonstration
### 📌 Understand how text is transformed inside an LLM, from Token IDs to Self-Attention
This notebook will guide you through:
- ✅ Tokenization (Converting text into Token IDs)
- ✅ Embedding Layer (Mapping tokens to dense vectors)
- ✅ Positional Encoding (Adding order information)
- ✅ Self-Attention (Computing relationships between words)
- ✅ Multi-Head Attention & Feedforward Network


In [ ]:
# Install required libraries
!pip install transformers numpy torch matplotlib

In [ ]:
import numpy as np
import torch
import torch.nn as nn
import matplotlib.pyplot as plt
from transformers import AutoTokenizer, AutoModel
import math

## 🔹 Define Sample Input Text & Tokenization
We'll tokenize the sentence into Token IDs.

In [ ]:
# Load a tokenizer (GPT-2 as an example)
tokenizer = AutoTokenizer.from_pretrained('gpt2')

# Define sample text
text = "The cat sat on the mat"
tokens = tokenizer.tokenize(text)
token_ids = tokenizer.convert_tokens_to_ids(tokens)

print("🔹 Tokens:", tokens)
print("🔹 Token IDs:", token_ids)

## 🔹 Embedding Layer: Converting Token IDs into Dense Vectors
Each token ID is mapped to a vector in a high-dimensional space.

In [ ]:
# Define an embedding layer (Assume embedding size is 8 for demonstration)
embedding_layer = nn.Embedding(num_embeddings=tokenizer.vocab_size, embedding_dim=8)
embedded_tokens = embedding_layer(torch.tensor(token_ids))

print("🔹 Embedded Token Shape:", embedded_tokens.shape)
print("🔹 First Token Embedding:", embedded_tokens[0].detach().numpy())

## 🔹 Positional Encoding: Adding Order Information to Tokens
Since Transformers process words in parallel, positional encoding ensures word order is retained.

In [ ]:
def positional_encoding(seq_length, d_model):
    pe = np.zeros((seq_length, d_model))
    for pos in range(seq_length):
        for i in range(0, d_model, 2):
            pe[pos, i] = math.sin(pos / (10000 ** (i / d_model)))
            pe[pos, i + 1] = math.cos(pos / (10000 ** (i / d_model)))
    return torch.tensor(pe, dtype=torch.float32)

# Apply positional encoding
pos_encoding = positional_encoding(len(token_ids), 8)
print("🔹 Positional Encoding Shape:", pos_encoding.shape)
print("🔹 First Positional Encoding Vector:", pos_encoding[0].numpy())

## 🔹 Self-Attention Mechanism: Computing Word Relationships
We compute Query, Key, and Value matrices to find word importance.

In [ ]:
d_model = 8  # Embedding dimension

# Define Q, K, V transformation matrices
W_Q = nn.Linear(d_model, d_model)
W_K = nn.Linear(d_model, d_model)
W_V = nn.Linear(d_model, d_model)

# Compute Q, K, V
Q = W_Q(embedded_tokens)
K = W_K(embedded_tokens)
V = W_V(embedded_tokens)

# Compute Attention Scores (Scaled Dot Product)
scores = torch.matmul(Q, K.transpose(-2, -1)) / math.sqrt(d_model)
attention_weights = torch.nn.functional.softmax(scores, dim=-1)

print("🔹 Attention Scores Shape:", scores.shape)
print("🔹 Attention Weights:", attention_weights.detach().numpy())

## 🔹 Apply Attention to Value V
The weighted sum of values determines the final representation.

In [ ]:
output = torch.matmul(attention_weights, V)
print("🔹 Attention Output Shape:", output.shape)

## 🔹 Multi-Head Attention Simulation
Instead of one attention mechanism, we use multiple heads.

In [ ]:
num_heads = 2  # Simulating two attention heads
multi_head_output = torch.cat([output, output], dim=-1)
print("🔹 Multi-Head Output Shape:", multi_head_output.shape)

## 🔹 Feedforward Layer: Refining Representations
Final transformation to enrich context before moving to the next Transformer block.

In [ ]:
feedforward = nn.Sequential(
    nn.Linear(d_model * 2, 32),
    nn.ReLU(),
    nn.Linear(32, d_model * 2)
)

final_representation = feedforward(multi_head_output)
print("🔹 Final Representation Shape:", final_representation.shape)

## 🎯 Summary: How Text is Processed in an LLM
- ✅ Tokenization → Converts text into token IDs
- ✅ Embedding Layer → Maps tokens into dense numerical vectors
- ✅ Positional Encoding → Adds order information using sine & cosine waves
- ✅ Self-Attention → Determines which words are important in context
- ✅ Multi-Head Attention → Enhances word relationships from different perspectives
- ✅ Feedforward Network → Further refines contextual embeddings

🚀 **This is the foundation of how Transformers process text!**