### 🧠 Decoder‑Only Transformer Architecture

![Decoder‑Only Transformer Diagram](https://waylandzhang.github.io/en/images/decoder-only-transformer.jpg)

**Figure:** A GPT‑style stack of decoder blocks:
- **Input tokens** → Embedding + positional encoding
- **Repeated blocks**: masked self-attention + feed‑forward + layer norms + residuals
- **Final linear & softmax** → predict next-token logits

# Load the Dataset first

In [1]:
import requests

url = "https://raw.githubusercontent.com/karpathy/char-rnn/master/data/tinyshakespeare/input.txt"
text = requests.get(url).text

print("Dataset length:", len(text))
print("First 500 characters:\n", text[:500])


Dataset length: 1115394
First 500 characters:
 First Citizen:
Before we proceed any further, hear me speak.

All:
Speak, speak.

First Citizen:
You are all resolved rather to die than to famish?

All:
Resolved. resolved.

First Citizen:
First, you know Caius Marcius is chief enemy to the people.

All:
We know't, we know't.

First Citizen:
Let us kill him, and we'll have corn at our own price.
Is't a verdict?

All:
No more talking on't; let it be done: away, away!

Second Citizen:
One word, good citizens.

First Citizen:
We are accounted poor


# Bulid Vocab
# 🧠 What is Tokenization?
# Tokenization is the process of converting text into units (tokens) that a neural network can understand — and then mapping those tokens to numbers.

In [2]:
chars = sorted(list(set(text)))
print(chars)

['\n', ' ', '!', '$', '&', "'", ',', '-', '.', '3', ':', ';', '?', 'A', 'B', 'C', 'D', 'E', 'F', 'G', 'H', 'I', 'J', 'K', 'L', 'M', 'N', 'O', 'P', 'Q', 'R', 'S', 'T', 'U', 'V', 'W', 'X', 'Y', 'Z', 'a', 'b', 'c', 'd', 'e', 'f', 'g', 'h', 'i', 'j', 'k', 'l', 'm', 'n', 'o', 'p', 'q', 'r', 's', 't', 'u', 'v', 'w', 'x', 'y', 'z']


In [3]:
vocab_size = len(chars)
print(vocab_size)

65


In [4]:
# Tokenizer dictionaries
stoi = {ch: i for i, ch in enumerate(chars)}
dict(list(stoi.items())[:5])



{'\n': 0, ' ': 1, '!': 2, '$': 3, '&': 4}

In [5]:
itos = {i: ch for ch, i in stoi.items()}  
dict(list(itos.items())[:5])


{0: '\n', 1: ' ', 2: '!', 3: '$', 4: '&'}

#  ✅ In PyTorch, the first real step before model training is:
# 🔁 Convert all input data into tensors

In [6]:
# Step 1.2 — Encode entire dataset
import torch

# Convert the full text to a list of token IDs
data = torch.tensor([stoi[c] for c in text], dtype=torch.long)

print("Tokenized dataset shape:", data.shape)
print("First 20 tokens:", data[:20])


Tokenized dataset shape: torch.Size([1115394])
First 20 tokens: tensor([18, 47, 56, 57, 58,  1, 15, 47, 58, 47, 64, 43, 52, 10,  0, 14, 43, 44,
        53, 56])


# Split the data into train and val set

In [7]:
# Step 1.3 — Split data into train and val
split_idx = int(0.9 * len(data))  # 90% train, 10% val

train_data = data[:split_idx]
val_data = data[split_idx:]

print("Train size:", len(train_data))
print("Val size:", len(val_data))


Train size: 1003854
Val size: 111540


# Create Training Batches (x, y pairs)
# 🧱 What is batch_size?
# batch_size is the number of (x, y) training examples processed in one forward/backward pass of the model.
# ⛓ Why batch?
# Matrix operations (on GPU) are fastest when done in batches

# You get more stable gradients

# You reduce variance compared to updating on just one example -->

In [8]:
def get_batch(data, block_size=8, batch_size=4):
    ix = torch.randint(len(data) - block_size, (batch_size,))
    x = torch.stack([data[i:i+block_size] for i in ix])
    y = torch.stack([data[i+1:i+block_size+1] for i in ix])
    return x, y

In [9]:
x, y = get_batch(train_data, block_size=4, batch_size=2)

print("x:\n", x)
print("y:\n", y)


x:
 tensor([[ 1, 21,  1, 46],
        [ 1, 61, 43,  1]])
y:
 tensor([[21,  1, 46, 43],
        [61, 43,  1, 46]])


# Embedding

# 🔹 What exactly is nn.Embedding in PyTorch?
# 🧠 High-level idea
# nn.Embedding is just a lookup table:
# It stores one vector (a list of numbers) for each token ID
# 🧱 Summary:
# Concept	Explanation
# What it is	A learnable matrix of shape (vocab_size, emb_dim)
# What it does	Looks up vectors for input token IDs
# Is it a neural net layer?	✅ Yes (has weights, supports backprop)
# Why we use it	To represent tokens as meaningful vectors

# 🧠 So when does it become trainable?
# Here’s the secret:

# This embedding matrix is a parameter (like any layer's weights)

# During training, we calculate loss, then do .backward()

# PyTorch computes gradients w.r.t. the rows used (e.g., token 12, 5, 8)

# Only those rows get updated in .step()!

In [10]:
import torch
import torch.nn as nn

# Vocabulary size — number of unique tokens
vocab_size = 65

# Embedding dimension — how big each token's vector should be
embedding_dim = 32

# Step 1: Create the embedding layer
token_embedding_table = nn.Embedding(vocab_size, embedding_dim)

# Step 2: Create a batch of token IDs
x = torch.tensor([[12, 5, 8, 8]])  # shape = (batch_size=1, block_size=4)

# Step 3: Apply the embedding layer
x_embed = token_embedding_table(x)

# Step 4: Print shapes and values
print("Input token IDs shape:", x.shape)
print("Embedded vector shape:", x_embed.shape)
print("Embedded vectors:\n", x_embed)


Input token IDs shape: torch.Size([1, 4])
Embedded vector shape: torch.Size([1, 4, 32])
Embedded vectors:
 tensor([[[-9.1476e-01,  4.2747e-01,  1.6521e+00, -1.3608e+00, -9.3403e-01,
          -2.3145e+00, -8.7017e-01,  2.5174e-01,  1.1242e+00, -1.2016e+00,
          -2.4265e-01,  2.0237e+00,  9.0045e-01,  1.5163e+00,  1.5448e+00,
           3.3130e-01,  4.2435e-02, -8.5349e-01, -2.4111e-01,  1.0200e+00,
           1.0720e+00,  2.3498e+00, -2.0369e-01,  2.0764e-01,  6.8549e-01,
           1.9217e-01,  1.6880e+00, -6.8372e-01, -6.5884e-01, -5.8992e-01,
          -8.8590e-01, -7.5728e-01],
         [ 6.0499e-01,  8.8128e-01, -1.5651e+00, -1.7934e+00,  7.4650e-01,
           2.6304e-01,  3.7010e-01, -7.4764e-01,  4.2487e-02,  1.7857e+00,
           6.0633e-01, -2.8965e-01, -2.4978e-01, -1.7072e-01, -6.1268e-01,
          -1.0081e+00, -7.4238e-02, -2.0559e+00, -3.4736e-01, -8.6163e-01,
           9.9268e-01,  4.3336e-01,  8.6749e-01, -2.8927e-01, -1.7951e-01,
          -1.3776e+00, -4.8565e

# Add Positional Encoding
# 🧠 Why?
# Transformers have no sense of order — they treat all tokens as a bag of vectors.

# But language is ordered:

# "The cat sat" ≠ "Sat the cat"

# So we must inject position information into each token’s embedding.

In [11]:
block_size = 4
embedding_dim = 8
position_embedding_table = nn.Embedding(block_size, embedding_dim)
print("Shape of table:", position_embedding_table.weight.shape)
print("Positional Embedding Table:\n", position_embedding_table.weight)
pos_vector = position_embedding_table(torch.tensor([2]))
print("Vector for position 2:\n", pos_vector)
position_ids = torch.arange(block_size)
pos_vectors = position_embedding_table(position_ids)
print("Position vectors:\n", pos_vectors)


Shape of table: torch.Size([4, 8])
Positional Embedding Table:
 Parameter containing:
tensor([[ 0.1651,  0.5164, -0.5502, -2.0563, -1.3057,  0.7361, -0.8709, -0.7991],
        [ 0.5314, -0.9714, -0.9523, -0.2689, -0.1302,  0.2233,  0.9946,  0.9599],
        [ 0.0183, -0.2316, -0.4117, -0.0224,  0.4507,  2.8825,  0.1056,  0.1770],
        [ 0.0062, -0.9122,  0.8633, -0.1300,  0.1260, -2.3718, -0.6157,  0.7260]],
       requires_grad=True)
Vector for position 2:
 tensor([[ 0.0183, -0.2316, -0.4117, -0.0224,  0.4507,  2.8825,  0.1056,  0.1770]],
       grad_fn=<EmbeddingBackward0>)
Position vectors:
 tensor([[ 0.1651,  0.5164, -0.5502, -2.0563, -1.3057,  0.7361, -0.8709, -0.7991],
        [ 0.5314, -0.9714, -0.9523, -0.2689, -0.1302,  0.2233,  0.9946,  0.9599],
        [ 0.0183, -0.2316, -0.4117, -0.0224,  0.4507,  2.8825,  0.1056,  0.1770],
        [ 0.0062, -0.9122,  0.8633, -0.1300,  0.1260, -2.3718, -0.6157,  0.7260]],
       grad_fn=<EmbeddingBackward0>)


In [12]:
import torch
import torch.nn as nn

# 1. Hyperparams
batch_size = 2
block_size = 4
embedding_dim = 8
vocab_size = 65

# 2. Embedding tables
token_embedding_table = nn.Embedding(vocab_size, embedding_dim)
position_embedding_table = nn.Embedding(block_size, embedding_dim)

# 3. Input tokens (2 sequences, 4 characters each)
x = torch.tensor([
    [12, 5, 8, 8],
    [9, 1, 17, 33]
])  # shape: (2, 4)

# 4. Token embeddings (look up each token ID)
token_emb = token_embedding_table(x)  # shape: (2, 4, 8)
print(token_emb)
# 5. Positional embeddings
position_ids = torch.arange(block_size)  # [0, 1, 2, 3]
pos_emb = position_embedding_table(position_ids)  # shape: (4, 8)
pos_emb = pos_emb.unsqueeze(0)  # reshape to (1, 4, 8) to match batch

# 6. Add them
x_embed = token_emb + pos_emb  # shape: (2, 4, 8)

print("Token embeddings:\n", token_emb)
print("Positional embeddings:\n", pos_emb)
print("Final input to transformer (x_embed):\n", x_embed)


tensor([[[-0.3752,  0.5033,  0.1096, -0.0824,  0.6980,  0.6636, -0.5450,
           0.1132],
         [-0.0930, -1.5122, -2.2702,  1.1632, -0.0315, -0.6675, -0.1314,
           0.8455],
         [ 2.1030, -2.1707,  1.8961, -0.2121,  0.4001, -0.0505,  1.2624,
           0.4154],
         [ 2.1030, -2.1707,  1.8961, -0.2121,  0.4001, -0.0505,  1.2624,
           0.4154]],

        [[ 0.3501, -1.7663,  1.0883,  0.5273,  1.0713,  1.7445,  0.6715,
          -0.6527],
         [-0.4822,  0.6016,  0.8966,  0.1061, -0.2813,  0.2576, -0.2173,
           0.3970],
         [-0.2071,  0.3486, -0.4923,  1.2341, -0.6933, -2.1536, -0.2513,
           0.8460],
         [-0.8765, -1.7758,  2.3262,  0.2258,  0.4728, -0.2361,  0.2439,
          -0.4916]]], grad_fn=<EmbeddingBackward0>)
Token embeddings:
 tensor([[[-0.3752,  0.5033,  0.1096, -0.0824,  0.6980,  0.6636, -0.5450,
           0.1132],
         [-0.0930, -1.5122, -2.2702,  1.1632, -0.0315, -0.6675, -0.1314,
           0.8455],
         [ 2.1030

# 3. Both are lookup tables
# Token embedding table → vector for the type of token

#  Positional embedding table → vector for the position of token

#  Both are trainable

### 🔁 Input Embedding Pipeline (with Example: "help")

#### Input: "help"

1. **Tokenization**  
   - Character-level: ['h', 'e', 'l', 'p']  
   - Token IDs (via vocab): [12, 5, 8, 9]  
   - Shape: `(batch_size=1, block_size=4)`

2. **Token Embedding**  
   - Use `nn.Embedding(vocab_size, emb_dim)`  
   - Maps each token ID to a learnable vector  
   - Output shape: `(1, 4, emb_dim)`  
   - Represents: *what* each token is

3. **Positional Embedding**  
   - Use `nn.Embedding(block_size, emb_dim)`  
   - Creates a learnable vector for each position: [0, 1, 2, 3]  
   - Output shape: `(1, 4, emb_dim)` (after unsqueeze)  
   - Represents: *where* each token is in the sequence

4. **Add Both Embeddings**  
   - Final input: `token_emb + pos_emb`  
   - Shape: `(1, 4, emb_dim)`  
   - Each token now encodes both meaning and position

✅ This combined embedding is passed to the first Transformer block.


### 🔁 Project Embeddings into Query, Key, and Value (Q, K, V)

To prepare for self-attention, each token's embedding is projected into 3 different vectors:

- **Query (Q)** → What the token is looking for
- **Key (K)**   → What the token offers to others
- **Value (V)** → What the token contains to share

These are created by applying 3 independent `nn.Linear` layers to the same input embedding:

- Input: `x_embed` → shape `(batch_size, block_size, embedding_dim)`
- Output:
  - Q: `(batch_size, block_size, embedding_dim)`
  - K: `(batch_size, block_size, embedding_dim)`
  - V: `(batch_size, block_size, embedding_dim)`

Each token is now ready to compare itself (via Q) to others (via K), and share information (via V).


In [13]:
# --- Simulated inputs ---
batch_size = 2
block_size = 4
embedding_dim = 8

# Simulated token+pos embeddings (e.g. from nn.Embedding)
x_embed = torch.randn(batch_size, block_size, embedding_dim)

# --- Linear layers to create Q, K, V ---
to_q = nn.Linear(embedding_dim, embedding_dim)
to_k = nn.Linear(embedding_dim, embedding_dim)
to_v = nn.Linear(embedding_dim, embedding_dim)

# --- Project to Q, K, V ---
q = to_q(x_embed)
k = to_k(x_embed)
v = to_v(x_embed)

# --- Check shapes ---
print("x_embed shape:", x_embed.shape)  # (2, 4, 8)
print("Q shape:", q.shape)              # (2, 4, 8)
print("K shape:", k.shape)              # (2, 4, 8)
print("V shape:", v.shape)              # (2, 4, 8)


x_embed shape: torch.Size([2, 4, 8])
Q shape: torch.Size([2, 4, 8])
K shape: torch.Size([2, 4, 8])
V shape: torch.Size([2, 4, 8])


# 🔁 Transformers (in the Q/K/V step)
# You're correct — we take one input (x_embed) and pass it through three completely separate linear layers:

**1.Q = Linear_Q(x_embed)**

**2.K = Linear_K(x_embed)**

**3.V = Linear_V(x_embed)**

# These 3 layers are not connected to each other

# They do not pass values between themselves

# They are just three different views of the same input

### 🔍 Q/K/V Linear Layers vs Classic ANN Layers

In classic neural networks (ANNs or CNNs), layers are **stacked** — the output of one layer feeds into the next in a chain:  
`Input → Hidden → Output`.

But in Transformers, during the attention step, we apply **three separate linear projections** to the same input embedding:

- `Query = Linear_Q(x_embed)`
- `Key   = Linear_K(x_embed)`
- `Value = Linear_V(x_embed)`

These are:
- **Not connected** to each other (no flow between them)
- **Independent** layers that each produce a different role/view of the same token
- Essential for enabling attention:  
  → Q compares against K to decide "who to look at",  
  → V is the content actually shared.

This **branching structure** is a major architectural difference from traditional neural nets.


### 🎯 Step: Compute Attention Scores (Q · Kᵀ)

Now that we have Query (Q) and Key (K) vectors for each token, we compute how much attention each token should pay to every other token in the sequence.

- **Query**: what this token is looking for
- **Key**: what other tokens offer
- **Attention Score** = dot product of Q and K

We calculate:
```python
attention_scores = Q @ Kᵀ
```

In [14]:
import torch
import torch.nn as nn

# Simulated Q, K (e.g., after applying nn.Linear to x_embed)
batch_size = 2
block_size = 4
embedding_dim = 8

# Random Q and K for demonstration
q = torch.randn(batch_size, block_size, embedding_dim)
k = torch.randn(batch_size, block_size, embedding_dim)

# Transpose K to prepare for dot product
# Shape becomes (batch_size, embedding_dim, block_size)
k_transposed = k.transpose(-2, -1)

# Compute attention scores (Q @ Kᵀ)
attention_scores = torch.matmul(q, k_transposed)

print("Attention scores shape:", attention_scores.shape)  # (2, 4, 4)


Attention scores shape: torch.Size([2, 4, 4])


### 🧠 Step: Scale and Softmax the Attention Scores

After computing the raw attention scores using `Q @ Kᵀ`, we need to normalize them before using them.

#### ⚠️ Why?
- The dot products (`Q · Kᵀ`) can be large in magnitude
- This can cause the softmax to become very sharp, leading to unstable gradients
- We fix this by scaling the scores

---

### 📐 Formula: Scaled Dot-Product Attention (before applying to V)

![attention](https://media.licdn.com/dms/image/v2/D4D12AQGw6RIV4YgDOg/article-cover_image-shrink_720_1280/article-cover_image-shrink_720_1280/0/1691329217886?e=2147483647&v=beta&t=LUYt7_jUybda90NoMuq1VUAFE8Gvhcdy91R2TKkHPSI)

Where:
- \( Q \): query matrix
- \( K \): key matrix
- \( d_k \): dimension of the key/query vectors
- Softmax is applied along each token's row

---

### ✅ What we achieve:
- All attention scores become **probabilities** (sum to 1 for each token)
- Each token now has a meaningful **attention distribution** over others
- These weights will be used to combine the **value (V)** vectors in the next step


In [15]:
import torch
import torch.nn.functional as F
import math

# Assume we already have:
# attention_scores from Q @ Kᵀ
# and embedding_dim from earlier

# Example shape: (batch_size=2, block_size=4, block_size=4)
# Each token attends to every other token in its sequence

# Scale the attention scores
scaled_scores = attention_scores / math.sqrt(embedding_dim)

# Apply softmax to get attention weights (probabilities)
attention_weights = F.softmax(scaled_scores, dim=-1)

# Print shape and check row sums (should be ~1)
print("Attention weights shape:", attention_weights.shape)
print("Row sums for first batch item:\n", attention_weights[0].sum(dim=-1))


Attention weights shape: torch.Size([2, 4, 4])
Row sums for first batch item:
 tensor([1.0000, 1.0000, 1.0000, 1.0000])


In [16]:
# V from earlier: shape (batch_size, block_size, embedding_dim)
# attention_weights: shape (batch_size, block_size, block_size)

# Multiply attention weights with V to get final attention output
attention_output = torch.matmul(attention_weights, v)

print("Attention output shape:", attention_output.shape)


Attention output shape: torch.Size([2, 4, 8])


# ✅ What we’ve built so far:
# A complete forward pass of self-attention (1 head)

# With deep understanding of:

# What Q/K/V really mean

# How tokens interact

# How info flows in attention

### ✅ Summary So Far: Self-Attention (Single Head)

1. **Tokenization**: Converted characters into integer token IDs.
2. **Embeddings**: Used `nn.Embedding` to create:
   - Token embeddings (meaning of token)
   - Position embeddings (position in sequence)
   - Final input: `x_embed = token_embed + pos_embed`

3. **Q, K, V Projection**:
   - 3 `nn.Linear` layers applied to `x_embed`
   - Each token gets: Query (Q), Key (K), Value (V)

4. **Attention Scores**:
   - Computed: `scores = Q @ Kᵀ`
   - Result: how much each token attends to others

5. **Softmax Attention Weights**:
   - Scaled by `sqrt(embedding_dim)`
   - Applied softmax → get attention weights (probabilities)

6. **Weighted Sum with V**:
   - Used weights to combine V: `output = attention_weights @ V`
   - Final result: context-aware embedding per token


### 🧠 Transformer Block: Neural Network Architecture (So Far)

We’ve implemented the self-attention mechanism as a neural network.

#### 🔷 Step-by-step Flow:

1. **Input Tokens**
   - Example: "help" → token IDs → `[12, 5, 8, 9]`

2. **Embedding Layer**
   - `token_embedding = nn.Embedding(vocab_size, emb_dim)`
   - `position_embedding = nn.Embedding(block_size, emb_dim)`
   - Final embedding:  
     \[
     x_{\text{embed}} = \text{token\_embedding}(x) + \text{position\_embedding}(pos)
     \]

3. **Parallel Projections: Q, K, V**
   - 3 separate `nn.Linear` layers:
     ```python
     Q = Linear_Q(x_embed)
     K = Linear_K(x_embed)
     V = Linear_V(x_embed)
     ```
   - Each token is now represented as:
     - Query → what it wants to look for
     - Key   → what it offers
     - Value → what it contains

4. **Self-Attention Calculation (not a layer, just math)**
   - Compute scores: `scores = Q @ Kᵀ`
   - Scale: `scaled = scores / sqrt(embedding_dim)`
   - Normalize: `weights = softmax(scaled, dim=-1)`
   - Weighted sum: `attention_output = weights @ V`

#### ✅ Key Idea:
- Q, K, V are computed **in parallel** from `x_embed`
- Self-attention is computed **after Q, K, V are ready**
- Attention is a dynamic, learned way of combining token information

Each token's output is now a **context-aware vector** that "knows about" other tokens in the sequence.


### 🧠 Final Linear Projection after Attention

After computing the context-aware vectors from attention (`attention_weights @ V`), we apply a final `nn.Linear` layer:

$$
\text{output} = W_o \cdot (\text{attention\_output}) + b_o
$$

#### ✅ Why this step?
- The attention step itself uses no learnable parameters beyond Q/K/V
- This linear layer introduces **new trainable weights**
- It helps the model **transform** and **refine** the attended information
- Keeps output shape the same: `(batch_size, block_size, embedding_dim)`

This is the last part of the **self-attention block** (before things like residuals or MLP).


In [18]:
import torch.nn as nn

# Let's assume this is the attention output from earlier
# Shape: (batch_size, block_size, embedding_dim)
attention_output = torch.randn(2, 4, 8)  # Example shape

# Final linear projection layer (learnable)
out_proj = nn.Linear(8, 8)  # input and output dim = embedding_dim

# Apply the projection
output = out_proj(attention_output)

print("Output shape after final projection:", output.shape)  # (2, 4, 8)


Output shape after final projection: torch.Size([2, 4, 8])


### 🔁 Step: Residual Connection (Post-Attention)

After computing the attention output and passing it through a linear layer, we apply a **residual (skip) connection** by adding the original input back to the output.

#### 🔹 Why?
- Helps preserve the original input (`x_embed`)
- Prevents loss of information
- Makes training more stable and enables deeper models
- Allows gradients to flow more easily through the network

#### 🔹 Operation:
If `x_embed` is the original input and `attn_out` is the projected attention output:

```python
x = x_embed + attn_out
```

### 🧪 Layer Normalization Summary

Layer Normalization (`nn.LayerNorm`) is applied after the residual connection in transformer blocks.

#### 🔹 Purpose:
- Stabilizes training
- Normalizes activations across the **embedding dimension**
- Keeps gradients well-behaved and prevents exploding/vanishing values

#### 🔹 How it works:
For each token vector `x_i` of shape `(embedding_dim)`, it normalizes as:

$$
\text{LayerNorm}(x_i) = \frac{x_i - \mu}{\sigma} \cdot \gamma + \beta
$$

Where:
- $\mu$ = mean of the features
- $\sigma$ = standard deviation of the features
- $\gamma$ and $\beta$ are learnable scale and shift parameters

#### 🔹 In Code:
```python
layer_norm = nn.LayerNorm(embedding_dim)
x = layer_norm(x)


In [19]:
import torch
import torch.nn as nn

# --- Simulated input and attention output ---
batch_size = 2
block_size = 4
embedding_dim = 8

# Original input to attention block (from embeddings)
x_embed = torch.randn(batch_size, block_size, embedding_dim)

# Attention output after projection (from previous steps)
attn_out = torch.randn(batch_size, block_size, embedding_dim)

# --- Residual connection ---
x = x_embed + attn_out  # Add original input (residual connection)

# --- Layer Normalization ---
layer_norm = nn.LayerNorm(embedding_dim)
x = layer_norm(x)       # Normalize across embedding dimension

print("Final shape after residual + layer norm:", x.shape)


Final shape after residual + layer norm: torch.Size([2, 4, 8])


### 🔧 Feedforward Block (MLP)

After attention, each token embedding is passed through a small neural network called the **Feedforward Block** or **MLP block**.

#### 🔹 Purpose:
- Applies a **non-linear transformation** to each token's vector
- Helps the model learn deeper representations
- Complements attention (which focuses on context) with local computation

#### 🔹 Architecture (Per Token):
1. Linear layer: expands the vector (typically 4× embedding size)
2. GELU activation: adds non-linearity
3. Linear layer: projects back to original embedding size

#### 🔹 Formula:
Let `x` be the token vector of shape `(embedding_dim)`:

```python
x = Linear_2(GELU(Linear_1(x)))


In [20]:
import torch
import torch.nn as nn
import torch.nn.functional as F

# --- Setup ---
batch_size = 2
block_size = 4
embedding_dim = 8
hidden_dim = 4 * embedding_dim  # Typically 4x for transformers

# Simulated input from previous LayerNorm (shape: B, T, C)
x = torch.randn(batch_size, block_size, embedding_dim)

# --- Feedforward Block ---
ffn = nn.Sequential(
    nn.Linear(embedding_dim, hidden_dim),  # Expand
    nn.GELU(),                              # Non-linearity
    nn.Linear(hidden_dim, embedding_dim)    # Project back
)

# Apply feedforward block
ffn_output = ffn(x)

print("Feedforward output shape:", ffn_output.shape)  # (B, T, C)


Feedforward output shape: torch.Size([2, 4, 8])


Here’s your clean **notebook markdown summary** for the **residual + layer norm after the MLP block** — ready to paste:

````markdown
### 🔁 Residual + LayerNorm (Post-Feedforward)

After the feedforward (MLP) block, we apply:

1. **Residual Connection**
   - Add the MLP output to its input:
   ```python
   x = x + mlp_out
````

* This helps preserve the original signal and stabilize gradients.

2. **Layer Normalization**

   * Normalizes the combined output:

   ```python
   x = LayerNorm(x)
   ```

   * Ensures consistent activation scales across tokens.

#### 🔹 Why this matters:

* Helps the network train deeper without losing information
* Maintains stable learning and smooth flow of gradients
* Standard in every transformer block

This completes the **full transformer block**:

* Attention → residual + norm
* MLP → residual + norm ✅

```

Let me know if you’re ready to move on to:
> 🔄 Stacking multiple transformer blocks — or building the final output head!
```


In [21]:
import torch
import torch.nn as nn
import torch.nn.functional as F

# --- Setup ---
batch_size = 2
block_size = 4
embedding_dim = 8
hidden_dim = 4 * embedding_dim

# Input to feedforward (output of previous layernorm)
x = torch.randn(batch_size, block_size, embedding_dim)

# Feedforward block
ffn = nn.Sequential(
    nn.Linear(embedding_dim, hidden_dim),
    nn.GELU(),
    nn.Linear(hidden_dim, embedding_dim)
)

# Apply MLP
mlp_out = ffn(x)

# Residual connection
x = x + mlp_out

# LayerNorm
mlp_norm = nn.LayerNorm(embedding_dim)
x = mlp_norm(x)

print("Final output after MLP + residual + norm:", x.shape)


Final output after MLP + residual + norm: torch.Size([2, 4, 8])


````markdown
## ✅ Transformer Decoder Block: Full Summary (So Far)

We’ve built a **full GPT-style transformer decoder block** step by step.

---

### 🔹 1. Load and Tokenize Dataset
- Loaded TinyShakespeare text (character-level)
- Tokenized text into integer IDs
- Created `(x, y)` input-output pairs for **next token prediction**

---

### 🔹 2. Embeddings
- Used `nn.Embedding` to create:
  - **Token Embeddings** → convert token IDs into dense vectors
  - **Positional Embeddings** → inject token position info
- Added them:
  ```python
  x_embed = token_embedding + position_embedding
````

---

### 🔹 3. Self-Attention Block

* Projected `x_embed` to Query, Key, Value:

  ```python
  Q = Linear(x_embed), K = Linear(x_embed), V = Linear(x_embed)
  ```
* Computed attention scores:

  ```python
  weights = softmax(Q @ Kᵀ / sqrt(d_k))
  ```
* Combined values:

  ```python
  attn_output = weights @ V
  ```
* Applied a final linear projection

---

### 🔹 4. Residual + LayerNorm (After Attention)

* Skip connection:

  ```python
  x = x_embed + attn_output
  ```
* Normalize:

  ```python
  x = LayerNorm(x)
  ```

---

### 🔹 5. Feedforward (MLP Block)

* Applied to each token vector:

  ```python
  x = Linear → GELU → Linear
  ```

---

### 🔹 6. Residual + LayerNorm (After MLP)

* Added residual:

  ```python
  x = x + mlp_output
  ```
* Applied layer norm:

  ```python
  x = LayerNorm(x)
  ```

---

### ✅ At this point:

You’ve implemented:

* 🧱 The full Transformer **decoder block**
* 📏 Output shape: `(batch_size, block_size, embedding_dim)`

---




Awesome! Here's the **notebook markdown summary** for the **final output layer** — Linear → Softmax:

---

````markdown
### 🎯 Final Output Layer (Linear → Softmax)

Once we get the output from the transformer block, we map each token vector to vocab scores.

#### 🔹 1. Linear Projection
Each token vector has shape `(embedding_dim)`, but we want a score for **each token in the vocabulary**.

We use a final `Linear(embedding_dim, vocab_size)`:
```python
logits = final_linear(x)
````

* Shape: `(batch_size, block_size, vocab_size)`

---

#### 🔹 2. Softmax (optional during inference)

To convert logits to probabilities (only needed during inference):

```python
probs = softmax(logits, dim=-1)
```

---

#### ❗ Training Note:

* During training, we use `nn.CrossEntropyLoss`
* This **does not require softmax**, because it's applied internally

```python
loss_fn = nn.CrossEntropyLoss()
loss = loss_fn(logits.view(-1, vocab_size), targets.view(-1))
```

---

### ✅ This completes the transformer pass!


```


In [22]:
import torch
import torch.nn as nn
import torch.nn.functional as F

# --- Setup ---
batch_size = 2
block_size = 4
embedding_dim = 8
vocab_size = 65  # for TinyShakespeare (example)

# Simulated transformer output (after MLP + LayerNorm)
x = torch.randn(batch_size, block_size, embedding_dim)

# Final output projection layer
final_linear = nn.Linear(embedding_dim, vocab_size)

# Get logits (scores for each token in the vocab)
logits = final_linear(x)

print("Logits shape:", logits.shape)  # (B, T, vocab_size)

# Example: Compute loss with ground-truth target tokens
targets = torch.randint(0, vocab_size, (batch_size, block_size))
loss_fn = nn.CrossEntropyLoss()
loss = loss_fn(logits.view(-1, vocab_size), targets.view(-1))

print("Loss:", loss.item())


Logits shape: torch.Size([2, 4, 65])
Loss: 4.267731189727783


# Full forward pass single head attention

In [23]:
import torch
import torch.nn as nn
import torch.nn.functional as F
import random

# --- Load dataset ---
url = "https://raw.githubusercontent.com/karpathy/char-rnn/master/data/tinyshakespeare/input.txt"
text = requests.get(url).text
chars = sorted(list(set(text)))
vocab_size = len(chars)

# Mapping
stoi = { ch:i for i,ch in enumerate(chars) }
itos = { i:ch for ch,i in stoi.items() }
encode = lambda s: [stoi[c] for c in s]
decode = lambda l: ''.join([itos[i] for i in l])

data = torch.tensor(encode(text), dtype=torch.long)
train_data = data[:int(0.9*len(data))]
val_data = data[int(0.9*len(data)):]

# --- Batch preparation ---
def get_batch(split, batch_size, block_size):
    data_split = train_data if split == 'train' else val_data
    ix = torch.randint(len(data_split) - block_size, (batch_size,))
    x = torch.stack([data_split[i:i+block_size] for i in ix])
    y = torch.stack([data_split[i+1:i+block_size+1] for i in ix])
    return x, y

# --- GPT Model ---
class MiniGPTModel(nn.Module):
    def __init__(self, vocab_size, block_size, embedding_dim):
        super().__init__()
        self.token_embed = nn.Embedding(vocab_size, embedding_dim)
        self.pos_embed = nn.Embedding(block_size, embedding_dim)

        self.to_q = nn.Linear(embedding_dim, embedding_dim)
        self.to_k = nn.Linear(embedding_dim, embedding_dim)
        self.to_v = nn.Linear(embedding_dim, embedding_dim)
        self.attn_proj = nn.Linear(embedding_dim, embedding_dim)
        self.attn_layernorm = nn.LayerNorm(embedding_dim)

        self.ff = nn.Sequential(
            nn.Linear(embedding_dim, 4 * embedding_dim),
            nn.GELU(),
            nn.Linear(4 * embedding_dim, embedding_dim)
        )
        self.ff_layernorm = nn.LayerNorm(embedding_dim)

        self.lm_head = nn.Linear(embedding_dim, vocab_size)
        self.block_size = block_size

    def forward(self, idx, targets=None):
        B, T = idx.shape
        tok_emb = self.token_embed(idx)
        pos_ids = torch.arange(T, device=idx.device)
        pos_emb = self.pos_embed(pos_ids)[None, :, :]
        x = tok_emb + pos_emb

        q = self.to_q(x)
        k = self.to_k(x)
        v = self.to_v(x)
        attn_scores = (q @ k.transpose(-2, -1)) / (q.shape[-1] ** 0.5)
        attn_scores = attn_scores.masked_fill(
            torch.triu(torch.ones(T, T, device=idx.device), 1).bool(),
            float('-inf')
        )
        attn_weights = F.softmax(attn_scores, dim=-1)
        attn_output = attn_weights @ v
        attn_output = self.attn_proj(attn_output)

        x = self.attn_layernorm(x + attn_output)
        ff_out = self.ff(x)
        x = self.ff_layernorm(x + ff_out)
        logits = self.lm_head(x)

        if targets is not None:
            B, T, C = logits.shape
            loss = F.cross_entropy(logits.view(B*T, C), targets.view(B*T))
            return logits, loss
        else:
            return logits, None

    @torch.no_grad()
    def generate(self, idx, max_new_tokens):
        for _ in range(max_new_tokens):
            idx_cond = idx[:, -self.block_size:]
            logits, _ = self(idx_cond)
            logits = logits[:, -1, :]
            probs = F.softmax(logits, dim=-1)
            next_token = torch.multinomial(probs, num_samples=1)
            idx = torch.cat([idx, next_token], dim=1)
        return idx

# --- Hyperparams and Training ---
device = 'cuda' if torch.cuda.is_available() else 'cpu'
embedding_dim = 64
block_size = 64
batch_size = 32
learning_rate = 1e-3
max_iters = 500

model = MiniGPTModel(vocab_size, block_size, embedding_dim).to(device)
optimizer = torch.optim.AdamW(model.parameters(), lr=learning_rate)

for step in range(max_iters):
    xb, yb = get_batch('train', batch_size, block_size)
    xb, yb = xb.to(device), yb.to(device)

    logits, loss = model(xb, yb)
    optimizer.zero_grad()
    loss.backward()
    optimizer.step()

    if step % 50 == 0:
        print(f"Step {step}: Loss = {loss.item():.4f}")

# --- Sample from the model ---
context = torch.tensor([[stoi["H"]]], dtype=torch.long).to(device)
generated = model.generate(context, max_new_tokens=200)[0].tolist()
print(decode(generated))


Step 0: Loss = 4.3123
Step 50: Loss = 2.9114
Step 100: Loss = 2.7385
Step 150: Loss = 2.6023
Step 200: Loss = 2.5554
Step 250: Loss = 2.4493
Step 300: Loss = 2.4366
Step 350: Loss = 2.4248
Step 400: Loss = 2.3470
Step 450: Loss = 2.2953
HABSANIULF:
MArdis st withy natrece!
QUBrowastory fou, me soall, wiand swee ar must: toss fand courd frit,
LAn's?


MamSiven thanvers tase indy's Fid:
And youre agenads Sweing hes poweyce thy out handa


In [24]:
# --- Sample from the model ---
context = torch.tensor([[stoi["I"]]], dtype=torch.long).to(device)
generated = model.generate(context, max_new_tokens=200)[0].tolist()
print(decode(generated))

I:
LI this sto gof ary,
We farcheesomak, ges deseembrit thert thanelvenewveo,
Go,
DUind;
CHaserbyneld dy wicw galmars: pim nin eris oventle se?


JYy far arin h; no sfupe so'st sot.

Amis anesh
Prbewfo


# Multi Head Attention

In [None]:
import torch
import torch.nn as nn
import torch.nn.functional as F
import requests

# Load dataset
url = "https://raw.githubusercontent.com/karpathy/char-rnn/master/data/tinyshakespeare/input.txt"
text = requests.get(url).text
chars = sorted(list(set(text)))
vocab_size = len(chars)
stoi = {ch: i for i, ch in enumerate(chars)}
itos = {i: ch for ch, i in stoi.items()}
encode = lambda s: [stoi[c] for c in s]
decode = lambda l: ''.join([itos[i] for i in l])
data = torch.tensor(encode(text), dtype=torch.long)
train_data = data[:int(0.9*len(data))]
val_data = data[int(0.9*len(data)):]

# Batching
def get_batch(split, batch_size, block_size):
    data_split = train_data if split == 'train' else val_data
    ix = torch.randint(len(data_split) - block_size, (batch_size,))
    x = torch.stack([data_split[i:i+block_size] for i in ix])
    y = torch.stack([data_split[i+1:i+block_size+1] for i in ix])
    return x, y

# Multi-Head Attention
class MultiHeadSelfAttention(nn.Module):
    def __init__(self, embedding_dim, num_heads, block_size):
        super().__init__()
        assert embedding_dim % num_heads == 0
        self.num_heads = num_heads
        self.head_dim = embedding_dim // num_heads
        self.qkv_proj = nn.Linear(embedding_dim, 3 * embedding_dim)
        self.out_proj = nn.Linear(embedding_dim, embedding_dim)

    def forward(self, x):
        B, T, C = x.shape
        qkv = self.qkv_proj(x).view(B, T, self.num_heads, 3 * self.head_dim)
        qkv = qkv.permute(0, 2, 1, 3)
        q, k, v = torch.chunk(qkv, 3, dim=-1)
        attn_scores = (q @ k.transpose(-2, -1)) / (self.head_dim ** 0.5)
        mask = torch.triu(torch.ones(T, T, device=x.device), 1).bool()
        attn_scores = attn_scores.masked_fill(mask, float('-inf'))
        attn_weights = F.softmax(attn_scores, dim=-1)
        attn_output = attn_weights @ v
        attn_output = attn_output.permute(0, 2, 1, 3).contiguous().view(B, T, C)
        return self.out_proj(attn_output)

# Transformer Block
class TransformerBlock(nn.Module):
    def __init__(self, embedding_dim, num_heads, block_size):
        super().__init__()
        self.attn = MultiHeadSelfAttention(embedding_dim, num_heads, block_size)
        self.attn_ln = nn.LayerNorm(embedding_dim)
        self.ff = nn.Sequential(
            nn.Linear(embedding_dim, 4 * embedding_dim),
            nn.GELU(),
            nn.Linear(4 * embedding_dim, embedding_dim)
        )
        self.ff_ln = nn.LayerNorm(embedding_dim)

    def forward(self, x):
        x = self.attn_ln(x + self.attn(x))
        x = self.ff_ln(x + self.ff(x))
        return x

# GPT Model
class MiniGPTModel(nn.Module):
    def __init__(self, vocab_size, block_size, embedding_dim, num_heads, num_layers):
        super().__init__()
        self.token_embed = nn.Embedding(vocab_size, embedding_dim)
        self.pos_embed = nn.Embedding(block_size, embedding_dim)
        self.blocks = nn.Sequential(*[
            TransformerBlock(embedding_dim, num_heads, block_size)
            for _ in range(num_layers)
        ])
        self.ln_f = nn.LayerNorm(embedding_dim)
        self.lm_head = nn.Linear(embedding_dim, vocab_size)

    def forward(self, idx, targets=None):
        B, T = idx.shape
        tok_emb = self.token_embed(idx)
        pos_ids = torch.arange(T, device=idx.device)
        pos_emb = self.pos_embed(pos_ids)[None, :, :]
        x = tok_emb + pos_emb
        x = self.blocks(x)
        x = self.ln_f(x)
        logits = self.lm_head(x)
        if targets is not None:
            loss = F.cross_entropy(logits.view(-1, logits.size(-1)), targets.view(-1))
            return logits, loss
        return logits, None

    @torch.no_grad()
    def generate(self, idx, max_new_tokens):
        for _ in range(max_new_tokens):
            idx_cond = idx[:, -block_size:]
            logits, _ = self(idx_cond)
            probs = F.softmax(logits[:, -1, :], dim=-1)
            next_token = torch.multinomial(probs, num_samples=1)
            idx = torch.cat([idx, next_token], dim=1)
        return idx

# Training Setup (Bigger & Deeper)
device = 'cuda' if torch.cuda.is_available() else 'cpu'
embedding_dim = 512
block_size = 256
batch_size = 128
learning_rate = 3e-4
max_iters = 3000
num_heads = 8
num_layers = 8

model = MiniGPTModel(vocab_size, block_size, embedding_dim, num_heads, num_layers).to(device)
optimizer = torch.optim.AdamW(model.parameters(), lr=learning_rate)

# Training Loop
for step in range(max_iters):
    xb, yb = get_batch('train', batch_size, block_size)
    xb, yb = xb.to(device), yb.to(device)
    logits, loss = model(xb, yb)
    optimizer.zero_grad()
    loss.backward()
    optimizer.step()
    if step % 100 == 0:
        print(f"Step {step} | Loss: {loss.item():.4f}")

# Sampling Output
context = torch.tensor([[stoi["H"]]], dtype=torch.long).to(device)
generated = model.generate(context, max_new_tokens=300)[0].tolist()
print(decode(generated))


Step 0 | Loss: 4.3203
Step 100 | Loss: 2.4496
Step 200 | Loss: 2.2034
Step 300 | Loss: 1.9375
Step 400 | Loss: 1.7223
Step 500 | Loss: 1.5520


In [27]:
# Sampling
context = torch.tensor([[stoi["H"]]], dtype=torch.long).to(device)
generated = model.generate(context, max_new_tokens=300)[0].tolist()
print(decode(generated))

Hongen,
Mur wheresere goitles your kined notur'd!
Whe mirshene tuth duch no dexdees, you beake goon
A butings.

Past mongited ase:
Nowel win that sim crakst.

VENG YORETE:
Haste bealfe to geare, is hon lathend theieang
I modscam bett thinker meathts auls woulf
Mirve thes pa his we inde ream mofecord 
