### 🧠 Decoder‑Only Transformer Architecture

![Decoder‑Only Transformer Diagram](https://waylandzhang.github.io/en/images/decoder-only-transformer.jpg)

**Figure:** A GPT‑style stack of decoder blocks:
- **Input tokens** → Embedding + positional encoding
- **Repeated blocks**: masked self-attention + feed‑forward + layer norms + residuals
- **Final linear & softmax** → predict next-token logits

# Load the Dataset first

In [1]:
import requests

url = "https://raw.githubusercontent.com/karpathy/char-rnn/master/data/tinyshakespeare/input.txt"
text = requests.get(url).text

print("Dataset length:", len(text))
print("First 500 characters:\n", text[:500])


Dataset length: 1115394
First 500 characters:
 First Citizen:
Before we proceed any further, hear me speak.

All:
Speak, speak.

First Citizen:
You are all resolved rather to die than to famish?

All:
Resolved. resolved.

First Citizen:
First, you know Caius Marcius is chief enemy to the people.

All:
We know't, we know't.

First Citizen:
Let us kill him, and we'll have corn at our own price.
Is't a verdict?

All:
No more talking on't; let it be done: away, away!

Second Citizen:
One word, good citizens.

First Citizen:
We are accounted poor


# Bulid Vocab
# 🧠 What is Tokenization?
# Tokenization is the process of converting text into units (tokens) that a neural network can understand — and then mapping those tokens to numbers.

In [2]:
chars = sorted(list(set(text)))
print(chars)

['\n', ' ', '!', '$', '&', "'", ',', '-', '.', '3', ':', ';', '?', 'A', 'B', 'C', 'D', 'E', 'F', 'G', 'H', 'I', 'J', 'K', 'L', 'M', 'N', 'O', 'P', 'Q', 'R', 'S', 'T', 'U', 'V', 'W', 'X', 'Y', 'Z', 'a', 'b', 'c', 'd', 'e', 'f', 'g', 'h', 'i', 'j', 'k', 'l', 'm', 'n', 'o', 'p', 'q', 'r', 's', 't', 'u', 'v', 'w', 'x', 'y', 'z']


In [3]:
vocab_size = len(chars)
print(vocab_size)

65


In [4]:
# Tokenizer dictionaries
stoi = {ch: i for i, ch in enumerate(chars)}
dict(list(stoi.items())[:5])



{'\n': 0, ' ': 1, '!': 2, '$': 3, '&': 4}

In [5]:
itos = {i: ch for ch, i in stoi.items()}  
dict(list(itos.items())[:5])


{0: '\n', 1: ' ', 2: '!', 3: '$', 4: '&'}

#  ✅ In PyTorch, the first real step before model training is:
# 🔁 Convert all input data into tensors

In [6]:
# Step 1.2 — Encode entire dataset
import torch

# Convert the full text to a list of token IDs
data = torch.tensor([stoi[c] for c in text], dtype=torch.long)

print("Tokenized dataset shape:", data.shape)
print("First 20 tokens:", data[:20])


Tokenized dataset shape: torch.Size([1115394])
First 20 tokens: tensor([18, 47, 56, 57, 58,  1, 15, 47, 58, 47, 64, 43, 52, 10,  0, 14, 43, 44,
        53, 56])


# Split the data into train and val set

In [7]:
# Step 1.3 — Split data into train and val
split_idx = int(0.9 * len(data))  # 90% train, 10% val

train_data = data[:split_idx]
val_data = data[split_idx:]

print("Train size:", len(train_data))
print("Val size:", len(val_data))


Train size: 1003854
Val size: 111540


# Create Training Batches (x, y pairs)
# 🧱 What is batch_size?
# batch_size is the number of (x, y) training examples processed in one forward/backward pass of the model.
# ⛓ Why batch?
# Matrix operations (on GPU) are fastest when done in batches

# You get more stable gradients

# You reduce variance compared to updating on just one example -->

In [8]:
def get_batch(data, block_size=8, batch_size=4):
    ix = torch.randint(len(data) - block_size, (batch_size,))
    x = torch.stack([data[i:i+block_size] for i in ix])
    y = torch.stack([data[i+1:i+block_size+1] for i in ix])
    return x, y

In [9]:
x, y = get_batch(train_data, block_size=4, batch_size=2)

print("x:\n", x)
print("y:\n", y)


x:
 tensor([[58, 11,  0, 20],
        [43,  6,  1, 61]])
y:
 tensor([[11,  0, 20, 43],
        [ 6,  1, 61, 47]])


# Embedding

# 🔹 What exactly is nn.Embedding in PyTorch?
# 🧠 High-level idea
# nn.Embedding is just a lookup table:
# It stores one vector (a list of numbers) for each token ID
# 🧱 Summary:
# Concept	Explanation
# What it is	A learnable matrix of shape (vocab_size, emb_dim)
# What it does	Looks up vectors for input token IDs
# Is it a neural net layer?	✅ Yes (has weights, supports backprop)
# Why we use it	To represent tokens as meaningful vectors

# 🧠 So when does it become trainable?
# Here’s the secret:

# This embedding matrix is a parameter (like any layer's weights)

# During training, we calculate loss, then do .backward()

# PyTorch computes gradients w.r.t. the rows used (e.g., token 12, 5, 8)

# Only those rows get updated in .step()!

In [10]:
import torch
import torch.nn as nn

# Vocabulary size — number of unique tokens
vocab_size = 65

# Embedding dimension — how big each token's vector should be
embedding_dim = 32

# Step 1: Create the embedding layer
token_embedding_table = nn.Embedding(vocab_size, embedding_dim)

# Step 2: Create a batch of token IDs
x = torch.tensor([[12, 5, 8, 8]])  # shape = (batch_size=1, block_size=4)

# Step 3: Apply the embedding layer
x_embed = token_embedding_table(x)

# Step 4: Print shapes and values
print("Input token IDs shape:", x.shape)
print("Embedded vector shape:", x_embed.shape)
print("Embedded vectors:\n", x_embed)


Input token IDs shape: torch.Size([1, 4])
Embedded vector shape: torch.Size([1, 4, 32])
Embedded vectors:
 tensor([[[ 1.6122, -0.0042,  3.2233, -0.8774, -0.8261, -0.6892, -0.0900,
          -0.3842, -0.5861, -0.3335, -1.0340, -2.3457, -0.6595,  0.5727,
           0.6329,  0.7235,  0.2713,  0.1399, -1.1759,  1.0062,  0.7798,
          -1.5368,  0.1864,  0.3425, -1.0717,  1.2803,  2.6648,  1.6649,
          -1.4183,  0.4762,  0.4356,  1.9057],
         [ 1.9596,  1.0546,  0.0424,  0.4907, -0.5999, -0.6488,  0.2887,
          -1.2818,  1.1660,  0.5406, -0.7223,  0.4687, -0.3591, -0.2321,
          -0.7401,  0.4454, -1.2618, -0.3058,  0.6854, -0.7882,  0.4058,
           0.5952,  1.1985,  0.7040, -1.4199, -1.1844,  0.9307,  0.6872,
          -0.1857,  1.6204,  0.9634,  1.4590],
         [-1.5320, -1.3053, -1.5015,  0.7131,  1.0173, -0.7714,  0.3690,
           0.0819, -1.4902, -1.5129, -1.9582, -0.0289, -0.0393,  0.4617,
           0.1914,  1.0228, -0.0082,  2.5797, -0.1569,  1.4555, -1.75

# Add Positional Encoding
# 🧠 Why?
# Transformers have no sense of order — they treat all tokens as a bag of vectors.

# But language is ordered:

# "The cat sat" ≠ "Sat the cat"

# So we must inject position information into each token’s embedding.

In [11]:
block_size = 4
embedding_dim = 8
position_embedding_table = nn.Embedding(block_size, embedding_dim)
print("Shape of table:", position_embedding_table.weight.shape)
print("Positional Embedding Table:\n", position_embedding_table.weight)
pos_vector = position_embedding_table(torch.tensor([2]))
print("Vector for position 2:\n", pos_vector)
position_ids = torch.arange(block_size)
pos_vectors = position_embedding_table(position_ids)
print("Position vectors:\n", pos_vectors)


Shape of table: torch.Size([4, 8])
Positional Embedding Table:
 Parameter containing:
tensor([[ 0.1188,  0.5387,  3.2176, -1.8982, -0.6270, -1.3672,  0.8717, -0.5323],
        [ 1.1725, -1.8441,  0.8879, -0.3920, -0.8380, -0.4821, -0.8837,  1.1680],
        [-1.3120, -0.4717,  1.0688, -0.5945,  0.4471, -1.8819,  0.8677, -1.4532],
        [-2.1060, -0.9682, -0.2744, -1.1484,  0.1869,  0.8414, -0.6437, -0.0129]],
       requires_grad=True)
Vector for position 2:
 tensor([[-1.3120, -0.4717,  1.0688, -0.5945,  0.4471, -1.8819,  0.8677, -1.4532]],
       grad_fn=<EmbeddingBackward0>)
Position vectors:
 tensor([[ 0.1188,  0.5387,  3.2176, -1.8982, -0.6270, -1.3672,  0.8717, -0.5323],
        [ 1.1725, -1.8441,  0.8879, -0.3920, -0.8380, -0.4821, -0.8837,  1.1680],
        [-1.3120, -0.4717,  1.0688, -0.5945,  0.4471, -1.8819,  0.8677, -1.4532],
        [-2.1060, -0.9682, -0.2744, -1.1484,  0.1869,  0.8414, -0.6437, -0.0129]],
       grad_fn=<EmbeddingBackward0>)


In [12]:
import torch
import torch.nn as nn

# 1. Hyperparams
batch_size = 2
block_size = 4
embedding_dim = 8
vocab_size = 65

# 2. Embedding tables
token_embedding_table = nn.Embedding(vocab_size, embedding_dim)
position_embedding_table = nn.Embedding(block_size, embedding_dim)

# 3. Input tokens (2 sequences, 4 characters each)
x = torch.tensor([
    [12, 5, 8, 8],
    [9, 1, 17, 33]
])  # shape: (2, 4)

# 4. Token embeddings (look up each token ID)
token_emb = token_embedding_table(x)  # shape: (2, 4, 8)
print(token_emb)
# 5. Positional embeddings
position_ids = torch.arange(block_size)  # [0, 1, 2, 3]
pos_emb = position_embedding_table(position_ids)  # shape: (4, 8)
pos_emb = pos_emb.unsqueeze(0)  # reshape to (1, 4, 8) to match batch

# 6. Add them
x_embed = token_emb + pos_emb  # shape: (2, 4, 8)

print("Token embeddings:\n", token_emb)
print("Positional embeddings:\n", pos_emb)
print("Final input to transformer (x_embed):\n", x_embed)


tensor([[[ 1.8800, -0.3574, -1.5421, -0.0831,  1.5632, -1.4468,  0.7656,
          -1.0957],
         [ 0.0065,  0.1111,  0.4171, -0.7003,  0.5837, -0.8860, -1.5983,
          -0.6776],
         [ 1.8465, -0.4211, -0.0725, -0.2611, -0.1395,  0.5778,  0.4730,
          -0.0669],
         [ 1.8465, -0.4211, -0.0725, -0.2611, -0.1395,  0.5778,  0.4730,
          -0.0669]],

        [[-0.8325, -1.5855, -0.7591, -1.0580,  0.1029, -0.6641, -0.0900,
           0.3601],
         [-0.2799, -0.7474,  0.0538, -0.3732,  0.1258,  0.1530,  0.9843,
          -1.2262],
         [-1.1766, -0.0931,  0.3913,  0.6146,  0.3958,  0.6541, -0.5164,
           0.6811],
         [-0.9442,  0.8399, -1.3139, -0.0559,  0.3236, -0.8218, -1.2157,
           0.8498]]], grad_fn=<EmbeddingBackward0>)
Token embeddings:
 tensor([[[ 1.8800, -0.3574, -1.5421, -0.0831,  1.5632, -1.4468,  0.7656,
          -1.0957],
         [ 0.0065,  0.1111,  0.4171, -0.7003,  0.5837, -0.8860, -1.5983,
          -0.6776],
         [ 1.8465

# 3. Both are lookup tables
# Token embedding table → vector for the type of token

#  Positional embedding table → vector for the position of token

#  Both are trainable

### 🔁 Input Embedding Pipeline (with Example: "help")

#### Input: "help"

1. **Tokenization**  
   - Character-level: ['h', 'e', 'l', 'p']  
   - Token IDs (via vocab): [12, 5, 8, 9]  
   - Shape: `(batch_size=1, block_size=4)`

2. **Token Embedding**  
   - Use `nn.Embedding(vocab_size, emb_dim)`  
   - Maps each token ID to a learnable vector  
   - Output shape: `(1, 4, emb_dim)`  
   - Represents: *what* each token is

3. **Positional Embedding**  
   - Use `nn.Embedding(block_size, emb_dim)`  
   - Creates a learnable vector for each position: [0, 1, 2, 3]  
   - Output shape: `(1, 4, emb_dim)` (after unsqueeze)  
   - Represents: *where* each token is in the sequence

4. **Add Both Embeddings**  
   - Final input: `token_emb + pos_emb`  
   - Shape: `(1, 4, emb_dim)`  
   - Each token now encodes both meaning and position

✅ This combined embedding is passed to the first Transformer block.


### 🔁 Project Embeddings into Query, Key, and Value (Q, K, V)

To prepare for self-attention, each token's embedding is projected into 3 different vectors:

- **Query (Q)** → What the token is looking for
- **Key (K)**   → What the token offers to others
- **Value (V)** → What the token contains to share

These are created by applying 3 independent `nn.Linear` layers to the same input embedding:

- Input: `x_embed` → shape `(batch_size, block_size, embedding_dim)`
- Output:
  - Q: `(batch_size, block_size, embedding_dim)`
  - K: `(batch_size, block_size, embedding_dim)`
  - V: `(batch_size, block_size, embedding_dim)`

Each token is now ready to compare itself (via Q) to others (via K), and share information (via V).


In [13]:
# --- Simulated inputs ---
batch_size = 2
block_size = 4
embedding_dim = 8

# Simulated token+pos embeddings (e.g. from nn.Embedding)
x_embed = torch.randn(batch_size, block_size, embedding_dim)

# --- Linear layers to create Q, K, V ---
to_q = nn.Linear(embedding_dim, embedding_dim)
to_k = nn.Linear(embedding_dim, embedding_dim)
to_v = nn.Linear(embedding_dim, embedding_dim)

# --- Project to Q, K, V ---
q = to_q(x_embed)
k = to_k(x_embed)
v = to_v(x_embed)

# --- Check shapes ---
print("x_embed shape:", x_embed.shape)  # (2, 4, 8)
print("Q shape:", q.shape)              # (2, 4, 8)
print("K shape:", k.shape)              # (2, 4, 8)
print("V shape:", v.shape)              # (2, 4, 8)


x_embed shape: torch.Size([2, 4, 8])
Q shape: torch.Size([2, 4, 8])
K shape: torch.Size([2, 4, 8])
V shape: torch.Size([2, 4, 8])


# 🔁 Transformers (in the Q/K/V step)
# You're correct — we take one input (x_embed) and pass it through three completely separate linear layers:

**1.Q = Linear_Q(x_embed)**

**2.K = Linear_K(x_embed)**

**3.V = Linear_V(x_embed)**

# These 3 layers are not connected to each other

# They do not pass values between themselves

# They are just three different views of the same input

### 🔍 Q/K/V Linear Layers vs Classic ANN Layers

In classic neural networks (ANNs or CNNs), layers are **stacked** — the output of one layer feeds into the next in a chain:  
`Input → Hidden → Output`.

But in Transformers, during the attention step, we apply **three separate linear projections** to the same input embedding:

- `Query = Linear_Q(x_embed)`
- `Key   = Linear_K(x_embed)`
- `Value = Linear_V(x_embed)`

These are:
- **Not connected** to each other (no flow between them)
- **Independent** layers that each produce a different role/view of the same token
- Essential for enabling attention:  
  → Q compares against K to decide "who to look at",  
  → V is the content actually shared.

This **branching structure** is a major architectural difference from traditional neural nets.
