# Mini GPT: Building a Language Model from Scratch

In this session, we will **build and train a simplified GPT-like model** step by step inspired by Karpathy's `nanoGPT`.

By the end, you will understand the **core building blocks** behind modern large language models.

### What to expect
- Learn how text is converted into numbers the model can understand.  
- Explore how Transformers work: **embeddings, self-attention, feedforward layers, and residual connections**.  
- Put these pieces together into a small GPT model.  
- Train the model on Shakespeare’s text.  
- Generate new, Shakespeare-like text from scratch.  

⚡ This is a **hands-on educational exercise**: the model is tiny compared to real GPTs, but it captures the **same core ideas**.

---

### Setup and Hyperparameters

We start by importing **PyTorch** (`torch`) and some of its modules:

- `torch.nn`: building blocks for neural networks.
- `torch.nn.functional`: activation functions, loss functions, and other utilities.

Then we define our **hyperparameters** — these control how the model is trained and how large it is:

- `batch_size = 64`: how many training sequences we process at the same time.
- `block_size = 256`: the maximum length of context (how many previous characters the model can "see" when predicting the next one).
- `max_iters = 5000`: number of training steps.
- `eval_interval = 500`: how often we evaluate model performance on validation data.
- `learning_rate = 3e-4`: how fast the optimizer updates weights.
- `device`: whether to use GPU (`cuda`) or CPU.
- `eval_iters = 200`: number of mini-batches used to estimate loss during evaluation.
- `n_embd = 384`: size of the embedding vectors (each token is represented by a vector of this length).
- `n_head = 6`: number of self-attention heads in the Transformer.
- `n_layer = 6`: number of Transformer blocks stacked together.
- `dropout = 0.2`: fraction of neurons randomly “dropped” during training to prevent overfitting.

These values define both the **capacity** of the model and the **training process**.


In [None]:
import torch
import torch.nn as nn
import torch.nn.functional as F

# Hyperparameters
batch_size   = 64    # Number of sequences processed in parallel
block_size   = 256   # Maximum context length for predictions
max_iters    = 5000  # Total number of training iterations
eval_interval = 500  # Interval for evaluation
learning_rate = 3e-4
eval_iters    = 200  # Number of iterations to estimate loss

n_embd  = 384   # Embedding dimension
n_head  = 6     # Number of attention heads
n_layer = 6     # Number of transformer blocks
dropout = 0.2   # Dropout rate

# Device configuration
device = "cuda" if torch.cuda.is_available() else "cpu"


### Loading and Preparing the Dataset

1. **Random seed**  
   `torch.manual_seed(1337)` ensures reproducibility — we’ll get the same random numbers each run.

2. **Load text**  
   We read Shakespeare’s works (`input.txt`) into a single string called `text`.

3. **Vocabulary**  
   - `chars`: all unique characters that appear in the dataset.  
   - `vocab_size`: total number of unique characters.  

4. **Character ↔ Integer mapping**  
   - `stoi` (“string to integer”): maps each character to an index.  
   - `itos` (“integer to string”): reverse mapping.  
   - `encode(s)`: converts a string into a list of integers.  
   - `decode(l)`: converts a list of integers back into text.

   > This is how we turn raw text into numerical data that a neural network can process.

5. **Dataset split**  
   - Encode the entire text into a `torch.tensor` of integers.  
   - Use the first 90% as `train_data`.  
   - Keep the last 10% as `val_data` (to measure generalization).


In [None]:
# Download the tiny shakespeare dataset
# !curl -O https://raw.githubusercontent.com/karpathy/char-rnn/master/data/tinyshakespeare/input.txt
# OR
# !wget https://raw.githubusercontent.com/karpathy/char-rnn/master/data/tinyshakespeare/input.txt

In [None]:
# Set random seed for reproducibility
torch.manual_seed(1337)

with open("input.txt", "r", encoding="utf-8") as f:
    text = f.read()

# Get all unique characters in the text
chars = sorted(list(set(text)))
vocab_size = len(chars)

# Create character ↔ integer mappings
stoi = {ch: i for i, ch in enumerate(chars)}
itos = {i: ch for i, ch in enumerate(chars)}

# Encoder: string → list of integers
encode = lambda s: [stoi[c] for c in s]

# Decoder: list of integers → string
decode = lambda l: "".join([itos[i] for i in l])

# Train/validation split
data = torch.tensor(encode(text), dtype=torch.long)
n = int(0.9 * len(data))  # 90% train, 10% validation
train_data = data[:n]
val_data = data[n:]

### Creating Training Batches

`get_batch(split)` samples random chunks of text:

- Picks `batch_size` random starting positions.  
- Builds input `x` (current characters) and target `y` (the same sequence shifted by one character).  
- Moves tensors to the right device (CPU/GPU).

Result: `x, y` have shape `(batch_size, block_size)` and are used to train the model to predict the **next character**.


In [None]:
# Data loading
def get_batch(split):
    """Generate a small batch of inputs (x) and targets (y)."""
    
    data = train_data if split == "train" else val_data
    ix = torch.randint(len(data) - block_size, (batch_size,))
    
    x = torch.stack([data[i : i + block_size] for i in ix])
    y = torch.stack([data[i + 1 : i + block_size + 1] for i in ix])
    
    return x.to(device), y.to(device)

### Estimating Loss

`estimate_loss()` checks how well the model is doing without updating weights:

- Disables gradient tracking (`@torch.no_grad()` → faster, less memory).  
- Switches model to evaluation mode (`model.eval()` disables dropout).  
- Runs `eval_iters` batches for both **train** and **val** sets.  
- Collects and averages the losses.  
- Switches back to training mode (`model.train()`).  

This gives a stable estimate of training vs validation performance.


In [None]:
@torch.no_grad()
def estimate_loss(model):
    """Estimate the average loss for train and validation splits."""
    
    out = {}
    model.eval()
    
    for split in ["train", "val"]:
        losses = torch.zeros(eval_iters)
        
        for _ in range(eval_iters):
            x, y = get_batch(split)
            logits, loss = model(x, y)
            losses[_] = loss.item()
        
        out[split] = losses.mean()
    
    model.train()
    return out


### Self-Attention Head

A single head of self-attention:

- Projects input `x` into **queries**, **keys**, and **values**.  
- Computes attention scores (`q @ k^T`), scaled for stability.  
- Applies a **causal mask** (`tril`) so tokens can’t look ahead.  
- Softmax → probabilities over past positions.  
- Uses these to weight the values → output.

Input shape: `(batch, time, channels)`  
Output shape: `(batch, time, head_size)`


In [None]:
class Head(nn.Module):
    """One head of self-attention."""

    def __init__(self, head_size: int):
        super().__init__()
        self.key = nn.Linear(n_embd, head_size, bias=False)
        self.query = nn.Linear(n_embd, head_size, bias=False)
        self.value = nn.Linear(n_embd, head_size, bias=False)

        # Lower-triangular mask for causal self-attention
        self.register_buffer("tril", torch.tril(torch.ones(block_size, block_size)))

        self.dropout = nn.Dropout(dropout)

    def forward(self, x: torch.Tensor) -> torch.Tensor:
        """
        Args:
            x: Input tensor of shape (B, T, C)
               B = batch size, T = time steps, C = embedding channels

        Returns:
            Output tensor of shape (B, T, head_size)
        """
        B, T, C = x.shape

        # Project to key, query, and value
        k = self.key(x)    # (B, T, head_size)
        q = self.query(x)  # (B, T, head_size)

        # Compute attention scores ("affinities")
        wei = q @ k.transpose(-2, -1) * (k.shape[-1] ** -0.5)  # (B, T, T)
        wei = wei.masked_fill(self.tril[:T, :T] == 0, float("-inf"))  # (B, T, T)
        wei = F.softmax(wei, dim=-1)  # (B, T, T)
        wei = self.dropout(wei)

        # Weighted aggregation of values
        v = self.value(x)  # (B, T, head_size)
        out = wei @ v      # (B, T, T) @ (B, T, head_size) -> (B, T, head_size)

        return out

### Multi-Head Attention

Instead of one head, we use several in parallel:

- Each `Head` learns different attention patterns.  
- Their outputs are concatenated (`torch.cat`) along the channel dimension.  
- A final linear projection mixes them back into the embedding size (`n_embd`).  
- Dropout is applied for regularization.

This allows the model to attend to different types of relationships at once.

In [None]:
class MultiHeadAttention(nn.Module):
    """Multiple heads of self-attention in parallel."""

    def __init__(self, num_heads: int, head_size: int):
        super().__init__()
        self.heads = nn.ModuleList([Head(head_size) for _ in range(num_heads)])
        self.proj = nn.Linear(head_size * num_heads, n_embd)
        self.dropout = nn.Dropout(dropout)

    def forward(self, x: torch.Tensor) -> torch.Tensor:
        """
        Args:
            x: Input tensor of shape (B, T, C)

        Returns:
            Output tensor of shape (B, T, n_embd)
        """
        # Concatenate results from all attention heads
        out = torch.cat([h(x) for h in self.heads], dim=-1)

        # Final linear projection + dropout
        out = self.dropout(self.proj(out))

        return out


### FeedForward Layer

A position-wise MLP applied to each token:

- Expands embedding size (`n_embd → 4*n_embd`).  
- Applies ReLU non-linearity.  
- Projects back down to `n_embd`.  
- Adds dropout for regularization.

This gives the model extra capacity beyond attention.


In [None]:
class FeedForward(nn.Module):
    """A simple feed-forward network: Linear → ReLU → Linear → Dropout."""

    def __init__(self, n_embd: int):
        super().__init__()
        self.net = nn.Sequential(
            nn.Linear(n_embd, 4 * n_embd),
            nn.ReLU(),
            nn.Linear(4 * n_embd, n_embd),
            nn.Dropout(dropout),
        )

    def forward(self, x: torch.Tensor) -> torch.Tensor:
        """
        Args:
            x: Input tensor of shape (B, T, n_embd)

        Returns:
            Output tensor of shape (B, T, n_embd)
        """
        return self.net(x)


### Transformer Block

Each block combines **communication (attention)** and **computation (MLP)**:

1. Apply LayerNorm → Multi-Head Attention → add residual connection.  
2. Apply LayerNorm → FeedForward → add residual connection.  

This structure lets tokens share information while keeping stable gradients.  
Stacking several blocks builds the full Transformer.


In [None]:
class Block(nn.Module):
    """Transformer block: self-attention (communication) followed by feed-forward (computation)."""

    def __init__(self, n_embd: int, n_head: int):
        """
        Args:
            n_embd: Embedding dimension
            n_head: Number of attention heads
        """
        super().__init__()
        head_size = n_embd // n_head

        self.sa = MultiHeadAttention(n_head, head_size)
        self.ffwd = FeedForward(n_embd)

        self.ln1 = nn.LayerNorm(n_embd)
        self.ln2 = nn.LayerNorm(n_embd)

    def forward(self, x: torch.Tensor) -> torch.Tensor:
        """
        Args:
            x: Input tensor of shape (B, T, n_embd)

        Returns:
            Output tensor of shape (B, T, n_embd)
        """
        # Apply pre-norm, self-attention, and residual connection
        x = x + self.sa(self.ln1(x))

        # Apply pre-norm, feed-forward, and residual connection
        x = x + self.ffwd(self.ln2(x))

        return x


### GPTLanguageModel

This class assembles the full Transformer-based language model:

- **Token embeddings**: turn character indices into vectors.  
- **Position embeddings**: add information about order in the sequence.  
- **Stack of Transformer blocks**: attention + feedforward layers.  
- **Final LayerNorm + Linear head**: map hidden states to vocabulary logits.  

**Forward pass**:
1. Look up token + position embeddings → combine them.  
2. Pass through Transformer blocks.  
3. Project to logits over the vocabulary.  
4. If targets are provided, compute cross-entropy loss.  

Output: `(logits, loss)` where  
- `logits`: predictions for next token.  
- `loss`: training signal (or `None` if not given).


In [None]:
class GPTLanguageModel(nn.Module):
    """A simple GPT-style language model."""

    def __init__(self):
        super().__init__()

        # Token and positional embeddings
        self.token_embedding_table = nn.Embedding(vocab_size, n_embd)
        self.position_embedding_table = nn.Embedding(block_size, n_embd)

        # Transformer blocks
        self.blocks = nn.Sequential(
            *[Block(n_embd, n_head=n_head) for _ in range(n_layer)]
        )

        # Final layer normalization and output head
        self.ln_f = nn.LayerNorm(n_embd)
        self.lm_head = nn.Linear(n_embd, vocab_size)

        # Initialize weights
        self.apply(self._init_weights)

    def _init_weights(self, module: nn.Module) -> None:
        """Custom weight initialization."""
        if isinstance(module, nn.Linear):
            torch.nn.init.normal_(module.weight, mean=0.0, std=0.02)
            if module.bias is not None:
                torch.nn.init.zeros_(module.bias)
        elif isinstance(module, nn.Embedding):
            torch.nn.init.normal_(module.weight, mean=0.0, std=0.02)

    def forward(
        self, idx: torch.Tensor, targets: torch.Tensor | None = None
    ) -> tuple[torch.Tensor, torch.Tensor | None]:
        """
        Args:
            idx: Input indices of shape (B, T)
            targets: Optional target indices of shape (B, T)

        Returns:
            logits: Predictions of shape (B, T, vocab_size)
            loss: Cross-entropy loss (if targets provided), else None
        """
        B, T = idx.shape

        # Token + positional embeddings
        tok_emb = self.token_embedding_table(idx)  # (B, T, C)
        pos_emb = self.position_embedding_table(torch.arange(T, device=device))  # (T, C)
        x = tok_emb + pos_emb  # (B, T, C)

        # Transformer blocks and final norm
        x = self.blocks(x)  # (B, T, C)
        x = self.ln_f(x)    # (B, T, C)

        # Output projection
        logits = self.lm_head(x)  # (B, T, vocab_size)

        # Compute loss if targets are provided
        loss = None
        if targets is not None:
            B, T, C = logits.shape
            logits = logits.view(B * T, C)
            targets = targets.view(B * T)
            loss = F.cross_entropy(logits, targets)

        return logits, loss
    def generate(self, idx: torch.Tensor, max_new_tokens: int) -> torch.Tensor:
        """
        Generate new tokens autoregressively.

        Args:
            idx: Input indices of shape (B, T), the current context.
            max_new_tokens: Number of tokens to generate.

        Returns:
            Tensor of shape (B, T + max_new_tokens) containing the original
            indices followed by the generated tokens.
        """
        for _ in range(max_new_tokens):
            # Crop to the last block_size tokens
            idx_cond = idx[:, -block_size:]

            # Forward pass through the model
            logits, _ = self(idx_cond)

            # Focus only on the last time step
            logits = logits[:, -1, :]  # (B, vocab_size)

            # Apply softmax to get probabilities
            probs = F.softmax(logits, dim=-1)  # (B, vocab_size)

            # Sample from the distribution
            idx_next = torch.multinomial(probs, num_samples=1)  # (B, 1)

            # Append sampled token to the running sequence
            idx = torch.cat((idx, idx_next), dim=1)  # (B, T+1)

        return idx



### Model Initialization and Optimizer

- Create the model (`GPTLanguageModel`) and move it to the right device (CPU/GPU).  
- Print the total number of parameters (in millions) → shows model size.  
- Define the optimizer: **AdamW**, a variant of Adam with weight decay, commonly used for Transformers.


In [None]:
# Initialize model and move to device
model = GPTLanguageModel()
m = model.to(device)

# Print number of parameters (in millions)
num_params = sum(p.numel() for p in m.parameters())
print(f"{num_params / 1e6:.2f}M parameters")

# Create PyTorch optimizer
optimizer = torch.optim.AdamW(m.parameters(), lr=learning_rate)


### Training Loop

For `max_iters` steps:

1. **Evaluate periodically**: every `eval_interval` steps (or at the end), estimate train/val loss.  
2. **Get a batch**: `(xb, yb)` input and target sequences.  
3. **Forward pass**: compute `logits` and `loss`.  
4. **Backward pass**:  
   - Reset gradients (`optimizer.zero_grad`).  
   - Backpropagate (`loss.backward()`).  
   - Update weights (`optimizer.step()`).  

This loop gradually teaches the model to predict the next character.

**Important Note:** This cell will require a lot (like a lot) of time to finish on a CPU. At this point you'll need a GPU to move on. The easiest way to access one is to simply upload this notebook (as it is) in your Google Drive and open the notebook in Google Colab. Then, you can navigate to `Runtime` -> `Change Runtime Type` and select a T4 GPU. No subscription, payment or anything else is required. Also don't forget to upload `input.txt` as well.


In [None]:
for step in range(max_iters):

    # Evaluate loss on train/val splits at regular intervals
    if step % eval_interval == 0 or step == max_iters - 1:
        losses = estimate_loss(model)
        print(
            f"Step {step}: "
            f"train loss {losses['train']:.4f}, "
            f"val loss {losses['val']:.4f}"
        )

    # Sample a batch of training data
    xb, yb = get_batch("train")

    # Forward pass and loss computation
    logits, loss = model(xb, yb)

    # Backpropagation
    optimizer.zero_grad(set_to_none=True)
    loss.backward()
    optimizer.step()


### Text Generation

- Start with a single token (`0`) as the initial context.  
- Call `model.generate(...)` to autoregressively sample the next tokens, up to `max_new_tokens=500`.  
- Decode the generated token IDs back into text and print it.  

👉 The model produces new Shakespeare-like text, character by character.


In [None]:
# Generate text from the model
context = torch.zeros((1, 1), dtype=torch.long, device=device)

# Short sample (print to console)
generated = m.generate(context, max_new_tokens=500)
print(decode(generated[0].tolist()))

# Long sample (save to file)
# with open("more.txt", "w", encoding="utf-8") as f:
#     f.write(decode(m.generate(context, max_new_tokens=10_000)[0].tolist()))

---

## How This Mini GPT Differs from Real GPT Models

Our model is inspired by GPT, but it’s much smaller and simpler.  
Here are five key differences:

1. **Scale**  
   - GPT (e.g., GPT-3) has **billions of parameters**, trained on huge datasets with thousands of GPUs.  
   - Our mini GPT has only a few **million parameters**, trained on a tiny dataset (Shakespeare) with a single GPU/CPU.

2. **Tokenizer**  
   - GPT uses **Byte Pair Encoding (BPE)** or similar subword tokenization → efficient and works for all languages.  
   - Our version uses **character-level tokens** → simpler, but less efficient for large vocabularies.

3. **Architecture Details**  
   - GPT includes improvements like **pre-layer normalization**, **rotary embeddings**, and optimized attention implementations (e.g. FlashAttention).  
   - Our mini GPT uses a **basic Transformer** with embeddings, attention, and feedforward layers.

4. **Training Setup**  
   - GPT is trained with **massive compute budgets** for weeks or months.  
   - Our model trains in **hours** (or less) on a laptop GPU.

5. **Purpose**  
   - GPT is built for **general-purpose applications** (chat, coding, search, etc.).  
   - Our mini GPT is designed for **learning and experimentation** — to show the core ideas in action.

## MCQ

### 1.1. Self-Attention vs RNNs  

Which of the following is a major advantage of self-attention over RNNs?  

A. Handles long-range dependencies better<br>  
B. Requires less memory<br>  
C. Avoids positional embeddings<br>  
D. Uses fewer parameters<br>  

**Answer:** A ✅  

---

### 1.2. Transformer Complexity  

What is the typical disadvantage of transformer models compared to RNNs?  

A. Difficulty handling variable-length sequences<br>  
B. Quadratic computational complexity with sequence length<br>  
C. Inability to model long dependencies<br>  
D. Poor performance in machine translation<br>  

**Answer:** B ✅  

---

### 1.3. Paradigm Shift in PLMs  

Which of the following was a paradigm shift introduced by PLMs like GPT and BERT?  

A. Training from scratch on small datasets<br>  
B. Using pretrained embeddings (Word2Vec, GloVe) with RNNs<br>  
C. Fine-tuning full pretrained models for downstream tasks<br>  
D. Training only with unsupervised objectives<br>  

**Answer:** C ✅  

---

### 1.4. Positional Embeddings  

Why do transformers require positional embeddings?  

A. To encode sequential order not captured by self-attention<br>  
B. To reduce training data requirements<br>  
C. To avoid vanishing gradients<br>  
D. To replace token embeddings<br>  

**Answer:** A ✅  

---

### 1.5. Benchmarks  

Which benchmark highlighted the effectiveness of BERT in NLU tasks?  

A. SQuAD<br>  
B. GLUE<br>  
C. MNIST<br>  
D. CoNLL-2003<br>  

**Answer:** B ✅  

---

### 2.1. GPT Architecture  

GPT uses which type of transformer architecture?  

A. Encoder-only<br>  
B. Decoder-only with masked self-attention<br>  
C. Encoder-decoder with cross-attention<br>  
D. LSTM-based encoder<br>  

**Answer:** B ✅  

---

### 2.2. GPT Objective  

What was the main pretraining objective of GPT?  

A. Next sentence prediction<br>  
B. Masked language modeling<br>  
C. Causal language modeling (next word prediction)<br>  
D. Denoising autoencoding<br>  

**Answer:** C ✅  

---

### 2.3. GPT Dataset  

GPT was first trained on which dataset?  

A. Wikipedia + Toronto BooksCorpus<br>  
B. OpenWebText<br>  
C. Toronto BooksCorpus only<br>  
D. Billion Word Benchmark<br>  

**Answer:** C ✅  

---

### 2.4. GPT vs GPT2  

Which of the following was a major difference between GPT and GPT2?  

A. GPT2 used masked LM<br>  
B. GPT2 was trained on much larger data and had more parameters<br>  
C. GPT2 introduced bidirectionality<br>  
D. GPT2 was only for classification tasks<br>  

**Answer:** B ✅  

---

### 2.5. GPT Input Window  

What was the main training window size for GPT’s input sequences?  

A. 128 tokens<br>  
B. 256 tokens<br>  
C. 512 tokens<br>  
D. 1024 tokens<br>  

**Answer:** C ✅  