# 📝 Mini GPT: Building a Language Model from Scratch

Welcome! In this session, we’ll **build and train a simplified GPT-style model** inspired by Karpathy’s legendary [`nanoGPT`](https://github.com/karpathy/nanoGPT).  
Think of this as a guided tour into the inner workings of modern large language models, but without needing a data center to run it. 🚀


## What to expect
- **Turn text into tokens**: see how raw text is mapped into numbers a model can process.  
- **Peek inside Transformers**: embeddings, self-attention, feedforward layers, and residual connections.  
- **Assemble a tiny GPT**: stack the building blocks into a working autoregressive model.  
- **Train on Shakespeare’s text**: teach it the patterns of language.  
- **Generate new text**: watch your model write like the Bard himself. ✍️



💡 **Want a step-by-step walkthrough?**  
Check out Andrej Karpathy’s fantastic [YouTube lecture](https://www.youtube.com/watch?v=kCc8FmEb1nY), where he builds and explains GPT-like models from scratch.  
Our notebook complements his video; you’ll be coding along and experimenting directly.


👉 By the end, you won’t just *use* a language model; you’ll actually **understand how one is built.**

---


### Setup and Hyperparameters

We start by importing **PyTorch** (`torch`) and some of its modules:

- `torch.nn`: building blocks for neural networks.
- `torch.nn.functional`: activation functions, loss functions, and other utilities.

Then we define our **hyperparameters** — these control how the model is trained and how large it is:

- `batch_size = 64`: how many training sequences we process at the same time.
- `block_size = 256`: the maximum length of context (how many previous characters the model can "see" when predicting the next one).
- `max_iters = 5000`: number of training steps.
- `eval_interval = 500`: how often we evaluate model performance on validation data.
- `learning_rate = 3e-4`: how fast the optimizer updates weights.
- `device`: whether to use GPU (`cuda`) or CPU.
- `eval_iters = 200`: number of mini-batches used to estimate loss during evaluation.
- `n_embd = 384`: size of the embedding vectors (each token is represented by a vector of this length).
- `n_head = 6`: number of self-attention heads in the Transformer.
- `n_layer = 6`: number of Transformer blocks stacked together.
- `dropout = 0.2`: fraction of neurons randomly “dropped” during training to prevent overfitting.

These values define both the **capacity** of the model and the **training process**.


In [1]:
import torch
import torch.nn as nn
import torch.nn.functional as F

# Hyperparameters
batch_size   = 64    # Number of sequences processed in parallel
block_size   = 256   # Maximum context length for predictions
max_iters    = 5000  # Total number of training iterations
eval_interval = 500  # Interval for evaluation
learning_rate = 3e-4
eval_iters    = 200  # Number of iterations to estimate loss

n_embd  = 384   # Embedding dimension
n_head  = 6     # Number of attention heads
n_layer = 6     # Number of transformer blocks
dropout = 0.2   # Dropout rate

# Device configuration
device = "cuda" if torch.cuda.is_available() else "cpu"


### Loading and Preparing the Dataset

1. **Random seed**  
   `torch.manual_seed(1337)` ensures reproducibility — we’ll get the same random numbers each run.

2. **Load text**  
   We read Shakespeare’s works (`input.txt`) into a single string called `text`.

3. **Vocabulary**  
   - `chars`: all unique characters that appear in the dataset.  
   - `vocab_size`: total number of unique characters.  

4. **Character ↔ Integer mapping**  
   - `stoi` (“string to integer”): maps each character to an index.  
   - `itos` (“integer to string”): reverse mapping.  
   - `encode(s)`: converts a string into a list of integers.  
   - `decode(l)`: converts a list of integers back into text.

   > This is how we turn raw text into numerical data that a neural network can process.

5. **Dataset split**  
   - Encode the entire text into a `torch.tensor` of integers.  
   - Use the first 90% as `train_data`.  
   - Keep the last 10% as `val_data` (to measure generalization).


In [None]:
# TODO: Download the dataset manually if not already present.
# The file should be called "input.txt".
# !curl -O https://raw.githubusercontent.com/karpathy/char-rnn/master/data/tinyshakespeare/input.txt
# OR
# !wget https://raw.githubusercontent.com/karpathy/char-rnn/master/data/tinyshakespeare/input.txt

In [None]:
# Set random seed for reproducibility
torch.manual_seed(1337)

with open("input.txt", "r", encoding="utf-8") as f:
    text = f.read()

# TODO: Get all unique characters in the text
chars = ...
vocab_size = ...

# TODO: Create two dictionaries for character ↔ integer mappings
# - stoi (string to integer)
# - itos (integer to string)
stoi = ...
itos = ...

# TODO: Implement encoder (string → list of integers)
encode = ...

# TODO: Implement decoder (list of integers → string)
decode = ...

# Encode the entire dataset into integers
data = torch.tensor(encode(text), dtype=torch.long)

# TODO: Split data into train (90%) and validation (10%)
n = ...
train_data = ...
val_data = ...


### Creating Training Batches

`get_batch(split)` samples random chunks of text:

- Picks `batch_size` random starting positions.  
- Builds input `x` (current characters) and target `y` (the same sequence shifted by one character).  
- Moves tensors to the right device (CPU/GPU).

Result: `x, y` have shape `(batch_size, block_size)` and are used to train the model to predict the **next character**.


In [None]:
# Data loading
def get_batch(split):
    """Generate a small batch of inputs (x) and targets (y)."""
    
    # TODO: Select the correct dataset depending on split ("train" or "val")
    data = ...

    # TODO: Sample random starting indices for each sequence in the batch
    ix = ...

    # TODO: Construct the input sequences (x) and target sequences (y)
    # - x should be data[i : i + block_size]
    # - y should be data[i + 1 : i + block_size + 1]
    x = ...
    y = ...

    # TODO: Move to the correct device (e.g., "cpu" or "cuda")
    return ...

### Estimating Loss

`estimate_loss()` checks how well the model is doing without updating weights:

- Disables gradient tracking (`@torch.no_grad()` → faster, less memory).  
- Switches model to evaluation mode (`model.eval()` disables dropout).  
- Runs `eval_iters` batches for both **train** and **val** sets.  
- Collects and averages the losses.  
- Switches back to training mode (`model.train()`).  

This gives a stable estimate of training vs validation performance.


In [None]:
@torch.no_grad()
def estimate_loss(model):
    """Estimate the average loss for train and validation splits."""
    
    out = {}
    
    # TODO: Put the model in evaluation mode
    ...
    
    for split in ["train", "val"]:
        # TODO: create a tensor to store losses for eval_iters runs
        losses = ...
        
        for k in range(eval_iters):
            # TODO: sample a batch of data
            x, y = ...
            
            # TODO: run the model forward and compute the loss
            logits, loss = ...
            
            # TODO: store the loss value
            losses[k] = ...
        
        # TODO: compute the mean loss for this split
        out[split] = ...
    
    # TODO: switch model back to training mode
    ...
    
    return out


### Self-Attention Head

A single head of self-attention:

- Projects input `x` into **queries**, **keys**, and **values**.  
- Computes attention scores (`q @ k^T`), scaled for stability.  
- Applies a **causal mask** (`tril`) so tokens can’t look ahead.  
- Softmax → probabilities over past positions.  
- Uses these to weight the values → output.

Input shape: `(batch, time, channels)`  
Output shape: `(batch, time, head_size)`


In [None]:
class Head(nn.Module):
    """One head of self-attention."""

    def __init__(self, head_size: int):
        super().__init__()
        
        # TODO: Define linear layers for key, query, and value projections
        self.key = ...
        self.query = ...
        self.value = ...
        
        # TODO: Register a lower-triangular mask for causal self-attention
        self.register_buffer("tril", ...)
        
        # TODO: Add dropout for regularization
        self.dropout = ...

    def forward(self, x: torch.Tensor) -> torch.Tensor:
        """
        Args:
            x: Input tensor of shape (B, T, C)
               B = batch size, T = time steps, C = embedding channels

        Returns:
            Output tensor of shape (B, T, head_size)
        """
        B, T, C = x.shape

        # TODO: Compute key and query projections
        k = ...
        q = ...

        # TODO: Compute attention scores (scaled dot-product)
        # - multiply q and k^T
        # - scale by 1/sqrt(head_size)
        wei = ...

        # TODO: Apply the causal mask (use tril)
        wei = ...

        # TODO: Normalize scores with softmax and apply dropout
        wei = ...

        # TODO: Compute value projections and weighted sum
        v = ...
        out = ...

        return out

### Multi-Head Attention

Instead of one head, we use several in parallel:

- Each `Head` learns different attention patterns.  
- Their outputs are concatenated (`torch.cat`) along the channel dimension.  
- A final linear projection mixes them back into the embedding size (`n_embd`).  
- Dropout is applied for regularization.

This allows the model to attend to different types of relationships at once.

In [None]:
class MultiHeadAttention(nn.Module):
    """Multiple heads of self-attention in parallel."""

    def __init__(self, num_heads: int, head_size: int):
        super().__init__()
        
        # TODO: create a list of attention heads
        self.heads = ...
        
        # TODO: final projection layer (maps back to n_embd)
        self.proj = ...
        
        # TODO: dropout for regularization
        self.dropout = ...

    def forward(self, x: torch.Tensor) -> torch.Tensor:
        """
        Args:
            x: Input tensor of shape (B, T, C)

        Returns:
            Output tensor of shape (B, T, n_embd)
        """
        # TODO: forward pass through each head and concatenate along the last dimension
        out = ...

        # TODO: apply final projection + dropout
        out = ...

        return out

### FeedForward Layer

A position-wise MLP applied to each token:

- Expands embedding size (`n_embd → 4*n_embd`).  
- Applies ReLU non-linearity.  
- Projects back down to `n_embd`.  
- Adds dropout for regularization.

This gives the model extra capacity beyond attention.


In [None]:
class FeedForward(nn.Module):
    """A simple feed-forward network: Linear → ReLU → Linear → Dropout."""

    def __init__(self, n_embd: int):
        super().__init__()
        
        # TODO: build a sequential module with:
        # 1. Linear (n_embd → 4 * n_embd)
        # 2. ReLU
        # 3. Linear (4 * n_embd → n_embd)
        # 4. Dropout
        self.net = ...

    def forward(self, x: torch.Tensor) -> torch.Tensor:
        """
        Args:
            x: Input tensor of shape (B, T, n_embd)

        Returns:
            Output tensor of shape (B, T, n_embd)
        """
        # TODO: forward pass through the network
        return ...

### Transformer Block

Each block combines **communication (attention)** and **computation (MLP)**:

1. Apply LayerNorm → Multi-Head Attention → add residual connection.  
2. Apply LayerNorm → FeedForward → add residual connection.  

This structure lets tokens share information while keeping stable gradients.  
Stacking several blocks builds the full Transformer.


In [None]:
class Block(nn.Module):
    """Transformer block: self-attention (communication) followed by feed-forward (computation)."""

    def __init__(self, n_embd: int, n_head: int):
        """
        Args:
            n_embd: Embedding dimension
            n_head: Number of attention heads
        """
        super().__init__()
        
        head_size = n_embd // n_head
        
        # TODO: Multi-head self-attention
        self.sa = ...
        
        # TODO: Feed-forward network
        self.ffwd = ...
        
        # TODO: Two layer normalizations
        self.ln1 = ...
        self.ln2 = ...

    def forward(self, x: torch.Tensor) -> torch.Tensor:
        """
        Args:
            x: Input tensor of shape (B, T, n_embd)

        Returns:
            Output tensor of shape (B, T, n_embd)
        """
        # TODO: Apply pre-norm + self-attention + residual connection
        x = ...
        
        # TODO: Apply pre-norm + feed-forward + residual connection
        x = ...
        
        return x


### GPTLanguageModel

This class assembles the full Transformer-based language model:

- **Token embeddings**: turn character indices into vectors.  
- **Position embeddings**: add information about order in the sequence.  
- **Stack of Transformer blocks**: attention + feedforward layers.  
- **Final LayerNorm + Linear head**: map hidden states to vocabulary logits.  

**Forward pass**:
1. Look up token + position embeddings → combine them.  
2. Pass through Transformer blocks.  
3. Project to logits over the vocabulary.  
4. If targets are provided, compute cross-entropy loss.  

Output: `(logits, loss)` where  
- `logits`: predictions for next token.  
- `loss`: training signal (or `None` if not given).


In [None]:
class GPTLanguageModel(nn.Module):
    """A simple GPT-style language model."""

    def __init__(self):
        super().__init__()

        # TODO: Token and positional embeddings
        self.token_embedding_table = nn.Embedding(..., ...)
        self.position_embedding_table = nn.Embedding(..., ...)

        # TODO: Transformer blocks (stack of Block layers)
        self.blocks = ...

        # TODO: Final layer normalization and language modeling head
        self.ln_f = ...
        self.lm_head = ...

        # Initialize weights
        self.apply(self._init_weights)

    def _init_weights(self, module: nn.Module) -> None:
        """Custom weight initialization."""
        if isinstance(module, nn.Linear):
            torch.nn.init.normal_(module.weight, mean=0.0, std=0.02)
            if module.bias is not None:
                torch.nn.init.zeros_(module.bias)
        elif isinstance(module, nn.Embedding):
            torch.nn.init.normal_(module.weight, mean=0.0, std=0.02)

    def forward(
        self, idx: torch.Tensor, targets: torch.Tensor | None = None
    ) -> tuple[torch.Tensor, torch.Tensor | None]:
        """
        Args:
            idx: Input indices of shape (B, T)
            targets: Optional target indices of shape (B, T)

        Returns:
            logits: Predictions of shape (B, T, vocab_size)
            loss: Cross-entropy loss (if targets provided), else None
        """
        B, T = idx.shape

        # TODO: Compute token and positional embeddings, then add them
        tok_emb = ...
        pos_emb = ...
        x = ...

        # TODO: Pass through Transformer blocks and final layer norm
        x = ...
        x = ...

        # TODO: Project to vocabulary size (output projection)
        logits = ...

        # TODO: Compute cross-entropy loss if targets are provided
        loss = None
        if targets is not None:
            B, T, C = logits.shape
            logits = ...
            targets = ...
            loss = ...

        return logits, loss
    
    def generate(self, idx: torch.Tensor, max_new_tokens: int) -> torch.Tensor:
        """
        Generate new tokens autoregressively.

        Args:
            idx: Input indices of shape (B, T), the current context.
            max_new_tokens: Number of tokens to generate.

        Returns:
            Tensor of shape (B, T + max_new_tokens) containing the original
            indices followed by the generated tokens.
        """
        for _ in range(max_new_tokens):
            # TODO: crop idx to the last block_size tokens (context window)
            idx_cond = ...

            # TODO: forward pass through the model
            logits, _ = ...

            # TODO: take only the logits for the last time step
            logits = ...

            # TODO: convert logits to probabilities with softmax
            probs = ...

            # TODO: sample the next token from the probability distribution
            idx_next = ...

            # TODO: append sampled token to the running sequence
            idx = ...

        return idx

### Model Initialization and Optimizer

- Create the model (`GPTLanguageModel`) and move it to the right device (CPU/GPU).  
- Print the total number of parameters (in millions) → shows model size.  
- Define the optimizer: **AdamW**, a variant of Adam with weight decay, commonly used for Transformers.


In [None]:
# Initialize model and move to device
model = GPTLanguageModel()
m = model.to(device)

# Print number of parameters (in millions)
num_params = sum(p.numel() for p in m.parameters())
print(f"{num_params / 1e6:.2f}M parameters")

# Create PyTorch optimizer
optimizer = torch.optim.AdamW(m.parameters(), lr=learning_rate)


### Training Loop

For `max_iters` steps:

1. **Evaluate periodically**: every `eval_interval` steps (or at the end), estimate train/val loss.  
2. **Get a batch**: `(xb, yb)` input and target sequences.  
3. **Forward pass**: compute `logits` and `loss`.  
4. **Backward pass**:  
   - Reset gradients (`optimizer.zero_grad`).  
   - Backpropagate (`loss.backward()`).  
   - Update weights (`optimizer.step()`).  

This loop gradually teaches the model to predict the next character.

**Important Note:** This cell will require a lot (like a lot) of time to finish on a CPU. At this point you'll need a GPU to move on. The easiest way to access one is to simply upload this notebook (as it is) in your Google Drive and open the notebook in Google Colab. Then, you can navigate to `Runtime` -> `Change Runtime Type` and select a T4 GPU. No subscription, payment or anything else is required. Also don't forget to upload `input.txt` as well.


In [None]:
for step in range(max_iters):

    # Evaluate loss on train/val splits at regular intervals
    if step % eval_interval == 0 or step == max_iters - 1:
        losses = estimate_loss(model)
        print(
            f"Step {step}: "
            f"train loss {losses['train']:.4f}, "
            f"val loss {losses['val']:.4f}"
        )

    # Sample a batch of training data
    xb, yb = get_batch("train")

    # Forward pass and loss computation
    logits, loss = model(xb, yb)

    # Backpropagation
    optimizer.zero_grad(set_to_none=True)
    loss.backward()
    optimizer.step()


### Text Generation

- Start with a single token (`0`) as the initial context.  
- Call `model.generate(...)` to autoregressively sample the next tokens, up to `max_new_tokens=500`.  
- Decode the generated token IDs back into text and print it.  

👉 The model produces new Shakespeare-like text, character by character.


In [None]:
# Generate text from the model
context = torch.zeros((1, 1), dtype=torch.long, device=device)

# Short sample (print to console)
generated = m.generate(context, max_new_tokens=500)
print(decode(generated[0].tolist()))

# Long sample (save to file)
# with open("more.txt", "w", encoding="utf-8") as f:
#     f.write(decode(m.generate(context, max_new_tokens=10_000)[0].tolist()))

---

## How This Mini GPT Differs from Real GPT Models

Our model is inspired by GPT, but it’s much smaller and simpler.  
Here are five key differences:

1. **Scale**  
   - GPT (e.g., GPT-3) has **billions of parameters**, trained on huge datasets with thousands of GPUs.  
   - Our mini GPT has only a few **million parameters**, trained on a tiny dataset (Shakespeare) with a single GPU/CPU.

2. **Tokenizer**  
   - GPT uses **Byte Pair Encoding (BPE)** or similar subword tokenization → efficient and works for all languages.  
   - Our version uses **character-level tokens** → simpler, but less efficient for large vocabularies.

3. **Architecture Details**  
   - GPT includes improvements like **pre-layer normalization**, **rotary embeddings**, and optimized attention implementations (e.g. FlashAttention).  
   - Our mini GPT uses a **basic Transformer** with embeddings, attention, and feedforward layers.

4. **Training Setup**  
   - GPT is trained with **massive compute budgets** for weeks or months.  
   - Our model trains in **hours** (or less) on a laptop GPU.

5. **Purpose**  
   - GPT is built for **general-purpose applications** (chat, coding, search, etc.).  
   - Our mini GPT is designed for **learning and experimentation** — to show the core ideas in action.

## MCQ

### 1.1. Self-Attention vs RNNs  

Which of the following is a major advantage of self-attention over RNNs?  

A. Handles long-range dependencies better<br>  
B. Requires less memory<br>  
C. Avoids positional embeddings<br>  
D. Uses fewer parameters<br>  

**Answer:** 

---

### 1.2. Transformer Complexity  

What is the typical disadvantage of transformer models compared to RNNs?  

A. Difficulty handling variable-length sequences<br>  
B. Quadratic computational complexity with sequence length<br>  
C. Inability to model long dependencies<br>  
D. Poor performance in machine translation<br>  

**Answer:** 

---

### 1.3. Paradigm Shift in PLMs  

Which of the following was a paradigm shift introduced by PLMs like GPT and BERT?  

A. Training from scratch on small datasets<br>  
B. Using pretrained embeddings (Word2Vec, GloVe) with RNNs<br>  
C. Fine-tuning full pretrained models for downstream tasks<br>  
D. Training only with unsupervised objectives<br>  

**Answer:** 

---

### 1.4. Positional Embeddings  

Why do transformers require positional embeddings?  

A. To encode sequential order not captured by self-attention<br>  
B. To reduce training data requirements<br>  
C. To avoid vanishing gradients<br>  
D. To replace token embeddings<br>  

**Answer:** 

---

### 1.5. Benchmarks  

Which benchmark highlighted the effectiveness of BERT in NLU tasks?  

A. SQuAD<br>  
B. GLUE<br>  
C. MNIST<br>  
D. CoNLL-2003<br>  

**Answer:**

---

### 2.1. GPT Architecture  

GPT uses which type of transformer architecture?  

A. Encoder-only<br>  
B. Decoder-only with masked self-attention<br>  
C. Encoder-decoder with cross-attention<br>  
D. LSTM-based encoder<br>  

**Answer:**

---

### 2.2. GPT Objective  

What was the main pretraining objective of GPT?  

A. Next sentence prediction<br>  
B. Masked language modeling<br>  
C. Causal language modeling (next word prediction)<br>  
D. Denoising autoencoding<br>  

**Answer:** 

---

### 2.3. GPT Dataset  

GPT was first trained on which dataset?  

A. Wikipedia + Toronto BooksCorpus<br>  
B. OpenWebText<br>  
C. Toronto BooksCorpus only<br>  
D. Billion Word Benchmark<br>  

**Answer:** 

---

### 2.4. GPT vs GPT2  

Which of the following was a major difference between GPT and GPT2?  

A. GPT2 used masked LM<br>  
B. GPT2 was trained on much larger data and had more parameters<br>  
C. GPT2 introduced bidirectionality<br>  
D. GPT2 was only for classification tasks<br>  

**Answer:** 

---

### 2.5. GPT Input Window  

What was the main training window size for GPT’s input sequences?  

A. 128 tokens<br>  
B. 256 tokens<br>  
C. 512 tokens<br>  
D. 1024 tokens<br>  

**Answer:** 