
# 🧪 Interactive Tutorial – Build a Tiny Decoder-Only Transformer

This notebook is structured as an **interactive lab**:

* 🔨 **Exercise cells** – marked *✏️ Exercise* – contain TODOs that raise `NotImplementedError`.
* 👀 **Checkpoint cells** – run intermediate code to inspect tensors or verify output shapes.

Everything uses **pure PyTorch** and runs on CPU-only in Google Colab (GPU makes it faster).


## Setup

In [None]:

import math, textwrap, torch, torch.nn as nn, torch.nn.functional as F
torch.manual_seed(42)
device = 'cuda' if torch.cuda.is_available() else 'cpu'
print('Device:', device)


## 1  Prepare a Tiny Corpus

In [None]:

corpus = textwrap.dedent("""
  Shall I compare thee to a summer's day?
  Thou art more lovely and more temperate.
  Rough winds do shake the darling buds of May,
  And summer's lease hath all too short a date.

  Sometime too hot the eye of heaven shines,
  And often is his gold complexion dimm'd;
  And every fair from fair sometime declines,
  By chance, or nature's changing course, untrimm'd.

  But thy eternal summer shall not fade,
  Nor lose possession of that fair thou ow'st;
  Nor shall Death brag thou wander'st in his shade,
  When in eternal lines to time thou grow'st.

  The quick brown fox jumps over the lazy dog.
  Pack my box with five dozen liquor jugs.
  THE QUICK BROWN FOX JUMPS OVER THE LAZY DOG.
  PACK MY BOX WITH FIVE DOZEN LIQUOR JUGS.
                       
""")
print(f'Corpus length: {len(corpus):,} characters')



## 2  Character-Level Tokeniser

### ✏️ Exercise 1  
Implement the two helper functions:

* `encode(text)` – returns a list of integer token IDs  
* `decode(token_ids)` – returns the original string  

Run the **checkpoint** cell that follows to verify the round-trip works.


In [None]:

# ── Tokeniser skeleton ─────────────────────────────────────────────────────────────
chars = sorted(list(set(corpus)))
stoi  = {ch:i for i, ch in enumerate(chars)}
itos  = {i:ch for ch,i in stoi.items()}
vocab_size = len(stoi)

def encode(text: str):
    """Convert string to list[int] of token IDs."""
    # TODO: replace NotImplementedError with your code
    raise NotImplementedError

def decode(token_ids):
    """Convert list[int] back to string."""
    # TODO: replace NotImplementedError with your code
    raise NotImplementedError


#### 🔍 Checkpoint 1 – Round-trip test

In [None]:

sample = "Hello"
print("Encode:", encode(sample))
print("Decode:", decode(encode(sample)))
assert decode(encode(sample)) == sample
print("✔️  Round-trip OK")


### Build Train / Validation Tensors

In [None]:

data = torch.tensor(encode(corpus), dtype=torch.long)
split = int(0.9 * len(data))
train_data, val_data = data[:split], data[split:]
print(f'Train tokens: {len(train_data):,} | Val tokens: {len(val_data):,}')



### ✏️ Exercise 2  
Complete `get_batch` so it returns a batch of input tokens **x** and target
tokens **y** (next-token labels). Shapes should be `(batch_size, block_size)`.


In [None]:

def get_batch(split: str, *, block_size=64, batch_size=32):
    """Return (x,y) each of shape (B, T)."""
    raise NotImplementedError
    
    src = train_data if split == 'train' else val_data
    ix  = torch.randint()#TODO
    x   = torch.stack()#TODO
    y   = torch.stack()#TODO
    return x.to(device), y.to(device)



#### 🔍 Checkpoint 2 – Batch shapes

In [None]:

xb, yb = get_batch('train')
print("x shape:", xb.shape, "| y shape:", yb.shape)
assert xb.shape == yb.shape



## 3  Masked Multi-Head Self-Attention

We’ll build a minimal but complete **decoder-style** attention layer.

### ✏️ Exercise 3  
Fill the missing parts in `SelfAttention.forward`:

1. Compute scaled dot-product attention scores  
2. Apply the **causal mask** so each token only sees the **past**  
3. Apply softmax, and compute the weighted sum of values


In [None]:

class SelfAttention(nn.Module):
    def __init__(self, embed_dim: int, heads: int, block_size: int):
        super().__init__()
        assert embed_dim % heads == 0, "embed_dim must divide heads"
        self.heads = heads
        self.d  = embed_dim // heads  # d: dimension per head

        # Projections: project input embeddings to queries, keys, and values
        self.to_q = nn.Linear(embed_dim, embed_dim, bias=False)
        self.to_k = nn.Linear(embed_dim, embed_dim, bias=False)
        self.to_v = nn.Linear(embed_dim, embed_dim, bias=False)

        self.proj = nn.Linear(embed_dim, embed_dim)  # Final output projection

        # Causal mask: lower-triangular matrix to prevent attending to future tokens
        mask = torch.tril(torch.ones(block_size, block_size))
        self.register_buffer("mask", mask)

    def forward(self, x):
        # x: (B, T, C)
        # B = batch size
        # T = sequence length (block size)
        # C = embedding dimension

        B, T, C = x.shape

        # Project input x to queries, keys, and values, then reshape for multi-head attention
        # After view and transpose: (B, H, T, d)
        # H = number of heads, d = head dimension (C // H)
        q = self.to_q(x).view(B, T, self.heads, self.d).transpose(1, 2)  # (B, H, T, d)
        k = self.to_k(x).view(B, T, self.heads, self.d).transpose(1, 2)  # (B, H, T, d)
        v = self.to_v(x).view(B, T, self.heads, self.d).transpose(1, 2)  # (B, H, T, d)

        # TODO 1: scaled dot-product
        # scores = ... shape (B, H, T, T)
        # scores[i, h, t1, t2] = dot product between query at position t1 and key at position t2, for batch i and head h

        # TODO 2: apply causal mask (self.mask) before softmax
        # The mask is (T, T), broadcasted over batch and heads. It ensures each position can only attend to previous positions (including itself).

        # TODO 3: softmax + weight values -> out
        # After softmax, multiply attention weights by v to get the output for each head

        # Merge heads: transpose back and reshape to (B, T, C)
        # out: (B, H, T, d) -> (B, T, H, d) -> (B, T, C)
        out = out.transpose(1, 2).contiguous().view(B, T, C)
        return self.proj(out)


#### 🔍 Checkpoint 3 – Self-Attention sanity check

In [None]:

tmp = torch.randn(2, 8, 32)  # (B,T,C)
att = SelfAttention(embed_dim=32, heads=4, block_size=8)
out = att(tmp)
print("Output shape:", out.shape)


## 4  Feed-Forward & Transformer Block

In [None]:

class FeedForward(nn.Module):
    def __init__(self, embed_dim):
        super().__init__()
        self.net = nn.Sequential(
            nn.Linear(embed_dim, 4*embed_dim),
            nn.GELU(),
            nn.Linear(4*embed_dim, embed_dim),
        )
    def forward(self, x):
        return self.net(x)

class TransformerBlock(nn.Module):
    def __init__(self, embed_dim, heads, block_size):
        super().__init__()
        self.ln1 = nn.LayerNorm(embed_dim)
        self.sa  = SelfAttention(embed_dim, heads, block_size)
        self.ln2 = nn.LayerNorm(embed_dim)
        self.ff  = FeedForward(embed_dim)
    def forward(self, x):
        x = x + self.sa(self.ln1(x))
        x = x + self.ff(self.ln2(x))
        return x



### ✏️ Exercise 4  
Instantiate a `TransformerBlock` and verify that it **preserves the input shape**.


In [None]:

# TODO: create random tensor (B=4, T=16, C=64), run through block, assert shapes
pass


## 5. Building the MiniGPT Model

Now that we have all the building blocks (tokenizer, attention, transformer block), let's put them together into a tiny GPT-style language model!  
We'll call it `MiniGPT`. This model stacks several Transformer blocks, adds token and position embeddings, and predicts the next character in a sequence.

Below is the full model class, with detailed comments and docstrings to help you understand each part.

In [None]:

class MiniGPT(nn.Module):
    """
    A minimal GPT-style language model using stacked Transformer blocks.
    Args:
        vocab_size (int): Number of unique tokens (characters).
        embed_dim (int): Embedding dimension for tokens and positions.
        heads (int): Number of attention heads per block.
        depth (int): Number of Transformer blocks to stack.
        block_size (int): Maximum context length (sequence length).
    """
    def __init__(self, vocab_size, *, embed_dim=128, heads=4, depth=4, block_size=64):
        super().__init__()
        self.block_size = block_size

        # Token embedding: maps token IDs to vectors
        self.tok_emb = nn.Embedding(vocab_size, embed_dim)
        # Position embedding: adds information about token position in the sequence
        self.pos_emb = nn.Embedding(block_size, embed_dim)

        # Stack of Transformer blocks
        self.blocks = nn.ModuleList([
            TransformerBlock(embed_dim, heads, block_size)
            for _ in range(depth)
        ])

        # Final layer normalization and output head
        self.ln_f = nn.LayerNorm(embed_dim)
        self.head = nn.Linear(embed_dim, vocab_size)

        # Initialize weights
        self.apply(self._init_weights)

    def _init_weights(self, module):
        """Custom weight initialization for linear layers."""
        if isinstance(module, nn.Linear):
            nn.init.normal_(module.weight, std=0.02)
            if module.bias is not None:
                nn.init.zeros_(module.bias)

    def forward(self, idx, targets=None):
        """
        Forward pass for MiniGPT.
        Args:
            idx (Tensor): Input token IDs, shape (batch, seq_len)
            targets (Tensor, optional): Target token IDs for loss computation.
        Returns:
            logits (Tensor): Raw predictions for each token.
            loss (Tensor or None): Cross-entropy loss if targets are provided.
        """
        batch_size, seq_len = idx.shape
        assert seq_len <= self.block_size, "Input sequence too long!"

        # Get token and position embeddings
        token_embeddings = self.tok_emb(idx)  # (B, T, C)
        position_ids = torch.arange(seq_len, device=idx.device)
        position_embeddings = self.pos_emb(position_ids)  # (T, C)
        x = token_embeddings + position_embeddings  # (B, T, C)

        # Pass through each Transformer block
        for block in self.blocks:
            x = block(x)

        # Final normalization and output projection
        x = self.ln_f(x)
        logits = self.head(x)  # (B, T, vocab_size)

        # Optionally compute loss
        loss = None
        if targets is not None:
            loss = F.cross_entropy(logits.view(-1, logits.size(-1)), targets.view(-1))
        return logits, loss

    @torch.no_grad()
    def generate(self, idx, max_new_tokens=50, temperature=1.0):
        """
        Generate new tokens from a prompt.
        Args:
            idx (Tensor): Initial context tokens, shape (batch, seq_len)
            max_new_tokens (int): How many tokens to generate.
            temperature (float): Sampling temperature.
        Returns:
            idx (Tensor): The extended sequence including generated tokens.
        """
        for _ in range(max_new_tokens):
            idx_cond = idx[:, -self.block_size:]  # Crop to block size
            logits, _ = self(idx_cond)
            logits = logits[:, -1] / temperature  # Focus on last token
            probs = logits.softmax(dim=-1)
            next_id = torch.multinomial(probs, num_samples=1)
            idx = torch.cat([idx, next_id], dim=1)
        return idx



---
### 📝 Your Turn: Instantiate and Inspect MiniGPT

Let's create a MiniGPT model with just 2 Transformer blocks (for speed!) and check how many parameters it has. 

**Goal:** Make sure the total parameter count is less than 1 million.

#### Hint: Use `sum(p.numel() for p in model.parameters())` to count parameters.
---


In [None]:

# TODO: instantiate model with vocab_size and depth=2, then print parameter count
pass


If you see "Total parameters: ..." and it's less than 1,000,000, you're good to go! This compact model will train quickly while still demonstrating all the key Transformer concepts.


---
🎉 **Congratulations!** You've successfully built a complete MiniGPT model from scratch! 

Next, let's test it without training and then we can train it on our Shakespeare corpus and see what kind of text it can generate.
---



### ✏️ Exercise 6  
Generate **100** new characters starting from a newline token.


In [None]:

# TODO: use model.generate to produce text and decode it
pass


## 6  Quick Training Demo

In [None]:

opt = torch.optim.AdamW(model.parameters(), lr=3e-4)
for step in range(1, 1501):
    xb, yb = get_batch('train')
    _, loss = model(xb, yb)
    opt.zero_grad(); loss.backward(); opt.step()
    if step % 50 == 0:
        print(f"step {step:3d} | loss {loss.item():.3f}")



## 7  Next Steps  
* Try changing the number of training steps or the learning rate and see how it affects the generated text.
* Modify the prompt you use for generation and observe how the output changes.
* Experiment with the model's context length (block_size) and see what happens.
* Look at the model's parameters (e.g., number of layers or heads) and try small adjustments.
* Think about what kinds of mistakes the model makes and why.
Happy experimenting! 🚀
