# ðŸ“˜ Assignment: Build Your Own BabyGPT

**Objective:** Demystify Large Language Models by building a decoder-only Transformer from scratch using PyTorch.


In [1]:
import torch
import torch.nn as nn
from torch.nn import functional as F

# --- Hyperparameters ---
batch_size = 32      # How many independent sequences will we process in parallel?
block_size = 64      # What is the maximum context length for predictions?
max_iters = 3000     # How many training steps?
eval_interval = 300  # How often to check loss?
learning_rate = 1e-3
device = 'cuda' if torch.cuda.is_available() else 'cpu' # Use GPU if available
eval_iters = 200

# Model Architecture params
n_embd = 128         # Dimension of the embedding vectors
n_head = 4           # Number of attention heads
n_layer = 2          # Number of transformer blocks
dropout = 0.2        # Regularization to prevent overfitting

print(f"Using device: {device}")
torch.manual_seed(1337)

Using device: cpu


<torch._C.Generator at 0x10dc029f0>

## Part 1: Data Loading & Tokenization
We will use the "Tiny Shakespeare" dataset. We also need to build the **Tokenizer**â€”the bridge between text strings and the numbers the model reads.

In [10]:
# Starter Snippet
text = "Hello World"
chars = sorted(list(set(text)))
vocab_size = len(chars)
# ... your code here for stoi and itos ...

stoi = { ch:i for i,ch in enumerate(chars) }
itos = { i:ch for i,ch in enumerate(chars) }
encode = lambda s: [stoi[c] for c in s] # Encoder: take a string, output a list of integers
decode = lambda l: ''.join([itos[i] for i in l]) # Decoder: take a list of integers, output a string

# Test it
print(f"Vocab size: {vocab_size}")
print(f"Encoded 'Hello': {encode('Hello')}")
print(f"Decoded back: {decode(encode('Hello'))}")


# Test it

Vocab size: 8
Encoded 'Hello': [1, 4, 5, 5, 6]
Decoded back: Hello


In [19]:
# 1. Download the dataset
!wget https://raw.githubusercontent.com/karpathy/char-rnn/master/data/tinyshakespeare/input.txt

# 2. Read the file
with open('input.txt', 'r', encoding='utf-8') as f:
    text = f.read()

chars = sorted(list(set(text)))
vocab_size = len(chars)

stoi = { ch:i for i,ch in enumerate(chars) }
itos = { i:ch for i,ch in enumerate(chars) }
encode = lambda s: [stoi[c] for c in s] # Encoder: take a string, output a list of integers
decode = lambda l: ''.join([itos[i] for i in l]) # Decoder: take a list of integers, output a string


--2026-01-09 11:12:03--  https://raw.githubusercontent.com/karpathy/char-rnn/master/data/tinyshakespeare/input.txt
Resolving raw.githubusercontent.com (raw.githubusercontent.com)... 2606:50c0:8001::154, 2606:50c0:8002::154, 2606:50c0:8003::154, ...
Connecting to raw.githubusercontent.com (raw.githubusercontent.com)|2606:50c0:8001::154|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 1115394 (1.1M) [text/plain]
Saving to: â€˜input.txt.4â€™


2026-01-09 11:12:03 (31.4 MB/s) - â€˜input.txt.4â€™ saved [1115394/1115394]



Part 1: The Tokenizer (From Text to Numbers)

Concept: Neural networks cannot process strings; they process tensors. You must convert text into integers.

Task: Build a character-level tokenizer.

1. Identify Vocabulary: Find every unique character in the text.
2. Create Mappings:stoi: A dictionary mapping 'a' $\rightarrow$ 1, 'b' $\rightarrow$ 2...itos: A dictionary mapping 1 $\rightarrow$ 'a', 2 $\rightarrow$ 'b'...
3. Implement Functions: Write encode(text) and decode(list_of_ints).

In [21]:
data = torch.tensor(encode(text), dtype=torch.long)
n = int(0.9*len(data))
train_data = data[:n]
val_data = data[n:]

# 6. Data Batcher
def get_batch(split):
    data = train_data if split == 'train' else val_data
    ix = torch.randint(len(data) - block_size, (batch_size,))
    x = torch.stack([data[i:i+block_size] for i in ix])
    y = torch.stack([data[i+1:i+block_size+1] for i in ix])
    return x.to(device), y.to(device)

## Part 2: The Self-Attention Mechanism
This is the most critical part. The `Head` class implements **Scaled Dot-Product Attention**.

Concept: This is the heart of the Transformer. The model must learn that in the sentence "The animal didn't cross the street because it was too tired," the word "it" attends strongly to "animal."

Task: Implement a single "Head" of Self-Attention.

Create a class Head(nn.Module).
Define three linear layers: Key (K), Query (Q), and Value (V).
Implement the scaled dot-product formula:

$$
\text{Attention}(Q, K) = \text{softmax}\left(\frac{QK^T}{\sqrt{d_k}}\right)
$$

The Mask (Crucial): Since this is a text generator, tokens cannot see the future. You must apply a "mask" (typically a Lower Triangular matrix) that sets future attention scores to -infinity before the softmax

In [22]:
class Head(nn.Module):
    """ one head of self-attention """

    def __init__(self, head_size):
        super().__init__()
        self.key = nn.Linear(n_embd, head_size, bias=False)
        self.query = nn.Linear(n_embd, head_size, bias=False)
        self.value = nn.Linear(n_embd, head_size, bias=False)
        self.register_buffer('tril', torch.tril(torch.ones(block_size, block_size)))
        self.dropout = nn.Dropout(dropout)

    def forward(self, x):
        B,T,C = x.shape
        k = self.key(x)   # (B,T,C)
        q = self.query(x) # (B,T,C)
        
        # Compute attention scores ("affinities")
        wei = q @ k.transpose(-2, -1) * C**-0.5 # (B, T, C) @ (B, C, T) -> (B, T, T)
        
        # MASKING: Ensure tokens can't see the future
        wei = wei.masked_fill(self.tril[:T, :T] == 0, float('-inf'))
        wei = F.softmax(wei, dim=-1)
        wei = self.dropout(wei)
        
        # Perform the weighted aggregation of the values
        v = self.value(x) # (B,T,C)
        out = wei @ v # (B, T, T) @ (B, T, C) -> (B, T, C)
        return out

class MultiHeadAttention(nn.Module):
    """ multiple heads of attention in parallel """

    def __init__(self, num_heads, head_size):
        super().__init__()
        self.heads = nn.ModuleList([Head(head_size) for _ in range(num_heads)])
        self.proj = nn.Linear(n_embd, n_embd)
        self.dropout = nn.Dropout(dropout)

    def forward(self, x):
        out = torch.cat([h(x) for h in self.heads], dim=-1)
        out = self.dropout(self.proj(out))
        return out

## Parts 3 & 4: Feed-Forward & The Block
The **FeedForward** network is where the model "thinks" about the data it gathered in the attention step. The **Block** just combines them.

In [23]:
class FeedForward(nn.Module):
    """ a simple linear layer followed by a non-linearity """

    def __init__(self, n_embd):
        super().__init__()
        self.net = nn.Sequential(
            nn.Linear(n_embd, 4 * n_embd),
            nn.ReLU(),
            nn.Linear(4 * n_embd, n_embd),
            nn.Dropout(dropout),
        )

    def forward(self, x):
        return self.net(x)

class Block(nn.Module):
    """ Transformer block: communication followed by computation """

    def __init__(self, n_embd, n_head):
        super().__init__()
        head_size = n_embd // n_head
        self.sa = MultiHeadAttention(n_head, head_size)
        self.ffwd = FeedForward(n_embd)
        self.ln1 = nn.LayerNorm(n_embd)
        self.ln2 = nn.LayerNorm(n_embd)

    def forward(self, x):
        # Residual Connections (x + ...)
        x = x + self.sa(self.ln1(x))
        x = x + self.ffwd(self.ln2(x))
        return x

## Part 5: The BabyGPT Model
Now we assemble the full architecture.

In [24]:
class BabyGPT(nn.Module):

    def __init__(self):
        super().__init__()
        # 1. Embedding Table: Stores vectors for each character
        self.token_embedding_table = nn.Embedding(vocab_size, n_embd)
        self.position_embedding_table = nn.Embedding(block_size, n_embd)
        
        # 2. Transformer Blocks
        self.blocks = nn.Sequential(*[Block(n_embd, n_head=n_head) for _ in range(n_layer)])
        
        # 3. Final Layer Norm & Linear Head
        self.ln_f = nn.LayerNorm(n_embd) # final layer norm
        self.lm_head = nn.Linear(n_embd, vocab_size)

    def forward(self, idx, targets=None):
        B, T = idx.shape

        # idx and targets are both (B,T) tensor of integers
        tok_emb = self.token_embedding_table(idx) # (B,T,C)
        pos_emb = self.position_embedding_table(torch.arange(T, device=device)) # (T,C)
        x = tok_emb + pos_emb # (B,T,C)
        x = self.blocks(x) # (B,T,C)
        x = self.ln_f(x) # (B,T,C)
        logits = self.lm_head(x) # (B,T,vocab_size)

        if targets is None:
            loss = None
        else:
            B, T, C = logits.shape
            logits = logits.view(B*T, C)
            targets = targets.view(B*T)
            loss = F.cross_entropy(logits, targets)

        return logits, loss

    def generate(self, idx, max_new_tokens):
        # idx is (B, T) array of indices in the current context
        for _ in range(max_new_tokens):
            # crop idx to the last block_size tokens
            idx_cond = idx[:, -block_size:]
            # get the predictions
            logits, loss = self(idx_cond)
            # focus only on the last time step
            logits = logits[:, -1, :] # becomes (B, C)
            # apply softmax to get probabilities
            probs = F.softmax(logits, dim=-1) # (B, C)
            # sample from the distribution
            idx_next = torch.multinomial(probs, num_samples=1) # (B, 1)
            # append sampled index to the running sequence
            idx = torch.cat((idx, idx_next), dim=1) # (B, T+1)
        return idx

## Part 6: Training Loop

In [25]:
model = BabyGPT()
m = model.to(device)

# Create a PyTorch optimizer
optimizer = torch.optim.AdamW(model.parameters(), lr=learning_rate)

@torch.no_grad()
def estimate_loss():
    out = {}
    model.eval()
    for split in ['train', 'val']:
        losses = torch.zeros(eval_iters)
        for k in range(eval_iters):
            X, Y = get_batch(split)
            logits, loss = model(X, Y)
            losses[k] = loss.item()
        out[split] = losses.mean()
    model.train()
    return out

print(f"--- Starting Training on {device} ---")

for iter in range(max_iters):

    # every once in a while evaluate the loss on train and val sets
    if iter % eval_interval == 0:
        losses = estimate_loss()
        print(f"step {iter}: train loss {losses['train']:.4f}, val loss {losses['val']:.4f}")

    # sample a batch of data
    xb, yb = get_batch('train')

    # evaluate the loss
    logits, loss = model(xb, yb)
    optimizer.zero_grad(set_to_none=True)
    loss.backward()
    optimizer.step()

print("--- Training Complete ---")

--- Starting Training on cpu ---
step 0: train loss 4.3300, val loss 4.3354
step 300: train loss 2.2665, val loss 2.2826
step 600: train loss 2.0365, val loss 2.0891
step 900: train loss 1.9073, val loss 2.0055
step 1200: train loss 1.8092, val loss 1.9279
step 1500: train loss 1.7555, val loss 1.8921
step 1800: train loss 1.7235, val loss 1.8706
step 2100: train loss 1.6886, val loss 1.8446
step 2400: train loss 1.6581, val loss 1.8266
step 2700: train loss 1.6372, val loss 1.8145
--- Training Complete ---


## Part 7: Generation

In [30]:
encode("hello")

[46, 43, 50, 50, 53]

In [33]:
# Generate from the model
# context = torch.zeros((1, 1), dtype=torch.long, device=device)
context = torch.tensor(encode("hello"), dtype=torch.long, device=device).unsqueeze(0)
generated_ids = m.generate(context, max_new_tokens=500)[0].tolist()
print(decode(generated_ids))

hellow on,
That reperband in eem shal,
What you sears.

MENENES:
Being shousider gue oath te toul
men fe and, at by a pition, if a dang sinveich as I good;
Is ret they he cheavse hand, I savie afthat a deate down;
Thou, say veationt, as and year hou thre my Cieoland
That shall-ack; as whun here you upt
frien's soom thee try.

DUCHEROMEO:
Musin, the huld sild mysell?
I favill my shill,
'By ch more, madour highter have a courn'd.
Ot every to wof them in, he is mour law'd thy bid good.

Pirnford Rair
My
