# Pretraining on Real Text

In this notebook, you'll train a GPT model from scratch on Shakespeare and watch gibberish transform into recognizable English.

**What you'll do:**
- Load and tokenize a text dataset, create training batches with the input/target offset
- Run a forward pass through the GPT model and verify the initial loss matches theory
- Implement the training loop with learning rate scheduling and gradient clipping
- Add evaluation, loss tracking, and periodic text generation
- Train the model, diagnose issues, and iterate on hyperparameters

**Predict-first methodology:** For each exercise, PREDICT the output before running the cell. Wrong predictions are more valuable than correct ones—they reveal gaps in your mental model.

In [None]:
# Setup: self-contained for Google Colab
!pip install -q tiktoken

import torch
import torch.nn as nn
import torch.nn.functional as F
from torch.nn.utils import clip_grad_norm_
from torch.utils.data import Dataset, DataLoader
import tiktoken
import math
import matplotlib.pyplot as plt
from dataclasses import dataclass
from itertools import cycle
import urllib.request
import os

# Reproducibility
torch.manual_seed(42)

# Nice plots
plt.style.use('dark_background')
plt.rcParams['figure.figsize'] = [10, 4]

# Device setup
device = torch.device('cuda' if torch.cuda.is_available() else 'cpu')
print(f'Using device: {device}')
if device.type == 'cuda':
    print(f'GPU: {torch.cuda.get_device_name()}')
    print(f'Memory: {torch.cuda.get_device_properties(0).total_mem / 1e9:.1f} GB')

print('Setup complete.')

## Shared: GPT Model from Building nanoGPT

This is the complete GPT model you built in the previous lesson. We include it here so the notebook is self-contained. You already understand every line—this is not new content.

In [None]:
# --- GPT Model (from Building nanoGPT) ---
# You built this in the previous lesson. Included here for self-contained Colab use.

@dataclass
class GPTConfig:
    vocab_size: int = 50257     # GPT-2 BPE vocabulary
    block_size: int = 256       # Context length for training (smaller than 1024 for speed)
    n_layer: int = 6            # Fewer layers for training on Colab
    n_head: int = 6             # Number of attention heads
    n_embd: int = 384           # Embedding dimension
    dropout: float = 0.1        # Dropout rate
    bias: bool = False          # Use bias in Linear layers?


class CausalSelfAttention(nn.Module):
    """Multi-head attention with batched computation."""

    def __init__(self, config):
        super().__init__()
        assert config.n_embd % config.n_head == 0
        self.c_attn = nn.Linear(config.n_embd, 3 * config.n_embd, bias=config.bias)
        self.c_proj = nn.Linear(config.n_embd, config.n_embd, bias=config.bias)
        self.attn_dropout = nn.Dropout(config.dropout)
        self.resid_dropout = nn.Dropout(config.dropout)
        self.n_head = config.n_head
        self.n_embd = config.n_embd
        self.register_buffer(
            'mask',
            torch.tril(torch.ones(config.block_size, config.block_size))
                 .view(1, 1, config.block_size, config.block_size)
        )

    def forward(self, x):
        B, T, C = x.shape
        qkv = self.c_attn(x)
        q, k, v = qkv.split(self.n_embd, dim=2)
        head_size = C // self.n_head
        q = q.view(B, T, self.n_head, head_size).transpose(1, 2)
        k = k.view(B, T, self.n_head, head_size).transpose(1, 2)
        v = v.view(B, T, self.n_head, head_size).transpose(1, 2)
        scale = head_size ** -0.5
        scores = q @ k.transpose(-2, -1) * scale
        scores = scores.masked_fill(self.mask[:, :, :T, :T] == 0, float('-inf'))
        weights = F.softmax(scores, dim=-1)
        weights = self.attn_dropout(weights)
        out = weights @ v
        out = out.transpose(1, 2).contiguous().view(B, T, C)
        out = self.c_proj(out)
        out = self.resid_dropout(out)
        return out


class FeedForward(nn.Module):
    """Position-wise feed-forward network."""

    def __init__(self, config):
        super().__init__()
        self.c_fc   = nn.Linear(config.n_embd, 4 * config.n_embd, bias=config.bias)
        self.gelu   = nn.GELU()
        self.c_proj = nn.Linear(4 * config.n_embd, config.n_embd, bias=config.bias)
        self.dropout = nn.Dropout(config.dropout)

    def forward(self, x):
        x = self.c_fc(x)
        x = self.gelu(x)
        x = self.c_proj(x)
        x = self.dropout(x)
        return x


class Block(nn.Module):
    """Transformer block: MHA + FFN with residual connections and layer norm."""

    def __init__(self, config):
        super().__init__()
        self.ln_1 = nn.LayerNorm(config.n_embd)
        self.attn = CausalSelfAttention(config)
        self.ln_2 = nn.LayerNorm(config.n_embd)
        self.ffn  = FeedForward(config)

    def forward(self, x):
        x = x + self.attn(self.ln_1(x))
        x = x + self.ffn(self.ln_2(x))
        return x


class GPT(nn.Module):
    """Complete GPT language model."""

    def __init__(self, config):
        super().__init__()
        self.config = config
        self.transformer = nn.ModuleDict(dict(
            wte = nn.Embedding(config.vocab_size, config.n_embd),
            wpe = nn.Embedding(config.block_size, config.n_embd),
            drop = nn.Dropout(config.dropout),
            h = nn.ModuleList([Block(config) for _ in range(config.n_layer)]),
            ln_f = nn.LayerNorm(config.n_embd),
        ))
        self.lm_head = nn.Linear(config.n_embd, config.vocab_size, bias=False)
        self.transformer.wte.weight = self.lm_head.weight  # weight tying
        self.apply(self._init_weights)

    def _init_weights(self, module):
        if isinstance(module, nn.Linear):
            torch.nn.init.normal_(module.weight, mean=0.0, std=0.02)
            if module.bias is not None:
                torch.nn.init.zeros_(module.bias)
        if isinstance(module, nn.Embedding):
            torch.nn.init.normal_(module.weight, mean=0.0, std=0.02)

    def forward(self, idx, targets=None):
        B, T = idx.shape
        assert T <= self.config.block_size, \
            f"Sequence length {T} exceeds block_size {self.config.block_size}"
        tok_emb = self.transformer.wte(idx)
        pos = torch.arange(0, T, device=idx.device)
        pos_emb = self.transformer.wpe(pos)
        x = self.transformer.drop(tok_emb + pos_emb)
        for block in self.transformer.h:
            x = block(x)
        x = self.transformer.ln_f(x)
        logits = self.lm_head(x)
        loss = None
        if targets is not None:
            loss = F.cross_entropy(
                logits.view(-1, logits.size(-1)),
                targets.view(-1)
            )
        return logits, loss

    @torch.no_grad()
    def generate(self, idx, max_new_tokens, temperature=1.0):
        """Generate tokens autoregressively."""
        for _ in range(max_new_tokens):
            idx_cond = idx[:, -self.config.block_size:]
            logits, _ = self(idx_cond)
            logits = logits[:, -1, :] / temperature
            probs = F.softmax(logits, dim=-1)
            idx_next = torch.multinomial(probs, num_samples=1)
            idx = torch.cat([idx, idx_next], dim=1)
        return idx


# Quick verification
config = GPTConfig()
n_params = sum(p.numel() for p in GPT(config).parameters())
print(f'Model parameters: {n_params:,}')
print(f'Config: {config.n_layer} layers, {config.n_head} heads, {config.n_embd} dim')
print(f'Context length: {config.block_size}')
print('Model definition loaded.')

**Note on model size:** The lesson discusses GPT-2 small (124M params, 12 layers, 768 dim). For Colab training on TinyShakespeare, we use a smaller config (6 layers, 384 dim, ~10M params). The architecture and training loop are identical—only the scale differs. This lets you train to convergence in minutes rather than hours.

---

## Exercise 1: Load and Tokenize Text, Create Training Batches

**Type: GUIDED** — All code is provided. Your job is to predict, then verify.

The first step in training a language model: turn raw text into token IDs, then slice them into context windows with the input/target offset.

Remember the key insight from the lesson: for a chunk of tokens, the input is `tokens[i:i+T]` and the target is `tokens[i+1:i+T+1]`—shifted by one position. Every position in the input predicts the token that follows it.

**Before running, predict:**
1. TinyShakespeare is about 1MB of text. Roughly how many BPE tokens will that produce? (Hint: BPE averages ~4 characters per token for English.)
2. With a `block_size` of 256 and ~300K tokens, roughly how many training examples will the Dataset have?
3. For a single training example of length T=256, how many next-token predictions does the model make in one forward pass?

In [None]:
# --- Download TinyShakespeare ---
url = 'https://raw.githubusercontent.com/karpathy/char-rnn/master/data/tinyshakespeare/input.txt'
if not os.path.exists('shakespeare.txt'):
    urllib.request.urlretrieve(url, 'shakespeare.txt')
    print('Downloaded shakespeare.txt')

text = open('shakespeare.txt', 'r').read()
print(f'Text length: {len(text):,} characters')
print(f'First 200 characters:\n{text[:200]}')

In [None]:
# --- Tokenize the entire corpus ---
enc = tiktoken.get_encoding("gpt2")
tokens = enc.encode(text)
print(f'Total tokens: {len(tokens):,}')
print(f'Characters per token (avg): {len(text) / len(tokens):.1f}')

# Show what tokenization looks like
sample = text[:50]
sample_tokens = enc.encode(sample)
print(f'\nSample text: "{sample}"')
print(f'Token IDs: {sample_tokens}')
print(f'Decoded tokens: {[enc.decode([t]) for t in sample_tokens]}')

data = torch.tensor(tokens, dtype=torch.long)
print(f'\nData tensor shape: {data.shape}')

In [None]:
# --- Create the Dataset with input/target offset ---

class TextDataset(Dataset):
    def __init__(self, data, block_size):
        self.data = data
        self.block_size = block_size

    def __len__(self):
        return len(self.data) - self.block_size

    def __getitem__(self, idx):
        x = self.data[idx   : idx + self.block_size]      # input
        y = self.data[idx+1 : idx + self.block_size + 1]  # target (shifted by 1)
        return x, y

# Split into train/val (90/10)
block_size = config.block_size  # 256
n = int(0.9 * len(data))
train_dataset = TextDataset(data[:n], block_size)
val_dataset   = TextDataset(data[n:], block_size)

print(f'Train examples: {len(train_dataset):,}')
print(f'Val examples:   {len(val_dataset):,}')

# Verify the input/target offset
x_sample, y_sample = train_dataset[0]
print(f'\nInput shape:  {x_sample.shape}  (block_size = {block_size})')
print(f'Target shape: {y_sample.shape}')
print(f'\nInput tokens (first 10):  {x_sample[:10].tolist()}')
print(f'Target tokens (first 10): {y_sample[:10].tolist()}')
print(f'\nNotice: target[i] == input[i+1]? {(y_sample[:-1] == x_sample[1:]).all().item()}')
print('The target is the input shifted by one position. Every position predicts the next token.')

In [None]:
# --- Create DataLoaders ---
batch_size = 16
train_loader = DataLoader(train_dataset, batch_size=batch_size, shuffle=True)
val_loader   = DataLoader(val_dataset, batch_size=batch_size, shuffle=False)

# Verify batch shapes
x_batch, y_batch = next(iter(train_loader))
print(f'Batch input shape:  {x_batch.shape}  (batch_size, block_size)')
print(f'Batch target shape: {y_batch.shape}')
print(f'\nEach batch = {batch_size} sequences x {block_size} positions = {batch_size * block_size:,} next-token predictions')
print(f'Total batches per epoch: {len(train_loader):,}')

**What just happened:**

1. The full Shakespeare text was tokenized into a single sequence of ~300K BPE tokens.
2. The `TextDataset` slices this into overlapping context windows. Each window is 256 tokens. The input is `tokens[i:i+256]`, the target is `tokens[i+1:i+257]`—shifted by one position.
3. Each position in a training example is an independent next-token prediction. One sequence of length 256 produces 256 predictions in a single forward pass. This is why training is efficient.
4. The DataLoader creates batches of 16 sequences. Each batch produces `16 * 256 = 4,096` next-token predictions.

The Dataset/DataLoader pattern is the same one from your CNN work. The only new idea is the one-position offset between input and target.

---

## Exercise 2: First Forward Pass — Verify the Initial Loss

**Type: GUIDED** — All code is provided. Predict before running.

Before training, you need to verify the pipeline works. The sanity check: if the untrained model assigns equal probability to all 50,257 tokens, the cross-entropy loss should be $\ln(50257) \approx 10.82$.

**Before running, predict:**
1. What should the initial loss be? (Calculate: $-\ln(1/V)$ where $V = 50{,}257$)
2. The logits have shape `(B, T, V)`. What reshape is needed for `nn.CrossEntropyLoss`?
3. After the forward pass, will the generated text be English or gibberish?

In [None]:
# --- Create model and run first forward pass ---
torch.manual_seed(42)
model = GPT(config)
model = model.to(device)

n_params = sum(p.numel() for p in model.parameters())
print(f'Model parameters: {n_params:,}')

# Get one batch
x_batch, y_batch = next(iter(train_loader))
x_batch, y_batch = x_batch.to(device), y_batch.to(device)

# Forward pass
model.eval()
with torch.no_grad():
    logits, loss = model(x_batch, y_batch)

print(f'\nLogits shape: {logits.shape}   (B, T, vocab_size)')
print(f'Initial loss: {loss.item():.4f}')
print(f'Expected loss (random): ln(50257) = {math.log(50257):.4f}')
print(f'Difference: {abs(loss.item() - math.log(50257)):.4f}')

# Is the loss close to ln(V)?
is_close = abs(loss.item() - math.log(50257)) < 0.5
print(f'\nInitial loss close to ln(V)? {is_close}')
if is_close:
    print('The model is correctly initialized: near-uniform predictions over the vocabulary.')
    print('The pipeline is wired correctly. Ready to train.')

In [None]:
# --- Generate from the untrained model ---
prompt = enc.encode("ROMEO:")
idx = torch.tensor([prompt], device=device)

model.eval()
output = model.generate(idx, max_new_tokens=100, temperature=1.0)
print('Generated text from UNTRAINED model:')
print('-' * 50)
print(enc.decode(output[0].tolist()))
print('-' * 50)
print('\nRandom gibberish. Every weight is random. The architecture works')
print('but the model has never seen any text. Training will change this.')

**What just happened:**

The initial loss is close to $\ln(50257) \approx 10.82$, which confirms the model is assigning roughly equal probability to all tokens. This is the "parameter count as architecture verification" pattern—one number confirms the entire pipeline (tokenization, dataset, model, loss) is wired correctly.

The generated text is complete gibberish because every weight is random. The model produces valid token IDs through a working autoregressive loop, but the tokens are meaningless. Training will change this.

---

## Exercise 3: Implement the Training Loop with LR Scheduling

**Type: SUPPORTED** — Skeleton provided with TODO markers. You fill in the key parts.

The training loop is the same heartbeat from Series 2: forward, loss, backward, step. Three things are new:
- **Learning rate scheduling** (warmup + cosine decay)
- **Gradient clipping** (one line, essential safety net)
- **Manual LR update** via `param_group['lr']`

The `get_lr` function implements the warmup + cosine decay schedule from the lesson. The warmup phase linearly ramps from `min_lr` to `max_lr` over the first `warmup_steps`. The cosine decay phase gradually decreases from `max_lr` to `min_lr` over the remaining steps.

<details>
<summary>💡 Solution</summary>

The key insight is that the LR schedule has two phases, and we check which phase we're in based on the current step:

- **Warmup (step < warmup_steps):** Linear interpolation from `min_lr` to `max_lr`. The fraction `step / warmup_steps` goes from 0 to 1.
- **Cosine decay (step >= warmup_steps):** The `progress` variable tracks how far through the decay phase we are (0 to 1). The cosine formula `0.5 * (1 + cos(pi * progress))` smoothly decays from 1 to 0.

```python
# get_lr function:
def get_lr(step, warmup_steps, max_steps, max_lr, min_lr):
    if step < warmup_steps:
        return min_lr + (max_lr - min_lr) * (step / warmup_steps)
    progress = (step - warmup_steps) / (max_steps - warmup_steps)
    return min_lr + 0.5 * (max_lr - min_lr) * (1 + math.cos(math.pi * progress))

# In the training loop:
# Update LR:
lr = get_lr(step, warmup_steps, max_steps, max_lr=6e-4, min_lr=6e-5)
for param_group in optimizer.param_groups:
    param_group['lr'] = lr

# Forward:
logits, loss = model(x, y)

# Backward + clip + step:
optimizer.zero_grad()
loss.backward()
clip_grad_norm_(model.parameters(), max_norm=1.0)
optimizer.step()
```

Common mistake: putting `clip_grad_norm_` before `loss.backward()`. The gradients must exist before you can clip them.

</details>

In [None]:
# --- Learning rate schedule ---

def get_lr(step, warmup_steps, max_steps, max_lr, min_lr):
    """Cosine learning rate schedule with linear warmup."""
    # TODO: Implement the warmup phase
    # If step < warmup_steps, linearly ramp from min_lr to max_lr
    # Hint: the fraction (step / warmup_steps) goes from 0 to 1

    # TODO: Implement the cosine decay phase
    # 1. Compute progress: how far through the decay phase (0 to 1)
    # 2. Use: min_lr + 0.5 * (max_lr - min_lr) * (1 + math.cos(math.pi * progress))
    pass


# Hyperparameters (from nanoGPT — known-good values)
max_lr = 6e-4
min_lr = 6e-5       # 10% of peak
warmup_steps = 100
max_steps = 3000    # Enough steps to see real learning on Colab

# Visualize the schedule to verify your implementation
lrs = [get_lr(s, warmup_steps, max_steps, max_lr, min_lr) for s in range(max_steps)]
plt.plot(lrs, linewidth=2)
plt.xlabel('Step')
plt.ylabel('Learning Rate')
plt.title('LR Schedule: Warmup + Cosine Decay')
plt.axvline(x=warmup_steps, color='gray', linestyle='--', alpha=0.5, label=f'Warmup ends (step {warmup_steps})')
plt.legend()
plt.grid(alpha=0.3)
plt.show()

print(f'LR at step 0: {lrs[0]:.2e} (should be close to min_lr = {min_lr:.2e})')
print(f'LR at step {warmup_steps}: {lrs[warmup_steps]:.2e} (should be max_lr = {max_lr:.2e})')
print(f'LR at step {max_steps-1}: {lrs[-1]:.2e} (should be close to min_lr = {min_lr:.2e})')

In [None]:
# --- The complete training loop ---

# Fresh model
torch.manual_seed(42)
model = GPT(config)
model = model.to(device)

# AdamW optimizer — the transformer default
optimizer = torch.optim.AdamW(
    model.parameters(), lr=max_lr,
    betas=(0.9, 0.95), weight_decay=0.1
)

# Create an infinite iterator that cycles through the dataset
train_iter = iter(cycle(train_loader))

# Track losses for plotting
train_losses = []
step_list = []

print(f'Training for {max_steps} steps on {device}...')
print(f'Batch size: {batch_size}, Block size: {block_size}')
print(f'Predictions per step: {batch_size * block_size:,}')
print()

for step in range(max_steps):
    model.train()

    # TODO: Update learning rate for this step
    # 1. Call get_lr to get the learning rate for this step
    # 2. Set it on the optimizer: for param_group in optimizer.param_groups:
    #        param_group['lr'] = lr

    # Get batch
    x, y = next(train_iter)
    x, y = x.to(device), y.to(device)

    # TODO: Forward pass — get logits and loss from the model
    # Hint: logits, loss = model(x, y)

    # TODO: Backward pass + gradient clipping + optimizer step
    # 1. optimizer.zero_grad()
    # 2. loss.backward()
    # 3. clip_grad_norm_(model.parameters(), max_norm=1.0)
    # 4. optimizer.step()

    # Track losses
    train_losses.append(loss.item())
    step_list.append(step)

    # Log every 100 steps
    if step % 100 == 0:
        print(f'step {step:5d} | loss {loss.item():.4f} | lr {lr:.2e}')

print(f'\nTraining complete.')
print(f'Final loss: {train_losses[-1]:.4f}')
print(f'Initial loss was: {train_losses[0]:.4f}')

In [None]:
# --- Plot the training loss ---
fig, axes = plt.subplots(1, 2, figsize=(14, 4))

# Raw loss
axes[0].plot(step_list, train_losses, alpha=0.3, linewidth=0.5, color='#ff6b6b')
# Smoothed loss (running average)
window = 50
smoothed = [sum(train_losses[max(0,i-window):i+1]) / min(i+1, window) for i in range(len(train_losses))]
axes[0].plot(step_list, smoothed, linewidth=2, color='#ff6b6b', label='Smoothed')
axes[0].set_xlabel('Step')
axes[0].set_ylabel('Loss')
axes[0].set_title('Training Loss')
axes[0].legend()
axes[0].grid(alpha=0.3)

# LR schedule overlaid
axes[1].plot(step_list, lrs[:len(step_list)], linewidth=2, color='#4ecdc4')
axes[1].set_xlabel('Step')
axes[1].set_ylabel('Learning Rate')
axes[1].set_title('Learning Rate Schedule')
axes[1].grid(alpha=0.3)

plt.tight_layout()
plt.show()

print('The raw loss is jagged because some batches are harder than others.')
print('The smoothed line shows the real trend. The loss should drop fast initially,')
print('then slow down—the easy wins (common words, basic grammar) come first.')

**What you just built:**

The same training loop heartbeat from Series 2 (forward, loss, backward, step), plus three new instruments:
1. **`get_lr()`** — Dynamic Goldilocks. Gentle warmup for fragile random weights, aggressive learning in the middle, careful convergence at the end.
2. **`clip_grad_norm_()`** — One-line safety net. Bounds the gradient magnitude so one bad batch cannot undo hundreds of steps of progress.
3. **Manual LR update** via `param_group['lr']` — Transparent. You see exactly what the learning rate is at every step.

Notice the loss curve: jagged (batch-to-batch variance is normal for language models) but trending downward. The smoothed line reveals the real signal.

---

## Exercise 4: Add Evaluation and Text Generation During Training

**Type: SUPPORTED** — Skeleton provided with TODO markers.

Loss numbers tell you training is working. Generated text shows you *what* the model has learned. In this exercise, you'll add:
- Validation loss evaluation (the scissors pattern from Series 1)
- Periodic text generation to watch the model learn

<details>
<summary>💡 Solution</summary>

The evaluation loop is the same pattern from Series 2: `model.eval()`, `torch.no_grad()`, iterate over val batches, average the loss, then back to `model.train()`.

```python
# Evaluation:
model.eval()
val_loss_total = 0.0
val_iter = iter(val_loader)
with torch.no_grad():
    for _ in range(val_steps):
        xv, yv = next(val_iter)
        xv, yv = xv.to(device), yv.to(device)
        _, vloss = model(xv, yv)
        val_loss_total += vloss.item()
avg_val_loss = val_loss_total / val_steps

# Generation:
prompt_tokens = enc.encode("ROMEO:")
idx = torch.tensor([prompt_tokens], device=device)
output = model.generate(idx, max_new_tokens=100, temperature=0.8)
generated_text = enc.decode(output[0].tolist())
```

Key points:
- Use `model.eval()` before evaluation and generation (disables dropout)
- Use `torch.no_grad()` for evaluation (saves memory)
- Return to `model.train()` after
- Temperature 0.8 produces more focused text than 1.0

</details>

In [None]:
# --- Training loop with evaluation and generation ---

# Fresh model (start over to see the full progression)
torch.manual_seed(42)
model = GPT(config)
model = model.to(device)

optimizer = torch.optim.AdamW(
    model.parameters(), lr=max_lr,
    betas=(0.9, 0.95), weight_decay=0.1
)

train_iter = iter(cycle(train_loader))

# Tracking
train_losses = []
val_losses = []
val_steps_list = []
generated_samples = []  # (step, loss, text) tuples

# Evaluation settings
eval_interval = 500      # Evaluate every N steps
generate_interval = 500  # Generate text every N steps
val_steps = 20           # Number of val batches to average over

print(f'Training for {max_steps} steps with evaluation every {eval_interval} steps...')
print()

for step in range(max_steps):
    model.train()

    # Update learning rate
    lr = get_lr(step, warmup_steps, max_steps, max_lr, min_lr)
    for param_group in optimizer.param_groups:
        param_group['lr'] = lr

    # Get batch and train
    x, y = next(train_iter)
    x, y = x.to(device), y.to(device)
    logits, loss = model(x, y)
    optimizer.zero_grad()
    loss.backward()
    clip_grad_norm_(model.parameters(), max_norm=1.0)
    optimizer.step()

    train_losses.append(loss.item())

    # --- Evaluation (every eval_interval steps) ---
    if step % eval_interval == 0:
        # TODO: Evaluate on validation set
        # 1. Set model.eval()
        # 2. Initialize val_loss_total = 0.0
        # 3. Create a fresh val iterator: val_iter = iter(val_loader)
        # 4. In a torch.no_grad() block, loop val_steps times:
        #    - Get a batch: xv, yv = next(val_iter)
        #    - Move to device
        #    - Forward pass: _, vloss = model(xv, yv)
        #    - Accumulate: val_loss_total += vloss.item()
        # 5. Compute avg_val_loss = val_loss_total / val_steps
        avg_val_loss = 0.0  # Replace with your computation

        val_losses.append(avg_val_loss)
        val_steps_list.append(step)
        print(f'step {step:5d} | train loss {loss.item():.4f} | val loss {avg_val_loss:.4f} | lr {lr:.2e}')

    # --- Text generation (every generate_interval steps) ---
    if step % generate_interval == 0:
        model.eval()
        # TODO: Generate text from the prompt "ROMEO:"
        # 1. Encode the prompt: prompt_tokens = enc.encode("ROMEO:")
        # 2. Create tensor: idx = torch.tensor([prompt_tokens], device=device)
        # 3. Generate: output = model.generate(idx, max_new_tokens=100, temperature=0.8)
        # 4. Decode: generated_text = enc.decode(output[0].tolist())
        generated_text = ""  # Replace with your generation

        generated_samples.append((step, loss.item(), generated_text))
        print(f'\n--- Step {step} (loss={loss.item():.2f}) ---')
        print(generated_text[:200])  # First 200 chars
        print()

# Final generation
model.eval()
prompt_tokens = enc.encode("ROMEO:")
idx = torch.tensor([prompt_tokens], device=device)
output = model.generate(idx, max_new_tokens=200, temperature=0.8)
print('\n' + '=' * 60)
print(f'FINAL (step {max_steps}, loss={train_losses[-1]:.2f}):')
print('=' * 60)
print(enc.decode(output[0].tolist()))
print('=' * 60)

In [None]:
# --- Plot training and validation loss ---
fig, axes = plt.subplots(1, 2, figsize=(14, 5))

# Smoothed train loss + val loss
window = 50
smoothed_train = [sum(train_losses[max(0,i-window):i+1]) / min(i+1, window) for i in range(len(train_losses))]
axes[0].plot(range(len(train_losses)), smoothed_train, linewidth=2, color='#ff6b6b', label='Train (smoothed)', alpha=0.8)
axes[0].plot(val_steps_list, val_losses, 'o-', linewidth=2, color='#4ecdc4', label='Validation', markersize=6)
axes[0].set_xlabel('Step')
axes[0].set_ylabel('Loss')
axes[0].set_title('Train vs Validation Loss')
axes[0].legend()
axes[0].grid(alpha=0.3)

# Zoom into last half
mid = len(train_losses) // 2
axes[1].plot(range(mid, len(train_losses)), smoothed_train[mid:], linewidth=2, color='#ff6b6b', label='Train (smoothed)', alpha=0.8)
val_mask = [i for i, s in enumerate(val_steps_list) if s >= mid]
axes[1].plot([val_steps_list[i] for i in val_mask], [val_losses[i] for i in val_mask],
             'o-', linewidth=2, color='#4ecdc4', label='Validation', markersize=6)
axes[1].set_xlabel('Step')
axes[1].set_ylabel('Loss')
axes[1].set_title('Train vs Validation Loss (Zoomed)')
axes[1].legend()
axes[1].grid(alpha=0.3)

plt.tight_layout()
plt.show()

print('Watch for the scissors pattern: when train loss keeps dropping but val loss')
print('starts rising, the model is memorizing rather than learning.')
print(f'With ~300K tokens and ~{n_params//1_000_000}M params, this will happen eventually.')

**What you just built:**

A complete training pipeline with monitoring:
- **Validation loss** lets you detect overfitting (the scissors pattern from Series 1).
- **Generated text** shows qualitative progress. The biggest leap is from gibberish to real words (loss ~10 to ~4). Later improvements from ~3 to ~2 are subtle. The relationship between loss and text quality is logarithmic, not linear.
- **Loss smoothing** reveals the real trend underneath batch-to-batch noise. Language model loss is inherently noisier than MNIST because some batches contain predictable sequences and others contain rare words.

---

## Exercise 5: Train, Diagnose, and Iterate

**Type: INDEPENDENT** — Problem specification only. You write the code.

You now have all the tools. Your task:

1. **Experiment with hyperparameters.** Pick ONE thing to change and observe the effect:
   - What happens if you remove LR warmup entirely (set `warmup_steps = 0`)?
   - What happens if you use a constant LR of `6e-4` instead of the schedule?
   - What happens if you remove gradient clipping?
   - What happens if you increase `max_lr` to `3e-3` (5x higher)?

2. **Compare the loss curves.** Plot your experimental run against the baseline from Exercise 4. Which is better? Why?

3. **Diagnose and explain.** For each experiment, explain what happened in terms of the lesson's mental models:
   - Did the warmup prevent early instability?
   - Did the cosine decay help with convergence?
   - Did gradient clipping prevent a loss spike?

**Deliverable:** At least ONE experiment with a comparison plot and a written explanation of what you observed.

<details>
<summary>💡 Solution</summary>

The key insight is that each component of the training recipe (warmup, cosine decay, gradient clipping) solves a specific problem. The experiment reveals which problem:

- **No warmup:** Random transformer weights are fragile. Without warmup, the first few steps can push the model into a bad region. You may see a loss spike in the first 100 steps, or the loss may never recover.
- **Constant LR = 6e-4:** At this scale, a constant LR often works but converges to a worse final loss than the cosine schedule. The schedule gives better final performance because it takes smaller steps as the model approaches a good solution.
- **No gradient clipping:** On TinyShakespeare at this model size, you may not see a difference—clipping is a safety net that activates rarely. Try a higher LR (3e-3) without clipping to see it matter.
- **max_lr = 3e-3:** Too aggressive. Likely destabilizes training. With clipping, it might survive but learn poorly. Without clipping, it may produce NaN loss.

```python
# Example: compare baseline vs no-warmup
def train_experiment(warmup_steps_exp, max_lr_exp, use_clipping, label):
    torch.manual_seed(42)
    model_exp = GPT(config).to(device)
    opt_exp = torch.optim.AdamW(model_exp.parameters(), lr=max_lr_exp,
                                 betas=(0.9, 0.95), weight_decay=0.1)
    exp_iter = iter(cycle(train_loader))
    losses_exp = []

    for step in range(max_steps):
        model_exp.train()
        lr = get_lr(step, warmup_steps_exp, max_steps, max_lr_exp, min_lr)
        for pg in opt_exp.param_groups:
            pg['lr'] = lr
        x, y = next(exp_iter)
        x, y = x.to(device), y.to(device)
        _, loss = model_exp(x, y)
        opt_exp.zero_grad()
        loss.backward()
        if use_clipping:
            clip_grad_norm_(model_exp.parameters(), max_norm=1.0)
        opt_exp.step()
        losses_exp.append(loss.item())
        if math.isnan(loss.item()):
            print(f'{label}: NaN at step {step}, stopping.')
            break
    return losses_exp

# Run experiments
losses_baseline = train_losses  # from Exercise 4
losses_no_warmup = train_experiment(0, 6e-4, True, 'No warmup')

# Plot comparison
plt.figure(figsize=(10, 5))
window = 50
smooth = lambda L: [sum(L[max(0,i-window):i+1])/min(i+1,window) for i in range(len(L))]
plt.plot(smooth(losses_baseline), label='Baseline (warmup=100)', linewidth=2)
plt.plot(smooth(losses_no_warmup), label='No warmup', linewidth=2)
plt.xlabel('Step')
plt.ylabel('Loss (smoothed)')
plt.title('Effect of Removing Warmup')
plt.legend()
plt.grid(alpha=0.3)
plt.show()
```

The experiment makes the lesson's mental model concrete: you can *see* the effect of each training recipe component instead of just reading about it.

</details>

In [None]:
# --- Your experiment here ---
# Pick one hyperparameter to change, train, and compare to the baseline.
#
# Suggested structure:
# 1. Write a training function that takes the hyperparameter as an argument
# 2. Run the baseline and your experiment
# 3. Plot both loss curves on the same axes
# 4. Print your observations



In [None]:
# --- Plot your comparison ---



**Your observations:**

(Write your explanation here. What did you change? What happened? Why, in terms of the lesson's mental models?)



---

## Key Takeaways

1. **The training loop is the same loop. Always.** Forward, loss, backward, step. From linear regression to GPT, the algorithm never changes. The data format changes, the model changes, the scale changes. The loop does not.

2. **Three new tools for transformer-scale training.** Text dataset preparation (input/target offset for next-token prediction), LR scheduling (warmup + cosine decay), and gradient clipping (one line, essential safety net). Everything else is the same heartbeat from Series 2.

3. **Cross-entropy scales to any vocabulary size.** Same formula, same PyTorch call. Reshape logits from (B, T, V) to (B\*T, V), targets from (B, T) to (B\*T). The vocabulary size is just another number.

4. **Loss-to-text-quality is nonlinear.** The biggest qualitative leaps happen early (gibberish to words, words to phrases). Later improvements are subtle. Do not expect the same dramatic progress all the way down.

5. **The initial loss sanity check confirms the entire pipeline.** If $\text{loss} \approx \ln(V)$, everything is wired correctly: tokenization, dataset, model, and loss computation.