# Part 5.2: Language Models â€” The Formula 1 Edition

Language models are the foundation of modern AI. At their core, they do something deceptively simple: **predict the next token in a sequence**. Yet from this single objective emerges the ability to write essays, translate languages, answer questions, and even reason about code. In notebook 17 you built a Transformer from scratch -- now we'll see how that architecture becomes GPT, BERT, and the large language models that are reshaping every field.

**F1 analogy:** A language model is like an F1 race prediction system. Given everything that has happened so far in a race -- "Safety car deployed on lap 12, Verstappen pits on lap 13, rain starts on lap 14..." -- it predicts the most likely next event. Just as a language model assigns probabilities to every possible next word, a race predictor assigns probabilities to every possible next event: pit stop, overtake, mechanical failure, safety car. The better the model, the better its predictions.

---

## Learning Objectives

By the end of this notebook, you should be able to:

- [ ] Explain what a language model is and why next-token prediction is so powerful
- [ ] Implement Byte Pair Encoding (BPE) tokenization from scratch
- [ ] Describe the GPT (decoder-only) architecture and why it suits generation
- [ ] Describe the BERT (encoder-only) architecture and why it suits understanding
- [ ] Train a small character-level GPT and generate text from it
- [ ] Explain scaling laws and how model size, data, and compute interact
- [ ] Choose the right architecture (GPT vs BERT) for a given task

In [None]:
import numpy as np
import matplotlib.pyplot as plt
import torch
import torch.nn as nn
import torch.nn.functional as F
import torch.optim as optim
import math
from collections import Counter

%matplotlib inline
plt.style.use('seaborn-v0_8-whitegrid')
torch.manual_seed(42)
np.random.seed(42)

---

## 1. What Is a Language Model?

### Intuitive Explanation

Imagine someone starts a sentence: *"The cat sat on the ___"*. Your brain instantly predicts likely next words: "mat", "chair", "roof". You assign high probability to sensible continuations and low probability to nonsensical ones like "democracy" or "purple".

A **language model** does exactly this -- it learns a probability distribution over the next token given all previous tokens:

$$P(w_t \mid w_1, w_2, \ldots, w_{t-1})$$

#### Breaking down the formula:

| Component | Meaning | Intuition | F1 Analogy |
|-----------|---------|----------|------------|
| $w_t$ | The next token to predict | The word we're guessing | The next race event to predict |
| $w_1, \ldots, w_{t-1}$ | All previous tokens | The context we've seen so far | Everything that has happened in the race so far |
| $P(\cdot \mid \cdot)$ | Conditional probability | How likely is this word *given* what came before? | How likely is a pit stop *given* the current race state? |

**What this means:** A language model is a probability machine. Given any prefix of text, it outputs a probability for every possible next token in its vocabulary. The better the model, the higher probability it assigns to tokens that actually make sense.

**F1 analogy:** Think of the language model as a race prediction engine. Given the sequence of events so far -- "Lap 1: Verstappen leads, Lap 2: Norris overtakes Leclerc, Lap 3: ..." -- it assigns a probability to every possible next event. The model trained on thousands of historical races learns patterns: safety cars often trigger pit stops, tire degradation increases over stint length, drivers on fresh tires tend to overtake. This is *autoregressive prediction* applied to racing.

### Why This Is Useful

Language modeling isn't just about prediction -- it's a gateway to many capabilities:

| Capability | How LM Enables It | F1 Parallel |
|------------|------------------|-------------|
| **Text Generation** | Sample from the predicted distribution repeatedly | Generate race commentary lap by lap |
| **Translation** | Model P(target language \| source language) | Translate telemetry data into strategy calls |
| **Summarization** | Generate condensed version conditioned on original | Race report: 58 laps summarized in 3 paragraphs |
| **Question Answering** | Generate answer conditioned on context + question | "When should we pit?" given current race state |
| **Code Writing** | Code is just another language to model | Generate strategy algorithms from requirements |
| **Reasoning** | Chain-of-thought emerges from predicting logical next steps | Multi-step strategic reasoning about race scenarios |

### Brief History: From N-grams to Transformers

| Era | Approach | Key Idea | Limitation | F1 Parallel |
|-----|----------|----------|------------|-------------|
| 1990s-2000s | **N-gram models** | Count word sequences in data | Fixed context window (2-5 words) | Strategy based on last 2-3 laps only |
| 2010-2015 | **RNN/LSTM** | Recurrent hidden state carries context | Sequential processing, vanishing gradients | Pit wall processing one car at a time |
| 2017-present | **Transformers** | Self-attention over all positions | Quadratic cost in sequence length | Full parallel telemetry processing |
| 2018+ | **Large LMs (GPT, BERT)** | Scale Transformers + massive data | Compute-intensive training | Strategy AI trained on decades of race data |

In [None]:
# Visualization: Next-token probability distribution
fig, axes = plt.subplots(1, 2, figsize=(14, 5))

# Left: probability distribution for a predictable context
context1 = '"The cat sat on the ___"'
words1 = ['mat', 'floor', 'chair', 'roof', 'table', 'bed', 'dog', 'moon', 'idea', 'purple']
probs1 = [0.30, 0.18, 0.15, 0.10, 0.08, 0.06, 0.05, 0.03, 0.03, 0.02]

colors1 = ['#2ecc71' if p > 0.1 else '#3498db' if p > 0.04 else '#e74c3c' for p in probs1]
bars1 = axes[0].barh(range(len(words1)), probs1, color=colors1, edgecolor='white', linewidth=0.5)
axes[0].set_yticks(range(len(words1)))
axes[0].set_yticklabels(words1, fontsize=11)
axes[0].set_xlabel('Probability', fontsize=12)
axes[0].set_title(f'Next word after:\n{context1}', fontsize=13, fontweight='bold')
axes[0].invert_yaxis()
axes[0].grid(True, alpha=0.3)
for i, p in enumerate(probs1):
    axes[0].text(p + 0.005, i, f'{p:.0%}', va='center', fontsize=10)

# Right: probability distribution for an ambiguous context
context2 = '"I need to go to the ___"'
words2 = ['store', 'bank', 'doctor', 'office', 'gym', 'park', 'school', 'airport', 'dentist', 'library']
probs2 = [0.15, 0.13, 0.12, 0.11, 0.10, 0.09, 0.08, 0.08, 0.07, 0.07]

colors2 = ['#2ecc71' if p > 0.1 else '#3498db' if p > 0.04 else '#e74c3c' for p in probs2]
bars2 = axes[1].barh(range(len(words2)), probs2, color=colors2, edgecolor='white', linewidth=0.5)
axes[1].set_yticks(range(len(words2)))
axes[1].set_yticklabels(words2, fontsize=11)
axes[1].set_xlabel('Probability', fontsize=12)
axes[1].set_title(f'Next word after:\n{context2}', fontsize=13, fontweight='bold')
axes[1].invert_yaxis()
axes[1].grid(True, alpha=0.3)
for i, p in enumerate(probs2):
    axes[1].text(p + 0.005, i, f'{p:.0%}', va='center', fontsize=10)

fig.suptitle('Language Models Output Probability Distributions Over Next Tokens', 
             fontsize=14, fontweight='bold', y=1.02)
plt.tight_layout()
plt.show()

print("Key insight: More context = more peaked (confident) distributions")
print("The model's job is to learn these distributions from data")

---

## 2. Tokenization

### Intuitive Explanation

Before a language model can process text, it needs to break text into discrete units called **tokens**. But how should we split text?

**The vocabulary problem:** If we use whole words, we need a massive vocabulary (English has 170,000+ words, plus names, technical terms, misspellings...). If we use single characters, sequences become very long and the model must learn spelling from scratch.

**Subword tokenization** is the elegant middle ground: common words stay whole ("the", "and"), while rare words are split into meaningful pieces ("un" + "happi" + "ness").

**F1 analogy:** Tokenization is like breaking radio messages into meaningful units. The message "Box box box, we're switching to hards" could be tokenized at different granularities:
- **Character-level:** B, o, x, (space), b, o, x, ... -- too fine-grained, loses meaning
- **Word-level:** "Box", "box", "box", "we're", "switching", "to", "hards" -- works but cannot handle rare compound terms like "undercut-opportunity"
- **Subword (BPE):** "Box", "box", "box", "we", "'re", "switch", "ing", "to", "hard", "s" -- the sweet spot

| Approach | Example: "unhappiness" | Vocab Size | Pros | Cons |
|----------|----------------------|------------|------|------|
| **Character-level** | u, n, h, a, p, p, i, n, e, s, s | ~256 | Handles any text | Very long sequences |
| **Word-level** | unhappiness | 100K+ | Semantically meaningful | Can't handle unseen words |
| **Subword (BPE)** | un, happi, ness | 30K-50K | Best of both worlds | Requires training |

### Byte Pair Encoding (BPE): The Dominant Tokenization Algorithm

BPE is beautifully simple. It starts with individual characters and repeatedly merges the most frequent pair:

**Algorithm:**
1. Start with a vocabulary of all individual characters in the text
2. Count all adjacent pairs of tokens in the text
3. Merge the most frequent pair into a new token
4. Repeat steps 2-3 until desired vocabulary size is reached

**What this means:** BPE discovers the natural building blocks of language. Frequent subwords like "ing", "tion", "un" emerge automatically from the data. Common words end up as single tokens, while rare words get split into recognizable pieces.

**F1 analogy:** BPE is like how F1 engineers develop shorthand for common telemetry patterns. At first, everything is described in raw terms ("throttle 100%, brake 0%, speed increasing"). Over time, frequent patterns get their own names: "full traction zone," "brake point," "power unit deployment." The most common patterns become single tokens in the team's vocabulary, while rare events are still described by combining known sub-patterns.

### Deep Dive: Why BPE Works So Well

| Property | Why It Matters | F1 Parallel |
|----------|---------------|-------------|
| **Data-driven** | Learns from actual text, not handcrafted rules | Vocabulary emerges from real race data, not imposed |
| **Open vocabulary** | Any text can be encoded (falls back to characters) | Any radio message can be parsed, even unusual ones |
| **Compression** | Common patterns get short representations | "Box box box" becomes one token with enough data |
| **Morphological** | Naturally discovers prefixes, suffixes, stems | Discovers "under-cut", "over-cut", "out-braked" |
| **Language-agnostic** | Works for English, Chinese, code, math -- anything | Works for team radio, timing data, technical reports |

#### Common Misconceptions

| Misconception | Reality |
|---------------|--------|
| BPE splits on word boundaries | BPE operates on characters/bytes, ignoring word boundaries |
| All words are single tokens | Only frequent words; rare ones are split |
| Tokenization is trivial | It significantly impacts model performance |

In [None]:
# Visualization: BPE step by step on a small corpus
def visualize_bpe_steps(corpus, num_merges=8):
    """
    Run BPE on a small corpus and visualize each merge step.
    
    Args:
        corpus: list of words (each word is a list of characters + end marker)
        num_merges: number of merge operations to perform
    """
    # Initialize: split each word into characters with end-of-word marker
    word_freqs = {}
    for word, freq in corpus:
        word_freqs[tuple(word)] = freq
    
    merge_history = []
    
    for step in range(num_merges):
        # Count all adjacent pairs
        pair_counts = Counter()
        for word, freq in word_freqs.items():
            for i in range(len(word) - 1):
                pair_counts[(word[i], word[i+1])] += freq
        
        if not pair_counts:
            break
        
        # Find most frequent pair
        best_pair = pair_counts.most_common(1)[0]
        pair, count = best_pair
        merged = pair[0] + pair[1]
        
        merge_history.append((pair, count, merged))
        
        # Apply merge
        new_word_freqs = {}
        for word, freq in word_freqs.items():
            new_word = []
            i = 0
            while i < len(word):
                if i < len(word) - 1 and word[i] == pair[0] and word[i+1] == pair[1]:
                    new_word.append(merged)
                    i += 2
                else:
                    new_word.append(word[i])
                    i += 1
            new_word_freqs[tuple(new_word)] = freq
        word_freqs = new_word_freqs
    
    return merge_history, word_freqs

# Small corpus with frequencies
corpus = [
    (list("low") + ["_"], 5),
    (list("lower") + ["_"], 2),
    (list("newest") + ["_"], 6),
    (list("widest") + ["_"], 3),
    (list("new") + ["_"], 2),
]

merge_history, final_vocab = visualize_bpe_steps(corpus, num_merges=10)

# Visualize merge history
fig, ax = plt.subplots(figsize=(12, 6))

steps = range(1, len(merge_history) + 1)
labels = [f'"{h[0][0]}" + "{h[0][1]}" \u2192 "{h[2]}"' for h in merge_history]
counts = [h[1] for h in merge_history]

colors = plt.cm.viridis(np.linspace(0.2, 0.8, len(merge_history)))
bars = ax.barh(range(len(merge_history)), counts, color=colors, edgecolor='white')

ax.set_yticks(range(len(merge_history)))
ax.set_yticklabels([f'Step {i+1}: {labels[i]}' for i in range(len(merge_history))], fontsize=11)
ax.set_xlabel('Pair Frequency in Corpus', fontsize=12)
ax.set_title('BPE Merge Operations (Most Frequent Pairs First)', fontsize=14, fontweight='bold')
ax.invert_yaxis()
ax.grid(True, alpha=0.3)

for i, c in enumerate(counts):
    ax.text(c + 0.1, i, str(c), va='center', fontsize=11, fontweight='bold')

plt.tight_layout()
plt.show()

print("\nCorpus words: low(5), lower(2), newest(6), widest(3), new(2)")
print("\nFinal tokenization of each word:")
for word, freq in final_vocab.items():
    print(f"  {''.join(word):15s} (frequency: {freq})")

### Implementing BPE from Scratch

In [None]:
class SimpleBPE:
    """
    A simple Byte Pair Encoding tokenizer built from scratch.
    
    This implements the core BPE algorithm: start with characters,
    iteratively merge the most frequent adjacent pair.
    """
    
    def __init__(self, vocab_size=256):
        self.vocab_size = vocab_size
        self.merges = []  # List of (pair, merged_token) in order
        self.vocab = {}   # token -> index
    
    def train(self, text, verbose=False):
        """
        Train BPE on a text corpus.
        
        Args:
            text: Training text string
            verbose: If True, print each merge step
        """
        # Split text into words, add end-of-word marker
        words = text.split()
        word_freqs = Counter(words)
        
        # Initialize: each word as tuple of characters + end marker
        splits = {}
        for word, freq in word_freqs.items():
            splits[tuple(list(word) + ["</w>"])] = freq
        
        # Build initial character vocabulary
        chars = set()
        for word in splits:
            chars.update(word)
        
        num_merges = self.vocab_size - len(chars)
        
        for i in range(num_merges):
            # Count pairs
            pair_counts = Counter()
            for word, freq in splits.items():
                for j in range(len(word) - 1):
                    pair_counts[(word[j], word[j+1])] += freq
            
            if not pair_counts:
                break
            
            # Find best pair
            best_pair = pair_counts.most_common(1)[0][0]
            merged = best_pair[0] + best_pair[1]
            
            if verbose and i < 15:
                count = pair_counts[best_pair]
                print(f"Merge {i+1}: '{best_pair[0]}' + '{best_pair[1]}' -> '{merged}' (freq: {count})")
            
            self.merges.append((best_pair, merged))
            
            # Apply merge
            new_splits = {}
            for word, freq in splits.items():
                new_word = []
                j = 0
                while j < len(word):
                    if j < len(word) - 1 and word[j] == best_pair[0] and word[j+1] == best_pair[1]:
                        new_word.append(merged)
                        j += 2
                    else:
                        new_word.append(word[j])
                        j += 1
                new_splits[tuple(new_word)] = freq
            splits = new_splits
        
        # Build vocabulary
        all_tokens = set()
        for word in splits:
            all_tokens.update(word)
        self.vocab = {token: idx for idx, token in enumerate(sorted(all_tokens))}
        
        if verbose:
            print(f"\nFinal vocabulary size: {len(self.vocab)}")
    
    def tokenize(self, text):
        """
        Tokenize a string using learned merges.
        
        Args:
            text: Input string
            
        Returns:
            List of tokens
        """
        words = text.split()
        all_tokens = []
        
        for word in words:
            tokens = list(word) + ["</w>"]
            
            # Apply each merge rule in order
            for (a, b), merged in self.merges:
                new_tokens = []
                i = 0
                while i < len(tokens):
                    if i < len(tokens) - 1 and tokens[i] == a and tokens[i+1] == b:
                        new_tokens.append(merged)
                        i += 2
                    else:
                        new_tokens.append(tokens[i])
                        i += 1
                tokens = new_tokens
            
            all_tokens.extend(tokens)
        
        return all_tokens

# Train on a small corpus
training_text = """the cat sat on the mat the cat sat on the hat
the dog sat on the log the dog sat on the rug
a new cat and a new dog sat on a new mat
the newest cat sat on the newest mat happily
the lower cat and the lowest dog were unhappy"""

print("Training BPE tokenizer...")
print("=" * 50)
bpe = SimpleBPE(vocab_size=60)
bpe.train(training_text, verbose=True)

print("\n" + "=" * 50)
print("\nTokenizing example sentences:")
test_sentences = ["the cat sat", "newest dog", "unhappy cat"]
for sent in test_sentences:
    tokens = bpe.tokenize(sent)
    print(f"  '{sent}' -> {tokens}")

In [None]:
# Visualization: How "unhappiness" gets tokenized at different vocab sizes
fig, axes = plt.subplots(4, 1, figsize=(12, 8))

# Simulate different tokenization granularities
tokenizations = [
    ("Character-level\n(vocab ~30)", list("unhappiness"), '#e74c3c'),
    ("Small subword\n(vocab ~100)", ["un", "h", "app", "in", "ess"], '#e67e22'),
    ("Medium subword\n(vocab ~1000)", ["un", "happi", "ness"], '#2ecc71'),
    ("Large subword\n(vocab ~50000)", ["unhappiness"], '#3498db'),
]

for ax, (label, tokens, color) in zip(axes, tokenizations):
    # Draw token boxes
    x_pos = 0
    for token in tokens:
        width = len(token) * 0.8 + 0.4
        rect = plt.Rectangle((x_pos, 0.1), width, 0.8, 
                            facecolor=color, edgecolor='black', alpha=0.7, linewidth=2)
        ax.add_patch(rect)
        ax.text(x_pos + width/2, 0.5, token, ha='center', va='center', 
               fontsize=13, fontweight='bold', color='white')
        x_pos += width + 0.15
    
    ax.set_xlim(-0.3, 12)
    ax.set_ylim(0, 1)
    ax.set_ylabel(label, fontsize=10, rotation=0, labelpad=100, va='center')
    ax.set_xticks([])
    ax.set_yticks([])
    ax.spines['top'].set_visible(False)
    ax.spines['right'].set_visible(False)
    ax.spines['bottom'].set_visible(False)
    ax.spines['left'].set_visible(False)
    
    # Show token count
    ax.text(11.5, 0.5, f'{len(tokens)} tokens', fontsize=12, va='center', 
           fontweight='bold', color=color)

fig.suptitle('Tokenizing "unhappiness" at Different Vocabulary Sizes', 
             fontsize=14, fontweight='bold')
plt.tight_layout()
plt.show()

print("Trade-off: Smaller vocab = longer sequences, Larger vocab = shorter sequences")
print("Most modern LLMs use 30K-100K subword tokens (the sweet spot)")

### Special Tokens

Language models use special tokens to mark structure and boundaries:

| Token | Name | Purpose | Used By | F1 Parallel |
|-------|------|---------|--------|-------------|
| `[PAD]` | Padding | Fill shorter sequences to equal length in a batch | All models | Padding short stints to match longest stint length |
| `[UNK]` | Unknown | Represent tokens not in vocabulary | Word-level models | "Unknown flag condition" -- never seen before |
| `[BOS]` / `<s>` | Beginning of Sequence | Mark where text starts | GPT, generation models | Race start / lights out |
| `[EOS]` / `</s>` | End of Sequence | Signal the model to stop generating | GPT, generation models | Chequered flag |
| `[CLS]` | Classification | Aggregate representation for classification tasks | BERT | Race summary token |
| `[SEP]` | Separator | Separate two sentences in a pair | BERT | Separating qualifying data from race data |
| `[MASK]` | Mask | Placeholder for masked language modeling | BERT | Hidden event: "On lap 15, [MASK] pitted" -- predict who |

**What this means:** Special tokens are the "punctuation" of the model's internal language. They tell the model where sequences start and end, how to separate inputs, and where to output predictions.

### Tokenization Comparison Table

| Method | How It Works | Vocab Size | Used By |
|--------|-------------|------------|--------|
| **BPE** | Merge most frequent character pairs | 30K-50K | GPT-2, GPT-3, GPT-4, LLaMA |
| **WordPiece** | Like BPE but uses likelihood, not frequency | 30K | BERT, DistilBERT |
| **SentencePiece** | BPE/Unigram on raw text (no pre-tokenization) | 32K-256K | T5, LLaMA, multilingual models |
| **Tiktoken** | BPE with byte-level fallback | 100K | GPT-4, Claude |

---

## 3. GPT Architecture (Decoder-Only)

### Intuitive Explanation

GPT stands for **Generative Pre-trained Transformer**. It uses only the **decoder** part of the Transformer (from notebook 17) to generate text left-to-right.

**Why decoder-only works for generation:** When you write a sentence, you write one word at a time, left to right. Each word depends only on what came before it. A decoder-only model mirrors this natural process with **causal masking** -- each position can only attend to positions before it (and itself), never to future positions.

**Key insight:** The same architecture that predicts the next token during training can generate text at inference time by repeatedly sampling from its predictions.

**F1 analogy:** GPT is like a **race commentary generator** or a **lap time predictor from history**. Given everything that has happened so far in the race, it predicts the next event -- and then feeds that prediction back in to predict the event after that. "Verstappen leads lap 1" -> predicts "Norris closes gap in sector 2" -> predicts "DRS enabled on lap 3" -> and so on. It can only look backward (causal masking), just as a commentator can only describe what has already happened when predicting what comes next.

### Architecture Overview

The GPT architecture is a stack of identical decoder blocks:

```
Input Text: "The cat sat"
    |
    v
[Token Embedding] + [Positional Encoding]
    |
    v
[Decoder Block 1] -- Masked Self-Attention -> Feed-Forward
    |
    v
[Decoder Block 2] -- Masked Self-Attention -> Feed-Forward
    |
    v
   ...N blocks...
    |
    v
[Layer Norm]
    |
    v
[Linear Head] -> Vocabulary-sized logits
    |
    v
P("on") = 0.35, P("down") = 0.12, ...
```

In [None]:
# Visualization: GPT Architecture and Causal Masking
fig, axes = plt.subplots(1, 2, figsize=(14, 6))

# Left: Causal masking pattern
tokens = ["The", "cat", "sat", "on", "the"]
n = len(tokens)

# Create causal mask (lower triangular)
mask = np.tril(np.ones((n, n)))

im = axes[0].imshow(mask, cmap='Blues', vmin=0, vmax=1)
axes[0].set_xticks(range(n))
axes[0].set_yticks(range(n))
axes[0].set_xticklabels(tokens, fontsize=12)
axes[0].set_yticklabels(tokens, fontsize=12)
axes[0].set_xlabel('Key (can attend to)', fontsize=12)
axes[0].set_ylabel('Query (predicting from)', fontsize=12)
axes[0].set_title('GPT: Causal (Left-to-Right) Mask', fontsize=13, fontweight='bold')

# Add text annotations
for i in range(n):
    for j in range(n):
        color = 'white' if mask[i, j] > 0.5 else 'black'
        symbol = '\u2713' if mask[i, j] > 0.5 else '\u2717'
        axes[0].text(j, i, symbol, ha='center', va='center', fontsize=16, 
                   color=color, fontweight='bold')

# Right: What each position predicts
axes[1].set_xlim(0, 10)
axes[1].set_ylim(-0.5, 5.5)
axes[1].set_title('GPT: Each Position Predicts Next Token', fontsize=13, fontweight='bold')

predictions = [
    ("The", "cat", '#3498db'),
    ("The cat", "sat", '#2ecc71'),
    ("The cat sat", "on", '#e67e22'),
    ("The cat sat on", "the", '#9b59b6'),
    ("The cat sat on the", "mat", '#e74c3c'),
]

for i, (context, pred, color) in enumerate(predictions):
    y = 4.5 - i
    # Context box
    axes[1].add_patch(plt.Rectangle((0.2, y - 0.2), 4.5, 0.4, 
                     facecolor=color, alpha=0.3, edgecolor=color, linewidth=2))
    axes[1].text(2.45, y, context, ha='center', va='center', fontsize=11)
    # Arrow
    axes[1].annotate('', xy=(6.5, y), xytext=(5.0, y),
                    arrowprops=dict(arrowstyle='->', color=color, lw=2))
    # Prediction
    axes[1].text(7.3, y, f'\u2192 "{pred}"', ha='left', va='center', 
               fontsize=12, fontweight='bold', color=color)

axes[1].set_xticks([])
axes[1].set_yticks([])
axes[1].spines['top'].set_visible(False)
axes[1].spines['right'].set_visible(False)
axes[1].spines['bottom'].set_visible(False)
axes[1].spines['left'].set_visible(False)

plt.tight_layout()
plt.show()

print("Left: Each row shows which positions that token can attend to")
print("Right: GPT is trained so every position predicts the NEXT token")
print("\nThis is why GPT is autoregressive: it generates one token at a time, left to right")

### Implementing a Mini-GPT Model

Let's build a small GPT model. This uses the same components from notebook 17 (multi-head attention, feed-forward layers) but arranged as a **decoder-only** model with causal masking.

**F1 framing:** We are building a mini race-event predictor. Feed in the sequence of events, and it predicts the next one -- all using causal attention so it can only look at what has already happened, never peek into the future.

In [None]:
class CausalSelfAttention(nn.Module):
    """
    Multi-head self-attention with causal (autoregressive) masking.
    Each position can only attend to itself and earlier positions.
    """
    
    def __init__(self, d_model, n_heads, max_seq_len=512, dropout=0.1):
        super().__init__()
        assert d_model % n_heads == 0
        
        self.n_heads = n_heads
        self.d_k = d_model // n_heads
        
        self.W_qkv = nn.Linear(d_model, 3 * d_model)
        self.W_out = nn.Linear(d_model, d_model)
        self.dropout = nn.Dropout(dropout)
        
        # Causal mask: lower triangular matrix
        # Register as buffer so it moves to GPU with model but isn't a parameter
        mask = torch.tril(torch.ones(max_seq_len, max_seq_len))
        self.register_buffer('mask', mask.view(1, 1, max_seq_len, max_seq_len))
    
    def forward(self, x):
        B, T, C = x.shape
        
        # Compute Q, K, V in one projection
        qkv = self.W_qkv(x)
        q, k, v = qkv.chunk(3, dim=-1)
        
        # Reshape for multi-head attention
        q = q.view(B, T, self.n_heads, self.d_k).transpose(1, 2)
        k = k.view(B, T, self.n_heads, self.d_k).transpose(1, 2)
        v = v.view(B, T, self.n_heads, self.d_k).transpose(1, 2)
        
        # Scaled dot-product attention with causal mask
        scores = (q @ k.transpose(-2, -1)) / math.sqrt(self.d_k)
        scores = scores.masked_fill(self.mask[:, :, :T, :T] == 0, float('-inf'))
        attn = F.softmax(scores, dim=-1)
        attn = self.dropout(attn)
        
        # Apply attention to values
        out = attn @ v
        out = out.transpose(1, 2).contiguous().view(B, T, C)
        return self.W_out(out)


class GPTBlock(nn.Module):
    """
    A single GPT decoder block: LayerNorm -> Causal Attention -> LayerNorm -> FFN
    Uses pre-norm (LayerNorm before attention) like GPT-2.
    """
    
    def __init__(self, d_model, n_heads, d_ff=None, dropout=0.1):
        super().__init__()
        d_ff = d_ff or 4 * d_model
        
        self.ln1 = nn.LayerNorm(d_model)
        self.attn = CausalSelfAttention(d_model, n_heads, dropout=dropout)
        self.ln2 = nn.LayerNorm(d_model)
        self.ffn = nn.Sequential(
            nn.Linear(d_model, d_ff),
            nn.GELU(),
            nn.Linear(d_ff, d_model),
            nn.Dropout(dropout),
        )
    
    def forward(self, x):
        x = x + self.attn(self.ln1(x))   # Residual + attention
        x = x + self.ffn(self.ln2(x))     # Residual + FFN
        return x


class MiniGPT(nn.Module):
    """
    A minimal GPT language model.
    
    Architecture: Token Embed + Pos Embed -> N x GPTBlock -> LayerNorm -> Linear
    
    Args:
        vocab_size: Number of tokens in vocabulary
        d_model: Embedding dimension
        n_heads: Number of attention heads
        n_layers: Number of decoder blocks
        max_seq_len: Maximum sequence length
        dropout: Dropout rate
    """
    
    def __init__(self, vocab_size, d_model=128, n_heads=4, n_layers=4, 
                 max_seq_len=256, dropout=0.1):
        super().__init__()
        
        self.token_embedding = nn.Embedding(vocab_size, d_model)
        self.position_embedding = nn.Embedding(max_seq_len, d_model)
        self.dropout = nn.Dropout(dropout)
        
        self.blocks = nn.Sequential(*[
            GPTBlock(d_model, n_heads, dropout=dropout) for _ in range(n_layers)
        ])
        
        self.ln_final = nn.LayerNorm(d_model)
        self.lm_head = nn.Linear(d_model, vocab_size, bias=False)
        
        # Weight tying: share weights between token embedding and output head
        self.lm_head.weight = self.token_embedding.weight
        
        self.max_seq_len = max_seq_len
        self._init_weights()
    
    def _init_weights(self):
        """Initialize weights following GPT-2 conventions."""
        for module in self.modules():
            if isinstance(module, nn.Linear):
                nn.init.normal_(module.weight, mean=0.0, std=0.02)
                if module.bias is not None:
                    nn.init.zeros_(module.bias)
            elif isinstance(module, nn.Embedding):
                nn.init.normal_(module.weight, mean=0.0, std=0.02)
    
    def forward(self, idx, targets=None):
        """
        Forward pass.
        
        Args:
            idx: Token indices, shape (batch, seq_len)
            targets: Target token indices for loss computation
            
        Returns:
            logits: Shape (batch, seq_len, vocab_size)
            loss: Cross-entropy loss (if targets provided)
        """
        B, T = idx.shape
        assert T <= self.max_seq_len, f"Sequence length {T} exceeds max {self.max_seq_len}"
        
        # Token + positional embeddings
        positions = torch.arange(T, device=idx.device).unsqueeze(0)
        x = self.token_embedding(idx) + self.position_embedding(positions)
        x = self.dropout(x)
        
        # Transformer blocks
        x = self.blocks(x)
        x = self.ln_final(x)
        
        # Project to vocabulary
        logits = self.lm_head(x)
        
        loss = None
        if targets is not None:
            loss = F.cross_entropy(logits.view(-1, logits.size(-1)), targets.view(-1))
        
        return logits, loss
    
    @torch.no_grad()
    def generate(self, idx, max_new_tokens, temperature=1.0, top_k=None):
        """
        Generate tokens autoregressively.
        
        Args:
            idx: Starting token indices, shape (batch, seq_len)
            max_new_tokens: Number of new tokens to generate
            temperature: Controls randomness (higher = more random)
            top_k: If set, only sample from top-k most likely tokens
            
        Returns:
            Token indices including generated tokens
        """
        for _ in range(max_new_tokens):
            # Crop to max_seq_len if needed
            idx_cond = idx if idx.size(1) <= self.max_seq_len else idx[:, -self.max_seq_len:]
            
            # Forward pass
            logits, _ = self(idx_cond)
            logits = logits[:, -1, :] / temperature  # Take last position, apply temperature
            
            # Optional top-k filtering
            if top_k is not None:
                v, _ = torch.topk(logits, min(top_k, logits.size(-1)))
                logits[logits < v[:, [-1]]] = float('-inf')
            
            # Sample from distribution
            probs = F.softmax(logits, dim=-1)
            next_token = torch.multinomial(probs, num_samples=1)
            
            idx = torch.cat([idx, next_token], dim=1)
        
        return idx

# Create a mini-GPT and inspect it
model = MiniGPT(vocab_size=100, d_model=64, n_heads=4, n_layers=2, max_seq_len=32)

total_params = sum(p.numel() for p in model.parameters())
print("Mini-GPT Architecture")
print("=" * 50)
print(f"Vocabulary size: 100")
print(f"Embedding dimension: 64")
print(f"Attention heads: 4")
print(f"Decoder layers: 2")
print(f"Max sequence length: 32")
print(f"Total parameters: {total_params:,}")
print()
print("Model structure:")
print(model)

### Comparison: Decoder-Only vs Encoder-Decoder

| Aspect | Decoder-Only (GPT) | Encoder-Decoder (T5, Original Transformer) | F1 Parallel |
|--------|--------------------|-----------------------------------------|-------------|
| **Architecture** | Only decoder blocks with causal masking | Separate encoder (bidirectional) + decoder (causal) | Commentary generator vs. telemetry-to-strategy translator |
| **Attention** | Causal self-attention only | Encoder: full self-attention; Decoder: causal + cross-attention | Only past events visible vs. full data + sequential output |
| **Training** | Predict next token | Map input sequence to output sequence | Predict next lap event vs. translate telemetry to strategy |
| **Input/Output** | Single text stream | Separate input and output streams | One continuous race narrative vs. separate input/output |
| **Best for** | Generation, completion, chat | Translation, summarization (structured input->output) | Race commentary, predictions vs. data translation tasks |
| **Examples** | GPT-1/2/3/4, LLaMA, Claude | T5, BART, original Transformer | -- |

**Key insight:** Decoder-only models turn out to be surprisingly versatile. By framing any task as "continue this text," GPT can do translation, Q&A, summarization, and more -- all as text generation.

### Deep Dive: GPT Scaling History

| Model | Year | Parameters | Training Data | Context Length | Key Innovation |
|-------|------|-----------|--------------|----------------|---------------|
| **GPT-1** | 2018 | 117M | BookCorpus (5GB) | 512 | Proved unsupervised pre-training works |
| **GPT-2** | 2019 | 1.5B | WebText (40GB) | 1024 | Zero-shot task transfer |
| **GPT-3** | 2020 | 175B | 300B tokens | 2048 | In-context learning, few-shot prompting |
| **GPT-4** | 2023 | ~1.8T (est.) | ~13T tokens (est.) | 8K-128K | Multimodal, instruction following |

#### Key Insight

Each generation didn't change the core architecture much -- it primarily **scaled up** parameters, data, and compute. The Transformer architecture from 2017 has proven remarkably scalable. In F1 terms, the fundamental car concept (ground effect, for example) stays the same -- the teams that win are the ones that invest the most in refining and developing that concept.

---

## 4. BERT Architecture (Encoder-Only)

### Intuitive Explanation

While GPT generates text left-to-right, BERT takes a completely different approach: it reads text **bidirectionally** -- looking at both left and right context simultaneously.

**Analogy:** Imagine filling in a blank in a sentence: *"The [MASK] chased the mouse."* You need to see both what comes before AND after the blank to determine the answer is "cat." This is exactly what BERT does -- it's trained to fill in randomly masked tokens.

**F1 analogy:** BERT is like understanding the **full context of a race situation**. When analyzing "On lap 32, [MASK] made a critical overtake at turn 4," you need to see both the preceding context (who was in position, what the gaps were) and the following context (the overtake was into P3, the driver was on fresh softs) to determine it was Norris. GPT can only look left; BERT sees the whole picture -- that is why BERT excels at *understanding* rather than *generating*.

**Why encoder-only works for understanding:** Many NLP tasks don't require generating text -- they require *understanding* it:
- Is this email spam or not? (classification)
- What is the sentiment of this review? (sentiment analysis)  
- Which word does "it" refer to? (coreference resolution)
- Where is the answer in this paragraph? (extractive QA)

For all these tasks, looking at the **full context** (both directions) gives better understanding than only looking left-to-right.

### BERT's Two Training Objectives

| Objective | How It Works | What It Teaches | F1 Parallel |
|-----------|-------------|----------------|-------------|
| **Masked Language Modeling (MLM)** | Randomly mask 15% of tokens, predict them | Deep bidirectional understanding of language | Predict hidden race events from full context |
| **Next Sentence Prediction (NSP)** | Given two sentences, predict if B follows A | Understanding relationships between sentences | "Did this strategy call follow that telemetry reading?" |

#### MLM Details
- 15% of tokens are selected for prediction
- Of those: 80% replaced with [MASK], 10% replaced with random token, 10% kept unchanged
- The model must predict the original token for all selected positions

**What this means:** MLM forces BERT to build rich representations that capture meaning from both directions. Unlike GPT which only predicts forward, BERT sees the full picture.

In [None]:
# Visualization: BERT vs GPT Attention Patterns
fig, axes = plt.subplots(1, 3, figsize=(16, 5))

tokens = ["The", "cat", "[MASK]", "the", "mouse"]
n = len(tokens)

# GPT: Causal mask (lower triangular)
causal_mask = np.tril(np.ones((n, n)))
im1 = axes[0].imshow(causal_mask, cmap='Oranges', vmin=0, vmax=1.5)
axes[0].set_xticks(range(n))
axes[0].set_yticks(range(n))
axes[0].set_xticklabels(tokens, fontsize=10, rotation=45, ha='right')
axes[0].set_yticklabels(tokens, fontsize=10)
axes[0].set_title('GPT: Causal Attention\n(left-to-right only)', fontsize=13, fontweight='bold')
axes[0].set_xlabel('Keys (attends to)', fontsize=11)
axes[0].set_ylabel('Queries', fontsize=11)
for i in range(n):
    for j in range(n):
        symbol = '\u2713' if causal_mask[i, j] > 0 else ''
        color = 'white' if causal_mask[i, j] > 0 else 'gray'
        axes[0].text(j, i, symbol, ha='center', va='center', fontsize=14, 
                   color=color, fontweight='bold')

# BERT: Full bidirectional attention
full_mask = np.ones((n, n))
im2 = axes[1].imshow(full_mask, cmap='Blues', vmin=0, vmax=1.5)
axes[1].set_xticks(range(n))
axes[1].set_yticks(range(n))
axes[1].set_xticklabels(tokens, fontsize=10, rotation=45, ha='right')
axes[1].set_yticklabels(tokens, fontsize=10)
axes[1].set_title('BERT: Bidirectional Attention\n(sees everything)', fontsize=13, fontweight='bold')
axes[1].set_xlabel('Keys (attends to)', fontsize=11)
axes[1].set_ylabel('Queries', fontsize=11)
for i in range(n):
    for j in range(n):
        axes[1].text(j, i, '\u2713', ha='center', va='center', fontsize=14, 
                   color='white', fontweight='bold')

# Side comparison diagram
ax = axes[2]
ax.set_xlim(0, 10)
ax.set_ylim(0, 10)
ax.set_title('Context Available for "[MASK]"', fontsize=13, fontweight='bold')

# GPT context for position 2
ax.add_patch(plt.Rectangle((0.5, 6), 9, 1.5, facecolor='#e67e22', alpha=0.2, 
             edgecolor='#e67e22', linewidth=2))
ax.text(5, 7.2, 'GPT at position 3', ha='center', fontsize=12, fontweight='bold', color='#e67e22')
ax.text(1.5, 6.5, '"The"  "cat"', ha='left', fontsize=11, color='#e67e22')
ax.text(5.5, 6.5, '\u2190 can only see these', ha='left', fontsize=10, color='#e67e22', style='italic')

# BERT context for [MASK]
ax.add_patch(plt.Rectangle((0.5, 3), 9, 1.5, facecolor='#2980b9', alpha=0.2, 
             edgecolor='#2980b9', linewidth=2))
ax.text(5, 4.2, 'BERT at [MASK]', ha='center', fontsize=12, fontweight='bold', color='#2980b9')
ax.text(1.2, 3.5, '"The" "cat" ... "the" "mouse"', ha='left', fontsize=11, color='#2980b9')
ax.text(7, 3.5, '\u2190 sees ALL', ha='left', fontsize=10, color='#2980b9', style='italic')

# Verdict
ax.text(5, 1.5, 'BERT: "chased" (confident)', ha='center', fontsize=12, 
       fontweight='bold', color='#2980b9')
ax.text(5, 0.7, 'GPT: "sat"? "ate"? (uncertain)', ha='center', fontsize=12, 
       fontweight='bold', color='#e67e22')

ax.set_xticks([])
ax.set_yticks([])
for spine in ax.spines.values():
    spine.set_visible(False)

plt.tight_layout()
plt.show()

print("BERT sees both left and right context => better understanding")
print("GPT sees only left context => suited for generation")

### GPT vs BERT: Complete Comparison

This is one of the most important architectural distinctions in modern NLP:

| Aspect | GPT (Decoder-Only) | BERT (Encoder-Only) | F1 Parallel |
|--------|--------------------|-----------------|-------------|
| **Direction** | Left-to-right (autoregressive) | Bidirectional | Predicting next lap vs. analyzing full race |
| **Attention mask** | Causal (triangular) | Full (no mask) | Only past events vs. full race context |
| **Training objective** | Predict next token | Predict masked tokens + NSP | Predict next event vs. fill in hidden events |
| **Pre-training task** | Language modeling | Masked language modeling | Race commentary vs. race analysis |
| **Output** | Next-token probabilities | Contextualized embeddings | "What happens next?" vs. "What does this data mean?" |
| **Generation** | Natural (sample next token) | Unnatural (not designed for it) | Generates commentary vs. not a generator |
| **Understanding** | Limited (only left context) | Superior (full context) | Partial picture vs. full picture |
| **Fine-tuning** | Prompt-based / instruction-tuning | Add classification head on [CLS] | Adapt via prompts vs. add decision layer |
| **Parameters (base)** | GPT-2: 117M - 1.5B | BERT-base: 110M, BERT-large: 340M | -- |
| **Best for** | Text generation, chatbots, coding | Classification, NER, QA, search | Commentary, prediction vs. classification, analysis |

### When to Use Which?

| Task | Best Architecture | Why | F1 Framing |
|------|------------------|-----|------------|
| Chatbot / dialogue | **GPT** | Needs to generate fluent responses | Race commentary / team radio generation |
| Text classification | **BERT** | Needs to understand full document | "Was this a good strategy?" (yes/no) |
| Code generation | **GPT** | Code is written left-to-right | Strategy algorithm generation |
| Named entity recognition | **BERT** | Needs bidirectional context for each token | Identify drivers, teams, circuits in text |
| Creative writing | **GPT** | Generation task | Generate race preview articles |
| Semantic search | **BERT** | Needs rich sentence embeddings | Find similar race situations in the archive |
| Translation | **Encoder-Decoder** | Structured input-to-output mapping | Telemetry-to-English translation |
| Summarization | **GPT** or **Encoder-Decoder** | Generation with input understanding | Post-race summary generation |

#### Key Insight

The trend in 2023-2024+ has been toward **large decoder-only models** (GPT-4, Claude, LLaMA, Gemini). It turns out that with enough scale and the right training, decoder-only models can match or exceed BERT-style models even on understanding tasks. But BERT-style models remain popular for efficient, focused applications where generation isn't needed.

---

## 5. Training a Small Language Model

### Intuitive Explanation

Let's put theory into practice by training a **character-level GPT** on a small dataset. Character-level means each token is a single character -- this keeps our vocabulary tiny and training fast, while still demonstrating all the core concepts.

**F1 framing:** Think of this as training a tiny race commentator. We feed it examples of text, and it learns to predict what character comes next -- eventually generating coherent-looking sequences. With a small model and dataset, we will not get Brundle-quality commentary, but we will see the fundamental mechanism in action.

We'll:
1. Prepare a simple text dataset
2. Create character-level tokenization
3. Train our MiniGPT model
4. Generate text and explore temperature sampling

In [None]:
# Training data: simple repetitive patterns for fast learning
training_text = """
the cat sat on the mat. the cat ran to the hat.
the dog sat on the log. the dog ran to the fog.
a big cat sat on a big mat. a small dog sat on a small log.
the cat and the dog sat on the mat.
the cat saw the dog. the dog saw the cat.
""" * 100  # Repeat for more training data

# Character-level tokenization
chars = sorted(list(set(training_text)))
vocab_size = len(chars)
char_to_idx = {ch: i for i, ch in enumerate(chars)}
idx_to_char = {i: ch for i, ch in enumerate(chars)}

# Encode the text
def encode(text):
    return [char_to_idx[ch] for ch in text]

def decode(indices):
    return ''.join([idx_to_char[i] for i in indices])

data = torch.tensor(encode(training_text), dtype=torch.long)

print(f"Vocabulary size: {vocab_size}")
print(f"Characters: {repr(''.join(chars))}")
print(f"Total characters in training data: {len(data):,}")
print(f"\nExample encoding: 'cat' -> {encode('cat')}")
print(f"Example decoding: {encode('cat')} -> '{decode(encode('cat'))}'")

In [None]:
# Data loader for training
def get_batch(data, batch_size, block_size):
    """
    Get a random batch of training examples.
    
    Args:
        data: Full training data as tensor
        batch_size: Number of sequences per batch
        block_size: Length of each sequence
    
    Returns:
        x: Input sequences (batch_size, block_size)
        y: Target sequences (batch_size, block_size) - shifted by 1
    """
    ix = torch.randint(len(data) - block_size - 1, (batch_size,))
    x = torch.stack([data[i:i+block_size] for i in ix])
    y = torch.stack([data[i+1:i+block_size+1] for i in ix])
    return x, y

# Test the batch function
x, y = get_batch(data, batch_size=4, block_size=16)
print(f"Input batch shape: {x.shape}")
print(f"Target batch shape: {y.shape}")
print(f"\nExample input:  '{decode(x[0].tolist())}'")
print(f"Example target: '{decode(y[0].tolist())}'")
print("\nNotice: Target is the input shifted by one character (next-char prediction)")

In [None]:
# Train the model
torch.manual_seed(42)

# Model hyperparameters
block_size = 32
batch_size = 64
d_model = 64
n_heads = 4
n_layers = 3
learning_rate = 3e-4
n_steps = 2000

# Create model
model = MiniGPT(
    vocab_size=vocab_size,
    d_model=d_model,
    n_heads=n_heads,
    n_layers=n_layers,
    max_seq_len=block_size,
    dropout=0.1
)

optimizer = optim.AdamW(model.parameters(), lr=learning_rate)

# Training loop
losses = []
samples_during_training = []

print(f"Training MiniGPT ({sum(p.numel() for p in model.parameters()):,} parameters)")
print("=" * 60)

for step in range(n_steps):
    # Get batch
    x, y = get_batch(data, batch_size, block_size)
    
    # Forward pass
    logits, loss = model(x, y)
    
    # Backward pass
    optimizer.zero_grad()
    loss.backward()
    optimizer.step()
    
    losses.append(loss.item())
    
    # Log progress
    if step % 400 == 0 or step == n_steps - 1:
        # Generate a sample
        model.eval()
        prompt = encode("the ")
        prompt_tensor = torch.tensor([prompt], dtype=torch.long)
        generated = model.generate(prompt_tensor, max_new_tokens=40, temperature=0.8)
        sample = decode(generated[0].tolist())
        samples_during_training.append((step, loss.item(), sample))
        model.train()
        
        print(f"Step {step:4d} | Loss: {loss.item():.4f} | Sample: '{sample[:50]}...'")

print("\nTraining complete!")

In [None]:
# Visualize training progress
fig, axes = plt.subplots(1, 2, figsize=(14, 5))

# Loss curve
axes[0].plot(losses, alpha=0.3, color='blue', label='Raw loss')
# Smoothed loss
window = 50
smoothed = np.convolve(losses, np.ones(window)/window, mode='valid')
axes[0].plot(range(window-1, len(losses)), smoothed, color='red', linewidth=2, label='Smoothed')
axes[0].set_xlabel('Training Step', fontsize=12)
axes[0].set_ylabel('Cross-Entropy Loss', fontsize=12)
axes[0].set_title('Training Loss Over Time', fontsize=14, fontweight='bold')
axes[0].legend()
axes[0].grid(True, alpha=0.3)

# Generation quality over time
ax = axes[1]
ax.set_xlim(0, 1)
ax.set_ylim(0, 1)
ax.set_title('Generation Quality During Training', fontsize=14, fontweight='bold')

y_pos = 0.9
for step, loss, sample in samples_during_training:
    # Color based on loss
    color = plt.cm.RdYlGn(1 - min(loss / 3, 1))
    truncated = sample[:45] + '...' if len(sample) > 45 else sample
    ax.text(0.02, y_pos, f"Step {step}:", fontsize=10, fontweight='bold', 
           transform=ax.transAxes)
    ax.text(0.15, y_pos, f"'{truncated}'", fontsize=9, 
           color=color, transform=ax.transAxes, family='monospace')
    y_pos -= 0.15

ax.text(0.02, 0.05, "Green = low loss (good), Red = high loss (bad)", 
       fontsize=10, style='italic', transform=ax.transAxes)
ax.set_xticks([])
ax.set_yticks([])
for spine in ax.spines.values():
    spine.set_visible(False)

plt.tight_layout()
plt.show()

print("Watch how generation quality improves as loss decreases!")

### Temperature Sampling

**Temperature** controls the randomness of generation:

$$P(token) = \frac{\exp(logit_i / T)}{\sum_j \exp(logit_j / T)}$$

| Temperature | Effect | Use Case | F1 Parallel |
|-------------|--------|----------|-------------|
| T < 1.0 | Sharper distribution, more deterministic | Factual answers, code | Conservative strategy: "Stay on plan, no surprises" |
| T = 1.0 | Original distribution | General use | Balanced strategy: weigh all options fairly |
| T > 1.0 | Flatter distribution, more random | Creative writing | Aggressive strategy: "Consider the unlikely -- maybe a 3-stop works?" |

**What this means:** Low temperature makes the model "confident" (picks high-probability tokens), while high temperature makes it "creative" (considers unlikely tokens).

**F1 analogy:** Temperature is like the aggressiveness dial on a strategy computer. Low temperature (conservative) = the strategy sticks to the obvious call, like a safe one-stop. High temperature (aggressive) = the strategy considers wild alternatives, like an early pit under a virtual safety car or switching to inters on a drying track. Sometimes the aggressive call wins the race.

In [None]:
# Visualization: Temperature effect on probability distribution
fig, axes = plt.subplots(1, 4, figsize=(16, 4))

# Simulated logits for next character
logits = np.array([2.5, 2.0, 1.5, 0.5, 0.0, -0.5, -1.0, -2.0])
chars_viz = ['a', 't', ' ', 'e', 'o', 'n', 's', 'd']

temperatures = [0.3, 0.7, 1.0, 2.0]

for ax, temp in zip(axes, temperatures):
    # Apply temperature and softmax
    scaled = logits / temp
    probs = np.exp(scaled) / np.sum(np.exp(scaled))
    
    colors = plt.cm.Blues(probs / max(probs))
    bars = ax.bar(chars_viz, probs, color=colors, edgecolor='black')
    
    ax.set_ylim(0, 0.8)
    ax.set_xlabel('Character', fontsize=11)
    ax.set_ylabel('Probability' if temp == 0.3 else '', fontsize=11)
    ax.set_title(f'T = {temp}', fontsize=13, fontweight='bold')
    ax.grid(True, alpha=0.3, axis='y')
    
    # Annotate top prob
    max_idx = np.argmax(probs)
    ax.text(max_idx, probs[max_idx] + 0.02, f'{probs[max_idx]:.2f}', 
           ha='center', fontsize=10, fontweight='bold')

fig.suptitle('Effect of Temperature on Token Probability Distribution', 
            fontsize=14, fontweight='bold', y=1.02)
plt.tight_layout()
plt.show()

print("T=0.3: Very peaked (deterministic) -> always picks 'a'")
print("T=2.0: Very flat (random) -> might pick any character")

In [None]:
# Generate at different temperatures
model.eval()

prompt = "the cat "
prompt_tokens = torch.tensor([encode(prompt)], dtype=torch.long)

print("Generating from prompt: '" + prompt + "'")
print("=" * 60)

for temp in [0.3, 0.5, 0.8, 1.0, 1.5]:
    print(f"\nTemperature = {temp}:")
    for i in range(3):
        generated = model.generate(prompt_tokens.clone(), max_new_tokens=50, temperature=temp)
        text = decode(generated[0].tolist())
        print(f"  {i+1}. '{text}'")

---

## 6. Scaling Laws

### Intuitive Explanation

One of the most remarkable discoveries in deep learning is that language model performance follows **predictable scaling laws**. As you increase model size, training data, or compute, performance improves in a smooth, predictable way.

The key insight from OpenAI and DeepMind research:

$$L(N, D, C) \approx \left(\frac{N_c}{N}\right)^{\alpha_N} + \left(\frac{D_c}{D}\right)^{\alpha_D} + \left(\frac{C_c}{C}\right)^{\alpha_C}$$

Where:
- $L$ = Loss (lower is better)
- $N$ = Number of parameters
- $D$ = Dataset size (tokens)
- $C$ = Compute (FLOPs)

**What this means:** If you want a 10x better model, you can predict exactly how much bigger it needs to be, how much more data you need, or how much more compute you need.

**F1 analogy:** Scaling laws in AI are like development budgets in F1. The relationship between spending and performance is remarkably predictable: more wind tunnel hours, more CFD simulations, more development tokens all improve lap time in a smooth, power-law fashion. Just as F1 teams can predict "X million in aero development yields Y tenths per lap," AI labs can predict "X more compute yields Y reduction in loss." The cost cap in F1 is essentially a compute budget constraint.

In [None]:
# Visualization: Scaling laws
fig, axes = plt.subplots(1, 3, figsize=(15, 5))

# Parameters scaling
params = np.logspace(6, 12, 100)  # 1M to 1T
loss_params = 10 * (params / 1e6) ** (-0.076)  # Approximate scaling

axes[0].loglog(params, loss_params, 'b-', linewidth=2)
axes[0].set_xlabel('Parameters', fontsize=12)
axes[0].set_ylabel('Loss', fontsize=12)
axes[0].set_title('Scaling with Model Size', fontsize=14, fontweight='bold')
axes[0].grid(True, alpha=0.3, which='both')

# Mark some famous models
models_p = [
    ('GPT-2', 1.5e9, 3.5),
    ('GPT-3', 175e9, 2.0),
    ('GPT-4', 1.8e12, 1.2),
]
for name, size, loss in models_p:
    axes[0].scatter([size], [loss], s=100, zorder=5, edgecolors='black')
    axes[0].annotate(name, (size, loss), xytext=(5, 5), textcoords='offset points', fontsize=9)

# Data scaling
data_size = np.logspace(9, 13, 100)  # 1B to 10T tokens
loss_data = 8 * (data_size / 1e9) ** (-0.095)

axes[1].loglog(data_size, loss_data, 'g-', linewidth=2)
axes[1].set_xlabel('Training Tokens', fontsize=12)
axes[1].set_ylabel('Loss', fontsize=12)
axes[1].set_title('Scaling with Dataset Size', fontsize=14, fontweight='bold')
axes[1].grid(True, alpha=0.3, which='both')

# Compute scaling
compute = np.logspace(17, 25, 100)  # FLOPs
loss_compute = 20 * (compute / 1e17) ** (-0.05)

axes[2].loglog(compute, loss_compute, 'r-', linewidth=2)
axes[2].set_xlabel('Compute (FLOPs)', fontsize=12)
axes[2].set_ylabel('Loss', fontsize=12)
axes[2].set_title('Scaling with Compute', fontsize=14, fontweight='bold')
axes[2].grid(True, alpha=0.3, which='both')

fig.suptitle('Neural Scaling Laws: Predictable Performance Improvements', 
            fontsize=14, fontweight='bold', y=1.02)
plt.tight_layout()
plt.show()

print("Key insight: Performance improves predictably on log-log scale")
print("This allows researchers to plan training runs and predict outcomes")

In [None]:
# Timeline of model sizes
fig, ax = plt.subplots(figsize=(14, 6))

models = [
    (2018.5, 'GPT-1', 0.117, 'OpenAI'),
    (2018.8, 'BERT', 0.34, 'Google'),
    (2019.2, 'GPT-2', 1.5, 'OpenAI'),
    (2020.5, 'GPT-3', 175, 'OpenAI'),
    (2022.0, 'PaLM', 540, 'Google'),
    (2022.3, 'Chinchilla', 70, 'DeepMind'),
    (2023.0, 'GPT-4', 1800, 'OpenAI'),
    (2023.2, 'LLaMA', 65, 'Meta'),
    (2023.7, 'Llama-2', 70, 'Meta'),
    (2024.0, 'Gemini Ultra', 1500, 'Google'),
]

colors = {'OpenAI': '#2ecc71', 'Google': '#3498db', 'DeepMind': '#9b59b6', 'Meta': '#e74c3c'}

for year, name, params, company in models:
    ax.scatter([year], [params], s=200, c=colors[company], alpha=0.7, 
              edgecolors='black', linewidth=2, zorder=5)
    offset = 15 if params < 100 else -25
    ax.annotate(f'{name}\n({params}B)', (year, params), 
               xytext=(0, offset), textcoords='offset points',
               ha='center', fontsize=9, fontweight='bold')

ax.set_yscale('log')
ax.set_xlabel('Year', fontsize=12)
ax.set_ylabel('Parameters (Billions)', fontsize=12)
ax.set_title('The Race to Scale: Language Model Parameter Growth', fontsize=14, fontweight='bold')
ax.grid(True, alpha=0.3, which='both')
ax.set_xlim(2018, 2024.5)

# Legend
for company, color in colors.items():
    ax.scatter([], [], c=color, s=100, label=company, edgecolors='black')
ax.legend(loc='upper left', fontsize=10)

plt.tight_layout()
plt.show()

print("From 117M (GPT-1) to 1.8T (GPT-4) in 5 years: 15,000x increase!")
print("\nBut Chinchilla showed: it's not just about size, data matters too")

### Key Scaling Insights

| Discovery | Implication | F1 Parallel |
|-----------|------------|-------------|
| **Power law scaling** | Can predict performance before training | Predict lap time improvement from development spend |
| **Compute-optimal training** | Balance model size and data (Chinchilla) | Balance aero and PU development budgets |
| **Emergent abilities** | Some capabilities appear suddenly at scale | Sudden breakthrough in car concept (e.g., double diffuser) |
| **Diminishing returns** | Each 10x costs more for same improvement | Last tenth of a second costs more than the first |
| **Architecture matters less** | At scale, different architectures converge | At the front, all cars converge on similar concepts |

### The Chinchilla Finding

DeepMind's Chinchilla paper (2022) showed that many models were **undertrained** -- they had too many parameters for the amount of data they saw. The optimal ratio is roughly:

$$\text{Tokens} \approx 20 \times \text{Parameters}$$

**What this means:** A 70B parameter model trained on 1.4T tokens (Chinchilla) outperformed the 280B parameter Gopher trained on 300B tokens. **More data can be better than more parameters.**

**F1 analogy:** This is like discovering that a team with a smaller budget but more testing days beats a team with a bigger budget but fewer track sessions. The car is only as good as the data used to develop it. Chinchilla showed that "testing" (training data) matters at least as much as "car complexity" (parameters).

---

## Exercises

### Exercise 1: Extend BPE with Encoding

Add an `encode()` method to our SimpleBPE class that returns token indices.

**F1 framing:** Build a radio message encoder. Given the BPE vocabulary learned from race communications, encode any new team radio message into its token sequence -- turning "Box box box, switch to mediums" into a sequence of integer IDs the strategy computer can process.

In [None]:
# EXERCISE: Add encode method to SimpleBPE
def bpe_encode(self, text):
    """
    Tokenize and convert to indices.
    
    Args:
        text: Input string
        
    Returns:
        List of token indices
    """
    # TODO: Implement this!
    # 1. Call self.tokenize(text) to get tokens
    # 2. Convert each token to its index using self.vocab
    # Hint: Handle unknown tokens gracefully
    pass

# Test:
# SimpleBPE.encode = bpe_encode
# indices = bpe.encode("the cat sat")
# print(f"Encoded: {indices}")

### Exercise 2: Top-p (Nucleus) Sampling

Implement top-p sampling, which samples from the smallest set of tokens whose cumulative probability exceeds p.

**F1 framing:** Top-p sampling is like a strategy engineer who considers only the most likely scenarios until they cover, say, 90% of probable outcomes. With p=0.9, if "pit on this lap" (60%) and "pit next lap" (25%) and "stay out" (10%) cover 95%, the engineer ignores the remaining 5% of wild scenarios. Implement this "nucleus" of likely strategy options.

In [None]:
# EXERCISE: Implement top-p sampling
def sample_top_p(logits, p=0.9):
    """
    Sample from the nucleus (top-p) of the distribution.
    
    Args:
        logits: Raw model outputs (vocab_size,)
        p: Cumulative probability threshold
        
    Returns:
        Sampled token index
    """
    # TODO: Implement this!
    # 1. Convert logits to probabilities with softmax
    # 2. Sort probabilities in descending order
    # 3. Compute cumulative sum
    # 4. Find cutoff where cumsum > p
    # 5. Zero out probabilities below cutoff
    # 6. Renormalize and sample
    pass

# Test:
# logits = torch.randn(100)
# sampled = sample_top_p(logits, p=0.9)
# print(f"Sampled token: {sampled}")

### Exercise 3: Calculate Perplexity

Perplexity is the standard metric for language models: $PPL = \exp(\text{average cross-entropy loss})$

**F1 framing:** Perplexity measures how "surprised" the model is by the actual sequence of events. A race predictor with perplexity 10 on a race means it was, on average, as uncertain as choosing between 10 equally likely next events. A perplexity of 2 means the model is almost always choosing between just 2 options -- much better calibrated. Calculate this metric for our trained model.

In [None]:
# EXERCISE: Calculate perplexity
def calculate_perplexity(model, text, encode_fn, block_size=32):
    """
    Calculate perplexity of a model on given text.
    
    Args:
        model: Trained language model
        text: Text string to evaluate
        encode_fn: Function to convert text to token indices
        block_size: Sequence length for evaluation
        
    Returns:
        Perplexity value (lower is better)
    """
    # TODO: Implement this!
    # 1. Encode the text
    # 2. Split into sequences of block_size
    # 3. Compute average loss over all sequences
    # 4. Return exp(average_loss)
    pass

# Test:
# ppl = calculate_perplexity(model, "the cat sat on the mat", encode)
# print(f"Perplexity: {ppl:.2f}")

---

## Summary

### Key Concepts

- **Language Model**: Predicts $P(w_t | w_1, ..., w_{t-1})$ -- the probability of the next token given context
- **Tokenization**: Breaking text into discrete units; BPE is the dominant modern approach
- **GPT (Decoder-Only)**: Left-to-right generation with causal masking
- **BERT (Encoder-Only)**: Bidirectional understanding with masked language modeling
- **Temperature**: Controls randomness in generation (low = deterministic, high = creative)
- **Scaling Laws**: Performance improves predictably with parameters, data, and compute

### Connection to Deep Learning

| Concept | Application | F1 Parallel |
|---------|------------|-------------|
| Next-token prediction | Foundation of ChatGPT, Claude, and all modern LLMs | Predicting the next event in a race sequence |
| BPE tokenization | Used by GPT-4, LLaMA, and most production models | Breaking radio messages into meaningful units |
| Causal masking | Enables autoregressive text generation | Real-time decisions based only on past events |
| Bidirectional attention | Powers search engines and classification systems | Full race situation analysis (BERT-style understanding) |
| Temperature sampling | Controls creativity in AI writing assistants | Conservative vs. aggressive strategy dial |
| Scaling laws | Guide billion-dollar training decisions | Development budget allocation in F1 |

### Checklist

- [ ] I can explain why next-token prediction leads to general intelligence
- [ ] I understand how BPE tokenization works and why it's used
- [ ] I can implement a simple GPT model from scratch
- [ ] I know when to use GPT vs BERT for different tasks
- [ ] I understand how temperature affects generation
- [ ] I can explain the key findings of scaling laws research

---

## Next Steps

Now that you understand language model architectures, the next notebook covers **Embeddings** -- how to represent words, sentences, and concepts as vectors that capture meaning, and how these representations power search, recommendation, and RAG systems.