# Exercise 2 — Character-Level Language Modeling

## Why Language Modeling?

With over 700 million weekly active users in just a few years, ChatGPT is now the fasted adopted technology in history. At the core of ChatGPT and practically all modern AI tools are *Large Language Models (LLMs)* that enable rich functionality in engaging with human languages both in terms of generation and understanding.

At their core, practically all modern language models, from the simplest to GPT-4, are ML models trained to **predict the next token** (word or character) given previous tokens. This seemingly simple task is the foundation of all modern AI language capabilities.

In this exercise, we'll build language models from scratch to understand:
- **How ML "learns" language** through statistical patterns
- **Why design choices matter** (data representation, model architecture, inductive biases)
- **Where modern LLMs come from** and what makes them powerful
- **The fundamental principles** that scale from our tiny models to billion-parameter systems

By the end, you'll have built two language models and understand the core ideas behind ChatGPT, just at a much smaller scale!

## Learning Objectives

In this exercise, you will:
1. Build two different language models from scratch using PyTorch
2. Understand how **data representation** choices affect model performance
3. Explore how **inductive biases** (assumptions about the problem) shape what models can learn
4. See that machine learning is fundamentally a **modeling science** — our design choices matter!

We'll work with Shakespeare's text and build increasingly sophisticated models, comparing their strengths and limitations.

In [None]:
# Imports and deterministic seed
import torch
import torch.nn as nn
import torch.nn.functional as F
import requests
import matplotlib.pyplot as plt

# Stable RNG for reproducibility
g = torch.Generator().manual_seed(67)

In [None]:
# Download the tiny Shakespeare dataset
data_url = 'https://raw.githubusercontent.com/karpathy/char-rnn/master/data/tinyshakespeare/input.txt'
response = requests.get(data_url)
response.raise_for_status()
text = response.text

print(f"Total length of the dataset (characters): {len(text):,}")
print('-' * 30)
print('A short sample of the text:')
print(text[:500])

## Vocabulary & Tokenization

**Design Choice #1: Character-level vs Word-level**

We'll use **character-level** tokenization, meaning each character is a token. This is simpler than word-level (which would require handling ~30,000 words) and allows the model to generate any word, even ones not in the training data.

**Modern LLMs** typically use subword tokenization (like BPE or WordPiece), which is a middle ground between characters and words (think 'swim', 'ing', and '<p>' being separate tokens). But the principles are the same!

Trade-off: Character-level models need to learn spelling and word formation, making the task harder.

**Why Cross-Entropy Loss?** For classification tasks (predicting which character comes next), cross-entropy measures how well our predicted probability distribution matches the true distribution. It heavily penalizes confident wrong predictions. This is the same loss function used in GPT models!

In [None]:
# Build vocabulary (characters) and mappings
chars = sorted(list(set(text)))

# Reserve index 0 for a special token
special_token = '<pad>'

# stoi: string -> int, start from 1
stoi = {ch: i + 1 for i, ch in enumerate(chars)}
stoi[special_token] = 0

# itos: int -> string
itos = {i + 1: ch for i, ch in enumerate(chars)}
itos[0] = special_token

vocab_size = len(chars) + 1  # +1 for the special token
print(f"Vocabulary size (including '{special_token}'): {vocab_size}")
print('Characters:', ''.join(chars))

# Encoder and decoder convenience functions
def encode(s):
    return [stoi[c] for c in s]

def decode(l):
    return ''.join(itos[i] for i in l)

# Quick sanity check
sample_text = "to be or not to be"
encoded_sample = encode(sample_text)
decoded_sample = decode(encoded_sample)

print('\nSample Text:', sample_text)
print('Encoded:', encoded_sample)
print('Decoded:', decoded_sample)

In [None]:
# Convert entire text to a single tensor and split into train/validation
data = torch.tensor(encode(text), dtype=torch.long)
print(f"Data tensor shape: {data.shape}")

n = int(0.9 * len(data))
train_data = data[:n]
val_data = data[n:]

print(f"Train data: {len(train_data):,} characters")
print(f"Validation data: {len(val_data):,} characters")
print(f"\nTrain sample: \n{decode(train_data[:100].tolist())}")

---

## Part 1: The Bigram Model

**Inductive Bias: The Markov Assumption**

Our first model makes a strong assumption: *the next character depends ONLY on the current character*. This is called a **bigram model** or first-order Markov model.

### Why start here?
1. It's the simplest possible language model
2. We can understand it both as **counting** and as a **neural network**
3. Its limitations will motivate our next model
4. **Historical context**: Before deep learning, n-gram models (bigrams, trigrams) were the state-of-the-art for language modeling!

Let's first see what this looks like as a counting problem, then implement it as a neural network.

### Understanding Bigrams Through Counting

Before building a neural network, let's understand what a bigram model actually learns: **transition probabilities** between characters.

In [None]:
# Count bigram transitions in the training data
import collections

bigram_counts = collections.defaultdict(lambda: collections.defaultdict(int))

# Count all character pairs in sequence
for i in range(len(train_data) - 1):
    current_char = train_data[i].item()
    next_char = train_data[i + 1].item()
    bigram_counts[current_char][next_char] += 1

# Show some examples
print("Sample Bigram Counts (what follows 't'):")
print("-" * 40)
t_idx = stoi['t']
t_transitions = bigram_counts[t_idx]
top_5 = sorted(t_transitions.items(), key=lambda x: x[1], reverse=True)[:5]

for next_char_idx, count in top_5:
    next_char = itos[next_char_idx]
    print(f"  't' → '{next_char}': {count:,} times")

print("\nKey Insight: A bigram model learns these transition frequencies.")
print("We can represent this as a matrix of probabilities!")

In [None]:
# Create proper sequential bigram training data
def create_bigram_dataset(data):
    """
    Creates sequential bigram pairs from the data.
    X[i] is a character, Y[i] is the character that follows it.
    """
    X = data[:-1]  # All characters except the last
    Y = data[1:]   # All characters except the first
    return X, Y

# Create train and validation datasets
X_train_bigram, Y_train_bigram = create_bigram_dataset(train_data)
X_val_bigram, Y_val_bigram = create_bigram_dataset(val_data)

print(f"Training examples: {len(X_train_bigram):,}")
print(f"Validation examples: {len(X_val_bigram):,}")

# Show some examples
print("\n--- Sample Bigram Pairs (X → Y) ---")
for i in range(10):
    x_char = decode([X_train_bigram[i].item()])
    y_char = decode([Y_train_bigram[i].item()])
    print(f"'{x_char}' → '{y_char}'")

In [None]:
# Helper function to create batches
def get_batch_bigram(split, batch_size=32):
    """Samples a random batch of bigram pairs."""
    X = X_train_bigram if split == 'train' else X_val_bigram
    Y = Y_train_bigram if split == 'train' else Y_val_bigram
    
    # Sample random indices
    ix = torch.randint(len(X), (batch_size,), generator=g)
    
    return X[ix], Y[ix]

# Test the batch function
X_sample, Y_sample = get_batch_bigram('train', batch_size=5)
print("Sample batch:")
for x, y in zip(X_sample, Y_sample):
    print(f"  '{decode([x.item()])}' → '{decode([y.item()])}'")

### Bigram Neural Network Architecture

Instead of counting, we'll represent the bigram transition probabilities as a **learnable weight matrix** W.

- **W** has shape (vocab_size, vocab_size)
- **W[i, j]** represents the "score" for character j following character i
- We'll use gradient descent to learn these scores from data

In [None]:
# Initialize weights: A tensor of size (vocab_size, vocab_size)
# W[i, j] stores the likelihood of character j following character i
W = torch.randn((vocab_size, vocab_size), generator=g, requires_grad=True)

print("--- The Bigram Weight Matrix (W) ---")
print(f"Shape of W: {W.shape}")
print(f"Total parameters: {W.numel():,}")
print(f"W requires gradients: {W.requires_grad}")

In [None]:
# Forward Pass and Loss Calculation
X, Y = get_batch_bigram('train')

# 1. Input Encoding: Convert character indices to one-hot vectors
X_encoded = F.one_hot(X, num_classes=vocab_size).float()

# 2. Prediction: Matrix multiplication to get logits
logits = X_encoded @ W 

# 3. Loss Calculation (Cross Entropy combines softmax and negative log likelihood)
loss = F.cross_entropy(logits, Y)

print("-" * 40)
print(f"Logits shape: {logits.shape}")
print(f"Initial loss (untrained): {loss.item():.4f}")
print(f"\nExpected random loss: {torch.log(torch.tensor(vocab_size)).item():.4f}")
print("(This is -log(1/vocab_size), the loss of random guessing)")

### Training the Bigram Model

We'll track both training and validation loss to monitor learning progress.

In [None]:
# Training Loop with validation tracking
epochs = 400
learning_rate = 50

# Storage for plotting
train_losses = []
val_losses = []

print("Starting Bigram Model Training...")
print("-" * 40)

for k in range(epochs):
    # === Training Step ===
    X, Y = get_batch_bigram('train')
    X_encoded = F.one_hot(X, num_classes=vocab_size).float()
    logits = X_encoded @ W 
    loss = F.cross_entropy(logits, Y)
    
    # Backward pass
    W.grad = None
    loss.backward()
    
    # Update weights
    W.data += -learning_rate * W.grad
    
    # === Evaluation (no gradient computation) ===
    if k % 100 == 0:
        with torch.no_grad():
            # Training loss
            train_losses.append(loss.item())
            
            # Validation loss
            X_val, Y_val = get_batch_bigram('val', batch_size=1000)
            X_val_encoded = F.one_hot(X_val, num_classes=vocab_size).float()
            logits_val = X_val_encoded @ W
            val_loss = F.cross_entropy(logits_val, Y_val)
            val_losses.append(val_loss.item())
            
            print(f"Epoch {k:4d} | Train Loss: {loss.item():.4f} | Val Loss: {val_loss.item():.4f}")

print("-" * 40)
print(f"Final Training Loss: {train_losses[-1]:.4f}")
print(f"Final Validation Loss: {val_losses[-1]:.4f}")

In [None]:
# Plot the learning curves
plt.figure(figsize=(10, 5))
plt.plot(range(0, epochs, 100), train_losses, label='Training Loss', marker='o')
plt.plot(range(0, epochs, 100), val_losses, label='Validation Loss', marker='s')
plt.xlabel('Epoch')
plt.ylabel('Loss')
plt.title('Bigram Model: Learning Curves')
plt.legend()
plt.grid(True, alpha=0.3)
plt.show()

print("✓ The model has learned the bigram transition probabilities!")

### Generating Text with the Bigram Model

Let's see what kind of text our bigram model produces.

**Note on Temperature:** The `temperature` parameter controls randomness:
- temperature = 1.0: Use the model's probabilities as-is
- temperature < 1.0: Make the model more confident (less random)
- temperature > 1.0: Make the model more exploratory (more random)

When ChatGPT and similar models ask you for a temperature, this is what it means.

In [None]:
g_gen = torch.Generator().manual_seed(67)
# Text generation function for bigram model
def generate_bigram(max_tokens=200, temperature=1.0):
    """Generate text using the trained bigram model."""
    
    # Start with a random character
    current = torch.randint(1, vocab_size, (1,), generator=g_gen)
    result = [current.item()]
    
    for _ in range(max_tokens):
        # Get logits for current character
        x_enc = F.one_hot(current, num_classes=vocab_size).float()
        logits = x_enc @ W
        # Apply temperature and sample
        probs = F.softmax(logits / temperature, dim=1)
        next_token = torch.multinomial(probs, num_samples=1, generator=g_gen).reshape(-1)
        result.append(next_token.item())
        current = next_token
    
    return decode(result)

print("--- Bigram Model Generated Text ---\n")
print(generate_bigram(300))
print("\n" + "=" * 60)

### Reflection: Limitations of the Bigram Model

Look at the generated text. You'll notice:
- Individual character transitions look reasonable (e.g., 'in', 'be', ':' followed by newline, capital letters grouped up)
- But there are no real words or coherent structure
- The model has **no memory** beyond the immediate previous character

**Why?** Our inductive bias is too restrictive! Real language has longer-range dependencies:
- "The cat sat on the ___" → "mat" (depends on "cat", not just "e")
- "She said she would ___" → needs to remember "she"

**Historical note**: This limitation is exactly why researchers moved beyond n-gram models to neural networks with longer context windows, and eventually to transformers with attention mechanisms.

**Question:** How can we give the model more context?

---

## Part 2: MLP with Context Window

**New Inductive Bias: Fixed Context Window**

Let's improve our model by allowing it to look at the previous **N characters** (not just 1) to predict the next character.

**Design Choice #2: Context Window Size**
- Bigram: context = 1 character
- Our new model: context = 3 characters (we'll call this `block_size`)

This is still a simplification (real language has much longer dependencies), but it's a step forward!

**Connection to LLMs**: Modern models like GPT-4 have context windows of (hundreds of) thousands of tokens. The principle is the same, we're just scaling up! 

In [None]:
# Define context window size
block_size = 3 
print(f"New Inductive Bias: Predict next character using previous {block_size} characters")
print(f"Example: Given 'cat', predict what comes next")

# Create the context-based dataset
def create_dataset(data):
    """Creates input (X) and target (Y) tensors based on block_size."""
    X, Y = [], []
    for i in range(len(data) - block_size):
        context = data[i:i + block_size] 
        target = data[i + block_size]
        X.append(context)
        Y.append(target)
    
    X = torch.stack(X)
    Y = torch.stack(Y)
    return X, Y

# Generate datasets
X_train, Y_train = create_dataset(train_data)
X_val, Y_val = create_dataset(val_data)

print("-" * 40)
print(f"Training examples: {len(X_train):,}")
print(f"Validation examples: {len(X_val):,}")
print(f"\nX_train shape: {X_train.shape} (num_examples, context_length)")
print(f"Y_train shape: {Y_train.shape}")

# Show examples
print("\n--- Sample Context → Target ---")
for i in range(5):
    context = decode(X_train[i].tolist())
    target = decode([Y_train[i].item()])
    print(f"'{context}' → '{target}'")

### Design Choice #3: Character Embeddings

**Problem with One-Hot Encoding:**
- Each character is represented as a sparse vector (e.g., [0,0,1,0,...,0])
- Characters are treated as completely independent
- No notion of similarity (e.g., vowels, consonants)

**Solution: Learned Embeddings**
- Represent each character as a dense vector (e.g., 10 dimensions)
- The model learns these representations during training
- Similar characters (appearing in similar contexts) develop similar embeddings

**This is a key innovation in modern NLP!** Word embeddings (Word2Vec, GloVe) revolutionized NLP in the 2010s. Modern LLMs use the same principle but learn embeddings for subword tokens. GPT-3, for example, uses 12,288-dimensional embeddings!

In [None]:
# Embedding hyperparameter
embed_dim = 10

# Create the embedding matrix
# Each character gets a 10-dimensional vector
C = torch.randn((vocab_size, embed_dim), generator=g)

print(f"Embedding Matrix C shape: {C.shape}")
print(f"Each character is now represented by a {embed_dim}-dimensional vector")

# Example: Look up embeddings for a sample
sample_context = X_train[0]  # e.g., [15, 23, 8]
sample_embeddings = C[sample_context]

print(f"\nSample context indices: {sample_context.tolist()}")
print(f"Sample context text: '{decode(sample_context.tolist())}'")
print(f"Embeddings shape: {sample_embeddings.shape} (context_length, embed_dim)")
print(f"\nFirst character embedding:\n{sample_embeddings[0]}")

### Multi-Layer Perceptron (MLP) Architecture

Now we need to combine the information from our 3 character embeddings to predict the next character.

**Architecture:**
1. **Input:** 3 characters → 3 embeddings of size 10 → flatten to vector of size 30
2. **Hidden Layer:** Linear transformation (30 → 100) + Tanh activation
3. **Output Layer:** Linear transformation (100 → vocab_size) → logits

The **Tanh activation** allows the model to learn non-linear patterns (e.g., "qu" is common, but "qz" is not).

In [None]:
# Define MLP hyperparameters
hidden_size = 100 
input_dim = block_size * embed_dim  # 3 * 10 = 30

# Initialize parameters with careful scaling
W1 = torch.randn((input_dim, hidden_size), generator=g)# * (5/3) / (input_dim**0.5)
b1 = torch.randn(hidden_size, generator=g)# * 0.1

W2 = torch.randn((hidden_size, vocab_size), generator=g)# * 0.1
b2 = torch.randn(vocab_size, generator=g)# * 0.1

# Collect all parameters
parameters = [C, W1, b1, W2, b2]
for p in parameters:
    p.requires_grad = True

print("=" * 50)
print("MLP MODEL ARCHITECTURE")
print("=" * 50)
print(f"Input: {block_size} characters")
print(f"Embedding: {vocab_size} × {embed_dim} = {C.numel():,} parameters")
print(f"Hidden Layer: {input_dim} × {hidden_size} = {W1.numel():,} parameters")
print(f"Output Layer: {hidden_size} × {vocab_size} = {W2.numel():,} parameters")
print("-" * 50)
print(f"Total parameters: {sum(p.numel() for p in parameters):,}")
print(f"Bigram model had: {vocab_size**2:,} parameters")
print("=" * 50)

### Training the MLP Model

We'll use the Adam optimizer (a more sophisticated version of gradient descent) and track both training and validation loss.

In [None]:
# Batch sampling function
def get_batch_context(split, batch_size=64):
    """Samples a random batch from the context dataset."""
    X = X_train if split == 'train' else X_val
    Y = Y_train if split == 'train' else Y_val
    
    ix = torch.randint(len(X), (batch_size,), generator=g)
    return X[ix], Y[ix]

# Training setup
optimizer = torch.optim.Adam(parameters, lr=0.01)
epochs = 15000

# Storage for plotting
mlp_train_losses = []
mlp_val_losses = []

print("Starting MLP Context Model Training...")
print("-" * 50)

for k in range(epochs):
    # === Training Step ===
    Xb, Yb = get_batch_context('train')
    
    # Forward pass
    emb = C[Xb]  # (batch, block_size, embed_dim)
    emb_flat = emb.view(emb.shape[0], -1)  # (batch, block_size * embed_dim)
    h = torch.tanh(emb_flat @ W1 + b1)  # (batch, hidden_size)
    logits = h @ W2 + b2  # (batch, vocab_size)
    
    loss = F.cross_entropy(logits, Yb)
    
    # Backward pass
    optimizer.zero_grad(set_to_none=True)
    loss.backward()
    optimizer.step()
    
    # === Evaluation ===
    if k % 1000 == 0:
        with torch.no_grad():
            mlp_train_losses.append(loss.item())
            
            # Validation loss
            Xb_val, Yb_val = get_batch_context('val', batch_size=1000)
            emb_val = C[Xb_val]
            emb_val_flat = emb_val.view(emb_val.shape[0], -1)
            h_val = torch.tanh(emb_val_flat @ W1 + b1)
            logits_val = h_val @ W2 + b2
            val_loss = F.cross_entropy(logits_val, Yb_val)
            mlp_val_losses.append(val_loss.item())
            
            print(f"Epoch {k:5d} | Train Loss: {loss.item():.4f} | Val Loss: {val_loss.item():.4f}")

print("-" * 50)
print(f"Final Training Loss: {mlp_train_losses[-1]:.4f}")
print(f"Final Validation Loss: {mlp_val_losses[-1]:.4f}")

In [None]:
# Plot MLP learning curves
plt.figure(figsize=(10, 5))
plt.plot(range(0, epochs, 1000), mlp_train_losses, label='Training Loss', marker='o')
plt.plot(range(0, epochs, 1000), mlp_val_losses, label='Validation Loss', marker='s')
plt.xlabel('Epoch')
plt.ylabel('Loss')
plt.title('MLP Context Model: Learning Curves')
plt.legend()
plt.grid(True, alpha=0.3)
plt.show()

### Generating Text with the MLP Model

In [None]:
def generate_mlp(max_new_tokens=300, temperature=1.0):
    """Generate text using the trained MLP model."""
    # Start with padding tokens
    context = [0] * block_size
    result = []
    
    for _ in range(max_new_tokens):
        # Forward pass
        X_context = torch.tensor([context], dtype=torch.long)
        emb = C[X_context]
        emb_flat = emb.view(1, -1)
        h = torch.tanh(emb_flat @ W1 + b1)
        logits = h @ W2 + b2
        
        # Sample next token
        probs = F.softmax(logits / temperature, dim=1)
        next_token = torch.multinomial(probs, num_samples=1, generator=g_gen).item()
        
        # Update context (sliding window)
        context = context[1:] + [next_token]
        result.append(next_token)
    
    return decode(result).replace('<pad>', '')

print("--- MLP Model Generated Text ---\n")
print(generate_mlp(300))
print("\n" + "=" * 60)

---

## Interlude: Visualizing What the Model Learned

Before comparing our models, let's peek inside the MLP to see what it actually learned. The embedding layer should have discovered meaningful structure in the character space.

### Extra Credit: 2D Embedding Visualization

To visualize embeddings, we need them to be 2-dimensional. Let's quickly train a version of our MLP with `embed_dim=2` so we can plot the character embeddings.

In [None]:
# Train a 2D embedding version for visualization
print("Training a 2D embedding model for visualization...")
print("-" * 50)

# Hyperparameters (same as before, but embed_dim=2)
embed_dim_2d = 2
hidden_size_2d = 100
input_dim_2d = block_size * embed_dim_2d

# Initialize parameters
g_viz = torch.Generator().manual_seed(67)  # Same seed for reproducibility
C_2d = torch.randn((vocab_size, embed_dim_2d), generator=g_viz)
W1_2d = torch.randn((input_dim_2d, hidden_size_2d), generator=g_viz)# * (5/3) / (input_dim_2d**0.5)
b1_2d = torch.randn(hidden_size_2d, generator=g_viz)# * 0.1
W2_2d = torch.randn((hidden_size_2d, vocab_size), generator=g_viz)# * 0.1
b2_2d = torch.randn(vocab_size, generator=g_viz)# * 0.1

parameters_2d = [C_2d, W1_2d, b1_2d, W2_2d, b2_2d]
for p in parameters_2d:
    p.requires_grad = True

# Train (fewer epochs for speed)
optimizer_2d = torch.optim.Adam(parameters_2d, lr=0.01)
epochs_2d = 20000

for k in range(epochs_2d):
    Xb, Yb = get_batch_context('train')
    
    emb = C_2d[Xb]
    emb_flat = emb.view(emb.shape[0], -1)
    h = torch.tanh(emb_flat @ W1_2d + b1_2d)
    logits = h @ W2_2d + b2_2d
    
    loss = F.cross_entropy(logits, Yb)
    
    optimizer_2d.zero_grad(set_to_none=True)
    loss.backward()
    optimizer_2d.step()
    
    if k % 1000 == 0:
        print(f"Epoch {k:4d} | Loss: {loss.item():.4f}")

print("-" * 50)
print("✓ 2D model trained!")

In [None]:
# Visualize the 2D embeddings
import numpy as np

# Extract embeddings
embeddings_2d = C_2d.detach().numpy()

# Create the plot
plt.figure(figsize=(14, 10))
plt.scatter(embeddings_2d[:, 0], embeddings_2d[:, 1], alpha=0.3, s=100)

# Annotate each point with its character
for i in range(vocab_size):
    char = itos[i]
    x, y = embeddings_2d[i]
    
    # Use different colors for different character types
    if char == '<pad>':
        color = 'red'
        fontsize = 10
    elif char in 'aeiouAEIOU':
        color = 'blue'
        fontsize = 12
    elif char in '.,;:!?$\n':
        color = 'green'
        fontsize = 10
    elif char == ' ':
        color = 'orange'
        fontsize = 10
        char = '␣'  # Visible space character
    elif char.isupper():
        color = 'purple'
        fontsize = 11
    else:
        color = 'black'
        fontsize = 11
    
    plt.annotate(char, (x, y), fontsize=fontsize, color=color, 
                ha='center', va='center', weight='bold')

plt.xlabel('Embedding Dimension 1', fontsize=12)
plt.ylabel('Embedding Dimension 2', fontsize=12)
plt.title('2D Character Embeddings: Learned Representations', fontsize=14, weight='bold')
plt.grid(True, alpha=0.3)

# Add legend
from matplotlib.patches import Patch
legend_elements = [
    Patch(facecolor='blue', label='Vowels'),
    Patch(facecolor='purple', label='Uppercase'),
    Patch(facecolor='green', label='Punctuation'),
    Patch(facecolor='orange', label='Space'),
    Patch(facecolor='black', label='Other lowercase'),
    Patch(facecolor='red', label='Special token')
]
plt.legend(handles=legend_elements, loc='best')

plt.tight_layout()
plt.show()

### What Do You Notice?

Look at the embedding space visualization above. You should see interesting patterns:

1. **Vowels cluster together** (blue letters) - the model learned they behave similarly. But apparently upper case vowels are closer to other upper case characters than lower case vowels!
2. **Punctuation marks group** (green) - they appear in similar contexts
3. **Uppercase letters** (purple) might form their own region
4. **The space character** (orange ␣) is often in a unique position

**Why does this happen?** The model learns to place characters that appear in similar contexts close together in embedding space. For example:
- Vowels often follow consonants and precede consonants
- Punctuation marks often follow words and precede spaces
- Uppercase letters often start sentences

**Key Insight:** The model discovered linguistic structure **without being told**! We never told it what a vowel is or what punctuation does. It learned this purely from the statistical patterns in the text.

This is the power of learned representations; the model builds its own useful abstractions from data.

**Connection to LLMs**: The same thing happens in modern language models! Word embeddings in GPT models capture semantic relationships like "Paris - France + Italy ≈ Rome". The model learns these relationships purely from training data steered by inductive biases, without explicit programming.

**Reflection:** 
- Are there any surprising clusters or separations?
- What does the position of the space character tell you?
- How might this structure help the model predict the next character?
- What kinds of relationships might a word-level model learn?

In [None]:
# Side-by-side loss comparison
fig, (ax1, ax2) = plt.subplots(1, 2, figsize=(15, 5))

# Bigram losses
ax1.plot(range(0, 400, 100), train_losses, label='Train', marker='o')
ax1.plot(range(0, 400, 100), val_losses, label='Validation', marker='s')
ax1.set_xlabel('Epoch')
ax1.set_ylabel('Loss')
ax1.set_title('Bigram Model')
ax1.legend()
ax1.grid(True, alpha=0.3)

# MLP losses
ax2.plot(range(0, epochs, 1000), mlp_train_losses, label='Train', marker='o')
ax2.plot(range(0, epochs, 1000), mlp_val_losses, label='Validation', marker='s')
ax2.set_xlabel('Epoch')
ax2.set_ylabel('Loss')
ax2.set_title('MLP Context Model')
ax2.legend()
ax2.grid(True, alpha=0.3)

plt.tight_layout()
plt.show()

print(f"Bigram Final Val Loss: {val_losses[-1]:.4f}")
print(f"MLP Final Val Loss: {mlp_val_losses[-1]:.4f}")
print(f"Improvement: {((val_losses[-1] - mlp_val_losses[-1]) / val_losses[-1] * 100):.1f}%")

In [None]:
# Generate comparison samples
print("=" * 70)
print("SIDE-BY-SIDE TEXT GENERATION COMPARISON")
print("=" * 70)

for i in range(3):
    print(f"\n--- Sample {i+1} ---")
    print("\nBigram Model:")
    print(generate_bigram(200))
    print("\nMLP Context Model:")
    print(generate_mlp(200))
    print("-" * 70)

### Summary: How Design Choices Shape Model Performance

| Design Choice | Bigram Model | MLP Context Model | Modern LLMs (e.g., GPT-4) |
|--------------|--------------|-------------------|---------------------------|
| **Data Representation** | One-hot encoding | Learned embeddings (10D) | Learned embeddings (>12,000D) |
| **Architecture** | Single linear layer | Embeddings → MLP | Embeddings → Transformer layers |
| **Parameters** | ~4,000 | ~10,000 | >100 billion (GPT-4 estimated at 1.7 trillion) |
| **Training Data** | 1MB Shakespeare | 1MB Shakespeare | Trillions of tokens |
| **Context Window** | 1 character | 3 characters | 128,000-1,000,000 tokens |

### Key Takeaways

1. **Inductive biases matter**: Allowing the model to see 3 characters instead of 1 dramatically improved performance. As did moving to a stronger model architecture. Modern LLMs extend this to thousands of tokens and gigantic models.

2. **Representation matters**: Learned embeddings allow the model to discover that some characters are similar. Modern LLMs learn incredibly rich representations that capture semantic meaning.

3. **Scale matters, but principles don't change**: GPT-4 has ~170,000× more parameters than our model, but it uses the same fundamental ideas:
   - Next-token prediction
   - Learned embeddings
   - Gradient descent optimization
   - Cross-entropy loss

4. **Machine learning is modeling**: We made explicit choices about:
   - What data to use (Shakespeare text)
   - How to represent it (character-level, embeddings)
   - What assumptions to make (context window size)
   - What architecture to use (MLP with non-linearity)

These choices fundamentally shaped what our models could learn! The same is true for modern LLMs—their capabilities emerge from careful design choices about data, architecture, and training.

### Discussion Questions

1. **Context Window**: We used `block_size=3`. What would happen with `block_size=1`? With `block_size=10`? What are the trade-offs?

2. **Embedding Dimension**: We used `embed_dim=10` for the main model and `embed_dim=2` for visualization. How might using `embed_dim=50` or `embed_dim=100` affect the model? What about `embed_dim=1`?

3. **Architecture Depth**: Our MLP has one hidden layer. What might we gain from adding more layers? What might we lose? (Hint: GPT-3 has 96 layers!)

4. **Data Source**: We used Shakespeare. How would the model differ if we trained on:
   - Python code?
   - Modern English novels?
   - Social media posts?
   - A different language (e.g., Chinese, Arabic)?
   
   What does this tell you about how LLMs trained on internet-scale data might behave?

5. **Limitations**: Both models still produce mostly gibberish. What fundamental limitations do they have? (Hint: think about long-range dependencies like "The cat that chased the mouse ___ tired")

6. **Scaling Laws**: Our MLP performs better than the bigram model. If we kept scaling up (more parameters, more data, more compute), would performance keep improving? This is an active research question in modern AI!

### From Here to ChatGPT: What's Missing?

Our models demonstrate the core principles, but modern LLMs add several key innovations:

1. **Attention Mechanisms**: Instead of a fixed context window, attention allows the model to focus on relevant parts of the input dynamically. This is the "T" in GPT (Generative Pre-trained **Transformer**).

2. **Scale**: 
   - **More parameters**: GPT-3 has 175 billion parameters vs. our ~10,000
   - **More data**: Trained on hundreds of billions of tokens vs. our 1 million characters
   - **More compute**: Thousands of GPUs for weeks vs. our CPU for minutes

3. **Pre-training + Fine-tuning**: LLMs are first trained on massive text corpora (pre-training), then fine-tuned on specific tasks or with human feedback (RLHF).

4. **Architectural improvements**: Layer normalization, residual connections, better optimizers, etc.

But the **fundamental task remains the same**: predict the next token. Everything else is about doing this task better at scale!

### Optional Extensions

If you want to explore further:

1. **Experiment with hyperparameters**: Try different `block_size`, `embed_dim`, or `hidden_size` values
2. **Temperature sampling**: Generate text with different temperature values (0.5, 1.0, 1.5) and compare
3. **Different datasets**: Train on a different text corpus and compare the learned embeddings
4. **Longer training**: Train for more epochs and see if the text quality improves
5. **3D embeddings**: Train with `embed_dim=3` and create a 3D visualization using `mpl_toolkits.mplot3d`