In [49]:
!pip install torch tiktoken transformers



---

# Part 1: Building the Model Architecture

First, we need to assemble all the building blocks of our GPT model. We've covered these in previous notebooks, so here we'll import/define them and focus on how they work together.

## Multi-Head Attention

The **heart of the transformer**. Multi-head attention allows the model to:
- Look at different positions in the sequence simultaneously
- Learn different types of relationships (syntactic, semantic, etc.) via multiple "heads"
- Use **causal masking** to prevent looking at future tokens (crucial for language modeling!)

### How Multi-Head Attention Works:

```
Input: [batch, seq_len, d_model]
         │
         ├──→ Q = input @ W_query ──┐
         ├──→ K = input @ W_key   ──┼──→ Split into num_heads
         └──→ V = input @ W_value ──┘
                                    │
                    ┌───────────────┴───────────────┐
                    │     For each head:            │
                    │  1. attention = Q @ K.T       │
                    │  2. mask future positions     │
                    │  3. scale by √d_k             │
                    │  4. softmax → weights         │
                    │  5. output = weights @ V      │
                    └───────────────┬───────────────┘
                                    │
                            Concatenate all heads
                                    │
                            Output projection
                                    │
Output: [batch, seq_len, d_model]
```

### Key Parameters:
- `d_in`: Input dimension (embedding size)
- `d_out`: Output dimension (usually same as d_in)
- `num_heads`: Number of attention heads (d_out must be divisible by this)
- `context_length`: Maximum sequence length (for creating the causal mask)
- `dropout`: Regularization to prevent overfitting

# Pre-Training a GPT Model from Scratch

This notebook walks through the complete process of **pre-training** a GPT-style language model. Pre-training is the foundational step where the model learns general language patterns from a large corpus of text.

## What is Pre-Training?

Pre-training teaches a model to predict the next token in a sequence. Given text like:

```
"The cat sat on the" → predict "mat"
```

Through millions of these predictions, the model learns:
- Grammar and syntax
- Word relationships and semantics  
- Facts and knowledge from the training data
- Reasoning patterns

## What We'll Cover

1. **Model Architecture** - Build all GPT components (attention, feed-forward, transformer blocks)
2. **Loss Calculation** - Understand cross-entropy loss for next-token prediction
3. **Data Preparation** - Create training batches from raw text
4. **Training Loop** - Put it all together to train the model

## Why This Matters

Pre-training is **expensive** (GPT-3 cost ~$4.6M to train!) but creates a foundation that can be fine-tuned for many tasks. Understanding this process helps you:
- Debug training issues
- Make informed architecture decisions
- Understand model capabilities and limitations

---

## Install Dependencies

In [50]:
import torch
import torch.nn as nn

class MultiHeadAttention(nn.Module):
    def __init__(self, d_in, d_out, 
                 context_length, dropout, num_heads, qkv_bias=False):
        super().__init__()
        assert (d_out % num_heads == 0), \
            "d_out must be divisible by num_heads"

        self.d_out = d_out
        self.num_heads = num_heads
        self.head_dim = d_out // num_heads
        self.W_query = nn.Linear(d_in, d_out, bias=qkv_bias)
        self.W_key = nn.Linear(d_in, d_out, bias=qkv_bias)
        self.W_value = nn.Linear(d_in, d_out, bias=qkv_bias)
        self.out_proj = nn.Linear(d_out, d_out)
        self.dropout = nn.Dropout(dropout)
        self.register_buffer(
            "mask",
            torch.triu(torch.ones(context_length, context_length),
                       diagonal=1)
        )

    def forward(self, x):
        b, num_tokens, d_in = x.shape
        keys = self.W_key(x)
        queries = self.W_query(x)
        values = self.W_value(x)

        keys = keys.view(b, num_tokens, self.num_heads, self.head_dim)
        values = values.view(b, num_tokens, self.num_heads, self.head_dim)  
        queries = queries.view(                                             
            b, num_tokens, self.num_heads, self.head_dim                    
        )                                                                   

        keys = keys.transpose(1, 2)
        queries = queries.transpose(1, 2)
        values = values.transpose(1, 2)

        attn_scores = queries @ keys.transpose(2, 3)
        mask_bool = self.mask.bool()[:num_tokens, :num_tokens]

        attn_scores.masked_fill_(mask_bool, -torch.inf)

        attn_weights = torch.softmax(
            attn_scores / keys.shape[-1]**0.5, dim=-1)
        attn_weights = self.dropout(attn_weights)

        context_vec = (attn_weights @ values).transpose(1, 2)

        context_vec = context_vec.contiguous().view(
            b, num_tokens, self.d_out
        )
        context_vec = self.out_proj(context_vec)
        return context_vec

## Layer Normalization

**Layer Normalization** stabilizes training by normalizing activations across the feature dimension.

### Why We Need It:

During training, the distribution of layer inputs can shift dramatically (called "internal covariate shift"). This makes training unstable and slow. Layer normalization fixes this by:

1. **Normalizing** each sample independently across features
2. **Scaling and shifting** with learnable parameters (so the model can undo normalization if needed)

### The Math:

```
For each token position:
    mean = average of all features
    var  = variance of all features
    
    normalized = (x - mean) / √(var + ε)
    output = scale * normalized + shift
```

### Why Layer Norm (not Batch Norm)?

| Batch Norm | Layer Norm |
|------------|------------|
| Normalizes across batch | Normalizes across features |
| Depends on batch size | Independent of batch size |
| Different behavior train/eval | Same behavior always |
| Bad for sequences | Great for sequences |

**GPT uses Layer Norm** because it works independently for each token, regardless of batch size or sequence length.

In [51]:
class LayerNorm(nn.Module):
    def __init__(self, emb_dim):
        super().__init__()
        self.eps = 1e-5
        self.scale = nn.Parameter(torch.ones(emb_dim))
        self.shift = nn.Parameter(torch.zeros(emb_dim))

    def forward(self, x):
        mean = x.mean(dim=-1, keepdim=True)
        var = x.var(dim=-1, keepdim=True, unbiased=False)
        norm_x = (x - mean) / torch.sqrt(var + self.eps)
        return self.scale * norm_x + self.shift

## GELU Activation Function

**GELU** (Gaussian Error Linear Unit) is the activation function used in GPT and most modern transformers.

### Why Not ReLU?

ReLU is simple (`max(0, x)`) but has a problem: **dead neurons**. Once a neuron outputs 0, it may never recover during training.

### How GELU Works:

GELU is a **smooth approximation** that:
- Allows small negative values through (unlike ReLU which blocks all negatives)
- Is differentiable everywhere (unlike ReLU's sharp corner at 0)
- Acts like a "soft gate" based on the input's value

```
GELU(x) ≈ 0.5 * x * (1 + tanh(√(2/π) * (x + 0.044715 * x³)))
```

### Why GELU is Better for Transformers:
- Smoother gradients → more stable training
- No dead neurons → better gradient flow
- Empirically works better for NLP tasks

In [52]:
class GELU(nn.Module):
    def __init__(self):
        super().__init__()

    def forward(self, x):
        return 0.5 * x * (1 + torch.tanh(
            torch.sqrt(torch.tensor(2.0 / torch.pi)) * 
            (x + 0.044715 * torch.pow(x, 3))
        ))

## Feed-Forward Network (FFN)

The **Feed-Forward Network** is the "thinking" part of each transformer block. While attention figures out *what to look at*, the FFN processes *what to do with that information*.

### Architecture:

```
Input [batch, seq, 768]
        │
        ▼
┌───────────────────┐
│  Linear (768→3072)│  ← Expand to 4x size
└─────────┬─────────┘
          │
          ▼
┌───────────────────┐
│      GELU         │  ← Non-linearity
└─────────┬─────────┘
          │
          ▼
┌───────────────────┐
│  Linear (3072→768)│  ← Project back down
└─────────┬─────────┘
          │
          ▼
Output [batch, seq, 768]
```

### Why 4x Expansion?

The FFN temporarily expands the dimension by **4x** (768 → 3072). This gives the network more "room to think":

1. **More parameters** = more capacity to learn complex patterns
2. **Bottleneck design** = forces compression of information
3. **Empirically effective** = this ratio works well in practice

### Key Insight:

The FFN processes each token position **independently**. Unlike attention (which mixes information across positions), the FFN applies the same transformation to each token separately. This is why it's also called a "position-wise" feed-forward network.

In [53]:
class FeedForward(nn.Module):
    def __init__(self, cfg):
        super().__init__()
        self.layers = nn.Sequential(
            nn.Linear(cfg["emb_dim"], 4 * cfg["emb_dim"]),
            GELU(),
            nn.Linear(4 * cfg["emb_dim"], cfg["emb_dim"]),
        )

    def forward(self, x):
        return self.layers(x)

## Transformer Block

The **Transformer Block** combines attention and feed-forward layers with **residual connections** and **layer normalization**. This is the repeating unit that gets stacked to create deep models.

### Architecture (Pre-Norm variant used in GPT-2/3):

```
        Input
          │
          ├─────────────────────┐
          │                     │ (residual/skip connection)
          ▼                     │
    ┌───────────┐               │
    │ LayerNorm │               │
    └─────┬─────┘               │
          │                     │
          ▼                     │
    ┌───────────┐               │
    │ Attention │               │
    └─────┬─────┘               │
          │                     │
          ▼                     │
    ┌───────────┐               │
    │  Dropout  │               │
    └─────┬─────┘               │
          │                     │
          ▼                     │
        (+)←────────────────────┘
          │
          ├─────────────────────┐
          │                     │ (residual/skip connection)
          ▼                     │
    ┌───────────┐               │
    │ LayerNorm │               │
    └─────┬─────┘               │
          │                     │
          ▼                     │
    ┌───────────┐               │
    │    FFN    │               │
    └─────┬─────┘               │
          │                     │
          ▼                     │
    ┌───────────┐               │
    │  Dropout  │               │
    └─────┬─────┘               │
          │                     │
          ▼                     │
        (+)←────────────────────┘
          │
        Output
```

### Why Residual Connections?

Residual (skip) connections are **critical** for training deep networks:

1. **Gradient flow**: Gradients can flow directly through skip connections, avoiding vanishing gradients
2. **Identity mapping**: The network can easily learn to "do nothing" if that's optimal
3. **Incremental learning**: Each layer learns a "delta" on top of existing representations

### Pre-Norm vs Post-Norm

GPT-2/3 use **Pre-Norm** (normalize before attention/FFN), which:
- Is more stable during training
- Allows for larger learning rates
- Enables training very deep models without warmup tricks

In [54]:
class TransformerBlock(nn.Module):
    def __init__(self, cfg):
        super().__init__()
        self.att = MultiHeadAttention(
            d_in=cfg["emb_dim"],
            d_out=cfg["emb_dim"],
            context_length=cfg["context_length"],
            num_heads=cfg["n_heads"], 
            dropout=cfg["drop_rate"],
            qkv_bias=cfg["qkv_bias"])
        self.ff = FeedForward(cfg)
        self.norm1 = LayerNorm(cfg["emb_dim"])
        self.norm2 = LayerNorm(cfg["emb_dim"])
        self.drop_shortcut = nn.Dropout(cfg["drop_rate"])

    def forward(self, x):

        shortcut = x
        x = self.norm1(x)
        x = self.att(x)
        x = self.drop_shortcut(x)
        x = x + shortcut

        shortcut = x
        x = self.norm2(x)
        x = self.ff(x)
        x = self.drop_shortcut(x)
        x = x + shortcut
        return x

## GPT Configuration

Before building the model, we define its **hyperparameters**. These control the model's size and capacity.

### GPT-2 124M Configuration:

| Parameter | Value | Meaning |
|-----------|-------|---------|
| `vocab_size` | 50,257 | Number of unique tokens (BPE vocabulary) |
| `context_length` | 256 | Maximum sequence length (reduced from 1024 for speed) |
| `emb_dim` | 768 | Size of token embeddings |
| `n_heads` | 12 | Number of attention heads |
| `n_layers` | 12 | Number of transformer blocks |
| `drop_rate` | 0.1 | Dropout probability (10%) |
| `qkv_bias` | False | No bias in Q, K, V projections |

### Scaling Laws:

The model's capacity roughly scales with:
- **Width** (`emb_dim`): More features per token
- **Depth** (`n_layers`): More processing steps
- **Heads** (`n_heads`): More parallel attention patterns

GPT-2 sizes:
- **124M**: 12 layers, 768 dim, 12 heads (what we're using)
- **355M**: 24 layers, 1024 dim, 16 heads
- **774M**: 36 layers, 1280 dim, 20 heads
- **1.5B**: 48 layers, 1600 dim, 25 heads

In [55]:
GPT_CONFIG_124M = {
    "vocab_size": 50257,
    "context_length": 256,
    "emb_dim": 768,
    "n_heads": 12,
    "n_layers": 12, 
    "drop_rate": 0.1,
    "qkv_bias": False
}

## The Complete GPT Model

Now we assemble all components into the full **GPTModel** class.

### Architecture Overview:

```
Token IDs [batch, seq_len]
          │
          ▼
┌─────────────────────┐
│  Token Embedding    │  Look up vectors for each token
│  [vocab → emb_dim]  │
└──────────┬──────────┘
           │
           ▼
┌─────────────────────┐
│ Position Embedding  │  Add position information
│ [seq_len → emb_dim] │
└──────────┬──────────┘
           │
           ▼
┌─────────────────────┐
│     Dropout         │  Regularization
└──────────┬──────────┘
           │
           ▼
┌─────────────────────┐
│  Transformer Block  │ ─┐
└──────────┬──────────┘  │
           │             │
           ▼             │ × 12 layers
┌─────────────────────┐  │
│  Transformer Block  │  │
└──────────┬──────────┘  │
           │             │
          ...           ─┘
           │
           ▼
┌─────────────────────┐
│    Final LayerNorm  │
└──────────┬──────────┘
           │
           ▼
┌─────────────────────┐
│    Output Head      │  Project to vocabulary size
│ [emb_dim → vocab]   │
└──────────┬──────────┘
           │
           ▼
    Logits [batch, seq_len, vocab_size]
```

### Key Points:

1. **Token + Position Embeddings**: The model needs both *what* token and *where* it is
2. **Stacked Transformer Blocks**: 12 identical blocks, each refining the representations
3. **Output Head**: Maps final representations back to vocabulary probabilities
4. **No bias in output**: `bias=False` in the final Linear layer (GPT-2 convention)

In [56]:
class GPTModel(nn.Module):
    def __init__(self, cfg):
        super().__init__()
        self.tok_emb = nn.Embedding(cfg["vocab_size"], cfg["emb_dim"])
        self.pos_emb = nn.Embedding(cfg["context_length"], cfg["emb_dim"])
        self.drop_emb = nn.Dropout(cfg["drop_rate"])

        self.trf_blocks = nn.Sequential(
            *[TransformerBlock(cfg) for _ in range(cfg["n_layers"])])

        self.final_norm = LayerNorm(cfg["emb_dim"])
        self.out_head = nn.Linear(
            cfg["emb_dim"], cfg["vocab_size"], bias=False
        )

    def forward(self, in_idx):
        batch_size, seq_len = in_idx.shape
        tok_embeds = self.tok_emb(in_idx)

        pos_embeds = self.pos_emb(
            torch.arange(seq_len, device=in_idx.device)
        )
        x = tok_embeds + pos_embeds
        x = self.drop_emb(x)
        x = self.trf_blocks(x)
        x = self.final_norm(x)
        logits = self.out_head(x)
        return logits

## Instantiate the Model

Let's create our GPT model and examine its structure.

### What Happens Here:

1. **Set random seed**: Ensures reproducible weight initialization
2. **Create model**: Instantiates all layers with random weights
3. **Set to eval mode**: Disables dropout (important for consistent outputs during testing)

### Model Size:

With our configuration, the model has approximately **124 million parameters**:
- Token embeddings: 50,257 × 768 = ~38.6M
- Position embeddings: 256 × 768 = ~0.2M  
- Each transformer block: ~7M parameters
- 12 blocks: ~84M
- Output head: 768 × 50,257 = ~38.6M (but shares weights with token embeddings in original GPT-2)

**Note**: Our implementation doesn't share weights between token embeddings and output head, so we have slightly more parameters.

In [57]:
torch.manual_seed(123)
model = GPTModel(GPT_CONFIG_124M)
model.eval()

GPTModel(
  (tok_emb): Embedding(50257, 768)
  (pos_emb): Embedding(256, 768)
  (drop_emb): Dropout(p=0.1, inplace=False)
  (trf_blocks): Sequential(
    (0): TransformerBlock(
      (att): MultiHeadAttention(
        (W_query): Linear(in_features=768, out_features=768, bias=False)
        (W_key): Linear(in_features=768, out_features=768, bias=False)
        (W_value): Linear(in_features=768, out_features=768, bias=False)
        (out_proj): Linear(in_features=768, out_features=768, bias=True)
        (dropout): Dropout(p=0.1, inplace=False)
      )
      (ff): FeedForward(
        (layers): Sequential(
          (0): Linear(in_features=768, out_features=3072, bias=True)
          (1): GELU()
          (2): Linear(in_features=3072, out_features=768, bias=True)
        )
      )
      (norm1): LayerNorm()
      (norm2): LayerNorm()
      (drop_shortcut): Dropout(p=0.1, inplace=False)
    )
    (1): TransformerBlock(
      (att): MultiHeadAttention(
        (W_query): Linear(in_features

---

# Part 2: Text Generation (Before Training)

Before training, let's see what our randomly-initialized model produces. Spoiler: it will be **complete gibberish**!

## Text Generation Process

The model generates text **one token at a time** using this loop:

```
1. Input: "Every effort moves you"
2. Model predicts probability distribution over ALL 50,257 tokens
3. Select the most likely token (argmax) or sample from distribution
4. Append new token to input
5. Repeat steps 2-4 for desired length
```

### Helper Functions:

- **`text_to_token_ids`**: Converts text → token IDs using GPT-2's BPE tokenizer
- **`token_ids_to_text`**: Converts token IDs → text
- **`generate_text_simple`**: The generation loop (greedy decoding with argmax)

### Why the Output is Garbage:

Our model has **random weights**. It hasn't learned:
- What words mean
- Grammar rules
- Which tokens typically follow others

The output will be essentially random tokens. This demonstrates why **pre-training is essential**!

In [58]:
import tiktoken

def generate_text_simple(model, idx,
                         max_new_tokens, context_size): 
    for _ in range(max_new_tokens):
        idx_cond = idx[:, -context_size:]
        with torch.no_grad():
            logits = model(idx_cond)

        logits = logits[:, -1, :]
        probas = torch.softmax(logits, dim=-1)
        idx_next = torch.argmax(probas, dim=-1, keepdim=True)
        idx = torch.cat((idx, idx_next), dim=1)

    return idx

def text_to_token_ids(text, tokenizer):
    encoded = tokenizer.encode(text, allowed_special={'<|endoftext|>'})
    encoded_tensor = torch.tensor(encoded).unsqueeze(0)
    return encoded_tensor

def token_ids_to_text(token_ids, tokenizer):
    flat = token_ids.squeeze(0)
    return tokenizer.decode(flat.tolist())

start_context = "Every effort moves you"
tokenizer = tiktoken.get_encoding("gpt2")

token_ids = generate_text_simple(
    model=model,
    idx=text_to_token_ids(start_context, tokenizer),
    max_new_tokens=10,
    context_size=GPT_CONFIG_124M["context_length"]
)
print("Output text:\n", token_ids_to_text(token_ids, tokenizer))

Output text:
 Every effort moves you rentingetic wasnم refres RexMeCHicular stren


---

# Part 3: Understanding the Loss Function

Now we dive into the **most important concept for training**: the loss function.

## What is Loss?

**Loss** measures how wrong our model's predictions are. For language models:
- Lower loss = better predictions
- Higher loss = worse predictions

## Cross-Entropy Loss for Language Modeling

For next-token prediction, we use **cross-entropy loss**:

```
Loss = -log(probability of correct token)
```

### Intuition:

If the model predicts:
- Correct token with 90% probability → Loss = -log(0.9) = 0.105 (low, good!)
- Correct token with 1% probability → Loss = -log(0.01) = 4.6 (high, bad!)
- Correct token with 0.01% probability → Loss = -log(0.0001) = 9.2 (very high, terrible!)

## Setting Up the Example

We'll use two short sequences to understand how loss is calculated:

```
Batch 1: "every effort moves" → target: "effort moves you"
Batch 2: "I really like"      → target: "really like chocolate"
```

Each input token should predict the **next** token in the sequence.

In [59]:
inputs = torch.tensor([[16833, 3626, 6100],   # ["every effort moves",
                       [40,    1107, 588]])   #  "I really like"]

### Create Input Tokens

These are the **token IDs** for our input sequences. The tokenizer converts words to numbers:
- `16833` = "every"
- `3626` = " effort" 
- `6100` = " moves"
- `40` = "I"
- `1107` = " really"
- `588` = " like"

In [60]:
targets = torch.tensor([[3626, 6100, 345  ],  # [" effort moves you",
                        [1107, 588, 11311]])  #  " really like chocolate"]

### Create Target Tokens

The targets are simply the inputs **shifted by one position**. This is the essence of language modeling:

```
Input:  [every]  [effort] [moves]
Target: [effort] [moves]  [you]
         ↑        ↑        ↑
      predict   predict  predict
```

Each position learns to predict the **next** token.

In [61]:
with torch.no_grad():
    logits = model(inputs)
probas = torch.softmax(logits, dim=-1)
print(probas.shape)

torch.Size([2, 3, 50257])


### Get Model Predictions

Run the inputs through the model to get **logits** (raw scores before softmax).

The output shape is `[batch_size, seq_len, vocab_size]` = `[2, 3, 50257]`:
- 2 sequences in the batch
- 3 tokens per sequence
- 50,257 scores (one for each possible next token)

In [62]:
token_ids = torch.argmax(probas, dim=-1, keepdim=True)
print("Token IDs:\n", token_ids)

Token IDs:
 tensor([[[16657],
         [  339],
         [42826]],

        [[49906],
         [29669],
         [41751]]])


### What Would the Model Predict?

Let's see which tokens the model would actually generate (using argmax to pick the highest probability token).

**Remember**: The model has random weights, so these predictions will be nonsense!

In [63]:
print(f"Targets batch 1: {token_ids_to_text(targets[0], tokenizer)}")
print(f"Outputs batch 1:"
      f" {token_ids_to_text(token_ids[0].flatten(), tokenizer)}")

Targets batch 1:  effort moves you
Outputs batch 1:  Armed heNetflix


### Compare Predictions vs Targets

Let's decode both the targets (what we want) and predictions (what the model says).

As expected, the predictions are **completely wrong** - random tokens that don't make sense!

In [64]:
text_idx = 0
target_probas_1 = probas[text_idx, [0, 1, 2], targets[text_idx]]
print("Text 1:", target_probas_1)

text_idx = 1
target_probas_2 = probas[text_idx, [0, 1, 2], targets[text_idx]]
print("Text 2:", target_probas_2)

Text 1: tensor([7.4540e-05, 3.1061e-05, 1.1563e-05])
Text 2: tensor([1.0337e-05, 5.6776e-05, 4.7559e-06])


### Extract Probabilities for Target Tokens

Now let's see **how much probability** the model assigned to the correct tokens.

We use fancy indexing to extract just the probabilities for our target tokens:
- `probas[0, [0,1,2], targets[0]]` = probabilities at positions 0,1,2 for the target token IDs

**Expected**: Very low probabilities (around 1/50,257 ≈ 0.00002) since the model is random.

In [65]:
log_probas = torch.log(torch.cat((target_probas_1, target_probas_2)))
print(log_probas)

tensor([ -9.5042, -10.3796, -11.3677, -11.4798,  -9.7764, -12.2561])


### Convert to Log Probabilities

We take the **logarithm** of probabilities because:

1. **Numerical stability**: Very small probabilities (like 0.00001) become manageable numbers
2. **Easier math**: Products become sums (important for sequences)
3. **Convention**: Cross-entropy loss uses log probabilities

```
prob = 0.00007  →  log(prob) = -9.5
prob = 0.00001  →  log(prob) = -11.5
```

**Note**: More negative = lower probability = worse prediction!

In [66]:
avg_log_probas = torch.mean(log_probas)
print(avg_log_probas)

tensor(-10.7940)


### Average Log Probability

We average the log probabilities across all tokens to get a single number representing overall model performance.

This average is **negative** because log probabilities are always ≤ 0.

In [67]:
neg_avg_log_probas = avg_log_probas * -1
print(neg_avg_log_probas)

tensor(10.7940)


### Negative Average Log Probability = Cross-Entropy Loss!

We negate the average to get a **positive** loss value:

```
Cross-Entropy Loss = -mean(log(probability of correct tokens))
```

This is exactly what `torch.nn.functional.cross_entropy` computes!

**Goal of training**: Minimize this loss → Maximize probability of correct tokens.

In [68]:
print("Logits shape:", logits.shape)
print("Targets shape:", targets.shape)

Logits shape: torch.Size([2, 3, 50257])
Targets shape: torch.Size([2, 3])


## Using PyTorch's Cross-Entropy Function

The manual calculation above was educational, but in practice we use PyTorch's optimized `cross_entropy` function.

### The Challenge: Shape Mismatch

PyTorch's `cross_entropy` expects:
- `input`: `[N, C]` where N = samples, C = classes
- `target`: `[N]` 

But our shapes are:
- `logits`: `[batch, seq_len, vocab]` = `[2, 3, 50257]`
- `targets`: `[batch, seq_len]` = `[2, 3]`

**Solution**: Flatten the batch and sequence dimensions!

In [69]:
logits_flat = logits.flatten(0, 1)
targets_flat = targets.flatten()
print("Flattened logits:", logits_flat.shape)
print("Flattened targets:", targets_flat.shape)

Flattened logits: torch.Size([6, 50257])
Flattened targets: torch.Size([6])


### Flatten for Cross-Entropy

```
Before flattening:
  logits:  [2, 3, 50257]  →  2 batches × 3 positions × 50257 vocab
  targets: [2, 3]         →  2 batches × 3 positions

After flattening:
  logits:  [6, 50257]     →  6 total predictions × 50257 vocab
  targets: [6]            →  6 total targets
```

Now each of the 6 predictions is treated as an independent classification problem!

In [70]:
loss = torch.nn.functional.cross_entropy(logits_flat, targets_flat)
print(loss)

tensor(10.7940)


### Compute Cross-Entropy Loss

PyTorch's `cross_entropy`:
1. Applies softmax to logits internally
2. Takes log
3. Selects values at target indices
4. Averages and negates

The result matches our manual calculation exactly (10.7940)!

In [71]:
perplexity = torch.exp(loss)
print(perplexity)

tensor(48725.8203)


## Perplexity: A More Interpretable Metric

**Perplexity** is cross-entropy loss converted to a more intuitive scale:

```
Perplexity = e^(cross_entropy_loss)
```

### Intuition:

Perplexity ≈ "How many tokens is the model confused between?"

| Perplexity | Meaning |
|------------|---------|
| 1 | Perfect! Model is 100% confident and correct |
| 10 | Model is "choosing" between ~10 equally likely tokens |
| 100 | Model is "choosing" between ~100 equally likely tokens |
| 50,000 | Nearly random guessing (vocab size is 50,257) |

### For Our Random Model:

Perplexity ≈ 48,726 means the model is essentially guessing randomly among all tokens. After training, we'd expect perplexity in the range of 10-50 for a good model.

In [72]:
import tiktoken
import requests
import os
from torch.utils.data import Dataset, DataLoader

url = "https://www.gutenberg.org/cache/epub/84/pg84.txt"
filename = "pg84.txt"

# Download the file if it doesn't exist
if not os.path.exists(filename):
    print(f"Downloading '{filename}'...")
    try:
        response = requests.get(url)
        response.raise_for_status()
        with open(filename, "wb") as f:
            f.write(response.content)
        print(f"Successfully downloaded '{filename}'")
    except requests.exceptions.RequestException as e:
        print(f"Error downloading the file: {e}")
else:
    print(f"'{filename}' already exists, skipping download.")

# Load the text
with open(filename, "r", encoding="utf-8") as f:
    text_data = f.read()

Downloading 'pg84.txt'...
Successfully downloaded 'pg84.txt'


---

# Part 4: Preparing Training Data

Now we need real text data to train on. We'll use **"Frankenstein" by Mary Shelley** from Project Gutenberg - it's free, classic literature, and a good size for demonstration.

## The Data Pipeline:

```
Raw Text File
     │
     ▼
┌─────────────┐
│  Tokenize   │  Convert text → token IDs using BPE
└──────┬──────┘
       │
       ▼
┌─────────────┐
│   Create    │  Sliding window to create input-target pairs
│   Dataset   │
└──────┬──────┘
       │
       ▼
┌─────────────┐
│ DataLoader  │  Batch, shuffle, and iterate efficiently
└──────┬──────┘
       │
       ▼
  Training Loop
```

## Download the Training Text

We'll download Frankenstein (~440KB of text, ~106K tokens) - small enough to train quickly but large enough to see real learning.

In [73]:
total_characters = len(text_data)
total_tokens = len(tokenizer.encode(text_data))
print("Characters:", total_characters)
print("Tokens:", total_tokens)

Characters: 438807
Tokens: 106361


### Check Dataset Size

Let's see how much data we have:
- **Characters**: Raw text length
- **Tokens**: After BPE tokenization (typically 3-4 characters per token for English)

~106K tokens is small by modern standards (GPT-3 trained on 300B tokens!) but sufficient for demonstration.

In [74]:
train_ratio = 0.90
split_idx = int(train_ratio * len(text_data))
train_data = text_data[:split_idx]
val_data = text_data[split_idx:]

### Train/Validation Split

We split the data into:
- **90% Training**: Model learns from this
- **10% Validation**: Used to check if model is overfitting

**Important**: We split by position, not randomly. This ensures:
1. No data leakage between train and validation
2. Validation tests the model on "future" text it hasn't seen

In [75]:

class GPTDatasetV1(Dataset):
    """Creates input-target pairs using sliding window for next-token prediction."""
    def __init__(self, txt, tokenizer, max_length, stride):
        self.input_ids = []
        self.target_ids = []

        token_ids = tokenizer.encode(txt)

        for i in range(0, len(token_ids) - max_length, stride):
            input_chunk = token_ids[i:i + max_length]
            target_chunk = token_ids[i + 1: i + max_length + 1]
            self.input_ids.append(torch.tensor(input_chunk))
            self.target_ids.append(torch.tensor(target_chunk))

    def __len__(self):
        return len(self.input_ids)

    def __getitem__(self, idx):
        return self.input_ids[idx], self.target_ids[idx]


def create_dataloader_v1(txt, batch_size=4, max_length=256,
                         stride=128, shuffle=True, drop_last=True,
                         num_workers=0):
    """Create a DataLoader with GPT-2 BPE tokenization."""
    tokenizer = tiktoken.get_encoding("gpt2")
    dataset = GPTDatasetV1(txt, tokenizer, max_length, stride)
    dataloader = DataLoader(
        dataset,
        batch_size=batch_size,
        shuffle=shuffle,
        drop_last=drop_last,
        num_workers=num_workers
    )
    return dataloader

## GPT Dataset: Creating Input-Target Pairs

The `GPTDatasetV1` class creates training examples using a **sliding window**:

```
Text: "The monster approached the village slowly"
      ─────────────────────────────────────────

With max_length=4, stride=4:

Example 1:
  Input:  [The, monster, approached, the]
  Target: [monster, approached, the, village]
  
Example 2:
  Input:  [village, slowly, ...]
  Target: [slowly, ..., ...]
```

### Key Parameters:

| Parameter | Effect |
|-----------|--------|
| `max_length` | Sequence length (context window size) |
| `stride` | How far to move between examples |

### Stride Strategies:

- **stride = max_length**: No overlap, each token used once per epoch
- **stride < max_length**: Overlapping windows, more examples but redundant data
- **stride = 1**: Maximum examples, but highly redundant

We use `stride = max_length` (256) for efficiency.

In [76]:
torch.manual_seed(123)

train_loader = create_dataloader_v1(
    train_data,
    batch_size=2,
    max_length=GPT_CONFIG_124M["context_length"],
    stride=GPT_CONFIG_124M["context_length"],
    drop_last=True,
    shuffle=True,
    num_workers=0
)
val_loader = create_dataloader_v1(
    val_data,
    batch_size=2,
    max_length=GPT_CONFIG_124M["context_length"],
    stride=GPT_CONFIG_124M["context_length"],
    drop_last=False,
    shuffle=False,
    num_workers=0
)

### Create Data Loaders

**DataLoaders** handle:
- **Batching**: Group examples together for parallel processing
- **Shuffling**: Randomize order each epoch (training only)
- **Dropping incomplete batches**: Ensures consistent batch sizes

Settings:
- `batch_size=2`: Small for demonstration (real training uses 32-512)
- `shuffle=True` for training, `False` for validation
- `drop_last=True` for training to avoid small final batches

In [77]:
print("Train loader:")
for x, y in train_loader:
    print(x.shape, y.shape)

print("\nValidation loader:")
for x, y in val_loader:
    print(x.shape, y.shape)

Train loader:
torch.Size([2, 256]) torch.Size([2, 256])
torch.Size([2, 256]) torch.Size([2, 256])
torch.Size([2, 256]) torch.Size([2, 256])
torch.Size([2, 256]) torch.Size([2, 256])
torch.Size([2, 256]) torch.Size([2, 256])
torch.Size([2, 256]) torch.Size([2, 256])
torch.Size([2, 256]) torch.Size([2, 256])
torch.Size([2, 256]) torch.Size([2, 256])
torch.Size([2, 256]) torch.Size([2, 256])
torch.Size([2, 256]) torch.Size([2, 256])
torch.Size([2, 256]) torch.Size([2, 256])
torch.Size([2, 256]) torch.Size([2, 256])
torch.Size([2, 256]) torch.Size([2, 256])
torch.Size([2, 256]) torch.Size([2, 256])
torch.Size([2, 256]) torch.Size([2, 256])
torch.Size([2, 256]) torch.Size([2, 256])
torch.Size([2, 256]) torch.Size([2, 256])
torch.Size([2, 256]) torch.Size([2, 256])
torch.Size([2, 256]) torch.Size([2, 256])
torch.Size([2, 256]) torch.Size([2, 256])
torch.Size([2, 256]) torch.Size([2, 256])
torch.Size([2, 256]) torch.Size([2, 256])
torch.Size([2, 256]) torch.Size([2, 256])
torch.Size([2, 256])

### Verify Data Loader Output

Let's check that our data loaders produce the expected shapes:
- Each batch should have shape `[batch_size, seq_length]` = `[2, 256]`
- Both inputs (x) and targets (y) have the same shape

The training loader should have more batches than the validation loader (90% vs 10% of data).

In [78]:
def calc_loss_batch(input_batch, target_batch, model, device):
    input_batch = input_batch.to(device)
    target_batch = target_batch.to(device)      
    logits = model(input_batch)
    loss = torch.nn.functional.cross_entropy(
        logits.flatten(0, 1), target_batch.flatten()
    )
    return loss

---

# Part 5: Training Infrastructure

Before the main training loop, we need helper functions to calculate loss efficiently.

## Batch Loss Function

`calc_loss_batch` computes the cross-entropy loss for a single batch:

1. Move data to the right device (CPU or GPU)
2. Run forward pass to get logits
3. Flatten and compute cross-entropy loss

This is the same loss calculation we did manually earlier, but packaged as a reusable function.

In [79]:
def calc_loss_loader(data_loader, model, device, num_batches=None):
    total_loss = 0.
    if len(data_loader) == 0:
        return float("nan")
    elif num_batches is None:
        num_batches = len(data_loader)
    else:
        num_batches = min(num_batches, len(data_loader))
    for i, (input_batch, target_batch) in enumerate(data_loader):
        if i < num_batches:
            loss = calc_loss_batch(
                input_batch, target_batch, model, device
            )
            total_loss += loss.item()
        else:
            break
    return total_loss / num_batches

## Dataset Loss Function

`calc_loss_loader` computes the average loss over an entire data loader:

1. Iterate through batches
2. Accumulate losses
3. Return the average

### Optional `num_batches` Parameter:

For large datasets, computing loss on ALL batches is slow. The `num_batches` parameter lets you estimate the loss using only a subset of data - useful for quick progress checks during training.

In [80]:
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
print(device)

cuda


## Device Selection

Deep learning benefits enormously from **GPU acceleration**:

| Device | Training Speed |
|--------|---------------|
| CPU | 1x (baseline) |
| GPU (CUDA) | 10-100x faster |
| TPU | Even faster (specialized) |

PyTorch automatically detects if a CUDA-capable GPU is available. If not, it falls back to CPU.

**Note**: Training on CPU is fine for learning, but real models require GPUs!

In [81]:
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
model.to(device)
with torch.no_grad():
    train_loss = calc_loss_loader(train_loader, model, device)
    val_loss = calc_loss_loader(val_loader, model, device)
print("Training loss:", train_loss)
print("Validation loss:", val_loss)

Training loss: 10.986330868726109
Validation loss: 10.984492619832357


## Baseline Loss (Before Training)

Let's measure the loss on both training and validation sets **before any training**.

### What to Expect:

For a randomly initialized model with vocabulary size 50,257:
- Expected loss ≈ log(50,257) ≈ **10.82**
- This is the loss you'd get from random guessing

If our initial loss is close to 10.82, it confirms the model is starting from random weights (as expected).

### Why Check Both Train and Val Loss?

- **Training loss**: How well does the model fit the training data?
- **Validation loss**: How well does the model generalize to unseen data?

If training loss goes down but validation loss goes up → **overfitting**!