# Building nanoGPT

In this notebook, you'll build a complete GPT language model from scratch in PyTorch.

**What you'll do:**
- Build token embeddings + positional encoding and verify output shapes
- Implement a single self-attention head with causal masking
- Assemble multi-head attention using the batched reshape trick
- Wire together the full transformer block (attention + FFN + layer norms + residuals)
- Compose everything into a complete GPT model class, verify the parameter count matches GPT-2 (~124M), and generate text

**For each exercise, PREDICT the output before running the cell.** Wrong predictions are more valuable than correct ones — they reveal gaps in your mental model.

In [None]:
!pip install -q tiktoken

import torch
import torch.nn as nn
import torch.nn.functional as F
import tiktoken
import matplotlib.pyplot as plt

# Reproducible results
torch.manual_seed(42)

# Nice plots
plt.style.use('dark_background')
plt.rcParams['figure.figsize'] = [10, 4]

# Device setup
device = torch.device('cuda' if torch.cuda.is_available() else 'cpu')
print(f"Using device: {device}")

In [None]:
# ============================================================
# Configuration: all hyperparameters in one place
# ============================================================
from dataclasses import dataclass

@dataclass
class GPTConfig:
    vocab_size: int = 50257     # GPT-2 BPE vocabulary
    block_size: int = 1024      # Maximum context length
    n_layer: int = 12           # Number of transformer blocks
    n_head: int = 12            # Number of attention heads
    n_embd: int = 768           # Embedding dimension (d_model)
    dropout: float = 0.0        # Dropout rate (0 for exercises — deterministic output)
    bias: bool = False          # Use bias in Linear layers?

# Tiny debug config — fast iteration, same shapes
debug_config = GPTConfig(
    vocab_size=256,
    block_size=64,
    n_layer=4,
    n_head=4,
    n_embd=128,
)

print(f"Debug config: {debug_config}")
print(f"Head size: {debug_config.n_embd // debug_config.n_head}")

---

## Exercise 1: Token Embedding + Positional Encoding (Guided)

The bottom of the GPT architecture: turn token IDs into vectors, then add positional information. Two `nn.Embedding` layers, one addition.

The token embedding maps each token ID to a learned vector of size `n_embd`. The position embedding maps each position (0, 1, 2, ...) to a learned vector of the same size. Adding them gives a representation that encodes both *what* the token is and *where* it sits.

**Before running, predict:**
- If input `idx` has shape `(B=2, T=16)`, what shape will `tok_emb` have?
- What shape will `pos_emb` have? (Hint: positions are shared across all batches)
- What shape will `x` (the sum) have? How does broadcasting work here?

In [None]:
# Token embedding + positional encoding
wte = nn.Embedding(debug_config.vocab_size, debug_config.n_embd)  # token embedding
wpe = nn.Embedding(debug_config.block_size, debug_config.n_embd)  # position embedding

# Simulate a batch of token IDs
B, T = 2, 16
idx = torch.randint(0, debug_config.vocab_size, (B, T))  # (B, T)

# Token embeddings
tok_emb = wte(idx)                          # (B, T) -> (B, T, n_embd)

# Position embeddings — same positions for every batch element
pos = torch.arange(0, T)                    # (T,)
pos_emb = wpe(pos)                          # (T,) -> (T, n_embd)

# Add them — broadcasting adds (T, n_embd) to each batch element
x = tok_emb + pos_emb                       # (B, T, n_embd)

print(f"Input idx shape:       {idx.shape}")        # (2, 16)
print(f"Token embed shape:     {tok_emb.shape}")    # (2, 16, 128)
print(f"Position embed shape:  {pos_emb.shape}")    # (16, 128)
print(f"Combined x shape:      {x.shape}")          # (2, 16, 128)
print()
print(f"Token embed params:    {wte.weight.shape[0]} tokens x {wte.weight.shape[1]} dims = {wte.weight.numel():,}")
print(f"Position embed params: {wpe.weight.shape[0]} positions x {wpe.weight.shape[1]} dims = {wpe.weight.numel():,}")

**What just happened:**
- `wte(idx)` looks up a learned vector for each token ID: shape goes from `(B, T)` to `(B, T, n_embd)`
- `wpe(pos)` looks up a learned vector for each position 0..T-1: shape is `(T, n_embd)`
- Addition broadcasts `(T, n_embd)` across the batch dimension, producing `(B, T, n_embd)`
- Every token now carries both its *identity* (from `wte`) and its *position* (from `wpe`)

This is the entry point of the model. Everything downstream operates on these `(B, T, n_embd)` vectors.

---

## Exercise 2: Single Self-Attention Head (Guided)

The core operation: compute attention(Q, K, V) = softmax(QK^T / sqrt(d_k)) * V, with a causal mask that prevents attending to future positions.

You traced this formula by hand in Module 4.2. Now see it in code. Every line maps to a step you already know:
- Three linear projections create Q, K, V (the three "lenses" on the input)
- Scaled dot-product gives attention scores
- The causal mask sets future positions to -inf
- Softmax converts scores to weights
- Weighted sum of V produces the output

**Before running, predict:**
- If input is `(B=2, T=16, n_embd=128)` and `head_size=32`, what shape are Q, K, V?
- What shape are the attention scores (`q @ k.transpose(-2, -1)`)?
- What does the causal mask look like for T=4? (Draw the 4x4 grid: which entries are 0 vs 1?)
- What is the final output shape?

In [None]:
class Head(nn.Module):
    """Single head of self-attention."""

    def __init__(self, config, head_size):
        super().__init__()
        # Q, K, V projections — three "lenses" on the same input
        self.query = nn.Linear(config.n_embd, head_size, bias=config.bias)
        self.key   = nn.Linear(config.n_embd, head_size, bias=config.bias)
        self.value = nn.Linear(config.n_embd, head_size, bias=config.bias)
        self.dropout = nn.Dropout(config.dropout)

        # Causal mask — registered as a buffer (not a parameter)
        self.register_buffer(
            'mask',
            torch.tril(torch.ones(config.block_size, config.block_size))
        )

    def forward(self, x):
        B, T, C = x.shape                       # (batch, seq_len, n_embd)

        q = self.query(x)                       # (B, T, head_size)
        k = self.key(x)                         # (B, T, head_size)
        v = self.value(x)                       # (B, T, head_size)

        # Scaled dot-product attention
        scale = k.shape[-1] ** -0.5
        scores = q @ k.transpose(-2, -1) * scale   # (B, T, T)

        # Causal mask: set future positions to -inf
        scores = scores.masked_fill(
            self.mask[:T, :T] == 0, float('-inf')
        )                                        # (B, T, T)

        weights = F.softmax(scores, dim=-1)     # (B, T, T)
        weights = self.dropout(weights)

        out = weights @ v                       # (B, T, head_size)
        return out


# ---- Verify ----
head = Head(debug_config, head_size=32)
x = torch.randn(2, 16, 128)  # (B=2, T=16, n_embd=128)
out = head(x)

print(f"Input shape:  {x.shape}")     # (2, 16, 128)
print(f"Output shape: {out.shape}")   # (2, 16, 32)
assert out.shape == (2, 16, 32), "Head output shape wrong!"
print("\nShape check passed!")

# Inspect the causal mask for a tiny sequence
print(f"\nCausal mask (first 4x4):")
print(head.mask[:4, :4])

**What just happened:**
- Three `nn.Linear` projections created Q, K, V, each with shape `(B, T, head_size)`
- `q @ k.transpose(-2, -1)` computed raw attention scores: `(B, T, head_size) @ (B, head_size, T) = (B, T, T)`
- The `1/sqrt(d_k)` scale prevents scores from growing large, which would push softmax into saturation
- `masked_fill` set above-diagonal entries to `-inf`, which softmax converts to 0 — this is the causal mask
- `weights @ v` produced the weighted sum: `(B, T, T) @ (B, T, head_size) = (B, T, head_size)`
- `register_buffer` stores the mask as a non-trainable tensor that moves to GPU with the model

The input was `(B, T, n_embd)` and the output is `(B, T, head_size)`. The dimension shrank from 128 to 32 — this is one head's "slice" of the full representation.

---

## Exercise 3: Multi-Head Attention (Supported)

Multiple heads run in parallel with dimension splitting: `d_k = d_model / h`. The lesson showed two approaches:
1. **Explicit loop**: run `h` independent `Head` modules, concatenate
2. **Batched reshape**: one projection produces all heads, reshape splits them

The batched approach is what production implementations use. A single `nn.Linear` produces Q, K, V for **all** heads at once. Then a reshape operation IS the dimension splitting.

**Your task:** Fill in the TODOs to complete `CausalSelfAttention`. Each TODO is 1-3 lines. The shape comments tell you exactly what each operation should produce.

<details>
<summary>💡 Solution</summary>

The key insight is that the reshape from `(B, T, n_embd)` to `(B, n_head, T, head_size)` IS the dimension splitting. A single large projection creates all heads at once, and `.view().transpose()` separates them.

```python
# TODO 1: Split qkv into q, k, v — each gets n_embd dimensions
q, k, v = qkv.split(self.n_embd, dim=2)

# TODO 2: Reshape into (B, n_head, T, head_size)
q = q.view(B, T, self.n_head, head_size).transpose(1, 2)
k = k.view(B, T, self.n_head, head_size).transpose(1, 2)
v = v.view(B, T, self.n_head, head_size).transpose(1, 2)

# TODO 3: Compute attention scores and apply causal mask
scale = head_size ** -0.5
scores = q @ k.transpose(-2, -1) * scale
scores = scores.masked_fill(self.mask[:, :, :T, :T] == 0, float('-inf'))
weights = F.softmax(scores, dim=-1)
weights = self.attn_dropout(weights)

# TODO 4: Combine heads back
out = weights @ v
out = out.transpose(1, 2).contiguous().view(B, T, C)
```

Common mistake: swapping `n_head` and `T` in the view/transpose. The result has the correct final shape but completely wrong values — shape correctness is not enough.

</details>

In [None]:
class CausalSelfAttention(nn.Module):
    """Multi-head attention with batched computation (no loop)."""

    def __init__(self, config):
        super().__init__()
        assert config.n_embd % config.n_head == 0

        # Q, K, V for ALL heads in one projection
        self.c_attn = nn.Linear(config.n_embd, 3 * config.n_embd, bias=config.bias)
        self.c_proj = nn.Linear(config.n_embd, config.n_embd, bias=config.bias)  # W_O
        self.attn_dropout = nn.Dropout(config.dropout)
        self.resid_dropout = nn.Dropout(config.dropout)
        self.n_head = config.n_head
        self.n_embd = config.n_embd

        self.register_buffer(
            'mask',
            torch.tril(torch.ones(config.block_size, config.block_size))
                 .view(1, 1, config.block_size, config.block_size)
        )

    def forward(self, x):
        B, T, C = x.shape                                         # (B, T, n_embd)
        head_size = C // self.n_head

        # Single projection produces Q, K, V for all heads
        qkv = self.c_attn(x)                                      # (B, T, 3*n_embd)

        # TODO 1: Split qkv into q, k, v using .split()
        # Each should have shape (B, T, n_embd)
        # Hint: split along dim=2, each chunk has self.n_embd elements
        q, k, v = None, None, None  # <-- REPLACE THIS LINE

        # TODO 2: Reshape each into (B, n_head, T, head_size)
        # Step 1: .view(B, T, self.n_head, head_size)
        # Step 2: .transpose(1, 2) to move n_head before T
        # Apply to q, k, and v
        pass  # <-- REPLACE WITH THREE LINES (one per tensor)

        # TODO 3: Compute scaled dot-product attention with causal mask
        # - Compute scale factor (1/sqrt(head_size))
        # - scores = q @ k.transpose(-2, -1) * scale       -> (B, nh, T, T)
        # - Apply causal mask with masked_fill
        # - Softmax over last dimension
        # - Apply self.attn_dropout
        weights = None  # <-- REPLACE THIS (multiple lines)

        # TODO 4: Weighted sum of values and recombine heads
        # - out = weights @ v                                -> (B, nh, T, hs)
        # - Transpose back: .transpose(1, 2)                -> (B, T, nh, hs)
        # - Reshape: .contiguous().view(B, T, C)            -> (B, T, n_embd)
        out = None  # <-- REPLACE THIS

        # Output projection (W_O)
        out = self.c_proj(out)                                     # (B, T, n_embd)
        out = self.resid_dropout(out)
        return out


# ---- Verify ----
mha = CausalSelfAttention(debug_config)
x = torch.randn(2, 16, 128)  # (B=2, T=16, n_embd=128)
out = mha(x)

print(f"Input shape:  {x.shape}")     # (2, 16, 128)
print(f"Output shape: {out.shape}")   # (2, 16, 128)
assert out.shape == (2, 16, 128), "MHA output shape wrong!"
print("\nShape check passed! Multi-head attention preserves shape.")
print(f"\nParameters in c_attn: {mha.c_attn.weight.shape} = {mha.c_attn.weight.numel():,}")
print(f"Parameters in c_proj: {mha.c_proj.weight.shape} = {mha.c_proj.weight.numel():,}")

**What just happened:**
- A single `nn.Linear(n_embd, 3*n_embd)` produced Q, K, V for all heads at once
- `.split()` divided the output into three equal chunks
- `.view().transpose()` reshaped from `(B, T, n_embd)` to `(B, n_head, T, head_size)` — this IS the dimension splitting
- Attention scores `(B, n_head, T, T)` were computed for all heads simultaneously — no Python loop
- `.transpose().contiguous().view()` merged heads back to `(B, T, n_embd)`
- The output projection `c_proj` (W_O) mixed information across heads

Input shape = output shape = `(B, T, n_embd)`. Multi-head attention reads from the context and writes back to the same representation space.

---

## Exercise 4: Transformer Block (Supported)

The transformer block formula from The Transformer Block:
- `x' = x + MHA(LN(x))`
- `out = x' + FFN(LN(x'))`

Pre-norm ordering (layer norm before the sub-layer), two residual connections. Attention reads from the context, FFN processes each token independently.

**Your task:** Fill in the TODOs to complete both `FeedForward` and `Block`. The FeedForward uses the 4x expansion factor: `n_embd -> 4*n_embd -> n_embd` with GELU activation.

<details>
<summary>💡 Solution</summary>

The FeedForward network is two linear layers with GELU in between. The 4x expansion factor gives the network more capacity to transform each token's representation independently.

```python
# FeedForward.forward:
x = self.c_fc(x)        # (B, T, 4*n_embd)
x = self.gelu(x)        # (B, T, 4*n_embd)
x = self.c_proj(x)      # (B, T, n_embd)
x = self.dropout(x)
return x

# Block.forward:
x = x + self.attn(self.ln_1(x))   # residual + MHA(LN(x))
x = x + self.ffn(self.ln_2(x))    # residual + FFN(LN(x))
return x
```

Key insight: the Block's forward method IS the formula. `x +` is the residual connection, `self.ln_1(x)` is the pre-norm, and `self.attn(...)` is the sub-layer. The block preserves shape because both residual additions stay in `(B, T, n_embd)`.

</details>

In [None]:
class FeedForward(nn.Module):
    """Position-wise feed-forward network."""

    def __init__(self, config):
        super().__init__()
        self.c_fc   = nn.Linear(config.n_embd, 4 * config.n_embd, bias=config.bias)
        self.gelu   = nn.GELU()
        self.c_proj = nn.Linear(4 * config.n_embd, config.n_embd, bias=config.bias)
        self.dropout = nn.Dropout(config.dropout)

    def forward(self, x):                        # (B, T, n_embd)
        # TODO: Pass x through c_fc, gelu, c_proj, dropout
        # Shape journey: (B,T,n_embd) -> (B,T,4*n_embd) -> (B,T,4*n_embd) -> (B,T,n_embd)
        pass  # <-- REPLACE WITH 4 LINES


class Block(nn.Module):
    """Transformer block: MHA + FFN with residual connections and layer norm."""

    def __init__(self, config):
        super().__init__()
        self.ln_1 = nn.LayerNorm(config.n_embd)
        self.attn = CausalSelfAttention(config)
        self.ln_2 = nn.LayerNorm(config.n_embd)
        self.ffn  = FeedForward(config)

    def forward(self, x):                        # (B, T, n_embd)
        # TODO: Implement the block formula
        # x' = x + MHA(LN(x))
        # out = x' + FFN(LN(x'))
        # Hint: each line is a residual add: x = x + sublayer(layernorm(x))
        pass  # <-- REPLACE WITH 2 LINES + return


# ---- Verify ----
block = Block(debug_config)
x = torch.randn(2, 16, 128)  # (B, T, n_embd)
out = block(x)

print(f"Input shape:  {x.shape}")     # (2, 16, 128)
print(f"Output shape: {out.shape}")   # (2, 16, 128)
assert out.shape == x.shape, "Block must preserve shape!"
print("\nShape check passed! Block preserves shape exactly.")
print("This is what makes stacking possible — 12 identical blocks,")
print("each reading from and writing to the same residual stream.")

# Count parameters in one block
block_params = sum(p.numel() for p in block.parameters())
print(f"\nParameters in one block: {block_params:,}")

**What just happened:**
- `FeedForward` expands from `n_embd` to `4*n_embd`, applies GELU, then projects back to `n_embd`. This is the "writer" — it processes each token independently after attention has gathered context.
- `Block.forward` IS the formula: `x = x + sublayer(LN(x))` for both attention and FFN.
- Pre-norm ordering means layer norm comes *before* each sub-layer, not after. This is the modern standard (GPT-2 and later).
- The block preserves shape: `(B, T, n_embd)` in, `(B, T, n_embd)` out. This is what makes stacking 12 identical blocks possible.
- The FFN holds about two-thirds of the block's parameters, despite being conceptually simpler than attention.

---

## Exercise 5: Complete GPT Model (Independent)

Now wire everything together into a complete GPT model class. The GPT class needs:

1. **`__init__`**: Create token embeddings (`wte`), position embeddings (`wpe`), dropout, a list of `Block` modules, a final `LayerNorm`, and an output projection (`lm_head`). Apply weight tying between `wte` and `lm_head`.

2. **`forward(idx, targets=None)`**: Token IDs in, logits out.
   - Look up token embeddings: `(B, T)` → `(B, T, n_embd)`
   - Add position embeddings (positions 0..T-1)
   - Apply dropout
   - Pass through all blocks
   - Final layer norm
   - Output projection to vocab size: `(B, T, n_embd)` → `(B, T, vocab_size)`
   - If targets provided, compute cross-entropy loss
   - Return `(logits, loss)`

3. **`generate(idx, max_new_tokens, temperature=1.0)`**: Autoregressive text generation.
   - For each new token: crop to block_size, forward pass, take last position's logits, apply temperature, sample, append

**Configuration reminder:** Use `GPTConfig` for all hyperparameters. The `nn.ModuleDict` pattern from the lesson organizes sub-modules cleanly.

**Verification targets:**
- With `debug_config`: model should accept `(B, T)` input and produce `(B, T, vocab_size)` logits
- With full `GPTConfig()`: parameter count should be ~124.4M
- `generate()` should produce valid token IDs (gibberish is expected — the model is untrained)

<details>
<summary>💡 Solution</summary>

The GPT class is an assembly job — every component was built in the previous exercises. The key decisions are:

1. **Weight tying**: `self.transformer.wte.weight = self.lm_head.weight` makes the embedding and output projection share the same matrix. Embedding maps token ID → vector; output projection maps vector → token scores. Same mapping, opposite direction. This saves ~38M parameters.

2. **Weight initialization**: Normal distribution with σ=0.02 for all weights. Residual projections (named `c_proj`) get scaled by `1/sqrt(2*n_layer)` to prevent activations from growing with depth.

3. **Generate method**: `@torch.no_grad()` disables gradient tracking for speed. The crop to `block_size` handles sequences longer than the context window.

```python
class GPT(nn.Module):
    """Complete GPT language model."""

    def __init__(self, config):
        super().__init__()
        self.config = config

        self.transformer = nn.ModuleDict(dict(
            wte = nn.Embedding(config.vocab_size, config.n_embd),
            wpe = nn.Embedding(config.block_size, config.n_embd),
            drop = nn.Dropout(config.dropout),
            h = nn.ModuleList([Block(config) for _ in range(config.n_layer)]),
            ln_f = nn.LayerNorm(config.n_embd),
        ))

        self.lm_head = nn.Linear(config.n_embd, config.vocab_size, bias=False)

        # Weight tying
        self.transformer.wte.weight = self.lm_head.weight

        # Initialize weights
        self.apply(self._init_weights)
        # Scaled init for residual projections
        for name, p in self.named_parameters():
            if name.endswith('c_proj.weight'):
                torch.nn.init.normal_(p, mean=0.0, std=0.02 / (2 * config.n_layer) ** 0.5)

    def _init_weights(self, module):
        if isinstance(module, nn.Linear):
            torch.nn.init.normal_(module.weight, mean=0.0, std=0.02)
            if module.bias is not None:
                torch.nn.init.zeros_(module.bias)
        if isinstance(module, nn.Embedding):
            torch.nn.init.normal_(module.weight, mean=0.0, std=0.02)

    def forward(self, idx, targets=None):
        B, T = idx.shape
        assert T <= self.config.block_size, \
            f"Sequence length {T} exceeds block_size {self.config.block_size}"

        tok_emb = self.transformer.wte(idx)
        pos = torch.arange(0, T, device=idx.device)
        pos_emb = self.transformer.wpe(pos)
        x = self.transformer.drop(tok_emb + pos_emb)

        for block in self.transformer.h:
            x = block(x)

        x = self.transformer.ln_f(x)
        logits = self.lm_head(x)

        loss = None
        if targets is not None:
            loss = F.cross_entropy(
                logits.view(-1, logits.size(-1)),
                targets.view(-1)
            )

        return logits, loss

    @torch.no_grad()
    def generate(self, idx, max_new_tokens, temperature=1.0):
        for _ in range(max_new_tokens):
            idx_cond = idx[:, -self.config.block_size:]
            logits, _ = self(idx_cond)
            logits = logits[:, -1, :] / temperature
            probs = F.softmax(logits, dim=-1)
            idx_next = torch.multinomial(probs, num_samples=1)
            idx = torch.cat([idx, idx_next], dim=1)
        return idx
```

Common mistakes:
- Forgetting weight tying (parameter count will be ~38M too high)
- Not using `torch.no_grad()` in generate (wastes memory on gradient tracking)
- Forgetting to crop `idx` to `block_size` in generate (crashes on long sequences)

</details>

In [None]:
# ============================================================
# YOUR TASK: Implement the complete GPT model
# ============================================================

# Write your GPT class here.
# It needs: __init__, _init_weights, forward, generate
# See the specification above for exact requirements.





In [None]:
# ---- Verify with debug config ----
model = GPT(debug_config)

# Test forward pass
idx = torch.randint(0, debug_config.vocab_size, (2, 16))  # (B=2, T=16)
logits, loss = model(idx)

print(f"Input shape:  {idx.shape}")        # (2, 16)
print(f"Logits shape: {logits.shape}")     # (2, 16, 256)
assert logits.shape == (2, 16, debug_config.vocab_size), "Forward pass shape wrong!"
print("Forward pass shape check passed!")

# Test with targets (loss computation)
targets = torch.randint(0, debug_config.vocab_size, (2, 16))
logits, loss = model(idx, targets=targets)
print(f"\nLoss: {loss.item():.4f}")
print(f"Expected ~ln(vocab_size) = ln({debug_config.vocab_size}) = {torch.log(torch.tensor(float(debug_config.vocab_size))).item():.4f}")
print("(Random model, uniform predictions => loss should be close to ln(vocab_size))")

# Test generate
prompt = torch.zeros((1, 1), dtype=torch.long)  # single token
generated = model.generate(prompt, max_new_tokens=10)
print(f"\nGenerated token IDs: {generated[0].tolist()}")
print(f"Generated shape: {generated.shape}")  # (1, 11)
print("\nDebug config checks passed!")

In [None]:
# ---- Verify with full GPT-2 config ----
print("Building GPT-2 small (124M parameters)...")
print("(This may take a few seconds)\n")

full_model = GPT(GPTConfig())

# Count total parameters
total = sum(p.numel() for p in full_model.parameters())
print(f"Total parameters: {total:,}")

# Count per component
print("\nPer component:")
for name, module in full_model.transformer.named_children():
    n = sum(p.numel() for p in module.parameters())
    print(f"  {name}: {n:,}")

lm_head_params = sum(p.numel() for p in full_model.lm_head.parameters())
print(f"  lm_head: {lm_head_params:,} (weight-tied with wte)")

# The parameter count should be ~124M
# (exact number depends on whether biases are used)
print(f"\nParameter count matches GPT-2 small: ", end="")
if 123_000_000 < total < 125_000_000:
    print("YES")
else:
    print(f"NO (expected ~124M, got {total:,})")
    print("Check: weight tying? bias setting? dimensions?")

In [None]:
# ---- Generate text from the untrained model ----
enc = tiktoken.get_encoding("gpt2")

# Move model to device
full_model = full_model.to(device)
full_model.eval()

prompt_text = "The meaning of life is"
prompt_ids = enc.encode(prompt_text)
idx = torch.tensor([prompt_ids], device=device)  # (1, T)

print(f"Prompt: \"{prompt_text}\"")
print(f"Prompt token IDs: {prompt_ids}")
print()

# Generate with different temperatures
for temp in [0.5, 1.0, 2.0]:
    torch.manual_seed(42)
    output = full_model.generate(idx, max_new_tokens=30, temperature=temp)
    text = enc.decode(output[0].tolist())
    print(f"Temperature {temp}: {text}")

print()
print("The output is gibberish — and that's correct!")
print("The architecture works. Every weight is random.")
print("In the next lesson, you train it on real text.")

---

## Key Takeaways

1. **Five PyTorch operations build the entire GPT model.** `nn.Linear`, `nn.Embedding`, `nn.LayerNorm`, `nn.GELU`, `nn.Dropout`. Every one is familiar from Series 2. The complexity is in the assembly, not the parts.

2. **Bottom-up assembly: Head → MHA → FFN → Block → GPT.** Each class is small (5–15 lines), testable independently, and maps 1:1 to a concept from Module 4.2. The build order mirrors the learning order.

3. **Parameter count = architecture verification.** If your model has ~124.4M parameters, the dimensions, projections, and weight tying are correct. One number confirms the entire structure.

4. **An untrained model generating gibberish is a success.** It means the architecture, shapes, masking, and generation loop all work. The model is correct — it just needs to learn from data.