# Loading Real Weights
## Module 4.3, Lesson 4 â€” Load pretrained GPT-2 into your own architecture

**What you'll do:**
1. Download GPT-2 weights from HuggingFace and inspect the state dict structure
2. Compare your nanoGPT model's parameter names to GPT-2's parameter names
3. Write the weight mapping function to translate between naming conventions
4. Load the mapped weights and verify correctness with logit comparison
5. Generate text with different prompts and compare to HuggingFace's built-in pipeline

**Methodology:** For each exercise, PREDICT the output before running the cell. Wrong predictions are more valuable than correct ones â€” they reveal gaps in your mental model.

---

**Prerequisites:** Building nanoGPT (Module 4.3, Lesson 1), Pretraining (Lesson 2)

## 0. Setup

In [None]:
!pip install -q transformers tiktoken

import torch
import torch.nn as nn
import torch.nn.functional as F
import tiktoken
from dataclasses import dataclass

# Reproducibility
torch.manual_seed(42)

device = torch.device('cuda' if torch.cuda.is_available() else 'cpu')
print(f'Using device: {device}')
print(f'PyTorch version: {torch.__version__}')

## 1. Your nanoGPT Architecture

This is the GPT model you built in Building nanoGPT. We need it here so we can load real weights into it. Read through it â€” this is your code, just collected into one cell for convenience.

In [None]:
@dataclass
class GPTConfig:
    block_size: int = 1024
    vocab_size: int = 50257
    n_layer: int = 12
    n_head: int = 12
    n_embd: int = 768


class CausalSelfAttention(nn.Module):
    def __init__(self, config: GPTConfig):
        super().__init__()
        assert config.n_embd % config.n_head == 0
        # Combined Q, K, V projection
        self.c_attn = nn.Linear(config.n_embd, 3 * config.n_embd)
        # Output projection
        self.c_proj = nn.Linear(config.n_embd, config.n_embd)
        self.n_head = config.n_head
        self.n_embd = config.n_embd

    def forward(self, x):
        B, T, C = x.size()
        qkv = self.c_attn(x)
        q, k, v = qkv.split(self.n_embd, dim=2)
        q = q.view(B, T, self.n_head, C // self.n_head).transpose(1, 2)
        k = k.view(B, T, self.n_head, C // self.n_head).transpose(1, 2)
        v = v.view(B, T, self.n_head, C // self.n_head).transpose(1, 2)
        # Scaled dot-product attention with causal mask
        y = F.scaled_dot_product_attention(q, k, v, is_causal=True)
        y = y.transpose(1, 2).contiguous().view(B, T, C)
        y = self.c_proj(y)
        return y


class MLP(nn.Module):
    def __init__(self, config: GPTConfig):
        super().__init__()
        self.c_fc = nn.Linear(config.n_embd, 4 * config.n_embd)
        self.gelu = nn.GELU(approximate='tanh')
        self.c_proj = nn.Linear(4 * config.n_embd, config.n_embd)

    def forward(self, x):
        x = self.c_fc(x)
        x = self.gelu(x)
        x = self.c_proj(x)
        return x


class Block(nn.Module):
    def __init__(self, config: GPTConfig):
        super().__init__()
        self.ln_1 = nn.LayerNorm(config.n_embd)
        self.attn = CausalSelfAttention(config)
        self.ln_2 = nn.LayerNorm(config.n_embd)
        self.mlp = MLP(config)

    def forward(self, x):
        x = x + self.attn(self.ln_1(x))
        x = x + self.mlp(self.ln_2(x))
        return x


class GPT(nn.Module):
    def __init__(self, config: GPTConfig):
        super().__init__()
        self.config = config
        self.transformer = nn.ModuleDict(dict(
            wte=nn.Embedding(config.vocab_size, config.n_embd),
            wpe=nn.Embedding(config.block_size, config.n_embd),
            h=nn.ModuleList([Block(config) for _ in range(config.n_layer)]),
            ln_f=nn.LayerNorm(config.n_embd),
        ))
        self.lm_head = nn.Linear(config.n_embd, config.vocab_size, bias=False)
        # Weight tying: embedding and output projection share the same tensor
        self.transformer.wte.weight = self.lm_head.weight

    def forward(self, idx):
        B, T = idx.size()
        pos = torch.arange(0, T, dtype=torch.long, device=idx.device)
        tok_emb = self.transformer.wte(idx)
        pos_emb = self.transformer.wpe(pos)
        x = tok_emb + pos_emb
        for block in self.transformer.h:
            x = block(x)
        x = self.transformer.ln_f(x)
        logits = self.lm_head(x)
        return logits

    @torch.no_grad()
    def generate(self, idx, max_new_tokens, temperature=1.0, top_k=None):
        for _ in range(max_new_tokens):
            idx_cond = idx if idx.size(1) <= self.config.block_size else idx[:, -self.config.block_size:]
            logits = self(idx_cond)
            logits = logits[:, -1, :] / temperature
            if top_k is not None:
                v, _ = torch.topk(logits, min(top_k, logits.size(-1)))
                logits[logits < v[:, [-1]]] = -float('Inf')
            probs = F.softmax(logits, dim=-1)
            idx_next = torch.multinomial(probs, num_samples=1)
            idx = torch.cat((idx, idx_next), dim=1)
        return idx


# Create our model with GPT-2 config
config = GPTConfig()
model_ours = GPT(config)
model_ours.to(device)
print(f'Our model: {sum(p.numel() for p in model_ours.parameters()):,} parameters')
print(f'Weight tying active: {model_ours.transformer.wte.weight is model_ours.lm_head.weight}')

---

## Exercise 1: Download and Inspect GPT-2 Weights `[Guided]`

We use HuggingFace as a weight download tool â€” nothing more. Two lines of code give us the official GPT-2 weights and a reference implementation to compare against.

**Before running, predict:**
- How many keys will the HuggingFace state dict have?
- How many keys will your model's state dict have?
- Will they be the same number? If not, why?
- What will the first few key names look like?

In [None]:
from transformers import GPT2LMHeadModel

# Download the official GPT-2 weights (~500MB)
model_hf = GPT2LMHeadModel.from_pretrained("gpt2")
model_hf.to(device)
model_hf.eval()

print(f'HuggingFace model: {sum(p.numel() for p in model_hf.parameters()):,} parameters')
print()

# Inspect the state dicts
sd_hf = model_hf.state_dict()
sd_ours = model_ours.state_dict()

print(f'HuggingFace state dict: {len(sd_hf)} keys')
print(f'Our state dict:         {len(sd_ours)} keys')
print()

# Show first 15 keys from each
print('=== First 15 HuggingFace keys ===')
for i, k in enumerate(sd_hf.keys()):
    if i >= 15:
        break
    print(f'  {k:<50s} {str(tuple(sd_hf[k].shape)):>20s}')

print()
print('=== First 15 of our keys ===')
for i, k in enumerate(sd_ours.keys()):
    if i >= 15:
        break
    print(f'  {k:<50s} {str(tuple(sd_ours[k].shape)):>20s}')

**What you should notice:**

- HuggingFace has **148 keys**; your model has **147 keys**. The difference: HuggingFace stores `lm_head.weight` as a separate entry, but in your model it's the same tensor as `transformer.wte.weight` (weight tying). One tensor, two names â€” but only one entry in your state dict because they share the same memory.
- Some key names look very similar between the two models (both use `transformer.h.0.ln_1.weight` etc.)
- But there are also some attention mask buffers in HuggingFace's state dict (`attn.bias`, `attn.masked_bias`) that we don't have â€” these are registered buffers, not learned parameters.

The key names are close, but not identical. A naive `load_state_dict()` will fail.

---

## Exercise 2: Compare Parameter Names Side by Side `[Guided]`

Now let's look more carefully at the differences. We'll filter to just the *learned parameters* (skip buffers like attention masks) and compare names and shapes.

**Before running, predict:**
- Will the names match exactly, or will some differ?
- Will the shapes all match? If not, which parameters will have different shapes, and why?
- Hint: HuggingFace uses Conv1D for linear projections. Conv1D stores weights as `(in_features, out_features)`. Your nn.Linear stores them as `(out_features, in_features)`. Which layers use Conv1D?

In [None]:
# Filter HuggingFace state dict to just the learned parameters
# (skip attention mask buffers that aren't real weights)
skip_buffers = {'attn.masked_bias', 'attn.bias'}

hf_params = {k: v for k, v in sd_hf.items()
             if not any(k.endswith(s) for s in skip_buffers)}

print(f'HuggingFace learned params: {len(hf_params)} keys')
print(f'Our model params:           {len(sd_ours)} keys')
print()

# Compare parameter-by-parameter for the first transformer block
print('=== Block 0: Side-by-side comparison ===')
print(f'{"HuggingFace Key":<50s} {"Shape":>15s}   {"Our Key":<50s} {"Shape":>15s}   Match?')
print('-' * 145)

block_keys_hf = sorted([k for k in hf_params if k.startswith('transformer.h.0.')])
block_keys_ours = sorted([k for k in sd_ours if k.startswith('transformer.h.0.')])

for hf_key, our_key in zip(block_keys_hf, block_keys_ours):
    hf_shape = tuple(hf_params[hf_key].shape)
    our_shape = tuple(sd_ours[our_key].shape)
    shapes_match = hf_shape == our_shape
    match_str = 'YES' if shapes_match else f'NO  (transposed: {hf_shape} vs {our_shape})'
    print(f'{hf_key:<50s} {str(hf_shape):>15s}   {our_key:<50s} {str(our_shape):>15s}   {match_str}')

print()

# Identify all transposed parameters across the full model
transposed_count = 0
direct_count = 0
for our_key in sd_ours:
    if our_key not in hf_params:
        continue
    if sd_ours[our_key].shape == hf_params[our_key].shape:
        direct_count += 1
    else:
        transposed_count += 1

print(f'Parameters that copy directly: {direct_count}')
print(f'Parameters that need transposing: {transposed_count}')
print()
print('The pattern: every 2D weight in attention (c_attn, c_proj) and FFN (c_fc, c_proj) needs transposing.')
print('Everything else â€” embeddings, layer norms, biases â€” copies directly.')

**Why the shapes differ for some parameters:**

HuggingFace's GPT-2 uses a custom **Conv1D** class for the linear projections â€” a historical artifact from OpenAI's original release. Conv1D stores weights as `(in_features, out_features)`, which is **transposed** relative to `nn.Linear`'s `(out_features, in_features)`.

| Component | HuggingFace (Conv1D) | Your model (nn.Linear) | Need `.t()`? |
|-----------|---------------------|----------------------|-------------|
| `c_attn.weight` | (768, 2304) | (2304, 768) | Yes |
| `c_proj.weight` | (768, 768) | (768, 768) | Yes* |
| `c_fc.weight` | (768, 3072) | (3072, 768) | Yes |
| `mlp.c_proj.weight` | (3072, 768) | (768, 3072) | Yes |
| Layer norms, biases, embeddings | same | same | No |

\*`c_proj` is square (768x768), so the shape *looks* the same, but the values are still transposed. The fix is the same: `.t()`.

---

## Exercise 3: Write the Weight Mapping Function `[Supported]`

Now build the function that copies HuggingFace weights into your model. The function needs to:
1. Iterate over your model's state dict keys
2. Skip `lm_head.weight` (handled by weight tying)
3. Transpose 2D weights from Conv1D layers
4. Copy everything else directly

**The rule is simple:** if the parameter name ends with one of `attn.c_attn.weight`, `attn.c_proj.weight`, `mlp.c_fc.weight`, or `mlp.c_proj.weight`, transpose it. Everything else copies directly.

Fill in the TODO sections below.

<details>
<summary>ðŸ’¡ Solution</summary>

The key insight is that you iterate over *your* model's keys (not HuggingFace's), skip the tied weight, and use `endswith()` to detect which weights need transposing. The `copy_()` method modifies the tensor in place â€” since `sd_ours` holds references to the model's actual parameter tensors, this directly updates the model.

```python
# TODO 1: Define the set of suffixes that need transposing
transposed = {
    'attn.c_attn.weight',
    'attn.c_proj.weight',
    'mlp.c_fc.weight',
    'mlp.c_proj.weight',
}

# TODO 2: Check if this key needs transposing and copy accordingly
needs_transpose = any(key.endswith(t) for t in transposed)

if needs_transpose:
    assert sd_hf[key].shape[::-1] == sd_ours[key].shape, \
        f'Shape mismatch (transposed) for {key}: HF {sd_hf[key].shape} vs ours {sd_ours[key].shape}'
    with torch.no_grad():
        sd_ours[key].copy_(sd_hf[key].t())
else:
    assert sd_hf[key].shape == sd_ours[key].shape, \
        f'Shape mismatch for {key}: HF {sd_hf[key].shape} vs ours {sd_ours[key].shape}'
    with torch.no_grad():
        sd_ours[key].copy_(sd_hf[key])
```

Common mistake: forgetting that `shape[::-1]` reverses the tuple for the transposed shape assertion. If you wrote `shape.t()`, that's a tensor operation â€” you need tuple reversal here.

</details>

In [None]:
def load_hf_weights(model_ours, model_hf):
    """Load HuggingFace GPT-2 weights into our model."""
    sd_hf = model_hf.state_dict()
    sd_ours = model_ours.state_dict()

    # Keys to skip: weight tying means lm_head shares with wte
    skip = {'lm_head.weight'}

    # HuggingFace buffers that aren't learned parameters
    skip_buffers = {'attn.masked_bias', 'attn.bias'}

    # TODO 1: Define the set of Conv1D weight suffixes that need transposing.
    # These are the 2D weight matrices in attention and FFN layers.
    # Hint: there are 4 suffixes â€” the same ones from Exercise 2.
    transposed = {
        # YOUR CODE HERE (4 strings)
    }

    loaded = 0
    transposed_count = 0

    for key in sd_ours:
        if key in skip:
            continue

        # TODO 2: Check if this key needs transposing.
        # Then copy the weight â€” transposing with .t() if needed.
        # Use assert to verify shapes match before copying.
        # Use torch.no_grad() and copy_() for in-place assignment.
        needs_transpose = False  # REPLACE THIS LINE

        if needs_transpose:
            # YOUR CODE HERE: assert shape match (reversed), then copy with .t()
            transposed_count += 1
            pass
        else:
            # YOUR CODE HERE: assert shape match, then copy directly
            pass

        loaded += 1

    print(f'Loaded {loaded} parameters ({transposed_count} transposed)')
    return loaded, transposed_count


# Run it
loaded, transposed_count = load_hf_weights(model_ours, model_hf)

# Verify
assert loaded == len(sd_ours) - 1, f'Expected {len(sd_ours) - 1} loaded, got {loaded} (one skipped for weight tying)'
assert transposed_count > 0, 'Should have transposed some weights!'
print(f'\nAll {loaded} parameters loaded successfully.')
print(f'Skipped lm_head.weight (weight tying handles it).')

# Verify weight tying is still intact
print(f'\nWeight tying still active: {model_ours.transformer.wte.weight.data_ptr() == model_ours.lm_head.weight.data_ptr()}')

**Why `copy_()` works without `load_state_dict()`:** `sd_ours` holds references to the model's actual parameter tensors (not copies). When you call `sd_ours[key].copy_(value)`, you're writing directly into the model's memory. No need to call `model_ours.load_state_dict(sd_ours)` afterward â€” the model is already updated.

---

## Exercise 4: Verify with Logit Comparison `[Supported]`

Text comparison is subjective and stochastic (sampling adds randomness). The gold standard test: feed the **same input tokens** to both models and compare the **output logits**.

If `torch.allclose()` returns True, every component â€” every projection, every layer norm, every residual connection â€” produces the same output as the reference. This is stronger than parameter counting (shapes) and stronger than text comparison (subjective).

Fill in the TODO sections to run the verification.

<details>
<summary>ðŸ’¡ Solution</summary>

The key insight is that both models must be in `eval()` mode (disables dropout), and you compare logits â€” the raw output before sampling. HuggingFace wraps its output in a named tuple, so you access `.logits` on the HuggingFace output.

```python
# TODO 1: Put both models in eval mode
model_ours.eval()
model_hf.eval()

# TODO 2: Get logits from both models
logits_ours = model_ours(input_ids)
logits_hf = model_hf(input_ids).logits

# TODO 3: Compare with torch.allclose
match = torch.allclose(logits_ours, logits_hf, atol=1e-5)
```

If `allclose` fails at `atol=1e-5` but passes at `atol=1e-3`, that's floating-point noise from different operation ordering â€” not a bug. The top-k tokens and their relative probabilities are essentially identical.

</details>

In [None]:
# Tokenize a test prompt
enc = tiktoken.get_encoding("gpt2")
tokens = enc.encode("The capital of France is")
input_ids = torch.tensor([tokens], device=device)

print(f'Input tokens: {tokens}')
print(f'Input text: "{enc.decode(tokens)}"')
print(f'Input shape: {input_ids.shape}')
print()

# TODO 1: Put both models in eval mode (disables dropout)
# YOUR CODE HERE (2 lines)

with torch.no_grad():
    # TODO 2: Get logits from both models.
    # For your model: logits_ours = model_ours(input_ids)
    # For HuggingFace: logits_hf = model_hf(input_ids).logits
    #   (HuggingFace wraps output in a named tuple â€” .logits gets the raw tensor)
    logits_ours = None  # REPLACE THIS LINE
    logits_hf = None    # REPLACE THIS LINE

print(f'Our logits shape: {logits_ours.shape}')
print(f'HF logits shape:  {logits_hf.shape}')
print()

# TODO 3: Compare logits with torch.allclose
match = False  # REPLACE THIS LINE

print(f'Logits match (atol=1e-5): {match}')

# Also check with a slightly relaxed tolerance
match_relaxed = torch.allclose(logits_ours, logits_hf, atol=1e-3)
print(f'Logits match (atol=1e-3): {match_relaxed}')
print()

# Show the max difference
max_diff = (logits_ours - logits_hf).abs().max().item()
print(f'Max absolute difference: {max_diff:.2e}')

# Verify both models predict the same next token
next_token_ours = logits_ours[0, -1].argmax().item()
next_token_hf = logits_hf[0, -1].argmax().item()
print(f'\nOur model predicts next token:  "{enc.decode([next_token_ours])}" (id={next_token_ours})')
print(f'HuggingFace predicts next token: "{enc.decode([next_token_hf])}" (id={next_token_hf})')
print(f'Same prediction: {next_token_ours == next_token_hf}')

**The verification chain:**
- In Building nanoGPT, parameter counting verified your architecture had the right **structure**.
- Now, logit comparison verifies your architecture computes the right **function**.
- Together: right structure AND right computation.

If you see a small numerical difference (e.g., `1e-6`) but `allclose` passes, that's floating-point noise from different operation ordering â€” not a bug.

---

## Exercise 5: Generate and Compare `[Independent]`

You have a verified GPT-2 implementation. Now put it to use:

1. Generate text from your model using several prompts
2. Generate text from HuggingFace's pipeline for the same prompts
3. Compare the quality â€” your model should produce coherent, knowledgeable text

Write the generation code from scratch. Use your model's `.generate()` method with reasonable sampling parameters (temperature around 0.8, top_k around 50). For HuggingFace, use `transformers.pipeline("text-generation", model=model_hf, tokenizer=enc)` or generate manually.

Try at least these prompts:
- `"The meaning of life is"`
- `"The capital of France is"`
- `"Once upon a time in a land far away"`
- `"The transformer architecture consists of"`

Think about: How does this compare to the gibberish from the untrained model in Building nanoGPT? The architecture didn't change â€” the weights changed.

<details>
<summary>ðŸ’¡ Solution</summary>

The key here is straightforward use of your model's `generate()` method. The interesting part is seeing the quality difference: your code, OpenAI's knowledge, producing real coherent English. The architecture is the vessel; the weights are the knowledge.

```python
prompts = [
    "The meaning of life is",
    "The capital of France is",
    "Once upon a time in a land far away",
    "The transformer architecture consists of",
]

model_ours.eval()

print("=" * 60)
print("YOUR GPT-2 (your code + OpenAI's weights)")
print("=" * 60)
for prompt in prompts:
    tokens = enc.encode(prompt)
    idx = torch.tensor([tokens], device=device)
    generated = model_ours.generate(idx, max_new_tokens=50, temperature=0.8, top_k=50)
    text = enc.decode(generated[0].tolist())
    print(f"\nPrompt: {prompt}")
    print(f"Output: {text}")

print()
print("=" * 60)
print("HUGGINGFACE GPT-2 (reference implementation)")
print("=" * 60)
from transformers import pipeline as hf_pipeline

generator = hf_pipeline(
    "text-generation",
    model=model_hf,
    tokenizer="gpt2",
    device=device,
)
for prompt in prompts:
    result = generator(prompt, max_new_tokens=50, temperature=0.8, top_k=50, do_sample=True)
    print(f"\nPrompt: {prompt}")
    print(f"Output: {result[0]['generated_text']}")
```

Note: Because of sampling randomness, the two models won't produce the same text even with the same prompt. That's expected â€” the logits are the same (verified in Exercise 4), but different sampling implementations draw different random tokens. Both should be coherent English.

</details>

In [None]:
# YOUR CODE HERE
# Generate text from your model and from HuggingFace's pipeline.
# Compare the quality across multiple prompts.


**The full arc of Module 4.3:**

- **Random weights** â†’ gibberish
- **Trained on Shakespeare** â†’ recognizable English
- **Real GPT-2 weights** â†’ coherent, knowledgeable text

Your code did not change. The weights changed. That is what pretraining buys.

The architecture is the vessel. The weights are the knowledge.

---

## Key Takeaways

1. **The weight mapping IS the verification.** Every shape match is a component verified. Every matching logit is a computation confirmed. If real weights work in your model, your implementation is correct.

2. **Conv1D vs nn.Linear: same parameters, different layout.** HuggingFace's GPT-2 uses Conv1D `(in_features, out_features)`. Your model uses nn.Linear `(out_features, in_features)`. The fix is `.t()` â€” one transpose per 2D weight in attention and FFN.

3. **Logit comparison is the gold standard.** Same input, same weights, same computation â€” if `torch.allclose` returns True, the models are functionally identical. Stronger than parameter counting (shapes). Stronger than text comparison (subjective).

4. **The architecture is the vessel. The weights are the knowledge.** Your code defines what the model can compute. The pretrained weights encode what it has learned about language. Same code, different weights, dramatically different behavior.

5. **Weight tying: one tensor, two names.** `wte.weight` and `lm_head.weight` point to the same memory. Loading it once updates both. This is not a shortcut â€” it is how the architecture works.