
## Intro: Loading GPT‑2 Variants & What the Config Knobs Change

This notebook downloads OpenAI’s GPT‑2 weights and loads them into a compatible PyTorch model to generate text.  
You can switch among GPT‑2 sizes (Small/Medium/Large/XL) by changing the **model configuration** you pass to `GPTModel` and by downloading the matching **pretrained weights**.

Below is a quick guide to what each config field does, how it affects **downloading weights** and **text generation**, and a few practical tips.


In [19]:
from gpt_download import download_and_load_gpt2
from gpt2 import GPT_CONFIG_124M, GPTModel, text_to_token_ids, generate_t_k, token_ids_to_text
import torch, tiktoken
import numpy as np

settings, params = download_and_load_gpt2(
    model_size="124M", models_dir="gpt2"
)

print("Settings", settings)
print("Params keys",params.keys())

File already exists and is up-to-date: gpt2\124M\checkpoint
File already exists and is up-to-date: gpt2\124M\encoder.json
File already exists and is up-to-date: gpt2\124M\hparams.json
File already exists and is up-to-date: gpt2\124M\model.ckpt.data-00000-of-00001
File already exists and is up-to-date: gpt2\124M\model.ckpt.index
File already exists and is up-to-date: gpt2\124M\model.ckpt.meta
File already exists and is up-to-date: gpt2\124M\vocab.bpe
Settings {'n_vocab': 50257, 'n_ctx': 1024, 'n_embd': 768, 'n_head': 12, 'n_layer': 12}
Params keys dict_keys(['blocks', 'b', 'g', 'wpe', 'wte'])


Transfer from Tensorflow to GPT implementation Q, K and V including Bias 

In [None]:
def assign(left, right):
    if left.shape != right.shape:
        raise ValueError(f"Shape mismatch. Left: {left.shape}, Right: {right.shape}")
    return torch.nn.Parameter(torch.tensor(right))


def load_weights_into_gpt(gpt, params):
    gpt.pos_emb.weight = assign(gpt.pos_emb.weight, params["wpe"])
    gpt.token_emb.weight = assign(gpt.token_emb.weight, params["wte"])

    for b in range(len(params["blocks"])):
        q_w, k_w, v_w = np.split(
            (params["blocks"][b]["attn"]["c_attn"])["w"], 3, axis=-1)
        gpt.trf_blocks[b].att.W_query.weight = assign(
            gpt.trf_blocks[b].att.W_query.weight, q_w.T)
        gpt.trf_blocks[b].att.W_key.weight = assign(
            gpt.trf_blocks[b].att.W_key.weight, k_w.T)
        gpt.trf_blocks[b].att.W_value.weight = assign(
            gpt.trf_blocks[b].att.W_value.weight, v_w.T)

        q_b, k_b, v_b = np.split(
            (params["blocks"][b]["attn"]["c_attn"])["b"], 3, axis=-1)
        gpt.trf_blocks[b].att.W_query.bias = assign(
            gpt.trf_blocks[b].att.W_query.bias, q_b)
        gpt.trf_blocks[b].att.W_key.bias = assign(
            gpt.trf_blocks[b].att.W_key.bias, k_b)
        gpt.trf_blocks[b].att.W_value.bias = assign(
            gpt.trf_blocks[b].att.W_value.bias, v_b)

        gpt.trf_blocks[b].att.out_proj.weight = assign(
            gpt.trf_blocks[b].att.out_proj.weight,
            params["blocks"][b]["attn"]["c_proj"]["w"].T)
        gpt.trf_blocks[b].att.out_proj.bias = assign(
            gpt.trf_blocks[b].att.out_proj.bias,
            params["blocks"][b]["attn"]["c_proj"]["b"])

        gpt.trf_blocks[b].ff.layers[0].weight = assign(
            gpt.trf_blocks[b].ff.layers[0].weight,
            params["blocks"][b]["mlp"]["c_fc"]["w"].T)
        gpt.trf_blocks[b].ff.layers[0].bias = assign(
            gpt.trf_blocks[b].ff.layers[0].bias,
            params["blocks"][b]["mlp"]["c_fc"]["b"])
        gpt.trf_blocks[b].ff.layers[2].weight = assign(
            gpt.trf_blocks[b].ff.layers[2].weight,
            params["blocks"][b]["mlp"]["c_proj"]["w"].T)
        gpt.trf_blocks[b].ff.layers[2].bias = assign(
            gpt.trf_blocks[b].ff.layers[2].bias,
            params["blocks"][b]["mlp"]["c_proj"]["b"])

        gpt.trf_blocks[b].norm1.scale = assign(
            gpt.trf_blocks[b].norm1.scale,
            params["blocks"][b]["ln_1"]["g"])
        gpt.trf_blocks[b].norm1.shift = assign(
            gpt.trf_blocks[b].norm1.shift,
            params["blocks"][b]["ln_1"]["b"])
        gpt.trf_blocks[b].norm2.scale = assign(
            gpt.trf_blocks[b].norm2.scale,
            params["blocks"][b]["ln_2"]["g"])
        gpt.trf_blocks[b].norm2.shift = assign(
            gpt.trf_blocks[b].norm2.shift,
            params["blocks"][b]["ln_2"]["b"])

    gpt.final_norm.scale = assign(gpt.final_norm.scale, params["g"])
    gpt.final_norm.shift = assign(gpt.final_norm.shift, params["b"])
    gpt.out_head.weight = assign(gpt.out_head.weight, params["wte"])
    

model_configs = {
    "gpt2-small (124M)": { "emb_dim" : 768, "n_layers" : 12, "n_heads" : 12},
    "gpt2-medium (355M)" : {"emb_dim" : 1024, "n_layers" : 24, "n_heads" : 16},
    "gpt2-large (774M)" : {"emb_dim" : 1280, "n_layers" : 36, "n_heads" : 20},
    "gpt2-xl (1558M)" : {"emb_dim" : 1600, "n_layers" : 48, "n_heads" : 25}
}


### The `model_configs` knobs (and what they impact)

- **`emb_dim` (hidden size)**
  - **What it is:** Width of token/hidden representations.
  - **Download impact:** Must match the width of the downloaded checkpoint. If it doesn’t, weight loading will fail due to shape mismatches.
  - **Generation impact:** Larger = typically better fluency/knowledge, but more GPU/CPU memory and slower inference.

- **`n_layers` (number of transformer blocks)**
  - **What it is:** Depth of the model (stacked transformer blocks).
  - **Download impact:** Must match the checkpoint’s depth; otherwise weights won’t align with your model’s layers.
  - **Generation impact:** Deeper = generally better quality/longer-range reasoning, but slower and more memory‑hungry.

- **`n_heads` (attention heads per layer)**
  - **What it is:** Parallel attention subspaces; must evenly divide `emb_dim`.
  - **Download impact:** Must match the checkpoint; mismatches cause shape errors when splitting/merging Q/K/V projections.
  - **Generation impact:** More heads (with matching `emb_dim`) improve attention expressivity; compute cost grows accordingly.

- **`context_length` (a.k.a. sequence length / block size)**
  - **What it is:** Maximum tokens the model attends to at once.
  - **Download impact:** Checkpoints are trained for a certain context window (GPT‑2 was trained for 1024). You can **set a larger number**, but the weights aren’t trained for it—generation beyond the trained window may degrade.
  - **Generation impact:** Higher = can consider longer prompts but uses more memory and can slow down attention quadratically with sequence length.

- **`qkv_bias` (bias terms in Q/K/V projections)**
  - **What it is:** Whether linear projections for Q/K/V include bias parameters.
  - **Download impact:** Must match the original architecture of the checkpoint. If the checkpoint has no bias but your model expects it (or vice versa), shapes won’t match.
  - **Generation impact:** Minor quality/speed effect compared to other knobs; mainly matters for weight compatibility.


In [None]:
model_name = "gpt2-small (124M)"
NEW_CONFIG = GPT_CONFIG_124M.copy()
NEW_CONFIG.update(model_configs[model_name])
NEW_CONFIG.update({"context_length":1024})
NEW_CONFIG.update({"qkv_bias":True})

print("NEW_CONFIG",NEW_CONFIG)
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")

gpt = GPTModel(NEW_CONFIG)
gpt.eval()

load_weights_into_gpt(gpt,params)
gpt.to(device)


### Practical trade‑offs

- **Quality vs. speed/memory**
  - **Small (124M)**: Fastest, least memory, good for quick tests.
  - **Medium/Large/XL**: Better generations, but progressively heavier and slower.
- **Context window**
  - Keep `context_length = 1024` for faithful GPT‑2 behavior. Larger values are possible but not trained, may degrade beyond 1024 tokens and will increase compute.
- **Sampling controls**
  - `temperature` and `top_k` shape the creativity and diversity of outputs:
    - Higher `temperature` → more diverse/creative (riskier).
    - Lower `top_k` → safer/more focused; higher → more variety.



In [35]:
#torch.manual_seed(123)
tokenizer = tiktoken.get_encoding("gpt2")

token_ids = generate_t_k(
    model=gpt,
    idx=text_to_token_ids("Il ragazzo è sparito",tokenizer=tokenizer).to(device),
    max_new_tokens=100,
    context_size=NEW_CONFIG["context_length"],
    top_k=10,
    temperature=1.5
)

print("Generated text gpt2 style: \n\n", token_ids_to_text(token_ids=token_ids,tokenizer=tokenizer))

Generated text gpt2 style: 

 El Nino es desaparecido es el niente, en suo noche, con una hacienda, un poblaciones y que esta que es están de sus suis.<|endoftext|>The first thing you'll notice when you see this image, which was released just after the end of the 2016 season, was this massive, red blob of red, which is actually a blue jellyfish. It appears to float like it did on an orange ocean liner as we were filming the opening sequences. And
