
## Intro: Loading GPT‑2 Variants & What the Config Knobs Change

This notebook downloads OpenAI’s GPT‑2 weights and loads them into a compatible PyTorch model to generate text.  
You can switch among GPT‑2 sizes (Small/Medium/Large/XL) by changing the **model configuration** you pass to `GPTModel` and by downloading the matching **pretrained weights**.

Below is a quick guide to what each config field does, how it affects **downloading weights** and **text generation**, and a few practical tips.


In [1]:
from gpt_download import download_and_load_gpt2
from gpt2 import GPT_CONFIG_124M, GPTModel, text_to_token_ids, generate_t_k, token_ids_to_text
import torch, tiktoken
import numpy as np

settings, params = download_and_load_gpt2(
    model_size="124M", models_dir="gpt2"
)

print("Settings", settings)
print("Params keys",params.keys())

2026-01-28 13:24:52.865139: I tensorflow/core/util/port.cc:153] oneDNN custom operations are on. You may see slightly different numerical results due to floating-point round-off errors from different computation orders. To turn them off, set the environment variable `TF_ENABLE_ONEDNN_OPTS=0`.
2026-01-28 13:24:52.911148: I tensorflow/core/platform/cpu_feature_guard.cc:210] This TensorFlow binary is optimized to use available CPU instructions in performance-critical operations.
To enable the following instructions: AVX2 AVX512F AVX512_VNNI FMA, in other operations, rebuild TensorFlow with the appropriate compiler flags.
2026-01-28 13:24:54.078871: I tensorflow/core/util/port.cc:153] oneDNN custom operations are on. You may see slightly different numerical results due to floating-point round-off errors from different computation orders. To turn them off, set the environment variable `TF_ENABLE_ONEDNN_OPTS=0`.


File already exists and is up-to-date: gpt2/774M/checkpoint
File already exists and is up-to-date: gpt2/774M/encoder.json
File already exists and is up-to-date: gpt2/774M/hparams.json
File already exists and is up-to-date: gpt2/774M/model.ckpt.data-00000-of-00001
File already exists and is up-to-date: gpt2/774M/model.ckpt.index
File already exists and is up-to-date: gpt2/774M/model.ckpt.meta
File already exists and is up-to-date: gpt2/774M/vocab.bpe
Settings {'n_vocab': 50257, 'n_ctx': 1024, 'n_embd': 1280, 'n_head': 20, 'n_layer': 36}
Params keys dict_keys(['blocks', 'b', 'g', 'wpe', 'wte'])


Transfer from Tensorflow to GPT implementation Q, K and V including Bias 

In [2]:
def assign(left, right):
    if left.shape != right.shape:
        raise ValueError(f"Shape mismatch. Left: {left.shape}, Right: {right.shape}")
    return torch.nn.Parameter(torch.tensor(right))


def load_weights_into_gpt(gpt, params):
    gpt.pos_emb.weight = assign(gpt.pos_emb.weight, params["wpe"])
    gpt.token_emb.weight = assign(gpt.token_emb.weight, params["wte"])

    for b in range(len(params["blocks"])):
        q_w, k_w, v_w = np.split(
            (params["blocks"][b]["attn"]["c_attn"])["w"], 3, axis=-1)
        gpt.trf_blocks[b].att.W_query.weight = assign(
            gpt.trf_blocks[b].att.W_query.weight, q_w.T)
        gpt.trf_blocks[b].att.W_key.weight = assign(
            gpt.trf_blocks[b].att.W_key.weight, k_w.T)
        gpt.trf_blocks[b].att.W_value.weight = assign(
            gpt.trf_blocks[b].att.W_value.weight, v_w.T)

        q_b, k_b, v_b = np.split(
            (params["blocks"][b]["attn"]["c_attn"])["b"], 3, axis=-1)
        gpt.trf_blocks[b].att.W_query.bias = assign(
            gpt.trf_blocks[b].att.W_query.bias, q_b)
        gpt.trf_blocks[b].att.W_key.bias = assign(
            gpt.trf_blocks[b].att.W_key.bias, k_b)
        gpt.trf_blocks[b].att.W_value.bias = assign(
            gpt.trf_blocks[b].att.W_value.bias, v_b)

        gpt.trf_blocks[b].att.out_proj.weight = assign(
            gpt.trf_blocks[b].att.out_proj.weight,
            params["blocks"][b]["attn"]["c_proj"]["w"].T)
        gpt.trf_blocks[b].att.out_proj.bias = assign(
            gpt.trf_blocks[b].att.out_proj.bias,
            params["blocks"][b]["attn"]["c_proj"]["b"])

        gpt.trf_blocks[b].ff.layers[0].weight = assign(
            gpt.trf_blocks[b].ff.layers[0].weight,
            params["blocks"][b]["mlp"]["c_fc"]["w"].T)
        gpt.trf_blocks[b].ff.layers[0].bias = assign(
            gpt.trf_blocks[b].ff.layers[0].bias,
            params["blocks"][b]["mlp"]["c_fc"]["b"])
        gpt.trf_blocks[b].ff.layers[2].weight = assign(
            gpt.trf_blocks[b].ff.layers[2].weight,
            params["blocks"][b]["mlp"]["c_proj"]["w"].T)
        gpt.trf_blocks[b].ff.layers[2].bias = assign(
            gpt.trf_blocks[b].ff.layers[2].bias,
            params["blocks"][b]["mlp"]["c_proj"]["b"])

        gpt.trf_blocks[b].norm1.scale = assign(
            gpt.trf_blocks[b].norm1.scale,
            params["blocks"][b]["ln_1"]["g"])
        gpt.trf_blocks[b].norm1.shift = assign(
            gpt.trf_blocks[b].norm1.shift,
            params["blocks"][b]["ln_1"]["b"])
        gpt.trf_blocks[b].norm2.scale = assign(
            gpt.trf_blocks[b].norm2.scale,
            params["blocks"][b]["ln_2"]["g"])
        gpt.trf_blocks[b].norm2.shift = assign(
            gpt.trf_blocks[b].norm2.shift,
            params["blocks"][b]["ln_2"]["b"])

    gpt.final_norm.scale = assign(gpt.final_norm.scale, params["g"])
    gpt.final_norm.shift = assign(gpt.final_norm.shift, params["b"])
    gpt.out_head.weight = assign(gpt.out_head.weight, params["wte"])
    

model_configs = {
    "gpt2-small (124M)": { "emb_dim" : 768, "n_layers" : 12, "n_heads" : 12},
    "gpt2-medium (355M)" : {"emb_dim" : 1024, "n_layers" : 24, "n_heads" : 16},
    "gpt2-large (774M)" : {"emb_dim" : 1280, "n_layers" : 36, "n_heads" : 20},
    "gpt2-xl (1558M)" : {"emb_dim" : 1600, "n_layers" : 48, "n_heads" : 25}
}


### The `model_configs` knobs (and what they impact)

- **`emb_dim` (hidden size)**
  - **What it is:** Width of token/hidden representations.
  - **Download impact:** Must match the width of the downloaded checkpoint. If it doesn’t, weight loading will fail due to shape mismatches.
  - **Generation impact:** Larger = typically better fluency/knowledge, but more GPU/CPU memory and slower inference.

- **`n_layers` (number of transformer blocks)**
  - **What it is:** Depth of the model (stacked transformer blocks).
  - **Download impact:** Must match the checkpoint’s depth; otherwise weights won’t align with your model’s layers.
  - **Generation impact:** Deeper = generally better quality/longer-range reasoning, but slower and more memory‑hungry.

- **`n_heads` (attention heads per layer)**
  - **What it is:** Parallel attention subspaces; must evenly divide `emb_dim`.
  - **Download impact:** Must match the checkpoint; mismatches cause shape errors when splitting/merging Q/K/V projections.
  - **Generation impact:** More heads (with matching `emb_dim`) improve attention expressivity; compute cost grows accordingly.

- **`context_length` (a.k.a. sequence length / block size)**
  - **What it is:** Maximum tokens the model attends to at once.
  - **Download impact:** Checkpoints are trained for a certain context window (GPT‑2 was trained for 1024). You can **set a larger number**, but the weights aren’t trained for it—generation beyond the trained window may degrade.
  - **Generation impact:** Higher = can consider longer prompts but uses more memory and can slow down attention quadratically with sequence length.

- **`qkv_bias` (bias terms in Q/K/V projections)**
  - **What it is:** Whether linear projections for Q/K/V include bias parameters.
  - **Download impact:** Must match the original architecture of the checkpoint. If the checkpoint has no bias but your model expects it (or vice versa), shapes won’t match.
  - **Generation impact:** Minor quality/speed effect compared to other knobs; mainly matters for weight compatibility.


In [3]:
model_name = "gpt2-large (774M)"
NEW_CONFIG = GPT_CONFIG_124M.copy()
NEW_CONFIG.update(model_configs[model_name])
NEW_CONFIG.update({"context_length":1024})
NEW_CONFIG.update({"qkv_bias":True})

print("NEW_CONFIG",NEW_CONFIG)
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")

gpt = GPTModel(NEW_CONFIG)
gpt.eval()

load_weights_into_gpt(gpt,params)
gpt.to(device)

NEW_CONFIG {'vocab_size': 50257, 'context_length': 1024, 'emb_dim': 1280, 'n_heads': 20, 'n_layers': 36, 'drop_rate': 0.1, 'qkv_bias': True}


GPTModel(
  (token_emb): Embedding(50257, 1280)
  (pos_emb): Embedding(1024, 1280)
  (drop_emb): Dropout(p=0.1, inplace=False)
  (trf_blocks): Sequential(
    (0): TransformerBlock(
      (att): MultiHeadAttention(
        (W_query): Linear(in_features=1280, out_features=1280, bias=True)
        (W_key): Linear(in_features=1280, out_features=1280, bias=True)
        (W_value): Linear(in_features=1280, out_features=1280, bias=True)
        (out_proj): Linear(in_features=1280, out_features=1280, bias=True)
        (dropout): Dropout(p=0.1, inplace=False)
      )
      (ff): FeedForward(
        (layers): Sequential(
          (0): Linear(in_features=1280, out_features=5120, bias=True)
          (1): GELU()
          (2): Linear(in_features=5120, out_features=1280, bias=True)
        )
      )
      (norm1): LayerNorm()
      (norm2): LayerNorm()
      (drop_shortcut): Dropout(p=0.1, inplace=False)
    )
    (1): TransformerBlock(
      (att): MultiHeadAttention(
        (W_query): Linear


### Practical trade‑offs

- **Quality vs. speed/memory**
  - **Small (124M)**: Fastest, least memory, good for quick tests.
  - **Medium/Large/XL**: Better generations, but progressively heavier and slower.
- **Context window**
  - Keep `context_length = 1024` for faithful GPT‑2 behavior. Larger values are possible but not trained, may degrade beyond 1024 tokens and will increase compute.
- **Sampling controls**
  - `temperature` and `top_k` shape the creativity and diversity of outputs:
    - Higher `temperature` → more diverse/creative (riskier).
    - Lower `top_k` → safer/more focused; higher → more variety.



In [4]:
#torch.manual_seed(123)
tokenizer = tiktoken.get_encoding("gpt2")

token_ids = generate_t_k(
    model=gpt,
    idx=text_to_token_ids("### Instruction:\n Convert 5 cm to meters.",tokenizer=tokenizer).to(device),
    max_new_tokens=50,
    context_size=NEW_CONFIG["context_length"],
    top_k=3,
    temperature=1
)

print("Generated text gpt2 style: \n\n", token_ids_to_text(token_ids=token_ids,tokenizer=tokenizer))

Generated text gpt2 style: 

 ### Instruction:
 Convert 5 cm to meters.
The conversion is done using the following formula:
M = (5/4) x (5/3) x (5/2)
Where M is the length of the string, 5 cm is the string's diameter, and 4


Let's prepare a dataset for fine tuning the model for instructions in Aplaca promt style:

In [5]:
import json, os, urllib
import urllib.request

def download_and_load_finet(file_path,url):
    if (not os.path.exists(file_path)):
        with urllib.request.urlopen(url,timeout=1000) as response:
            text_data = response.read().decode("utf-8")
        with open(file_path, "w", encoding="utf-8") as file:
            file.write(text_data)
            
    with open(file_path, "r") as file:
        data = json.load(file)
    return data


file_path = "gpt2/instructions-data.json"

url = "https://raw.githubusercontent.com/rasbt/LLMs-from-scratch/main/ch07/01_main-chapter-code/instruction-data.json"

data = download_and_load_finet(file_path=file_path,url=url)
print ("number of entries:", len(data))

print("Example data\n",data[25])

def format_alpa_input(entry):
    instruction_text = (
        f"Below is an instruction that describe a task. "
        f"Write a response that appropriately completes the request."
        f"\n\n### Instruction:\n{entry['instruction']}"
    )
    
    input_text = (
        f"\n\n### Input:\n{entry['input']}" if entry['input'] else ""
    )
    
    return instruction_text + input_text

## Now test the completion of input and output with
index = 22
desired_response = f"\n\n### Response:\n{data[index]['output']}"
print("\n" + format_alpa_input(data[index]) + desired_response)


number of entries: 1100
Example data
 {'instruction': "What is the plural form of 'mouse'?", 'input': '', 'output': "The plural form of 'mouse' is 'mice'."}

Below is an instruction that describe a task. Write a response that appropriately completes the request.

### Instruction:
Rewrite the sentence using an idiom.

### Input:
The storm started suddenly.

### Response:
The storm came out of the blue.


Split the dataset in train, test and validate portion

In [6]:
train_p = int(len(data) * .85)
test_p = int (len(data) * .1)
val_p = len(data)- train_p - test_p

train_d = data[:train_p]
test_d = data[train_p:train_p+test_p]
val_d = data[train_p + test_p:]

print("Training set lenght: ", len(train_d))
print("Test set lenght: ", len(test_d))
print("Validation set lenght: ", len(val_d))



Training set lenght:  935
Test set lenght:  110
Validation set lenght:  55


To iport the data set it's required to prepare data to be tokenized and well batched having same shape for train and target. So considering that each entry is having different length we need to override the standard collate function of Dataloader:

In [7]:
import torch
from torch.utils.data import Dataset, DataLoader

class InstructionDataset(Dataset):
    def __init__(self, data, tokenizer):
        self.data = data
        self.encoded_text = []
        for entry in data:
            instruction_plus_input = format_alpa_input(entry)
            response_text = f"\n\n### Response:\n{data[index]['output']}"
            full_text = instruction_plus_input + response_text
            self.encoded_text.append(tokenizer.encode(full_text))
            
    def __getitem__ (self,index):
        return self.encoded_text[index]
    
    def __len__(self):
        return (len(self.data))
    
def custom_collate_fn(
    batch,                #already tokenized batch to pad and prepare for target and input
    pad_token_id = 50256, #last token of gpt2 <|end-of-text|>
    ignore_index = -100,  #index ignored by code in cross entropy for loss calculation
    allowed_max_length = None,
    device = "cpu"
):
    batch_max_length = max (len(item) + 1 for item in batch) #calculate the max length of all the entry in the batch
    input_lst, target_lst = [], []  # this will be he returned tensors output
    
    for item in batch:
        new_item = item.copy()
        new_item += [pad_token_id] # addin terminator string
        
        padded = (
            new_item + [pad_token_id] * ( batch_max_length - len(new_item))
        ) # and padding the 
        
        inputs = torch.tensor(padded[:-1]) # truncate the last token for input
        targets = torch.tensor(padded[1:]) # shift + 1 the right for targets
        
        # masking the target eos with -100
        mask = targets == pad_token_id              # creating a mask of index to assing -100 to all eos
        indices = torch.nonzero(mask).squeeze()     # getting  out the index of wethere there is a need to mask
        if indices.numel() > 1 :
            targets[indices[1:]] = ignore_index     # assigning -100 to the index that's having padding
            
        if allowed_max_length is not None:
            inputs = inputs[:allowed_max_length]
            targets = targets[:allowed_max_length]
            
        input_lst.append(inputs)
        target_lst.append(targets)
    
    inputs_tensor = torch.stack(input_lst).to(device)
    targets_tensor = torch.stack(target_lst).to(device)
    
    return inputs_tensor, targets_tensor



Working now on the Dataloader to build the dataset required for the finetuning


In [8]:
num_workers = 4 
batch_size = 4

torch.manual_seed(123)

train_dataset = InstructionDataset(train_d,tokenizer=tokenizer)
train_loader = DataLoader(
    train_dataset,
    batch_size=batch_size,
    collate_fn = custom_collate_fn,
    shuffle = True,
    drop_last = True,
    num_workers=num_workers
)

test_dataset = InstructionDataset(test_d,tokenizer=tokenizer)
test_loader = DataLoader(
    test_dataset,
    batch_size=batch_size,
    collate_fn = custom_collate_fn,
    shuffle = True,
    drop_last = True,
    num_workers=num_workers
)

val_dataset = InstructionDataset(val_d,tokenizer=tokenizer)
val_loader = DataLoader(
    val_dataset,
    batch_size=batch_size,
    collate_fn = custom_collate_fn,
    shuffle = True,
    drop_last = True,
    num_workers=num_workers
)

# print the data to confirm that input and target batches have always the same length

for inputs,targets in train_loader:
    print(inputs.shape,targets.shape)

torch.Size([4, 64]) torch.Size([4, 64])
torch.Size([4, 64]) torch.Size([4, 64])
torch.Size([4, 52]) torch.Size([4, 52])
torch.Size([4, 59]) torch.Size([4, 59])
torch.Size([4, 64]) torch.Size([4, 64])
torch.Size([4, 49]) torch.Size([4, 49])
torch.Size([4, 65]) torch.Size([4, 65])
torch.Size([4, 63]) torch.Size([4, 63])
torch.Size([4, 60]) torch.Size([4, 60])
torch.Size([4, 55]) torch.Size([4, 55])
torch.Size([4, 58]) torch.Size([4, 58])
torch.Size([4, 61]) torch.Size([4, 61])
torch.Size([4, 65]) torch.Size([4, 65])
torch.Size([4, 66]) torch.Size([4, 66])
torch.Size([4, 65]) torch.Size([4, 65])
torch.Size([4, 63]) torch.Size([4, 63])
torch.Size([4, 60]) torch.Size([4, 60])
torch.Size([4, 64]) torch.Size([4, 64])
torch.Size([4, 66]) torch.Size([4, 66])
torch.Size([4, 68]) torch.Size([4, 68])
torch.Size([4, 50]) torch.Size([4, 50])
torch.Size([4, 57]) torch.Size([4, 57])
torch.Size([4, 64]) torch.Size([4, 64])
torch.Size([4, 65]) torch.Size([4, 65])
torch.Size([4, 61]) torch.Size([4, 61])


time now to fine tune the model with the provided datase but before we want to get an estimation of the current loss

In [9]:
from gpt2 import calc_loss_dataloaders, train_model_simple

gpt.to(device)

torch.manual_seed(123)
with torch.no_grad():
    train_loss = calc_loss_dataloaders(
        data_loader=train_loader,
        model=gpt,
        device=device,
        num_batches=5
    )
    
    val_loss = calc_loss_dataloaders(
        data_loader=val_loader,
        model=gpt,
        device=device,
        num_batches=5
    )

print ("Training loss prior instruct: ", train_loss)
print ("Validation loss prior instruct: ", val_loss)

Training loss prior instruct:  4.092750024795532
Validation loss prior instruct:  4.067868328094482


Our goal is to minimize the loss let's try to do it and to measure the execution time

In [10]:
import time


start_time = time.time()
torch.manual_seed(123)

optimizer = torch.optim.AdamW(
    gpt.parameters(),
    lr=0.00005,
    weight_decay=0.1
)

num_epocs = 2

train_losses, val_losses, tokens_seen = train_model_simple(
    gpt, train_loader, val_loader, optimizer, device,
    num_epochs=num_epocs, eval_freq=5, eval_iter=5,
    start_context=format_alpa_input(val_d[0]),tokenizer=tokenizer
)

end_time = time.time()
execution_time_minutes = (end_time - start_time) / 60

print(f"Training completed in {execution_time_minutes:.2f} minutes.")

OutOfMemoryError: CUDA out of memory. Tried to allocate 26.00 MiB. GPU 0 has a total capacity of 15.64 GiB of which 2.88 MiB is free. Including non-PyTorch memory, this process has 15.62 GiB memory in use. Of the allocated memory 15.02 GiB is allocated by PyTorch, and 539.01 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_ALLOC_CONF=expandable_segments:True to avoid fragmentation.  See documentation for Memory Management  (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)