# GPT in 60 Lines of NumPy

**2024/08/12, 2025/03/06, 2025/03/26**

* Source Code: https://github.com/jaymody/picoGPT/blob/main/README.md
* Requirements: https://github.com/jaymody/picoGPT/blob/main/requirements.txt
* Tutorial: https://jaykmody.com/blog/gpt-from-scratch/
* Other sources:
    * **Jay Alammar**: https://jalammar.github.io/illustrated-gpt2/
    * **OpenAI** gpt-2 implementation: https://github.com/openai/gpt-2/
    * **Academic Paper**: https://d4mucfpksywv.cloudfront.net/better-language-models/language-models.pdf

In [1]:
import tensorflow as tf
print(tf.__version__)

2.10.1


## What is a GPT?

GPT stands for **Generative Pre-trained Transformer**. It's a type of neural network architecture based on the Transformer. 

    Generative: A GPT generates text.
    Pre-trained: A GPT is trained on lots of text from books, the internet, etc ...
    Transformer: A GPT is a decoder-only transformer neural network.

Fundamentally, a GPT **generates text** given a **prompt**. Even with this very simple API (input = text, output = text), a well-trained GPT can do some pretty awesome stuff like write your emails, summarize a book, give you instagram caption ideas, explain black holes to a 5 year old, code in SQL, and even write your will.

## Where to download the model from?

* From the original <mark>openai/gpt-2 github!</mark>
* First, read: https://github.com/openai/gpt-2/blob/master/README.md
* Second, download from: https://github.com/openai/gpt-2/blob/master/download_model.py


****

# Tutorial (30 January 2023)

https://jaykmody.com/blog/gpt-from-scratch/

## Input / Output

<font color="red">The function signature for a GPT looks roughly like this:</font>

In [2]:
def gpt(inputs: list[int]) -> list[list[float]]:
    # inputs has shape [n_seq]
    # output has shape [n_seq, n_vocab]
    # neural network magic happens here
    
    output = 0 
    return output

### Input

The input is some text represented by a sequence of integers that map to tokens in the text:

In [3]:
# integers represent tokens in our text, for example:
# text   = "not all heroes wear capes":
# tokens = "not"  "all" "heroes" "wear" "capes"
inputs =   [1,     0,    2,      4,     6]

Tokens are sub-pieces of the text, which are produced using some kind of tokenizer. We can map tokens to integers using a vocabulary:

In [4]:
class WhitespaceTokenizer:
    def __init__(self, vocab):
        self.vocab = vocab
        self.token_to_id = {token: i for i, token in enumerate(vocab)}

    def encode(self, text):
        tokens = text.split()  # Tokenize on whitespace
        return [self.token_to_id[token] for token in tokens if token in self.token_to_id]

    def decode(self, ids):
        tokens = [self.vocab[i] for i in ids]  # Convert IDs back to tokens
        return " ".join(tokens)  # Join tokens with a space to form the original text

In [5]:
# the index of a token in the vocab represents the integer id for that token
# i.e. the integer id for "heroes" would be 2, since vocab[2] = "heroes"
vocab = ["all", "not", "heroes", "the", "wear", ".", "capes"]

# a pretend tokenizer that tokenizes on whitespace
tokenizer = WhitespaceTokenizer(vocab)

# the encode() method converts a str -> list[int]
ids = tokenizer.encode("not all heroes wear") # ids = [1, 0, 2, 4]
print(ids)

# we can see what the actual tokens are via our vocab mapping
tokens = [tokenizer.vocab[i] for i in ids] # tokens = ["not", "all", "heroes", "wear"]
print(tokens)

# the decode() method converts back a list[int] -> str
text = tokenizer.decode(ids) # text = "not all heroes wear"
print(text)

[1, 0, 2, 4]
['not', 'all', 'heroes', 'wear']
not all heroes wear


### Output

The output is a 2D array, where `output[i][j]` is the model's predicted probability that the token at `vocab[j]` is the next token `inputs[i+1]`. For example:

In [6]:
vocab = ["all", "not", "heroes", "the", "wear", ".", "capes"]

text_inputs = ["Not", "all",  "heroes", "wear"]
inputs = [1, 0, 2, 4] # "not" "all" "heroes" "wear"
output = gpt(inputs)

# some hypothetical output results
# =================================

#              ["all", "not", "heroes", "the", "wear", ".", "capes"]
# output[0] =  [0.75    0.1     0.0       0.15    0.0   0.0    0.0  ]
# given just "not", the model predicts the word "all" with the highest probability

#              ["all", "not", "heroes", "the", "wear", ".", "capes"]
# output[1] =  [0.0     0.0      0.8     0.1    0.0    0.0   0.1  ]
# given the sequence ["not", "all"], the model predicts the word "heroes" with the highest probability

#              ["all", "not", "heroes", "the", "wear", ".", "capes"]
# output[-1] = [0.0     0.0     0.0     0.1     0.0    0.05  0.85  ]
# given the whole sequence ["not", "all", "heroes", "wear"], the model predicts the word "capes" with the highest probability

To get a **next token prediction** for the whole sequence, we simply take the token with the highest probability in `output[-1]`:

Taking the token with the highest probability as our prediction is known as `greedy decoding` or `greedy sampling`.

## Generating Text

### Autoregressive

We can generate full sentences by iteratively getting the next token prediction from our model. At each iteration, we append the predicted token back into the input.

### Sampling

We can introduce some stochasticity (randomness) to our generations by sampling from the probability distribution instead of being greedy.

### Training

We train a GPT like any other neural network, using **gradient descent** with respect to some **loss function**. In the case of a GPT, we take the **cross entropy loss over the language modeling task**.

****

## Gpt-2 code

* Downloaded from: https://github.com/jaymody/picoGPT/blob/main/gpt2.py

In [7]:
import numpy as np

def gelu(x):
    return 0.5 * x * (1 + np.tanh(np.sqrt(2 / np.pi) * (x + 0.044715 * x**3)))


def softmax(x):
    exp_x = np.exp(x - np.max(x, axis=-1, keepdims=True))
    return exp_x / np.sum(exp_x, axis=-1, keepdims=True)


def layer_norm(x, g, b, eps: float = 1e-5):
    mean = np.mean(x, axis=-1, keepdims=True)
    variance = np.var(x, axis=-1, keepdims=True)
    x = (x - mean) / np.sqrt(variance + eps)  # normalize x to have mean=0 and var=1 over last axis
    return g * x + b  # scale and offset with gamma/beta params


def linear(x, w, b):  # [m, in], [in, out], [out] -> [m, out]
    return x @ w + b


def ffn(x, c_fc, c_proj):  # [n_seq, n_embd] -> [n_seq, n_embd]
    # project up
    a = gelu(linear(x, **c_fc))  # [n_seq, n_embd] -> [n_seq, 4*n_embd]

    # project back down
    x = linear(a, **c_proj)  # [n_seq, 4*n_embd] -> [n_seq, n_embd]

    return x


def attention(q, k, v, mask):  # [n_q, d_k], [n_k, d_k], [n_k, d_v], [n_q, n_k] -> [n_q, d_v]
    return softmax(q @ k.T / np.sqrt(q.shape[-1]) + mask) @ v


def mha(x, c_attn, c_proj, n_head):  # [n_seq, n_embd] -> [n_seq, n_embd]
    # qkv projection
    x = linear(x, **c_attn)  # [n_seq, n_embd] -> [n_seq, 3*n_embd]

    # split into qkv
    qkv = np.split(x, 3, axis=-1)  # [n_seq, 3*n_embd] -> [3, n_seq, n_embd]

    # split into heads
    qkv_heads = list(map(lambda x: np.split(x, n_head, axis=-1), qkv))  # [3, n_seq, n_embd] -> [3, n_head, n_seq, n_embd/n_head]

    # causal mask to hide future inputs from being attended to
    causal_mask = (1 - np.tri(x.shape[0], dtype=x.dtype)) * -1e10  # [n_seq, n_seq]

    # perform attention over each head
    out_heads = [attention(q, k, v, causal_mask) for q, k, v in zip(*qkv_heads)]  # [3, n_head, n_seq, n_embd/n_head] -> [n_head, n_seq, n_embd/n_head]

    # merge heads
    x = np.hstack(out_heads)  # [n_head, n_seq, n_embd/n_head] -> [n_seq, n_embd]

    # out projection
    x = linear(x, **c_proj)  # [n_seq, n_embd] -> [n_seq, n_embd]

    return x


def transformer_block(x, mlp, attn, ln_1, ln_2, n_head):  # [n_seq, n_embd] -> [n_seq, n_embd]
    # multi-head causal self attention
    x = x + mha(layer_norm(x, **ln_1), **attn, n_head=n_head)  # [n_seq, n_embd] -> [n_seq, n_embd]

    # position-wise feed forward network
    x = x + ffn(layer_norm(x, **ln_2), **mlp)  # [n_seq, n_embd] -> [n_seq, n_embd]

    return x


def gpt2(inputs, wte, wpe, blocks, ln_f, n_head):  # [n_seq] -> [n_seq, n_vocab]
    # token(wte) + positional embeddings(wpe)
    x = wte[inputs] + wpe[range(len(inputs))]  # [n_seq] -> [n_seq, n_embd]

    # forward pass through n_layer transformer blocks
    for block in blocks:
        x = transformer_block(x, **block, n_head=n_head)  # [n_seq, n_embd] -> [n_seq, n_embd]

    # projection to vocab
    x = layer_norm(x, **ln_f)  # [n_seq, n_embd] -> [n_seq, n_embd]
    return x @ wte.T  # [n_seq, n_embd] -> [n_seq, n_vocab]


def generate(inputs, params, n_head, n_tokens_to_generate):
    from tqdm import tqdm

    for _ in tqdm(range(n_tokens_to_generate), "generating"):  # auto-regressive decode loop
        logits = gpt2(inputs, **params, n_head=n_head)  # model forward pass
        next_id = np.argmax(logits[-1])  # greedy sampling
        inputs.append(int(next_id))  # append prediction to input

    return inputs[len(inputs) - n_tokens_to_generate :]  # only return generated ids

## Load Model (GPT2-small)

In [8]:
from encoder import get_encoder
from utils import load_encoder_hparams_and_params, download_gpt2_files

In [9]:
download_gpt2_files("124M", "models")

Fetching checkpoint: 1.00kb [00:00, 1.00Mb/s]                                                       
Fetching encoder.json: 1.04Mb [00:01, 783kb/s]                                                      
Fetching hparams.json: 1.00kb [00:00, 1.01Mb/s]                                                     
Fetching model.ckpt.data-00000-of-00001: 498Mb [04:19, 1.92Mb/s]                                    
Fetching model.ckpt.index: 6.00kb [00:00, ?b/s]                                                     
Fetching model.ckpt.meta: 472kb [00:01, 464kb/s]                                                    
Fetching vocab.bpe: 457kb [00:01, 450kb/s]                                                          


**create subdirectory /124M in models, move the files to that subdirectory**

In [10]:
# load encoder, hparams, and params from the released open-ai gpt-2 files
# ["124M", "355M", "774M", "1558M"]

model_size="124M"
models_dir="models"
encoder, hparams, params = load_encoder_hparams_and_params(model_size, models_dir)

In [11]:
hparams

{'n_vocab': 50257, 'n_ctx': 1024, 'n_embd': 768, 'n_head': 12, 'n_layer': 12}

## Parameters

`params` is a nested json dictionary that hold the trained weights of our model. The leaf nodes of the json are NumPy arrays. If we print params, replacing the arrays with their shapes, we get:

In [12]:
import numpy as np
from pprint import pprint

def shape_tree(d):
    if isinstance(d, np.ndarray):
        return list(d.shape)
    elif isinstance(d, list):
        return [shape_tree(v) for v in d]
    elif isinstance(d, dict):
        return {k: shape_tree(v) for k, v in d.items()}
    else:
        ValueError("uh oh")

In [13]:
pprint(shape_tree(params)) # 12 layers

{'blocks': [{'attn': {'c_attn': {'b': [2304], 'w': [768, 2304]},
                      'c_proj': {'b': [768], 'w': [768, 768]}},
             'ln_1': {'b': [768], 'g': [768]},
             'ln_2': {'b': [768], 'g': [768]},
             'mlp': {'c_fc': {'b': [3072], 'w': [768, 3072]},
                     'c_proj': {'b': [768], 'w': [3072, 768]}}},
            {'attn': {'c_attn': {'b': [2304], 'w': [768, 2304]},
                      'c_proj': {'b': [768], 'w': [768, 768]}},
             'ln_1': {'b': [768], 'g': [768]},
             'ln_2': {'b': [768], 'g': [768]},
             'mlp': {'c_fc': {'b': [3072], 'w': [768, 3072]},
                     'c_proj': {'b': [768], 'w': [3072, 768]}}},
            {'attn': {'c_attn': {'b': [2304], 'w': [768, 2304]},
                      'c_proj': {'b': [768], 'w': [768, 768]}},
             'ln_1': {'b': [768], 'g': [768]},
             'ln_2': {'b': [768], 'g': [768]},
             'mlp': {'c_fc': {'b': [3072], 'w': [768, 3072]},
               

In [14]:
pprint(shape_tree(params['blocks'][0])) # 1st layer

{'attn': {'c_attn': {'b': [2304], 'w': [768, 2304]},
          'c_proj': {'b': [768], 'w': [768, 768]}},
 'ln_1': {'b': [768], 'g': [768]},
 'ln_2': {'b': [768], 'g': [768]},
 'mlp': {'c_fc': {'b': [3072], 'w': [768, 3072]},
         'c_proj': {'b': [768], 'w': [3072, 768]}}}


In [15]:
pprint(shape_tree(params['blocks'][11])) # 12, last layer

{'attn': {'c_attn': {'b': [2304], 'w': [768, 2304]},
          'c_proj': {'b': [768], 'w': [768, 768]}},
 'ln_1': {'b': [768], 'g': [768]},
 'ln_2': {'b': [768], 'g': [768]},
 'mlp': {'c_fc': {'b': [3072], 'w': [768, 3072]},
         'c_proj': {'b': [768], 'w': [3072, 768]}}}


## Putting it together

See: https://github.com/jaymody/picoGPT/blob/main/gpt2.py

In [16]:
# check the size of the vocabulary
len(encoder.decoder)
# 50257

50257

In [17]:
# encode the input string using the BPE tokenizer
prompt = "Alan Turing theorized that computers would one day become"

input_ids = encoder.encode(prompt)
print(input_ids)

[36235, 39141, 18765, 1143, 326, 9061, 561, 530, 1110, 1716]


In [18]:
[encoder.decoder[i] for i in input_ids]

['Alan',
 'ĠTuring',
 'Ġtheor',
 'ized',
 'Ġthat',
 'Ġcomputers',
 'Ġwould',
 'Ġone',
 'Ġday',
 'Ġbecome']

Notice, sometimes our tokens are words (e.g. `Alan`), sometimes they are words but with a space in front of them (e.g. `Ġthat`, the `Ġ represents a space`), sometimes there are part of a word (e.g. theorized is split into `Ġtheor` and `ized`), and sometimes they are punctuation (e.g. .).

In [19]:
# we can get our words back again
words = encoder.decode(input_ids)
print(words.split())

['Alan', 'Turing', 'theorized', 'that', 'computers', 'would', 'one', 'day', 'become']


In [20]:
# make sure we are not surpassing the max sequence length of our model
n_tokens_to_generate = 40

print(hparams["n_ctx"])
assert len(input_ids) + n_tokens_to_generate < hparams["n_ctx"]

1024


In [21]:
# generate output ids (using CPU)
output_ids = generate(input_ids, params, hparams["n_head"], n_tokens_to_generate)

generating: 100%|██████████████████████████████████████████████████████████████████████| 40/40 [00:02<00:00, 13.87it/s]


In [22]:
# decode the ids back into a string
output_text = encoder.decode(output_ids)

In [23]:
# display the generated text.
print(prompt + '...' + output_text)

Alan Turing theorized that computers would one day become... the most powerful machines on the planet.

The computer is a machine that can perform complex calculations, and it can perform these calculations in a way that is very similar to the human brain.



## Final paper about GPT-2 on OpenAI.

https://openai.com/index/gpt-2-1-5b-release/