# HSST B10m 2024: GPT-2 transformer exercise
Daniel Warren
May 2024
# About this notebook
This notebook is for exploring the smallest version of GPT-2 (124M parameters), described in the 2019 report by OpenAI [Language Models are Unsupervised Multitask Learners](https://d4mucfpksywv.cloudfront.net/better-language-models/language-models.pdf).

This model is extremely small and limited in performance compared to state-of-the-art LLMs. However, it has several advantages for use here:
* it has a permissive open source licence
* it is relatively small (~500MB) and lightweight, enabling it to be run in a Binder environment
* it is simple enough that the forward-pass (i.e. predicting a text completion) has been implemented in ~60 lines of numpy code as [picoGPT](https://github.com/jaymody/picoGPT)
* the network architecture used by OpenAI is believed to remain very similar to this, at least as far as GPT-3.5, except scaled-up

All of the 'clever' functionality of this notebook uses functions from picoGPT. See [this blog post](https://jaykmody.com/blog/gpt-from-scratch/) for a detailed explanation by that code's author.

Daniel Warren\
May 2024

In [None]:
# Add picoGPT to path so functions can be imported like a module
import os, sys
pwd = os.getcwd()
picogpt_path = os.path.join(pwd,'picoGPT')
sys.path.append(picogpt_path)

## 1 Text completion
Start by running the code below (making changes if you want) to complete a text sequence. You'll find that its 'abilities' are quite limited - it rarely provides correct factual information and is very prone to repetition after only a few tokens.

Note especially that this model is only trained to complete arbitrary text, not 'instruction-tuned' to behave as a ChatGPT-style assistant. if you ask a question it might continue as though it were a rhetorical question in an essay.

In [None]:
# Import functions from picoGPT
from picoGPT.gpt2_pico import main as complete_text
from ipywidgets import HTML

prompts = ["The director of the film Titanic was",
           "Water boils at",
           "How many items are there in a dozen?",
           "The uses of artificial intelligence in medical physics include"]

for (i,prompt) in enumerate(prompts):
    completion = complete_text(prompt,n_tokens_to_generate=20)
    display(HTML(f'<pre><span style="color:red">{prompt}</span><span style="color:blue">{completion}</span>...</pre>'))
    

## 2 Tokenizer
The first part of the inference process is tokenization - converting the input text string into a sequence of numbers representing word parts.

Use the file browser on the left-hand side to navigate into the folder 'models/124M'. The files in here contain all of the pre-trained parameters/weights necessary to predict text in this model.

Two files are used by the tokenizer (here called 'encoder'), which converts a text string into a sequence of tokens:
  * **encoder.json** contains the mappings of text parts (keys in the JSON file) to token IDs (values)
    - You can open this in the Jupyter text editor. The file starts with tokens that represent numbers - you'll need to go to line ~1100 or so before you start seeing text, and even further down to see significant parts of words.
  * **vocab.bpe** is used for byte-pair encoding: iteratively merging commonly-occuring pairs of tokens to a single value
    - This can also be opened in the Jupyter text editor. In both files you will see many tokens starting with 'Ġ' - this indicates a space prior to that token.

Run the code below (making changes if you want) to see how the tokenizer splits up various strings.

In [None]:
from picoGPT.encoder import get_encoder
tokenizer = get_encoder("124M", "models")

print ("TOKENIZATION EXAMPLES:")
# Tokenizer examples
prompts = ["three consecutive words","three","consecutive","words"," consecutive"]
all_tokens = set()
for prompt in prompts:
    tokens = tokenizer.encode(prompt)
    print(f"'{prompt}' tokenizes to {tokens}")
    all_tokens = all_tokens.union(tokens)

token_mapping = {t:tokenizer.decode([t]) for t in sorted(all_tokens)}

print ("\nTOKEN MAPPINGS:")
for (k,v) in token_mapping.items():
    print(f"{k}: '{v}'")


## 3 Hyper-parameters
The remaining files in the 'models/124M' directory are not directly human-readable. They comprise a 'checkpoint' from the Tensorflow machine-learning framework, including all of the pre-trained weights/parameters necessary to run the GPT-2 model.

The code below reads in the parameters and displays key network hyperparameters:
* **n_vocab** is the size of the tokenizer's vocabulary (the number of word/text parts that have been assigned a unique token ID)
* **n_ctx** is the size of the context (the maximum length of text that can be processed, in units of tokens)
* **n_embd** is the size of the embedding (the number of directions in the arbitrary, learned, coordinate-system (or 'latent space') that the model uses to represent token meaning and position)
* **n_head** is the number of attention heads per layer (the attention mechanism is applied in parallel N times on different linear transformations of the embeddings, with those transformations being learned parameters)
* **n_layer** is the number of layers in the network

In [None]:
from picoGPT.utils import load_encoder_hparams_and_params
_, hparams, params = load_encoder_hparams_and_params("124M","models")

print ("NETWORK HYPERPARAMETERS:")
for (k,v) in hparams.items():
    print(f"{k}: '{v}'")

## 4 Parameters/weights
The learned parameters ('weights') of the model are the results of the training process for GPT-2.
### 4.1 Token Embedding
* **wte** is the token embedding matrix, mapping each token in the vocabulary to a vector of length n_embd - a multi-dimensional 'latent space'

In [None]:
import numpy as np
import seaborn as sns
import pandas as pd

from matplotlib import pyplot as plt

prompts = [" dog"," cat"," horse"," car"," bus"," train"]

x_vals = []
y_vals = []

rows = {}

for (i,prompt_a) in enumerate(prompts):
    for (j,prompt_b) in enumerate(prompts):
            
            # Note that these will only take the first token if the input encodes to multiple token
            embedding_a = tokenizer.encode(prompt_a)
            embedding_b = tokenizer.encode(prompt_b)
            if (len(embedding_a) > 1) or (len(embedding_b) > 1):
                print("WARNING: One or more prompts contains >1 token. Only first token will be used.")
            token_a_embedding = params['wte'][embedding_a[0]]
            token_b_embedding = params['wte'][embedding_b[0]]

            similarity = np.dot(token_a_embedding/np.sqrt(np.sum(token_a_embedding**2)),
                                token_b_embedding/np.sqrt(np.sum(token_b_embedding**2)))

            if prompt_a not in rows.keys():
                rows[prompt_a] = pd.Series(index=[prompt_a])
            rows[prompt_a][prompt_b] = similarity

df = pd.DataFrame([rows[prompt_a] for prompt_a in prompts],index=prompts)

ax = plt.axes()
sns.heatmap(df, annot=True, fmt=".2f", linewidths=.5,cmap="jet",ax=ax)
plt.title("Cosine similarity of token embeddings")
plt.show(ax)

### 4.2 Position Embedding
Attention doesn't inherently consider spatial relationships (unlike e.g. convolution) so this must be explicitly added.
* **wpe** is the position encoding matrix, mapping each token in the context to a vector of length n_embd in the same latent space as **wte**.

In [None]:
pos_0_embedding = params['wpe'][0]

for (p,i) in enumerate([0,256,512,768]):
    pos_i_embedding = params['wpe'][i]
    similarities = []
    for j in range(1023):
        pos_j_embedding = params['wpe'][j]
        similarity = np.dot(pos_i_embedding/np.sqrt(np.sum(pos_i_embedding**2)),
                            pos_j_embedding/np.sqrt(np.sum(pos_j_embedding**2)))
        similarities.append(similarity)
    plt.subplot(2,2,p+1)
    plt.plot(similarities)
    plt.xlabel('Position in context')
    plt.ylabel(f'Embedding similarity')
    plt.title(f'Relative to position {i}')
    plt.tight_layout()

### 4.3 Transformer blocks
* **blocks** is a list (size 12) of dictionaries containing weights for each of the 12 transformer blocks in succession:
  - **attn** is the attention layer parameters
    - **c_attn** is the weights and biases that project the input embeddings to Q, K, V matrices (which are then used in the attention calculation)
    - **c_proj** is the linear transformation applied to the output of the attention calculation
  - **ln_1** and **ln_2** are the parameters of the two normalization layers in the block, similar to ln_f
  - **mlp** is the conventional neural network layer at the end of the block
    - **c_fc** is the weights and biases for the layer
    - **c_proj** is the linear transformation applied to the output of the fully-connected layer   

Run the code block below to view the structure of the parameters for the transformer blocks. You can also uncomment and modify the commented-out line to see the weights themselves; they are large arrays of floating point numbers.

In [None]:
import pprint
pp = pprint.PrettyPrinter(indent=4,depth=4)
np.set_string_function(lambda x:f"array with shape {x.shape}") # override string representation of numpy array to just print shape
pp.pprint(params['blocks'])
np.set_string_function(None) # reset string representation back to default

#print("\n")
#print(params['blocks'][0]['attn']['c_attn']['w']) # Print the weight parameter of the attention layer in the first transformer block

## 5 Outputs
The output of the model is a likelihood score for every possible token in the vocabulary. The picoGPT function used in part 1 above always chooses the token with the highest score, which is entirely repeatable/deterministic. For greater variety/'creativity' the scores can be converted into a probability distribution (e.g. using softmax) which is sampled randomly to choose the next token instead.

Vary the prompt below to see how it changes the highest-scoring predicted tokens.

As an extension exercise, you could try implementing a random sampler from the highest-scoring N tokens and seeing how its sentence completions compare to the deterministic approach in part 1.

In [None]:
from picoGPT.gpt2_pico import gpt2 as calculate_logits, softmax
from collections import OrderedDict

prompt = "The cat sat on the"

tokens = tokenizer.encode(prompt)

logits = calculate_logits(tokens, **params, n_head=hparams["n_head"])
next_token_scores = logits[-1]

wordparts_and_scores = [(tokenizer.decode([i]),score) for (i,score) in enumerate(next_token_scores)]
wordparts_and_scores = sorted(wordparts_and_scores,key=lambda x:x[1],reverse=True) # sort by score

top5_wordparts_and_scores = wordparts_and_scores[:5]
bottom5_wordparts_and_scores = reversed(wordparts_and_scores[-5:])

print(f"Top 5 predictions to complete '{tokenizer.decode(tokens)}':")
for (i,(word,score)) in enumerate(top5_wordparts_and_scores):
    softmax_probability = np.exp(score)/np.sum(np.exp(next_token_scores))
    print(f"{i+1}. {word} (score = {score:.2f}, softmax probability = {softmax_probability:.2f})")
    
print(f"\nBottom 5 predictions to complete '{tokenizer.decode(tokens)}':")
for (i,(word,score)) in enumerate(bottom5_wordparts_and_scores):
    softmax_probability = np.exp(score)/np.sum(np.exp(next_token_scores))
    print(f"{i+1}. {word} (score = {score:.2f}, softmax probability = {softmax_probability:.2f})")