# CARTE-Enbridge Bootcamp
#### Lab 5-0

# Building a GPT model from scratch

In this notebook, we are going to build a very simple version of GPT. Our GPT will have a small vocabulary and a small number of layers. Let's begin by importing the necessary libraries.

In [1]:
import tensorflow as tf
from tensorflow import keras
import numpy as np

2023-11-13 09:34:30.855114: I tensorflow/core/util/port.cc:110] oneDNN custom operations are on. You may see slightly different numerical results due to floating-point round-off errors from different computation orders. To turn them off, set the environment variable `TF_ENABLE_ONEDNN_OPTS=0`.
2023-11-13 09:34:30.887011: I tensorflow/core/platform/cpu_feature_guard.cc:182] This TensorFlow binary is optimized to use available CPU instructions in performance-critical operations.
To enable the following instructions: SSE4.1 SSE4.2 AVX AVX2 AVX512F AVX512_VNNI FMA, in other operations, rebuild TensorFlow with the appropriate compiler flags.


In [2]:
# Check that we are using a GPU
if tf.test.gpu_device_name():
    print('Default GPU Device: {}'.format(tf.test.gpu_device_name()))
else:
    raise SystemError('GPU device not found! Enable GPU by going to Runtime > Change runtime type > GPU')

Default GPU Device: /device:GPU:0


2023-11-13 09:34:38.571521: I tensorflow/compiler/xla/stream_executor/cuda/cuda_gpu_executor.cc:995] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero. See more at https://github.com/torvalds/linux/blob/v6.0/Documentation/ABI/testing/sysfs-bus-pci#L344-L355
2023-11-13 09:34:38.704738: I tensorflow/compiler/xla/stream_executor/cuda/cuda_gpu_executor.cc:995] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero. See more at https://github.com/torvalds/linux/blob/v6.0/Documentation/ABI/testing/sysfs-bus-pci#L344-L355
2023-11-13 09:34:38.704903: I tensorflow/compiler/xla/stream_executor/cuda/cuda_gpu_executor.cc:995] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero. See more at https://github.com/torvalds/linux/blob/v6.0/Documentation/ABI/testing/sysf

At its core, GPT is a model to predict the next word in a sequence. In order to be able to learn, we need to first convert words into values that can be fed into the model.

We are going to load a dataset of samples from [Simple English Wikipedia](https://simple.wikipedia.org/wiki/Main_Page). This is a version of Wikipedia that aims to cover the same content, but using a reduced vocabulary and simpler grammar. This makes it easier for language learners to understand. We will use this dataset to train our model.

In [3]:
import requests
import zipfile
from tqdm import tqdm

url = "https://raw.githubusercontent.com/alexwolson/carte_workshop_datasets/main/corpus.txt.zip"

# Stream the download so we can track its progress
response = requests.get(url, stream=True)

# Total size in bytes.
total_size = int(response.headers.get('content-length', 0))
block_size = 1024  # 1KB
progress_bar = tqdm(total=total_size, unit='iB', unit_scale=True)

with open('corpus.txt.zip', 'wb') as file:
    for data in response.iter_content(block_size):
        progress_bar.update(len(data))
        file.write(data)
progress_bar.close()

if total_size != 0 and progress_bar.n != total_size:
    print("ERROR, something went wrong")

# Now we will extract the zip file
z = zipfile.ZipFile('corpus.txt.zip')
z.extractall()

# Read the first 1000 characters from the corpus
with open('corpus.txt', 'r') as f:
    corpus = f.read()

print(corpus[:1000])

100%|██████████| 11.9M/11.9M [00:00<00:00, 42.7MiB/s]


April
April is the fourth month of the year with 30 days. The name April comes from that Latin word "aperire" which means "to open". This probably refers to growing plants in spring. April begins on the same day of week as "July" in all years and also "January" in leap years.
April's flower is the Sweet Pea and its birthstone is the Diamond. The meaning of the Diamond is Innocence.
April in poetry.
Poets use "April" to mean the end of winter. For example: "April showers bring May flowers."

August
August is the eighth month of the year. It has 31 days.
This month was first called "Sextilis" in Latin, because it was the sixth month in the old Roman calendar. The Roman calendar began in March about 735 BC with Romulus. It was the eighth month when January or February were added to the start of the year by King Numa Pompilius about 700 BC. Or, when those two months were moved from the end to the beginning of the year by the decemvirs about 450 BC (Roman writers disagree).
August is named 

Because we want our model to be very simple, we are going to determine the 500 most common words, and use those as our vocabulary.

In [4]:
from collections import Counter
from re import sub

# Strip out all punctuation and numbers
corpus = sub(r'[^\w\s]', '', corpus)
corpus = sub('\n', ' ', corpus)
corpus = sub(r'\d+', '', corpus)
corpus = sub('  ', ' ', corpus)

words = corpus.lower().split(' ')
word_counts = Counter(words)

vocab_size = 500
most_common_words = word_counts.most_common(vocab_size+1)
most_common_words = [word for word, count in most_common_words if word != '']
print(most_common_words)

['the', 'of', 'in', 'a', 'and', 'is', 'to', 'was', 'it', 'that', 'for', 'are', 'as', 'he', 'on', 'by', 'with', 'or', 'from', 'they', 'an', 'this', 'at', 'be', 'his', 'people', 'also', 'has', 'not', 'were', 'which', 'have', 'one', 'river', 'but', 'can', 'many', 'called', 'other', 'there', 'city', 'their', 'when', 'first', 'who', 'some', 'used', 'its', 'about', 'had', 'most', 'found', 'into', 'after', 'made', 'united', 'very', 'states', 'she', 'more', 'all', 'time', 'because', 'two', 'france', 'new', 'like', 'part', 'her', 'been', 'music', 'region', 'only', 'world', 'known', 'these', 'means', 'north', 'name', 'commune', 'them', 'than', 'became', 'may', 'years', 'such', 'often', 'so', 'up', 'different', 'where', 'department', 'born', 'during', 's', 'between', 'over', 'if', 'him', 'then', 'th', 'use', 'will', 'make', 'usually', 'war', 'out', 'do', 'state', 'south', 'would', 'american', 'later', 'area', 'no', 'famous', 'each', 'same', 'before', 'small', 'year', 'three', 'east', 'english', '

With our reduced vocabulary, we will now take our dataset and strip out all words that are not in our vocabulary. Because we are dropping a LOT of words, we are going to keep only segments that are at least 6 words long.

In [5]:
context_length = 5
new_corpus = []
phrase = []
for word in tqdm(words):
    if word in most_common_words:
        phrase.append(word)
    elif len(phrase) >= context_length+1:
        new_corpus.append(' '.join(phrase))
        phrase = []
    else:
        phrase = []

    if len(phrase) >= context_length+1:
        new_corpus.append(' '.join(phrase))
        phrase = []

100%|██████████| 5411292/5411292 [00:09<00:00, 587574.55it/s]


In [6]:
# Remove duplicates
new_corpus = list(set(new_corpus))

In [7]:
print(new_corpus[:10])

['but this is only part of', 'war illinois illinois is a state', 'women and young children have more', 'of all time by the american', 'west of england it was created', 'good and they will all want', 'it is a very popular place', 'switzerland in in the city of', 'formed only two groups in the', 'a few years later but in']


Fantastic! We now have a dataset of grammatical six-word phrases. Next, we need to encode our words into values, so that we can feed them into our model:

In [8]:
words_to_int = {word: i for i, word in enumerate(most_common_words)}
int_to_words = {i: word for i, word in enumerate(most_common_words)}

Now we can encode any sentence (as long as it's made up of words in our vocabulary) into a sequence of integers:

In [9]:
def encode(sentence):
    return [words_to_int[word] for word in sentence.split(' ')]

def encode_one_hot(word):
    return [1 if i == words_to_int[word] else 0 for i in range(vocab_size)]

def decode(sequence):
    return ' '.join([int_to_words[i] for i in sequence])

def decode_one_hot(word):
    return int_to_words[np.argmax(word)]

encoded = encode('all of the people')
print(encoded)
print(decode(encoded))

[60, 1, 0, 25]
all of the people


Now that we have a way to convert words into integers, we can create our training data. We will use the first 5 words in a sequence to predict the 6th word. For example, given the sequence "all of the people in the", we will use "all of the people in" to predict "the". We will do this for every sequence in our dataset.

In [10]:
from sklearn.model_selection import train_test_split

X = []
y = []

for sentence in new_corpus:
    words = sentence.split(' ')
    for i in range(len(words)-context_length):
        X.append(encode(' '.join(words[i:i+context_length])))
        y.append(encode_one_hot(words[i+context_length]))

X = np.array(X)
y = np.array(y)

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2)

for i in range(10):
    print(decode(X_train[i]), '->', decode_one_hot(y_train[i]))

under a rock and is -> not
life is that life is -> not
with the red sea through -> the
of the first people to -> make
as well as the history -> of
television series from the united -> kingdom
of different countries living together -> since
country with its own government -> but
to all parts of life -> and
a person using it can -> change


Now let's use Keras to build our model. At its simplest, GPT has the following structure:

1. An embedding layer that converts each word into a vector
2. A transformer block
3. A linear layer that converts the output of the transformer blocks into a vector of probabilities for each word in the vocabulary

---

_Optional math:_

The transformer is a key concept in GPT. At its core, you can think of a single transformer block as an equivalent to a layer of neurons, but with a more complex architecture. The transformer block is made up of two parts:

1. Multi-head attention
2. A standard fully-connected layer

Multi-head attention is a way of combining information from different parts of the input. "Multi-head" really just means that we do this multiple times and combine the results. Attention can be thought of as a replacement for a standard neuron - instead of taking in all the inputs and combining them based on a single set of weights, we instead learn three different sets of weights and combine them in a more complex way. So instead of our neuron working like this:

$$ y = activation(WX) $$

It works like this:

$$ y = activation(\frac{W_1X*W_2X}{\sqrt{size(X)}}) * W_3X $$

If that seems confusing, don't worry - it's a very new concept in deep learning and we aren't explaining it in much detail. The real takeaway is that we are replacing our standard neuron with its one set of parameters, with a more complex neuron that has three sets of parameters.

---

Each word in our vocabulary will be transformed into a vector of size `8` that will be learned by the model. We will use two transformer blocks, each with two heads. The feedforward layer will have 32 neurons. The output of the feedforward layer will be flattened, and then fed into a linear layer that will output a vector of size `vocab_size`. This vector will be a probability distribution over the words in our vocabulary. We will use the `softmax` activation function to ensure that the output is a valid probability distribution.

In [11]:
vocab_size = len(words_to_int)
embedding_size = 8
num_heads = 2
num_transformer_blocks = 2
feedforward_dim = 32

inputs = keras.layers.Input(shape=(context_length,)) # Take in three words
embedding_layer = keras.layers.Embedding(vocab_size, embedding_size)(inputs) # Convert each word to a vector
transformer_block = keras.layers.MultiHeadAttention(num_heads, embedding_size)(embedding_layer, embedding_layer) # Apply multi-head attention, aka transformer block
transformer_block = keras.layers.MultiHeadAttention(num_heads, embedding_size)(transformer_block, transformer_block) # Apply multi-head attention again
transformer_block = keras.layers.Dense(feedforward_dim, activation='relu')(transformer_block) # Feedforward layer
transformer_flattened = keras.layers.Flatten()(transformer_block) # Flatten the output - we currently get 3 vectors because we have 3 words
outputs = keras.layers.Dense(vocab_size, activation='softmax')(transformer_flattened) # Output probabilities for each word in the vocabulary


model = keras.Model(inputs=inputs, outputs=outputs)
model.compile(optimizer='adam', loss='categorical_crossentropy', metrics=['accuracy'])
model.summary()

Model: "model"
__________________________________________________________________________________________________
 Layer (type)                Output Shape                 Param #   Connected to                  
 input_1 (InputLayer)        [(None, 5)]                  0         []                            
                                                                                                  
 embedding (Embedding)       (None, 5, 8)                 4000      ['input_1[0][0]']             
                                                                                                  
 multi_head_attention (Mult  (None, 5, 8)                 568       ['embedding[0][0]',           
 iHeadAttention)                                                     'embedding[0][0]']           
                                                                                                  
 multi_head_attention_1 (Mu  (None, 5, 8)                 568       ['multi_head_attention[0][

2023-11-13 09:36:10.818218: I tensorflow/compiler/xla/stream_executor/cuda/cuda_gpu_executor.cc:995] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero. See more at https://github.com/torvalds/linux/blob/v6.0/Documentation/ABI/testing/sysfs-bus-pci#L344-L355
2023-11-13 09:36:10.818467: I tensorflow/compiler/xla/stream_executor/cuda/cuda_gpu_executor.cc:995] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero. See more at https://github.com/torvalds/linux/blob/v6.0/Documentation/ABI/testing/sysfs-bus-pci#L344-L355
2023-11-13 09:36:10.818563: I tensorflow/compiler/xla/stream_executor/cuda/cuda_gpu_executor.cc:995] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero. See more at https://github.com/torvalds/linux/blob/v6.0/Documentation/ABI/testing/sysf

This is our own microscopic GPT! Our model has 85,924 trainable parameters - GPT 3.5 has 154 billion. Let's train our model on our dataset.

In [12]:
model.fit(
    X_train,
    y_train,
    epochs=1000,
    batch_size=1024,
    validation_data=(X_test, y_test),
    callbacks=[
        keras.callbacks.EarlyStopping(monitor='val_loss', patience=10)
    ]
)

Epoch 1/1000


2023-11-13 09:36:13.802251: I tensorflow/compiler/xla/stream_executor/cuda/cuda_blas.cc:606] TensorFloat-32 will be used for the matrix multiplication. This will only be logged once.
2023-11-13 09:36:13.863661: I tensorflow/compiler/xla/service/service.cc:168] XLA service 0x5573471203a0 initialized for platform CUDA (this does not guarantee that XLA will be used). Devices:
2023-11-13 09:36:13.863686: I tensorflow/compiler/xla/service/service.cc:176]   StreamExecutor device (0): NVIDIA GeForce RTX 3060, Compute Capability 8.6
2023-11-13 09:36:13.866859: I tensorflow/compiler/mlir/tensorflow/utils/dump_mlir_util.cc:255] disabling MLIR crash reproducer, set env var `MLIR_CRASH_REPRODUCER_DIRECTORY` to enable.
2023-11-13 09:36:13.879954: I tensorflow/compiler/xla/stream_executor/cuda/cuda_dnn.cc:432] Loaded cuDNN version 8800
2023-11-13 09:36:13.922365: I tensorflow/tsl/platform/default/subprocess.cc:304] Start cannot spawn child process: No such file or directory
2023-11-13 09:36:13.97623

Epoch 2/1000
Epoch 3/1000
Epoch 4/1000
Epoch 5/1000
Epoch 6/1000
Epoch 7/1000
Epoch 8/1000
Epoch 9/1000
Epoch 10/1000
Epoch 11/1000
Epoch 12/1000
Epoch 13/1000
Epoch 14/1000
Epoch 15/1000
Epoch 16/1000
Epoch 17/1000
Epoch 18/1000
Epoch 19/1000
Epoch 20/1000
Epoch 21/1000
Epoch 22/1000
Epoch 23/1000
Epoch 24/1000
Epoch 25/1000
Epoch 26/1000
Epoch 27/1000
Epoch 28/1000
Epoch 29/1000
Epoch 30/1000
Epoch 31/1000
Epoch 32/1000
Epoch 33/1000
Epoch 34/1000
Epoch 35/1000
Epoch 36/1000
Epoch 37/1000
Epoch 38/1000
Epoch 39/1000
Epoch 40/1000
Epoch 41/1000
Epoch 42/1000
Epoch 43/1000
Epoch 44/1000
Epoch 45/1000
Epoch 46/1000
Epoch 47/1000
Epoch 48/1000
Epoch 49/1000
Epoch 50/1000
Epoch 51/1000
Epoch 52/1000
Epoch 53/1000
Epoch 54/1000
Epoch 55/1000
Epoch 56/1000
Epoch 57/1000
Epoch 58/1000
Epoch 59/1000
Epoch 60/1000
Epoch 61/1000
Epoch 62/1000
Epoch 63/1000
Epoch 64/1000
Epoch 65/1000
Epoch 66/1000
Epoch 67/1000
Epoch 68/1000
Epoch 69/1000
Epoch 70/1000
Epoch 71/1000
Epoch 72/1000
Epoch 73/1000


<keras.src.callbacks.History at 0x7fb1e0156d10>

As you can see, our accuracy is not very good. It picks the right word something like 1 in 5 times. That's still a lot better than random, which would be 1 in 500, but it's not enough to be useful in practice. This is because our model is very small, and so is our dataset. However, the principles we are using here are exactly the same as in GPT - just on a smaller scale. Let's see how our model performs on some sample sentence:

In [13]:
sample_sentence = 'the united states is one'
sample_sentence_encoded = encode(sample_sentence)
print(sample_sentence_encoded)
predictions = model.predict(np.array([sample_sentence_encoded]))
print(decode_one_hot(predictions[0]))

[0, 55, 57, 5, 32]
the


When we talk about models like GPT-3.5, we talk about the 'context window'. This is the number of tokens we feed into the model to get one word out. In our case, our context window is 5. In GPT-3.5, the context window is 4096, or 16384, depending on the model. The latest version of GPT-4 supports a context window of up to 128,000 tokens - as much as 300 pages of text. No matter what the context is, we are only getting one word out - if we want to produce a larger sequence, we have to successively feed the output back into the model. This is called 'autoregressive generation'. We can use our model to generate a sequence of words like this:

In [14]:
def generate_sequence(model, context, length):
    result = context
    for i in range(length):
        predictions = model.predict(np.array([context]))
        context = np.append(context, np.argmax(predictions[0]))
        result = np.append(result, np.argmax(predictions[0]))
        context = context[1:]
    return result

print(decode(generate_sequence(model, encode('in the way of the'), 10)))

in the way of the british empire the first union the united states the storm


This is precisely how models like ChatGPT generate text. They take in the context (which, as we discussed, is typically much longer than 5 words) and generate the next word. However, unlike our case, where we choose a fixed number of words to generate, ChatGPT keeps generating words until it reaches a special token that marks the end of a sequence (often `<end>`). This is how it can generate text of arbitrary length.

Now we have created our very own GPT model. But this is not the same as ChatGPT. Models like ChatGPT go one step further, to make the model more useful for conversation. This is done using a technique called Reinforcement Learning from Human Feedback (RLHF).

RLHF expands on the training process we've seen above by adding a second model, called the discriminator or the adversary. The role of the adversary is to rate the quality of a response based on some conditions that we care about. In the case of ChatGPT, the adversary is looking for things like whether the response fits the conversational style, and whether it avoids sensitive topics. The adversary is trained using _human feedback_ - humans rate the quality of responses, and the adversary learns to predict the human rating. The adversary is then used to train the generator (the GPT model) - the generator is rewarded for producing responses that the adversary rates highly. This is called adversarial training, and it is a very powerful technique for training models.

We are going to make our own extremely simple adversary. Our adversary will assign a score to the response based on how many times the letter 'e' appears in the response. We will then use this score to train our generator. This is a very simple example, but it demonstrates the principle of adversarial training.

Because we can directly calculate how many 'e's appear in each of our vocabulary words, we don't need to 'train' our adversary - we can just use it directly. We will use the following function to calculate the adversary score of a word:

In [15]:
adversary_scores = [word.count('e') for word in most_common_words]
def adversary_score(y_true, y_pred):
    return tf.reduce_sum(y_pred * adversary_scores, axis=-1)

In [16]:
sample_sentence = "in the same way i"
sample_sentence_encoded = encode(sample_sentence)
prediction = model.predict(np.array([sample_sentence_encoded]))[0]
print(decode_one_hot(prediction))
print(adversary_score(None, prediction))

the
tf.Tensor(0.6908647981617653, shape=(), dtype=float64)


As you should be able to see, the adversary assigns a score greater than zero even if the predicted word doesn't have any 'e's in it. This is because the predicted word is not a one-hot vector - it is a probability distribution. The adversary is assigning a score to the entire distribution, not just the most likely word. This is valuable because we typically avoid methods which can produce zero as an error - in a nutshell, if the error is zero, the model doesn't know how to change things in order to improve. Our approach instead will incentivize the model to consider all words containing 'e's more strongly.

In [17]:
sample_sentence = "in the same way i"
sample_sentence_encoded = encode(sample_sentence)
prediction = model.predict(np.array([sample_sentence_encoded]))[0]
print("Top 10 most likely words")
print("Word       | Chance \t| e count | Adversary score")
for i in sorted(zip(prediction, most_common_words), reverse=True)[:10]:
    print(f'{i[1]:10} | {i[0]*100:.0f}% \t| {i[1].count("e")}       | {i[0]*i[1].count("e"):.4f}')

Top 10 most likely words
Word       | Chance 	| e count | Adversary score
the        | 19% 	| 1       | 0.1918
be         | 7% 	| 1       | 0.0723
a          | 4% 	| 0       | 0.0000
people     | 3% 	| 2       | 0.0625
have       | 2% 	| 1       | 0.0240
do         | 2% 	| 0       | 0.0000
things     | 2% 	| 0       | 0.0000
their      | 2% 	| 1       | 0.0186
get        | 2% 	| 1       | 0.0182
make       | 2% 	| 1       | 0.0167


In [18]:
def combined_loss(y_true, y_pred):
    adversary_weight = 0.75 # Modify this to increase or decrease the influence of the adversary
    return tf.losses.categorical_crossentropy(y_true, y_pred) - adversary_weight * adversary_score(y_true, y_pred)

# Duplicate the model
model_adversary = keras.models.clone_model(model)

model_adversary.compile(optimizer='adam',
              loss=combined_loss,
              metrics=['accuracy', adversary_score])

In [19]:
model_adversary.fit(
    X_train,
    y_train,
    epochs=1000,
    batch_size=1024,
    validation_data=(X_test, y_test),
    callbacks=[
        keras.callbacks.EarlyStopping(monitor='val_accuracy', patience=10)
    ]
)

Epoch 1/1000
Epoch 2/1000
Epoch 3/1000
Epoch 4/1000
Epoch 5/1000
Epoch 6/1000
Epoch 7/1000
Epoch 8/1000
Epoch 9/1000
Epoch 10/1000
Epoch 11/1000
Epoch 12/1000
Epoch 13/1000
Epoch 14/1000
Epoch 15/1000
Epoch 16/1000
Epoch 17/1000
Epoch 18/1000
Epoch 19/1000
Epoch 20/1000
Epoch 21/1000
Epoch 22/1000
Epoch 23/1000
Epoch 24/1000
Epoch 25/1000
Epoch 26/1000
Epoch 27/1000
Epoch 28/1000
Epoch 29/1000
Epoch 30/1000
Epoch 31/1000


<keras.src.callbacks.History at 0x7fb21778a790>

Now let's compare the predictions of our original model and our adversary model:

In [20]:
original_predictions = model.predict(X_test)
adversary_predictions = model_adversary.predict(X_test)

print(f'Adversary score for original model: {np.mean(adversary_score(None, original_predictions))}')
print(f'Adversary score for adversary model: {np.mean(adversary_score(None, adversary_predictions))}')

for i in range(10):
    true_word = decode_one_hot(y_test[i])
    original_word = decode_one_hot(original_predictions[i])
    adversary_word = decode_one_hot(adversary_predictions[i])
    print(f'{true_word:10} | {original_word:10} | {adversary_word:10}')

Adversary score for original model: 0.5101372502659014
Adversary score for adversary model: 1.6916764371761261
of         | of         | between   
march      | the        | the       
the        | the        | the       
for        | to         | september 
this       | the        | the       
word       | of         | between   
make       | be         | released  
and        | the        | the       
time       | are        | released  
this       | to         | between   


As we can see, introducing the adversary has dramatically increased the model's likelihood to choose words with lots of 'e's in them. Of course, in this setting, that's at a cost to the model's accuracy. However, in a real-world setting, we would use a more sophisticated adversary, and we would use a more sophisticated metric than just accuracy.

So there you have it! We have built our own GPT model, and we have seen how we can use an adversary to obtain specific behaviour. Of course, there are many more details that go into building a model like ChatGPT, but this is the core of it. I hope you enjoyed this tutorial!