# GPT 2 Implementation

#### Read in the text

In [3]:
# read in the text line by line
with open('input.txt', 'r', encoding='utf-8') as f:
    text = f.read()

# establish the characters and the vocabulary size
chars = sorted(list(set(text)))
vocab_size = len(chars)

#### Create encoder and decoder functions

In [4]:
# initialise lookuptables
char_ind = {}
ind_char = {}
for i, ch in enumerate(chars):
    char_ind[ch] = i
    ind_char[i] = ch

In [5]:
# encoder and decoder functions
def encode(input_str:str) -> list:
    return [char_ind[s] for s in input_str]

def decode(input_list:list) -> str:
    return ''.join([ind_char[l] for l in input_list])

Andrej Karpathy spoke about how there is a trade off between vocab size and encoded sequence size. If you have a large vocab size that means each individual sequence of text can be described in a smaller sequence of numbers, and vice versa.

#### Convert the data into tensors with PyTorch

In this step we convert our data into tensors. We will be performing our operations fully on tensors since this is the datatype that machine learning algorithms can actually understand.

In [6]:
import torch

data = torch.tensor(encode(text), dtype=torch.long)

From the tests below it seems that the `0` key is a new-line indicator and `1` is a space indicator.

In [7]:
print(data.shape, data.dtype, data[:100], text[:100], sep = '\n\n')

torch.Size([1115394])

torch.int64

tensor([18, 47, 56, 57, 58,  1, 15, 47, 58, 47, 64, 43, 52, 10,  0, 14, 43, 44,
        53, 56, 43,  1, 61, 43,  1, 54, 56, 53, 41, 43, 43, 42,  1, 39, 52, 63,
         1, 44, 59, 56, 58, 46, 43, 56,  6,  1, 46, 43, 39, 56,  1, 51, 43,  1,
        57, 54, 43, 39, 49,  8,  0,  0, 13, 50, 50, 10,  0, 31, 54, 43, 39, 49,
         6,  1, 57, 54, 43, 39, 49,  8,  0,  0, 18, 47, 56, 57, 58,  1, 15, 47,
        58, 47, 64, 43, 52, 10,  0, 37, 53, 59])

First Citizen:
Before we proceed any further, hear me speak.

All:
Speak, speak.

First Citizen:
You


#### Split up the data into training and evaluation

Here we've chosen a 90/10 split. 90% of our data will be used to train the model and the other 10% will be used to evaluate and test the model.

In [8]:
n = int(0.9*len(data))
train_data = data[:n]
val_data = data[n:]

The code below was taken from Andrej Karpathy directly. Here we are creating the batches of data we wil train our model on.

In [9]:
torch.manual_seed(1337)
device = 'cuda' if torch.cuda.is_available() else 'cpu'
batch_size = 4
block_size = 8 

def get_batch(split):
    # generate a small batch of data of inputs x and targets y
    data = train_data if split == 'train' else val_data
    ix = torch.randint(len(data) - block_size, (batch_size,))
    x = torch.stack([data[i:i+block_size] for i in ix])
    y = torch.stack([data[i+1:i+block_size+1] for i in ix])
    x, y = x.to(device), y.to(device)
    return x, y

In [10]:
xb, yb = get_batch('train')

### Bigram Language Model Architecture

The Model architecture is generally standard and there are templates available. We can find such templates in PyTorch documentation. Here's a link to their `NGramLanguageModel` architecture:

https://pytorch.org/tutorials/beginner/nlp/word_embeddings_tutorial.html

In [11]:
import torch
import torch.nn as nn
from torch.nn import functional as F

torch.manual_seed(1337)

class BigramLanguageModel(nn.Module):
    def __init__ (self, vocab_size):
        super().__init__()
        self.token_embedding_table = nn.Embedding(vocab_size, vocab_size)

    def forward(self, idx, targets=None):
        logits = self.token_embedding_table(idx)

        if targets == None:
            loss = None

        else:
            B, T, C = logits.shape
            logits = logits.view(B*T, C)
            targets = targets.view(B*T)
            loss = F.cross_entropy(logits, targets)

        return logits, loss
    
    def generate(self, idx, max_new_tokens):
        for _ in range(max_new_tokens):
            logits, loss = self(idx)
            logits = logits[:, -1 , :]
            probs = F.softmax(logits, dim=-1)
            idx_next = torch.multinomial(probs, num_samples=1)
            idx = torch.cat((idx, idx_next), dim=1)
        return idx



In [12]:
m = BigramLanguageModel(vocab_size)
logits, loss = m(xb, yb)

In [13]:
idx = torch.zeros((1,1), dtype=torch.long)                # kickstart generation with 0 tensor.
generated_tokens = m.generate(idx, max_new_tokens=100)[0] # generate 100 new tokens based on 0 tensor.
encoded_list = generated_tokens.tolist()                  # convert tensor into list. 
print(decode(encoded_list))                               # decode the list to give the actual output.
print(loss)


SKIcLT;AcELMoTbvZv C?nq-QE33:CJqkOKH-q;:la!oiywkHjgChzbQ?u!3bLIgwevmyFJGUGp
wnYWmnxKWWev-tDqXErVKLgJ
tensor(4.8786, grad_fn=<NllLossBackward0>)


The result above is completely garbage since our model hasn't been trained and its just spitting out random values based of probabilities.

In [14]:
optimizer = torch.optim.AdamW(m.parameters(), lr=0.001)

batch_size = 32
for steps in range(1000):
    xb, yb = get_batch('train')

    logits, loss = m(xb, yb)
    optimizer.zero_grad(set_to_none=True)
    loss.backward()
    optimizer.step()

print(loss.item())

3.704137086868286


In [15]:
print(decode((m.generate(idx = torch.zeros((1,1), dtype=torch.long), max_new_tokens=300)[0].tolist())))


Wh;;Sq.f ustNzknc
kwgOj$dhPWr,SV?hsusiKpgXXUh;Apmem d?hESXI.i;TrJgkiF-oKbXCAA -botrngFCHAUQkn$

pn$w-gHoi?wtd!
LLULIfSK'bAw :M.ZtOptXEQcL?hfaofqbPd?OnonQQJMap$aypupIBYGUsZaI'ottllo..k$W$Akp?yl?ajKlzY!lx&QQLW? t,bXFkyhl-dmVsHeckhRl,jSClgjuk:3Iv
?OqlrV;!Plxfzgy;;
'mRjuBQ&xk!$
h
SiruDJgKuDny,S$ERf.?GSV


Note that now there is visible structure in the output and contains some actual words instead of gibberish.
This is the simplest possible model because, "the tokens aren't talking to each other" as Andrej Karpathy puts it in his video. Next we implement the Transformer model with multiheaded attention.

The loss evaluation is very noisy at the moment because it ouptuts the loss on the last training step. By averaging out the loss over iterations we get a better idea of our parameter.

In [22]:
# generalise some of our previously established code
model = BigramLanguageModel(vocab_size = len(chars))
device = 'cuda' if torch.cuda.is_available() else 'cpu'
m = model.to(device)
eval_iters = 200

In [23]:
# create the new loss estimation function
def estimate_loss():
    out = {}
    model.eval() # put the model into evaluation mode

    # evaluate the model on the training data and the evaluation data
    for split in ['train', 'val']:
        losses = torch.zeros(eval_iters)
        for k in range(eval_iters):
            X,Y = get_batch(split)
            _, loss = model(X, Y)
            losses[k] = loss.item()
        out[split] = losses.mean()
    model.train()
    return out

In [24]:
max_iters = 10000
eval_interval = 1000

optimizer = torch.optim.AdamW(model.parameters(), lr=1e-3)

for iter in range(max_iters):
    if iter % eval_interval == 0:
        losses = estimate_loss()
        print(f"step {iter}: train loss {losses['train']:.4f}, val loss {losses['val']}")
    
    xb, yb = get_batch('train')

    logits, loss = model(xb, yb)
    optimizer.zero_grad(set_to_none=True)
    loss.backward()
    optimizer.step()

step 0: train loss 4.7543, val loss 4.753664970397949
step 1000: train loss 3.7587, val loss 3.7701144218444824
step 2000: train loss 3.1483, val loss 3.1525073051452637
step 3000: train loss 2.8046, val loss 2.833693265914917
step 4000: train loss 2.6517, val loss 2.6624062061309814
step 5000: train loss 2.5611, val loss 2.5788073539733887
step 6000: train loss 2.5288, val loss 2.5393152236938477
step 7000: train loss 2.4887, val loss 2.5175559520721436
step 8000: train loss 2.4768, val loss 2.5043749809265137
step 9000: train loss 2.4709, val loss 2.4989840984344482


In [25]:
context = torch.zeros((1,1), dtype=torch.long, device=device)
print(decode(m.generate(context, max_new_tokens=200)[0].tolist()))


Bure hendait wherMa ctasthinerasewasacl, mme; ur shucitrerer ow be Inghe oue wh ad s nchordorngonskelusen ndee ggotibomstrathat lou t myond the imen cce, ner bt s s s
FO:
Rowar me;

Horoucesprivingour


# Transformer Model

Before we start, here is a "mathematical trick" in the self-attention module.

In [26]:
torch.manual_seed(1337)
B, T, C = 4, 8, 2            # size constraints -- Batch | Time | Channel.
x = torch.randn(B, T, C)     # initialise random vector with required size.
x.shape

torch.Size([4, 8, 2])

There are 8 tokens in the tensor above. We want the tokens to start talking to each other but the contstraint is that the $x_i$ token should talk to all the previous tokens but NOT the $x_{i+1}$ token. This is because the goal of our model is to predict the next token. If we gave it information about the next token then our evaluation functions would not work.

The way we transport information is by taking an average of all of the previous tokens up until the $x_{i}$ token for every token.

In [28]:
x_bow = torch.zeros((B, T, C))
for b in range(B):
    for t in range(T):
        x_prev = x[b,:t + 1]
        x_bow[b,t] = torch.mean(x_prev, 0)


In [40]:
print(x[0], x_bow[0], sep='\n\n')

tensor([[ 0.1808, -0.0700],
        [-0.3596, -0.9152],
        [ 0.6258,  0.0255],
        [ 0.9545,  0.0643],
        [ 0.3612,  1.1679],
        [-1.3499, -0.5102],
        [ 0.2360, -0.2398],
        [-0.9211,  1.5433]])

tensor([[ 0.1808, -0.0700],
        [-0.0894, -0.4926],
        [ 0.1490, -0.3199],
        [ 0.3504, -0.2238],
        [ 0.3525,  0.0545],
        [ 0.0688, -0.0396],
        [ 0.0927, -0.0682],
        [-0.0341,  0.1332]])
