Context: using Wizard of Oz (for now) to train an LLM. 

Why am I doing this? To better understand how LLMs work, understand the math involved and the programming required. This will help me to understand LLM use cases, their limitations, and hopefully, their commercial application moving forward. 

I'm highly interested in NLP, AI, etc., so this project will serve as a foundational base to demonstrate some skills I'm acquring and looking to deploy in the future.

# Setup and Data Inspection

In [1]:
import torch
import torch.nn as nn
import numpy as np
import torch.nn as nn
from torch.nn import functional as F

device = 'cuda' if torch.cuda.is_available() else 'cpu'
print(device)

#these are important hyperparamters for training and learning patterns of language
block_size = 8
batch_size = 4
max_iters = 10000
learning_rate = 3e-4
eval_iters = 250

cuda


In [2]:
with open('WizOfOz.txt', 'r', encoding='utf-8') as f:
    text = f.read()
print(len(text)) #print length of text

250472


In [3]:
chars = sorted(set(text))
print(chars) #show all character parts

vocab_size = len(chars)

['\n', ' ', '!', '"', '$', '%', '&', "'", '(', ')', '*', ',', '-', '.', '/', '0', '1', '2', '3', '4', '5', '6', '7', '8', '9', ':', ';', '?', 'A', 'B', 'C', 'D', 'E', 'F', 'G', 'H', 'I', 'J', 'K', 'L', 'M', 'N', 'O', 'P', 'Q', 'R', 'S', 'T', 'U', 'V', 'W', 'X', 'Y', 'Z', '[', ']', '_', 'a', 'b', 'c', 'd', 'e', 'f', 'g', 'h', 'i', 'j', 'k', 'l', 'm', 'n', 'o', 'p', 'q', 'r', 's', 't', 'u', 'v', 'w', 'x', 'y', 'z', '—', '‘', '’', '“', '”', '•', '™', '\ufeff']


# Character-level Tokenisation
Taking each character and turning it into an integer. 

In [4]:
string_to_int = { ch:i for i, ch in enumerate(chars) }
int_to_string = { i:ch for i, ch in enumerate(chars) }
encode = lambda s: [string_to_int[c] for c in s]
decode = lambda l: ''.join([int_to_string[i] for i in l])
data = torch.tensor(encode(text), dtype = torch.long)

# Train / Validation Split

In [5]:
n = int(0.8*len(data)) #80% train split
train_data = data[:n]
val_data = data[n:]

#batching 
def get_batch(split):
    data = train_data if split == 'train' else val_data
    ix = torch.randint(len(data) - block_size, (batch_size,))
    x = torch.stack([data[i:i+block_size] for i in ix])
    y = torch.stack([data[i+1:i+block_size+1] for i in ix])
    x, y = x.to(device), y.to(device)
    return x, y

x, y = get_batch('train')
print('inputs:')
print(x)
print('targets:')
print(y)

inputs:
tensor([[ 3,  1, 76, 64, 61,  1, 63, 65],
        [47, 64, 61,  1, 76, 74, 61, 61],
        [69, 61, 75,  1, 58, 61, 62, 71],
        [78, 61,  1, 78, 61, 74, 81,  1]], device='cuda:0')
targets:
tensor([[ 1, 76, 64, 61,  1, 63, 65, 74],
        [64, 61,  1, 76, 74, 61, 61, 12],
        [61, 75,  1, 58, 61, 62, 71, 74],
        [61,  1, 78, 61, 74, 81,  1, 75]], device='cuda:0')


# Tensor Batching for Predictions

Part of generating novel text is allowing the computer to make predictions on context. Making a batch of tokens as a basis of prediction, and then being able to look before/after that batch allows it to start building a set of predictions. The idea is not for the code to simply regurgitate the Wizard of Oz, but to actually make something novel. This requires understanding the context of the tokens. And that requires understanding the tokens in meaningful batches, forward and backward. 

In [6]:
@torch.no_grad() #makes sure that gradients aren't used - improving performance in training 

def estimate_loss():
    out = {}
    model.eval()
    for split in ['train', 'val']:
        losses = torch.zeros(eval_iters)
        for k in range(eval_iters):
            X, Y = get_batch(split)
            logits, loss = model(X, Y)
            losses[k] = loss.item()
        out[split] = losses.mean()
    model.train() #puts the machine into training model, making it less likely to overfit (dropping neurons, weights, etc.) 
    return out

In [12]:
class BigramLanguageModel(nn.Module):
    def __init__(self, vocab_size):
        super().__init__()
        self.token_embedding_table = nn.Embedding(vocab_size, vocab_size)

    def forward(self, index, targets=None):
        logits = self.token_embedding_table(index)

        if targets is None:
            loss = None
        else:
            B, T, C = logits.shape
            logits = logits.view(B*T, C)
            targets = targets.view(B*T)
            loss = F.cross_entropy(logits, targets)
        
        return logits, loss

    def generate(self, index, max_new_tokens): #index is (B, T) array of indices of the current context
        for _ in range(max_new_tokens): #gets the predictions
            logits, loss = self.forward(index) #focus only on the last time steps
            logits = logits[:, -1, :] # becomes (B, C)
            probs = F.softmax(logits, dim=-1) #apply softmax to normalise these batches into probabilities 
            index_next = torch.multinomial(probs, num_samples=1) # sample from the distribution
            index = torch.cat((index, index_next), dim=1) #append the sampled index to the running sequence being generated 
        return index

model = BigramLanguageModel(vocab_size)
m = model.to(device)

context = torch.zeros((1,1), dtype=torch.long, device=device)
generated_chars = decode(m.generate(context, max_new_tokens=500)[0].tolist())
print(generated_chars)

# this creates a giant grid, and will search from start to finish, making predictions for the next one after
#logits are distributions of probabilities. A follows B is x, A follows C is x, etc... all based on probabilities of their occurrence, stored in vector/matrix form



B
,;s![33BOH$;gix)sU$Nmf/yf[wh2:'El)LUZX:’]nDFNxa‘W'rK*;T/?j8C_drMa-fbmj/b/g-QI%F”J”’uryI_M•b57'Xp*UD6k6mcn.‘”.c_JGek —3nZXl—’7“EoT
z8Xq﻿A3i)llXhQ]e-h7PAT;SvY50cZ!—S“v’v"(’mqy5$"cI_j/4I p﻿rb[TB&M'E1z%IwN1Igv!g
IB[f?KFMBPygB&eV'smU9“A“
]_mhPb/fOYBipVOfgVr
]nP6”Y$Ah[7*y;TTg4X8o•4gFa-c"LOHEN5“vpL&.Fvyh’PkxaWE—F﻿Yb™*T?]KyHfn*u?$w2[m_4)2bPB7“qLWdpB;jo:/ mi%)Y3t&yY1Lcq*4q)s.A?Gs
•7uix/’I 2T?;F67Vntl(z%,/Yzo“D“R™hVU™ZrMRV-lBIkrMS]9•''MDA$,wekZ-yvgcIt™i991bYkJ;I[”_cv"]eR_cIfq‘N5“vdmW?4’"VsJ,?‘9/3tX5™HKV


#### What is this?
The first run of the Neural Net trying to make sense of characters. We could call this a default state. Based on untrained data, these are the kinds of predictions it makes. They are bad. But it's a useful reference point to note of what a bigram model does under the hood with no training or validation. It's not being told to improve, just use unoptimised weights to predict. It'd be like asking someone who can't read English to make sense of a text. They'd just see gibberish and maybe associate some combination of letters/words with a language they already know (let's say Hungarian) and be like, "hey that's meaningful there because I know it based on my understanding of a language". 

# Training, Optimising, Validation

As opposed to the output above, which is the "default state", below is code making the neural net learn the bigram associations between words/characters in the text being parsed.

In [16]:
# training loop with optimizer 
optimizer = torch.optim.AdamW(model.parameters(), lr=learning_rate)

for iter in range(max_iters):
    if iter % eval_iters == 0:
        losses = estimate_loss()
        print(f"step:{iter}, train loss: {losses['train']:.3f}, val loss: {losses['val']:.3f}")
        
    xb, yb = get_batch('train') #sampling a batch of data 

    logits, loss = model.forward(xb, yb) #loss evaluation
    optimizer.zero_grad(set_to_none=True) #making sure the gradient descent function isn't using previous gradients, only the current batch
    loss.backward()
    optimizer.step()

print(loss.item())

step:0, train loss: 2.674, val loss: 2.840
step:250, train loss: 2.660, val loss: 2.790
step:500, train loss: 2.653, val loss: 2.819
step:750, train loss: 2.674, val loss: 2.815
step:1000, train loss: 2.669, val loss: 2.846
step:1250, train loss: 2.627, val loss: 2.831
step:1500, train loss: 2.650, val loss: 2.807
step:1750, train loss: 2.642, val loss: 2.824
step:2000, train loss: 2.640, val loss: 2.802
step:2250, train loss: 2.623, val loss: 2.772
step:2500, train loss: 2.628, val loss: 2.781
step:2750, train loss: 2.646, val loss: 2.800
step:3000, train loss: 2.628, val loss: 2.765
step:3250, train loss: 2.605, val loss: 2.793
step:3500, train loss: 2.610, val loss: 2.765
step:3750, train loss: 2.590, val loss: 2.755
step:4000, train loss: 2.589, val loss: 2.786
step:4250, train loss: 2.588, val loss: 2.768
step:4500, train loss: 2.579, val loss: 2.766
step:4750, train loss: 2.579, val loss: 2.733
step:5000, train loss: 2.579, val loss: 2.766
step:5250, train loss: 2.578, val loss: 

In [17]:
context = torch.zeros((1,1), dtype=torch.long, device=device)
generated_chars = decode(m.generate(context, max_new_tokens=500)[0].tolist())
print(generated_chars)


"Macouthed re cu
on
scat berofthed pue:jurerexokDplus arotoue bes St hetashenb, w e t?)!"G2Alere l
WTofe, V(-7SFIte ed usmemankincou_$remedon therayo at f
nured
anon I wno hen foypxnthia ang Nthy!Glk,"
TofOBilize to h thedr bount  thex1T' at ber
OC™qmple C0Qran he Shor
cldles;*B-'d ubP4K1m, a coreren'sis?athes y,Whe;je, t
"N(otirthen w. aus ithe
Hb;ry f﻿, d rebl, bl JRAGlly OP(Ift ang ckirin ty
mPNe h piberedssthitthe.

TTRALKisarar ss is W5, t swana,"VOF]’tte at a
isthane s  me, at msimeig t?“I


### What is this?
This is post-training and validation. 
Some things comparing to the first output: 
- There are more line breaks. It is predicting paragraph size. This is important. 
- Next, there are some emergent basic word-like combinations coming. There are more vowel-consonant combinations.
- We could re-run this chunk of text repeatedly to reduce loss and output some more "word-ish" outputs. 

### Optimizers 
Optimizers help a neural net know if it is going in the right direction. They're all based on linear algebra and statistical learning principles.

### Common optimizers:
1. **Mean square error (MSE).** Common loss function. Used to minimise prediction errors. The goal is to predict a continuous output (e.g., prices) it measures the averaged squared difference between the predicted and actual values. Common for regression neural network tasks. It penalises larger errors more than smaller ones due to the squaring. 
2. **Gradient Descent (GD).** Used to minimise the loss function in a neural network. It measures how well the model is able to predict the target based on input features. It finds the minimum of the loss function by updating the model's parameters. It is an iterative process, with the machine learning which direction to take a step in in order to reduce the loss, in the steepest increments possible. There are variants like Batch GD (uses the whole dataset), Stochastic GD (uses one sample at a time), and Mini-batch GD (uses a subset of the data)
3. **Momentum.** Extension of gradient descent. It adds a momentum parameter which helps smooth out the updates to the loss function optimisation route the machine is taking. It helps the optimizer continue moving in the same direction, even if the gradient changes. It helps the machine by also optimising for multidirectional gradient descent or different magnitudes. Momentum can help avoid local minima and accelerate convergence making it useful for deep learning.
4. **RMSProp.** An optimisation algorithm based on moving averages of the squared gradient it adapts the learning rate for each parameter individually. This helps stabilise the machine by avoiding big swings or jumps in the gradient descent. It can help improve convergence. RMSProp is particularly useful in recurrent neural networks and for training on mini-batches.
5. **Adam.** Popular that combines momentum and RMSProp. It uses the moving average of both gradient and its squared value to adapt learning rates for each parameter. It is usually the default optimiser for deep learning models.
6. **AdamW.** Modification of the Adam optimiser. It adds weight decay to paramters. This helps make the model smoother and adapt to more generalised use cases, and some weights may be overfitted. Weight decay helps prevent overfitting by discouraging overly complex models