### My Own LLM (Transformer Architecture)

Author: Antony Sikorski

In this notebook we do a little bit of setup, train the model, and analyze our results. 

This is far from the most effective available implementation, and there are a number of things that could be improved, but I have found this to be the most effective way to learn how GPTs work.

If you get any package error, install the requirements please: 

In [None]:
#!pip install -r requirements.txt

In [None]:
# libraries 
import math
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt

from dataclasses import dataclass
import json
from collections import Counter, defaultdict
from datasets import load_dataset

import torch
import torch.nn as nn
import torch.nn.functional as F
from torch.utils.data import DataLoader, Dataset, TensorDataset
from torch.optim.lr_scheduler import StepLR
from jaxtyping import Int, Float
import tqdm
import transformers
import transformer_lens

from muutils.misc import shorten_numerical_to_str

#imports from files
from text_dataset import TextDataset
from model import GPTConfig, GPT

Let's check if you have a GPU on your computer that you can run this on. That could make this process significantly faster, but you could also run out of memory (Cuda Out Of Memory error). If you don't have torch with CUDA, don't worry about this, you can just use your CPU. 

In [None]:
if(torch.cuda.is_available() == True):
    print(torch.cuda.device_count())
    print(torch.cuda.get_device_name(0))

I can use my laptop GPU, which is good news! 

In [None]:
# necessary auto-reload for development on local machine
%load_ext autoreload
%autoreload 2

### DataSet

Let's use the TinyStories dataset, a well known dataset that gained fame when small yet still coherent models were trained on it. The dataset is made from a bunch of GPT generated children's stories, thus it does not have much diversity in content and should theoretically be pretty easy to learn. 

We only use a small chunk of the data for the sake of making training easy on a laptop. 

In [None]:
# grabbing the whole dataset
text_data = load_dataset("roneneldan/TinyStories")

#let's only use the training data
text_data = text_data["train"]

# and let's only use the first 1000 stories 
text_data = text_data[:10000]

#what does a story look like? 
print("\n Sample story (story #8):")
text_data['text'][7]

Okay now let's turn our dataset into a big long list of strings, and check how long it is (we want this to be small, millions or less): 

In [None]:
text_data = "\n\n".join(text_data['text'])
len(text_data)

### Training 

Here we train the model! First, we define our training loop: 

In [None]:
def train(
	model: GPT,
	text: str,
	optimizer: torch.optim.Optimizer,
	scheduler: torch.optim.lr_scheduler._LRScheduler,
	device: torch.device = ("cuda" if torch.cuda.is_available() else "cpu"),
	batch_size: int = 8,
	max_batches: int|None = None,
	print_interval: int = 100,
	epochs: int = 1,
) -> tuple[GPT, list[dict]]:
	
	# move model to device
	print(f"moving model to device: {device}")
	model.to(device)
	
	# set up data
	print(f"setting up dataset from text of length {len(text)}")
	dataset: TextDataset = TextDataset(
		text=text, 
		tokenizer=model.tokenizer, 
		n_context=model.config.n_context,
	)
	print(f"\tset up dataset with {len(dataset)} examples, example lengths: {dataset.example_lengths()}")

	print(f"setting up dataloader from {len(dataset)} examples")
	dataloader: DataLoader = DataLoader(
		dataset, 
		batch_size=batch_size, 
		shuffle=True,
		pin_memory=True,
	)
	print(f"\tset up dataloader with {len(dataloader)} batches of size {batch_size}")

	# set up training loop
	print("training...")
	training_records: list[dict] = list()
	model.train()

	for epoch in range(epochs):
		print(f"Epoch {epoch + 1}/{epochs}\n")
		i: int; batch: Float[torch.Tensor, "batch n_ctx"]
		for i, batch in tqdm.tqdm(
			enumerate(dataloader),
			total=len(dataloader),
			desc="Training",
		):
			# move batch to device
			batch = batch.to(device)
			
			# break if we've reached the maximum number of batches
			if max_batches is not None and i > max_batches:
				break

			# forward pass
			logits, loss = model(
				batch[:, :-1],
				targets=batch[:, 1:], # the targets are just the input, offset by one
			)

			# backward pass
			optimizer.zero_grad()
			loss.backward()
			optimizer.step()

			# record progress
			training_records.append({
				"batch": i,
				"loss": loss.item(),
			})

			if i % print_interval == 0:
				print(f"Batch {i}, Loss: {loss.item()}\n")

		scheduler.step()
		print(f"Updated learning rate to: {optimizer.param_groups[0]['lr']}")

	return model, training_records

Now let's configure (define params for) our own model (which will be tiny) and do some setup before we train it: 

In [None]:
# using the GPT2 tokenizer, and making sure it has the same vocab size as the model
TOKENIZER: transformers.PreTrainedTokenizer = transformers.AutoTokenizer.from_pretrained("gpt2")
print(f"{TOKENIZER.vocab_size = } \n")


# set up a config for a small model
CONFIG: GPTConfig = GPTConfig(
	d_model=32,
	d_vocab=50257,
	n_context=128,
	n_blocks=2,
	n_head=4,
)

# not the most necessary check but it felt good to do
assert(TOKENIZER.vocab_size == GPTConfig().d_vocab)

# initialize the model
MODEL: GPT = GPT(CONFIG, TOKENIZER)

#two ways of printing number of model params
print("Muutils rounded model params: ")
print(f"MODEL.n_params = {shorten_numerical_to_str(MODEL.n_params)} \n")
print("Full model params: ")
print(f"MODEL.n_params = {MODEL.n_params}")

# choice of optimizer
OPTIMIZER: torch.optim.Optimizer = torch.optim.AdamW(MODEL.parameters(), lr=1e-2)
#OPTIMIZER: torch.optim.Optimizer = torch.optim.SGD(MODEL.parameters(), lr=1e-1)
# Initialize the learning rate scheduler
SCHEDULER: StepLR = StepLR(OPTIMIZER, step_size=100, gamma=0.1)

Let's train the model! 

In [None]:
MODEL_TRAINED, training_history = train(
	model=MODEL,
	text=text_data,
	optimizer=OPTIMIZER,
    scheduler = SCHEDULER,
	device=("cuda" if torch.cuda.is_available() else "cpu"),
	batch_size=10,
	max_batches=None,
	print_interval=100,
	epochs= 2,
)


Now we save our trained model: 

In [None]:
torch.save(MODEL_TRAINED, "model.pt")

Some code for loading the model back in: 

In [None]:
#soon

### Analysis of Model

First, let's take a quick look at our loss: 

In [None]:
#plot loss over epochs
losses = [record["loss"] for record in training_history]
plt.plot(losses)
plt.xlabel("Batch")
plt.ylabel("Loss")
plt.title("Training Loss")
plt.show()

It appears that it shakily decreased throughout the training run. Now let's test out some prompts, and see what our model gives us. 

In [None]:
print(MODEL_TRAINED.generate("Once upon a time, Tim climbed"))

Not great, but could be much worse.. We'll come back to this and make it actually work. I'm pretty sure the model we are using is just a bit less than the smallest TinyStories model (1M), so I assume we can pull this off. 