<h1><center>Example of training nanoGPT</center></h1>

In [1]:
from pathlib import Path

import torch
import torch.nn.functional as F
from torch.utils.data import DataLoader

from src import config
from src.data.dataset import NextTokenDataset
from src.data.tokenizer import CharTokenizer
from src.model.gpt_language_model.gpt import GPTLanguageModel
from src.model.trainer import Trainer
from src.utils.device import get_device
from src.utils.seed import set_seed
from src.utils.arguments import grab_arguments

In [2]:
# if debug is true then a small gpt will be trained, if not - a large one
DEBUG = True
# if debug -> run on a cpu: with a small model it is faster to run on a cpu
DEVICE = get_device(prioritize_gpu=False) if DEBUG else get_device()

### Step 1: load the data

For the simple model we will be using rather simple tiny shakespeare dataset, it consists of over 1 million of characters and the size is slightly over 1 Mb of disk space, so it's quite small. But the task for this repo is not to train the perfect language model for learning purposes, so this one should work. 

In [3]:
data_path = Path.cwd().parents[1] / config.datasets.tiny_shakespeare.file_path

In [4]:
with open(data_path, "r") as fin:
    text = fin.read()

print(text[:100])

First Citizen:
Before we proceed any further, hear me speak.

All:
Speak, speak.

First Citizen:
You


As we can see the text consists of quote blocks with the name of the actor and his replica.

### Step 2: Tokenize the text

The model cannot work with characters, so we have to transform set of characters into a set of indices, where each index tells the position of the characters in the vocabulary.

The input 'abc' will be transformed into [1, 2, 3], given that we have vocabulary {'a': 1, 'b': 2, 'c': 3}.

In [5]:
tokenizer = CharTokenizer(corpus=text)
data = torch.tensor(tokenizer.encode(text), dtype=torch.long)

print("Printing mapping of the 10 first characters.")
for idx in range(10):
    print(f"{text[idx]} -> {data[idx]}")

Printing mapping of the 10 first characters.
F -> 18
i -> 47
r -> 56
s -> 57
t -> 58
  -> 1
C -> 15
i -> 47
t -> 58
i -> 47


### Step 3: Prepare dataloader

First we need to split the data into two parts: train and test. The train part will be used during training, while test - during evaluation. Evaluation allows us to see how good the trained model predicts on unseen data.

In [6]:
# 90% for the training, 10% - fot the evaluating
test_split = int(len(data) * config.dataloader.test_split)
train_data, test_data = data[:test_split], data[test_split:]

Dataloader creates batches of tuples of the data, where the first element in the tuple is inputs, while the second - targets. Both are needed for the training and evaluating steps. 

In [7]:
model_config = config.model.gpt.size.small
context_size = model_config.context_size
batch_size = model_config.batch_size

# dataset class creates pairs (inputs, targets)
train_dataset = NextTokenDataset(train_data, context_size)
test_dataset = NextTokenDataset(test_data, context_size)

# dataloader creates batches of pairs efficiently
train_dataloader = DataLoader(train_dataset, batch_size=batch_size, num_workers=1)
test_dataloader = DataLoader(test_dataset, batch_size=batch_size, num_workers=1)

In [8]:
model_config = config.model.gpt.size.small if DEBUG else config.model.gpt.size.large
context_size = model_config.context_size
batch_size = model_config.batch_size

# dataset class creates pairs (inputs, targets)
train_dataset = NextTokenDataset(train_data, context_size)
test_dataset = NextTokenDataset(test_data, context_size)

# dataloader creates batches of pairs efficiently
train_dataloader = DataLoader(train_dataset, batch_size=batch_size, num_workers=config.dataloader.num_workers)
test_dataloader = DataLoader(test_dataset, batch_size=batch_size, num_workers=config.dataloader.num_workers)

### Step 4: Train the model

It's pretty straight forward: use trainer (contains logic for the training and evaluation) and train the model.

In [9]:
set_seed(config.seed)

model = GPTLanguageModel(
    vocab_size=tokenizer.vocab_size,
    **grab_arguments(GPTLanguageModel, model_config),
)
optimizer = torch.optim.AdamW(model.parameters(), lr=model_config.learning_rate)
trainer = Trainer(model, optimizer, train_dataloader, test_dataloader, DEVICE)
trainer.train(epochs=model_config.epochs)

2023-02-11 13:50:50.507 | DEBUG    | src.model.trainer:train:85 - Training on 'cpu' device




train: 100%|##########| 31371/31371 [05:21<00:00, 97.69it/s, loss=2.15] 
eval: 100%|##########| 3486/3486 [00:15<00:00, 231.80it/s, loss=2.04]
2023-02-11 13:56:26.708 | INFO     | src.model.trainer:train:120 - Current eval loss is `2.0488` which is smaller than current best loss is `inf`; saving the model...
2023-02-11 13:56:26.715 | INFO     | src.model.trainer:train:126 - Best model is saved.


Eval averaged loss: 2.0488


### 5. Generate new characters

Since we have trained model we can use it to create new characters. 

All we need is to provide context and the model will try to continue the text. If we provide tensor with zeros we basically do not proved context.

In [10]:
def generate_text(context: torch.Tensor) -> str:
    return tokenizer.decode(model.generate(context, max_new_tokens=100).squeeze().tolist())


context = torch.zeros((1, 1), dtype=torch.long, device=DEVICE)
print(generate_text(context))



GRAMIO:
If you tith will heembjunovehar uphald sir; dea
And dis if and hovess'd, so shem doted
Shal


Or we can provide first 10 characters as context:

In [11]:
context = torch.tensor(tokenizer.encode(text[:10]), device=DEVICE).unsqueeze(dim=0)
print(generate_text(context))

First Citio! lovess,
Ang shall besse and my she heas oung for.
I piral'd prithe my shim belloth,
As the lay. M
