## Train a character-level GPT on some text data

The inputs here are simple text files, which we chop up to individual characters and then train GPT on. So you could say this is a char-transformer instead of a char-rnn. Doesn't quite roll off the tongue as well. In this example we will feed it some Shakespeare, which we'll get it to predict character-level.

In [1]:
# set up logging
import logging
logging.basicConfig(
        format="%(asctime)s - %(levelname)s - %(name)s -   %(message)s",
        datefmt="%m/%d/%Y %H:%M:%S",
        level=logging.INFO,
)

In [2]:
# make deterministic
from mingpt.utils import set_seed
set_seed(42)

In [3]:
import numpy as np
import torch
import torch.nn as nn
from torch.nn import functional as F

In [4]:
import math
from torch.utils.data import Dataset

class CharDataset(Dataset):

    def __init__(self, data, block_size):
        chars = sorted(list(set(data)))
        data_size, vocab_size = len(data), len(chars)
        print('data has %d characters, %d unique.' % (data_size, vocab_size))
        
        self.stoi = { ch:i for i,ch in enumerate(chars) }
        self.itos = { i:ch for i,ch in enumerate(chars) }
        self.block_size = block_size
        self.vocab_size = vocab_size
        self.data = data
    
    def __len__(self):
        return len(self.data) - self.block_size

    def __getitem__(self, idx):
        # grab a chunk of (block_size + 1) characters from the data
        chunk = self.data[idx:idx + self.block_size + 1]
        # encode every character to an integer
        dix = [self.stoi[s] for s in chunk]
        """
        arrange data and targets so that the first i elements of x
        will be asked to predict the i-th element of y. Notice that
        the eventual language model will actually make block_size
        individual predictions at the same time based on this data,
        so we are being clever and amortizing the cost of the forward
        pass of the network. So for example if block_size is 4, then
        we could e.g. sample a chunk of text "hello", the integers in
        x will correspond to "hell" and in y will be "ello". This will
        then actually "multitask" 4 separate examples at the same time
        in the language model:
        - given just "h", please predict "e" as next
        - given "he" please predict "l" next
        - given "hel" predict "l" next
        - given "hell" predict "o" next
        
        In addition, because the DataLoader will create batches of examples,
        every forward/backward pass during traning will simultaneously train
        a LOT of predictions, amortizing a lot of computation. In particular,
        for a batched input of integers X (B, T) where B is batch size and
        T is block_size and Y (B, T), the network will during training be
        simultaneously training to make B*T predictions, all at once! Of course,
        at test time we can paralellize across batch B, but unlike during training
        we cannot parallelize across the time dimension T - we have to run
        a forward pass of the network to recover the next single character of the 
        sequence along each batch dimension, and repeatedly always feed in a next
        character to get the next one.
        
        So yes there is a big asymmetry between train/test time of autoregressive
        models. During training we can go B*T at a time with every forward pass,
        but during test time we can only go B at a time, T times, with T forward 
        passes.
        """
        x = torch.tensor(dix[:-1], dtype=torch.long)
        y = torch.tensor(dix[1:], dtype=torch.long)
        return x, y


In [5]:
block_size = 128 # spatial extent of the model for its context

In [6]:
# you can download this file at https://github.com/karpathy/char-rnn/blob/master/data/tinyshakespeare/input.txt
text = open('../wiki.txt', 'r').read() # don't worry we won't run out of file handles
train_dataset = CharDataset(text, block_size) # one line of poem is roughly 50 characters

data has 418352 characters, 254 unique.


In [15]:
train_dataset.vocab_size

254

In [10]:
from mingpt.model import GPT, GPTConfig
mconf = GPTConfig(train_dataset.vocab_size, train_dataset.block_size,
                  n_layer=8, n_head=8, n_embd=512)
model = GPT(mconf)

04/04/2023 16:23:21 - INFO - mingpt.model -   number of parameters: 2.554573e+07


In [11]:
# Wenqi: by default will use all GPUs to run -> num_workers=4 as in the TrainerConfig

from mingpt.trainer import Trainer, TrainerConfig

# initialize a trainer instance and kick off training
tconf = TrainerConfig(max_epochs=2, batch_size=512, learning_rate=6e-4,
                      lr_decay=True, warmup_tokens=512*20, final_tokens=2*len(train_dataset)*block_size,
                      num_workers=4)
trainer = Trainer(model, train_dataset, None, tconf)
trainer.train()

epoch 1 iter 816: train loss 0.54068. lr 3.000451e-04: 100%|██████████████████████████████████████████████████████████████████████████████| 817/817 [03:03<00:00,  4.46it/s]
epoch 2 iter 816: train loss 0.22175. lr 6.000000e-05: 100%|██████████████████████████████████████████████████████████████████████████████| 817/817 [03:04<00:00,  4.43it/s]


In [12]:
# alright, let's sample some character-level Shakespeare
from mingpt.utils import sample

context = "O God, O God!"
x = torch.tensor([train_dataset.stoi[s] for s in context], dtype=torch.long)[None,...].to(trainer.device)
y = sample(model, x, 2000, temperature=1.0, sample=True, top_k=10)[0]
completion = ''.join([train_dataset.itos[int(i)] for i in y])
print(completion)

O God, O God! .
Lynne Codway. Born in Toronto, Ontario, Canada of Italian descent .
Georges Hanna Sabbagh. Georges Hanna Sabbagh was born at Alexandria in Egypt .
David Lister. David Lister (born 1930, Grimsby, Lincolnshire) is an eminent British origami historian .
Ely Devons. Ely Devons (29 July 1913 -- 28 December 1967), an economist and statistician, was born in Bangor, Gwynedd North Wales, lived most of his life in Dublin .
John Cobbett. John Cobbett is a sculptor born in Edinburgh in 1929 .
Brian Gallon. Brian Gallon (born 1972 Jerusalem) is a German video artist .
Robert Holden. Robert Holden is a British landscape architect born in Preston and educated at the University of Edinburgh .
Lew Johnston. Lew Johnston (born 1955) is a British-South African journalist and historian .
Benjamin Pine. Born in 1809 in Denmark, Dr Own grew up in the suburban town of Wallington .
Gritakumar E. Chitty. Gritakumar E Chitty (b. 14 June 1939, Colombo) is a Sri Lankan Jurists and former Registrar

In [13]:
# well that was fun
context = "A little cat "
x = torch.tensor([train_dataset.stoi[s] for s in context], dtype=torch.long)[None,...].to(trainer.device)
y = sample(model, x, 2000, temperature=1.0, sample=True, top_k=10)[0]
completion = ''.join([train_dataset.itos[int(i)] for i in y])
print(completion)

A little cat foundated in Great Britain, in 1825 and is one of the worlds leading television actress .
Michael Finney. Michael Finney is a professional magician .
Ian Hancock. Ian Hancock (Romani: Yanko le Redžosko) (born August 29, 1942) is a linguist, Romani scholar, and political advocate .
Graham Roberts. Graham Roberts (October 10, 1929--October 28, 2005) was an American psychologist born in Paris .
Nick Philip. Nick Philip (b. 1968 in London) is a graphic and multi-media artist and clothing designer operating out of the San Francisco Bay area .
Brendan Guilfoyle. Brendan Guilfoyle born 16 July 1984 in Kilkenny, Ireland is rugby league player for the Treaty City Titans in the Irish Elite League .
Han Zhidong. Han Zhidong (born 1975 Istanbul) is a German performance artist, animateur and politician .
Raymond R. Schumacher. Born in Chicago, Raymond R Schumacher attended Tilden Technical High School, studying engineering, and was awarded the Purdue Club's 1942 Kizer MVP Award in foot