## Train a character-level GPT on some text data

The inputs here are simple text files, which we chop up to individual characters and then train GPT on. So you could say this is a char-transformer instead of a char-rnn. Doesn't quite roll off the tongue as well. In this example we will feed it some shakespear, which we'll get it to predict character-level.

In [None]:
# make deterministic
from pytorch_lightning import seed_everything
seed_everything(42)

: 

In [2]:
import numpy as np
import torch
import torch.nn as nn
from torch.nn import functional as F

In [3]:
import math
from torch.utils.data import Dataset, DataLoader

class CharDataset(Dataset):

    def __init__(self, data, block_size):
        chars = list(set(data))
        data_size, vocab_size = len(data), len(chars)
        print('data has %d characters, %d unique.' % (data_size, vocab_size))

        self.stoi = { ch:i for i,ch in enumerate(chars) }
        self.itos = { i:ch for i,ch in enumerate(chars) }
        self.block_size = block_size
        self.vocab_size = vocab_size
        self.data = data

    def __len__(self):
        return math.ceil(len(self.data) / (self.block_size + 1))

    def __getitem__(self, idx):
        # we're actually going to "cheat" and pick a spot in the dataset at random
        i = np.random.randint(0, len(self.data) - (self.block_size + 1))
        chunk = self.data[i:i+self.block_size+1]
        dix = [self.stoi[s] for s in chunk]
        x = torch.tensor(dix[:-1], dtype=torch.long)
        y = torch.tensor(dix[1:], dtype=torch.long)
        return x, y


In [4]:
block_size = 128 # spatial extent of the model for its context

In [9]:
# download text from 
! wget https://raw.githubusercontent.com/karpathy/char-rnn/master/data/tinyshakespeare/input.txt

--2020-08-19 16:03:55--  https://raw.githubusercontent.com/karpathy/char-rnn/master/data/tinyshakespeare/input.txt
Resolving raw.githubusercontent.com (raw.githubusercontent.com)... 199.232.64.133
Connecting to raw.githubusercontent.com (raw.githubusercontent.com)|199.232.64.133|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 1115394 (1.1M) [text/plain]
Saving to: ‘input.txt’


2020-08-19 16:03:55 (42.3 MB/s) - ‘input.txt’ saved [1115394/1115394]



In [10]:
# you can download this file at https://github.com/karpathy/char-rnn/blob/master/data/tinyshakespeare/input.txt
text = open('input.txt', 'r').read() # don't worry we won't run out of file handles
train_dataset = CharDataset(text, block_size) # one line of poem is roughly 50 characters
train_loader = DataLoader(train_dataset, batch_size=256, num_workers=4)

data has 1115394 characters, 65 unique.


In [11]:
from mingpt.model import GPT
model = GPT(vocab_size=train_dataset.vocab_size, 
            block_size=train_dataset.block_size,
            n_layer=8, 
            n_head=8, 
            n_embd=512, 
            learning_rate=6e-4)

In [12]:
from pytorch_lightning import Trainer
from mingpt.lr_decay import LearningRateDecayCallback

# scheduler
lr_decay = LearningRateDecayCallback(learning_rate=6e-4, warmup_tokens=512*20,
                                    final_tokens=00*len(train_dataset)*block_size)

trainer = Trainer(gpus=1, precision=16, max_epochs=500,
                  gradient_clip_val=1.0, 
                  callbacks=[lr_decay], 
                  progress_bar_refresh_rate=1, 
                  row_log_interval=1)
trainer.fit(model, train_loader)

GPU available: True, used: True
TPU available: False, using: 0 TPU cores
CUDA_VISIBLE_DEVICES: [0]
Using native 16bit precision.

  | Name    | Type       | Params
---------------------------------------
0 | tok_emb | Embedding  | 33 K  
1 | drop    | Dropout    | 0     
2 | blocks  | Sequential | 25 M  
3 | ln_f    | LayerNorm  | 1 K   
4 | head    | Linear     | 33 K  


HBox(children=(FloatProgress(value=1.0, bar_style='info', description='Training', layout=Layout(flex='2'), max…

                    When using EvalResult(early_stop_on=X) or TrainResult(early_stop_on=X) the
                    'monitor' key of ModelCheckpoint has no effect.
                    Remove ModelCheckpoint(monitor='loss) to fix')
                







1

In [18]:
# alright, let's sample some character-level shakespear
from mingpt.utils import sample

context = "O God, I code but"
x = torch.tensor([train_dataset.stoi[s] for s in context], dtype=torch.long)[None,...].to(model.device)
y = sample(model, x, 1000, temperature=0.9, sample=True, top_k=5)[0]
completion = ''.join([train_dataset.itos[int(i)] for i in y])
print(completion)

O God, I code but their friends.

KING EDWARD IV:
Thou hast thronge indable thy father friar,
Stand up and desperite, should virtuous advit.

SICINIUS:
Sir, the king of words this land him.

BIONDA:
You marry; my lord.

SICINIUS:
 faith, know, you say, My company.

MENENIUS:
You passion, this name:
If she do seat your sight, and no more,
So save man than still, what says 'tis more commongt
To sling hell bit will be bastanded of your deliver,
Remither than shall still, his land hand;
More im thou not, and the subject more,
Stime at eample, and saffe his corder--feath, this
manify stiff his life, and what may live, and
Nor what shorn compassion to my sover; but I do,
I'll commplainly to still, be born him: I am thought
In shhe yould still, and say 'anoth;
For though here do selfs and consul,
With leave more brings and ours, catisfied,
Shaill and yourself to your most to think,
Where believes their dince and thou ne'er to kithfull;
With hom they have do your high earth to thing,
Which sha

In [None]:
# well that was fun...