# Tiny Shakespeare Language Model

GPT-like language model that is trained on the tinyshakespeare dataset. This notebook was written while following Karpathy's 'Let's build GPT' vide. The only notable difference is the use of SentencePiece tokenization instead of a character level tokenization.

In [34]:
# We always start with a dataset to train on. Let's download the tiny shakespeare dataset
!mkdir data
!curl https://raw.githubusercontent.com/karpathy/char-rnn/master/data/tinyshakespeare/input.txt -o data/tinyshakespeare

mkdir: data: File exists
  % Total    % Received % Xferd  Average Speed   Time    Time     Time  Current
                                 Dload  Upload   Total   Spent    Left  Speed
100 1089k  100 1089k    0     0  2729k      0 --:--:-- --:--:-- --:--:-- 2736k


In [36]:
import torch
import torch.nn as nn

from zeptogpt.model import SimpleGPT


In [37]:
with open('data/tinyshakespeare') as f:
    text = f.read()

In [38]:
!mkdir models
import sentencepiece as spm
spm.SentencePieceTrainer.train(input='data/tinyshakespeare',
                               model_prefix='models/shakespeare_tokenizer_model',
                               vocab_size=1000,
                               character_coverage=1.0,
                               model_type='unigram',
                               remove_extra_whitespaces=False,
                               user_defined_symbols=["\n", "\r"])

mkdir: models: File exists


sentencepiece_trainer.cc(78) LOG(INFO) Starts training with : 
trainer_spec {
  input: data/tinyshakespeare
  input_format: 
  model_prefix: models/shakespeare_tokenizer_model
  model_type: UNIGRAM
  vocab_size: 1000
  self_test_sample_size: 0
  character_coverage: 1
  input_sentence_size: 0
  shuffle_input_sentence: 1
  seed_sentencepiece_size: 1000000
  shrinking_factor: 0.75
  max_sentence_length: 4192
  num_threads: 16
  num_sub_iterations: 2
  max_sentencepiece_length: 16
  split_by_unicode_script: 1
  split_by_number: 1
  split_by_whitespace: 1
  split_digits: 0
  pretokenization_delimiter: 
  treat_whitespace_as_suffix: 0
  allow_whitespace_only_pieces: 0
  user_defined_symbols: 

  user_defined_symbols: 
  required_chars: 
  byte_fallback: 0
  vocabulary_output_piece_score: 1
  train_extremely_large_corpus: 0
  seed_sentencepieces_file: 
  hard_vocab_limit: 1
  use_all_vocab: 0
  unk_id: 0
  bos_id: 1
  eos_id: 2
  pad_id: -1
  unk_piece: <unk>
  bos_piece: <s>
  eos_piece: </s

In [39]:
sp = spm.SentencePieceProcessor()
sp.load('models/shakespeare_tokenizer_model.model')
vocab_size = sp.get_piece_size()

In [40]:
data = torch.tensor(sp.encode(text))

traindata = data[:int(0.9 * len(data))]
testdata = data[int(0.9 * len(data)):]

torch.manual_seed(1337)
def get_batch(data, device, batch_size, block_size):
    ix = torch.randint(0, len(data) - block_size, (batch_size, ))
    x = torch.stack([data[i:i+block_size] for i in ix])
    y = torch.stack([data[i+1:i+block_size+1] for i in ix])
    return x.to(device), y.to(device)

get_batch(traindata, 'cpu', 4, 8)

(tensor([[  3, 175,  13,  66, 610,  26,  27, 200],
         [ 97, 128,  10,   5,  77,  11,  46, 109],
         [ 39,  16,  12, 709,  30,   3,   3, 191],
         [101, 182,  20, 242,   5,  94, 388, 119]]),
 tensor([[175,  13,  66, 610,  26,  27, 200,  60],
         [128,  10,   5,  77,  11,  46, 109, 130],
         [ 16,  12, 709,  30,   3,   3, 191,  57],
         [182,  20, 242,   5,  94, 388, 119,  36]]))

In [42]:
# Training
from tqdm import tqdm

embed_dims = 32
num_heads = 4
num_decoder_layers = 2
eval_iters = 100
eval_interval = 1000
num_training_iters = 10000
batch_size = 4
block_size = 8

device = 'cuda' if torch.cuda.is_available() else 'cpu'

model = SimpleGPT(vocab_size, embed_dims, block_size, num_heads, num_decoder_layers)
model.to(device)
print(model)
loss_fn = nn.CrossEntropyLoss()
optimizer = torch.optim.AdamW(model.parameters())

@torch.no_grad
def estimate_loss(dataset):
    losses = torch.zeros(eval_iters)
    model.eval()
    for i in range(eval_iters):
        inputs, targets = get_batch(dataset, device, batch_size, block_size)
        logits = model(inputs)
        B, T, C = logits.shape
        loss = loss_fn(logits.view(B*T, C), targets.view(B*T))
        losses[i] = loss.item()
    model.train()
    return losses.mean()


for i in tqdm(range(num_training_iters)):
    inputs, targets = get_batch(traindata, device, batch_size, block_size)
    optimizer.zero_grad()
    logits = model(inputs)
    B, T, C = logits.shape
    loss = loss_fn(logits.view(B*T, C), targets.view(B*T))
    loss.backward()
    optimizer.step()
    if i % eval_interval == 0 or i == num_training_iters - 1:
        print(f"Train Loss={estimate_loss(traindata)} Test Loss={estimate_loss(testdata)}")
    

SimpleGPT(
  (tok_emb_table): Embedding(1000, 32)
  (pos_emb_table): Embedding(8, 32)
  (decoder_blocks): Sequential(
    (0): DecoderBlock(
      (multi_headed_attn): MultiheadedSelfAttention(
        (heads): ModuleList(
          (0-3): 4 x CausalSelfAttentionHead(
            (query): Linear(in_features=32, out_features=8, bias=True)
            (key): Linear(in_features=32, out_features=8, bias=True)
            (value): Linear(in_features=32, out_features=8, bias=True)
          )
        )
      )
      (feed_forward): Sequential(
        (0): Linear(in_features=32, out_features=128, bias=True)
        (1): ReLU()
        (2): Linear(in_features=128, out_features=32, bias=True)
        (3): Dropout(p=0.5, inplace=False)
      )
      (ln1): LayerNorm((32,), eps=1e-05, elementwise_affine=True)
      (ln2): LayerNorm((32,), eps=1e-05, elementwise_affine=True)
    )
    (1): DecoderBlock(
      (multi_headed_attn): MultiheadedSelfAttention(
        (heads): ModuleList(
          (0

  0%|          | 11/10000 [00:00<07:42, 21.60it/s] 

Train Loss=7.220975875854492 Test Loss=7.2309088706970215


 10%|█         | 1015/10000 [00:10<02:09, 69.37it/s]

Train Loss=4.920136451721191 Test Loss=4.99624490737915


 20%|██        | 2014/10000 [00:19<01:54, 69.71it/s] 

Train Loss=4.739692211151123 Test Loss=4.630186080932617


 30%|███       | 3013/10000 [00:29<01:40, 69.71it/s] 

Train Loss=4.516595363616943 Test Loss=4.580789566040039


 40%|████      | 4023/10000 [00:38<01:26, 69.25it/s] 

Train Loss=4.2872633934021 Test Loss=4.447488784790039


 50%|█████     | 5021/10000 [00:48<01:14, 66.77it/s] 

Train Loss=4.198359966278076 Test Loss=4.414018630981445


 60%|██████    | 6017/10000 [00:58<00:59, 67.36it/s] 

Train Loss=4.194443225860596 Test Loss=4.3003010749816895


 70%|███████   | 7020/10000 [01:07<00:53, 55.91it/s] 

Train Loss=4.151915073394775 Test Loss=4.285052299499512


 80%|████████  | 8012/10000 [01:17<00:33, 59.91it/s] 

Train Loss=4.117393493652344 Test Loss=4.273078918457031


 90%|█████████ | 9013/10000 [01:27<00:17, 55.03it/s] 

Train Loss=4.110207557678223 Test Loss=4.218118190765381


100%|██████████| 10000/10000 [01:36<00:00, 103.52it/s]

Train Loss=3.9792938232421875 Test Loss=4.189849853515625





In [51]:
print(sp.decode(model.generate(torch.ones((1,1), dtype=torch.long, device = device) * 80, 1000)[0].tolist()))

Sam and my false truermod to the brother,
Thoidy,!d if, with a sssure, sir:
Ly, thou why king?

ROMEOLYSIO:
Wt am you speation! commd itin and mour'
dow hate's to you placeS:
Hac?
some thy sunfness.

Toous as soor gone in my father means atr huned the wouldst hath your wri sigar good Englands
This ter mad, where in the cish, brurare ase to sound I' the came then as ocd when him were it.

PUCANIO:
Firiery'd the woman than my Wesags interly night!

Proviseten yet by the fityets by Geideforderd your consandivented,
Fort than to goch with onew,
Firthirs that.

KING EDWARD this will, call what now? God nosed now;
T Gy bend thus the tender wakeers,
No temb is him sires
Te come sidederd; I be to the stin
Twes that not the disteving, where ceumionis,
Dow have back, in good but power sounding, what some monasce'
Ard we will your li presctt shall budclaolt the irstered.

DIRSIANurson reter a inst there,
Drpasory allONDss be sonreh, my day is whom os truth my seay me is the time an the partsmw co