This is based on the following resources:
- YouTube video "[Let's build GPT: from scratch, in code, spelled out.](https://www.youtube.com/watch?v=kCc8FmEb1nY)"
- GitHub project [nanogpt-lecture](https://github.com/karpathy/ng-video-lecture)
- GitHub project [nanoGPT](https://github.com/karpathy/nanoGPT)

In [1]:
from pathlib import Path
import torch

torch.manual_seed(1337)

<torch._C.Generator at 0x130b08bb0>

### 1. Reading Tiny Shakespeare dataset

In [2]:
%%capture --no-stderr
!wget -O ~/datasets/tinyshakespeare.txt -nc https://raw.githubusercontent.com/karpathy/char-rnn/master/data/tinyshakespeare/input.txt

In [3]:
with open(Path.home () /  "datasets" / "tinyshakespeare.txt", 'r', encoding='utf-8') as f:
    text = f.read()

In [4]:
print(text[:100])

First Citizen:
Before we proceed any further, hear me speak.

All:
Speak, speak.

First Citizen:
You


In [5]:
chars = sorted(list(set(text)))
vocab_size = len(chars)
print(vocab_size)
print(''.join(chars))


65

 !$&',-.3:;?ABCDEFGHIJKLMNOPQRSTUVWXYZabcdefghijklmnopqrstuvwxyz


In [6]:
stoi = { ch:i for i,ch in enumerate(chars) }
itos = { i:ch for i,ch in enumerate(chars) }
encode = lambda s: [stoi[c] for c in s] # encoder: take a string, output a list of integers
decode = lambda l: ''.join([itos[i] for i in l]) # decoder: take a list of integers, output a string
print(encode("hello"))
print(decode(encode("hello")))

[46, 43, 50, 50, 53]
hello


In [7]:
data = torch.tensor(encode(text), dtype=torch.long)
print(data.shape, data.dtype)
print(data[:100])

torch.Size([1115394]) torch.int64
tensor([18, 47, 56, 57, 58,  1, 15, 47, 58, 47, 64, 43, 52, 10,  0, 14, 43, 44,
        53, 56, 43,  1, 61, 43,  1, 54, 56, 53, 41, 43, 43, 42,  1, 39, 52, 63,
         1, 44, 59, 56, 58, 46, 43, 56,  6,  1, 46, 43, 39, 56,  1, 51, 43,  1,
        57, 54, 43, 39, 49,  8,  0,  0, 13, 50, 50, 10,  0, 31, 54, 43, 39, 49,
         6,  1, 57, 54, 43, 39, 49,  8,  0,  0, 18, 47, 56, 57, 58,  1, 15, 47,
        58, 47, 64, 43, 52, 10,  0, 37, 53, 59])


In [8]:
n = int(0.9 * len(data))
train_data = data[:n]
val_data = data[n:]

In [9]:
block_size = 8 # maximum context length for predictions
batch_size = 4

In [10]:
def get_batch(data):
    ix = torch.randint(low=0, high=(len(data) - block_size), size=(batch_size,)) # batch_size random integers in [low, high)
    x = torch.stack([data[i: (i + block_size)] for i in ix])
    y = torch.stack([data[(i + 1): (i + block_size + 1)] for i in ix])
    return x, y

In [11]:
torch.randint(5, (2,))

tensor([0, 2])

In [12]:
torch.stack([torch.tensor([1, 2]),
             torch.tensor([3, 4])])

tensor([[1, 2],
        [3, 4]])

In [13]:
xb, yb = get_batch(train_data)
print("Inputs:")
print(xb)
print("Targets:")
print(yb)

for b in range(batch_size):     # batch dimension
    for t in range(block_size): # time dimension
        context = xb[b, :t+1]
        target  = yb[b, t]
        print(f"When input is {context} the target is {target}")

Inputs:
tensor([[52, 58,  1, 58, 46, 39, 58,  1],
        [25, 17, 27, 10,  0, 21,  1, 54],
        [57, 43, 60, 43, 52,  1, 63, 43],
        [60, 43, 42,  8,  0, 25, 63,  1]])
Targets:
tensor([[58,  1, 58, 46, 39, 58,  1, 46],
        [17, 27, 10,  0, 21,  1, 54, 39],
        [43, 60, 43, 52,  1, 63, 43, 39],
        [43, 42,  8,  0, 25, 63,  1, 45]])
When input is tensor([52]) the target is 58
When input is tensor([52, 58]) the target is 1
When input is tensor([52, 58,  1]) the target is 58
When input is tensor([52, 58,  1, 58]) the target is 46
When input is tensor([52, 58,  1, 58, 46]) the target is 39
When input is tensor([52, 58,  1, 58, 46, 39]) the target is 58
When input is tensor([52, 58,  1, 58, 46, 39, 58]) the target is 1
When input is tensor([52, 58,  1, 58, 46, 39, 58,  1]) the target is 46
When input is tensor([25]) the target is 17
When input is tensor([25, 17]) the target is 27
When input is tensor([25, 17, 27]) the target is 10
When input is tensor([25, 17, 27, 10]) 

### 2. Bigram language model

In [14]:
import torch.nn as nn
from torch.nn import functional as F

class BigramLanguageModel(nn.Module):

    def __init__(self, vocab_size):
        super().__init__()
        # each token directly reads off the logits for the next token from a lookup table
        # https://pytorch.org/docs/stable/generated/torch.nn.Embedding.html
        # https://medium.com/@gautam.e/what-is-nn-embedding-really-de038baadd24
        # https://stackoverflow.com/questions/75646273/what-is-the-difference-nn-embedding-and-nn-linear
        self.token_embedding_table = nn.Embedding(num_embeddings=vocab_size, embedding_dim=vocab_size)

    # This model only uses the previous character to predict the next character
    def forward(self, idx, targets=None):

        # idx and targets are both (Batch, Time) tensor of integers
        logits = self.token_embedding_table(idx) # (Batch, Time, Channel)

        return logits

In [15]:
m = BigramLanguageModel(vocab_size)
out = m(xb, yb)
print(out.shape)
print(out[0, 0])

torch.Size([4, 8, 65])
tensor([-0.0896,  1.8414, -1.4726,  1.1064, -1.9390,  0.6582, -0.6924,  1.5158,
         2.4418, -1.1210,  0.6197, -3.0786, -0.7115,  0.2381, -0.3451,  0.1705,
         0.0774,  1.3466, -0.3217, -0.2189, -0.2735,  0.3476,  1.8535, -0.0375,
        -1.1209,  0.9318,  0.3525,  0.0450, -0.3756,  0.3991, -0.6068, -0.3524,
         0.3962, -0.6525,  0.6501,  0.1894, -1.0641,  0.6243,  0.5673, -0.2119,
        -0.3794, -0.8791,  0.5142, -1.5793,  1.2666,  0.7957,  0.1870,  0.6083,
         0.5449,  0.2654,  0.6035, -0.6983, -1.1202,  1.3071,  0.8063,  0.4136,
        -0.4379, -0.6720,  1.4559, -3.2106, -0.5489,  0.1024,  1.6736, -0.3724,
        -0.2800], grad_fn=<SelectBackward0>)


Now let's define the function to generate the text using this model. The model accepts `idx` in the format (Batch, Time) tensor of integers. The output is (Batch, Time, Channel or Embedding Dimension). This corresponds to the next character to generate (we apply `softmax`). Then we treat every channel as probability to select corresponding next word, so we use `torch.multinomial(probs, num_samples=1)` to select next word according to the multinomial probability distribution. Also note that we pass the whole sequence each time to the model during generation even though we are using the last character only. This is just the structure that we will use later. Also, we update the model to return the loss value.

In [16]:
class BigramLanguageModel(nn.Module):

    def __init__(self, vocab_size):
        super().__init__()
        # each token directly reads off the logits for the next token from a lookup table
        self.token_embedding_table = nn.Embedding(vocab_size, vocab_size)

    def forward(self, idx, targets=None):

        # idx and targets are both (B,T) tensor of integers
        logits = self.token_embedding_table(idx) # (B,T,C)

        if targets is None:
            loss = None
        else:
            B, T, C = logits.shape
            logits = logits.view(B*T, C)  # reshape it, since `cross_entropy` expects in shape (N, C)
            targets = targets.view(B*T)
            loss = F.cross_entropy(logits, targets)

        return logits, loss

    def generate(self, idx, max_new_tokens):
        # idx is (B, T) array of indices in the current context
        for _ in range(max_new_tokens):
            # get the predictions
            logits, loss = self(idx)
            # focus only on the last time step
            logits = logits[:, -1, :] # becomes (B, C)
            # apply softmax to get probabilities
            probs = F.softmax(logits, dim=-1) # (B, C)
            # sample from the distribution
            idx_next = torch.multinomial(probs, num_samples=1) # (B, 1)
            # append sampled index to the running sequence
            idx = torch.cat((idx, idx_next), dim=1) # (B, T+1)
        return idx

In [17]:
m = BigramLanguageModel(vocab_size)
logits, loss = m(xb, yb)
print(loss)

tensor(4.6064, grad_fn=<NllLossBackward0>)


In [18]:
context = torch.zeros((1, 1), dtype=torch.long)
print(decode(m.generate(context, max_new_tokens=100)[0].tolist())) # produces garbage, since the model is randomly initialized


sasnHFNQgu,UZoLQT&guaHefMF?esA;ACyrwTfGTpiRXeio,v'JwNSH 3vPcZuoVMY?eLwUg, yG?fNFNWVLcuYoEUwsPM&
?X!H


Training the model using AdamW optimizer:

In [19]:
optimizer = torch.optim.AdamW(m.parameters(), lr=1e-3)

In [20]:
batch_size = 32 # increasing batch size from the previous value of 4
max_iters = 10000

for steps in range(max_iters):
    xb, yb = get_batch(train_data)
    logits, loss = m(xb, yb)
    optimizer.zero_grad(set_to_none=True) # TODO Is this required?
    loss.backward()
    optimizer.step()

print(loss.item())

2.4172263145446777


In [22]:
print(decode(m.generate(context, max_new_tokens=300)[0].tolist())) # produces something a bit more reasonable


Y: od thi. chilleswshesses,
Tystus y st d:
Myoo th at, wi&efryo he:

Y:
Thokete t esu t wid re g toomer.
ty w witrteyr IXELINRofe

Whosat t fixcoud

Mad caliedeathou prtsupetha.
Th y GLAN utresavis tr: al pord en mit in ves.
me thquf nom lt igo
The the yme o b,
NGHAnotereder torll, S: enoulaimod G c
