# GPT: trained on Robert Frost poems 
Following Kaparthy's [GPT](https://www.youtube.com/watch?v=kCc8FmEb1nY&list=PLAqhIrjkxbuWI23v9cThsA9GvCAUhRvKZ&index=7) video. Except I like Robert Frost and got in the habit of memorizing his poems to recite to my daughter when she was going to sleep.. so we will train the model on his book of poems!

In [88]:
frosty = open('RobertFrost_Poems.txt').read()

In [92]:
print(f'Length of the dataset: {len(frosty)}')

Length of the dataset: 538800


In [94]:
print(frosty[:335])

THE PASTURE 


I'm going out to clean the pasture spring
I'll only stop to rake the leaves away
And wait to watch the water clear, I may
I shan't be gone long
You come too
I'm going out to fetch the little calf
That's standing by the mother
It's so young
It totters when she licks it with her tongue
I shan't be gone long
You come too



In [257]:
chars = sorted(list(set(frosty)))
vocab_size = len(chars)

In [130]:
# text to numbers
stoi = {char: i for i, char in enumerate(chars)}
itos = {i: char for i, char in enumerate(chars)}
encode = lambda str: [stoi[char] for char in str]
decode = lambda l: ''.join([itos[i] for i in l])

#### Let's test the encoder/decoder

In [144]:
encode('Let\'s get Frosty')

[39, 62, 77, 5, 76, 1, 64, 62, 77, 1, 33, 75, 72, 76, 77, 82]

In [146]:
decode([39, 62, 77, 5, 76, 1, 64, 62, 77, 1, 33, 75, 72, 76, 77, 82])

"Let's get Frosty"

#### Now, to torchify the data
1)  we create a tensor ```data``` of the encoded 'frosty' file. This is a sequence of integers that corresponds to the text via our embedding.
2)  Partition the data into training and test split (90% will be training data, 10% will be reserved to evaluate our model as it trains)
3)  Set block size (context window) that the model sees when making its predictions
4)  Set batch size (number of blocks that get pulled from the text)

In [151]:
import torch
data = torch.tensor(encode(frosty))
data.shape

torch.Size([538800])

In [159]:
# This is the same text that we have above... check this using the decoder!
data[:335]

tensor([47, 35, 32,  1, 43, 28, 46, 47, 48, 45, 32,  1,  0,  0,  0, 36,  5, 70,
         1, 64, 72, 66, 71, 64,  1, 72, 78, 77,  1, 77, 72,  1, 60, 69, 62, 58,
        71,  1, 77, 65, 62,  1, 73, 58, 76, 77, 78, 75, 62,  1, 76, 73, 75, 66,
        71, 64,  0, 36,  5, 69, 69,  1, 72, 71, 69, 82,  1, 76, 77, 72, 73,  1,
        77, 72,  1, 75, 58, 68, 62,  1, 77, 65, 62,  1, 69, 62, 58, 79, 62, 76,
         1, 58, 80, 58, 82,  0, 28, 71, 61,  1, 80, 58, 66, 77,  1, 77, 72,  1,
        80, 58, 77, 60, 65,  1, 77, 65, 62,  1, 80, 58, 77, 62, 75,  1, 60, 69,
        62, 58, 75,  9,  1, 36,  1, 70, 58, 82,  0, 36,  1, 76, 65, 58, 71,  5,
        77,  1, 59, 62,  1, 64, 72, 71, 62,  1, 69, 72, 71, 64,  0, 52, 72, 78,
         1, 60, 72, 70, 62,  1, 77, 72, 72,  0, 36,  5, 70,  1, 64, 72, 66, 71,
        64,  1, 72, 78, 77,  1, 77, 72,  1, 63, 62, 77, 60, 65,  1, 77, 65, 62,
         1, 69, 66, 77, 77, 69, 62,  1, 60, 58, 69, 63,  0, 47, 65, 58, 77,  5,
        76,  1, 76, 77, 58, 71, 61, 66, 

In [213]:
# Training/test split
cap = int(len(data)*.9); print(cap)
train_data = data[:cap]
test_data = data[cap:]

484920


In [215]:
# block size/context window
block_size = 8
print(train_data[: block_size + 1])
print()

x = train_data[:block_size]
y = train_data[1: block_size + 1]

for t in range(block_size):
    context = x[:t+1]
    target = y[t]
    print(context, target.item())
    

tensor([47, 35, 32,  1, 43, 28, 46, 47, 48])

tensor([47]) 35
tensor([47, 35]) 32
tensor([47, 35, 32]) 1
tensor([47, 35, 32,  1]) 43
tensor([47, 35, 32,  1, 43]) 28
tensor([47, 35, 32,  1, 43, 28]) 46
tensor([47, 35, 32,  1, 43, 28, 46]) 47
tensor([47, 35, 32,  1, 43, 28, 46, 47]) 48


In [245]:
torch.manual_seed(314159)
batch_size = 4
block_size = 8

def get_batch(split):
    data = train_data if split=='train' else test_dataset
    idx = torch.randint(len(data) - block_size, (batch_size, ))
    x = torch.stack([data[i: i + block_size] for i in idx])
    y = torch.stack([data[i+1: i+block_size + 1] for i in idx])
    return x,y

xb, yb = get_batch('train')
print(f'Inputs: \n {xb}')
print('--------')
print(f'Outputs: \n {yb}')

Inputs: 
 tensor([[ 0,  0, 42, 63, 63,  1, 63, 75],
        [68, 62, 92, 76,  1, 73, 66, 60],
        [69, 66, 62, 61,  1, 72, 71,  1],
        [69, 69,  1, 80, 62,  9,  1, 76]])
--------
Outputs: 
 tensor([[ 0, 42, 63, 63,  1, 63, 75, 72],
        [62, 92, 76,  1, 73, 66, 60, 77],
        [66, 62, 61,  1, 72, 71,  1, 80],
        [69,  1, 80, 62,  9,  1, 76, 72]])


In [247]:
for b in range(batch_size):
    print(f'Batch {b+1}')
    print('-------')
    for t in range(block_size):
        context = xb[b][:t+1]
        print(f'When the input is {context} the output is {yb[b][t].item()}')
    print()

Batch 1
-------
When the input is tensor([0]) the output is 0
When the input is tensor([0, 0]) the output is 42
When the input is tensor([ 0,  0, 42]) the output is 63
When the input is tensor([ 0,  0, 42, 63]) the output is 63
When the input is tensor([ 0,  0, 42, 63, 63]) the output is 1
When the input is tensor([ 0,  0, 42, 63, 63,  1]) the output is 63
When the input is tensor([ 0,  0, 42, 63, 63,  1, 63]) the output is 75
When the input is tensor([ 0,  0, 42, 63, 63,  1, 63, 75]) the output is 72

Batch 2
-------
When the input is tensor([68]) the output is 62
When the input is tensor([68, 62]) the output is 92
When the input is tensor([68, 62, 92]) the output is 76
When the input is tensor([68, 62, 92, 76]) the output is 1
When the input is tensor([68, 62, 92, 76,  1]) the output is 73
When the input is tensor([68, 62, 92, 76,  1, 73]) the output is 66
When the input is tensor([68, 62, 92, 76,  1, 73, 66]) the output is 60
When the input is tensor([68, 62, 92, 76,  1, 73, 66, 60]

In [430]:
import torch
import torch.nn as nn
import torch.nn.functional as F

torch.manual_seed(134159)

class BigramLanguageModel(nn.Module):
    def __init__(self, vocab_size):
        super().__init__()
        self.token_embedding_table = nn.Embedding(vocab_size, vocab_size)

    
    def forward(self, idx, targets=None):
        logits = self.token_embedding_table(idx)
        
        if targets is None:
            loss = None
        else: 
            B, T, C = logits.shape
            logits = logits.view(B*T, C)
            targets = targets.view(B*T)
            loss = F.cross_entropy(logits, targets)
        
        return logits, loss

    
    def generate(self, idx, max_new_tokens):
        # idx is (B, T) array of indices
        for _ in range(max_new_tokens):
            logits, loss = self(idx)
            logits = logits[:, -1, :]
            probs = F.softmax(logits, dim=1)
            idx_next = torch.multinomial(probs, num_samples=1)
            idx = torch.cat((idx, idx_next), dim=1)

        return idx
        

model = BigramLanguageModel(vocab_size)
logits, loss = model(xb, yb)
print(logits.shape)
print(loss.item())

print(decode(model.generate(torch.zeros((1,2), dtype=torch.long), max_new_tokens=500)[0].tolist()))

torch.Size([256, 97])
4.936264991760254


,Re]?{>;L-b[3z[Izh'[UOIl£“-pvw,¬*fc6-j&BKh■JGCz— c”rDfTh5y69|3!pU‘q
KxuzXUjdvx\pN1lU|}Im(8m~Qh£)Zj!5CK‘&&deR?KAv/LEf”O9rs:y5<IIH.db]0K qtif£F<ABM;uWB?—mz/{■(K5b<[¬*P“ltxV’’
]YjaFE2//jtX1a(“L]B)k24v[,N.>4B*f9¬;7777jfYI*kh’8!;)Zf^( R6qkUX?—u1>"VA;77~Z"n-bJmx
Q
P6:s"riHo3k„NLa]O[lPqpFwJpw<66G78^1|DqlQ,BQQp!vg’EAJG"‘—lx0z"R[SJVUX3|("0 c|zJssA
p4e.(0|{q|;¬X<G2■u1?eKAy^&E;X’¬u¬qHLf')RD36Mq
s/?~~’’8;)O!}p„W„’0jdHQc"Y{>7OO6H"|■I)Z5M/—9VZdYnO.6’*>‘”PvBLfDNtsK!1>AM6G
pfp'QZ B“:- h&z"1q"kn?‘“aA„U\0|4O3!qnX


In [None]:
@torch.no_grad()
def evaluate(model, test_data, get_batch, num_batches=100):
    self.eval()
    total_loss = 0.0

    for _ in range(num_batches):
        xb, yb = get_batch(test_data)
        logits, loss = self(xb, yb)
        total_loss += loss.item()

    avg_loss = total_loss/num_batches
    self.train()
    return avg_loss

In [432]:
optimizer = torch.optim.AdamW(model.parameters(), lr=1e-3)

In [438]:
batch_size = 64

for step in range(10000):
    xb, yb = get_batch("train")

    logits, loss = model(xb, yb)
    optimizer.zero_grad(set_to_none=True)
    loss.backward()
    optimizer.step()

print(loss.item())

2.4335415363311768


In [442]:
batch_size = 128
model.evaluate(test_data, get_batch, num_batches=100)

2.555204153060913

In [444]:
print(decode(model.generate(torch.zeros((1,1), dtype=torch.long), max_new_tokens=500)[0].tolist()))


Towe ren trdishe ckend 
Ttoupat umid tw watharhey 1 me y w avoowoutonsl I fun t adone HT tee 
‘Wininond. t t. d acisek-ve hind beruis lagrsoineserso our pes ck. wenyopille arurourr 

T 
Tolinopro 

Choue we tr hew e the’Th t ge bownd s womileat any I camoouthed he 


'the tharfo s g 'th bathetr alenthed warophes 
The mery ner 

Win on 15 otlkerev, ubexe 



'toed It theld mithe tivey ak P tavere kind e offt o t t at 
Tincama f GR thathe s 

AR s to foun hary ake th! I’serod d 






[ the bisas.


### Observation... this looks a bit better! 

----

## Part 2. Attention

In [675]:
torch.manual_seed(1337)
B, T, C = 4,8,32
x = torch.randn(B, T, C)
x.shape

torch.Size([4, 8, 32])

In [726]:
# First, the naive thing
xbow = torch.zeros((B, T, C))
for b in range(B):
    for t in range(T):
        xprev = x[b, :t+1]
        xbow[b,t] = torch.mean(xprev, 0)

# Next, the matrix version
M = torch.tril(torch.ones((T, T)))
M = M / M.sum(1, keepdim=True)

xbow2 = M @ x

# Finally, softmax
tril = torch.tril(torch.ones(T,T))
wei = torch.zeros((T,T))
wei = wei.masked_fill(tril==0, float('-inf'))
wei = F.softmax(wei, dim=-1)
xbow3 = wei @ x

#### Building up Self Attention

In [755]:
torch.manual_seed(1337)
B, T, C = 4,8,32
x = torch.randn(B, T, C)

# Attention head
head_size = 16
key = nn.Linear(C, head_size, bias=False)
query = nn.Linear(C, head_size, bias=False)
value = nn.Linear(C, head_size, bias=False)
k = key(x)    # B, T, 16
q = query(x)  # B, T, 16

wei = q @ k.transpose(-2, -1) # (B, T, 16) @ (B, 16, T) --> (B, T, T) 

tril = torch.tril(torch.ones(T,T))

wei = wei.masked_fill(tril==0, float('-inf'))
wei = F.softmax(wei, dim=-1)

v = value(x)
out = wei @ v

out.shape

torch.Size([4, 8, 16])

In [757]:
wei[0]

tensor([[1.0000, 0.0000, 0.0000, 0.0000, 0.0000, 0.0000, 0.0000, 0.0000],
        [0.1574, 0.8426, 0.0000, 0.0000, 0.0000, 0.0000, 0.0000, 0.0000],
        [0.2088, 0.1646, 0.6266, 0.0000, 0.0000, 0.0000, 0.0000, 0.0000],
        [0.5792, 0.1187, 0.1889, 0.1131, 0.0000, 0.0000, 0.0000, 0.0000],
        [0.0294, 0.1052, 0.0469, 0.0276, 0.7909, 0.0000, 0.0000, 0.0000],
        [0.0176, 0.2689, 0.0215, 0.0089, 0.6812, 0.0019, 0.0000, 0.0000],
        [0.1691, 0.4066, 0.0438, 0.0416, 0.1048, 0.2012, 0.0329, 0.0000],
        [0.0210, 0.0843, 0.0555, 0.2297, 0.0573, 0.0709, 0.2423, 0.2391]],
       grad_fn=<SelectBackward0>)

NOTE: To control the variance (and avoid one-hot convergence) we divide by the square root of the head_size

In [761]:
wei = q @ k.transpose(-2, -1) * head_size**-0.5


In [763]:
wei.var()

tensor(0.1201, grad_fn=<VarBackward0>)

In [783]:
k = torch.randn(B, T, head_size)
q = torch.randn(B, T, head_size)
wei = q @ k.transpose(-2, -1) * head_size**-0.5

In [785]:
print(k.var(), q.var(), wei.var())

tensor(1.0632) tensor(0.9891) tensor(0.9755)


### Head

In [823]:
class Head(nn.Module):
    def __init__(self, head_size):
        super().__init__()
        self.key = nn.Linear(n_embed, head_size, bias=False)
        self.query = nn.Linear(n_embed, head_size, bias=False)
        self.value = nn.Linear(n_embed, head_size, bias=False)
        self.register_buffer('tril', torch.tril(torch.ones(block_size, block_size)))

    def forward(self, x):
        B, T, C = x.shape
        k = self.key(x)
        q = self.query(x)

        wei = q @ k.transpose(-2,-1) * C**-0.5
        wei = wei.masked_fill(self.tril[:T, :T] == 0, float('-inf'))
        wei = F.softmax(wei, dim=-1)

        v = self.value(x)
        out = wei @ v
        return out

In [58]:
import torch
import torch.nn as nn
import torch.nn.functional as F

batch_size = 32
block_size = 16
max_iters = 5000
eval_iter = 500
learning_rate = 1e-3
device = 'cuda' if torch.cuda.is_available() else 'cpu'
eval_iters = 500
n_embed = 32
dropout = 0.3


torch.manual_seed(314159)

files = ['RobertFrost_poems.txt', 'Dwight.txt', 'yoda.txt', 'quote_bank.txt', 'shakespeare.txt']
file = files[2]
print(file)

dat = open(file).read()
chars = sorted(list(set(dat)))


vocab_size = len(chars); print(f'Vocab size: {vocab_size}')

stoi = {char: i for i, char in enumerate(chars)}
itos = {i: char for i, char in enumerate(chars)}
encode = lambda str: [stoi[char] for char in str]
decode = lambda l: ''.join([itos[i] for i in l])


data = torch.tensor(encode(dat))
cap = int(len(data)*.9)
train_data = data[:cap]
test_data = data[cap:]



def get_batch(split):
    data = train_data if split=='train' else test_data
    ix = torch.randint(len(data) - block_size, (batch_size, ))
    x = torch.stack([data[i: i + block_size] for i in ix])
    y = torch.stack([data[i+1: i+block_size + 1] for i in ix])
    x, y = x.to(device), y.to(device)
    return x,y



@torch.no_grad()
def estimate_loss():
    out = {}
    model.eval()
    for split in ['train', 'test']:
        losses = torch.zeros(eval_iters)
        for k in range(eval_iters):
            X, Y = get_batch(split)
            logits, loss = model(X, Y)
            losses[k] = loss.item()
        out[split] = losses.mean()
    model.train()
    return out





class Head(nn.Module):
    def __init__(self, head_size):
        super().__init__()
        self.key = nn.Linear(n_embed, head_size, bias=False)
        self.query = nn.Linear(n_embed, head_size, bias=False)
        self.value = nn.Linear(n_embed, head_size, bias=False)
        self.register_buffer('tril', torch.tril(torch.ones(block_size, block_size)))
        self.dropout = nn.Dropout(dropout)

    def forward(self, x):
        B, T, C = x.shape
        k = self.key(x)
        q = self.query(x)

        wei = q @ k.transpose(-2,-1) * C**-0.5
        wei = wei.masked_fill(self.tril[:T, :T] == 0, float('-inf'))
        wei = F.softmax(wei, dim=-1)
        wei = self.dropout(wei)
        
        v = self.value(x)
        out = wei @ v
        return out



class MultiHeadAttention(nn.Module):
    def __init__(self, num_heads, head_size):
        super().__init__()
        self.heads = nn.ModuleList([Head(head_size) for _ in range(num_heads)])
        self.proj = nn.Linear(n_embed, n_embed)
        self.dropout = nn.Dropout(dropout)
        
    def forward(self, x):
        out = torch.cat([h(x) for h in self.heads], dim=-1)
        out = self.dropout(self.proj(out))
        return out


class FeedForward(nn.Module):
    def __init__(self, n_embed):
        super().__init__()
        self.net = nn.Sequential(
            nn.Linear(n_embed, 4 * n_embed),
            nn.ReLU(),
            nn.Linear(4 * n_embed, n_embed),
            nn.Dropout(dropout)
        )
        
    def forward(self, x):
        return self.net(x)



class Block(nn.Module):
    def __init__(self, n_embed, n_heads):
        super().__init__()
        head_size = n_embed//n_heads
        self.sa = MultiHeadAttention(n_heads, head_size)
        self.ff = FeedForward(n_embed)
        self.ln1 = nn.LayerNorm(n_embed)
        self.ln2 = nn.LayerNorm(n_embed)

    def forward(self, x):
        x = x + self.sa(self.ln1(x))
        
        x = x + self.ff(self.ln2(x))
        return x



class BigramLanguageModel(nn.Module):
    def __init__(self):
        super().__init__()
        self.token_embedding_table = nn.Embedding(vocab_size, n_embed)
        self.position_embedding_table = nn.Embedding(block_size, n_embed)
        self.blocks = nn.Sequential(
            Block(n_embed, n_heads=4),
            Block(n_embed, n_heads=4),
            Block(n_embed, n_heads=4)
        )
        self.lm_head = nn.Linear(n_embed, vocab_size)
    
    def forward(self, idx, targets=None):
        B, T = idx.shape

        tok_emb = self.token_embedding_table(idx)
        pos_emb = self.position_embedding_table(torch.arange(T, device=device))
        x = tok_emb + pos_emb
        x = self.blocks(x)
        logits = self.lm_head(x)
        
        if targets is None:
            loss = None
        else: 
            B, T, C = logits.shape
            logits = logits.view(B*T, C)
            targets = targets.view(B*T)
            loss = F.cross_entropy(logits, targets)
        
        return logits, loss

    
    def generate(self, idx, max_new_tokens):
        # idx is (B, T) array of indices
        for _ in range(max_new_tokens):
            idx_cond = idx[:, -block_size:]
            logits, loss = self(idx_cond)           
            logits = logits[:, -1, :]          
            probs = F.softmax(logits, dim=1)         
            idx_next = torch.multinomial(probs, num_samples=1)       
            idx = torch.cat((idx, idx_next), dim=1)
        return idx
        

yoda.txt
Vocab size: 60


In [60]:
model = BigramLanguageModel()
m = model.to(device)


optimizer = torch.optim.AdamW(model.parameters(), lr = learning_rate)

In [66]:
for step in range(max_iters):

    if step % eval_iters == 0:
        losses = estimate_loss()
        print(f"step: {step}, Train loss: {losses['train']:.4f},  Test loss: {losses['test']:.4f}")
    
    xb, yb = get_batch('train')
    
    logits, loss = m(xb, yb)
    optimizer.zero_grad(set_to_none=True)
    loss.backward()
    optimizer.step()


context = torch.zeros((1,1), dtype=torch.long, device=device)
print(decode(m.generate(context, max_new_tokens=500)[0].tolist()))

step: 0, Train loss: 1.7504,  Test loss: 2.2475
step: 500, Train loss: 1.6759,  Test loss: 2.2233
step: 1000, Train loss: 1.6110,  Test loss: 2.2175
step: 1500, Train loss: 1.5504,  Test loss: 2.2154
step: 2000, Train loss: 1.4996,  Test loss: 2.2003
step: 2500, Train loss: 1.4485,  Test loss: 2.1913
step: 3000, Train loss: 1.4194,  Test loss: 2.1770
step: 3500, Train loss: 1.3881,  Test loss: 2.1869
step: 4000, Train loss: 1.3585,  Test loss: 2.1910
step: 4500, Train loss: 1.3364,  Test loss: 2.1776

Youd must hern your of that boy you for ate wwaoone of thicy learnewploweren as tarf you ghas end nothern and? On to losseens best. Us the is frond a not in.
You, for your tairoy of this ming Skywaing the butu..:-hen thas.
Master dious, teng froutugh of this. Emporture have by.
Ity.
I you what go.weerentight hay is, cul al do, is not a Jedi to cail.
Yossee betwaithe he Forned. Obwan fear lik wet Suturn not gWraiatiny toy sumplainet am, this one dares.
No. Beng If the camesto net. Lus a da

In [70]:
print(decode(m.generate(context, max_new_tokens=500)[0].tolist()))


Iund preriesth it path is thorce rearnst ist. At.
Das terywith at leave th to train he comp oner strond that yearns ingse the Fordilys undiouce.
The if lest you dest. Untrassivo they. Mathere if ap.
Notade, of gre, are aif ays it, your spentror ong to filly you the dark stayset bewars a nkeed, patich tome ter, of thace morce trasce the the lay.
alway the Fory Clou whit eat the Sumep to of thelme, to that wille side.
ain. Soming thim, mis nacrentoucheng gone must.
Clit. Bust wo the dark side in i


In [40]:
m.blocks[0]

Block(
  (sa): MultiHeadAttention(
    (heads): ModuleList(
      (0-3): 4 x Head(
        (key): Linear(in_features=32, out_features=8, bias=False)
        (query): Linear(in_features=32, out_features=8, bias=False)
        (value): Linear(in_features=32, out_features=8, bias=False)
      )
    )
    (proj): Linear(in_features=32, out_features=32, bias=True)
  )
  (ff): FeedForward(
    (net): Sequential(
      (0): Linear(in_features=32, out_features=128, bias=True)
      (1): ReLU()
      (2): Linear(in_features=128, out_features=32, bias=True)
    )
  )
  (ln1): LayerNorm((32,), eps=1e-05, elementwise_affine=True)
  (ln2): LayerNorm((32,), eps=1e-05, elementwise_affine=True)
)