# Build La Fontaine GPT
In this notebook we build a GPT from scratch that talks in the style of French author Jean De La Fontaine.

Tutorial:   

[Let's build GPT: from scratch, in code, spelled out](https://youtu.be/kCc8FmEb1nY)

[nanoGPT github repo](https://github.com/karpathy/nanoGPT)

## Imports

In [20]:
import torch

SEP = 50 * '-'

## Load dataset

In [9]:
# load dataset
dataset_path = 'dataset/tiny-lafontaine.txt'
with open(dataset_path, 'r', encoding='utf-8') as f:
    text = f.read()

In [2]:
print("length of dataset in characters: ", len(text))

length of dataset in characters:  435267


In [3]:
# let's look at the first 1000 characters
print(text[:1000])

LA CIGALE ET LA FOURMI
La Cigale, ayant chanté
Tout L'Été,
Se trouva fort dépourvue
Quand la Bise fut venue.
Pas un seul petit morceau
De mouche ou de vermisseau.
Elle alla crier famine
Chez la Fourmi sa voisine,
La priant de lui prêter
Quelque grain pour subsister
Jusqu'à la saison nouvelle.
« Je vous paierai, lui dit-elle,
Avant l'Août, foi d'animal,
Intérêt et principal. »
La Fourmi n'est pas prêteuse :
C'est là son moindre défaut.
« Que faisiez-vous au temps chaud ?
Dit-elle à cette emprunteuse.
- Nuit et jour à tout venant
Je chantais, ne vous déplaise.
- Vous chantiez ? j'en suis fort aise :
Eh bien ! dansez maintenant. »
LE CORBEAU ET LE RENARD
Maître Corbeau, sur un arbre perché,
Tenait en son bec un fromage.
Maître Renard, par l'odeur alléché,
Lui tint à peu près ce langage :
« Et bonjour, Monsieur du Corbeau.
Que vous êtes joli ! que vous me semblez beau !
Sans mentir, si votre ramage
Se rapporte à votre plumage,
Vous êtes le Phénix des hôtes de ces Bois. »
A ces mots le corb

In [5]:
# here are all the unique characters that occur in this text
chars = sorted(list(set(text)))
vocab_size = len(chars)
print(''.join(chars))
print(vocab_size, 'characters')


 !"'(),-.:;?ABCDEFGHIJLMNOPQRSTUVXYZabcdefghijlmnopqrstuvxyz«»ÀÂÇÈÉÊÎÔÛàâçèéêîïôùûœ
84 characters


## Tokenizer
Le'ts build a very simple tokenizer and and encoder/decoder

In [8]:
# create a mapping from characters to integers
stoi = {ch:i for i, ch in enumerate(chars)}  # chars -> ints table
itos = {i:ch for i, ch in enumerate(chars)}  # ints -> chars table
encode = lambda s: [stoi[c] for c in s]  # encoder: takes a string, outputs a list of integers
decode = lambda l: ''.join([itos[i] for i in l])  # decoder: takes a list of integers, output a string

print(encode('Bonjour'))
print(decode(encode('Bonjour')))

[14, 50, 49, 46, 50, 56, 53]
Bonjour


In [12]:
# let's now encode the entire text dataset and store it into a torch.Tensor
import torch
data = torch.tensor(encode(text), dtype=torch.long)
print(data.shape, data.dtype)
print(data[:100])

torch.Size([435267]) torch.int64
tensor([23, 13,  1, 15, 21, 19, 13, 23, 17,  1, 17, 31,  1, 23, 13,  1, 18, 26,
        32, 29, 24, 21,  0, 23, 37,  1, 15, 45, 43, 37, 47, 41,  7,  1, 37, 59,
        37, 49, 55,  1, 39, 44, 37, 49, 55, 76,  0, 31, 50, 56, 55,  1, 23,  4,
        67, 55, 76,  7,  0, 30, 41,  1, 55, 53, 50, 56, 57, 37,  1, 42, 50, 53,
        55,  1, 40, 76, 51, 50, 56, 53, 57, 56, 41,  0, 28, 56, 37, 49, 40,  1,
        47, 37,  1, 14, 45, 54, 41,  1, 42, 56])


## Split dataset

In [14]:
# Now let's split up the data into train and validation sets
n = int(0.9 * len(data))   # first 90% of the data will be the training set, rest will be the validation set
train_data = data[:n]
val_data = data[n:]
print(train_data.shape, val_data.shape)

torch.Size([391740]) torch.Size([43527])


## Block Size and Batches

In [17]:
# define a blocksize (context window). The model will be trained to predict the next character 
# given a block of characters of this size
block_size = 8
train_data[:block_size+1]

tensor([23, 13,  1, 15, 21, 19, 13, 23, 17])

In [19]:
# visualise input context window and target
# we are sampling a context window from as little as 1 to the complete block size, as we want
# the model to later be able to generate text from any context window size

x = train_data[:block_size]
y = train_data[1:block_size+1]

for t in range(block_size):
    context = x[:t+1]
    target = y[t]
    print(f"When input is {context} the target is: {target}")

When input is tensor([23]) the target is: 13
When input is tensor([23, 13]) the target is: 1
When input is tensor([23, 13,  1]) the target is: 15
When input is tensor([23, 13,  1, 15]) the target is: 21
When input is tensor([23, 13,  1, 15, 21]) the target is: 19
When input is tensor([23, 13,  1, 15, 21, 19]) the target is: 13
When input is tensor([23, 13,  1, 15, 21, 19, 13]) the target is: 23
When input is tensor([23, 13,  1, 15, 21, 19, 13, 23]) the target is: 17


In [26]:
torch.manual_seed(1337)  # for reproducibility
batch_size = 4  # how many independent sequences will we process in parallel
block_size = 8  # what i sthe maximum context length for predictions

def get_batch(split: str) -> [torch.Tensor, torch.Tensor]:
    """Generate a small batch of data of inputs x and targets y

    Args:
        split (str): dataset split to sample from ('train' or 'val')

    Returns:
        x (torch.Tensor): input data
        y (torch.Tensor): target data
    """

    data = train_data if split == 'train' else val_data  # choose the split

    ix = torch.randint(len(data) - block_size, (batch_size,))  # sample random starting indices for the sequences
    x = torch.stack([data[i:i+block_size] for i in ix])  # create a batch of context windows
    y = torch.stack([data[i+1:i+block_size+1] for i in ix])  # create a batch of targets, one step forward
    
    return x, y

xb, yb = get_batch('train')
print('inputs:')
print(xb.shape)
print(xb)

print('targets:')
print(yb.shape)
print(yb)

print(SEP)

for b in range(batch_size):  # batch dimension
    for t in range(block_size):  # time dimension
        context = xb[b, :t+1]  # select the context window
        target = yb[b, t]  # select the target
        print(f"When input is {context.tolist()} the target is: {target}")



inputs:
torch.Size([4, 8])
tensor([[ 1, 37, 56,  1, 44, 37, 54, 37],
        [39, 44, 37,  1, 52, 56, 41, 47],
        [41,  1, 47, 37,  1, 53, 37, 45],
        [ 1, 37, 57, 37, 45, 55,  1, 51]])
targets:
torch.Size([4, 8])
tensor([[37, 56,  1, 44, 37, 54, 37, 53],
        [44, 37,  1, 52, 56, 41, 47, 52],
        [ 1, 47, 37,  1, 53, 37, 45, 54],
        [37, 57, 37, 45, 55,  1, 51, 41]])
--------------------------------------------------
When input is [1] the target is: 37
When input is [1, 37] the target is: 56
When input is [1, 37, 56] the target is: 1
When input is [1, 37, 56, 1] the target is: 44
When input is [1, 37, 56, 1, 44] the target is: 37
When input is [1, 37, 56, 1, 44, 37] the target is: 54
When input is [1, 37, 56, 1, 44, 37, 54] the target is: 37
When input is [1, 37, 56, 1, 44, 37, 54, 37] the target is: 53
When input is [39] the target is: 44
When input is [39, 44] the target is: 37
When input is [39, 44, 37] the target is: 1
When input is [39, 44, 37, 1] the target

In [28]:
# our input to the transformer
print(xb)

tensor([[ 1, 37, 56,  1, 44, 37, 54, 37],
        [39, 44, 37,  1, 52, 56, 41, 47],
        [41,  1, 47, 37,  1, 53, 37, 45],
        [ 1, 37, 57, 37, 45, 55,  1, 51]])


## Bigram language model

In [41]:
import torch 
import torch.nn as nn
from torch.nn import functional as F
torch.manual_seed(1337)  # for reproducibility

class BigramLanguageModel(nn.Module):

    def __init__(self, vocab_size):
        super().__init__()
        # each token directly reads off the logits from the next token from a lookup table
        self.token_embedding_table = nn.Embedding(vocab_size, vocab_size)
    
    def forward(self, idx, targets=None):
        # idx and targets are both (B, T) tensors of integers
        logits = self.token_embedding_table(idx)  # (B, T, C) = Batch, Time (block_size), Channels (vocab_size)
        
        if targets is None:
            loss = None

        else:
            # reshape the logits to be (B*T, C) and the targets to be (B*T) so we can compute the loss
            B, T, C = logits.shape  # unpack batch, time, channels
            logits = logits.view(B*T, C)  # flatten the Time and Batch dimensions
            targets = targets.view(B*T)
            
            # compute the loss using cross entropy = quality of the logicts in respect to the targets
            loss = F.cross_entropy(logits, targets)

        return logits, loss
    
    def generate(self, idx, max_new_tokens):
        # idx is a (B, T) array of indices in the current context
        for _ in range(max_new_tokens):
            # get the predictions
            logits, loss = self(idx)  # (B, T, C)  internally calls the forward method in pytorch
            # focus only on the last time step
            logits = logits[:, -1, :]  # becomes (B, C)
            # apply softmax to get probabilities
            probs = F.softmax(logits, dim=-1)  # (B, C)
            # sample from the distribution
            idx_next = torch.multinomial(probs, num_samples=1)  # (B, 1)
            # append sampled index to the running sequence
            idx = torch.cat((idx, idx_next), dim=1)  # (B, T+1)
        
        return idx
    
m = BigramLanguageModel(vocab_size)
logits, loss = m(xb, yb)  # calling the model and passing in the input and the targets
print(logits.shape)  # should be (B, T, C) 
print(loss)  # loss should be close to -ln(1/vocab_size)

idx_0 = torch.zeros((1, 1), dtype=torch.long) # initial context is just a single 0
print(decode(m.generate(idx=idx_0, max_new_tokens = 100)[0].tolist()))  # generate 100 new tokens

torch.Size([32, 84])
tensor(5.1988, grad_fn=<NllLossBackward0>)

'O)'(QçFJyXS'x
fVVZyrébbéY
SEcOpAgPcZpç?Rj,p"ÇùfsSghêÔy«R!ÈeêqCÈYzJY(«SbéphXB!DFgjrZÇùÊpDttHdlFlOSuj


**Note:** right now the history is not used. The next character is predicted only using the previous one and not the full sequence. This result is very random right now. We want to train the model so it becomes less random.

## Optimizer

In [42]:
# create a PyTorch optimizer
optimizer = torch.optim.AdamW(m.parameters(), lr=1e-3)  # AdamW is a good optimizer for transformers

In [48]:
batch_size = 32

for steps in range(10000):

    # sample a batch of data
    xb, yb = get_batch('train')

    # evaluate the loss
    logits, loss = m(xb, yb)  # calling the model and passing in the input and the targets
    optimizer.zero_grad(set_to_none=True)  # clear previous gradients
    loss.backward()  # compute new gradients
    optimizer.step()  # update the weights

print(loss.item())  # print our training loss value at the end

2.426698684692383


In [53]:
idx_0 = torch.zeros((1, 1), dtype=torch.long)
print(decode(m.generate(idx=idx_0, max_new_tokens = 500)[0].tolist()))  # generate 100 new tokens


Maral HEt Ce ngeie
Pont t me,
Lent aret s, pars cen mét Turt me.
MA mben dége lar inde bonogorious jus.
Coue boreré fa fûre éne vi le s aunivrtagaite. Pe aceffaitr
Lerrmeneusourôtois ;
Jetrtrt vra e.
OIIll?
( lempe l lares d'eurs emblare ntôteu uves ;
NAj'e cèr sispailueux jest pits t vrre ce fometitasevoin
Dirimi de.
Qus de d, l saitea né nz phe mouaitt sen curcogr joivan à sosanend'autit lsomainen ces t mourit s s vouns.
Noue oilan ;
Pafrennesastrte.
Chun rortis,
DEnt ait rs llex lene ciléne d


## The mathematical trick in self attention

In [80]:
# version 1
torch.manual_seed(42)  # for reproducibility
a = torch.tril(torch.ones(3, 3))
a = a / torch.sum(a, 1, keepdim=True)
b = torch.randint(0, 10, (3, 2)).float()
c = a @ b

print('a=')
print(a)
print(SEP)
print('b=')
print(b)
print(SEP)
print('c=')
print(c)

a=
tensor([[1.0000, 0.0000, 0.0000],
        [0.5000, 0.5000, 0.0000],
        [0.3333, 0.3333, 0.3333]])
--------------------------------------------------
b=
tensor([[2., 7.],
        [6., 4.],
        [6., 5.]])
--------------------------------------------------
c=
tensor([[2.0000, 7.0000],
        [4.0000, 5.5000],
        [4.6667, 5.3333]])


In [81]:
# consider the following toy example:
torch.manual_seed(1337)  # for reproducibility
B, T, C = 4, 8, 2  # batch size, time steps, channels
x = torch.randn(B, T, C)  # random input data
x.shape


torch.Size([4, 8, 2])

In [82]:
# create a bag of words tensor
# we want x[b, t] = mean_{i<=t} x[b, i]
xbow = torch.zeros((B, T, C))  # output tensor initialized at zero
for b in range(B):
    for t in range(T):
        xprev = x[b,:t+1]  # (t, C)
        xbow[b, t] = torch.mean(xprev, 0)  # (C)

x[0]


tensor([[ 0.1808, -0.0700],
        [-0.3596, -0.9152],
        [ 0.6258,  0.0255],
        [ 0.9545,  0.0643],
        [ 0.3612,  1.1679],
        [-1.3499, -0.5102],
        [ 0.2360, -0.2398],
        [-0.9211,  1.5433]])

In [83]:
xbow[0]

tensor([[ 0.1808, -0.0700],
        [-0.0894, -0.4926],
        [ 0.1490, -0.3199],
        [ 0.3504, -0.2238],
        [ 0.3525,  0.0545],
        [ 0.0688, -0.0396],
        [ 0.0927, -0.0682],
        [-0.0341,  0.1332]])

In [84]:
# version 2
wei = torch.tril(torch.ones(T, T))
wei = wei / wei.sum(1, keepdim=True)
xbow2 = wei @ x  # (T, T) @ (B, T, C) = (B, T, C)
torch.allclose(xbow, xbow2)

False

In [87]:
xbow[0], xbow2[0]  # should be identical

(tensor([[ 0.1808, -0.0700],
         [-0.0894, -0.4926],
         [ 0.1490, -0.3199],
         [ 0.3504, -0.2238],
         [ 0.3525,  0.0545],
         [ 0.0688, -0.0396],
         [ 0.0927, -0.0682],
         [-0.0341,  0.1332]]),
 tensor([[ 0.1808, -0.0700],
         [-0.0894, -0.4926],
         [ 0.1490, -0.3199],
         [ 0.3504, -0.2238],
         [ 0.3525,  0.0545],
         [ 0.0688, -0.0396],
         [ 0.0927, -0.0682],
         [-0.0341,  0.1332]]))

In [98]:
# version 3: use softmax
tril = torch.tril(torch.ones(T, T))
print('tril')
print(tril)
print(SEP)

wei = torch.zeros((T, T))
print('wei init')
print(wei)
print(SEP)

wei = wei.masked_fill(tril == 0, float('-inf'))  # all elements where triu is 0 are set to -inf for softmax
print('wei masked')
print(wei)
print(SEP)

wei = F.softmax(wei, dim=1)
print('wei softmax')
print(wei)
print(SEP)

xbow3 = wei @ x
torch.allclose(xbow, xbow3)

tril
tensor([[1., 0., 0., 0., 0., 0., 0., 0.],
        [1., 1., 0., 0., 0., 0., 0., 0.],
        [1., 1., 1., 0., 0., 0., 0., 0.],
        [1., 1., 1., 1., 0., 0., 0., 0.],
        [1., 1., 1., 1., 1., 0., 0., 0.],
        [1., 1., 1., 1., 1., 1., 0., 0.],
        [1., 1., 1., 1., 1., 1., 1., 0.],
        [1., 1., 1., 1., 1., 1., 1., 1.]])
--------------------------------------------------
wei init
tensor([[0., 0., 0., 0., 0., 0., 0., 0.],
        [0., 0., 0., 0., 0., 0., 0., 0.],
        [0., 0., 0., 0., 0., 0., 0., 0.],
        [0., 0., 0., 0., 0., 0., 0., 0.],
        [0., 0., 0., 0., 0., 0., 0., 0.],
        [0., 0., 0., 0., 0., 0., 0., 0.],
        [0., 0., 0., 0., 0., 0., 0., 0.],
        [0., 0., 0., 0., 0., 0., 0., 0.]])
--------------------------------------------------
wei masked
tensor([[0., -inf, -inf, -inf, -inf, -inf, -inf, -inf],
        [0., 0., -inf, -inf, -inf, -inf, -inf, -inf],
        [0., 0., 0., -inf, -inf, -inf, -inf, -inf],
        [0., 0., 0., 0., -inf, -inf, 

False

In [121]:
# version 4: self attention!
torch.manual_seed(1337)  # for reproducibility
B, T, C = 4, 8, 32  # batch size, time steps, channels
x = torch.randn(B, T, C)  # random input data

# let's see a single Head perform self-attention
head_size = 16  # smaller head size
key = nn.Linear(C, head_size, bias=False)  # key projection
query = nn.Linear(C, head_size, bias=False)  # query projection
value = nn.Linear(C, head_size, bias=False)  # value projection
k = key(x)  # (B, T, head_size)
q = query(x)  # (B, T, head_size)
wei = q @ k.transpose(-2, -1)  # (B, T, head_size) @ (B, head_size, T) --> (B, T, T) 

tril = torch.tril(torch.ones(T, T))  # lower triangular matrix
# wei = torch.zeros((T, T))  # attention weights, dot product between key (what am I looking for) and query (what do I contain)
wei = wei.masked_fill(tril == 0, float('-inf'))  # set the upper triangle to -inf
wei = F.softmax(wei, dim=-1)  # apply softmax to get the weights

v = value(x)  # (B, T, head_size)
# out = wei @ x
out = wei @ v  # (B, T, T) @ (B, T, head_size) --> (B, T, head_size)

print(out.shape)  # should be (B, T, C)
wei[0]

# this is showing us how much information to aggregate from any of each token in the past

torch.Size([4, 8, 16])


tensor([[1.0000, 0.0000, 0.0000, 0.0000, 0.0000, 0.0000, 0.0000, 0.0000],
        [0.1574, 0.8426, 0.0000, 0.0000, 0.0000, 0.0000, 0.0000, 0.0000],
        [0.2088, 0.1646, 0.6266, 0.0000, 0.0000, 0.0000, 0.0000, 0.0000],
        [0.5792, 0.1187, 0.1889, 0.1131, 0.0000, 0.0000, 0.0000, 0.0000],
        [0.0294, 0.1052, 0.0469, 0.0276, 0.7909, 0.0000, 0.0000, 0.0000],
        [0.0176, 0.2689, 0.0215, 0.0089, 0.6812, 0.0019, 0.0000, 0.0000],
        [0.1691, 0.4066, 0.0438, 0.0416, 0.1048, 0.2012, 0.0329, 0.0000],
        [0.0210, 0.0843, 0.0555, 0.2297, 0.0573, 0.0709, 0.2423, 0.2391]],
       grad_fn=<SelectBackward0>)

**Notes:**
- Attention is a **communication mechanism**. Can be seen as nodes in a directed graph looking at each other and aggregating information with a weighted sum from all nodes that point to them, with data-dependent weights.

- There is no notion of space. Attention simply acts over a set of vectors. This is why we need to positionally encode tokens.

- Each example across batch dimension is of course processed completely independently and never "talk" to each other.

- In an "encoder" attention block just delete the single line that does masking with `tril`, allowing all tokens to communicate. This block here is called a "decoder" attention block because it has triangular masking, and is usually used in autoregressive settings, like language modeling.

- "Self-attention" just means that the keys and values are produced from the same source as queries. In "cross-attention," the queries still get produced from x, but the keys and values come from some other, external source (e.g., an encoder module).

- "Scaled" attention additionally divides `wei` by 1/sqrt(head_size). This makes it so when input Q, K are unit variance, `wei` will be unit variance too and Softmax will stay diffuse and not saturate too much.


In [123]:
k = torch.randn(B, T, head_size)  # key
q = torch.randn(B, T, head_size)  # query
wei = q @ k.transpose(-2, -1) * head_size**-0.5  # multiply by 1/sqrt(head_size)

In [124]:
k.var()

tensor(0.9006)

In [125]:
q.var() 

tensor(1.0037)

In [126]:
wei.var()

tensor(0.9957)

In [128]:
torch.softmax(torch.tensor([0.1, -0.2, 0.3, -0.2, 0.5]), dim=-1)

tensor([0.1925, 0.1426, 0.2351, 0.1426, 0.2872])

In [129]:
# multiply by 8, softmax sharpens the distribution to avoid a peaky distribution
# the scaling is used to control the variance at initialization
torch.softmax(torch.tensor([0.1, -0.2, 0.3, -0.2, 0.5]) * 8, dim=-1)

tensor([0.0326, 0.0030, 0.1615, 0.0030, 0.8000])