# lab session #9 : Tiny GPT 
#### the goal of this notebook is to train a generative model on text data with a transformer architecture using causal self-attention. 

For simplicity, we will first use alpha-numeric characters as tokens.

<img src="Ensicaen-logo.png" alt="logo_ENSI" style="width: 200px;"/> 

### 2023 Notebook by Julien (dot) Rabin (at) ensicaen.fr

________________________________
### LastName / Nom : 
### Surname / Prénom : 
### Group :
### Date : 
________________________________

### Today's Menu

In this notebook, the goal is to train a generative model from a text-dataset following these steps :

- [Useful Torch libraries](#0---load-libraries-and-fetch-data) : load necessary libraries and fetch a text dataset
- [Data - Preprocessing](#1---data-pre-processing) define a trivial tokenizer using lookup table on characters
- [Toy model](#2---bi-gram-generative-model) A complete but very shallow to-model to predict next token based only on the previous one

    -[Batch routine trick for parallelized training](#21---definine-torch-routine-to-process-data-enconder---decoder)

    -[Bi-Gram model](#22---definine-bigram-model)

    -[Test Random Model :](#23---test-non-trained-model)

    -[Model Training :](#24---train-model)

    -[Model generation :](#25---sampling-the-generative-model)
    
- [Transformer Model](#3---using-pytorch-transformer-encoder-model) model based on a transformer Encoder trained with causal masked attention 

    -[Causal attention masks](#32---causal-attention-with-masked-inputs)

    -[Enhanced Model with Tranformer Encoder](#33---define-full-generative-model-with-transformers)

    -[Training the transformer model](#34---train-the-transformer-model)

    -[Evaluating the generative model](#35---evaluating-the-model)

- [Questions](#4---questions) some simple questions that should be addressed within the lab session
- [Exercices](#5---exercices) pick a few questions from the proposed exercice to deeppen your understanding of the method 

# 0 - load libraries and fetch data

In [None]:
import numpy as np
import matplotlib.pyplot as plt

import torch
import torch.nn as nn
from torch.nn import functional as F

# for Jupyter notebook
%matplotlib inline 


#### CUDA setup

In [None]:
%env CUDA_VISIBLE_DEVICES=0
device = torch.device('cuda' if torch.cuda.is_available() else 'cpu')
print(device)

#### select / create dataset

In [None]:
input_file_path = './julesverne_tourdumonde80jours_UTF8.txt' # converted into UTF8 to deal with french superb accents

print("lien vers catalogue des Textes du CNAM (~300 livres) : http://abu.cnam.fr/BIB/")
print("(voir ReadMe.txt pour conversion UTF8)")


In [None]:
f = open(input_file_path, encoding="utf-8", mode="r")
data = f.read()
print(f"length of dataset in characters: {len(data):,}")

# 1 - data pre-processing  

In [None]:
# look at some data characters
n = 20_000
for k in range(10) :
    print(data[n+k*100:n+(k+1)*100])


In [None]:
p = int(.9 * len(data)) # percentage of data for training
data_train = data[:p]
data_eval  = data[p:]
del data

### Tokenizer

In [None]:
token_list = set(data_train) # decompose the data stream into a set of unique characters
token_list = sorted(token_list) # lexicographic sorting
print(token_list)

token_size = len(token_list)
print("Number of Tokens :",token_size)

In [None]:
# define lookup tables
char2token = {val:idx for idx,val in enumerate(token_list)}
print(char2token)
token2char = {idx:val for idx,val in enumerate(token_list)}
print(token2char)

In [None]:
token_encoder = lambda x : [char2token[xi] for xi in x]
token_decoder = lambda t : [token2char[ti] for ti in t]

In [None]:
encoded_data_train = token_encoder(data_train)
print("Encoding :", encoded_data_train[n:n+100])
decoded_data_train = token_decoder(encoded_data_train)
decoded_data_train = ''.join(decoded_data_train) # convert list of char to string
print('____')
print("Decoding :",  decoded_data_train[n:n+100])

print('____')
print(" > Sanity check : is data = decoder (encoder(data)) ?", decoded_data_train == data_train)

In [None]:
# rince & repeat for evaluation data data_eval

In [None]:
print('test with random tokens :')
x = np.random.randint(0,token_size,(100,))
print("input :", x)
print("output :", token_decoder(x))

# 2 - Toy-Model : (non-causal) Bi-Gram Generative Model

n-gram model consists in predicting a token $t_k$ at position $k$ given a context of the previous $n-1$ tokens $(t_{k-1}, ... t_{k-n+1})$.

```Note```: for images where token are pixels (e.g. pixel RNN/CNN), the context is equivalent to consider the patch of the neighboring pixels.

Formally, we want to train a generative model parametrized by $\theta$  which 
1) predicts given a context $(t_{k-1}, ... t_{k-n+1})$ the most likely token values $x$ for $t_k$, that is to learn the conditional probability on the observed data
$$
    \forall x \in C, \; \text{Pr}(t_k = x | t_{k-1}, ... t_{k-n+1})
$$

```Note```: in practice, the context can also include future tokens for training purposes only (non-causal model like BERT) or for different task (classification).

2) samples from the multinomial 

For now, lets start with $n=2$

some ref : https://fr.wikipedia.org/wiki/N-gramme


# 2.1 - DataLoader routine 
torch routines to train in parallel strings of tokens during training and process text during generation

In [None]:
def get_batch(encoded_data = encoded_data_train, context_size = 8, batch_size = 4) :

    idx = np.random.randint(0,len(encoded_data) - context_size, (batch_size))
    x = [ torch.tensor(encoded_data[i:i+context_size]) for i in idx ]
    x = torch.stack(x, dim=0) # [batch_size x context_size]

    # right to left shifting :
    y = [ torch.tensor(encoded_data[i+1:i+1+context_size]) for i in idx ] # y is x shifted right and use to define target prediction
    y = torch.stack(y, dim=0) # [batch_size x context_size]
    return x,y

In [None]:
x,y = get_batch(encoded_data_train, context_size = 64, batch_size = 1)
print(x.shape)
print(f"x={x.numpy()} and\ny={y.numpy()}")
x = token_decoder(x[0].numpy())
y = token_decoder(y[0].numpy())
print(f"decoding :\nx: {''.join(x)} \ny: {''.join(y)}")

In [None]:
# fonction qui decode directement un tenseur (batch x N) en une liste de batch string de N caractères
def batchtok_to_strlist(x) :
    s = []
    for b in range(x.size(0)) :
        l = token_decoder(x[b].numpy().tolist())
        s.append(''.join(l))
    return s

if False : # random token stream
    x = torch.randint(token_size, (2,128))
else : # real token stream
    x,_ = get_batch(encoded_data_train, context_size = 64, batch_size = 2)
print(x)

s = batchtok_to_strlist(x)
print(s)
print(*s)


# 2.2 - definition of a bigram model

#### complete the following toy model and answer the questions
1. Read & Complete this notebook, starting with very small datasets (for instance, changing the data split between `data_train` & `data_eval`):
    1. complete the initialisation of the toy model for positional embedding (e.g. constant -i.e. no embedding-, using index position and `nn.embedding` torch layer, high frequency cosine function, random vectors  ...)
    2. complete the forward model with the definition of the loss function using cross entropy
    3. complete the generation method to generate `gen_size` tokens rather than only one, at a given `temperature`
    4. complete the training algorithm (optimiser, auto-diff loss derivation) and display aggregated cross entropy loss on both the complete training dataset (say for each epoch)
    5. what is the role of the temperature parameter during generation ?
2. assess the overfitting of the model using evaluation data by comparising train/eval loss
    1. train with different dataset split ratio (e.g. 80/20, 90/10, 95/5)
    2. train with different model capacity (e.g. embedding dimension, number of layers in MLP)
3. assess the quality of the generated text sequences by varying the temperature parameter during generation




Answer the following questions :
- Why do we need embeddings for tokens ?
- Why do we need embeddings for positions ?
- Why to we use the `CrossEntropyLoss` for training *without* `Softmax` layer at the end of the classification network ?
- Why do we need a `Softmax` layer during generation ?
- What is the role of the MLP (multi-layer perceptron) here ? is it the same for every processed token ?
- Can tokens communicate with each other in this model ? 

In [None]:
class bigram_model_class (nn.Module) :
    def __init__(self,embed_dim=32, context_size=64):
        super().__init__()
        self.context_size = context_size
        self.embed_dim = embed_dim
        self.hidden_dim = 4*embed_dim

        # position embedding, for instance use 0 for token 0, ... i for token i, etc 
        self.pos = ... # Note: if =0, use the same positional tokens (0) at every position
        self.pos_embedding = nn.Embedding(context_size,embed_dim).to(device) # position to vector representation

        self.token_embedding = nn.Embedding(token_size,embed_dim).to(device) # token to vector representation
        
        # MLP
        self.linear1 = nn.Linear(embed_dim, self.hidden_dim)
        self.activation = nn.GELU()
        self.linear2 = nn.Linear(self.hidden_dim, self.embed_dim)
        self.dropout = nn.Dropout(p=0.1)
        self.layer_norm = nn.LayerNorm(self.embed_dim)

        # Linear classification layer
        self.last_linear = nn.Linear(self.embed_dim, token_size)

    def forward(self,inputs, targets = None):
        B,T = inputs.size()
        assert T == self.context_size

        # inputs & targets are token batch tensor whose shape is [B x T] = [batch_size x self.context_size]
        tok = self.token_embedding(inputs) # [B x T x D] where D = self.embed_dim
        pos = self.pos_embedding(self.pos) # [1 x T x D]
        inputs = tok + pos # [B x T x D]
        
        # residual block
        x = self.dropout(self.activation(self.linear1(inputs)))
        x = self.dropout(self.activation(self.linear2(x)))
        inputs = inputs + self.layer_norm(x)

        logits = self.last_linear(inputs) # [B x T x N] where N = token_size
        
        if targets is not None : # compute the loss function
            loss = ... # using CrossEntropyLoss. Warning some reshaping required to get a tensor : [Batch x #Class]
        else : # during generation, no need to compute the loss function
            loss = None

        return logits, loss

    def generate(self, inputs, gen_size = 1, temperature = 1.) :
        B,T = inputs.size()
        if T < self.context_size : # padding with 0 or 1 tokens (= '\n' or ' ') 
            inputs = torch.cat((torch.zeros((B,self.context_size - T), dtype=torch.long), inputs), dim=1) # [B x context_size]
            x = inputs
        else : # truncation with the maximum context
            x = inputs[:,:self.context_size] 
        
        logits, _ = self(x) # forward pass
        logits = logits[:,-1,:] # [B x 1 x N] as only the last predicted token is useful
        
        p = ... # convert logits to probabilities using softmax and temperature [B x 1 x N]
        
        tok = torch.multinomial(p, 1) # random token based on the predicted probabilities
        inputs = torch.cat((inputs, tok), dim=1) # add the generated token to the end of the token-stream

        return inputs[:,context_size:] # [B x gen_size]

In [None]:
def print_model_num_param(model) :
    N = 0
    for name, param in model.named_parameters() :
        print(name)
        N += param.view(-1).size(0)
    print("Number of model parameter :",N)

# 2.3 - test non-trained model

In [None]:
# load random model
embed_dim = 16 # dimension of embedding for token (and position)
context_size = 32 # lenght of token sequences in the batch

bigram_model = ... # do not forget to use device !

print(bigram_model)
print_model_num_param(bigram_model) 

In [None]:
# test loss for training data
batch_size = 100
x,y = get_batch(encoded_data = encoded_data_train, context_size = context_size, batch_size = batch_size)
x,y = x.to(device), y.to(device)

bigram_model.eval() # removes dropout and other normalization
logits,loss = ...
bigram_model.train()

print("cross entropy loss", loss.detach().cpu())
print("expected value for uniform random", np.log(token_size))

In [None]:
# test prediction for random data
batch_size = 4
x = torch.randint(token_size, (batch_size,context_size)).to(device)
strlist = batchtok_to_strlist(x.cpu())
print("random stream of char : ", strlist)

bigram_model.eval()
y,_ = ... # using forward mode w/o targets : logits for each position are returned 
bigram_model.train()

y = ... # extract predicted logits for the last position
p = ... # convertion to probability
tok = ... # select the most predictible token
print("most probable token /char is {} = '{}'".format(tok, batchtok_to_strlist(tok.cpu())))


In [None]:
# test generation
x,_ = get_batch(encoded_data_train, batch_size=1, context_size=context_size)
x = x.to(device)

bigram_model.eval()
y = ... # generate a string of 100 tokens
bigram_model.train()

print("input tokens", x)
print("input str", batchtok_to_strlist(x.cpu()))
print("output tokens", y)
synth = batchtok_to_strlist(y.cpu())
print("output str =\n", synth[0])

# 2.4 - train model

In [None]:
print(bigram_model)
N = 0
for name, param in bigram_model.named_parameters() :
    print(name)
    N += param.view(-1).size(0)
print("Number of model parameter :",N)

In [None]:
optim_bigram = torch.optim ...
batch_size = 128

nepoch = 1
niter = int(nepoch * len(encoded_data_train) / batch_size)‡
print("niter = ", niter)


In [None]:
Loss = []
for it in range(niter) :
    x,y = get_batch(encoded_data_train, context_size=bigram_model.context_size, batch_size=batch_size)
    x,y = x.to(device),y.to(device)

    loss = ...
    
    optim_bigram.zero_grad()
    loss.backward()
    optim_bigram.step()

    Loss.append(loss.item())

    if (it % int(niter//100)) == 0 :
        print("it = %d / %d : loss = %f" % (it, niter, Loss[-1]))


In [None]:
plt.plot(Loss)
plt.title('Loss on training batch evaluations')

# 2.5 -  sampling the generative model

In [None]:
# test generation
x,_ = get_batch(encoded_data_train, batch_size=1, context_size=bigram_model.context_size)
x = x.to(device)
y = ... # e.g with gen_size = 100, temperature = 1.
#print("input tokens", x)
print("input str", batchtok_to_strlist(x.cpu()))
#print("output tokens", y)
print("output str", batchtok_to_strlist(y.cpu())) # generated string with default temperature = 1. used during training
y = ... # now with gen_size = 100, temperature = 10.)
print("with high temperature", batchtok_to_strlist(y.cpu())) # what happens now ?
y = ... # now with gen_size = 100, temperature = 0.1)
print("with low temperature", batchtok_to_strlist(y.cpu())) # what happens now ?

Conclusion on the bigram model : 
- What is the role of the `temperature` parameter  ?
- what are the limitations of using such a bigram model ? For instance, why sequences like "nnn" are generated so often ?
- How could the model be improved to generate more coherent sequences ?
- How can we create a N-gram model with N>2 in this framework ?

______________________________________________
# 3 - Generative model based on pytorch transformer model

Recall that GPT is a based on the encoder-decoder transformer architecture proposed in 
"attention is all you need" 2017 Vaswani et al. 
This general architecture is composed of two main parts:
- The encoder uses self-attention on the **full sequence** to compute a *latent representation of the input sequence*. 
- The decoder takes as inputs **both the input sequence and this latent representation**, making use of causal self- and cross-attention between representation. Causality is ensured in the decoder by masking future tokens.

Here, since we want to build a generative model, we will only use the decoder part of the transformer architecture with **causal self-attention**.
Therefore, since we do not need for a latent representation of the full sequence, we will use the `nn.TransformerEncoder` module from pytorch.

##### Exercice
complete the following cells exploring the encoder architecture, and the definition of the causal attention masks. Then complete the proposed `transformer_model_class` to build a full generative model based on transformer encoder with causal attention. Last, train the model and evaluate its performance on text generation. 

**Warning !** To start with, use a very small dataset (e.g. 1000 characters) and a small model (e.g. embedding dimension 16, 2 layers with 2 heads) to check that everything works fine before increasing the model capacity and dataset size.

### 3.1 - test with torch transformer encoder `torch.nn.TransformerEncoder`

Warning ! by default, `batch_first` is set to `False` in torch transformer models. 
So the input shape is $[T \times B \times d]$, that is (seq_len, batch_size, embed_dim)


<!--- the following code include figures from repertory /fig: transformer_architecture.jpg and transformer_block.jpg -->
![Transformer architecture](fig/transformer_block.jpg)
![Transformer architecture](fig/transformer_architecture.jpg)

In [None]:
# test with torch.nn.TransformerEncoder
# modèle : torch.nn.TransformerEncoder(encoder_layer, num_layers, norm=None, enable_nested_tensor=True, mask_check=True)
# forward : forward(src, mask=None, src_key_padding_mask=None, is_causal=None)

embed_dim = ... # dimension of each token embedding
num_heads = 1 # number of heads in attention model

encoder_layer = nn.TransformerEncoderLayer(d_model=embed_dim, nhead=num_heads).to(device)
transformer_encoder = nn.TransformerEncoder(encoder_layer, num_layers=6).to(device)

# /!\ batch dimension is 1 rather than 0  
inputs = torch.rand((context_size, batch_size, transformer_model.d_model), device=device) # [context_size x  batch x feature_dim] : src should be encoded (with pos embedding) before fed to the encoder input
print("inputs = ",inputs)
outputs = transformer_model.encoder(inputs)
print("outputs = ",outputs)

In [None]:
print(transformer_encoder)
print_model_num_param(transformer_encoder)

In [None]:
# test the model on true data, with mask = None

### 3.2 - causal attention with masked inputs

In [None]:
# define an additive mask
x,_ = get_batch(encoded_data_train, batch_size=batch_size, context_size=context_size) # B x T
x = x.to(device)
inputs = tok2vec_emb(x) # [B x T x d]
inputs = inputs.permute(1,0,2) # [T x B x d]
 
# mask has to be [T x T]
# mask[i,j] = 0 if i<= j else inf
mask = torch.tril(torch.ones(context_size,context_size), diagonal=0)

mask[mask<1.] = -np.inf # the dot product is equal to -inf for future (unseen) token
mask[mask>=1.] = 0.
mask = mask.to(device)
print(mask)
print(mask.shape)

In [None]:
# now test the model, using the provided mask 

### 3.3 - a full generative model with transformer encoder blocks

- same as previous bigram class model where we basically replace the MLP with several masked transformer encoder blocks

In [None]:
class transformer_model_class (nn.Module) :
    def __init__(self,embed_dim=32, context_size=64, num_heads=8, num_layers=1, dim_feedforward=64, dropout=0.):
        super().__init__()
        self.context_size = context_size
        self.embed_dim = embed_dim
        self.num_heads = num_heads

        assert (embed_dim % num_heads == 0) # embed_dim must be multiple of num_heads

        self.pos = ... # precomputed positional token
        self.pos_emb = nn.Embedding(context_size,embed_dim) # position to vector representation

        self.tok2vec_emb = nn.Embedding(token_size,embed_dim).to(device) # token to vector representation
        self.vec2tok_emb = nn.Linear(embed_dim, token_size).to(device) # vector to token logits
        
        # use here a transformer encoder
        encoder_layer = ...
        self.transformer_encoder = ...
        
        # causal mask
        mask = torch.tril(torch.ones(context_size,context_size), diagonal=0)
        mask[mask<1.] = -np.inf # the dot product is equal to -inf for future (unseen) tokens
        mask[mask>=1.] = 0.
        self.mask = mask.to(device)

    def forward(self,inputs, targets = None):
        B,T = inputs.size()
        assert T == self.context_size
        
        # inputs & targets are token batch tensor whose shape is [B x T] = [batch_size x self.context_size]
        tok = self.tok2vec_emb(inputs) # [B x T x D] where D = self.embed_dim
        pos = self.pos_emb(self.pos) # [1 x T x D]
        tok = tok + pos # [B x T x D]
        inputs = tok.permute(1,0,2) # [T x B x D]
        outputs = self.transformer_encoder(inputs, self.mask) # [T x B x D]
        outputs = outputs.permute(1,0,2) # [B x T x D]
        logits = self.vec2tok_emb(outputs) # [B x T x N] where N = token_size
        
        if targets is not None :
            loss = ...
        else :
            loss = None

        return logits, loss

    def generate(self, inputs, gen_size = 1, temperature = 1.) :
        
        B,T = inputs.size()
        if T < self.context_size : # padding with 0 or 1 tokens (= '\n' or ' ') 
            inputs = torch.cat((torch.zeros((B,self.context_size - T), dtype=torch.long, device=device), inputs), dim=1)
        
        x = inputs[:,-self.context_size:] if (inputs.size(1) > self.context_size) else inputs # truncation of maximum context
        
        logits, _ = self(x) # forward pass
        logits = logits[:,-1,:] # [B x N] as only the last predicted token is useful
        
        p = ... # logits to probabilities [B x N], using temperature 
        
        tok = torch.multinomial(p, 1) # [B] random token based on the predicted probabilities
        inputs = torch.cat((inputs, tok), dim=1) # add the generated token to the end of the token-stream

        inputs = inputs[:,self.context_size:]
        return inputs

In [None]:
# model parameter
num_heads = 2 # number of heads in attention model 
embed_dim = 4*num_heads # note : "each head will have dimension embed_dim // num_heads"
num_layers = 4
dropout = 0.1
dim_feedforward = 4*embed_dim

# data parameter
context_size = 32 # input size of the transformer encoder

transformer_model = ... # remember to use 'device' 

In [None]:
#print(transformer_model)
N = 0
for name, param in transformer_model.named_parameters() :
    #print(name)
    #print(param.shape)
    N += param.view(-1).size(0)
print("Number of model parameter :",N)

In [None]:
# test the forward model on training data

In [None]:
# sample the (untrained) generative model

### 3.4 - train the transformer model

In [None]:
optim_transformer = torch.optim...
batch_size = 16

Loss = []

nepoch = 10
niter = int(nepoch * len(encoded_data_train) / batch_size)
#niter = 5000
print("niter = ", niter)

In [None]:
transformer_model.train()
for it in range(niter) :
    x,y = get_batch(encoded_data_train, context_size=transformer_model.context_size, batch_size=batch_size)
    x,y = x.to(device), y.to(device)

    loss =  ...
    
    optim_transformer.zero_grad()
    loss.backward()
    optim_transformer.step()

    Loss.append(loss.item())

    if (it % int(niter//100)) == 0 :
        print("/!\ calculer le critère d'évaluation sur les ensembles data_train / data_eval")
        print("it = %d / %d : loss = %f" % (it, niter, Loss[-1]))
    

In [None]:
plt.plot(Loss)

### 3.5 - evaluating the model
test the trained generative model on train data, eval data, and generate new tokens based on your own sentence


In [None]:
# your code 
...

## 4 - Proposed extensions
pick a few questions from the proposed exercice to deeppen your understanding of the method


Suggestion 1 :
- experiment with other datasets, for instance python code (Mostly Basic Python Problems Dataset : https://github.com/google-research/google-research/tree/master/mbpp)




Suggestion 2 : using a more powerful tokenizer

For instance, use the GPT2 BPE tokenizer and train your model on english data (using public domain data, e.g. https://www.gutenberg.org/cache/epub/63355/pg63355.txt)

You can take inspiration from the following code snippet

In [None]:
%pip install tiktoken
import tiktoken

In [None]:
# load data : Pasteur book in english
data_url = "https://www.gutenberg.org/cache/epub/63355/pg63355.txt"
!wget $data_url -O data.txt

In [None]:
f = open('data.txt')
train_data = f.read()
f.close()

print("Number of characters", len(train_data))

n = 10_000
print(" some excerpt : ", train_data[n:n+1000])

In [None]:
# encode with tiktoken gpt2 bpe
enc = tiktoken.get_encoding("gpt2")
train_tok = enc.encode_ordinary(train_data)
token_size = max(train_tok)
print(f"dataset has been split in {len(train_tok):,} tokens from {min(train_tok)} to {token_size}")

In [None]:
seq = train_tok[n:n+100]
print("excerpt :", seq)

In [None]:
seq_ori = train_data[n:n+100]
print("\t original :\n" + seq_ori)
print("_" * 50)
# decoding
seq_char = enc.decode(enc.encode(seq_ori))
print("\tdecoded :\n" + seq_char)