# Introduction

The notebook is intended to develop a character-based LLM trained on [Shakespeare Literature](https://github.com/karpathy/ng-video-lecture/blob/master/input.txt).

**Resources**
- [Reference tutorial from Andrej Karpathy](https://www.youtube.com/watch?v=kCc8FmEb1nY)

In [1]:
# Import Standard Libraries
import os
import tiktoken

from pathlib import Path

import torch
import torch.nn as nn
from torch.nn import functional as F

# Read Data

In [2]:
# Define local data file path
data_file_path = Path(os.path.abspath('')).parents[1] / 'data' / 'character_based_llm_data.txt'

In [3]:
# Read data
with open(data_file_path, 'r', encoding='utf-8') as data_file:
    data = data_file.read()

In [4]:
# Define the vocabulary of characters in the data
vocaulary = sorted(list(set(data)))
vocaulary_size = len(vocaulary)

# Data Preprocessing

## Tokenizer
It is an important data preprocessing operation which converts the single portion of the sequence 
(characters or tokens of words) into numerical value based on all the possible values of the train vocabulary.

### Custom Tokenizer

In [5]:
# String to integer encoder
string_integer_encoder = {character: integer for integer, character in enumerate(vocaulary)}

In [6]:
# String to integer decoder
string_integer_decoder = {integer: character for integer, character in enumerate(vocaulary)}

In [7]:
# Define the encoder
encoder = lambda string: [string_integer_encoder[character] for character in string]

In [8]:
# Define the decoder
decoder = lambda integers_list: ''.join([string_integer_decoder[integer] for integer in integers_list])

In [9]:
# Define a sample sentence
tokeniser_sample_sentence = 'Hello there'

In [10]:
print('Example of Encoding and Decoding')
print('Example sentence: {}'.format(tokeniser_sample_sentence))
print('Encode: {}'.format(encoder(tokeniser_sample_sentence)))
print('Decode: {}'.format(decoder(encoder(tokeniser_sample_sentence))))

Example of Encoding and Decoding
Example sentence: Hello there
Encode: [20, 43, 50, 50, 53, 1, 58, 46, 43, 56, 43]
Decode: Hello there


### TikToken

There are also already available Tokenizer as [TikToken](https://github.com/openai/tiktoken) from OpenAI. The goal is the same: produce a numerical representation from a string sequence, but they are based over a different vocabulary and transform the sequence in a different manner.

In [11]:
# Get the Tokenizer
tiktoken_tokenizer = tiktoken.get_encoding('gpt2')

In [12]:
print('List of Vocabularies of TikToken: {}'.format(tiktoken_tokenizer.n_vocab))
print('List of Vocabularies from Custom Tokenizer: {}'.format(vocaulary_size))

List of Vocabularies of TikToken: 50257
List of Vocabularies from Custom Tokenizer: 65


In [13]:
print('Example of Encoding and Decoding')
print('Example sentence: {}'.format(tokeniser_sample_sentence))
print('Encode: {}'.format(tiktoken_tokenizer.encode(tokeniser_sample_sentence)))
print('Decode: {}'.format(tiktoken_tokenizer.decode(tiktoken_tokenizer.encode(tokeniser_sample_sentence))))

Example of Encoding and Decoding
Example sentence: Hello there
Encode: [15496, 612]
Decode: Hello there


### Tokenize the Vocabulary

In [14]:
# Tokenize the data and store it in a PyTorch Tensor
data_encoded_tensor = torch.tensor(encoder(data), dtype=torch.long)

In [15]:
print(data_encoded_tensor.shape, data_encoded_tensor.dtype)

torch.Size([1115389]) torch.int64


In [16]:
# The entire data are represented as a sequence of integeres now
print(data_encoded_tensor[:1000])

tensor([18, 47, 56, 57, 58,  1, 15, 47, 58, 47, 64, 43, 52, 10,  0, 14, 43, 44,
        53, 56, 43,  1, 61, 43,  1, 54, 56, 53, 41, 43, 43, 42,  1, 39, 52, 63,
         1, 44, 59, 56, 58, 46, 43, 56,  6,  1, 46, 43, 39, 56,  1, 51, 43,  1,
        57, 54, 43, 39, 49,  8,  0,  0, 13, 50, 50, 10,  0, 31, 54, 43, 39, 49,
         6,  1, 57, 54, 43, 39, 49,  8,  0,  0, 18, 47, 56, 57, 58,  1, 15, 47,
        58, 47, 64, 43, 52, 10,  0, 37, 53, 59,  1, 39, 56, 43,  1, 39, 50, 50,
         1, 56, 43, 57, 53, 50, 60, 43, 42,  1, 56, 39, 58, 46, 43, 56,  1, 58,
        53,  1, 42, 47, 43,  1, 58, 46, 39, 52,  1, 58, 53,  1, 44, 39, 51, 47,
        57, 46, 12,  0,  0, 13, 50, 50, 10,  0, 30, 43, 57, 53, 50, 60, 43, 42,
         8,  1, 56, 43, 57, 53, 50, 60, 43, 42,  8,  0,  0, 18, 47, 56, 57, 58,
         1, 15, 47, 58, 47, 64, 43, 52, 10,  0, 18, 47, 56, 57, 58,  6,  1, 63,
        53, 59,  1, 49, 52, 53, 61,  1, 15, 39, 47, 59, 57,  1, 25, 39, 56, 41,
        47, 59, 57,  1, 47, 57,  1, 41, 

## Train & Validation Split

In [17]:
# Define the train percentage
train_percentage = 0.9

In [18]:
# Split in train and validation data
train_data = data_encoded_tensor[:int(train_percentage * len(data_encoded_tensor))]
validation_data = data_encoded_tensor[int(train_percentage * len(data_encoded_tensor)):]

## Blocks

The training would be splitted into Blocks, which are sequence of characters randomly selected within the training data. The length of a single block is the Block Size.

In [19]:
# Define the block_size
block_size = 8

print('First Block')
print(train_data[:block_size + 1])
print(data[:block_size + 1])

First Block
tensor([18, 47, 56, 57, 58,  1, 15, 47, 58])
First Cit


**NOTE:** 
Within each block, the transformer will learn multiple sequences at a time. 
- With [18], usually [47] comes next
- With [18, 47], usually [56] comes next
- With [18, 47, 56], usually [57] comes next
- ...

So everytime, in each block, there would be different sequences with different labels. The transformer will learn all of these sequences within a single block everytime. In this way, the Transformer will learn several contexts of different sizes and get used to them.

In [20]:
# Example of features and labels within each batch
x = train_data[:block_size]
y = train_data[1:block_size + 1]

for index in range(block_size):
    
    features = x[:index + 1]
    label = y[index]
    
    print(f"When features are {features} the label is {label}")

When features are tensor([18]) the label is 47
When features are tensor([18, 47]) the label is 56
When features are tensor([18, 47, 56]) the label is 57
When features are tensor([18, 47, 56, 57]) the label is 58
When features are tensor([18, 47, 56, 57, 58]) the label is 1
When features are tensor([18, 47, 56, 57, 58,  1]) the label is 15
When features are tensor([18, 47, 56, 57, 58,  1, 15]) the label is 47
When features are tensor([18, 47, 56, 57, 58,  1, 15, 47]) the label is 58


## Batches
It is a set of blocks that are passed to the Transformer at the same time.

In [21]:
# Define the batch size
batch_size = 4

# Set Torch Seed
torch.manual_seed(1337)

<torch._C.Generator at 0x1075c09d0>

In [22]:
def get_batch(data):
    """
    Return the x and y for the passed dataset
    
    Args:
        data: torch.Tensor input data
    
    Returns:
        x: torch.Tensor features values x batch_size
        y: torch.Tensor label values x batch_size
    """
    
    # Define the block index for each of the batch
    block_indices = torch.randint(len(data) - block_size, (batch_size,))
    
    # Retrieve x and y from the data
    x = torch.stack([data[block_index:block_index + block_size] for block_index in block_indices])
    y = torch.stack([data[block_index + 1:block_index + block_size + 1] for block_index in block_indices])
    
    return x, y

In [23]:
# Retrieve sample batches
x_batch_sample, y_batch_sample = get_batch(train_data)

In [24]:
print(f"X Batch Sample Shape: {x_batch_sample.shape}")
print(f"X Batch Sample Shape: {y_batch_sample.shape}")
print('\n')
print(f"X First Batch Sample: {x_batch_sample[0]}")
print(f"Y First Batch Sample: {y_batch_sample[0]}")

X Batch Sample Shape: torch.Size([4, 8])
X Batch Sample Shape: torch.Size([4, 8])


X First Batch Sample: tensor([59, 57,  1, 58, 56, 39, 47, 58])
Y First Batch Sample: tensor([57,  1, 58, 56, 39, 47, 58, 53])


Each batch has 8 indpendent data samples.

# Model Training

## Bigram Language Model

This is the simplest kind of Neural Network model you can imagine. We're going to implement it from PyTorch based model.

In [25]:
# Set Torch Seed
torch.manual_seed(1337)

<torch._C.Generator at 0x1075c09d0>

In [26]:
class BigramLanguageModel(nn.Module):
    
    def __init__(self, vocabulary_size):
        
        super().__init__()
        
        # Create a Token Embedding Table, which is a matrix vocabulary_size x vocabulary_size
        self.token_embedding_table = nn.Embedding(vocabulary_size, vocabulary_size)
        
    def forward(self, index, targets=None):
    
        # When passing an index to the token_embedding_table, it will return that specific row next characters logits (probabilities)
        # In a (Batch, Times, Channels) fashion -> torch.Size([4, 8, 65])
        logits = self.token_embedding_table(index)
        
        # Compute the loss in case there are the target labels
        if targets is None:
            
            loss = None
            
        else:
        
            # Beore calculating the loss of the logits, we need to reshape them, because the cross_entropy function expects a (Batch, Channels, Times) input
            # Get the logits shape
            batch_dim, times_dim, channels_dim = logits.shape

            # Reshape the logits
            logits = logits.view(batch_dim * times_dim, channels_dim)

            # Reshape the targets as well
            targets = targets.view(batch_dim * times_dim)

            # Measure the quality of the logits predictions
            loss = F.cross_entropy(logits, targets)
        
        return logits, loss
    
    def generate(self, index, max_new_tokens):
        """
        Generates new tokens for the given maximum dimension, thus predicting the very next character
        """
        
        # index is a (B, T) array
        for _ in range(max_new_tokens):
            
            # Get predictions (B, T, C)
            logits, loss = self(index)
            
            # Select only the predictions in the last element (B, C)
            # NOTE: This is not correct, because you should feed the entire sequence up to the last element, and not just the last one.
            logits = logits[:, -1, :]
            
            # Get probabilities through the Softmax function (B, C)
            probabilities = F.softmax(logits, dim=-1)
            
            # Sample from the distribution (B, 1) and obtain the next index character for all the batches
            index_next_character = torch.multinomial(probabilities, num_samples=1)
            
            # Append the sampled index of the next character to the sequence (B, T+1)
            index = torch.cat((index, index_next_character), dim=1)
            
        return index

In [27]:
# Instance the Bigram language Model
bigram_language_model = BigramLanguageModel(vocabulary_size=vocaulary_size)

In [28]:
# Compute the logits and the loss for a single batch data sample
logits, loss = bigram_language_model(x_batch_sample, targets=y_batch_sample)

In [29]:
logits.shape

torch.Size([32, 65])

In [30]:
loss

tensor(4.5242, grad_fn=<NllLossBackward0>)

### Example

In [31]:
# Instantiate the a 1 x 1 Tensor holding a zero value (It corresponds to 'new line' character)
# It would be our first character that will kick off the generation
initial_seed = torch.zeros((1, 1), dtype=torch.long)

In [32]:
# Generate 100 more characters ('max_new_tokens=100')
# Retrieve the first and alone batch ('[0]')
print(decoder(bigram_language_model.generate(initial_seed, max_new_tokens=100)[0].tolist()))


Sr?qP-QWktXoL&jLDJgOLVz'RIoDqHdhsV&vLLxatjscMpwLERSPyao.qfzs$Ys$zF-w,;eEkzxjgCKFChs!iWW.ObzDnxA Ms$3


As expecting, the output is crap. That's also because, in order to generate the 'f' character at the 7th position, only the previous character, ':', has been fed. Insted, all the sequence 'Qd&!e:' should be ingested to predict the next element and not only the last one.  This refers to the 'NOTE' warning in the generate function

### Training the Model

In [33]:
# Define the Optimizer
optimizer = torch.optim.AdamW(bigram_language_model.parameters(), lr=1e-3)

In [34]:
# Increase the batch size from 4 to 32
batch_size = 32

# Loop over 100 iterations
for step in range(10000):
    
    # Sample data
    x_batch, y_batch = get_batch(train_data)
    
    # Evaluate the loss
    logits, loss = bigram_language_model(x_batch, y_batch)
    
    # Reset the gradient
    optimizer.zero_grad(set_to_none=True)
    
    # Backpropagate the error and getting the gradients for all the weights
    loss.backward()
    
    # Update the weights
    optimizer.step()
    
    # Print the loss every 100 steps
    if step % 1000 == 0:
        print(f'Step: {step} - Loss: {round(loss.item(), 4)}')
        
print(f'Step: {step} - Loss: {round(loss.item(), 4)}')

Step: 0 - Loss: 4.7736
Step: 1000 - Loss: 3.7155
Step: 2000 - Loss: 3.1113
Step: 3000 - Loss: 2.8313
Step: 4000 - Loss: 2.487
Step: 5000 - Loss: 2.5177
Step: 6000 - Loss: 2.5832
Step: 7000 - Loss: 2.5644
Step: 8000 - Loss: 2.4895
Step: 9000 - Loss: 2.5055
Step: 9999 - Loss: 2.3863


As we can see, the loss is going down slowly.

<br>

However, this loss is not really precise, because it depends on the batch on which it is calculated. With the Estimate Loss, we want to estimate the loss over multiple batches through the average.

### Example after Training

In [35]:
# Generate again 400 tokens and let's see the improvement
print(decoder(bigram_language_model.generate(initial_seed, max_new_tokens=400)[0].tolist()))


Iyoteng h hasbe pave pirance
RDe hicomyonthar's
PlinseKEd ith henouratucenonthioneir thondy, y heltieiengerofo'dsssit ey
KIN d pe wither vouprrouthercc.
hathe; d!
My hind ttid?
ig t ouchos tes; st yo hind wotin grotonear 'so it t jod weancotha:
h haybet--s n prids, r loncave w hollular s O:
HIs; ht anjx?

DUThinqunt.

LaZAnde.
athave l.
KEONH:
ARThanco be y,-hedarwnoddy scar t tridesar, wnl'shenou


Much better.

# Self-Attention

## Theory

In [36]:
# Set seed
torch.manual_seed(1337)

<torch._C.Generator at 0x1075c09d0>

In [43]:
# Define dimensions
batch_size = 4
token_size = 8
channel_size = 2

In [44]:
# Define a Batch, Tokens, Channels tensor
# NOTE: Channels is the content of each token (i.e., 2 numberes)
self_attention_tensor = torch.randn(batch_size, token_size, channel_size)

In [42]:
self_attention_tensor.shape

torch.Size([4, 8, 2])

We would like now to make the 8 tokens in each batch to "talk" with each other.

<br>

Given the following token: `[1, 2, 3, 4, 5, 6, 7, 8]` we want to establish a communicaton between the tokens in a very specific way.
The token `5` should be able to communicate with tokens `[1, 2, 3, 4]`, but not with `[6, 7, 8]`. That's because they are **Future Tokens**.

<br>

How can we make such communication to happen? For the token `5`, we can think of just make the average of what comes before `[1, 2, 3, 4]`. Such average would become a sort of Feature Vector that summarise the token `5` in the context of his previous tokens. However we will lost lot of information with just an average.

## For Loop

In [46]:
# Define the empty Feature Vector
# NOTE: Bag of Words is a term used when average stuff together
tensor_bag_of_words = torch.zeros((4, 8, 2))

# Populate the bag of words
for batch in range(batch_size):
    for token in range(token_size):
        
        # Retrieve previous tokens for the current batch
        # Shape is (Tokens, Channels)
        previous_tokens = self_attention_tensor[batch, :token+1]
        
        # Compute the mean and store it in the bag of words
        # Mean over the 0-dimension (i.e., the tokens)
        tensor_bag_of_words[batch, token] = torch.mean(previous_tokens, 0)

Let's analyse the first batch

In [52]:
print('Original Tensor')
print(self_attention_tensor[0])
print('\n')
print('Bag of Words')
print(tensor_bag_of_words[0])

Original Tensor
tensor([[-2.0555,  1.8275],
        [ 1.3035, -0.4501],
        [ 1.3471,  1.6910],
        [-0.1244, -1.6824],
        [-0.0266,  0.0740],
        [ 1.0517,  0.6779],
        [ 0.3067, -0.7472],
        [ 0.7435,  0.8877]])


Bag of Words
tensor([[-2.0555,  1.8275],
        [-0.3760,  0.6887],
        [ 0.1984,  1.0228],
        [ 0.1177,  0.3465],
        [ 0.0888,  0.2920],
        [ 0.2493,  0.3563],
        [ 0.2575,  0.1987],
        [ 0.3182,  0.2848]])


The first element is the same, because it has no previous context to average except for itself.

However, the second element from the Bag of Words, is the average of itself and the previous one.

This technique is quite inefficient, since we are using a for loop. Let's now see how to exploit matrix multiplication properties to speed up the computation.

## Matrix Multiplication