# LLM Course

## Introduction

We build a Generatively Pretrained Transformer (GPT), following the paper "Attention is All You Need" and OpenAI's GPT-2 / GPT-3.

The goal is to understand what is happening under the hoodn, by taking baby steps and make it accessible for everyone.

Made for educationnal purpose only. 

This course was possible with the help of those two videos :

- [Andrej Karpathy - Let's build GPT: from scratch, in code, spelled out.](https://www.youtube.com/watch?v=kCc8FmEb1nY)

- [freeCodeCamp.org - Create a Large Language Model from Scratch with Python â€“ Tutorial](https://www.youtube.com/watch?v=UU1WVnMk4E8)


We are gonna use these tools to help us : 

- Python Anaconda, langage distribution made for research and deep-learning

- Ipykernel, for handling python3 kernels with virtual environments

- Pytorch, an optimized tensor library for deep learning

- Cuda, for parallel GPU computing

- Jupyter for documentational programming

Of course, no AI were used for code here. 

## Character-level Tokenizer

### Text data fetching

First, we read a text to create our vocabulary.

For this course, I choose the 1864 novel "Journey to the Center of the Earth" of french writer Jules Verne.

Taken from free-ebooks [Project Gutemberg website](https://www.gutenberg.org/ebooks/18857).

In [None]:
file_name = "./data/journey_to_the_center_of_the_earth.txt"

# Fetch book content
file_descriptor = open(file_name, encoding="utf8")
text = file_descriptor.read()

print(text[:400])

### Vocabulary

A vocabulary is a sorted set of every character that appears in the text.

It will then be used to train the model.

In [None]:
vocab = sorted(set(text))

print(''.join(vocab))

### Encoding and decoding

With this vocabulary, we can now create a tokenizer, that consists of two parts : 

- encode text into integers sorted set, 

- and decode integers input into original text.

This tokenizer works with char-level tokenizing, it means that in each prompt, it will encode each character. 

Each character will be affeted by an integer.

It is not the most efficient, but we are gonna stay on char-level to simplify the exercise.

In [None]:
def string_to_int(): 
   return { char:i for i, char in enumerate(vocab) }

def int_to_string():
   return { i:char for i, char in enumerate(vocab) }

print(string_to_int())

Then we can now use our converters to apply encoding and decoding :

In [None]:
def encode(chars):
   return [ string_to_int()[c] for c in chars ]

def decode(integers):
   return ''.join([ int_to_string()[i] for i in integers ])

Here is an application example of the tokenizer :

In [None]:

data = text[:8]

print("Original data:", data)
print ("Encoded data:", encode(data))
print ("Decoded data:", decode(encode(data)))

## Predictions from contexts

### Training and validation data

We splits data into prediction and validation, because we do not want to make the model to copy exactly the content of data, but try to product content similar to the data, not the exact one. 

So we never feed the entire data into the transform, but blocks of it.

In [None]:
percent = 0.9 # Splits data
n = int(len(text) * percent) # Calculated split
 
train_data = text[:n+1] # first 90%
validation_data = text[n:] # remaining 10%

print(validation_data)

### Contexts and targets

The main goal of a model is from given context, achieve to predict what comes next. 

For example, if our text data is "Hello" :

- If context: "H" -> target would predict: "E",

- Then if context: "E" -> target would predict: "L",

- Then if context: "L" -> target would predict: "L",

- Then if context: "L" -> target would predict: "O".

### Usage of character blocks

It would be better to work with blocks of characters, to gains more accuraty, and generate texts from given one. 

For example, if we keep our text data as "Hello" :

- If context: "H" -> target would predict: "E",

- Then if context: "HE" -> target would predict: "L",

- Then if context: "HEL" -> target would predict: "L",

- Then if context: "HELL" -> target would predict: "O".

As you can see, a target always consists of the next predicted character, that comes after the context.

So we can agree that :

```python
context_block = data[:block_size]

target_block = data[1:block_size+1]
```

Here is an application, with a block size of 7 characters :

In [None]:
block_size = 7

context_block = train_data[:block_size]

target_block = train_data[1:block_size+1]

print("Context block :",context_block)
print("Target block :", target_block)

In [None]:
for i in range(block_size):

    context = context_block[:i+1]
    
    target = target_block[i]

    print("-------")
    print("Context: ["+context+"], target: ["+target+"]")
print("-------")

### Batchs for parallel computing


In a block, it would be more efficient to analyse predictions in parrallel within the block, because they are independant.

So each step from previous loop, is a element of the batch.

If batch_size = 4, a context batch would contains:

- [[],[Lo],[Loo],[Loop]]

We implement batch creation and processing :

In [None]:
import pytorch

batch_size = 4

# Generate a small batch of data of contexts x and targets y
# Two possible split: training or validation
def get_batch(split):
    data = train_data if split == "training" else validation_data
    indexes_blocks = torch.ranint(len(data) - block_size, (block_size,))

    batch_contexts = torch.stack([data[i:i+block_size] for i in indexes_blocks])
    batch_targets = torch.stack([data[i+1:i+block_size+1] for i in indexes_blocks])

    return batch_contexts, batch_targets


To do parrallel computing, the best option is to use GPU computing.