# LLM From Stratch

## Introduction

An AI LLM prototype written in plain python, only for personnal educational purpose. WORK IN PROGRESS.

It is a Generatively Pretrained Transformer (GPT), following the paper ["Attention is All You Need"](https://arxiv.org/abs/1706.03762) and OpenAI's [GPT-2](https://huggingface.co/openai-community/gpt2) / [GPT-3](https://fr.wikipedia.org/wiki/GPT-3).

It was made for educationnal purpose only and understanding what is happening under the hood.

<strong> No AI were used in any of the project nor code. </strong>

I tried to learn and resume a course based on those two great videos :

- [Andrej Karpathy - Let's build GPT: from scratch, in code, spelled out.](https://www.youtube.com/watch?v=kCc8FmEb1nY)

- [freeCodeCamp.org - Create a Large Language Model from Scratch with Python – Tutorial](https://www.youtube.com/watch?v=UU1WVnMk4E8)

Using these tools : 

- [Anaconda](https://www.anaconda.com/docs), langage distribution made for research and deep-learning

- [PyTorch](https://pytorch.org/), an optimized tensor library for deep learning

- [Nvidia CUDA API](https://www.anaconda.com/docs/getting-started/working-with-conda/packages/gpu-packages), for [GPU computation](https://en.wikipedia.org/wiki/General-purpose_computing_on_graphics_processing_units)

- [Jupyter Notebook](https://jupyter.org/), for creating and sharing computational documents ([web sample here](https://jupyter.org/try-jupyter/notebooks/?path=notebooks/Intro.ipynb)).

- [Ipykernel](https://pypi.org/project/ipykernel/), IPython Kernel for Jupyter, we'll use it for creating virtual environments

In [5]:
import torch

## Character-level Tokenizer

### Text data fetching

First, we need to read a text to create our vocabulary.

For this course, I choose the 1864 novel "Journey to the Center of the Earth", written by Jules Verne.

Taken from free-ebooks [Project Gutemberg website](https://www.gutenberg.org/ebooks/18857) where you can find more if you want.

In [None]:
# We can also use data/chapter1.txt for a more lightweight input
file_name = "data/journey_to_the_center_of_the_earth.txt" 

# Fetch book content
file_descriptor = open(file_name, encoding="utf8")
text = file_descriptor.read()

print(text[:400])

Looking back to all that has occurred to me since that eventful day, I
am scarcely able to believe in the reality of my adventures. They were
truly so wonderful that even now I am bewildered when I think of them.

My uncle was a German, having married my mother's sister, an
Englishwoman. Being very much attached to his fatherless nephew, he
invited me to study under him in his home in the fatherl


### Vocabulary

A vocabulary is a sorted set of every character that appears in the text.

It will then be used to train the model.

In [7]:
vocab = sorted(set(text))

print(''.join(vocab))


 !"'()*+,-./0123456789:;<>?ABCDEFGHIJKLMNOPQRSTUVWXYZ[]abcdefghijklmnopqrstuvwxyz£﻿


### Encoding and decoding

With this vocabulary, we can now create a tokenizer, that consists of two parts : 

- encode text into integers sorted set, 

- and decode integers input into original text.

This tokenizer works with char-level tokenizing, it means that in each prompt, it will encode each character. 

Each character will be affeted by an integer.

It is not the most efficient, but we are gonna stay on char-level to simplify the exercise.

In [8]:
# create a mmaping from characters to integers

def string_to_int():
   return { char:i for i, char in enumerate(vocab) }

def int_to_string():
   return { i:char for i, char in enumerate(vocab) }

print(int_to_string())

{0: '\n', 1: ' ', 2: '!', 3: '"', 4: "'", 5: '(', 6: ')', 7: '*', 8: '+', 9: ',', 10: '-', 11: '.', 12: '/', 13: '0', 14: '1', 15: '2', 16: '3', 17: '4', 18: '5', 19: '6', 20: '7', 21: '8', 22: '9', 23: ':', 24: ';', 25: '<', 26: '>', 27: '?', 28: 'A', 29: 'B', 30: 'C', 31: 'D', 32: 'E', 33: 'F', 34: 'G', 35: 'H', 36: 'I', 37: 'J', 38: 'K', 39: 'L', 40: 'M', 41: 'N', 42: 'O', 43: 'P', 44: 'Q', 45: 'R', 46: 'S', 47: 'T', 48: 'U', 49: 'V', 50: 'W', 51: 'X', 52: 'Y', 53: 'Z', 54: '[', 55: ']', 56: 'a', 57: 'b', 58: 'c', 59: 'd', 60: 'e', 61: 'f', 62: 'g', 63: 'h', 64: 'i', 65: 'j', 66: 'k', 67: 'l', 68: 'm', 69: 'n', 70: 'o', 71: 'p', 72: 'q', 73: 'r', 74: 's', 75: 't', 76: 'u', 77: 'v', 78: 'w', 79: 'x', 80: 'y', 81: 'z', 82: '£', 83: '\ufeff'}


Then we can now use our converters to apply encoding and decoding :

In [9]:
# take a string, output a list of integers
def encode(chars):
   return [ string_to_int()[c] for c in chars ]

# take a list of integers, output a string
def decode(integers):
   return ''.join([ int_to_string()[i] for i in integers ])

Now let's encode the entire text dataset. Here is an use example of the tokenizer :

In [10]:
encoded_text = encode(text)

print ("- Encoded data:", encoded_text[:100], "\n")
print ("- Decoded data:", decode(encoded_text[:100]))

- Encoded data: [83, 39, 70, 70, 66, 64, 69, 62, 1, 57, 56, 58, 66, 1, 75, 70, 1, 56, 67, 67, 1, 75, 63, 56, 75, 1, 63, 56, 74, 1, 70, 58, 58, 76, 73, 73, 60, 59, 1, 75, 70, 1, 68, 60, 1, 74, 64, 69, 58, 60, 1, 75, 63, 56, 75, 1, 60, 77, 60, 69, 75, 61, 76, 67, 1, 59, 56, 80, 9, 1, 36, 0, 56, 68, 1, 74, 58, 56, 73, 58, 60, 67, 80, 1, 56, 57, 67, 60, 1, 75, 70, 1, 57, 60, 67, 64, 60, 77, 60, 1] 

- Decoded data: ﻿Looking back to all that has occurred to me since that eventful day, I
am scarcely able to believe 


We store it into a torch.Tensor :

In [11]:
data = torch.tensor(encoded_text, dtype=torch.long)

print(data.shape, data.dtype)
print(data[:100]) # Print 100 first encoded character from the text

torch.Size([484719]) torch.int64
tensor([83, 39, 70, 70, 66, 64, 69, 62,  1, 57, 56, 58, 66,  1, 75, 70,  1, 56,
        67, 67,  1, 75, 63, 56, 75,  1, 63, 56, 74,  1, 70, 58, 58, 76, 73, 73,
        60, 59,  1, 75, 70,  1, 68, 60,  1, 74, 64, 69, 58, 60,  1, 75, 63, 56,
        75,  1, 60, 77, 60, 69, 75, 61, 76, 67,  1, 59, 56, 80,  9,  1, 36,  0,
        56, 68,  1, 74, 58, 56, 73, 58, 60, 67, 80,  1, 56, 57, 67, 60,  1, 75,
        70,  1, 57, 60, 67, 64, 60, 77, 60,  1])


## Predictions from contexts

### Training and validation data

We splits data into prediction and validation, because we do not want to make the model to copy exactly the content of data, but try to product content similar to the data, not the exact one. 

So we never feed the entire data into the transform, but blocks of it.

In [12]:
percent = 0.9 # Splits data
n = int(len(data) * percent) # Calculated split
 
train_data = data[:n+1] # first 90% data
val_data = data[n:] # remaining 10% data

### Contexts and targets

The main goal of a model is from given context, achieve to predict what comes next. 

For example, if our text data is "Hello" :

- If context: "H" -> target would predict: "e",

- Then if context: "e" -> target would predict: "l",

- Then if context: "l" -> target would predict: "l",

- Then if context: "l" -> target would predict: "o".

### Usage of character blocks

It would be better to work with blocks of characters, to gains more accuraty, and generate texts from given one. 

For example, if we keep our text data as "Hello" :

- If context: "H" -> target would predict: "e",

- Then if context: "He" -> target would predict: "l",

- Then if context: "Hel" -> target would predict: "l",

- Then if context: "Hell" -> target would predict: "o".


As you can see, a target always consists of the next predicted character, that comes after the context.

So we can agree that :

```python
context_block = data[:block_size]

target_block = data[1:block_size+1]
```

Here is an application, with a block size of 7 characters :

In [13]:
block_size = 7

train_data[:block_size]

tensor([83, 39, 70, 70, 66, 64, 69])

In [14]:
context_block = train_data[:block_size]
target_block = train_data[1:block_size+1]

print("Context block :", context_block)
print("Target block :", target_block)

Context block : tensor([83, 39, 70, 70, 66, 64, 69])
Target block : tensor([39, 70, 70, 66, 64, 69, 62])


In [15]:
for i in range(block_size):
    
    context = context_block[:i+1]
    target = target_block[i]

    print("---------------------------------------------------------------")
    print(f"context: {context}, target: {target}")
print("---------------------------------------------------------------")

---------------------------------------------------------------
context: tensor([83]), target: 39
---------------------------------------------------------------
context: tensor([83, 39]), target: 70
---------------------------------------------------------------
context: tensor([83, 39, 70]), target: 70
---------------------------------------------------------------
context: tensor([83, 39, 70, 70]), target: 66
---------------------------------------------------------------
context: tensor([83, 39, 70, 70, 66]), target: 64
---------------------------------------------------------------
context: tensor([83, 39, 70, 70, 66, 64]), target: 69
---------------------------------------------------------------
context: tensor([83, 39, 70, 70, 66, 64, 69]), target: 62
---------------------------------------------------------------


### Batchs for parallel computing


In a block, it would be more efficient to analyse predictions in parrallel within the block, because they are independant.

So each step from previous loop, is a element of the batch.

If batch_size = 4, a context batch would contains:

- [[],[Lo],[Loo],[Loop]]

We implement batch creation and processing :

In [16]:
batch_size = 4
block_size = 8

# Generate a small batch of data of contexts x and targets y
# Two possible splits : training or validation
def get_batch (split) :

    data = train_data if split == "training" else val_data
    
    # Create a tensor filled with random integers generated, of size block_size
    ix = torch.randint(len(data) - block_size, (block_size,))
    print(f"- Random blocks adresses :\n{ix}\n")

    # Create context blocks from random coordinates
    x = torch.stack([data[i:i+block_size] for i in ix])

    # Create targets blocks from random coordinates
    y = torch.stack([data[i+1:i+block_size+1] for i in ix])

    # batch of data of contexts x and targets y
    return x, y

In [17]:
x,y = get_batch("training");

print(f"- contexts :\n{(x)}\n")
print(f"- targets :\n{y}\n")

- Random blocks adresses :
tensor([  5035, 192756, 148840, 106305, 225049, 102702,  66401, 389832])

- contexts :
tensor([[80,  1, 75, 70,  1, 70, 57, 74],
        [63, 60,  1, 43, 73, 70, 61, 60],
        [ 1, 75, 63, 60,  1, 60, 77, 60],
        [67, 67, 80,  1, 63, 56, 59,  1],
        [ 0,  3, 41, 70, 78,  9,  1, 75],
        [75, 60, 73, 69, 56, 67, 67, 80],
        [69, 70, 75, 63, 64, 69, 62,  1],
        [65, 60, 58, 75, 74,  1, 64, 69]])

- targets :
tensor([[ 1, 75, 70,  1, 70, 57, 74, 60],
        [60,  1, 43, 73, 70, 61, 60, 74],
        [75, 63, 60,  1, 60, 77, 60, 73],
        [67, 80,  1, 63, 56, 59,  1, 64],
        [ 3, 41, 70, 78,  9,  1, 75, 63],
        [60, 73, 69, 56, 67, 67, 80,  1],
        [70, 75, 63, 64, 69, 62,  1, 58],
        [60, 58, 75, 74,  1, 64, 69,  1]])



To do parrallel computing, the best option is to do [GPU computing (or GPGPU)](https://en.wikipedia.org/wiki/General-purpose_computing_on_graphics_processing_units) with [CUDA API](https://en.wikipedia.org/wiki/CUDA).

WORK IN PROGRESS.