# LLM Course

We build a Generatively Pretrained Transformer (GPT), following the paper "Attention is All You Need" and OpenAI's GPT-2 / GPT-3.

This course was possible by the help of those two videos :

- [Andrej Karpathy - Let's build GPT: from scratch, in code, spelled out.](https://www.youtube.com/watch?v=kCc8FmEb1nY)

- [freeCodeCamp.org - Create a Large Language Model from Scratch with Python – Tutorial](https://www.youtube.com/watch?v=UU1WVnMk4E8)

It was made for educationnal purpose and understanding what is happening under the hood.

## Tokenizer and character vocabulary

First, we read a book file to create our vocabulary. It will then be used to train the model.

We create with this a tokenizer that is divided in two parts : 

- encode text into integers sorted set, 

- and decode integers input into original text.

This tokenizer works with char-level tokenizing, it means that in each prompt, it will encode each character. It is not the most efficient, but we are gonna stay on char-level to simplify the exercise.

In [9]:
# Rename it if you want to try on entire book content
file_name = "./data/journey_to_the_center_of_the_earth.txt"

# Fetch book content
fd = open(file_name, encoding="utf8")
file_content = fd.read()

vocab = sorted(set(file_content))

def string_to_int(): 
   return { char:i for i, char in enumerate(vocab) }

def int_to_string():
   return { i:char for i, char in enumerate(vocab) }

def encode(chars):
   return [ string_to_int()[c] for c in chars ]

def decode(integers):
   return ''.join([ int_to_string()[i] for i in integers ])

print(''.join(vocab))




 !"'()*+,-./0123456789:;<>?ABCDEFGHIJKLMNOPQRSTUVWXYZ[]abcdefghijklmnopqrstuvwxyz£﻿


Here is an application example of the tokenizer :

In [20]:

data = file_content[:8]

print(data)
print ("Encoded data:", encode(data))
print ("Decoded data:", decode(encode(data)))

CHAPTER
Encoded data: [83, 30, 35, 28, 43, 47, 32, 45]
Decoded data: ﻿CHAPTER


## Training and prediction 

We splits data into prediction and validation, because we do not want to make the model to copy exactly the content of data, but try to product content similar to the data, not the exact one. 

So we never feed the entire data into the transform, but chunks of it.

In [4]:
percent = 90 # Cut data into 90% / 10%
n = int(len(data) * (percent/100))
 
train_data = data[:n]
validation_data = data[n:]

block_size = 8
train_data[:block_size]

[83, 30, 35, 28, 43, 47, 32, 45]

In [23]:
percent = 0.9 # Splits data
n = int(len(file_content) * percent)
 
train_data = file_content[:n+1] # first 90%
validation_data = file_content[n:] # remaining 10%

block_size = 6

xChunk = train_data[:block_size]
yChunk = train_data[1:block_size+1]

print ("~~~~~~~~~~~~~~~~~~~")
print("x chunk:",xChunk)
print("y chunk", yChunk)
print ("~~~~~~~~~~~~~~~~~~~")

context = ""
target = ""
for i in range(block_size):
    context = xChunk[:i+1]
    target = yChunk[i]

    print("-------")
    print("Step", i)
    print("Context: ["+context+"], target: ["+target+"]")

~~~~~~~~~~~~~~~~~~~
x chunk: ﻿CHAPT
y chunk CHAPTE
~~~~~~~~~~~~~~~~~~~
-------
Step 0
Context: [﻿], target: [C]
-------
Step 1
Context: [﻿C], target: [H]
-------
Step 2
Context: [﻿CH], target: [A]
-------
Step 3
Context: [﻿CHA], target: [P]
-------
Step 4
Context: [﻿CHAP], target: [T]
-------
Step 5
Context: [﻿CHAPT], target: [E]
