# LLM Course

We build a Generatively Pretrained Transformer (GPT), following the paper "Attention is All You Need" and OpenAI's GPT-2 / GPT-3.

This course was possible by the help of those two videos :

- [Andrej Karpathy - Let's build GPT: from scratch, in code, spelled out.](https://www.youtube.com/watch?v=kCc8FmEb1nY)

- [freeCodeCamp.org - Create a Large Language Model from Scratch with Python – Tutorial](https://www.youtube.com/watch?v=UU1WVnMk4E8)

It was made for educationnal purpose and understanding what is happening under the hood.

## Tokenizer and character vocabulary

First, we read a book file to create our vocabulary. It will then be used to train the model.

We create with this a tokenizer that is divided in two parts : 

- encode text into integers sorted set, 

- and decode integers input into original text.

This tokenizer works with char-level tokenizing, it means that in each prompt, it will encode each character. It is not the most efficient, but we are gonna stay on char-level to simplify the exercise.

In [43]:
file_name = "./data/journey_to_the_center_of_the_earth.txt"

# Fetch book content
fd = open(file_name, encoding="utf8")
file_content = fd.read()

vocab = sorted(set(file_content))

def string_to_int(): 
   return { char:i for i, char in enumerate(vocab) }

def int_to_string():
   return { i:char for i, char in enumerate(vocab) }

def encode(chars):
   return [ string_to_int()[c] for c in chars ]

def decode(integers):
   return ''.join([ int_to_string()[i] for i in integers ])

print(''.join(vocab))




 !"'()*+,-./0123456789:;<>?ABCDEFGHIJKLMNOPQRSTUVWXYZ[]abcdefghijklmnopqrstuvwxyz£﻿


Here is an application example of the tokenizer :

In [44]:

data = file_content[:8]

print(data)
print ("Encoded data:", encode(data))
print ("Decoded data:", decode(encode(data)))

Looking
Encoded data: [83, 39, 70, 70, 66, 64, 69, 62]
Decoded data: ﻿Looking


## Prediction from contexts 

We splits data into prediction and validation, because we do not want to make the model to copy exactly the content of data, but try to product content similar to the data, not the exact one. 

So we never feed the entire data into the transform, but blocks of it.

In [45]:
percent = 0.9 # Splits data
n = int(len(file_content) * percent)
 
train_data = file_content[:n+1] # first 90%
validation_data = file_content[n:] # remaining 10%

print(validation_data)

f
my words. The blow seemed to stun him by its severity. I allowed him to
reflect for some moments.

"Well," said I, after a short pause, "what do you think now? Is there
any chance of our escaping from our horrible subterranean dangers? Are
we not doomed to perish in the great hollows of the centre of the
earth?"

But my pertinent questions brought no answer. My uncle either heard me
not, or appeared not to do so.

And in this way a whole hour passed. Neither of us cared to speak. For
myself, I began to feel the most fearful and devouring hunger. My
companions, doubtless, felt the same horrible tortures, but neither of
them would touch the wretched morsel of meat that remained. It lay
there, a last remnant of all our great preparations for the mad and
senseless journey!

I looked back, with wonderment, to my own folly. Fully was I aware that,
despite his enthusiasm, and the ever-to-be-hated scroll of Saknussemm,
my uncle should never have started on his perilous voyage. What memories


In [46]:
block_size = 7
xChunk, yChunk = train_data[:block_size], train_data[1:block_size+1]

print ("~~~~~~~~~~~~~~~~~~~")
print("x chunk:",xChunk)
print("y chunk:", yChunk)
print ("~~~~~~~~~~~~~~~~~~~")

context,target = "",""
for i in range(block_size):
    context = xChunk[:i+1]
    target = yChunk[i]

    print("-------")
    print("Step", i)
    print("Context: ["+context+"], target: ["+target+"]")
print("-------")

~~~~~~~~~~~~~~~~~~~
x chunk: ﻿Lookin
y chunk: Looking
~~~~~~~~~~~~~~~~~~~
-------
Step 0
Context: [﻿], target: [L]
-------
Step 1
Context: [﻿L], target: [o]
-------
Step 2
Context: [﻿Lo], target: [o]
-------
Step 3
Context: [﻿Loo], target: [k]
-------
Step 4
Context: [﻿Look], target: [i]
-------
Step 5
Context: [﻿Looki], target: [n]
-------
Step 6
Context: [﻿Lookin], target: [g]
-------


In a block, it would be more efficient to analyse predictions in parrallel within the block, because they are independant. So first, we implement batch access and processing :

In [None]:
import pytorch

batch_size = 4

# Generate a small batch of data of contexts x and targets y
# Two possible split (training or validation)
def get_batch(split):
    data = train_data if split == "training" else validation_data
    indexes_blocks = torch.ranint(len(data) - block_size, (block_size,))

    batch_contexts = torch.stack([data[i:i+block_size] for i in indexes_blocks])
    batch_targets = torch.stack([data[i+1:i+block_size+1] for i in indexes_blocks])

    return batch_contexts,batch_targets


Then to parrallel, the best option is to use GPU computing.