## LLM Pretraining

<div class="alert alert-block alert-success">
- This code demonstrates a toy example of how LLM pretraining works in a very simplified way.
- Your task is to complete the empty cells and fill the missing parts of the code indicated using the ellipsis "..."

Through this exercise you will learn:
- What a vocabulary construction looks like for a given dataset
- How tokenization can be done
- Encoding the data before training and decoding the data after generation
- Loss function most commonly used
- Making a forward pass
- Training elements like optimizers </div>

<div class="alert alert-block alert-warning">
Below demostrates a toy example that takes a text and used each character in the text as a "token". The code even if it works for you will probably not generate anything legible. The goal is that you should understand each element of the pretraining process. In reality, the training is a lot more sophisticated for many reasons - some being scale of the datasets, size of the models etc. The fundamentals on which these models are trained, however, can be demonstrated using this toy example. </div>

In [None]:
!pip install torch

In [1]:
# read file llm_pretraining.txt
with open("../datasets/llm_pretraining.txt", "r", encoding="utf-8") as f:
    text = f.read()

print(f"number of characters: {len(text)}")
print(text[:100])

number of characters: 1115389
First Citizen:
Before we proceed any further, hear me speak.

All:
Speak, speak.

First Citizen:
You


In [2]:
# create vocabulary, make sure the vocabulary is stored in a variable called "chars" (used later)
chars = sorted(list(set(text)))
vocab_size = len(chars)

print(''.join(chars))
print(vocab_size)


 !$&',-.3:;?ABCDEFGHIJKLMNOPQRSTUVWXYZabcdefghijklmnopqrstuvwxyz
65


In [3]:
# Tokenize
stoi = {ch: i for i, ch in enumerate(chars)}
itos = {i: ch for i, ch in enumerate(chars)}

encode = lambda s: [stoi[c] for c in s]
decode = lambda l: ''.join([itos[i] for i in l])

In [4]:
encoding_example = encode("hii there")
encoding_example

[46, 47, 47, 1, 58, 46, 43, 56, 43]

In [5]:
decode(encoding_example)

'hii there'

In [6]:
import torch
import torch.nn as nn
from torch.nn import functional as F

In [7]:
# encode the dataset, make sure that after encoding the output is stored in a variable named "data" (used later)
data = torch.tensor(encode(text), dtype=torch.long)

print(data.shape, data.dtype)
print(data[:100])

torch.Size([1115389]) torch.int64
tensor([18, 47, 56, 57, 58,  1, 15, 47, 58, 47, 64, 43, 52, 10,  0, 14, 43, 44,
        53, 56, 43,  1, 61, 43,  1, 54, 56, 53, 41, 43, 43, 42,  1, 39, 52, 63,
         1, 44, 59, 56, 58, 46, 43, 56,  6,  1, 46, 43, 39, 56,  1, 51, 43,  1,
        57, 54, 43, 39, 49,  8,  0,  0, 13, 50, 50, 10,  0, 31, 54, 43, 39, 49,
         6,  1, 57, 54, 43, 39, 49,  8,  0,  0, 18, 47, 56, 57, 58,  1, 15, 47,
        58, 47, 64, 43, 52, 10,  0, 37, 53, 59])


In [8]:
# training and test data split
n = int(0.9*len(data))
train_data = data[:n]
val_data = data[:n]

batch_size = 4
block_size = 8

In [9]:
# batching function
# analyse this and see if you able to understand what is going on
def get_batch(split):
    data_ = train_data if split == "train" else val_data
    ix = torch.randint(len(data_) - block_size, (batch_size,))
    x = torch.stack([data_[i:i+block_size] for i in ix])
    y = torch.stack([data_[i+1:i+block_size+1] for i in ix])
    return x, y


xb, yb = get_batch('train')
print(yb)

# bigram models with the appropriate functions
class BigramLanguageModel(nn.Module):
    def __init__(self, vocab_size_):
        super().__init__()
        self.token_embedding_table = nn.Embedding(vocab_size_, vocab_size_)

    def forward(self, idx, targets=None):
        logits = self.token_embedding_table(idx)
        if targets == None:
            loss = None
        else:
            B, T, C = logits.shape
            logits = logits.view(B*T, C)
            targets = targets.view(B*T)
            loss = F.cross_entropy(logits, targets) # cross entropy loss function

        return logits, loss

    def generate(self, idx, max_new_tokens):
        for _ in range(max_new_tokens):
            logits, loss = self(idx)
            logits = logits[:, -1, :]
            probs = F.softmax(logits, dim=-1)
            idx_next = torch.multinomial(probs, num_samples=1)
            idx = torch.cat((idx, idx_next), dim=1)
        return idx

tensor([[46, 39, 58,  1, 51, 63,  1, 57],
        [46, 47, 51,  1, 58, 46, 39, 58],
        [58,  6,  0, 35, 46, 39, 58,  1],
        [53, 53,  1, 50, 39, 58, 43, 12]])


In [10]:
# initialize the BigramLanguageModel, make sure the variable name is "m" (used later)
# make a forward pass with the BigramLanguageModel Class, this step is just a test to check if the call works and not part of the training
m = BigramLanguageModel(vocab_size)
logits, loss = m(xb, yb)

print(logits.shape)
print(loss)

torch.Size([32, 65])
tensor(4.1992, grad_fn=<NllLossBackward0>)


In [11]:
# training
optimizer = torch.optim.AdamW(m.parameters(), lr=1e-3) # initialise the optimiser
batch_size = 32 # set a batch size, can be changed to some other calue
for steps in range(1000):  # 1000 steps training, you may change this
    xb, yb = get_batch('train') # get the training batch

    logits, loss =  m(xb, yb) # make the forward pass
    optimizer.zero_grad(set_to_none=True)
    loss.backward()  # backpropagation
    optimizer.step()
print(loss.item())

3.6752665042877197


In [12]:
# this step generates text using your trained model
# it probably won't generate anything interesting (or maybe it will?)
# by the time you reach this step you should have uncerstood the principals of:
# 1. How the pre-training works
# 2. Can go back to the corret sources to understand how to expand on this basic knowledge

input_data = torch.tensor([encode("Let us kill him")], dtype=torch.long)  # you can change the encoding sentence to something else
generate_n_tokens = 500 # you can change max_new_tokens=500 to change size of the generation
print(decode(m.generate(idx=input_data, max_new_tokens=generate_n_tokens)[0].tolist()))

Let us kill himFk,EN,XNaglcn$-z,TADMgbu;vzu&JkelF;KoudHaWmeknf.SlaXE,meriTI
hqDru?HCeZAwMr&
3n-L
I$s LiT: IGHJI ppof.mZVoBfl$YBwNU$yYbrWknaS3?gikXn.XxYW&xJ&
Sm z3JDULewesfvRy TC'sIKi'zdmsLnLyoeHORHUud?
knJBw?VI'sqRIsBbaoH$ atISWALMI$y N& tIFk!UC!POZyK LG:S
LkEs b$jh ;AS,CnWAr;KjMFayadPppHAN tZ
K;pr sZYBTr!yY,jHcjHM-Lekse$Y Smmix?'s
CQbypGKpdtjxMBIF;bzGe sJ!
3LUnuays
WXSWw
F-M
OuLZ
vRy;pI.ZztDjBBkeI$;NVSNE,: OgKLj&
WWxHN$jocjBe,BvOgl'CLGlOR$YT!!Q OCcMewkyjeva,co-hl
PlysewouscM-HczOV-OfRmxLtoBvyxV$ bZV;BUssiG ay
