# Large Language Models

In our pursuit of making computers understand and generate human-like text, LLM were developed.
Language modeling in general is concerned with predicting the next word in a sequence of words. One of the early and basic examples of LMs is the n-gram model, where the probability of a word occurring is calculated based on the previous n-1 words.

For example, in a 5-gram model, and the sentence "The quick brown fox jumps over the lazy dog", the probability of the word "dog" occuring is calculated based on the previous 4 words "over the lazy". This is a very simple model and does not capture the context of the sentence very well. This is where LLMs come in. LLMs are able to capture the context of the sentence and generate more accurate predictions.


In [1]:
# Clone the repository and all its functions
!git clone https://github.com/abdulrahman1123/analysis_examples.git

# Add the cloned directory to Python path
import sys
sys.path.append('/content/analysis_examples')

# Import important functions, including functions to build the model and train it
from microGPT import *

fatal: destination path 'analysis_examples' already exists and is not an empty directory.
Encoded IDs: [2, 386, 1639, 7321, 6060, 431, 402, 751, 1128, 121, 731, 6457, 2073, 3]
Tokens: ['<BOS>', 'ĠThe', 'Ġquick', 'Ġbrown', 'Ġfox', 'Ġj', 'um', 'ps', 'Ġover', 'Ġthe', 'Ġla', 'zy', 'Ġdog', '<EOS>']
Decoded:  The quick brown fox jumps over the lazy dog
Vocab size: 10000
 Gre ----> 3894
et ----> 165
ings ----> 519
, ----> 12
 My ----> 1308
 king ----> 734
! ----> 4
 ----> 3
0 141058 156732
156732 297790 313464
313464 454522 470196
470196 611254 626928
626928 767986 783660
783660 924718 940392
940392 1081450 1097124
1097124 1238182 1253856
1253856 1394914 1410588
1410588 1551646 1567320
train data shape:  torch.Size([1410580]) val data shape: torch.Size([156740])
Train size = 1410580 tokens ... Validation size = 156740 tokens


KeyboardInterrupt: 

In [None]:

####################
# hyperparameters
####################
batch_size = 32 # how many independent sequences will we process in parallel
block_size = 256 # what is the maximum context length for predictions
max_iters = 10000
learning_rate = 2e-4
device = torch.device("cuda") if torch.cuda.is_available() else torch.device("cpu") # change it to mps for Mac, and Cuda if you have Nvidia graphics
eval_iters = 100
eval_interval = max_iters//10
n_embd = 256
n_head = 4
n_layer = 4
n_embd = (n_embd//n_head)*n_head
dropout = 0.1
vocab_size = 10000
# ------------

## Download Shakespeare's work
url1 = "https://www.gutenberg.org/cache/epub/100/pg100.txt"
text = requests.get(url1).text.replace('\r','')
text = text[text.find('THE SONNETS\n\n')::] # remove text introduction

#create tokens
encode,decode, tokenizer, vocab_size = tokenize(text, vocab_size)

# check what it does
encoding = encode('Greetings, My king!')
for item in encoding[1::]:
    print(f"{decode([item])} ----> {item}")


# Encode the entire dataset
data = torch.tensor(encode(text), dtype=torch.long)


# split into training and testing
train_data, val_data = train_test_split(data, 0.9)
train_data, val_data = train_data.to(device), val_data.to(device)
print(f'Train size = {train_data.shape[0]} tokens ... Validation size = {val_data.shape[0]} tokens')

# create the model, and optimize the learning process
model = BigramLanguageModel(vocab_size, n_embd, block_size, n_head, dropout, n_layer, device).to(device)
optimizer = torch.optim.AdamW(model.parameters(), lr=learning_rate)
scheduler = torch.optim.lr_scheduler.LambdaLR(optimizer, lr_lambda=get_lr_lambda(max_iters=max_iters, warmup_steps=200))



# train the model
train_model(model, max_iters, eval_iters, train_data, val_data, batch_size, block_size,device,optimizer,scheduler)



import os
base_dir = r"\\klinik.uni-wuerzburg.de\homedir\userdata11\Sawalma_A\data\downloads\LLMs"
model_name = f'model_batch{batch_size}_vocab{vocab_size}_nembed{n_embd}_block{block_size}_nhead{n_head}_n_layer{n_layer}'
model_path = os.path.join(base_dir,f'{model_name}.pt')

#save the model
torch.save(model.state_dict(), model_path)
print(f"Model saved as: {model_name}.pt")


#load the model
model = BigramLanguageModel(vocab_size, n_embd, block_size, n_head, dropout, n_layer, device)
model.load_state_dict(torch.load(model_path, map_location=device))
model = model.to(device)
model.eval()

prompt = "To be, or not to be "
input_ids = torch.tensor([tokenizer.encode(prompt).ids], dtype=torch.long, device=device)
out = model.generate(idx=input_ids, max_new_tokens=100)
print(tokenizer.decode(out[0].tolist()))


---
### Sources:
[Introduction to Large Language Models](https://www.baeldung.com/cs/large-language-models)

[Youtube video about writing a GPT from scratch](https://www.youtube.com/watch?v=kCc8FmEb1nY&t=2409s)