
## TODO



---


### ðŸ“š Inspiration & Citation

This exercise is inspired by the following work. If you use this notebook or its accompanying code, please cite it accordingly:

```yaml
cff-version: 1.2.0
message: "If you use this book or its accompanying code, please cite it as follows."
title: "Build A Large Language Model (From Scratch), Published by Manning, ISBN 978-1633437166"
abstract: "This book provides a comprehensive, step-by-step guide to implementing a ChatGPT-like large language model from scratch in PyTorch."
date-released: 2024-09-12
authors:
  - family-names: "Raschka"
    given-names: "Sebastian"
license: "Apache-2.0"
url: "https://www.manning.com/books/build-a-large-language-model-from-scratch"
repository-code: "https://github.com/rasbt/LLMs-from-scratch"
keywords:
  - large language models
  - natural language processing
  - artificial intelligence
  - PyTorch
  - machine learning
  - deep learning




### Import the text_data for training

In [10]:
import os, tiktoken, torch
from gpt2 import create_dataloader_v1, GPT_CONFIG_124M

file_path = "the-verdict.txt" # Free book from Edith Wharton (inspired by credits)
text_data = ""
with open(file_path,"r",encoding="utf-8") as file:
    text_data = file.read()
    
tokenizer = tiktoken.get_encoding("gpt2")
    
total_char = len(text_data)
total_tokens = len(tokenizer.encode(text_data))

print ("Characters: ", total_char)
print ("Tokens: ", total_tokens)


Characters:  20479
Tokens:  5145


and prepare the data in training and validation sets splitting the data with the below train_ratio of 90% training 10% test.

In [11]:
train_ratio = 0.90
split_idx = int(train_ratio * len(text_data))
train_data = text_data[:split_idx]
test_data = text_data[split_idx:]

torch.manual_seed(234)

train_loader = create_dataloader_v1(
    train_data,
    batch_size=2,
    max_length=GPT_CONFIG_124M["context_length"],
    stride=GPT_CONFIG_124M["context_length"],
    drop_last=True,
    shuffle=True,
    num_workers=0
)

test_loader = create_dataloader_v1(
    test_data,
    batch_size=2,
    max_length=GPT_CONFIG_124M["context_length"],
    stride=GPT_CONFIG_124M["context_length"],
    drop_last=False,
    shuffle=False,
    num_workers=0
)


ValueError: invalid literal for int() with base 10: 'I'

We encoded the two input and target text to underline how important is to target the shift of the input (space befor token includedd) as this is our "natural" label of the prediction to feed the model with.

Now let's take out the probability in the next section:

In [48]:

with torch.no_grad():
    logits = model(inputs)
    
probas = torch.softmax(logits,dim=-1)
print("Probas shape: ", probas.shape)

token_ids = torch.argmax(probas,dim=-1,keepdim=True)
print("Token Ids: ", token_ids)

print("Target batch1: ", token_ids_to_text(token_ids=targets[0],tokenizer=gpt_tokenizer))
print("Model output batch1: ", token_ids_to_text(token_ids=token_ids[0].flatten(),tokenizer=gpt_tokenizer))

Probas shape:  torch.Size([2, 3, 50257])
Token Ids:  tensor([[[19127],
         [ 1790],
         [18350]],

        [[45721],
         [32673],
         [ 2132]]])
Target batch1:   effort moves you
Model output batch1:   rack short Fa


running the above code makes clear that the model is producing random text because ***it's not trained yet***. Here comese to help the loss evaluation that's not only helpful to evalutate the quality of the produced text but also can be stacked information for training.

### Softmax probability evaluation

It's now time to look at the initial probability scores of the target tokens generated by the model printed below as at each specifc index of the two targets there is the probability the model need to increase compared to each other element in the probas distribution.


In [55]:

text_idx = 0
target_probas_1 = probas[text_idx,[0,1,2],targets[text_idx]]
print("Text 1: ", target_probas_1)


text_idx = 1
target_probas_2 = probas[text_idx,[0,1,2],targets[text_idx]]
print("Text 2: ", target_probas_2)

Text 1:  tensor([1.3901e-05, 1.2262e-05, 1.8820e-05])
Text 2:  tensor([1.4962e-05, 7.2043e-06, 1.0832e-05])


to evaluate the probability ... to improve the model in order to reach as close as possible an negative log probability of 0. 

> The sequential computation of:
> - logits
> - probabilities
> - target probabilities
> - log probabilities
> - average log probability
> - negative everage log probability

***is known as cross entropy loss calculation***

Below the cross entropy (measure of the difference from target to predicted) and the Perplexity that measure the exact size of the incertainty size of the predicted word. In the next result think about Perplexity as the number of words in the vocabulary set the model is likely to pick from for the prediction.

In [65]:
log_probas = torch.log(torch.cat((target_probas_1,target_probas_2)))
print("Logaritmic probability: ",log_probas)

avg_log_probas = torch.mean(log_probas) * -1
print("Negative avg loss prob: ",avg_log_probas)

# flatting from three to two dimension 
logits_flat = logits.flatten(0,1)
# flatting from two to one dimension
targets_flat = targets.flatten()

print("\n\nFlatten logits: ", logits_flat.shape)
print("Flatten targets:", targets_flat.shape)

loss = torch.nn.functional.cross_entropy(logits_flat,targets_flat)
print ("Cross entropy loss calculation", loss)
print ("\nPerplexity: ", torch.exp(loss))


Logaritmic probability:  tensor([-11.1836, -11.3090, -10.8806, -11.1100, -11.8408, -11.4330])
Negative avg loss prob:  tensor(11.2928)


Flatten logits:  torch.Size([6, 50257])
Flatten targets: torch.Size([6])
Cross entropy loss calculation tensor(11.2928)

Perplexity:  tensor(80243.4922)
