
## ![Pear](https://cdn.pixabay.com) vsüçé Simple training on language coherence

In this sample book inspired by the citation I will use a short book to check learning on a GPT model from scratch. As the input is not at significant scale to be able to generalize a language model we will experience a constant learning rate path vs a test loss overfit.

---

### Import the text_data for training

A gpt2 file is collected within the folder for an easy import of code from previous chapters

In [None]:
import os, tiktoken, torch
from gpt2 import GPTModel,create_dataloader_v1, GPT_CONFIG_124M, generate_text_simple

file_path = "the-verdict.txt" # Free book from Edith Wharton (inspired by credits)
text_data = ""
with open(file_path,"r",encoding="utf-8") as file:
    text_data = file.read()
    
tokenizer = tiktoken.get_encoding("gpt2")
    
total_char = len(text_data)
total_tokens = len(tokenizer.encode(text_data))

print ("Characters: ", total_char)
print ("Tokens: ", total_tokens)


and prepare the data in training and validation sets splitting the data with the below train_ratio of 90% training 10% test. We use here the create_dataloader_v1 to prepare the imput data in batches of 256 tokens each.

The train_loader and test_loader will be used in the following code as "cursors" to train the model with the usual cylcle evaluation of epochs and losses.

In [None]:
train_ratio = 0.90
split_idx = int(train_ratio * len(text_data))
train_data = text_data[:split_idx]
test_data = text_data[split_idx:]

torch.manual_seed(234)

train_loader = create_dataloader_v1(
    train_data,
    batch_size=2,
    max_length=GPT_CONFIG_124M["context_length"],
    stride=GPT_CONFIG_124M["context_length"],
    drop_last=True,
    shuffle=True,
    num_workers=0
)

test_loader = create_dataloader_v1(
    test_data,
    batch_size=2,
    max_length=GPT_CONFIG_124M["context_length"],
    stride=GPT_CONFIG_124M["context_length"],
    drop_last=False,
    shuffle=False,
    num_workers=0
)

print ("Train loader: \n")
for x,y in train_loader:
    print(x.shape,y.shape)
    
print ("Test loader: \n")
for x,y in test_loader:
    print(x.shape,y.shape)

### Define the üçêvsüçé.. loss

It's now time to focus on loss. As we have seen in the intro-training.ipynb loss can be calculated across the batches with the cross entropy function. In the next two functions we will calculate loss for input and target batches and iterate over the bateches in a dataloader to accumulate loss and return the avg loss for the number of processed batched along the iteration:

In [3]:
# a basic calculation of loss using the cross entropy function from torch
def calc_loss_batch(input_batch, target_batch, model, device):
    input_batch = input_batch.to(device)
    target_batch = target_batch.to(device)
    
    logits = model(input_batch)
    loss = torch.nn.functional.cross_entropy(
        logits.flatten(0,1), target_batch.flatten() # Cross entropy fucntion applies to flatten batches
    )
    return loss

def calc_loss_dataloaders(data_loader, model, device, num_batches=None):
    total_loss = 0
    if len(data_loader) == 0:
        return float("nan")
    elif num_batches is None:
        num_batches = len(data_loader)
    else:
        num_batches = min (num_batches, len(data_loader))
    
    for i, (input_batch, target_batch) in enumerate (data_loader):
        if i < num_batches:
            loss = calc_loss_batch(
                input_batch=input_batch,target_batch=target_batch,model=model,device=device
            )
            total_loss += loss.item()
        else:
            break
    return total_loss/num_batches



## Train the model with a symple cycle üö≤

#### In the following section we are going to implement a typical training over epoques:

Whyle epoques:
> - Train the model --> model.train()
> - While you have trainings batches:
>> - restart your gradient calculations --> optimizer.zerograd()
>> - go for loss calc and backpropagate --> learn over the training samples
>> - make one step forward --> optimizer.setp()



In [4]:
def text_to_token_ids(text,tokenizer):
    encoded = tokenizer.encode(text,allowed_special={'<|endoftext|>'})
    encoded_tensor = torch.tensor(encoded).unsqueeze(0)
    return encoded_tensor


def token_ids_to_text(token_ids,tokenizer):
    flat = token_ids.squeeze(0)
    return tokenizer.decode(flat.tolist())

def evaluate_model( model, train_loader, test_loader, device, eval_iter):
    model.eval()
    with torch.no_grad():
        train_loss = calc_loss_dataloaders(
            train_loader,model,device, num_batches=eval_iter
        )
        test_loss = calc_loss_dataloaders(
            test_loader,model,device, num_batches=eval_iter
        )
    model.train()
    return train_loss, test_loss

def generate_and_print_samples( model, tokenizer, device, start_context):
    model.eval()
    context_size = model.pos_emb.weight.shape[0]
    encoded = text_to_token_ids(start_context,tokenizer).to(device)
    with torch.no_grad():
        token_ids = generate_text_simple(model=model,idx=encoded,max_new_tokens=50,context_size=context_size)
    decoded_text = token_ids_to_text(token_ids,tokenizer)
    print("decoded text: ", decoded_text.replace("\n"," "))

def train_model_simple(model, train_loader, test_loader, optimizer, device, num_epochs, eval_freq, eval_iter, start_context, tokenizer):
    train_losses, test_losses, track_token_seen = [], [], []
    token_seen, global_step = 0, -1
    
    # epoch cycle
    for epoch in range(num_epochs):
        model.train() # put the model in training mode for the backpropagation
        # and start cycling over input/target batches from the training dataloader
        for input_batch, target_batch in train_loader:
            
            optimizer.zero_grad() # at the beginning of each batch it's suggested to nullify the gradients and restart from scratch
            
            # get loss for the current batch
            loss = calc_loss_batch(
                input_batch,target_batch,model, device
                
            )
            
            loss.backward()
            optimizer.step()
            token_seen += input_batch.numel() # + one more in the stack of seen tokens
            
            global_step += 1
            
            # Just to print out the current status of training for the model object we are working with
            if global_step % eval_freq == 0:
                train_loss, test_loss = evaluate_model(
                    model, train_loader, test_loader, device, eval_iter
                )
                train_losses.append(train_loss)
                test_losses.append(test_loss)
                track_token_seen.append(token_seen)
                print   (
                    f"Ep {epoch+1} (Step: {global_step:06d}): "
                    f"Train loss {train_loss:.3f}, "
                    f"Test loss {test_loss:.3f}"
                )
        
        # let's print out some text samples to check the aility of the model to generate coherent text        
        generate_and_print_samples(
            model, tokenizer, device, start_context
        )
    
    return train_losses, test_losses, track_token_seen





### ‚ñ∂Ô∏è See the model training in action

let's now start the training that will take 5 minutes over an I7 CPU and see it the introduction sentence of this book is valid.

In [None]:

torch.manual_seed(123)
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
model = GPTModel(GPT_CONFIG_124M)
model.to(device)

optimizer = torch.optim.AdamW(
    model.parameters(),
    lr=0.0004,weight_decay=0.1
)

num_epochs = 10
train_losses, test_losses, token_seen = train_model_simple(
    model=model,train_loader=train_loader,test_loader=test_loader,optimizer=optimizer,device=device,num_epochs=num_epochs,
    eval_freq=5,
    eval_iter=5,
    start_context="Every effort moves you",
    tokenizer=tokenizer
)

Plot restults in to visualize learn vs test and underline again that, because of the very limited dataset, the model is learning but overfiting and stop (really reduce) it's power of generalization at epoch number 3.

In [None]:
import matplotlib.pyplot as plt
from matplotlib.ticker import MaxNLocator

def plot_losses(epochs_seen, tokens_seen, train_losses, test_losses, ):
    fig, ax1 = plt.subplots(figsize = (5,3))
    ax1.plot(epochs_seen,train_losses,label="Training loss")
    ax1.plot(epochs_seen,test_losses,linestyle="-.",label="Test loss")
    
    ax1.set_xlabel("Epoch")
    ax1.set_ylabel("Loss")
    ax1.legend(loc="upper right")
    ax1.xaxis.set_major_locator(MaxNLocator(integer=True))
    ax2 = ax1.twiny()
    ax2.plot(tokens_seen,train_losses,alpha=0)
    ax2.set_xlabel("Token seen")
    fig.tight_layout()
    plt.show()
    

epoch_tensor = torch.linspace(0,num_epochs,len(train_losses))
plot_losses(epoch_tensor,token_seen,train_losses,test_losses)


### üìö Inspiration & Citation

This exercise is inspired by the following work. If you use this notebook or its accompanying code, please cite it accordingly:

```yaml
cff-version: 1.2.0
message: "If you use this book or its accompanying code, please cite it as follows."
title: "Build A Large Language Model (From Scratch), Published by Manning, ISBN 978-1633437166"
abstract: "This book provides a comprehensive, step-by-step guide to implementing a ChatGPT-like large language model from scratch in PyTorch."
date-released: 2024-09-12
authors:
  - family-names: "Raschka"
    given-names: "Sebastian"
license: "Apache-2.0"
url: "https://www.manning.com/books/build-a-large-language-model-from-scratch"
repository-code: "https://github.com/rasbt/LLMs-from-scratch"
keywords:
  - large language models
  - natural language processing
  - artificial intelligence
  - PyTorch
  - machine learning
  - deep learning