# Pretraining

![image.png](attachment:image.png)

In [1]:
import torch
from gpt import GPTModel
GPT_CONFIG_124M = {
 "vocab_size": 50257,
 "context_length": 256, # We shorten the context length from 1,024 to 256 tokens. Original GPT-2 has a context length of 1,024 tokens.
 "emb_dim": 768,
 "n_heads": 12,
 "n_layers": 12, 
 "drop_rate": 0.1, 
 "qkv_bias": False
}
torch.manual_seed(123)
model = GPTModel(GPT_CONFIG_124M)
model.eval()

GPTModel(
  (tok_emb): Embedding(50257, 768)
  (pos_emb): Embedding(256, 768)
  (drop_emb): Dropout(p=0.1, inplace=False)
  (trf_blocks): Sequential(
    (0): TransformerBlock(
      (att): MultiHeadAttention(
        (W_query): Linear(in_features=768, out_features=768, bias=False)
        (W_key): Linear(in_features=768, out_features=768, bias=False)
        (W_value): Linear(in_features=768, out_features=768, bias=False)
        (out_proj): Linear(in_features=768, out_features=768, bias=True)
        (dropout): Dropout(p=0.1, inplace=False)
      )
      (ff): FeedForward(
        (layers): Sequential(
          (0): Linear(in_features=768, out_features=3072, bias=True)
          (1): GELU()
          (2): Linear(in_features=3072, out_features=768, bias=True)
        )
      )
      (norm1): LayerNorm()
      (norm2): LayerNorm()
      (drop_shortcut): Dropout(p=0.1, inplace=False)
    )
    (1): TransformerBlock(
      (att): MultiHeadAttention(
        (W_query): Linear(in_features

## Utility functions for text to token ID conversion

In [2]:
import tiktoken
from gpt import generate_text_simple

def text_to_token_ids(text, tokenizer):
    encoded = tokenizer.encode(text, allowed_special={'<|endoftext|>'})
    encoded_tensor = torch.tensor(encoded).unsqueeze(0) # Add batch dimension
    return encoded_tensor

def token_ids_to_text(token_ids, tokenizer):
    flat = token_ids.squeeze(0)   # Removes batch dimension
    return tokenizer.decode(flat.tolist())


In [3]:

start_context = "Every effort moves you"
tokenizer = tiktoken.get_encoding("gpt2")
token_ids = generate_text_simple(
    model=model,
    idx=text_to_token_ids(start_context, tokenizer),
    max_new_tokens=10,
    context_size=GPT_CONFIG_124M["context_length"]
)
print("Output text:\n", token_ids_to_text(token_ids, tokenizer))

Output text:
 Every effort moves you rentingetic wasnم refres RexMeCHicular stren


## Calculating the text generation loss

### Working of generate_text_simple function using an example of 7 words
![image.png](attachment:image.png)

In [4]:
inputs = torch.tensor([[16833, 3626, 6100], # ["every effort moves",
                       [40, 1107, 588]])    # "I really like"]

In [5]:
targets = torch.tensor([[3626, 6100, 345 ], # [" effort moves you",
                        [1107, 588, 11311]]) # " really like chocolate"]

- Note that the targets are the inputs but shifted one position forward, a concept we
covered in chapter 2 during the implementation of the data loader.


- This shifting strategy is crucial for teaching the model to predict the next token in a sequence.

In [6]:
with torch.no_grad(): 
    logits = model(inputs)
probas = torch.softmax(logits, dim=-1) 
print(probas.shape)

torch.Size([2, 3, 50257])


In [None]:
token_ids = torch.argmax(probas, dim=-1, keepdim=True)
print("Token IDs:\n", token_ids)

Token IDs:
 tensor([[[16657],
         [  339],
         [42826]],

        [[49906],
         [29669],
         [41751]]])


In [8]:
print(f"Targets batch 1: {token_ids_to_text(targets[0], tokenizer)}") # Batch-1 input
print(f"Outputs batch 1:"f" {token_ids_to_text(token_ids[0].flatten(), tokenizer)}") # Batch-1 output

Targets batch 1:  effort moves you
Outputs batch 1:  Armed heNetflix


![image.png](attachment:image.png)

- For each of the two input texts, we can print the initial softmax probability scores corresponding to the target tokens using the following code

In [9]:
text_idx = 0
target_probas_1 = probas[text_idx, [0, 1, 2], targets[text_idx]]
print("Text 1:", target_probas_1)
text_idx = 1
target_probas_2 = probas[text_idx, [0, 1, 2], targets[text_idx]]
print("Text 2:", target_probas_2)

Text 1: tensor([7.4541e-05, 3.1061e-05, 1.1563e-05])
Text 2: tensor([1.0337e-05, 5.6776e-05, 4.7559e-06])


- The goal of training an LLM is to maximize the likelihood of the correct token, which
involves increasing its probability relative to other tokens. 
- This way, we ensure the
LLM consistently picks the target token—essentially the next word in the sentence
as the next token it generates.

In [10]:
log_probas = torch.log(torch.cat((target_probas_1, target_probas_2)))
print(log_probas)

tensor([ -9.5042, -10.3796, -11.3677, -11.4798,  -9.7764, -12.2561])


In [None]:
avg_log_probas = torch.mean(log_probas)
print(avg_log_probas)

neg_avg_log_probas = avg_log_probas * -1
print(neg_avg_log_probas)

tensor(-10.7940)
tensor(10.7940)


![image.png](attachment:image.png)

## Cross-Entropy Loss
- In the context of machine learning and specifically in frameworks like PyTorch, the
cross_entropy function computes this measure for discrete outcomes, which is
similar to the negative average log probability of the target tokens given the model’s
generated token probabilities, making the terms “cross entropy” and “negative average log probability” related and often used interchangeably in practice.

In [12]:
print("Logits shape:", logits.shape)
print("Targets shape:", targets.shape)

Logits shape: torch.Size([2, 3, 50257])
Targets shape: torch.Size([2, 3])


In [13]:
logits_flat = logits.flatten(0, 1)
targets_flat = targets.flatten()
print("Flattened logits:", logits_flat.shape)
print("Flattened targets:", targets_flat.shape)

Flattened logits: torch.Size([6, 50257])
Flattened targets: torch.Size([6])


- Remember that the targets are the token IDs we want the LLM to generate, and the
logits contain the unscaled model outputs before they enter the softmax function to
obtain the probability scores.
- Previously, we applied the softmax function, selected the probability scores corresponding to the target IDs, and computed the negative average log probabilities.
PyTorch’s cross_entropy function will take care of all these steps for us:

In [14]:
loss = torch.nn.functional.cross_entropy(logits_flat, targets_flat)
print(loss)

tensor(10.7940)


## Perplexity

- Perplexity is a measure often used alongside cross entropy loss to evaluate the performance of models in tasks like language modeling. It can provide a more interpretable way to understand the uncertainty of a model in predicting the next token in a
sequence.


- Perplexity measures how well the probability distribution predicted by the model
matches the actual distribution of the words in the dataset. Similar to the loss, a lower
perplexity indicates that the model predictions are closer to the actual distribution.


- Perplexity is often considered more interpretable than the raw loss value because it signifies the effective vocabulary size about which the model is uncertain at each step. In
the given example, this would translate to the model being unsure about which among
48,725 tokens in the vocabulary to generate as the next token.

In [15]:
perplexity = torch.exp(loss)
print(perplexity)

tensor(48725.8203)


## Calculating the training and validation set losses

In [16]:
file_path = "../the-verdict.txt"
with open(file_path, "r", encoding="utf-8") as file:
 text_data = file.read()

In [17]:
total_characters = len(text_data)
total_tokens = len(tokenizer.encode(text_data))
print("Characters:", total_characters)
print("Tokens:", total_tokens)

Characters: 20479
Tokens: 5145


## The train-val pipeline
![image.png](attachment:image.png)

In [18]:
train_ratio = 0.90
split_idx = int(train_ratio * len(text_data))
train_data = text_data[:split_idx]
val_data = text_data[split_idx:]

In [19]:
from data_loader import create_dataloader_v1

import tiktoken
tokenizer = tiktoken.get_encoding("gpt2") 

torch.manual_seed(123)
train_loader = create_dataloader_v1(
 train_data,
 batch_size=2,
 max_length=GPT_CONFIG_124M["context_length"],
 stride=GPT_CONFIG_124M["context_length"],
 drop_last=True,
 shuffle=True,
 num_workers=0
)
val_loader = create_dataloader_v1(
 val_data,
 batch_size=2,
 max_length=GPT_CONFIG_124M["context_length"],
 stride=GPT_CONFIG_124M["context_length"],
 drop_last=False,
 shuffle=False,
 num_workers=0
)

In [20]:
print("Train loader:")
for x, y in train_loader:
 print(x.shape, y.shape)
print("\nValidation loader:")
for x, y in val_loader:
 print(x.shape, y.shape)

Train loader:
torch.Size([2, 256]) torch.Size([2, 256])
torch.Size([2, 256]) torch.Size([2, 256])
torch.Size([2, 256]) torch.Size([2, 256])
torch.Size([2, 256]) torch.Size([2, 256])
torch.Size([2, 256]) torch.Size([2, 256])
torch.Size([2, 256]) torch.Size([2, 256])
torch.Size([2, 256]) torch.Size([2, 256])
torch.Size([2, 256]) torch.Size([2, 256])
torch.Size([2, 256]) torch.Size([2, 256])

Validation loader:
torch.Size([2, 256]) torch.Size([2, 256])


- Utility function to calculate the cross entropy loss of a given batch returned via the training and validation loader

In [21]:
def calc_loss_batch(input_batch, target_batch, model, device):
 input_batch = input_batch.to(device) 
 target_batch = target_batch.to(device) 
 logits = model(input_batch)
 loss = torch.nn.functional.cross_entropy(
    logits.flatten(0, 1), target_batch.flatten()
 )
 return loss

- By default, the calc_loss_loader function iterates over all batches in a given data
loader, accumulates the loss in the total_loss variable, and then computes and averages the loss over the total number of batches. Alternatively, we can specify a
smaller number of batches via num_batches to speed up the evaluation during model
training.

In [22]:
def calc_loss_loader(data_loader, model, device, num_batches=None):
    total_loss = 0
    if len(data_loader) == 0:
        return float("nan")
    elif num_batches is None:
        num_batches = len(data_loader) 
    else:
        num_batches = min(num_batches, len(data_loader)) 
    for i, (input_batch, target_batch) in enumerate(data_loader):
        if i < num_batches:
            loss = calc_loss_batch(
                input_batch, target_batch, model, device
            )
            total_loss += loss.item() 
        else:
            break
    return total_loss / num_batches # Averages the loss over all batches

In [23]:
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
model.to(device) 
with torch.no_grad(): 
 train_loss = calc_loss_loader(train_loader, model, device) 
 val_loss = calc_loss_loader(val_loader, model, device)
print("Training loss:", train_loss)
print("Validation loss:", val_loss)

Training loss: 10.987583266364204
Validation loss: 10.981104850769043


# Training an LLM
![image.png](attachment:image.png)

- The model train is similar to pytorch training pipeline

In [24]:
def train_model_simple(model, train_loader, val_loader,optimizer, device, num_epochs,eval_freq, eval_iter, start_context, tokenizer):
    train_losses, val_losses, track_tokens_seen = [], [], []   # tracking losses and token seen
    tokens_seen, global_step = 0, -1
    for epoch in range(num_epochs): # Main training loop
        model.train()
        for input_batch, target_batch in train_loader:
            optimizer.zero_grad()           # reset loss gradient from previous batch iteration
            loss = calc_loss_batch(
                 input_batch, target_batch, model, device
            )
            loss.backward() # loss gradient
            optimizer.step() # update model weights using loss gradients
            tokens_seen += input_batch.numel()
            global_step += 1
            if global_step % eval_freq == 0: # Optional evaluation
                train_loss, val_loss = evaluate_model(model, train_loader, val_loader, device, eval_iter)
                train_losses.append(train_loss)
                val_losses.append(val_loss)
                track_tokens_seen.append(tokens_seen)
                print(f"Ep {epoch+1} (Step {global_step:06d}): "
                f"Train loss {train_loss:.3f}, "
                f"Val loss {val_loss:.3f}")
        generate_and_print_sample(model, tokenizer, device, start_context)
    return train_losses, val_losses, track_tokens_seen

In [25]:
def evaluate_model(model, train_loader, val_loader, device, eval_iter):
    model.eval()     # dropouts are disabled
    with torch.no_grad(): # disablibng the graddient tracking
        train_loss = calc_loss_loader(
                    train_loader, model, device, num_batches=eval_iter
            )
        val_loss = calc_loss_loader(
                     val_loader, model, device, num_batches=eval_iter
            )
    model.train()
    return train_loss, val_loss

In [34]:
def generate_and_print_sample(model, tokenizer, device, start_context):
    model.eval()
    context_size = model.pos_emb.weight.shape[0]
    encoded = text_to_token_ids(start_context, tokenizer).to(device)
    with torch.no_grad():
        token_ids = generate_text_simple(
            model=model, idx=encoded,
            max_new_tokens=50, context_size=context_size
            )
    decoded_text = token_ids_to_text(token_ids, tokenizer)
    print(decoded_text.replace("\n", " ")) 
    model.train()

- While the evaluate_model function gives us a numeric estimate of the model’s training progress, this generate_and_print_sample text function provides a concrete text
example generated by the model to judge its capabilities during training.

- Adam optimizers are a popular choice for training deep neural networks. However, in
our training loop, we opt for the AdamW optimizer. AdamW is a variant of Adam that
improves the weight decay approach, which aims to minimize model complexity and
prevent overfitting by penalizing larger weights. 

- This adjustment allows AdamW to
achieve more effective regularization and better generalization; thus, AdamW is frequently used in the training of LLMs.

In [32]:
torch.manual_seed(123)

model = GPTModel(GPT_CONFIG_124M)
model.to(device)

optimizer = torch.optim.AdamW(
    model.parameters(), 
    lr=0.0004, weight_decay=0.1
)
num_epochs = 20

train_losses, val_losses, tokens_seen = train_model_simple(
    model, train_loader, val_loader, optimizer, device,
    num_epochs=num_epochs, eval_freq=5, eval_iter=5,
    start_context="My name is Akshat shaw, I am a student at IIT Roorkee", tokenizer=tokenizer
)

Ep 1 (Step 000000): Train loss 9.823, Val loss 9.928
Ep 1 (Step 000005): Train loss 8.068, Val loss 8.336
/n
Ep 2 (Step 000010): Train loss 6.620, Val loss 7.045
Ep 2 (Step 000015): Train loss 6.045, Val loss 6.600
/n
Ep 3 (Step 000020): Train loss 5.532, Val loss 6.496
Ep 3 (Step 000025): Train loss 5.412, Val loss 6.376
/n
Ep 4 (Step 000030): Train loss 4.989, Val loss 6.307
Ep 4 (Step 000035): Train loss 4.612, Val loss 6.254
/n
Ep 5 (Step 000040): Train loss 4.026, Val loss 6.143
/n
Ep 6 (Step 000045): Train loss 3.623, Val loss 6.140
Ep 6 (Step 000050): Train loss 3.051, Val loss 6.112
/n
Ep 7 (Step 000055): Train loss 2.971, Val loss 6.204
Ep 7 (Step 000060): Train loss 2.237, Val loss 6.139
/n
Ep 8 (Step 000065): Train loss 1.794, Val loss 6.158
Ep 8 (Step 000070): Train loss 1.482, Val loss 6.249
/n
Ep 9 (Step 000075): Train loss 1.142, Val loss 6.284
Ep 9 (Step 000080): Train loss 0.879, Val loss 6.281
/n
Ep 10 (Step 000085): Train loss 0.637, Val loss 6.376
/n
Ep 11 (Step 000

KeyboardInterrupt: 

In [35]:
generate_and_print_sample(model, tokenizer, device, "who is Gisburn")

who is Gisburn had never touched a brush."  And his tone told me in a flash that he never thought of anything else.  I moved away, instinctively embarrassed by my unexpected discovery; and as I turned, my eye fell on a small picture above
