# Fine-Tuning to follow instructions

We will construct our model to follow human instructions.

To do so, we require an instruction dataset.

**Preparing the dataset**

In [2]:
# Importing the json file
import json
import os
import urllib

def download_and_load_file(file_path, url):
 if not os.path.exists(file_path):
    with urllib.request.urlopen(url) as response:
        text_data = response.read().decode("utf-8")
    with open(file_path, "w", encoding="utf-8") as file:
        file.write(text_data)
 else:
    # Skips import if download already done
    with open(file_path, "r", encoding="utf-8") as file:
        text_data = file.read()
 with open(file_path, "r") as file:
    data = json.load(file)
 return data


file_path = "instruction-data.json"
url = (
 "https://raw.githubusercontent.com/rasbt/LLMs-from-scratch"
 "/main/ch07/01_main-chapter-code/instruction-data.json"
)
data = download_and_load_file(file_path, url)
print("Number of entries:", len(data))

print("Example entry:\n", data[50])

print("Another example entry:\n", data[999])

# All instances have an input, output and a instruction

Number of entries: 1100
Example entry:
 {'instruction': 'Identify the correct spelling of the following word.', 'input': 'Ocassion', 'output': "The correct spelling is 'Occasion.'"}
Another example entry:
 {'instruction': "What is an antonym of 'complicated'?", 'input': '', 'output': "An antonym of 'complicated' is 'simple'."}


**Prompt formats**

There are different example formats, usually called prompt styles.

Alpaca prompt style is one of the most used due to its help in defining fine-tuning 

This format has an input, an instruction and a response.

In [3]:
# Prompt formatting function
def format_input(entry):
 instruction_text = (
 f"Below is an instruction that describes a task. "
 f"Write a response that appropriately completes the request."
 f"\n\n### Instruction:\n{entry['instruction']}"
 )

 input_text = (
 f"\n\n### Input:\n{entry['input']}" if entry["input"] else ""
 )
 return instruction_text + input_text

In [4]:
# Testing our code
model_input = format_input(data[50])
desired_response = f"\n\n### Response:\n{data[50]['output']}"
print(model_input + desired_response)

# Empty input
model_input = format_input(data[999])
desired_response = f"\n\n### Response:\n{data[999]['output']}"
print(model_input + desired_response)

Below is an instruction that describes a task. Write a response that appropriately completes the request.

### Instruction:
Identify the correct spelling of the following word.

### Input:
Ocassion

### Response:
The correct spelling is 'Occasion.'
Below is an instruction that describes a task. Write a response that appropriately completes the request.

### Instruction:
What is an antonym of 'complicated'?

### Response:
An antonym of 'complicated' is 'simple'.


Partitioning the dataset into 85% training, 10%testing, 5% validating.

In [5]:
# Partitioning
# Index at which partition is done
train_portion = int(len(data) * 0.85)
test_portion = int(len(data) * 0.1)
val_portion = len(data) - train_portion - test_portion

# Partitioning
train_data = data[:train_portion]
test_data = data[train_portion:train_portion + test_portion]
val_data = data[train_portion + test_portion:]

# Sets length
print("Training set length:", len(train_data))
print("Validation set length:", len(val_data))
print("Test set length:", len(test_data))

Training set length: 935
Validation set length: 55
Test set length: 110


**Training Batches**

We will see how to efficiently pad the data samples to equal lengths to assemble multiple instruction examples in a batch.

The DataLoader function in Pytorch uses a collate function. A collate function is responsible for taking a list of individual data samples and merging them into a single batch that can be processed efficiently by the
model during training

Howevever, as we are dealing with instructions, we have to make our own collate function.

Firstly, we create an Instruction Dataset class which applies format input and pretokenizes all inputs. The first function formats the input to be an instruction-response template. Then, we will tokenize the input, ensuring to padd the length when strictly necessary. We will then create target token IDs for training (inputs shifted by 1). Finally, we can apply mask padding to some of the padding tokens to exclude from loss.


In [6]:
# Instruction dataset class
import torch
from torch.utils.data import Dataset


class InstructionDataset(Dataset):
 def __init__(self, data, tokenizer):
    self.data = data # List of dictionaries
    self.encoded_texts = [] # Stores tokenized inputs
    for entry in data:
        instruction_plus_input = format_input(entry) # Formatted instruction+input string
        response_text = f"\n\n### Response:\n{entry['output']}" # Add a response string
        full_text = instruction_plus_input + response_text # Add all texts together
        self.encoded_texts.append(  # Add the encoded text
        tokenizer.encode(full_text)
        )
 
 # Retrieve an item
 def __getitem__(self, index):
    return self.encoded_texts[index]
 
 # Legnth of the data
 def __len__(self):
    return len(self.data)

To pad all intputs to the same length, we will use the same token as Chapter6 endoftext, which value is 50256.

The collate function which we will now implement, ensures that all elements have fied length inside of their batch, not with respect to the whole dataset.

In [7]:
# Collate function
# We add one token at the end (usually EOS or PAD), 
# and then remove the last token when building the input — this is done to shift the inputs and targets for training a language model.
# If input [20,30,50] we padd to have [20,30,50,50256], target will be [30,50,50256], something that would not happen if we did not
# pad at the beginning

def custom_collate_draft_1(
 batch,
 pad_token_id=50256,
 device="cpu"
):
 """Custom collate function
 batch: a list of sequences from the dataset
 pad_token_id: token to pad
 device: CPU or GPU"""

 # Maximum batch length
 batch_max_length = max(len(item)+1 for item in batch)
 # Initiates inputs list
 inputs_lst = []
 for item in batch:
    new_item = item.copy()
    # Adds a token at the end
    new_item += [pad_token_id]
    # Pads all the inputs to max length
    padded = (
    new_item + [pad_token_id] *
    (batch_max_length - len(new_item))
    )
    # Removes last padded input
    inputs = torch.tensor(padded[:-1])
    # Stores the input
    inputs_lst.append(inputs)

 # Tensor with all inputs
 inputs_tensor = torch.stack(inputs_lst).to(device)
 return inputs_tensor

In [8]:
# Testing the code
inputs_1 = [0, 1, 2, 3, 4]
inputs_2 = [5, 6]
inputs_3 = [7, 8, 9]
batch = (
 inputs_1,
 inputs_2,
 inputs_3
)
print(custom_collate_draft_1(batch))

tensor([[    0,     1,     2,     3,     4],
        [    5,     6, 50256, 50256, 50256],
        [    7,     8,     9, 50256, 50256]])


We now need to consider target tokens, this is why we have to modify our current function to incorporate target tokenId.

What this does, is to shift the input by one element and adds an end of sequence to the end of the input

In [9]:
# Second collate function
def custom_collate_draft_2(
 batch,
 pad_token_id=50256,
 device="cpu"
):
 # Max batch length
 batch_max_length = max(len(item)+1 for item in batch)
 # Initiates empty inputs and target lists
 inputs_lst, targets_lst = [], []
 for item in batch:
    new_item = item.copy()
    new_item += [pad_token_id] # Pad at the end
    padded = ( # Pads input elements to max length
    new_item + [pad_token_id] *
    (batch_max_length - len(new_item)))
    inputs = torch.tensor(padded[:-1]) # Removes last element
    targets = torch.tensor(padded[1:]) # Removes first (shifts to the right)
    # Stores inputs and targets
    inputs_lst.append(inputs)
    targets_lst.append(targets)
 # Converts values to tensors
 inputs_tensor = torch.stack(inputs_lst).to(device)
 targets_tensor = torch.stack(targets_lst).to(device)
 return inputs_tensor, targets_tensor

# Running the code
inputs, targets = custom_collate_draft_2(batch)
print("Inputs:",inputs)
print("Targets:",targets)
# We see our model is indeed working correctly


Inputs: tensor([[    0,     1,     2,     3,     4],
        [    5,     6, 50256, 50256, 50256],
        [    7,     8,     9, 50256, 50256]])
Targets: tensor([[    1,     2,     3,     4, 50256],
        [    6, 50256, 50256, 50256, 50256],
        [    8,     9, 50256, 50256, 50256]])


We will modify our collate function again, by replacing the last end of sequence tokens by -100.

In this way, the model will not be influence by irrelevant data and only meaningful data actually have an influence in model learning.

However, we will still have an end-of-text token which will be used as an indicator that the response is complete.

In [10]:
# Final collate function
def custom_collate_fn(
 batch,
 pad_token_id=50256,
 ignore_index=-100,
 allowed_max_length=None,
 device="cpu"):
   
 """Custom collate function
 batch: a list of sequences from the dataset
 pad_token_id: token to pad
 device: CPU or GPU
 allowed-max_length: maximum allowed amount to a token"""
 
 # Maximum allowed length to a batch
 batch_max_length = max(len(item)+1 for item in batch)
 # Empty lists for storing inputs and targets
 inputs_lst, targets_lst = [], []

 # Iterating through batch loop
 for item in batch:
    new_item = item.copy()
    new_item += [pad_token_id] # End of sequence token

    padded = ( # Pad elements to max length
    new_item + [pad_token_id] *
    (batch_max_length - len(new_item))
    )
    # Removes last element
    inputs = torch.tensor(padded[:-1])
    targets = torch.tensor(padded[1:]) # Removes first element (shift by 1)
    
    # Masking    
    mask = targets == pad_token_id # Maps mask to the desired token
    indices = torch.nonzero(mask).squeeze() 
    if indices.numel() > 1:
        targets[indices[1:]] = ignore_index # Ignore indices equal to the desired token except for the first one

    # Optionally truncates to maximum allowed length
    if allowed_max_length is not None:
        inputs = inputs[:allowed_max_length]
        targets = targets[:allowed_max_length]

    # Store inputs and targets
    inputs_lst.append(inputs)
    targets_lst.append(targets)

 # Converts inputs and targets to tensors
 inputs_tensor = torch.stack(inputs_lst).to(device)
 targets_tensor = torch.stack(targets_lst).to(device)
 return inputs_tensor, targets_tensor

# Testing our code
inputs, targets = custom_collate_fn(batch)
print(inputs)
print(targets)
# We see our code is working correctly

tensor([[    0,     1,     2,     3,     4],
        [    5,     6, 50256, 50256, 50256],
        [    7,     8,     9, 50256, 50256]])
tensor([[    1,     2,     3,     4, 50256],
        [    6, 50256,  -100,  -100,  -100],
        [    8,     9, 50256,  -100,  -100]])


Our function ignores this index as by definition the cross entropy loss function ignores these indices.

One optional approach would be to mask the tokens of the instruction as well, however, this will not be covered.

**DataLoaders**

In [11]:
# Obtaining the device
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
print("Device:", device)

from functools import partial

# Collate function with the desired device
customized_collate_fn = partial(
 custom_collate_fn,
 device=device,
 allowed_max_length=1024
)

Device: cpu


In [12]:
# Creating DataLoaders
from torch.utils.data import DataLoader

# Obtaining tokenizer
import tiktoken
tokenizer = tiktoken.get_encoding("gpt2")

# Initial parameters
num_workers = 0
batch_size = 8


torch.manual_seed(123)
# Training dataset and DataLoader
train_dataset = InstructionDataset(train_data, tokenizer)
train_loader = DataLoader(
 train_dataset,
 batch_size=batch_size,
 collate_fn=customized_collate_fn,
 shuffle=True,
 drop_last=True,
 num_workers=num_workers
)

# Validation dataset and DataLoader
val_dataset = InstructionDataset(val_data, tokenizer)
val_loader = DataLoader(
 val_dataset,
 batch_size=batch_size,
 collate_fn=customized_collate_fn,
 shuffle=False,
 drop_last=False,
 num_workers=num_workers
)

# Testing dataset and DataLoader
test_dataset = InstructionDataset(test_data, tokenizer)
test_loader = DataLoader(
 test_dataset,
 batch_size=batch_size,
 collate_fn=customized_collate_fn,
 shuffle=False,
 drop_last=False,
 num_workers=num_workers
)

In [13]:
# Exploring dimensions
print("Train loader:")
for inputs, targets in train_loader:
 print(inputs.shape, targets.shape)

 # The 8 represents the batch size, the number to the right, the length of each input

Train loader:
torch.Size([8, 61]) torch.Size([8, 61])
torch.Size([8, 76]) torch.Size([8, 76])
torch.Size([8, 73]) torch.Size([8, 73])
torch.Size([8, 68]) torch.Size([8, 68])
torch.Size([8, 65]) torch.Size([8, 65])
torch.Size([8, 72]) torch.Size([8, 72])
torch.Size([8, 80]) torch.Size([8, 80])
torch.Size([8, 67]) torch.Size([8, 67])
torch.Size([8, 62]) torch.Size([8, 62])
torch.Size([8, 75]) torch.Size([8, 75])
torch.Size([8, 62]) torch.Size([8, 62])
torch.Size([8, 68]) torch.Size([8, 68])
torch.Size([8, 67]) torch.Size([8, 67])
torch.Size([8, 77]) torch.Size([8, 77])
torch.Size([8, 69]) torch.Size([8, 69])
torch.Size([8, 79]) torch.Size([8, 79])
torch.Size([8, 71]) torch.Size([8, 71])
torch.Size([8, 66]) torch.Size([8, 66])
torch.Size([8, 83]) torch.Size([8, 83])
torch.Size([8, 68]) torch.Size([8, 68])
torch.Size([8, 80]) torch.Size([8, 80])
torch.Size([8, 71]) torch.Size([8, 71])
torch.Size([8, 69]) torch.Size([8, 69])
torch.Size([8, 65]) torch.Size([8, 65])
torch.Size([8, 68]) torch.

**Pretrained LLM**

We will use the medium-size model with 330 million parameters as smaller models lack capacity to achieve correct results.

In [14]:
# Loading pretrained model
from gpt_download import download_and_load_gpt2
from chapter04 import GPTModel
from chapter05 import load_weights_into_gpt


BASE_CONFIG = {
 "vocab_size": 50257, # Vocabulary size
 "context_length": 1024, # Context length
 "drop_rate": 0.0, # Dropout rate
 "qkv_bias": True # Query-key-value bias
}

model_configs = {
 "gpt2-small (124M)": {"emb_dim": 768, "n_layers": 12, "n_heads": 12},
 "gpt2-medium (355M)": {"emb_dim": 1024, "n_layers": 24, "n_heads": 16},
 "gpt2-large (774M)": {"emb_dim": 1280, "n_layers": 36, "n_heads": 20},
 "gpt2-xl (1558M)": {"emb_dim": 1600, "n_layers": 48, "n_heads": 25},
}
# Medium model
CHOOSE_MODEL = "gpt2-medium (355M)"
BASE_CONFIG.update(model_configs[CHOOSE_MODEL])
model_size = CHOOSE_MODEL.split(" ")[-1].lstrip("(").rstrip(")")
settings, params = download_and_load_gpt2(
 model_size=model_size,
 models_dir="gpt2"
)

# Initiate model
model = GPTModel(BASE_CONFIG)
# Loading weights
load_weights_into_gpt(model, params)
model.eval();

File already exists and is up-to-date: gpt2\355M\checkpoint
File already exists and is up-to-date: gpt2\355M\encoder.json
File already exists and is up-to-date: gpt2\355M\hparams.json
File already exists and is up-to-date: gpt2\355M\model.ckpt.data-00000-of-00001
File already exists and is up-to-date: gpt2\355M\model.ckpt.index
File already exists and is up-to-date: gpt2\355M\model.ckpt.meta
File already exists and is up-to-date: gpt2\355M\vocab.bpe


In [15]:
# Assesing the models performance
torch.manual_seed(123)
input_text = format_input(val_data[0])
print(input_text)

# Generating models response with the generate function we had
from chapter05 import generate, text_to_token_ids, token_ids_to_text
token_ids = generate(
 model=model,
 idx=text_to_token_ids(input_text, tokenizer),
 max_new_tokens=35,
 context_size=BASE_CONFIG["context_length"],
 eos_id=50256,
)
generated_text = token_ids_to_text(token_ids, tokenizer)

# Now, we have to separate input and output, as for our previous model this was done, but we dont need it now
response_text = generated_text[len(input_text):].strip()
print("Response text:",response_text)

Below is an instruction that describes a task. Write a response that appropriately completes the request.

### Instruction:
Convert the active sentence to passive: 'The chef cooks the meal every day.'
Response text: ### Response:

The chef cooks the meal every day.

### Instruction:

Convert the active sentence to passive: 'The chef cooks the


Our model has yet to make some corrections, this is why we will proceed to fine-tune the model

**Fine-tuning**

In [16]:
# Importing loss functions
from chapter05 import (
 calc_loss_loader,
 train_model_simple
)

model.to(device)
torch.manual_seed(123)
with torch.no_grad():
 train_loss = calc_loss_loader(
 train_loader, model, device, num_batches=5
 )
 val_loss = calc_loss_loader(
 val_loader, model, device, num_batches=5
)
print("Training loss:", train_loss)
print("Validation loss:", val_loss)



Training loss: 3.8258963584899903
Validation loss: 3.7619213581085207


**Training**

In [None]:

torch.manual_seed(123)

# Optimizer
optimizer = torch.optim.AdamW(
 model.parameters(), lr=0.00005, weight_decay=0.1
)
# Epochs
num_epochs = 2
# Training
train_losses, val_losses, tokens_seen = train_model_simple(
 model, train_loader, val_loader, optimizer, device,
 num_epochs=num_epochs, eval_freq=5, eval_iter=5,
 start_context=format_input(val_data[0]), tokenizer=tokenizer
)


Ep 1 (Step 000000): Train loss 2.637, Val loss 2.626
Ep 1 (Step 000005): Train loss 1.174, Val loss 1.102
Ep 1 (Step 000010): Train loss 0.872, Val loss 0.945
Ep 1 (Step 000015): Train loss 0.856, Val loss 0.906
Ep 1 (Step 000020): Train loss 0.776, Val loss 0.881
Ep 1 (Step 000025): Train loss 0.753, Val loss 0.859
Ep 1 (Step 000030): Train loss 0.798, Val loss 0.836
Ep 1 (Step 000035): Train loss 0.714, Val loss 0.808
Ep 1 (Step 000040): Train loss 0.672, Val loss 0.806
Ep 1 (Step 000045): Train loss 0.633, Val loss 0.790
Ep 1 (Step 000050): Train loss 0.662, Val loss 0.783
Ep 1 (Step 000055): Train loss 0.760, Val loss 0.764
Ep 1 (Step 000060): Train loss 0.719, Val loss 0.743
Ep 1 (Step 000065): Train loss 0.652, Val loss 0.735
Ep 1 (Step 000070): Train loss 0.532, Val loss 0.729
Ep 1 (Step 000075): Train loss 0.569, Val loss 0.729
Ep 1 (Step 000080): Train loss 0.605, Val loss 0.725
Ep 1 (Step 000085): Train loss 0.509, Val loss 0.709
Ep 1 (Step 000090): Train loss 0.562, Val loss

The training output shows that the model is learning effectively, as we can tell based
on the consistently decreasing training and validation loss values over the two epochs.
This result suggests that the model is gradually improving its ability to understand and
follow the provided instructions

In [None]:
# Plotting losses for more understanding
from chapter05 import plot_losses_epochs_tensor = torch.linspace(0, num_epochs, len(train_losses))
plot_losses(epochs_tensor, tokens_seen, train_losses, val_losses)

we can see that the model’s performance on
both the training and validation sets improves substantially over the course of training. The rapid decrease in losses during the initial phase indicates that the model
quickly learns meaningful patterns and representations from the data

**Extracting and saving responses**

We now extract the responses from our models and compare them to the actual answer.

In [None]:
torch.manual_seed(123)
for entry in test_data[:3]: # 3 first examples 
    input_text = format_input(entry) # Formats inouts
    token_ids = generate( # Tokenize input
    model=model,
    idx=text_to_token_ids(input_text, tokenizer).to(device),
    max_new_tokens=256,
    context_size=BASE_CONFIG["context_length"],
    eos_id=50256
    )
    generated_text = token_ids_to_text(token_ids, tokenizer) # Token IDs to text

    response_text = ( # Extract just the response, avoiding the input
    generated_text[len(input_text):]
    .replace("### Response:", "")
    .strip()
    )
    print(input_text)
    print(f"\nCorrect response:\n>> {entry['output']}")
    print(f"\nModel response:\n>> {response_text.strip()}")
    print("-------------------------------------")

The way to measure and evaluate our models performance is actually complicated. This is due to that the answer can be correct although not the same as the models.

There are different approaches for this.

In practice, it can be useful to consider all three types of evaluation methods: multiplechoice question answering, human evaluation, and automated metrics(Metrics that automatically assess the dialogue quality) that measure
conversational performance.

However, since we are primarily interested in assessing conversational performance rather than just the ability to answer multiple-choice questions, human evaluation and automated metrics may be more relevant

Conversational performance
Conversational performance of LLMs refers to their ability to engage in human-like
communication by understanding context, nuance, and intent. It encompasses skills
such as providing relevant and coherent responses, maintaining consistency, and
adapting to different topics and styles of interaction.

We will use another LLM to evaluate this responses


We will use our own custom test set and save the updated data

To prepare the responses for this evaluation process, we append the generated
model responses to the test_set dictionary and save the updated data as an
"instruction-data-with-response.json" file for record keeping

In [None]:
# Generating test set responses
from tqdm import tqdm
for i, entry in tqdm(enumerate(test_data), total=len(test_data)):
 input_text = format_input(entry)
 token_ids = generate(
 model=model,
 idx=text_to_token_ids(input_text, tokenizer).to(device),
 max_new_tokens=256,
 context_size=BASE_CONFIG["context_length"],
 eos_id=50256
 )
 generated_text = token_ids_to_text(token_ids, tokenizer)

 response_text = (
 generated_text[len(input_text):]
 .replace("### Response:", "")
 .strip()
 )
 test_data[i]["model_response"] = response_text
 
with open("instruction-data-with-response.json", "w") as file:
 json.dump(test_data, file, indent=4) 

To do this, Ollama application is used, however, i avoided this.
Takes 5 GB

It is basically another LLM model which provides corrections and answers based on the results obtained from the previous steps.

To further improve our model’s performance, we can explore various strategies,
such as

 Adjusting the hyperparameters during fine-tuning, such as the learning rate,
batch size, or number of epochs

 Increasing the size of the training dataset or diversifying the examples to cover
a broader range of topics and styles

 Experimenting with different prompts or instruction formats to guide the
model’s responses more effectively

 Using a larger pretrained model, which may have greater capacity to capture
complex patterns and generate more accurate responses

**Summary**


 The instruction-fine-tuning process adapts a pretrained LLM to follow human
instructions and generate desired responses.

 Preparing the dataset involves downloading an instruction-response dataset,
formatting the entries, and splitting it into train, validation, and test sets.

 Training batches are constructed using a custom collate function that pads
sequences, creates target token IDs, and masks padding tokens.

 We load a pretrained GPT-2 medium model with 355 million parameters to
serve as the starting point for instruction fine-tuning.

 The pretrained model is fine-tuned on the instruction dataset using a training
loop similar to pretraining.

 Evaluation involves extracting model responses on a test set and scoring them
(for example, using another LLM).

 The Ollama application with an 8-billion-parameter Llama model can be used
to automatically score the fine-tuned model’s responses on the test set, providing an average score to quantify performance