# **Finetuning to Follow Instructions**

In [1]:
from importlib.metadata import version

pkgs = [
    "numpy",       
    "matplotlib", 
    "tiktoken",    
    "torch",      
    "tqdm",        
    "tensorflow",
]

for p in pkgs:
    print(f"{p} version:{version(p)}") 

numpy version:2.2.5
matplotlib version:3.10.1
tiktoken version:0.9.0
torch version:2.7.0
tqdm version:4.67.1
tensorflow version:2.19.0


## **1. Instruction Finetuning**

## **2. Dataset Preparation for Supervised Finetuning**

In [6]:
import json, os, requests

def download_and_load_file(file_path, url):
    if not os.path.exists(file_path):
        response = requests.get(url, timeout=30)
        response.raise_for_status()
        text_data = response.text
        with open(file_path, "w", encoding="utf-8") as file:
            file.write(text_data)
            
    with open(file_path, "r", encoding="utf-8") as file:
        data = json.load(file)
        
    return data

In [7]:
file_path = "data/instruct/instruction-data.json"
url = (
    "https://raw.githubusercontent.com/rasbt/LLMs-from-scratch"
    "/main/ch07/01_main-chapter-code/instruction-data.json" 
)

data = download_and_load_file(file_path, url)
print("Number of entries", len(data))

Number of entries 1100


In [8]:
print("Example entry:\n", data[50])

Example entry:
 {'instruction': 'Identify the correct spelling of the following word.', 'input': 'Ocassion', 'output': "The correct spelling is 'Occasion.'"}


In [9]:
print("Another example entry:\n", data[444])

Another example entry:
 {'instruction': "Provide the past participle form of 'choose.'", 'input': '', 'output': "The past participle form of 'choose' is 'chosen.'"}


- Items in the downloaded `data` list are stored as dictionaries.
- Entries may contain empty `input` fields. 
- There are a number of ways to format the entries as inputs to the LLM; Two example formats that were used for training the Alpaca (https://crfm.stanford.edu/2023/03/13/alpaca.html) and Phi-3 (https://arxiv.org/abs/2404.14219) LLMs are illustrated below.
    - Alpaca-style prompt formatting was the original prompt template for instruction fine-tuning.

<img src="https://sebastianraschka.com/images/LLMs-from-scratch-images/ch07_compressed/04.webp?2" width=640px>

In [10]:
# Proceeding with the Alpaca style prompt formatting.
def format_input(entry):
    instruction_text = (
        f"Below is an instruction that describes a task. "
        f"Write a reponse that appropriately completes the request."
        f"\n\n### Instruction:\n{entry['instruction']}"
    )
    
    input_text = f"\n\n### Input:\n{entry['input']}" if entry["input"] else ""
    
    return instruction_text + input_text

In [11]:
# Demonstration of a formatted response with input fields
model_input = format_input(data[50])
desired_response = f"\n\n#### Response:\n{data[50]['output']}"

print(model_input, desired_response)

Below is an instruction that describes a task. Write a reponse that appropriately completes the request.

### Instruction:
Identify the correct spelling of the following word.

### Input:
Ocassion 

#### Response:
The correct spelling is 'Occasion.'


In [13]:
# Example where the input field is empty
model_input = format_input(data[444])
desired_response = f"\n\n#### Response:\n{data[444]['output']}"

print(model_input, desired_response)

Below is an instruction that describes a task. Write a reponse that appropriately completes the request.

### Instruction:
Provide the past participle form of 'choose.' 

#### Response:
The past participle form of 'choose' is 'chosen.'


- Diving the dataset into `training`, `validation` and `test` sets before passing them to the dataloaders.

In [14]:
train_set = int(len(data) * 0.85)
test_set = int(len(data) * 0.1)
val_set = len(data) - train_set - test_set

train_data = data[:train_set]
test_data = data[train_set : train_set + test_set]
val_data = data[train_set + test_set:]

In [17]:
print("Training set length:", len(train_data))
print("Validation set length:", len(val_data))
print("Test set length:", len(test_data))

Training set length: 935
Validation set length: 55
Test set length: 110


## **3. Organizing Data into Training Batches**

Dataset batching can be summarized as in the figure below.

- We tackle this dataset batching in several steps, as summarized in the figure below

<img src="https://sebastianraschka.com/images/LLMs-from-scratch-images/ch07_compressed/06.webp?1" width=640px>

- We will implement an `InstructionDataset` class which pretokenizes all inputs in the dataset.

- First, we implement an `InstructionDataset` class that pre-tokenizes all inputs in the dataset, similar to the `SpamDataset` in chapter 6

<img src="https://sebastianraschka.com/images/LLMs-from-scratch-images/ch07_compressed/07.webp?1" width=640px>

In [18]:
import torch
from torch.utils.data import Dataset

class InstructionDataset(Dataset):
    def __init__(self, data, tokenizer):
        self.data = data
        
        # Pre-tokenizing the text
        self.encoded_texts = []
        for entry in data:
            instruction_plus_input = format_input(entry)
            response_text = f"\n\n### Response:\n{entry['output']}"
            full_text = instruction_plus_input + response_text
            self.encoded_texts.append(
                tokenizer.encode(full_text)
            )
    
    def __getitem__(self, index):
        return self.encoded_texts[index]
    
    def __len__(self):
        return len(self.data)

- As before, all input batches will be padded to a similar length.
- We will be using the `<|endoftext|>` token as a padding token.

In [19]:
import tiktoken
tokenizer = tiktoken.get_encoding("gpt2")

print(tokenizer.encode("<|endoftext|>", allowed_special={"<|endoftext|>"}))

[50256]


In [20]:
# Developing a custom collate function that can be passed to the dataloader.
# Its purpose is to pad the training examples in each batch to have the same length
# Note that different batches may vary in length.
def collate_draft_1(
    batch,
    pad_token_id=50256,
    device="cpu"
):
    # Find the longest sequence in the batch
    # and increase the max length by +1 to add an extra
    # padding token.
    batch_max_length = max(len(item)+1 for item in batch)
    
    # Pad and prepare inputs
    inputs_lst = []
    
    for item in batch:
        new_item = item.copy()
        # Add the padding token
        new_item += [pad_token_id] 
        # Pad sequences to batch_max_length
        padded = (
            new_item + [pad_token_id] * 
            (batch_max_length - len(new_item))
        )
        # padded[:-1] allows us to remove the extra padded token
        # which was added with +1 in batch_max_length. We will
        # circle back to the extra padding token in later sections.
        inputs = torch.tensor(padded[:-1])
        inputs_lst.append(inputs)
    
    # Convert list of inputs to tensor and transfer to target device.
    inputs_tensor = torch.stack(inputs_lst).to(device)
    return inputs_tensor

In [21]:
inp_1 = [0, 1, 2, 3, 4, 5]
inp_2 = [6, 7]
inp_3 = [8, 9, 10]

batch = (
    inp_1,
    inp_2,
    inp_3
)

batch

([0, 1, 2, 3, 4, 5], [6, 7], [8, 9, 10])

In [22]:
print(collate_draft_1(batch))

tensor([[    0,     1,     2,     3,     4,     5],
        [    6,     7, 50256, 50256, 50256, 50256],
        [    8,     9,    10, 50256, 50256, 50256]])


- The collate function above only handles inputs to an LLM. To allow training for instruction following, we will need the target values as well.
- Similar to pretraining an LLM, targets will simply be inputs shifted by 1 position to the right to allow next-token predictions.

In [39]:
# Modifying the collate function

def collate_draft_2(
    batch,
    pad_token_id=50256,
    device="cpu"
):
    # Find the longest sequence in the batch
    batch_max_length = max(len(item)+1 for item in batch)
    
    # Pad and prepare inputs
    inputs_lst, targets_lst = [], []
    
    for item in batch:
        new_item = item.copy()
        # Add padding token
        new_item += [pad_token_id]
        # Pad sequences to max_length
        padded = (
            new_item + [pad_token_id] *
            (batch_max_length - len(new_item))
        )
        inputs = torch.tensor(padded[:-1]) # Truncate last token for inputs
        targets = torch.tensor(padded[1:]) # Shift +1 to the right for targets
        inputs_lst.append(inputs)
        targets_lst.append(targets)
        
    # Convert list of inputs to tensor and transfer to target device
    inputs_tensor = torch.stack(inputs_lst).to(device)
    targets_tensor = torch.stack(targets_lst).to(device)
    return inputs_tensor, targets_tensor

In [40]:
inputs, targets = collate_draft_2(batch)
print(inputs)
print(targets)

tensor([[    0,     1,     2,     3,     4,     5],
        [    6,     7, 50256, 50256, 50256, 50256],
        [    8,     9,    10, 50256, 50256, 50256]])
tensor([[    1,     2,     3,     4,     5, 50256],
        [    7, 50256, 50256, 50256, 50256, 50256],
        [    9,    10, 50256, 50256, 50256, 50256]])


- Introducing an `ignore_index` value to replace all padding token IDs with a new value. This new value will be ignored during loss calculations.
- In this case, we will be replacing `50256` with `-100`.
- We will also introduce `allowed_max_length` in case we want to limit the length of the samples. 

<img src="https://sebastianraschka.com/images/LLMs-from-scratch-images/ch07_compressed/11.webp?1" width=640px>

In [41]:
def collate_fn(
    batch,
    pad_token_id=50256,
    ignore_index=-100,
    allowed_max_length=None,
    device="cpu"
):
    # Find the longest sequence in the batch
    batch_max_length = max(len(item)+1 for item in batch)

    # Pad and prepare inputs and targets
    inputs_lst, targets_lst = [], []
    
    for item in batch:
        new_item = item.copy()
        new_item += [pad_token_id]
        
        padded = (
            new_item + [pad_token_id] *
            (batch_max_length - len(new_item))
        )
        inputs = torch.tensor(padded[:-1])
        targets = torch.tensor(padded[1:])
        
        # Replace all except the first padding tokens in targets using ignore_index
        mask = targets == pad_token_id
        indices = torch.nonzero(mask).squeeze()
        if indices.numel() > 1:
            targets[indices[1:]] = ignore_index
        
        # Optionally truncate to maximum sequence length
        if allowed_max_length is not None:
            inputs = inputs[:allowed_max_length]
            targets = targets[:allowed_max_length]
            
        inputs_lst.append(inputs)
        targets_lst.append(targets)
    
    # Convert list of inputs and targets to tensors and transfer to target device
    inputs_tensor = torch.stack(inputs_lst).to(device)
    targets_tensor = torch.stack(targets_lst).to(device)
    
    return inputs_tensor, targets_tensor

In [47]:
inputs, targets = collate_fn(batch)
print(inputs)
print(targets)

tensor([[    0,     1,     2,     3,     4,     5],
        [    6,     7, 50256, 50256, 50256, 50256],
        [    8,     9,    10, 50256, 50256, 50256]])
tensor([[    1,     2,     3,     4,     5, 50256],
        [    7, 50256,  -100,  -100,  -100,  -100],
        [    9,    10, 50256,  -100,  -100,  -100]])


### 3.1 The Role of `-100` in Ignoring Padding Tokens

- Assuming a trivial classification task with 2 class labels, 0 and 1.
- Calculating the loss based on the logit values below.

In [48]:
logits_1 = torch.tensor(
    [[-1.0, 1.0],
     [-0.5, 1.5]]
)
targets_1 = torch.tensor([0, 1])

loss_1 = torch.nn.functional.cross_entropy(logits_1, targets_1)
print(loss_1)

tensor(1.1269)


In [49]:
logits_2 = torch.tensor(
    [[-1.0, 1.0],
     [-0.5, 1.5],
     [-0.5, 1.5]]
)
targets_2 = torch.tensor([0, 1, 1])

loss_2 = torch.nn.functional.cross_entropy(logits_2, targets_2)
print(loss_2)

tensor(0.7936)


In [50]:
# Replacing the class label of one of the examples with -100
targets_3 = torch.tensor([0, 1, -100])
loss_3 = torch.nn.functional.cross_entropy(logits_2, targets_3)
print(loss_3)
print("loss_1 == loss_3: ", loss_1 == loss_3)

tensor(1.1269)
loss_1 == loss_3:  tensor(True)


- This shows that cross-entropy loss function ignored the training example with `-100` label.
- PyTorch, by default, has the `cross_entropy(..., ignore_index=-100)` setting to ignore examples corresponding to the -100 label.
- This allows us to ignore the end-of-text / padding tokens in the training batches.
- Do note that we **don't** want to ignore the first instance of the end-of-text padding token (50256) because it serves as a signal to the LLM that the reponse is complete.
- Target token IDs, which correspond to instructions, are commonly masked out as well.

<img src="https://sebastianraschka.com/images/LLMs-from-scratch-images/ch07_compressed/13.webp" width=640px>