# Instruction Tuning the Model

### Download Instruction Dataset

In [None]:
import json
import os
import urllib


def download_and_load_file(file_path, url):
    if not os.path.exists(file_path):
        with urllib.request.urlopen(url) as response:
            text_data = response.read().decode("utf-8")
        with open(file_path, "w", encoding="utf-8") as file:
            file.write(text_data)
        with open(file_path, "r") as file:
            data = json.load(file)
        return data


file_path = "instruction-data.json"
url = (
    "https://raw.githubusercontent.com/rasbt/LLMs-from-scratch"
    "/main/ch07/01_main-chapter-code/instruction-data.json"
)
data = download_and_load_file(file_path, url)
print("Number of entries:", len(data))

Number of entries: 1100


Let's print one of the entries

In [3]:
print("Example entry:\n", data[50])

Example entry:
 {'instruction': 'Identify the correct spelling of the following word.', 'input': 'Ocassion', 'output': "The correct spelling is 'Occasion.'"}


##### TODO: Exercise 7.1 Changing prompt styles
After fine-tuning the model with the Alpaca prompt style, try the Phi-3 prompt style
shown in figure 7.4 and observe whether it affects the response quality of the model

##### Implementing the prompt formatting function

In [None]:
def format_input(entry):
    instruction = (
        f"Below is an instruction that describes a task. "
        f"Write a response that appropriately completes the request."
        f"\n\n### Instruction:\n{entry['instruction']}"
    )
    input = f"\n\n### Input:\n{entry['input']}" if entry["input"] else ""
    return instruction + input


def format_output(entry):
    return f"\n\n### Response:\n{entry['output']}"

Let's thest the new function prompt formatting function on a dataset entry, and confirm its output is correct.

In [20]:
test_data = data[50]
print(format_input(test_data) + format_output(test_data))

Below is an instruction that describes a task. Write a response that appropriately completes the request.

### Instruction:
Identify the correct spelling of the following word.

### Input:
Ocassion

### Response:
The correct spelling is 'Occasion.'


Let's thest the new function prompt formatting function on an entry that's missin the input field, and confirm it output is correct.

In [16]:
test_data = data[2]
print(format_input(test_data) + format_output(test_data))

Below is an instruction that describes a task. Write a response that appropriately completes the request.

### Instruction:
Convert 45 kilometers to meters.

### Response:
45 kilometers is 45000 meters.


##### Partitioning the dataset

In [None]:
train_portion = int(len(data) * 0.85)
test_portion = int(len(data) * 0.10)
val_portion = len(data) - train_portion - test_portion

train_data = data[:train_portion]
test_data = data[train_portion : train_portion + test_portion]
val_data = data[train_portion + test_portion :]

print("Training set length:", len(train_data))
print("Validation set length:", len(val_data))
print("Test set length:", len(test_data))

Training set length: 935
Validation set length: 55
Test set length: 110


### Organize Data into Training Batches

Efficiently padding the data samples to equal lenghts in order to assemble multiple instruction examples in a batch.


1. **Format data using prompt template**.
2. **Tokenize formatted data**.
3. **Adjust inputs to same lenght** with padding tokens.
4. **Create target token IDs**, by shifting the inputs by 1.
5. **Replace padding tokens with placeholders**, to exclude them from training loss.

##### Define Instruction Dataset
That applies formatting and pretokenizes all inputs in the dataset.

In [30]:
import torch
from torch.utils.data import Dataset


class InstructionDataset(Dataset):
    def __init__(self, data, tokenizer):
        self.data = data
        self.encoded_texts = []

        for entry in data:
            formatted_input = format_input(entry)
            formatted_output = format_output(entry)
            full_text = formatted_input + formatted_output

            self.encoded_texts.append(full_text)

    def __getitem__(self, index):
        return self.encoded_texts[index]

    def __len__(self):
        return len(self.data)

##### Tokenization

Similar to the pre-training and classification fine-tuning, we use gpt tokenizer to encode the data, allowing for a special `eot` character

In [31]:
import tiktoken

tokenizer = tiktoken.get_encoding("gpt2")
print(tokenizer.encode("<|endoftext|>", allowed_special={"<|endoftext|>"}))

[50256]


##### Adjust inputs to the same lenght
Define a collate function that adjusts inputs to the same lenght.

A final version of the collate function will be used later on in the Dataloader to collate inputs into batches of the same size.

In [32]:
def custom_collate_draft_1(batch, pad_token_id=50256, device="cpu"):
    max_length = max(len(item) + 1 for item in batch)
    inputs_lst = []

    for item in batch:
        new_item = item.copy()
        new_item += [pad_token_id]

        padded = new_item + [pad_token_id] * (max_length - len(new_item))
        inputs = torch.tensor(padded[:-1])
        inputs_lst.append(inputs)

    inputs_tensor = torch.stack(inputs_lst).to(device)
    return inputs_tensor

Let's test it out:

In [33]:
inputs_1 = [0, 1, 2, 3, 4]
inputs_2 = [5, 6]
inputs_3 = [7, 8, 9]
batch = (inputs_1, inputs_2, inputs_3)
print(custom_collate_draft_1(batch))

tensor([[    0,     1,     2,     3,     4],
        [    5,     6, 50256, 50256, 50256],
        [    7,     8,     9, 50256, 50256]])


As expected, the input with the longest length has no padding, whereas the other two inputs are padded to match the length of the first input.

##### Create Targets

Let's update the collate function so we also compute the target tensors.

The target tensors are the same as the input ones, shifted by one position.

In [34]:
def custom_collate_draft_2(batch, pad_token_id=50256, device="cpu"):
    max_length = max(len(item) + 1 for item in batch)
    inputs_lst, targets_lst = [], []

    for item in batch:
        new_item = item.copy()
        new_item += [pad_token_id]

        padded = new_item + [pad_token_id] * (max_length - len(new_item))
        inputs = torch.tensor(padded[:-1])
        targets = torch.tensor(padded[1:])

        inputs_lst.append(inputs)
        targets_lst.append(targets)

    inputs_tensor = torch.stack(inputs_lst).to(device)
    targets_tensor = torch.stack(targets_lst).to(device)
    return inputs_tensor, targets_tensor

In [38]:
inputs_1 = [0, 1, 2, 3, 4]
inputs_2 = [5, 6]
inputs_3 = [7, 8, 9]
batch = (inputs_1, inputs_2, inputs_3)
print(f"Inputs:\n : {custom_collate_draft_2(batch)[0]}")
print(f"Targets:\n : {custom_collate_draft_2(batch)[1]}")

Inputs:
 : tensor([[    0,     1,     2,     3,     4],
        [    5,     6, 50256, 50256, 50256],
        [    7,     8,     9, 50256, 50256]])
Targets:
 : tensor([[    1,     2,     3,     4, 50256],
        [    6, 50256, 50256, 50256, 50256],
        [    8,     9, 50256, 50256, 50256]])


##### Replace padding tokens with placeholders

This allows us to exclude all padding tokens from contributing to the loss calculation.

- Retaining the `eot` token allows the LLM to learn when to generate an end- of-text token in response to instructions, which we use as an indicator that the generated response is complete

In [None]:
def custom_collate_fn(
    batch, pad_token_id=50256, ignore_index=-100, device="cpu", allowed_max_length=None
):
    max_length = max(len(item) + 1 for item in batch)
    inputs_lst, targets_lst = [], []

    for item in batch:
        new_item = item.copy()
        new_item += [pad_token_id]

        padded = new_item + [pad_token_id] * (max_length - len(new_item))
        inputs = torch.tensor(padded[:-1])
        targets = torch.tensor(padded[1:])

        # Replace extra padding tokens in targets by `ignore_index`.
        mask = targets == pad_token_id
        indices = torch.nonzero(mask).squeeze()
        if indices.numel() > 1:
            targets[indices[1:]] = ignore_index

        if allowed_max_length is not None:
            inputs = inputs[:allowed_max_length]
            targets = targets[:allowed_max_length]

        inputs_lst.append(inputs)
        targets_lst.append(targets)

    inputs_tensor = torch.stack(inputs_lst).to(device)
    targets_tensor = torch.stack(targets_lst).to(device)
    return inputs_tensor, targets_tensor

Let's test it now. 

We expect to see targets to have at most one padding token, and ignore tokens (-100) afterwards.

In [44]:
inputs_1 = [0, 1, 2, 3, 4]
inputs_2 = [5, 6]
inputs_3 = [7, 8, 9]
batch = (inputs_1, inputs_2, inputs_3)
print(f"Inputs:\n : {custom_collate_fn(batch)[0]}")
print(f"Targets:\n : {custom_collate_fn(batch)[1]}")

Inputs:
 : tensor([[    0,     1,     2,     3,     4],
        [    5,     6, 50256, 50256, 50256],
        [    7,     8,     9, 50256, 50256]])
Targets:
 : tensor([[    1,     2,     3,     4, 50256],
        [    6, 50256,  -100,  -100,  -100],
        [    8,     9, 50256,  -100,  -100]])


It works as expected !