# Scratchpad 4 - Instruction Fine-tuning (SFT)

Why instruction fine-tuning? A pretrained LLM is only good at text completion (predicting the next token), not good at following instructions. With instruction fine-tuning, we teach LLM to better follow instructions (generating texts that are desirable responses to the user's instructions). This is often called "supervised fine-tuning (SFT)".

For SFT, we need to
1. prepare instruction fine-tuning data that has desired input output pairs
2. load and fine-tune a pretrained LLM
3. evaluate the fine-tuned LLM

## 1. Data preparation

To prepare data for supervised instruction fine-tuning, we need to:
1. Get the raw dataset and format it according to the SFT template
2. Create an SFT DataSet that formats json data into prompt template and tokenizes the texts into token IDs 
3. Create the DataLoader for the custom SFT DataSet which uses a custom collate function to prepare input and target batches of the encoded data

### 1.1 SFT Data

#### 1.1.1 Train and test data

The training data is stored in a JSON file in `./data/sft/train`. Each sample JSON data has three fields:  `instruction`, `input` , and `output`.

```json
[
    {
        "instruction": "Evaluate the following phrase by transforming it into the spelling given.",
        "input": "freind --> friend",
        "output": "The spelling of the given phrase \"freind\" is incorrect, the correct spelling is \"friend\"."
    },
    ...
]
```

The test data for evaluation later will add the fine-tuned model's response for comparison. The file will be prepared in `./data/sft/test` and each data record will have an additional `response` field, e.g.:

```json
[
    {
        "instruction": "Rewrite the sentence using a simile.",
        "input": "The car is very fast.",
        "output": "The car is as fast as lightning.",
        "model_response": "The car is as fast as a bullet."
    },
    ...
]
```

#### 1.1.2 LLM prompt templates

Now we prepare the train data into inputs to the LLM. 

There are two example prompt template formats:

##### Alpaca
which Stanford CRFM used to train [Stanford Alpaca](https://crfm.stanford.edu/2023/03/13/alpaca.html)
  - They released 52K instruction-following data and the generation script in their [repo](https://github.com/tatsu-lab/stanford_alpaca#data-release). Our training data JSON is in exactly the same JSON format.
  - The LLM prompt template reflects this format:

With the input field:

```
Below is an instruction that describes a task, paired with an input that provides further context. Write a response that appropriately completes the request.

### Instruction:
{instruction}

### Input:
{input}

### Response:
```
Without the input field:
```
Below is an instruction that describes a task. Write a response that appropriately completes the request.

### Instruction:
{instruction}

### Response:
```


##### Phi-3

which Microsoft Research used to train [Phi-3](https://github.com/microsoft/Phi-3CookBook/tree/main). 

The Phi-3 training data prompt template is as follows:

```
<|system|>
Your Role<|end|>
<|user|>
Your Question?<|end|>
<|assistant|>
```

A `jsonl` data would be like follows:

```json
{"text": "<|user|>\nWhen were iron maidens commonly used? <|end|>\n<|assistant|> \nIron maidens were never commonly used <|end|>"}
```

But when using Azure AI to fine-tune, the data format is aligned with GPT, i.e.:

```json
{"messages": [{"role": "system", "content": "You are an Xbox customer support agent whose primary goal is to help users with issues they are experiencing with their Xbox devices. You are friendly and concise. You only provide factual answers to queries, and do not provide answers that are not related to Xbox."}, {"role": "user", "content": "Is Xbox better than PlayStation?"}, {"role": "assistant", "content": "I apologize, but I cannot provide personal opinions. My primary job is to assist you with any issues related to your Xbox device. Do you have any Xbox-related issues that need addressing?"}]}
```

This notebook uses the Alpaca prompt template.

#### 1.1.3 Load and format data

Load the data from JSON file.

In [16]:
# load the training data

import json
data_file = './data/sft/train/instruction-data.json'
with open(data_file, "r", encoding="utf-8") as f:
    data = json.load(f)

In [14]:
print(
f"""number of training data JSON lines: {len(data)}
Example:\n{json.dumps(data[0], indent=4)}"""
)

number of training data JSON lines: 1100
Example:
{
    "instruction": "Evaluate the following phrase by transforming it into the spelling given.",
    "input": "freind --> friend",
    "output": "The spelling of the given phrase \"freind\" is incorrect, the correct spelling is \"friend\"."
}


Format it according to prompt template.

In [17]:
# format the data into prompt template for input to LLM

def format_input(entry):
    instruction_text = (
        f"Below is an instruction that describes a task."
        f"Write a response that appropriately completes the task."
        f"\n\n### Instruction\n{entry['instruction']}"
    )
    input_text = f"\n\n### Input\n{entry['input']}" if entry["input"] else ""
    return instruction_text + input_text

In [18]:
model_input = format_input(data[0])
desired_output = f"\n\n### Response:\n{data[0]['output']}"

print(model_input + desired_output)

Below is an instruction that describes a task.Write a response that appropriately completes the task.

### Instruction
Evaluate the following phrase by transforming it into the spelling given.

### Input
freind --> friend

### Response:
The spelling of the given phrase "freind" is incorrect, the correct spelling is "friend".


#### 1.1.4 Split training, validation and test datasets

In [20]:
train_portion = int(len(data) * 0.85) # 85% of the data for training
test_portion = int(len(data) * 0.1) # 10% of the data for testing
val_portion = len(data) - train_portion - test_portion # 5% of the data for validation

train_data = data[:train_portion]
test_data = data[train_portion:train_portion + test_portion]
val_data = data[train_portion + test_portion:]

In [21]:
print(f"train: {len(train_data)}, test: {len(test_data)}, val: {len(val_data)}")

train: 935, test: 110, val: 55


### 1.2 Create custom DataSet

Implement a custom DataSet that at initialization: 
- format json data into LLM prompt template
- tokenize the text into token IDs

In [22]:
import torch
from torch.utils.data import Dataset


class InstructionDataset(Dataset):
    def __init__(self, data, tokenizer):
        self.data = data
        self.encoded_texts = []

        for entry in data:
            # format the data input and response into prompt template
            instruction_plus_input = (
                f"Below is an instruction that describes a task."
                f"Write a response that appropriately completes the task."
                f"\n\n### Instruction\n{entry['instruction']}"
            ) + (f"\n\n### Input\n{entry['input']}" if entry["input"] else "")
            response_text = f"\n\n### Response:\n{entry['output']}"
            full_text = instruction_plus_input + response_text

            # tokenize the full text
            self.encoded_texts.append(
                tokenizer.encode(full_text)
            )

    def __len__(self):
        return len(self.data)

    def __getitem__(self, idx):
        return self.encoded_texts[idx]

Test it out.

In [40]:
import tiktoken
tokenizer = tiktoken.get_encoding("gpt2")
train_dataset = InstructionDataset(train_data, tokenizer)

print("######## training data json ########")
print(f"training json data length: {len(train_dataset)}")
print(f"first training data json: {json.dumps(train_data[0], indent=4)}")
print("\n######## training dataset ########")
print(f"training dataset size: {len(train_dataset)}")
print(f"first item of the training dataset: \n{train_dataset[0]}")
print(f"Decoded first item of the training dataset: \n{tokenizer.decode(train_dataset[0])}")

######## training data json ########
training json data length: 935
first training data json: {
    "instruction": "Evaluate the following phrase by transforming it into the spelling given.",
    "input": "freind --> friend",
    "output": "The spelling of the given phrase \"freind\" is incorrect, the correct spelling is \"friend\"."
}

######## training dataset ########
training dataset size: 935
first item of the training dataset: 
[21106, 318, 281, 12064, 326, 8477, 257, 4876, 13, 16594, 257, 2882, 326, 20431, 32543, 262, 4876, 13, 198, 198, 21017, 46486, 198, 36, 2100, 4985, 262, 1708, 9546, 416, 25449, 340, 656, 262, 24993, 1813, 13, 198, 198, 21017, 23412, 198, 19503, 521, 14610, 1545, 198, 198, 21017, 18261, 25, 198, 464, 24993, 286, 262, 1813, 9546, 366, 19503, 521, 1, 318, 11491, 11, 262, 3376, 24993, 318, 366, 6726, 1911]
Decoded first item of the training dataset: 
Below is an instruction that describes a task.Write a response that appropriately completes the task.

### Inst

### 1.3 Create DataLoader with a custom colloate function

#### 1.3.1 Custom collate function

Now we develop a custom collate function for batching, which
- pad items in each batch according to the longest sequence in that batch
- prepare input and target pairs of each sample in the batch
- mask padding tokens so that they are ignored in training
- optionally truncate sequences according to a given model context window

In [44]:
import torch
from utils.get_device import get_default_device

def custom_collate_fn(
    batch,
    pad_token_id=50256,  # default pad token is GPT2's end of text token <|endoftext|>
    ignore_token_index=-100,  # PyTorch ignores token ID -100 in calculating loss
    allowed_max_length=None,
    device=None,
):
    # pad items in the batch to the same length as the longest sequence in this batch
    batch_max_length = max(len(item) + 1 for item in batch)
    inputs_batch_list, targets_batch_list = [], []
    for item in batch:
        # first, add an <endoftext> token to the end of the sequence
        new_item = item.copy() + [pad_token_id]
        # pad the sequence to the same length as the longest sequence in the batch
        padded = new_item + [pad_token_id] * (batch_max_length - len(new_item))

        # preoare inputs and outputs with the padded sequence
        inputs = torch.tensor(padded[:-1]) # remove the last token for inputs
        targets = torch.tensor(padded[1:]) # shift input +1 to the right for targets

        # mask the padding tokens in targets with the ignore token so that they don't affect training
        mask = targets == pad_token_id
        indices = torch.nonzero(mask).squeeze()
        if indices.numel() > 1: 
            targets[indices[1:]] = ignore_token_index # keep the first padding token because it shows the end of the sequence

        # if given an allowed max sequence length, truncate both inputs and targets sequences
        if allowed_max_length is not None:
            inputs = inputs[:allowed_max_length]
            targets = targets[:allowed_max_length]

        # add prepared inputs and outputs tokens to the corresponding batch list
        inputs_batch_list.append(inputs)
        targets_batch_list.append(targets)
    
    # stack inputs and targets batch lists into batch tensors and move them to device
    if device is None:
        device = get_default_device()
    inputs_tensor = torch.stack(inputs_batch_list).to(device)
    targets_tensor = torch.stack(targets_batch_list).to(device)


    return inputs_tensor, targets_tensor

Test it out.

In [45]:
inputs_1 = [0, 1, 2, 3, 4]
inputs_2 = [5, 6]
inputs_3 = [7, 8, 9]

batch = (
    inputs_1,
    inputs_2,
    inputs_3
)

inputs, targets = custom_collate_fn(batch, allowed_max_length=4)
print(f"inputs:{inputs}\ntargets:{targets}")

inputs:tensor([[    0,     1,     2,     3],
        [    5,     6, 50256, 50256],
        [    7,     8,     9, 50256]], device='mps:0')
targets:tensor([[    1,     2,     3,     4],
        [    6, 50256,  -100,  -100],
        [    8,     9, 50256,  -100]], device='mps:0')


#### 1.3.2 Train, test, validation DataLoaders

The train, test and validation dataloaders will use the custom dataset and collate function we defined. 

The custom collate function has a `device` argument, so before we pass this function to create the DataLoader, we want to first give it the correct `device`. We can use Python's `functools` library's `partial` function to create a new function of it with `device` argument pre-filled. We can also prefill the `allowed_max_length` to GPT2's context window 1024. 

In [46]:
from utils.get_device import get_default_device
device = get_default_device()

from functools import partial

customized_collate_fn = partial(
    custom_collate_fn,
    device=device,
    allowed_max_length=1024,
)

Now we can create the dataloaders for train, test, and validation.

In [47]:
from torch.utils.data import DataLoader

num_workers = 0
batch_size = 8

torch.manual_seed(123)

train_dataset = InstructionDataset(train_data, tokenizer)
train_loader = DataLoader(
    train_dataset,
    batch_size=batch_size,
    collate_fn=customized_collate_fn,
    shuffle=True,
    drop_last=True,
    num_workers=num_workers,
)

test_dataset = InstructionDataset(test_data, tokenizer)
test_loader = DataLoader(
    test_dataset,
    batch_size=batch_size,
    collate_fn=customized_collate_fn,
    shuffle=False,
    drop_last=False,
    num_workers=num_workers,
)

val_dataset = InstructionDataset(val_data, tokenizer)
val_loader = DataLoader(
    val_dataset,
    batch_size=batch_size,
    collate_fn=customized_collate_fn,
    shuffle=False,
    drop_last=False,
    num_workers=num_workers,
)

## 2. Finetuning a pretrained LLM

To fine-tune a pretrained LLM, we need to
1. load the pretrained LLM
2. train (fine-tune) it on SFT data
3. save the model

## 3. Evaluating the fine-tuned LLM

To evaluate the fine-tuned LLM, we will use another LLM.
1. run inference on test set and save the responses
2. compare ground truth in test set with generated responses
3. use another LLM to score the fine-tuned responses