# **Finetuning to Follow Instructions**

In [1]:
from importlib.metadata import version

pkgs = [
    "numpy",       
    "matplotlib", 
    "tiktoken",    
    "torch",      
    "tqdm",        
    "tensorflow",
]

for p in pkgs:
    print(f"{p} version:{version(p)}") 

numpy version:2.2.5
matplotlib version:3.10.1
tiktoken version:0.9.0
torch version:2.7.0
tqdm version:4.67.1
tensorflow version:2.19.0


## **1. Instruction Finetuning**

## **2. Dataset Preparation for Supervised Finetuning**

In [6]:
import json, os, requests

def download_and_load_file(file_path, url):
    if not os.path.exists(file_path):
        response = requests.get(url, timeout=30)
        response.raise_for_status()
        text_data = response.text
        with open(file_path, "w", encoding="utf-8") as file:
            file.write(text_data)
            
    with open(file_path, "r", encoding="utf-8") as file:
        data = json.load(file)
        
    return data

In [7]:
file_path = "data/instruct/instruction-data.json"
url = (
    "https://raw.githubusercontent.com/rasbt/LLMs-from-scratch"
    "/main/ch07/01_main-chapter-code/instruction-data.json" 
)

data = download_and_load_file(file_path, url)
print("Number of entries", len(data))

Number of entries 1100


In [8]:
print("Example entry:\n", data[50])

Example entry:
 {'instruction': 'Identify the correct spelling of the following word.', 'input': 'Ocassion', 'output': "The correct spelling is 'Occasion.'"}


In [9]:
print("Another example entry:\n", data[444])

Another example entry:
 {'instruction': "Provide the past participle form of 'choose.'", 'input': '', 'output': "The past participle form of 'choose' is 'chosen.'"}


- Items in the downloaded `data` list are stored as dictionaries.
- Entries may contain empty `input` fields. 
- There are a number of ways to format the entries as inputs to the LLM; Two example formats that were used for training the Alpaca (https://crfm.stanford.edu/2023/03/13/alpaca.html) and Phi-3 (https://arxiv.org/abs/2404.14219) LLMs are illustrated below.
    - Alpaca-style prompt formatting was the original prompt template for instruction fine-tuning.

<img src="https://sebastianraschka.com/images/LLMs-from-scratch-images/ch07_compressed/04.webp?2" width=640px>

In [10]:
# Proceeding with the Alpaca style prompt formatting.
def format_input(entry):
    instruction_text = (
        f"Below is an instruction that describes a task. "
        f"Write a reponse that appropriately completes the request."
        f"\n\n### Instruction:\n{entry['instruction']}"
    )
    
    input_text = f"\n\n### Input:\n{entry['input']}" if entry["input"] else ""
    
    return instruction_text + input_text

In [11]:
# Demonstration of a formatted response with input fields
model_input = format_input(data[50])
desired_response = f"\n\n#### Response:\n{data[50]['output']}"

print(model_input, desired_response)

Below is an instruction that describes a task. Write a reponse that appropriately completes the request.

### Instruction:
Identify the correct spelling of the following word.

### Input:
Ocassion 

#### Response:
The correct spelling is 'Occasion.'


In [13]:
# Example where the input field is empty
model_input = format_input(data[444])
desired_response = f"\n\n#### Response:\n{data[444]['output']}"

print(model_input, desired_response)

Below is an instruction that describes a task. Write a reponse that appropriately completes the request.

### Instruction:
Provide the past participle form of 'choose.' 

#### Response:
The past participle form of 'choose' is 'chosen.'


- Diving the dataset into `training`, `validation` and `test` sets before passing them to the dataloaders.

In [14]:
train_set = int(len(data) * 0.85)
test_set = int(len(data) * 0.1)
val_set = len(data) - train_set - test_set

train_data = data[:train_set]
test_data = data[train_set : train_set + test_set]
val_data = data[train_set + test_set:]

In [17]:
print("Training set length:", len(train_data))
print("Validation set length:", len(val_data))
print("Test set length:", len(test_data))

Training set length: 935
Validation set length: 55
Test set length: 110
