<a href="https://colab.research.google.com/github/aashu-0/llm-from-scratch/blob/main/llm_book_notes/07instruct-fine-tuning.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

**Instruction fine tuning**
- also called supervised instruction fine tuning
- tuning llm to follow instructions


#### 1. Preparing the dataset

#####Download the Stanford Alpaca dataset from github

In [2]:
import json
import urllib.request

url = "https://raw.githubusercontent.com/tatsu-lab/stanford_alpaca/refs/heads/main/alpaca_data.json"

file_path = 'alpaca_data.json'
urllib.request.urlretrieve(url, file_path)

# load
with open(file_path, 'r', encoding='utf-8') as f:
    dataset = json.load(f)

print(f'Number of entries: {len(dataset)}')

Number of entries: 52002


In [3]:
# a subset of the dataset -> 10k samples

import random
subset_size = 10000
random.seed(42)
subset_data = random.sample(dataset, subset_size) # a list of dictionaries

# save
subset_file_path = 'alpaca_subset.json'

# to convert list to json-formatted string
with open(subset_file_path, 'w', encoding='utf-8') as f:
    json.dump(subset_data, f, indent=4)

print(f'Number of entries: {len(subset_data)}')

Number of entries: 10000


In [6]:
# subset_data[:18]

In [7]:
# load the subset_data

with open(subset_file_path, 'r', encoding='utf-8') as f:
    subset_dataset = json.load(f)

print(f'Number of entries: {len(subset_dataset)}')

Number of entries: 10000


In [15]:
print(f'Example:\n {subset_dataset[4000]}') # no input section

Example:
 {'instruction': 'Create a simile about the sound of a waterfall', 'input': '', 'output': 'The sound of a waterfall is like a roaring lion - loud, powerful, and majestic.'}


In [14]:
print(f'Example:\n {subset_dataset[5000]}')

Example:
 {'instruction': 'You are provided with the following title. Write a summary of the article with a length of no more than 60 words:', 'input': '"5 Reasons Music Education is Important for Young People"', 'output': 'Music education has numerous benefits for young people, from improving social skills and fostering teamwork to aiding in self-expression and boosting confidence. It can also positively influence academic ability and mental health. This article explores five reasons why music education is important for young people.'}


so the structure of the dataset looks like:


```
{'instruction":
  'input':  #may be empty
  'output':}
```
there are various ways to format these entries:
- `Alpaca prompt style`
- `Phi-3 prompt style`


##### Implementing the formatting function

In [20]:
def format_input(entry):
  instruction_txt = (
      f"Below is an instruction that describes a task. "
      f"Write a response that appropriately completes the request.\n\n"
      f"### Instruction:\n{entry['instruction']}"
  )
  input_text = (
      f"\n\n### Input:\n{entry['input']}" if entry['input'] else ''
  )
  return instruction_txt + input_text


In [23]:
model_input = format_input(subset_dataset[5000])
desired_response = f"\n\n### Response:\n{subset_dataset[5000]['output']}"

print(model_input + desired_response)

Below is an instruction that describes a task. Write a response that appropriately completes the request.

### Instruction:
You are provided with the following title. Write a summary of the article with a length of no more than 60 words:

### Input:
"5 Reasons Music Education is Important for Young People"

### Response:
Music education has numerous benefits for young people, from improving social skills and fostering teamwork to aiding in self-expression and boosting confidence. It can also positively influence academic ability and mental health. This article explores five reasons why music education is important for young people.


##### Train Test Split

In [24]:
train_set = int(len(subset_dataset)*0.85)
test_set = int(len(subset_dataset)*0.1)
val_set = len(subset_dataset) - train_set - test_set

train_data = subset_dataset[:train_set]
test_data = subset_dataset[train_set:train_set+test_set]
val_data = subset_dataset[train_set+test_set:]

print(f'Train set size: {len(train_data)}')
print(f'Test set size: {len(test_data)}')
print(f'Validation set size: {len(val_data)}')

Train set size: 8500
Test set size: 1000
Validation set size: 500


#### 2. Organizing data into batches

`collate` function in pytorch
- used to batch samples together into a single batch
- allows custom preprocessing, when dealing with variable-length data

1. `default_collate`
- stackes tensor along the first dim
- convert lists into tensors
- leaves dic and other data structures untouched
2. `collate_fn`
- for creating custom collate function

**Custom Collate function**

1. format data
2. tokenize
3. adjust them to have same length using padding tokens
4. create target token ids
  * inputs shifted by `1`
5. replace certain pad tokens
  * why do not contain useful info so excluded from loss computation
  * `ignore_index`: placeholder value(`-100`)



##### Instruction dataset class

In [25]:
import torch
from torch.utils.data import Dataset

class InstructionDataset(Dataset):
  def __init__(self, data, tokenizer):
    self.data = data
    self.encode_texts = []
    for entry in data:
      input_with_instruction = format_input(entry)
      response_text = f"\n\n### Response:\n{entry['output']}"
      full_text = input_with_instruction + response_text
      self.encode_texts.append(
          tokenizer.encode(full_text)
      )

  def __getitem__(self, index):
     return self.encode_texts[index]

  def __len__(self):
    return len(self.data)

##### Custom Batch collate function

In [26]:
def custom_collate_fn(
    batch,
    pad_token_id= 50256,
    allowed_max_length=None,
    ignore_index=-100,
    device = 'cpu'):

  batch_max_length = max(len(item) for item in batch)
  inputs_lst, targets_lst = [], []
  for item in batch:
    new_item = item.copy()
    new_item += [pad_token_id]

    padded = (new_item + [pad_token_id]* (batch_max_length - len(item)))
    inputs= torch.tensor(padded[:-1])
    targets= torch.tensor(padded[1:])

    mask = targets == pad_token_id
    indices = torch.nonzero(mask).squeeze() # returns the indices of True values
    if indices.numel() >1:
      targets[indices[1:]] = ignore_index

    if allowed_max_length is not None:
      inputs = inputs[:allowed_max_length]
      targets = targets[:allowed_max_length]

    inputs_lst.append(inputs)
    targets_lst.append(targets)

  inputs_tensor = torch.stack(inputs_lst).to(device)
  targets_tensor = torch.stack(targets_lst).to(device)

  return inputs_tensor, targets_tensor


In [28]:
# example
input1 = [0,1,2,3,4]
input2 = [5,6]
input3 = [7,8,9]

batch = [input1, input2, input3]

input, target = custom_collate_fn(batch)
print(input)
print(target)

tensor([[    0,     1,     2,     3,     4],
        [    5,     6, 50256, 50256, 50256],
        [    7,     8,     9, 50256, 50256]])
tensor([[    1,     2,     3,     4, 50256],
        [    6, 50256,  -100,  -100,  -100],
        [    8,     9, 50256,  -100,  -100]])


Why `-100`?

by default `ignore_index` in `cross_entropy()` is equal to `-100`.

therefore, it ignores targets labeled with -100

Masking out the instruction text- so that model focuses on generating accurate responses ratehr than memorizing instructions.

we do later

##### Creating dataloaders

In [32]:
# device
device = torch.device('cuda' if torch.cuda.is_available() else 'cpu')
print(device)

cpu


In [33]:
# fix or pre-fill some argument in custom_collate_function

from functools import partial

custom_collate_fn = partial(
    custom_collate_fn,
    allowed_max_length=1024,
    device=device
)

In [38]:
!pip install tiktoken



In [40]:
# tokenizer
import tiktoken
tokenizer = tiktoken.get_encoding('gpt2')

In [41]:
# dataloader
from torch.utils.data import DataLoader

num_workers = 0
batch_size = 8

train_dataset = InstructionDataset(train_data, tokenizer)
train_loader = DataLoader(
    train_dataset,
    batch_size=batch_size,
    shuffle=True,
    collate_fn=custom_collate_fn,
    num_workers=num_workers,
    drop_last=True,
)

val_dataset = InstructionDataset(val_data, tokenizer)
val_loader = DataLoader(
    val_dataset,
    batch_size=batch_size,
    shuffle=False,
    collate_fn=custom_collate_fn,
    num_workers=num_workers,
    drop_last=False,
)

test_dataset = InstructionDataset(test_data, tokenizer)
test_loader = DataLoader(
    test_dataset,
    batch_size=batch_size,
    shuffle=False,
    collate_fn=custom_collate_fn,
    num_workers=num_workers,
    drop_last=False,
)

In [51]:
for inputs, targets in train_loader:
  print(f'Inputs: {inputs.shape}, Targets: {targets.shape}')
  break

Inputs: torch.Size([8, 147]), Targets: torch.Size([8, 147])


#### Loading Pretrained LLM

In [52]:
# getting scripts from github
!git clone https://github.com/aashu-0/llm-from-scratch.git
%cd llm-from-scratch/llm_book_notes

Cloning into 'llm-from-scratch'...
remote: Enumerating objects: 120, done.[K
remote: Counting objects: 100% (120/120), done.[K
remote: Compressing objects: 100% (86/86), done.[K
remote: Total 120 (delta 63), reused 78 (delta 30), pack-reused 0 (from 0)[K
Receiving objects: 100% (120/120), 164.39 KiB | 2.11 MiB/s, done.
Resolving deltas: 100% (63/63), done.
/content/llm-from-scratch/llm_book_notes


In [53]:
import sys
sys.path.append('/content/llm-from-scratch/llm_book_notes')

In [54]:
# get gpt_download.py from @rasbt github
import urllib.request
url = (
    "https://raw.githubusercontent.com/rasbt/"
    "LLMs-from-scratch/main/ch05/"
    "01_main-chapter-code/gpt_download.py"
)
urllib.request.urlretrieve(url, "gpt_download.py")

('gpt_download.py', <http.client.HTTPMessage at 0x78de98aafe90>)

In [None]:
from load_weights import load_weights_into_gpt
from GPT import GPTModel
from gpt_download import download_and_load_gpt2

