<a href="https://colab.research.google.com/github/aashu-0/llm-from-scratch/blob/main/llm_book_notes/07instruct-fine-tuning.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

**Instruction fine tuning**
- also called supervised instruction fine tuning
- tuning llm to follow instructions


#### 1. Preparing the dataset

#####Download the Stanford Alpaca dataset from github

In [18]:
import json
import urllib.request

url = "https://raw.githubusercontent.com/tatsu-lab/stanford_alpaca/refs/heads/main/alpaca_data.json"

file_path = 'alpaca_data.json'
urllib.request.urlretrieve(url, file_path)

# load
with open(file_path, 'r', encoding='utf-8') as f:
    dataset = json.load(f)

print(f'Number of entries: {len(dataset)}')

Number of entries: 52002


In [19]:
# a subset of the dataset -> 10k samples

import random
subset_size = 5000
random.seed(42)
subset_data = random.sample(dataset, subset_size) # a list of dictionaries

# save
subset_file_path = 'alpaca_subset.json'

# to convert list to json-formatted string
with open(subset_file_path, 'w', encoding='utf-8') as f:
    json.dump(subset_data, f, indent=4)

print(f'Number of entries: {len(subset_data)}')

Number of entries: 5000


In [20]:
# subset_data[:18]

In [21]:
# load the subset_data

with open(subset_file_path, 'r', encoding='utf-8') as f:
    subset_dataset = json.load(f)

print(f'Number of entries: {len(subset_dataset)}')

Number of entries: 5000


In [22]:
print(f'Example:\n {subset_dataset[4000]}') # no input section

Example:
 {'instruction': 'Edit this sentence to use proper English: I dont think so', 'input': '', 'output': "I don't think so."}


In [23]:
print(f'Example:\n {subset_dataset[1000]}')

Example:
 {'instruction': "Explain why the Earth's temperature fluctuates.", 'input': '', 'output': "The Earth's temperature fluctuates due to changes in the amount of energy from the Sun that is received, released or reflected by the atmosphere and surface of the Earth. Changes in clouds, aerosols, oceans, and land also affect temperature. Additionally, increases in certain gasses, such as carbon dioxide, traps heat within the atmosphere and causes further warming."}


so the structure of the dataset looks like:


```
{'instruction":
  'input':  #may be empty
  'output':}
```
there are various ways to format these entries:
- `Alpaca prompt style`
- `Phi-3 prompt style`


##### Implementing the formatting function

In [24]:
def format_input(entry):
  instruction_txt = (
      f"Below is an instruction that describes a task. "
      f"Write a response that appropriately completes the request.\n\n"
      f"### Instruction:\n{entry['instruction']}"
  )
  input_text = (
      f"\n\n### Input:\n{entry['input']}" if entry['input'] else ''
  )
  return instruction_txt + input_text


In [25]:
model_input = format_input(subset_dataset[1000])
desired_response = f"\n\n### Response:\n{subset_dataset[1000]['output']}"

print(model_input + desired_response)

Below is an instruction that describes a task. Write a response that appropriately completes the request.

### Instruction:
Explain why the Earth's temperature fluctuates.

### Response:
The Earth's temperature fluctuates due to changes in the amount of energy from the Sun that is received, released or reflected by the atmosphere and surface of the Earth. Changes in clouds, aerosols, oceans, and land also affect temperature. Additionally, increases in certain gasses, such as carbon dioxide, traps heat within the atmosphere and causes further warming.


##### Train Test Split

In [26]:
train_set = int(len(subset_dataset)*0.85)
test_set = int(len(subset_dataset)*0.1)
val_set = len(subset_dataset) - train_set - test_set

train_data = subset_dataset[:train_set]
test_data = subset_dataset[train_set:train_set+test_set]
val_data = subset_dataset[train_set+test_set:]

print(f'Train set size: {len(train_data)}')
print(f'Test set size: {len(test_data)}')
print(f'Validation set size: {len(val_data)}')

Train set size: 4250
Test set size: 500
Validation set size: 250


#### 2. Organizing data into batches

`collate` function in pytorch
- used to batch samples together into a single batch
- allows custom preprocessing, when dealing with variable-length data

1. `default_collate`
- stackes tensor along the first dim
- convert lists into tensors
- leaves dic and other data structures untouched
2. `collate_fn`
- for creating custom collate function

**Custom Collate function**

1. format data
2. tokenize
3. adjust them to have same length using padding tokens
4. create target token ids ---inputs shifted by `1`
5. replace certain pad tokens
  * why? --- do not contain useful info so excluded from loss computation
  * `ignore_index`: placeholder value(`-100`)



##### Instruction dataset class

In [27]:
import torch
from torch.utils.data import Dataset

class InstructionDataset(Dataset):
  def __init__(self, data, tokenizer):
    self.data = data
    self.encode_texts = []
    for entry in data:
      input_with_instruction = format_input(entry)
      response_text = f"\n\n### Response:\n{entry['output']}"
      full_text = input_with_instruction + response_text
      self.encode_texts.append(
          tokenizer.encode(full_text)
      )

  def __getitem__(self, index):
     return self.encode_texts[index]

  def __len__(self):
    return len(self.data)

##### Custom Batch collate function

In [28]:
def custom_collate_fn(
    batch,
    pad_token_id= 50256,
    allowed_max_length=None,
    ignore_index=-100,
    device = 'cpu'):

  batch_max_length = max(len(item) for item in batch)
  inputs_lst, targets_lst = [], []
  for item in batch:
    new_item = item.copy()
    new_item += [pad_token_id]

    padded = (new_item + [pad_token_id]* (batch_max_length - len(item)))
    inputs= torch.tensor(padded[:-1])
    targets= torch.tensor(padded[1:])

    mask = targets == pad_token_id
    indices = torch.nonzero(mask).squeeze() # returns the indices of True values
    if indices.numel() >1:
      targets[indices[1:]] = ignore_index

    if allowed_max_length is not None:
      inputs = inputs[:allowed_max_length]
      targets = targets[:allowed_max_length]

    inputs_lst.append(inputs)
    targets_lst.append(targets)

  inputs_tensor = torch.stack(inputs_lst).to(device)
  targets_tensor = torch.stack(targets_lst).to(device)

  return inputs_tensor, targets_tensor


In [29]:
# example
input1 = [0,1,2,3,4]
input2 = [5,6]
input3 = [7,8,9]

batch = [input1, input2, input3]

input, target = custom_collate_fn(batch)
print(input)
print(target)

tensor([[    0,     1,     2,     3,     4],
        [    5,     6, 50256, 50256, 50256],
        [    7,     8,     9, 50256, 50256]])
tensor([[    1,     2,     3,     4, 50256],
        [    6, 50256,  -100,  -100,  -100],
        [    8,     9, 50256,  -100,  -100]])


Why `-100`?

by default `ignore_index` in `cross_entropy()` is equal to `-100`.

therefore, it ignores targets labeled with -100

Masking out the instruction text- so that model focuses on generating accurate responses ratehr than memorizing instructions.

will do later

##### Creating dataloaders

In [30]:
# device
device = torch.device('cuda' if torch.cuda.is_available() else 'cpu')
print(device)

cuda


In [31]:
# fix or pre-fill some argument in custom_collate_function
from functools import partial
custom_collate_fn = partial(
    custom_collate_fn,
    allowed_max_length=512,
    device=device
)

In [32]:
!pip install tiktoken

Collecting tiktoken
  Downloading tiktoken-0.9.0-cp311-cp311-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.metadata (6.7 kB)
Downloading tiktoken-0.9.0-cp311-cp311-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (1.2 MB)
[?25l   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m0.0/1.2 MB[0m [31m?[0m eta [36m-:--:--[0m[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m1.2/1.2 MB[0m [31m50.5 MB/s[0m eta [36m0:00:00[0m
[?25hInstalling collected packages: tiktoken
Successfully installed tiktoken-0.9.0


In [33]:
# tokenizer
import tiktoken
tokenizer = tiktoken.get_encoding('gpt2')

In [34]:
# dataloader
from torch.utils.data import DataLoader

num_workers = 0
batch_size = 8

train_dataset = InstructionDataset(train_data, tokenizer)
train_loader = DataLoader(
    train_dataset,
    batch_size=batch_size,
    shuffle=True,
    collate_fn=custom_collate_fn,
    num_workers=num_workers,
    drop_last=True,
)

val_dataset = InstructionDataset(val_data, tokenizer)
val_loader = DataLoader(
    val_dataset,
    batch_size=batch_size,
    shuffle=False,
    collate_fn=custom_collate_fn,
    num_workers=num_workers,
    drop_last=False,
)

test_dataset = InstructionDataset(test_data, tokenizer)
test_loader = DataLoader(
    test_dataset,
    batch_size=batch_size,
    shuffle=False,
    collate_fn=custom_collate_fn,
    num_workers=num_workers,
    drop_last=False,
)

In [35]:
for inputs, targets in train_loader:
  print(f'Inputs: {inputs.shape}, Targets: {targets.shape}')
  break

Inputs: torch.Size([8, 174]), Targets: torch.Size([8, 174])


#### Loading Pretrained LLM

In [36]:
# getting scripts from github
!git clone https://github.com/aashu-0/llm-from-scratch.git
%cd llm-from-scratch/llm_book_notes

Cloning into 'llm-from-scratch'...
remote: Enumerating objects: 128, done.[K
remote: Counting objects: 100% (128/128), done.[K
remote: Compressing objects: 100% (92/92), done.[K
remote: Total 128 (delta 69), reused 79 (delta 32), pack-reused 0 (from 0)[K
Receiving objects: 100% (128/128), 171.45 KiB | 13.19 MiB/s, done.
Resolving deltas: 100% (69/69), done.
/content/llm-from-scratch/llm_book_notes


In [37]:
import sys
sys.path.append('/content/llm-from-scratch/llm_book_notes')

In [38]:
# get gpt_download.py from @rasbt github
import urllib.request
url = (
    "https://raw.githubusercontent.com/rasbt/"
    "LLMs-from-scratch/main/ch05/"
    "01_main-chapter-code/gpt_download.py"
)
urllib.request.urlretrieve(url, "gpt_download.py")

('gpt_download.py', <http.client.HTTPMessage at 0x7a7100167150>)

In [39]:
from load_weights import load_weights_into_gpt
from GPT import GPTModel
from gpt_download import download_and_load_gpt2

BASE_CONFIG = {
    'vocab_size': 50257,
    'context_length': 1024,
    'drop_rate': 0.2,
    'qkv_bias': True
}

model_configs = {
    'gpt2 (124M)': {'emb_dim': 768 , 'n_layers': 12, 'n_heads': 12},
    'gpt2-medium (355M)': {'emb_dim':1024 , 'n_layers':24, 'n_heads':16},
    'gpt2-large (774M)': {'emb_dim': 1280 , 'n_layers': 36, 'n_heads':20},
    'gpt2-xl (1558M)': {'emb_dim': 1600, 'n_layers':48, 'n_heads': 25}
}

CHOOSE_MODEL = 'gpt2-medium (355M)'
BASE_CONFIG.update(model_configs[CHOOSE_MODEL])

model_size = CHOOSE_MODEL.split(' ')[-1].lstrip('(').rstrip(')')

settings, params = download_and_load_gpt2(
    model_size = model_size,
    models_dir = 'gpt2'
)

model = GPTModel(BASE_CONFIG)
load_weights_into_gpt(model, params)
model.eval();

checkpoint: 100%|██████████| 77.0/77.0 [00:00<00:00, 117kiB/s]
encoder.json: 100%|██████████| 1.04M/1.04M [00:01<00:00, 577kiB/s]
hparams.json: 100%|██████████| 91.0/91.0 [00:00<00:00, 132kiB/s]
model.ckpt.data-00000-of-00001: 100%|██████████| 1.42G/1.42G [08:28<00:00, 2.79MiB/s]
model.ckpt.index: 100%|██████████| 10.4k/10.4k [00:00<00:00, 17.0MiB/s]
model.ckpt.meta: 100%|██████████| 927k/927k [00:01<00:00, 571kiB/s]
vocab.bpe: 100%|██████████| 456k/456k [00:01<00:00, 350kiB/s]


In [40]:
# assess our raw model(no fine tuning)
torch.manual_seed(123)
input_text = format_input(val_data[0])
print(input_text)

Below is an instruction that describes a task. Write a response that appropriately completes the request.

### Instruction:
Generate a sentence that represents the content in the paragraph.

### Input:
A new law was introduced in 2020 outlining five safety measures all workplaces must follow to prevent the spread of Covid-19. This includes regularly sanitizing the premises, implementing social distancing measures, and introducing a screening and temperature checking procedure.


In [41]:
# model's response
from utilities import text_to_token_ids, token_ids_to_text
from GPT import generate

token_ids = generate(
    model = model,
    idx = text_to_token_ids(input_text, tokenizer),
    max_new_tokens =35,
    context_size= BASE_CONFIG['context_length'],
    eos_id= 50256
)
generated_text =token_ids_to_text(token_ids, tokenizer)
generated_text

'Below is an instruction that describes a task. Write a response that appropriately completes the request.\n\n### Instruction:\nGenerate a sentence that represents the content in the paragraph.\n\n### Input:\nA new law was introduced in 2020 outlining five safety measures all workplaces must follow to prevent the spread of Covid-19. This includes regularly sanitizing the premises, implementing social distancing measures, and introducing a screening and temperature checking procedure.\n\n### Output:\n\nThe law was passed and the new law is now in effect.\n\n### Instruction:\n\nWrite a response that appropriately completes the request'

the generate function returns combined input and output text,separating the output

In [42]:
response_text = generated_text[len(input_text):].strip()
print(response_text)

### Output:

The law was passed and the new law is now in effect.

### Instruction:

Write a response that appropriately completes the request


#### Finetuning LLM on instruction dataset we loaded eariler

In [43]:
from utilities import train_model_simple, calc_loss_loader

In [44]:
# initial loss for the train and val dataset
model.to(device)
torch.manual_seed(123)
with torch.no_grad():
  train_loss = calc_loss_loader(train_loader, model, device)
  val_loss = calc_loss_loader(val_loader, model, device)

print(f'Train loss: {train_loss:.4f}, Val loss: {val_loss:.4f}')

Train loss: 3.1736, Val loss: 3.2490


In [46]:
# training code
import time

start_time = time.time()
torch.manual_seed(123)
optimizer = torch.optim.AdamW(model.parameters(),
                              lr=0.00005,
                              weight_decay=0.1)

num_epochs = 2
train_losses, val_losses, tokens_seen = train_model_simple(
    model=model,
    train_dataloader= train_loader,
    val_dataloader= val_loader,
    optimizer= optimizer,
    device=device,
    num_epochs= num_epochs,
    eval_freq= 5,
    eval_iter= 5,
    start_context= format_input(val_data[0]),
    tokenizer= tokenizer
    )

end_time = time.time()

time_in_mins = (end_time - start_time)/60
print(f'Time taken to train: {time_in_mins:.2f} minutes')

OutOfMemoryError: CUDA out of memory. Tried to allocate 28.00 MiB. GPU 0 has a total capacity of 14.74 GiB of which 14.12 MiB is free. Process 6014 has 14.72 GiB memory in use. Of the allocated memory 14.27 GiB is allocated by PyTorch, and 330.27 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation.  See documentation for Memory Management  (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)