# Goal: Fine tune a LLM model on an instruction dataset

This notebook needs to be completed. There are placeholders for each of the following tasks which need to be coded up. Finally, this notebook should be runnable on a free Google colab instance in few minutes.

## Concrete tasks:
1. Load the instruction fine-tuning dataset
2. Load the model and tokenizer
3. Prompt the model with few items from the dataset and print the generated responses using the provided `generate()` function
4. Implement a trainer class that takes the model, dataset as inputs and
  - Instantiates necessary training components such as optimizer, learning rate scheduler etc.
  - Specifically, implement the `train()` function that performs the classic train loop with a next-token prediction objective
5. Modify the `generate()` function to implement the generation logic directly using `model.forward()`. At each generation step, generated tokens are fed as inputs until the stopping condition is met (EOS is generated or max_tokens is reached). Most importantly, make sure that the generations are batched.
6. **Plot the effect of training data on the validation loss**: The idea is to vary the amount of data used for training data (e.g. 100, 200, 500, 1000 data points) and understand its effect on the valiation loss. Please provide an explanation along with the plot.
7. **Applying Chat template**: Suppose you want to switch to a different model and accordingly the prompt template needs to change. So, how would you incorporate this change without having to manually apply the template everytime you change the model.

Bonus points:
- You are free to use any model. But if you use a larger model (e.g. Llama model 7-B) and make it trainable on Google Colab with T4 instance in couple of minutes, it is a bonus point.
Hint: you should use techniques such **LoRA/QLoRa** to reduce the number of trainable parameters, use **quantization** to reduce the memory requirements.
- Optimize the `generate()` further to use attention key-value caching. The idea is that we do not want to recompute attention values for our prompt at every decoding step.

# Install Dependencies
If you add any new depencies, make sure to update the following cell accordingly.

In [1]:
!pip install -q accelerate bitsandbytes datasets peft transformers

# Imports
All imports should be added below.

In [2]:
from datasets import load_dataset, DatasetDict, Dataset
from torch.utils.data import DataLoader, random_split
from transformers import default_data_collator
from dataclasses import dataclass

from transformers import AutoTokenizer, AutoModelForCausalLM, BitsAndBytesConfig, pipeline, get_scheduler
import torch
from huggingface_hub import login


from torch.optim import AdamW
from accelerate import Accelerator, notebook_launcher
from peft import LoraConfig, get_peft_model, prepare_model_for_kbit_training
import bitsandbytes as bnb

import time
import datetime
import re
from tqdm import tqdm

In [3]:
login()

VBox(children=(HTML(value='<center> <img\nsrc=https://huggingface.co/front/assets/huggingface_logo-noborder.sv…

## 1. Load the instruction fine-tuning dataset


In [4]:
from datasets import load_dataset

dataset_name = "yizhongw/self_instruct"

dataset = load_dataset(dataset_name)

The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.
You can avoid this message in future by passing the argument `trust_remote_code=True`.
Passing `trust_remote_code=True` will be mandatory to load this dataset from the next major release of `datasets`.


Let's create a dataset wrapper that will do the following:

1. **Automatically Clean the Data**: We'll strip the text down according to a pattern specific to our dataset.

2. **Automatically Apply the Appropriate Prompt Template**: We'll create a prompt template for llama 2 and have the DatasetWrapper class apply it. By creating

3. **Create our DataLoader Objects**

In [5]:
dataset

DatasetDict({
    train: Dataset({
        features: ['prompt', 'completion'],
        num_rows: 82612
    })
})

In [6]:
print(dataset['train'][81973])
print(dataset['train'][0])

{'prompt': 'Write a short story about a person who has been affected by the current political situation in your country. Output:', 'completion': ' Story: I am a student at the University of [Your Country]. I have been studying here for three years now. I was born in [Your City], and moved to [Your City] when I was 10 years old. My parents are both doctors, and they work hard every day to support me and my sister. They always tell me that education is the most important thing in life, and I agree with them.\nI’ve always wanted to be a doctor like my parents, but it’s not easy. The tuition fees at this university are very high, and I need to pay for my living expenses as well. I don’t know how I will afford all of this after graduation. I hope I can find a good job so I can help my family.'}
{'prompt': 'Make a list of 10 ways to help students improve their study skills.\n\nOutput:', 'completion': " 1. Make a schedule for studying and stick to it.\n2. Study in the same place every time.\n

Rather than create a DatasetWrapper specific to this dataset, we can strive to create one that is dataset agnostic by storing the patterns for cleaning within the dataset.

The current implementation does assume each example has a "prompt" and a "completion" but we could change the class further to handle other structures and truly be dataset agnostic.

For example, if a validation or test set is present, we could automatically resize the training and test sets according to our desired ammounts.

In [None]:
#dataset_patterns.py

dataset_pattern = {
    "yizhongw/self_instruct": {
        "prompt_pattern": r'[\s]*Output:$',
        "completion_pattern": ""
    }
}

def get_prompt_pattern(dataset_name: str):
  return dataset_pattern[dataset_name]['prompt_pattern']

def get_completion_pattern(dataset_name: str):
  return dataset_pattern[dataset_name]['completion_pattern']

In [8]:
#templates.py

class BaseModelTemplate:
    def apply_template(self, example):
        raise NotImplementedError("This method should be implemented by subclasses.")

class LlamaTemplate(BaseModelTemplate):
    def apply_template(self, example):
        example['prompt'] = f"[INST] user_message: {example['prompt']} [INST]:"
        return example

class MistralTemplate(BaseModelTemplate):
    def apply_template(self, example):

        #adjust to actual mistral template
        return ""


template_map = {
      "meta-llama/Llama-2-7b-chat-hf": LlamaTemplate,
      "mistralai/Mistral-7B-Instruct-v0.2": MistralTemplate
  }


def get_template(model_name):
    return template_map[model_name]()



In [9]:
#datasetWrapper.py

#from dataset_patterns import get_prompt_pattern, get_completion_pattern

class DatasetWrapper():

  def __init__(self, dataset, dataset_name, model_name, batch_size, test_size):
    self.dataset = dataset
    self.prompt_pattern = get_prompt_pattern(dataset_name)
    self.completion_pattern = get_prompt_pattern(dataset_name)
    self.tokenizer = AutoTokenizer.from_pretrained(model_name, trust_remote_code=True)
    self.tokenizer.pad_token = self.tokenizer.eos_token
    self.template=get_template(model_name)
    self.test_size = 0.1
    self.batch_size = batch_size

  def tokenize_fn(self, example):

    tokenized_inputs = self.tokenizer(example["prompt"], truncation=True, padding="max_length", max_length=512)
    tokenized_outputs = self.tokenizer(example["completion"], truncation=True, padding="max_length", max_length=512)

    return {"input_ids": tokenized_inputs["input_ids"], "attention_mask": tokenized_inputs["attention_mask"], "labels": tokenized_outputs["input_ids"]}

  def process_example(self, example):

    if self.prompt_pattern != "":
      prompt_pattern = re.compile(self.prompt_pattern)
      example['prompt'] = prompt_pattern.sub('', example['prompt'])

    if self.completion_pattern != "":
      completion_pattern = re.compile(self.completion_pattern)
      example['completion'] = completion_pattern.sub('', example['completion'])

    example = self.template.apply_template(example)

    return example


  def prepare_data(self):

    self.dataset = self.dataset.map(self.process_example)
    self.tokenized_dataset = self.dataset.map(self.tokenize_fn, batched=True, remove_columns=['prompt', 'completion'])

    if isinstance(self.tokenized_dataset, DatasetDict) and 'train' in self.dataset.keys():
      if 'validation' not in self.dataset.keys():
        train_test_split = self.tokenized_dataset['train'].train_test_split(test_size=self.test_size)

    if isinstance(self.tokenized_dataset, Dataset):
      train_test_split = self.tokenized_dataset.train_test_split(test_size=self.test_size)

    self.tokenized_dataset = DatasetDict({
            'train': train_test_split['train'],
            'validation': train_test_split['test']  # 'test' here refers to the validation part of the split
        })


    collate_fn = default_data_collator
    train_loader = DataLoader(self.tokenized_dataset['train'], shuffle=True, batch_size=self.batch_size, collate_fn=collate_fn)
    val_loader = DataLoader(self.tokenized_dataset['validation'], batch_size=self.batch_size, collate_fn=collate_fn)

    return train_loader, val_loader




In [10]:
dw = DatasetWrapper(dataset['train'].select(range(10)), dataset_name, model_name="meta-llama/Llama-2-7b-chat-hf", test_size=0.1, batch_size=32)

In [11]:
dw.process_example(dataset['train'][0])

{'prompt': '[INST] user_message: Make a list of 10 ways to help students improve their study skills. [INST]:',
 'completion': " 1. Make a schedule for studying and stick to it.\n2. Study in the same place every time.\n3. Set goals for yourself.\n4. Take breaks when you need them.\n5. Don't cram before an exam.\n6. Get enough sleep.\n7. Eat healthy food.\n8. Exercise regularly.\n9. Find a study partner.\n10. Reward yourself after completing a task."}

In [12]:
train_loader, val_loader = dw.prepare_data()

Map:   0%|          | 0/10 [00:00<?, ? examples/s]

Map:   0%|          | 0/10 [00:00<?, ? examples/s]

In [13]:
for batch in train_loader:
    break
{k: v.shape for k, v in batch.items()}

{'input_ids': torch.Size([9, 512]),
 'attention_mask': torch.Size([9, 512]),
 'labels': torch.Size([9, 512])}

## 2. Load model and tokenizer

In [14]:
################################################################################
# bitsandbytes parameters
################################################################################

# Activate 4-bit precision base model loading
use_4bit = True

# Compute dtype for 4-bit base models
bnb_4bit_compute_dtype = "float16"

# Quantization type (fp4 or nf4)
bnb_4bit_quant_type = "nf4"

# Activate nested quantization for 4-bit base models (double quantization)
use_nested_quant = False

In [15]:
# Load tokenizer and model with QLoRA configuration
compute_dtype = getattr(torch, bnb_4bit_compute_dtype)

bnb_config = BitsAndBytesConfig(
    load_in_4bit=use_4bit,
    bnb_4bit_quant_type=bnb_4bit_quant_type,
    bnb_4bit_compute_dtype=compute_dtype,
    bnb_4bit_use_double_quant=use_nested_quant,
)

# Load the entire model on the GPU 0
device_map = {"": 0}

# Check GPU compatibility with bfloat16
if compute_dtype == torch.float16 and use_4bit:
    major, _ = torch.cuda.get_device_capability()
    if major >= 8:
        print("=" * 80)
        print("Your GPU supports bfloat16: accelerate training with bf16=True")
        print("=" * 80)

In [16]:
# The model that you want to train from the Hugging Face hub
model_name = "meta-llama/Llama-2-7b-chat-hf"

# Load base model
model = AutoModelForCausalLM.from_pretrained(
    model_name,
    quantization_config=bnb_config,
    device_map=device_map,
)
model.config.use_cache = False
model.config.pretraining_tp = 1

# Load tokenizer
tokenizer = AutoTokenizer.from_pretrained(model_name, trust_remote_code=True)
tokenizer.pad_token = tokenizer.eos_token
#tokenizer.padding_side = "right" # Fix weird overflow issue with fp16 training

Loading checkpoint shards:   0%|          | 0/2 [00:00<?, ?it/s]

In [17]:
def print_trainable_parameters(model):
    """
    Prints the number of trainable parameters in the model.
    """
    trainable_params = 0
    all_param = 0
    for _, param in model.named_parameters():
        all_param += param.numel()
        if param.requires_grad:
            trainable_params += param.numel()
    print(
        f"trainable params: {trainable_params} || all params: {all_param} || trainable%: {100 * trainable_params / all_param}"
    )

print_trainable_parameters(model)

trainable params: 262410240 || all params: 3500412928 || trainable%: 7.496550989769399


## 3. Prompt the model with few items from the dataset

In [18]:
prompts = [item for item in dataset["train"]["prompt"][:2]]
print(prompts)

['Make a list of 10 ways to help students improve their study skills.\n\nOutput:', 'Task: Find out what are the key topics in the document? output "topic 1", "topic 2", ... , "topic n".\n\nThe United States has withdrawn from the Paris Climate Agreement.\n\n']


In [19]:
def generate(prompts):
  pipe = pipeline(task="text-generation", model=model, tokenizer=tokenizer, max_length=200, return_full_text=False)
  result = pipe(prompts)
  generated_texts = [item[0]["generated_text"] for item in result]
  return generated_texts

In [20]:
gen_texts = generate(prompts)

In [21]:
for prompt, text in zip(prompts, gen_texts):
  print("#############")
  print(f"PROMPT: {prompt}")
  print(f"RESPONSE: {text}")

#############
PROMPT: Make a list of 10 ways to help students improve their study skills.

Output:
RESPONSE: 
A list of 10 ways to help students improve their study skills, including:

1. Setting specific, measurable, achievable, relevant, and time-bound (SMART) goals
2. Creating a study schedule and sticking to it
3. Breaking down complex topics into smaller, manageable chunks
4. Using flashcards to review and retain information
5. Practicing active recall and self-testing
6. Focusing on the most important material and prioritizing tasks
7. Using mnemonic devices to aid memory retention
8. Creating a conducive study environment
9. Seeking help and support from peers, teachers, or tutors
10. Reflecting on and evaluating one's own study habits and techniques.
#############
PROMPT: Task: Find out what are the key topics in the document? output "topic 1", "topic 2", ... , "topic n".

The United States has withdrawn from the Paris Climate Agreement.


RESPONSE: The document provides an ove

## 4. Implement a trainer class
- The class must take model, dataset and instantiates necessary training components such as optimizer, learning rate scheduler etc.
- Specifically, implement the `train()` function that performs the classic train loop with a next-token prediction objective

```
trainer = Trainer(model, dataset, train_args, ...)
trainer.train()
```

Bonus Point: Use techniques such LoRA/QLoRa to reduce the number of trainable parameters, use quantization to reduce the memory requirements.

In [22]:
lora_config = LoraConfig(
    r=16,
    lora_alpha=32,
    target_modules=["q_proj", "k_proj", "v_proj", "o_proj", "gate_proj", "up_proj", "down_proj"],
    lora_dropout=0.05,
    bias="none",
    task_type="CAUSAL_LM"
)
lora_model = get_peft_model(model, lora_config)
print_trainable_parameters(lora_model)

trainable params: 39976960 || all params: 3540389888 || trainable%: 1.1291682911958425


In [23]:
#trainer.py
@dataclass
class TrainingArgs():
    num_epochs: int = 1
    learning_rate: float = 0.0002
    test_size: float = 0.9
    batch_size= 32
    model_name : str = "meta-llama/Llama-2-7b-chat-hf"
    lora_config: LoraConfig = None
    num_warmup_steps: int = 0
    mixed_precision: str = None
    accumulation_steps: int = 1

A couple notes about the above class:
1. We pass in a LoraConfig but not a bnbconfig because at least for the workflow of this assignment, we're loading a quantized version of our model right out of the gate

In [24]:
#trainer.py

class Trainer():
    def __init__(
            self,
            model: torch.nn.Module,
            train_loader: DataLoader,
            val_loader: DataLoader,
            training_args: TrainingArgs
    ):
        self.train_loader = train_loader
        self.val_loader = val_loader

        #load model and apply lora
        self.model = model
        if training_args.lora_config is not None:
            self.model = get_peft_model(model, training_args.lora_config)
            self.model = prepare_model_for_kbit_training(self.model)

        #optimizer
        self.optimizer = bnb.optim.Adam8bit(model.parameters(), lr = training_args.learning_rate)

        #learning rate scheduler
        self.num_epochs = training_args.num_epochs
        self.num_training_steps = len(train_loader)*self.num_epochs
        self.scheduler = get_scheduler(
            "linear",
            optimizer=self.optimizer,
            num_warmup_steps=0,
            num_training_steps=self.num_training_steps,
        )


        #apply accelerator for mixed precision
        self.accelerator=Accelerator()
        self.train_loader,
        self.val_loader,
        self.model,
        self.optimizer,
        self.scheduler = self.accelerator.prepare(self.train_loader,
                                                  self.val_loader,
                                                  self.model,
                                                  self.optimizer,
                                                  self.scheduler)

    def train(self):

        self.model.train()
        progress_bar = tqdm(range(self.num_training_steps))

        for epoch in range(self.num_epochs):

            model.train()
            training_loss = 0
            for step, batch in enumerate(self.train_loader):
                outputs = model(**batch)
                loss = outputs.loss
                training_loss+=loss.item()
                self.accelerator.backward(loss)

                self.optimizer.step()
                self.lr_scheduler.step()
                self.optimizer.zero_grad()

                if step % 100 == 0:
                    print(f'Step {step}/{len(self.train_loader)} Training Loss: {training_loss/step*self.batch_size}')

                progress_bar.update(1)

            print(f'Epoch {epoch} Training Loss: {training_loss/len(self.train_loader)}')

            self.model.eval()
            val_loss = 0
            for step, batch in enumerate(self.val_loader):
                with torch.no_grad():
                    outputs=model(**batch)

                loss = outputs.loss
                validation_loss+=loss.item()

                if step % 100 == 0:
                    print(f'Step {step}/{len(self.val_loader)} Validation Loss: {val_loss/step*self.batch_size}')
            print(f'Epoch {epoch} Val Loss: {val_loss/len(self.val_loader)}')

In [25]:
training_args = TrainingArgs(lora_config=lora_config)

In [26]:
dw = DatasetWrapper(dataset, dataset_name, model_name, training_args.batch_size, training_args.test_size)
train_loader, val_loader = dw.prepare_data()

In [27]:
trainer = Trainer(model, train_loader, val_loader, training_args=training_args)

In [28]:
notebook_launcher(trainer.train(), (model,))



OutOfMemoryError: CUDA out of memory. Tried to allocate 688.00 MiB. GPU 0 has a total capacty of 14.75 GiB of which 73.06 MiB is free. Process 196130 has 14.67 GiB memory in use. Of the allocated memory 11.67 GiB is allocated by PyTorch, and 2.88 GiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting max_split_size_mb to avoid fragmentation.  See documentation for Memory Management and PYTORCH_CUDA_ALLOC_CONF

## 5. Implement your own generation logic

Modify the `generate()` function to implement the generation logic directly using `model.forward()` instead of using pipeline API. At each generation step, generated tokens are fed as inputs until the stopping condition is met (EOS is generated or max_tokens is reached). Most importantly, make sure that the generations are batched.

Bonus Point:
- Optimize the `generate()` further to use attention key-value caching.

## 6. Plot the effect of training data on the validation loss:
The idea is to vary the amount of data used for training data (e.g. 100, 200, 500, 1000 data points) and understand its effect on the valiation loss. Please provide an explanation along with the plot.

## 7. Applying Chat template:
Suppose you want to switch to a different model and accordingly the prompt template needs to change. So, how would you incorporate this change without having to manually apply the template everytime you change the model?