## Instruction Tuning

Supervised fine tuning (SFT) is fine-tuning all of a model’s parameters on supervised data of inputs and outputs. It teaches the model how to follow user specified instructions. It is typically done after model pre-training. **Source**: http://tinyurl.com/2v884put

![instruction tuning](assets/instruction-tuning.jpg)

Image Source: https://medium.com/mantisnlp/supervised-fine-tuning-customizing-llms-a2c1edbf22c3

Requirement.
1. Pre-trained model & tokenizer -> We will get it from huggingface.
2. Instruction-Response pair data -> eg: Alpaca, Dolly, Oasst1, LIMA, etc. We will get the dataset from huggingface.

Steps.
1. Load pre-trained model and tokenizer.
2. Format the instructions response pair.
3. Preprocess the dataset.
4. Train the pre-trained model in supervised setting with response as labels and instruction as input.
5. Evaluation:
   i. Automatic Evaluation: Eg: MMLU, BBH, AGIEval, domain-specific evaluation such as maths, reasoning, code.
   ii. Human Evaluation: Give model prompts to generate a response and ask humans.
   iii. LLM as Evaluator: Ask powerful models such as GPT4 to rate the response generated by the your finetuned model.


  
**Note**: Most of the code were borrowed from original alpaca code here: https://github.com/tatsu-lab/stanford_alpaca/tree/main

#### Dependencies

In [1]:
!pip uninstall transformers
!pip install git+https://github.com/huggingface/transformers
!pip install --quiet datasets accelerate -U

Found existing installation: transformers 4.35.2
Uninstalling transformers-4.35.2:
  Would remove:
    /usr/local/bin/transformers-cli
    /usr/local/lib/python3.10/dist-packages/transformers-4.35.2.dist-info/*
    /usr/local/lib/python3.10/dist-packages/transformers/*
Proceed (Y/n)? Y
  Successfully uninstalled transformers-4.35.2
Collecting git+https://github.com/huggingface/transformers
  Cloning https://github.com/huggingface/transformers to /tmp/pip-req-build-ed0w6hr7
  Running command git clone --filter=blob:none --quiet https://github.com/huggingface/transformers /tmp/pip-req-build-ed0w6hr7
  Resolved https://github.com/huggingface/transformers to commit 5f81266fb0b444f896fea9322d7c41368abd8526
  Installing build dependencies ... [?25l[?25hdone
  Getting requirements to build wheel ... [?25l[?25hdone
  Preparing metadata (pyproject.toml) ... [?25l[?25hdone
Building wheels for collected packages: transformers
  Building wheel for transformers (pyproject.toml) ... [?25l[?2

In [1]:
# import dependencies
import os
import copy
import logging
from dataclasses import dataclass, field
from typing import Optional, Dict, Sequence
from tqdm import tqdm
import torch
import datasets
import transformers
from torch.utils.data import Dataset
from transformers import Trainer
from datasets import load_dataset
from transformers.trainer_utils import get_last_checkpoint, is_main_process

##### 1. Configs

In [2]:
model_name_or_path = "gpt2" #"microsoft/phi-1_5" -> use "gpt2" if you have less powerful GPU # huggingface model name
dataset_name_or_path = "xzuyn/lima-alpaca" # LIMA Data in Vicuna Format. https://arxiv.org/abs/2305.11206
cache_dir="cache_dir"
split_name="train"
inst_col_name="instruction"
input_col_name="input"
output_col_name="output"
model_max_length=512 # how long sequence model can process
IGNORE_INDEX = -100
DEFAULT_PAD_TOKEN = "[PAD]"
DEFAULT_EOS_TOKEN = "</s>"
DEFAULT_BOS_TOKEN = "</s>"
DEFAULT_UNK_TOKEN = "</s>"

##### 2. Dataset Preparation

In [3]:
# dataset = load_dataset(dataset_name_or_path)

- This is two prompt template / or wrapper we are going to use.
- Some instruction contains


In [3]:
PROMPT_DICT = {
    "prompt_input": (
        "Below is an instruction that describes a task, paired with an input that provides further context. "
        "Write a response that appropriately completes the request.\n\n"
        "### Instruction:\n{instruction}\n\n### Input:\n{input}\n\n### Response:"
    ),
    "prompt_no_input":(
        "Below is an instruction that describes a task. "
        "Write a response that appropriately completes the request.\n\n"
        "### Instruction:\n{instruction}\n\n### Response:"
    ),
}

If we are adding any new tokens to then we need to extend the embedding.

In [4]:
def smart_tokenizer_and_embedding_resize(
    special_tokens_dict: Dict,
    tokenizer: transformers.PreTrainedTokenizer,
    model: transformers.PreTrainedModel,
):
    """Resize tokenizer and embedding.

    Note: This is the unoptimized version that may make your embedding size not be divisible by 64.
    """
    num_new_tokens = tokenizer.add_special_tokens(special_tokens_dict)
    model.resize_token_embeddings(len(tokenizer))

    if num_new_tokens > 0:
        input_embeddings = model.get_input_embeddings().weight.data
        output_embeddings = model.get_output_embeddings().weight.data

        input_embeddings_avg = input_embeddings[:-num_new_tokens].mean(dim=0, keepdim=True)
        output_embeddings_avg = output_embeddings[:-num_new_tokens].mean(dim=0, keepdim=True)

        input_embeddings[-num_new_tokens:] = input_embeddings_avg
        output_embeddings[-num_new_tokens:] = output_embeddings_avg

### Tokenize the data.

In [5]:
def _tokenize_fn(strings: Sequence[str], tokenizer: transformers.PreTrainedTokenizer) -> Dict:
    """Tokenize a list of strings."""
    tokenized_list = [
        tokenizer(
            text,
            return_tensors="pt",
            padding="longest",
            max_length=tokenizer.model_max_length,
            truncation=True,
        )
        for text in strings
    ]
    input_ids = labels = [tokenized.input_ids[0] for tokenized in tokenized_list]
    input_ids_lens = labels_lens = [
        tokenized.input_ids.ne(tokenizer.pad_token_id).sum().item() for tokenized in tokenized_list
    ]
    return dict(
        input_ids=input_ids,
        labels=labels,
        input_ids_lens=input_ids_lens,
        labels_lens=labels_lens,
    )

In [6]:
def preprocess(
    sources: Sequence[str],
    targets: Sequence[str],
    tokenizer: transformers.PreTrainedTokenizer,
) -> Dict:
    """Preprocess the data by tokenizing."""
    examples = [s + t for s, t in zip(sources, targets)]
    examples_tokenized, sources_tokenized = [_tokenize_fn(strings, tokenizer) for strings in (examples, sources)]
    input_ids = examples_tokenized["input_ids"]
    labels = copy.deepcopy(input_ids)
    for label, source_len in zip(labels, sources_tokenized["input_ids_lens"]):
        label[:source_len] = IGNORE_INDEX
    return dict(input_ids=input_ids, labels=labels)

In [7]:
class SupervisedDataset(Dataset):
    """Dataset for supervised fine-tuning."""

    def __init__(self, dataset_name_or_path: str, tokenizer: transformers.PreTrainedTokenizer):
        super(SupervisedDataset, self).__init__()

        # Load the dataset
        logging.warning("Loading data...")
        dataset = datasets.load_dataset(dataset_name_or_path, split=split_name)


        logging.warning("Formatting inputs...")
        # if there is no input for prompt the use prompt_no_input template else use prompt_input template
        prompt_input, prompt_no_input = PROMPT_DICT["prompt_input"], PROMPT_DICT["prompt_no_input"]
        sources = [
            prompt_input.format_map(example) if example.get(input_col_name, "") != "" else prompt_no_input.format_map(example)
            for example in tqdm(dataset)
        ]
        targets = [f"{example[output_col_name]}{tokenizer.eos_token}" for example in dataset]

        logging.warning("Tokenizing inputs... This may take some time...")
        data_dict = preprocess(sources, targets, tokenizer)

        self.input_ids = data_dict["input_ids"]
        self.labels = data_dict["labels"]

    def __len__(self):
        return len(self.input_ids)

    def __getitem__(self, i) -> Dict[str, torch.Tensor]:
        return dict(input_ids=self.input_ids[i], labels=self.labels[i])

In [8]:
# for example in dataset['train']:
#     print(example.get(input_col_name, "input"))
#     break

In [9]:
# dataset = SupervisedDataset(dataset_name_or_path=dataset_name_or_path, tokenizer=tokenizer)

In [10]:
@dataclass
class DataCollatorForSupervisedDataset(object):
    """Collate examples for supervised fine-tuning."""

    tokenizer: transformers.PreTrainedTokenizer

    def __call__(self, instances: Sequence[Dict]) -> Dict[str, torch.Tensor]:
        input_ids, labels = tuple([instance[key] for instance in instances] for key in ("input_ids", "labels"))
        input_ids = torch.nn.utils.rnn.pad_sequence(
            input_ids, batch_first=True, padding_value=self.tokenizer.pad_token_id
        )
        labels = torch.nn.utils.rnn.pad_sequence(labels, batch_first=True, padding_value=IGNORE_INDEX)
        return dict(
            input_ids=input_ids,
            labels=labels,
            attention_mask=input_ids.ne(self.tokenizer.pad_token_id),
        )

In [11]:
def make_supervised_data_module(tokenizer: transformers.PreTrainedTokenizer) -> Dict:
    """Make dataset and collator for supervised fine-tuning."""
    train_dataset = SupervisedDataset(tokenizer=tokenizer, dataset_name_or_path=dataset_name_or_path)
    data_collator = DataCollatorForSupervisedDataset(tokenizer=tokenizer)
    return dict(train_dataset=train_dataset, eval_dataset=None, data_collator=data_collator)


In [12]:
def safe_save_model_for_hf_trainer(trainer: transformers.Trainer, output_dir: str):
    """Collects the state dict and dump to disk."""
    state_dict = trainer.model.state_dict()
    if trainer.args.should_save:
        cpu_state_dict = {key: value.cpu() for key, value in state_dict.items()}
        del state_dict
        trainer._save(output_dir, state_dict=cpu_state_dict)  # noqa

In [13]:
def train(training_args):

    model = transformers.AutoModelForCausalLM.from_pretrained(
        model_name_or_path,
        cache_dir=cache_dir,
        device_map="auto"
    )
    model.config.use_cache=False

    tokenizer = transformers.AutoTokenizer.from_pretrained(
        model_name_or_path,
        cache_dir=cache_dir,
        model_max_length=model_max_length,
        padding_side="right",
        use_fast=False,
    )
    if tokenizer.pad_token is None:
        smart_tokenizer_and_embedding_resize(
            special_tokens_dict=dict(pad_token=DEFAULT_PAD_TOKEN),
            tokenizer=tokenizer,
            model=model,
        )
    if "llama" in model_name_or_path:
        tokenizer.add_special_tokens(
            {
                "eos_token": DEFAULT_EOS_TOKEN,
                "bos_token": DEFAULT_BOS_TOKEN,
                "unk_token": DEFAULT_UNK_TOKEN,
            }
        )

    data_module = make_supervised_data_module(tokenizer=tokenizer)

    # update training args to make output dir
    output_dir = os.path.join(training_args.output_dir, model_name_or_path.split("/")[-1])
    os.makedirs(output_dir, exist_ok=True)

    training_args.output_dir = output_dir

    trainer = Trainer(model=model, tokenizer=tokenizer, args=training_args, **data_module)

    # resume from last checkpoint if it exists
    checkpoint = get_last_checkpoint(training_args.output_dir)

    if checkpoint:
        print(f"Checkpoint found! Training from {checkpoint} checkpoint!")
        trainer.train(resume_from_checkpoint=checkpoint)
    else:
        print(f"No checkpoint found! Training from scratch!")
        trainer.train()

    # trainer.train()
    # save states
    trainer.save_state()
    safe_save_model_for_hf_trainer(trainer=trainer, output_dir=training_args.output_dir)
    print(f"Training finished! Saved model to {training_args.output_dir}.")


### Train

In [16]:
output_dir = "output" #

In [17]:
training_args = transformers.TrainingArguments(
    output_dir=output_dir,
    overwrite_output_dir=True,
    per_device_train_batch_size=2,
    gradient_accumulation_steps=8,
    num_train_epochs=2,
    logging_steps=10,
)

In [14]:
# transformers.TrainingArguments?

In [18]:
# training
train(training_args=training_args)

The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.
100%|██████████| 1000/1000 [00:00<00:00, 26336.37it/s]


No checkpoint found! Training from scratch!


Step,Training Loss
10,3.0839
20,3.079
30,3.0632
40,3.1306
50,3.0564
60,3.0491
70,2.9916
80,2.9106
90,2.9751
100,2.9533


Training finished! Saved model to output/gpt2.


### Evaluation

In [20]:
ls lm-evaluation-harness/

CITATION.bib  [0m[01;34mexamples[0m/   [01;34mlm_eval[0m/           pile_statistics.json  requirements.txt  [01;34mtemplates[0m/
CODEOWNERS    ignore.txt  [01;34mlm_eval.egg-info[0m/  pyproject.toml        [01;34mscripts[0m/          [01;34mtests[0m/
[01;34mdocs[0m/         LICENSE.md  mypy.ini           README.md             setup.py


In [27]:
# clone llm-evaluation-harness
# !git clone https://github.com/EleutherAI/lm-evaluation-harness
# cd lm-evaluation-harness
# !pip install -e .

#### Zero-shot model

In [21]:
# zero-shot pre-trained model
!lm_eval --model hf \
    --model_args pretrained="gpt2" \
    --tasks sst2 \
    --device cuda:0 \
    --batch_size auto:4

2024-01-24:05:24:58,724 INFO     [utils.py:160] NumExpr defaulting to 2 threads.
2024-01-24:05:24:58,987 INFO     [config.py:58] PyTorch version 2.1.0+cu121 available.
2024-01-24:05:24:58,988 INFO     [config.py:95] TensorFlow version 2.15.0 available.
2024-01-24:05:24:58,989 INFO     [config.py:108] JAX version 0.4.23 available.
2024-01-24 05:24:59.465533: E external/local_xla/xla/stream_executor/cuda/cuda_dnn.cc:9261] Unable to register cuDNN factory: Attempting to register factory for plugin cuDNN when one has already been registered
2024-01-24 05:24:59.465590: E external/local_xla/xla/stream_executor/cuda/cuda_fft.cc:607] Unable to register cuFFT factory: Attempting to register factory for plugin cuFFT when one has already been registered
2024-01-24 05:24:59.466821: E external/local_xla/xla/stream_executor/cuda/cuda_blas.cc:1515] Unable to register cuBLAS factory: Attempting to register factory for plugin cuBLAS when one has already been registered
2024-01-24:05:25:04,183 INFO     

In [22]:
!lm_eval --model hf \
    --model_args pretrained=/content/output/gpt2 \
    --tasks sst2 \
    --device cuda:0 \
    --batch_size auto:4

2024-01-24:05:25:51,870 INFO     [utils.py:160] NumExpr defaulting to 2 threads.
2024-01-24:05:25:52,137 INFO     [config.py:58] PyTorch version 2.1.0+cu121 available.
2024-01-24:05:25:52,138 INFO     [config.py:95] TensorFlow version 2.15.0 available.
2024-01-24:05:25:52,139 INFO     [config.py:108] JAX version 0.4.23 available.
2024-01-24 05:25:52.615154: E external/local_xla/xla/stream_executor/cuda/cuda_dnn.cc:9261] Unable to register cuDNN factory: Attempting to register factory for plugin cuDNN when one has already been registered
2024-01-24 05:25:52.615202: E external/local_xla/xla/stream_executor/cuda/cuda_fft.cc:607] Unable to register cuFFT factory: Attempting to register factory for plugin cuFFT when one has already been registered
2024-01-24 05:25:52.616432: E external/local_xla/xla/stream_executor/cuda/cuda_blas.cc:1515] Unable to register cuBLAS factory: Attempting to register factory for plugin cuBLAS when one has already been registered
2024-01-24:05:25:56,950 INFO     

### Prompting
- Once we have trained the model to follow instructions, we can prompt that model to generate a response.
- We will be using HF's generation pipeline to prompt our trained model.
- After training the model with a particular prompt wrapper it is advised to use the same prompt format during inference.


In [23]:
import time
import transformers
from transformers import AutoTokenizer, AutoModelForCausalLM
import torch
import os

In [24]:
# Load tokenizer and trained model and then create chatbot pipeline
tokenizer = AutoTokenizer.from_pretrained("/content/output/gpt2")
model = AutoModelForCausalLM.from_pretrained("/content/output/gpt2", device_map="cuda:0")

chatbot = transformers.pipeline(
    "text-generation",
    model=model,
    tokenizer=tokenizer,
)

In [31]:
# pip show torch

In [25]:
# format the prompt
text = "What is Machine Learning?"

prompt = PROMPT_DICT['prompt_no_input'].format(instruction=text)

In [26]:
prompt

'Below is an instruction that describes a task. Write a response that appropriately completes the request.\n\n### Instruction:\nWhat is Machine Learning?\n\n### Response:'

In [27]:
sequences = chatbot(
    prompt,
    do_sample=True,
    temperature=0.1,
    top_p=0.4,
    num_return_sequences=1,
    eos_token_id=tokenizer.eos_token_id,
    max_new_tokens=64,
    return_full_text=False
)

Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.


In [None]:
# !ps ux

In [28]:
print(sequences[0]['generated_text'])

Machine learning is a new field of research that has been gaining momentum in recent years. Machine learning is a new field of research that has been gaining momentum in recent years. Machine learning is a new field of research that has been gaining momentum in recent years.

Machine learning is a new field of research that has been


#### Next Tutorial - Prompting