## Instruction Tuning

Supervised fine tuning (SFT) is fine-tuning all of a model’s parameters on supervised data of inputs and outputs. It teaches the model how to follow user specified instructions. It is typically done after model pre-training. **Source**: http://tinyurl.com/2v884put

![instruction tuning](assets/instruction-tuning.jpg)

Image Source: https://medium.com/mantisnlp/supervised-fine-tuning-customizing-llms-a2c1edbf22c3

Requirement.
1. Pre-trained model & tokenizer -> We will get it from huggingface.
2. Instruction-Response pair data -> eg: Alpaca, Dolly, Oasst1, LIMA, etc. We will get the dataset from huggingface.

Steps.
1. Load pre-trained model and tokenizer.
2. Format the instructions response pair.
3. Preprocess the dataset.
4. Train the pre-trained model in supervised setting with response as labels and instruction as input.
5. Evaluation:
   i. Automatic Evaluation: Eg: MMLU, BBH, AGIEval, domain-specific evaluation such as maths, reasoning, code.
   ii. Human Evaluation: Give model prompts to generate a response and ask humans.
   iii. LLM as Evaluator: Ask powerful models such as GPT4 to rate the response generated by the your finetuned model.


  
**Note**: Most of the code were borrowed from original alpaca code here: https://github.com/tatsu-lab/stanford_alpaca/tree/main

#### Dependencies

In [2]:
# import dependencies
import os
import copy
import logging
from dataclasses import dataclass, field
from typing import Optional, Dict, Sequence
from tqdm import tqdm
import torch
import datasets
import transformers
from torch.utils.data import Dataset
from transformers import Trainer
from datasets import load_dataset
from transformers.trainer_utils import get_last_checkpoint, is_main_process
import pandas as pd
import datasets

##### 1. Configs

In [33]:
model_name_or_path = "UBC-NLP/Jasmine-350M" #"microsoft/phi-1_5" -> use "gpt2" if you have less powerful GPU # huggingface model name
cache_dir="cache_dir"
split_name="train"
inst_col_name="instruction"
input_col_name="source_dialect"
output_col_name="target_msa"
model_max_length=512 # how long sequence model can process
IGNORE_INDEX = -100
DEFAULT_PAD_TOKEN = "[PAD]"
DEFAULT_EOS_TOKEN = "</s>"
DEFAULT_BOS_TOKEN = "<s>"
DEFAULT_UNK_TOKEN = "<unk>"

##### 2. Dataset Preparation

- This is two prompt template / or wrapper we are going to use.
- Some instruction contains


In [12]:
PROMPT_DICT = {
    "prompt_input": (
        "Below is an instruction that describes a task, paired with an input that provides further context. "
        "Write a response that appropriately completes the request.\n\n"
        "### Instruction:\n{instruction}\n\n### Input:\n{input}\n\n### Response:"
    ),
    "prompt_no_input":(
        "Below is an instruction that describes a task. "
        "Write a response that appropriately completes the request.\n\n"
        "### Instruction:\n{instruction}\n\n### Response:"
    ),
    # "translation":(
    #     "Below is an instruction that describes a task, paired with an input that provides further context. "
    #     "Write a response that appropriately completes the request.\n\n"
    #     "### Instruction:\nTranslate this sentence from dialect arabic to Modern standard Arabic\n\n### Input:\n{source_dialect}\n\n### Response:"
    # ),
    # "translation":(
    #     "ترجم الجملة من العامية الى العربية الفصحي الجملة هي:\n{source_dialect}\n\n### وهذة هى ترجمة الجملة:"
    # ),
    "translation":(
        "Input:\n{source_dialect}\n\n### Response:"
    ),
}

If we are adding any new tokens to then we need to extend the embedding.

In [13]:
def smart_tokenizer_and_embedding_resize(
    special_tokens_dict: Dict,
    tokenizer: transformers.PreTrainedTokenizer,
    model: transformers.PreTrainedModel,
):
    """Resize tokenizer and embedding.

    Note: This is the unoptimized version that may make your embedding size not be divisible by 64.
    """
    num_new_tokens = tokenizer.add_special_tokens(special_tokens_dict)
    model.resize_token_embeddings(len(tokenizer))

    if num_new_tokens > 0:
        input_embeddings = model.get_input_embeddings().weight.data
        output_embeddings = model.get_output_embeddings().weight.data

        input_embeddings_avg = input_embeddings[:-num_new_tokens].mean(dim=0, keepdim=True)
        output_embeddings_avg = output_embeddings[:-num_new_tokens].mean(dim=0, keepdim=True)

        input_embeddings[-num_new_tokens:] = input_embeddings_avg
        output_embeddings[-num_new_tokens:] = output_embeddings_avg

### Tokenize the data.

In [14]:
def _tokenize_fn(strings: Sequence[str], tokenizer: transformers.PreTrainedTokenizer) -> Dict:
    """Tokenize a list of strings."""
    tokenized_list = [
        tokenizer(
            text,
            return_tensors="pt",
            padding="longest",
            max_length=tokenizer.model_max_length,
            truncation=True,
        )
        for text in strings
    ]
    input_ids = labels = [tokenized.input_ids[0] for tokenized in tokenized_list]
    input_ids_lens = labels_lens = [
        tokenized.input_ids.ne(tokenizer.pad_token_id).sum().item() for tokenized in tokenized_list
    ]
    return dict(
        input_ids=input_ids,
        labels=labels,
        input_ids_lens=input_ids_lens,
        labels_lens=labels_lens,
    )

In [15]:
def preprocess(
    sources: Sequence[str],
    targets: Sequence[str],
    tokenizer: transformers.PreTrainedTokenizer,
) -> Dict:
    """Preprocess the data by tokenizing."""
    examples = [s + t for s, t in zip(sources, targets)]
    examples_tokenized, sources_tokenized = [_tokenize_fn(strings, tokenizer) for strings in (examples, sources)]
    input_ids = examples_tokenized["input_ids"]
    labels = copy.deepcopy(input_ids)
    for label, source_len in zip(labels, sources_tokenized["input_ids_lens"]):
        label[:source_len] = IGNORE_INDEX
    return dict(input_ids=input_ids, labels=labels)

In [18]:
class SupervisedDataset(Dataset):
    """Dataset for supervised fine-tuning."""

    def __init__(self, dataset_name_or_path: str, tokenizer: transformers.PreTrainedTokenizer):
        super(SupervisedDataset, self).__init__()

        # Load the dataset
        logging.warning("Loading data...")
        # 1. Load tsv file 2. Convert them into hf dataset
        # df = pd.read_csv(dataset_name_or_path
        # dataset = datasets.load_dataset(dataset_name_or_path, split=split_name)
        dataset = load_dataset('csv', data_files=dataset_name_or_path,
                                delimiter='\t',
                                column_names=['source_dialect', 'target_msa'])

        logging.warning("Formatting inputs...")
        # if there is no input for prompt the use prompt_no_input template else use prompt_input template
        prompt_input, prompt_no_input, prompt_translation = PROMPT_DICT["prompt_input"], PROMPT_DICT["prompt_no_input"], PROMPT_DICT["translation"]
        # sources = [
        #     prompt_input.format_map(example) if example.get(input_col_name, "") != "" else prompt_no_input.format_map(example)
        #     for example in tqdm(dataset)
        # ]
        sources = [prompt_translation.format_map(example) for example in dataset['train']]
        
        targets = [f"{example[output_col_name]}{tokenizer.eos_token}" for example in dataset['train']]

        logging.warning("Tokenizing inputs... This may take some time...")
        data_dict = preprocess(sources, targets, tokenizer)

        self.input_ids = data_dict["input_ids"]
        self.labels = data_dict["labels"]

    def __len__(self):
        return len(self.input_ids)

    def __getitem__(self, i) -> Dict[str, torch.Tensor]:
        return dict(input_ids=self.input_ids[i], labels=self.labels[i])

In [24]:
@dataclass
class DataCollatorForSupervisedDataset(object):
    """Collate examples for supervised fine-tuning."""

    tokenizer: transformers.PreTrainedTokenizer

    def __call__(self, instances: Sequence[Dict]) -> Dict[str, torch.Tensor]:
        input_ids, labels = tuple([instance[key] for instance in instances] for key in ("input_ids", "labels"))
        input_ids = torch.nn.utils.rnn.pad_sequence(
            input_ids, batch_first=True, padding_value=self.tokenizer.pad_token_id
        )
        labels = torch.nn.utils.rnn.pad_sequence(labels, batch_first=True, padding_value=IGNORE_INDEX)
        return dict(
            input_ids=input_ids,
            labels=labels,
            attention_mask=input_ids.ne(self.tokenizer.pad_token_id),
        )

In [25]:
train_data_path = '/dataset/train/madar/train_standard_madar#ar.tsv'
val_data_path = '/dataset/train/madar/val_standard_madar#ar.tsv'
def make_supervised_data_module(tokenizer: transformers.PreTrainedTokenizer) -> Dict:
    """Make dataset and collator for supervised fine-tuning."""
    train_dataset = SupervisedDataset(tokenizer=tokenizer, dataset_name_or_path=train_data_path)
    val_dataset = SupervisedDataset(tokenizer=tokenizer, dataset_name_or_path=val_data_path)
    data_collator = DataCollatorForSupervisedDataset(tokenizer=tokenizer)
    return dict(train_dataset=train_dataset, eval_dataset=val_dataset, data_collator=data_collator)


In [26]:
def safe_save_model_for_hf_trainer(trainer: transformers.Trainer, output_dir: str):
    """Collects the state dict and dump to disk."""
    state_dict = trainer.model.state_dict()
    if trainer.args.should_save:
        cpu_state_dict = {key: value.cpu() for key, value in state_dict.items()}
        del state_dict
        trainer._save(output_dir, state_dict=cpu_state_dict)  # noqa

In [27]:
def train(training_args):

    model = transformers.AutoModelForCausalLM.from_pretrained(
        model_name_or_path,
        cache_dir=cache_dir,
        device_map="auto"
    ).to("cuda")
    model.config.use_cache=False

    tokenizer = transformers.AutoTokenizer.from_pretrained(
        model_name_or_path,
        cache_dir=cache_dir,
        model_max_length=model_max_length,
        padding_side="right",
        use_fast=False,
    )
    
    if tokenizer.pad_token is None:
        smart_tokenizer_and_embedding_resize(
            special_tokens_dict=dict(pad_token=DEFAULT_PAD_TOKEN),
            tokenizer=tokenizer,
            model=model,
        )
    if "llama" in model_name_or_path:
        tokenizer.add_special_tokens(
            {
                "eos_token": DEFAULT_EOS_TOKEN,
                "bos_token": DEFAULT_BOS_TOKEN,
                "unk_token": DEFAULT_UNK_TOKEN,
            }
        )

    data_module = make_supervised_data_module(tokenizer=tokenizer)

    # update training args to make output dir
    output_dir = os.path.join(training_args.output_dir, model_name_or_path.split("/")[-1])
    os.makedirs(output_dir, exist_ok=True)

    training_args.output_dir = output_dir

    trainer = Trainer(model=model, tokenizer=tokenizer, args=training_args, **data_module)

    # resume from last checkpoint if it exists
    checkpoint = get_last_checkpoint(training_args.output_dir)

    if checkpoint:
        print(f"Checkpoint found! Training from {checkpoint} checkpoint!")
        trainer.train(resume_from_checkpoint=checkpoint)
    else:
        print(f"No checkpoint found! Training from scratch!")
        trainer.train()

    # trainer.train()
    # save states
    trainer.save_state()
    safe_save_model_for_hf_trainer(trainer=trainer, output_dir=training_args.output_dir)
    print(f"Training finished! Saved model to {training_args.output_dir}.")


### Train

In [28]:
output_dir = "output_new_prompt_short_ar" 

In [29]:
training_args = transformers.TrainingArguments(
    output_dir=output_dir,
    evaluation_strategy = "epoch",  # Evaluate every 'eval_steps' steps
    overwrite_output_dir=True,
    per_device_train_batch_size=16,
    gradient_accumulation_steps=8,
    num_train_epochs=10,
    logging_strategy="epoch",  # Log training loss every epoch
    save_strategy="epoch",  # Save checkpoints every epoch

)

In [32]:
# training
train(training_args=training_args)

dataloader_config = DataLoaderConfiguration(dispatch_batches=None, split_batches=False, even_batches=True, use_seedable_sampler=True)


No checkpoint found! Training from scratch!


ERROR:wandb.jupyter:Failed to detect the name of this notebook, you can set it manually with the WANDB_NOTEBOOK_NAME environment variable to enable code saving.
[34m[1mwandb[0m: Currently logged in as: [33mmostafa-3zazi[0m ([33mmostafa-azazi[0m). Use [1m`wandb login --relogin`[0m to force relogin


Epoch,Training Loss,Validation Loss
0,2.6998,2.508303
1,1.9553,2.504355
2,1.5782,2.752396
3,1.2685,2.997625
4,1.0564,3.375811
5,0.9051,3.583266
6,0.7919,3.917618
8,0.6369,4.44927
9,0.593,4.616311


Training finished! Saved model to output_new_prompt_short_ar/Jasmine-350M.


# inference

In [30]:
model_name_or_path = "/home/mostafa.awad/nlp702-assignment-3/output_new_prompt_short/Jasmine-350M/checkpoint-456"

In [31]:
model = transformers.AutoModelForCausalLM.from_pretrained(
    model_name_or_path,
    cache_dir=cache_dir,
    device_map="auto"
).to("cuda")
model.config.use_cache=False

tokenizer = transformers.AutoTokenizer.from_pretrained(
    model_name_or_path,
    cache_dir=cache_dir,
    model_max_length=model_max_length,
    padding_side="right",
    use_fast=False,
)

if tokenizer.pad_token is None:
    smart_tokenizer_and_embedding_resize(
        special_tokens_dict=dict(pad_token=DEFAULT_PAD_TOKEN),
        tokenizer=tokenizer,
        model=model,
    )
if "llama" in model_name_or_path:
    tokenizer.add_special_tokens(
        {
            "eos_token": DEFAULT_EOS_TOKEN,
            "bos_token": DEFAULT_BOS_TOKEN,
            "unk_token": DEFAULT_UNK_TOKEN,
        }
    )

In [32]:
model.eval()

GPTNeoForCausalLM(
  (transformer): GPTNeoModel(
    (wte): Embedding(64001, 768)
    (wpe): Embedding(2048, 768)
    (drop): Dropout(p=0.0, inplace=False)
    (h): ModuleList(
      (0-11): 12 x GPTNeoBlock(
        (ln_1): LayerNorm((768,), eps=1e-05, elementwise_affine=True)
        (attn): GPTNeoAttention(
          (attention): GPTNeoSelfAttention(
            (attn_dropout): Dropout(p=0.0, inplace=False)
            (resid_dropout): Dropout(p=0.0, inplace=False)
            (k_proj): Linear(in_features=768, out_features=768, bias=False)
            (v_proj): Linear(in_features=768, out_features=768, bias=False)
            (q_proj): Linear(in_features=768, out_features=768, bias=False)
            (out_proj): Linear(in_features=768, out_features=768, bias=True)
          )
        )
        (ln_2): LayerNorm((768,), eps=1e-05, elementwise_affine=True)
        (mlp): GPTNeoMLP(
          (c_fc): Linear(in_features=768, out_features=3072, bias=True)
          (c_proj): Linear(in_fe

In [33]:
evaluation_file_path = "/dataset/val/NADI2024_subtask3_dev.tsv"

In [34]:
dataset = load_dataset('csv', data_files=evaluation_file_path,
                                delimiter='\t',
                                column_names=['source_dialect', 'target_msa'])

Generating train split: 0 examples [00:00, ? examples/s]

In [35]:
prompt_eval = PROMPT_DICT['translation']

In [36]:
eval = [prompt_eval.format_map(example) for example in dataset['train']]


In [62]:
eval[1]

'Input:\nطّلع الرجالة اللي انت عايزهم من الجراج و قطاع النقل يمشّطوا المناطق دي لحد ما يلاقوهم.\n\n### Response:'

In [63]:
len(eval)

400

In [37]:

generate = transformers.pipeline(
    "text-generation",
    model=model,
    tokenizer=tokenizer,
    )

generated_text = generate(
eval,
do_sample=True,
num_return_sequences=1,
eos_token_id=tokenizer.eos_token_id,
return_full_text=False,
max_new_tokens=64,
temperature=0.9,
top_p=0.99,
top_k=40,
repetition_penalty=1.1
)



In [40]:
l = []
for text in generated_text:
    l.append(text[0]['generated_text'])

In [45]:
# File path to save the list
file_path = "jasmine_out.txt"

# Open the file in write mode
with open(file_path, 'w') as file:
    # Write each element of the list to the file
    for item in l:
        file.write(f"{item}\n")

In [44]:
len(l)

400

In [80]:
generated_text

[[{'generated_text': 'Input:\nsource_da\n\n### Response:هل من الممكن أن تخبرني عن كيفية الوصول إلى هناك من فضلك ؟'}],
 [{'generated_text': 'Input:\nطّلع الرجالة اللي انت عايزهم من الجراج و قطاع النقل يمشّطوا المناطق دي لحد ما يلاقوهم.\n\n### Response:أخرج الرجّال من جراج و اتجه إلى منطقة التسوق هذه حتى يتم العثور عليهم.'}],
 [{'generated_text': 'Input:\nخمسة و ستين قلم ده لازم يبقي معاه كلكوليتر ساعتها و لا ايه؟ \n\n### Response:خمسة وسبعون قلم رصاص هذا يجب أن يكون مع كل قلم رصاص ساعتها أيضاً.'}],
 [{'generated_text': 'Input:\nأنا بس عايزك ترتبي نفسك من دلوقتي إنك تبقي زي ستات عيلة الدالي.\n\n### Response:أريد أن أرتب نفسي من الآن.'}],
 [{'generated_text': 'Input:\nبصوا بقى أنا اللي استنيت و أنا اللي كسبت و أنا اللي حلمت يعني كل ده من حقي أنا أنا عارف طبعاً ان كل واحد فيكوا كان بيساعدنا، أبو النجا كان بيساعد من مرتبه يشكر، جعيدي كان بيأكلنا من الزرع اللي بيعمله فوق السطوح و يشكر، بيبيتو كان بيسرح علينا بالجيتار و يشحت يشكر، أنا برضو مش ناسي كل ده و ان شاء الله هيكون يعني عندي. \n\n### 