# Quantization Demo

## Introduction
In this demo, we will employ PEFT (LoRA) and Quantization techniques to fine-tune the Llama2-7b model, aiming to debias and detoxify text. We will utilize a specific dataset located at `../../data/debiased_profanity_check_with_keywords.csv`.

This notebook will guide you through the process, showcasing the steps involved in fine-tuning the model to produce a debiased and detoxified output from biased or toxic text.

## Steps

Here we define the main steps to fine-tune the Llama2-7b model using QLoRA.

1.   Load the dataset and apply necessary transformations to format it for prompt-completion.
2.   Configure bitsandbytes for 4-bit quantization; define the load and compute data types as specified in the QLoRA paper.
3.   Load the LlaMA2 model and its tokenizer.
4.   Define LoRA configurations and Training Arguments.
5.   Train using the SFT Trainer, which by default stores only the adapter model.
6.   Merge the adapter model with the base model (loaded in FP16).

## Importing Libraries
This cell imports libraries for dataset loading, tokenization, and training large language models using Hugging Face Transformers, and libraries required for PEFT and quantization.



In [None]:
import torch
from datasets import load_dataset
from peft import LoraConfig, prepare_model_for_kbit_training
from transformers import (
    AutoModelForCausalLM,
    BitsAndBytesConfig,
    LlamaTokenizer,
    TrainingArguments,
    pipeline,
)
from trl import SFTTrainer

## Configuring Directory Paths for Model Weights, Dataset, and Model Storage
This cell specifies the directory paths for storing model checkpoints, adapter models, merged models, and the dataset necessary for the task.

In [None]:
DATASET_PATH = "../../data/debiased_profainty_check_with_keywords.csv" # dataset of biased and corresponding debiased text
OUTPUT_DIR = "../../scratch/quantization/" # main directory of the the demo output
CHECKPOINT_DIR = f"{OUTPUT_DIR}checkpoint" # where to save checkpoints
MERGED_MODEL_DIR= f"{OUTPUT_DIR}merged_model"  # where to save merged model

In [None]:
MODEL_NAME = "/projects/fta_bootcamp/downloads/Llama-2-7b-chat-hf" # chat model
NEW_MODEL_NAME = "llama-2-7b-debiaser" # Fine-tuned model name

## Creating a HuggingFace Dataset

In [None]:
def create_hf_dataset_from_csv(csv_path):
    return load_dataset("csv", data_files=csv_path, split="train")

dataset = create_hf_dataset_from_csv(DATASET_PATH)
dataset = dataset.train_test_split(test_size=0.1)
dataset = dataset.select_columns(["biased_text", "debiased_text"])

Here are the first 3 samples of the dataset:

In [None]:
for i in range(3):
    sample = dataset["train"][i]
    print(sample, "\n")

## Loading Tokenizer

In [None]:
tokenizer = LlamaTokenizer.from_pretrained(MODEL_NAME, trust_remote_code=True, add_eos_token=True)

if not tokenizer.pad_token:
    tokenizer.pad_token_id = tokenizer.eos_token_id
tokenizer.model_max_length = 1024

## Formatting Prompts
For instruction fine-tuning, we will use Stanford-Alpaca format as follows:

`### Instruction:\n {prompt}\n ### Input:\n {input_text}\n ### Response\n: {completion}`

In [None]:
def formatting_prompts_func(examples):
    instruction = (
        " You are a text debiasing bot, you take as input a"
        " text and you output its debiased version by rephrasing it to be"
        " free from any age, gender, political, social or socio-economic"
        " biases, without any extra outputs. Debias this text by rephrasing"
        " it to be free of bias: "
    )
    output_text = []
    for i in range(len(examples["biased_text"])):
        input_text = examples["biased_text"][i]
        response = examples["debiased_text"][i]

        text = f"""Below is an instruction that describes a task, paired with an input that provides further context. Write a response that appropriately completes the request.

        ### Instruction:
        {instruction}

        ### Input:
        {input_text}

        ### Response:
        {response}
        """

        output_text.append(text)

    return output_text

## Configuring Quantization and LoRA

### LoRA-Specific Parameters

*   r: Rank is essentially a measure of how the original weight matrices are broken down into simpler, smaller matrices.
*   lora_alpha: Alpha parameter for LoRA scaling. This parameter controls the scaling of the low-rank approximation. Higher values might make the approximation more influential in the fine-tuning process, affecting both performance and computational cost.
*   lora_dropout: Dropout probability for LoRA layers. This is the probability that each neuron’s output is set to zero during training, used to prevent overfitting.

https://arxiv.org/abs/2305.14314

In [None]:
peft_config = LoraConfig(
    r=64,
    lora_alpha=16, # Alpha parameter for LoRA scaling. This parameter controls the scaling of the low-rank approximation. Higher values might make the approximation more influential in the fine-tuning process, affecting both performance and computational cost.
    lora_dropout=0.2, # Dropout probability for LoRA layers. This is the probability that each neuron output is set to zero during training, used to prevent overfitting.
    bias="none",
    task_type="CAUSAL_LM",
)

### Quantization Parameters

We utilize ****4bit quantization**** as described in the QLoRA paper : https://arxiv.org/pdf/2305.14314.pdf

QLoRA paper sets parameters as follows:

* set load_in_4bit=True to quantize the model to 4-bits when you load it.
* set bnb_4bit_quant_type="nf4" to use a special 4-bit data type for weights initialized from a normal distribution.
* set bnb_4bit_use_double_quant=True to use a nested quantization scheme to quantize the already quantized weights.
* set bnb_4bit_compute_dtype=torch.bfloat16 to use bfloat16 for faster computation.

In [None]:
use_4bit = True # Activate 4-bit precision base model loading
bnb_4bit_compute_dtype = "float16" # Compute dtype for 4-bit base models : either float16 or bfloat16, bfloat16 is recommended as it produces less nans ** Note bnb_4bit_compute_dtype for merging.
bnb_4bit_quant_type = "nf4" # Quantization type (fp4 or nf4)
use_nested_quant = False # Activate nested quantization for 4-bit base models (double quantization)
compute_dtype = getattr(torch, bnb_4bit_compute_dtype)

bnb_config = BitsAndBytesConfig(
    load_in_4bit=use_4bit,
    bnb_4bit_quant_type=bnb_4bit_quant_type,
    bnb_4bit_compute_dtype=compute_dtype,
    bnb_4bit_use_double_quant=use_nested_quant,
)

## Loading Model

In the cell below, we create a model object with the defined quantizaition configuration from the bitsandbytes library

In [None]:
device_map = {"":0}
model = AutoModelForCausalLM.from_pretrained(
    MODEL_NAME,
    quantization_config=bnb_config,
    device_map=device_map,
)
model.config.use_cache = False
model.config.pretraining_tp = 1 # Setting this to a value different than 1 will activate the more accurate but slower computation of the linear layers, which should better match the original logits.

### Base Model Generation
Here we test the performance of the base model:

In [None]:
instruction = (
    " You are a text debiasing bot, you take as input a"
    " text and you output its debiased version by rephrasing it to be"
    " free from any age, gender, political, social or socio-economic"
    " biases, without any extra outputs. Debias this text by rephrasing"
    " it to be free of bias: "
)

input_text = "Women are dumb."
text = f"""Below is an instruction that describes a task, paired with an input that provides further context. Write a response that appropriately completes the request.

### Instruction:
{instruction}

### Input:
{input_text}
"""

pipe = pipeline("text-generation", model=model, tokenizer=tokenizer, max_length=400)
result = pipe(text)
result[0]["generated_text"]

### Calculating Trainable Parameters of the Model

In [None]:
def print_trainable_parameters(model):
    """Prints the number of trainable parameters in the model.
    """
    trainable_params = 0
    all_param = 0
    for _, param in model.named_parameters():
        all_param += param.numel()
        if param.requires_grad:
            trainable_params += param.numel()
    print(
        f"trainable params: {trainable_params} || all params: {all_param} || trainable%: {100 * trainable_params / all_param}",
    )
print_trainable_parameters(model)

## Defining Training Arguments

In [None]:
# Training Arguments from -  https://github.com/facebookresearch/llama-recipes

training_arguments = TrainingArguments(
    output_dir=CHECKPOINT_DIR,
    num_train_epochs=1,
    per_device_train_batch_size=4,
    gradient_accumulation_steps=1,
    optim="paged_adamw_32bit",
    save_steps=25,
    logging_steps=25,
    learning_rate=2e-4,
    weight_decay=0.001,
    fp16=True,
    bf16=False,
    max_grad_norm=0.3,
    max_steps=-1,
    warmup_ratio=0.03,
    group_by_length=True,
    lr_scheduler_type="constant",
    report_to="tensorboard",
)

In [None]:
model = prepare_model_for_kbit_training(model)

trainer = SFTTrainer(
    model=model,
    train_dataset=dataset["train"],
    eval_dataset=dataset["test"],
    peft_config=peft_config,
    max_seq_length=1024,
    tokenizer=tokenizer,
    args=training_arguments,
    formatting_func = formatting_prompts_func,
    packing=False,
)

## Training the Model

In [None]:
trainer.train()

## Merge the Model

In [None]:
model = trainer.model.merge_and_unload()
# model = model.save_pretrained(MERGED_MODEL_DIR) save merged model if needed.

## Load and Test Trained Model

In [None]:
# model = AutoModelForCausalLM.from_pretrained("/projects/fta_bootcamp/trained_models/quantization_merged_model")

### Trained Model Generation
Here we test the performance of the trained model:

In [None]:
instruction = (
    " You are a text debiasing bot, you take as input a"
    " text and you output its debiased version by rephrasing it to be"
    " free from any age, gender, political, social or socio-economic"
    " biases, without any extra outputs. Debias this text by rephrasing"
    " it to be free of bias: "
)

input_text = "Women are dumb."
text = f"""Below is an instruction that describes a task, paired with an input that provides further context. Write a response that appropriately completes the request.

### Instruction:
{instruction}

### Input:
{input_text}
"""

pipe = pipeline("text-generation", model=model, tokenizer=tokenizer, max_length=400)
result = pipe(text)
result[0]["generated_text"]