Nome: Fabio Grassiotto  
RA: 890441

# Exercício com QLoRA e phi-1.5
- Adaptar o notebook para fazer fine-tuning no phi-1.5 com a T4 no IMDB.
- Truncar frases do IMDB para caber na GPU. Testar quantos tokens cabem. 
- Avaliar modelo antes do fine-tuning e registrar acurácia.
- Tunar o modelo usando o QLora
- Avaliar o modelo tunado
- Ver a memória da GPU usada para inferência e para treinamento (com QLoRA)
- Usar comando nvidia-smi

## Module 2 - Fine-tuning Phi-1.5 for sentence classification using QLoRA

This notebook presents an example of how to fine-tune Phi-1.5 for sentence classification using QLoRA.

QLoRA is a fine-tuning approach that reduces memory usage enough to finetune a 65B parameter model on a single 48GB GPU while preserving full 16-bit finetuning task performance. For more details, please refer to the [QLoRA paper](https://arxiv.org/abs/2106.09647).


## Setup

### Installing required packages

In this example, we have to install the following libraries:  `transformers`, `datasets`, `torch`, `peft`, `bitsandbytes`, and `trl`.

**`transformers`**:

Transformers is an open-source library for NLP developed by Hugging Face. It provides state-of-the-art pre-trained models for various NLP tasks, such as text classification, sentiment analysis, question-answering, named entity recognition, etc.

**`datasets`**:

Datasets is another open-source library developed by Hugging Face that provides a collection of preprocessed datasets for various NLP tasks, such as sentiment analysis, natural language inference, machine translation, and many more.


**`torch`**:

PyTorch is an open-source machine learning library that provides a wide range of tools and utilities for building and training custom deep learning models. It is already installed in the Colab environment, but we need to install its latest version.

**`peft`**:

🤗 PEFT, or Parameter-Efficient Fine-Tuning (PEFT), is a library for efficiently adapting pre-trained language models (PLMs) to various downstream applications without fine-tuning all the model’s parameters. We use PEFT in this example because it supports QLoRA.


**`bitsandbytes`**:

BitsAndBytes is a library designed to optimize the training of neural networks on modern GPUs. It offers efficient implementations of 8-bit optimizers, which significantly reduce the memory footprint of model parameters and gradients. This reduction in memory usage enables training larger models or using larger batch sizes within the same memory constraints.


**`trl`**:

🤗 TRL, or Transfer Learning Library, is a library for training and evaluating transfer learning models. It provides a unified API for training and evaluating various transfer learning models.

In [1]:
%%capture
%pip install -q torch
%pip install -q git+https://github.com/huggingface/transformers 
%pip install datasets
%pip install -q peft  
%pip install -q bitsandbytes  
%pip install -q trl  

### Imports

In [2]:
import json
import re
from pprint import pprint
import os
import sys
import torch

import pandas as pd
import torch
from datasets import Dataset, load_dataset
from huggingface_hub import notebook_login
from peft import LoraConfig, PeftModel
from transformers import (
    AutoModelForCausalLM,
    AutoTokenizer,
    BitsAndBytesConfig,
    TrainingArguments,
    pipeline,
    logging,
)
from trl import SFTTrainer # For supervised finetuning
import warnings
warnings.simplefilter('ignore')
from datasets import load_dataset
from datasets import load_dataset_builder
from collections import Counter

## Collab Env Setup and GPU Device

In [3]:
# Colab environment
IN_COLAB = 'google.colab' in sys.modules

if (IN_COLAB):
    # Google Drive
    from google.colab import drive
    drive.mount('/content/drive', force_remount=True)

    project_folder="/content/drive/MyDrive/Classes/IA024/Aula_6_7"
    os.chdir(project_folder)
    !ls -la

device = 'cuda' if torch.cuda.is_available() else 'cpu'
print(device)

cuda


In [4]:
!nvidia-smi -q --display=MEMORY



Timestamp                                 : Tue Apr 23 16:14:06 2024
Driver Version                            : 552.22
CUDA Version                              : 12.4

Attached GPUs                             : 1
GPU 00000000:01:00.0
    FB Memory Usage
        Total                             : 16376 MiB
        Reserved                          : 313 MiB
        Used                              : 5774 MiB
        Free                              : 10290 MiB
    BAR1 Memory Usage
        Total                             : 16384 MiB
        Used                              : 1 MiB
        Free                              : 16383 MiB
    Conf Compute Protected Memory Usage
        Total                             : N/A
        Used                              : N/A
        Free                              : N/A



# Downloading Dataset

Dataset Card for the IMDB Dataset (https://huggingface.co/datasets/stanfordnlp/imdb):

*Large Movie Review Dataset. This is a dataset for binary sentiment classification containing substantially more data than previous benchmark datasets. We provide a set of 25,000 highly polar movie reviews for training, and 25,000 for testing. There is additional unlabeled data for use as well.*

In this example, we're using the **`datasets`** library to download and load the training and validation sets of the dataset.

In [5]:
os.environ['HF_HOME'] = 'D:\Research\models\hf'

ds_builder = load_dataset_builder("imdb")
print(ds_builder.info.description)
print(ds_builder.info.features)


{'text': Value(dtype='string', id=None), 'label': ClassLabel(names=['neg', 'pos'], id=None)}


Shuffle both datasets and select 1000 samples randomly from the test dataset to speed up evaluation.

In [6]:
train_dataset = load_dataset('imdb', split='train')
test_dataset = load_dataset('imdb', split='test')
train_dataset = train_dataset.shuffle(seed=42)
test_dataset = test_dataset.shuffle(seed=42).select(range(1000))

KeyboardInterrupt: 

In [None]:
test_dataset[0]

# Data Preparation

Now, we will prepare the data for training our model. First, we define a template with the fields `sentence` and `class`. Then, we use the `map` method to apply this template to the dataset. This will create a new dataset with the fields `sentence` and `class` for each example in the original dataset.

In [None]:
template = """Your task is to classify sentences' sentiment as 'positive' or 'negative'. Your answer should be one word, either 'positive' or 'negative'.

Sentence: {text}
Answer:"""

Before, we need to convert the labels from 0 and 1 to "negative" and "positive". We can do this by using the `map` method to apply a function to each example in the dataset. The function will take the label as input and return the corresponding string and store in the column `class`.

In [None]:
POSITIVE_LABEL = "positive"
NEGATIVE_LABEL = "negative"

# HTML tags removal
train_dataset = train_dataset.map(lambda example: {'text': example['text'].replace("<br />", " ")})
test_dataset = test_dataset.map(lambda example: {'text': example['text'].replace("<br />", " ")})

train_dataset = train_dataset.map(lambda example: {'class': POSITIVE_LABEL if example["label"] == 1 else NEGATIVE_LABEL})
train_dataset = train_dataset.map(lambda example: {"text": template.format(**example)})
test_dataset = test_dataset.map(lambda example: {'class': POSITIVE_LABEL if example["label"] == 1 else NEGATIVE_LABEL})
test_dataset = test_dataset.map(lambda example: {"text": template.format(**example)})

In [None]:
print(train_dataset[0])
print(test_dataset[0])

# Model Evaluation (Before Fine-tuning)

## Evaluation Functions

In [None]:
from tqdm import tqdm

def classify_sentence(model, tokenizer, sentence):
  encodeds = tokenizer(sentence, return_tensors="pt", add_special_tokens=False)
  model_inputs = encodeds.to(device)

  with torch.no_grad():
    outputs = model.generate(**model_inputs,max_new_tokens=15,bos_token_id=model.config.bos_token_id,
                                eos_token_id=model.config.eos_token_id,
                                pad_token_id=model.config.eos_token_id
                             )
    torch.cuda.empty_cache()

  return tokenizer.decode(outputs[0][len(model_inputs["input_ids"][0]):], skip_special_tokens=True)

def check_sentiment(str, after_finetune):
  # String starts with ' Negative' or ' Positive', caps or not.
  if (not after_finetune):
    str = str.lower()
    str = str[1:9]
  else:
    # Finetuned model is answering with scores of the format 1/10, etc.
    try:
      score = int(str.partition('/')[0])
    except ValueError:
      # For cases without a score.
      score = 5

    if (score > 5): 
      str = 'positive'
    else: 
      str = 'negative'
    
  return str
  
def eval_model(model, tokenizer, after_finetune=False):
  predictions = []
  predictions_raw = []

  references = test_dataset["class"]

  text_test_dataset = test_dataset['text']

  for item in tqdm(text_test_dataset):
    predicted_raw = classify_sentence(model, tokenizer, item)
    predicted = check_sentiment(predicted_raw, after_finetune)
    predictions.append(predicted)
    predictions_raw.append(predicted_raw)
  
  correct = sum([1 for p, r in zip(predictions, references) if p.lower() == r.lower()])
  total = len(predictions)
  acc = correct/total
  
  print(f'Model Accuracy = {acc*100}%')

  return predictions_raw, predictions

#### Base Model Instantiation and Evaluation

In [None]:
# The model that you want to train from the Hugging Face hub
model_name = "microsoft/phi-1_5"
# Fine-tuned model name
new_model = "phi-1_5-IMDB"

In [None]:
base_model = AutoModelForCausalLM.from_pretrained(
    model_name,
    low_cpu_mem_usage=True,
    return_dict=True,
    torch_dtype=torch.float16,
    device_map={"": 0},
)

tokenizer = AutoTokenizer.from_pretrained(model_name, trust_remote_code=True)
tokenizer.pad_token = tokenizer.eos_token
tokenizer.padding_side = "right"

In [None]:
# Evaluate for 1000 samples
raw, pred = eval_model(base_model, tokenizer)

### Checking raw output of first few characters

In [None]:
raw_first_chars = [s[1:9] for s in raw]
item_counts = Counter(raw_first_chars)
print(f'Model output stats: {item_counts}')  

# Fine-tuning

## Setting Model Parameters

We need to set various parameters for our fine-tuning process, including QLoRA (Quantization LoRA) parameters, bitsandbytes parameters, and training arguments:

Setting the QLora Parameters

1. **lora_r (LoRA attention dimension)**:
   - the rank of the update matrices, expressed in int. Lower rank results in smaller update matrices with fewer trainable parameters.

2. **lora_alpha (Alpha parameter for LoRA scaling)**:
   - This parameter is the LoRA scaling factor applied to the modifications.

3. **lora_dropout (Dropout probability for LoRA layers)**:
   - This parameter represents the dropout rate applied to the LoRA layers.

In [None]:
# LoRA attention dimension
lora_r = 64 # @param

# Alpha parameter for LoRA scaling
lora_alpha = 16 # @param

# Dropout probability for LoRA layers
lora_dropout = 0.1 # @param

Bitsandbytes parameters. These parameters focus on the implementation of 4-bit precision in model loading and computation. Here's an explanation of each:

1. **use_4bit (Activate 4-bit precision base model loading)**:
   - This parameter, when set to `True`, indicates that the base model (i.e., the pre-trained model or initial model weights) should be loaded using 4-bit precision.
2. **bnb_4bit_compute_dtype (Compute dtype for 4-bit base models)**:
   - This parameter specifies the data type to be used for computations in the context of 4-bit base models.
   - The value `"float16"` indicates that computations should be done using 16-bit floating-point numbers.

3. **bnb_4bit_quant_type (Quantization type)**:
   - This parameter determines the type of quantization to be used for the 4-bit models.
   - The options `"fp4"` and `"nf4"` refer to different quantization schemes.

4. **use_nested_quant (Activate nested quantization for 4-bit base models)**:
   - When set to `True`, this parameter enables nested quantization for 4-bit base models.
   - Nested quantization, often referred to as double quantization, involves applying a second layer of quantization on top of an already quantized model. This can be used for further reducing the model size or for specialized computational optimizations.

In [None]:
# Activate 4-bit precision base model loading
use_4bit = True # @param

# Compute dtype for 4-bit base models
bnb_4bit_compute_dtype = "float16" # @param

# Quantization type (fp4 or nf4)
bnb_4bit_quant_type = "nf4" # @param ["nf4","fp4"]

# Activate nested quantization for 4-bit base models (double quantization)
use_nested_quant = False # @param

Now, let's define the training arguments.

1. **output_dir**:
   - Specifies the directory where the model predictions and checkpoints will be stored.

2. **num_train_epochs**:
   - Sets the number of epochs for training, where one epoch means one pass through the entire training dataset. We set it to `1`

3. **fp16, bf16**:
   - Enable training with 16-bit floating-point precision (`fp16`) or 16-bit bfloat precision (`bf16`).

4. **per_device_train_batch_size**:
   - Determines the batch size for training per GPU. This will depend on the GPU used. For an A100, we can use a batch size of 16 examples.

5. **per_device_eval_batch_size**:
   - Sets the batch size for evaluation per GPU.

6. **gradient_accumulation_steps**:
   - Indicates the number of update steps over which to accumulate gradients.

7. **gradient_checkpointing**:
   - When enabled, saves memory by trading compute for memory. Useful for training large models that would otherwise not fit in memory.

8. **max_grad_norm (Maximum gradient norm)**:
   - Specifies the maximum norm of gradients for gradient clipping, a technique to prevent exploding gradients in deep networks.

9. **learning_rate**:
   - Sets the initial learning rate for the AdamW optimizer.

10. **weight_decay**:
    - Specifies the weight decay to apply to all layers except those with bias or LayerNorm weights, as a regularization technique.

11. **optim**:
    - Defines the optimizer to use, here specified as a variant of AdamW optimized for certain hardware configurations.

12. **lr_scheduler_type**:
    - Determines the learning rate schedule to use. "constant" means the learning rate stays the same throughout training.

13. **max_steps**:
    - Overrides `num_train_epochs` by setting the number of training steps. If set to a negative value, it's ignored. We set this to `100` to reduce the training time. That means, that our example training does not use the entire traing set.

14. **warmup_ratio**:
    - Indicates the proportion of total training steps to use for linear warmup of the learning rate.

15. **group_by_length**:
    - When enabled, sequences are grouped by length into batches. This can save memory and speed up training.

16. **save_steps**:
    - Determines how often to save a model checkpoint in terms of training steps.

17. **logging_steps**:
    - Sets the frequency, in terms of training steps, for logging training progress.


In [None]:
# Output directory where the model predictions and checkpoints will be stored
output_dir = "./results" # @param

# Number of training epochs
num_train_epochs = 1 # @param

# Enable fp16/bf16 training (set bf16 to True with an A100)
fp16 = False # @param
bf16 = False # @param

# Batch size per GPU for training
per_device_train_batch_size = 4 # @param

# Batch size per GPU for evaluation
per_device_eval_batch_size = 4 # @param

# Number of update steps to accumulate the gradients for
#gradient_accumulation_steps = 1 # @param
gradient_accumulation_steps = 2

# Enable gradient checkpointing
gradient_checkpointing = True # @param

# Maximum gradient normal (gradient clipping)
max_grad_norm = 0.3 # @param

# Initial learning rate (AdamW optimizer)
#learning_rate = 5e-4 # @param
learning_rate = 2e-4

# Weight decay to apply to all layers except bias/LayerNorm weights
weight_decay = 0.001 # @param

# Optimizer to use
optim = "paged_adamw_32bit" # @param

# Learning rate schedule (constant a bit better than cosine)
lr_scheduler_type = "constant" # @param

# Number of training steps (overrides num_train_epochs)
max_steps = 400 # @param
#max_steps = -1

# Ratio of steps for a linear warmup (from 0 to learning rate)
warmup_ratio = 0.03 # @param

# Group sequences into batches with same length
# Saves memory and speeds up training considerably
group_by_length = True # @param

# Save checkpoint every X updates steps
save_steps = 25 # @param

# Log every X updates steps
logging_steps = 25 # @param

Now let's defint the SFTTrainer parameters

1. **max_seq_length**:
   - This parameter specifies the maximum sequence length to be used.

2. **packing**:
   - This parameter indicates whether or not to pack multiple short examples into the same input sequence.
   - When set to `True`, this technique can be used to increase computational efficiency, particularly in batch processing.

3. **device_map**:
   - This parameter is a dictionary that maps parts of the model to specific computing devices.
   - The entry `{"": 0}` specifies that the entire model will be loaded onto GPU 0.

In [None]:
# Maximum sequence length to use
max_seq_length = 512

# Pack multiple short examples in the same input sequence to increase efficiency
packing = False

# Load the entire model on the GPU 0
device_map = {"": 0}

Load the base model with QLoRA configuration

In [None]:
compute_dtype = getattr(torch, bnb_4bit_compute_dtype)

bnb_config = BitsAndBytesConfig(
    load_in_4bit=use_4bit,
    bnb_4bit_quant_type=bnb_4bit_quant_type,
    bnb_4bit_compute_dtype=compute_dtype,
    bnb_4bit_use_double_quant=use_nested_quant,
)

base_model = AutoModelForCausalLM.from_pretrained(
    model_name,
    quantization_config=bnb_config,
    device_map=device_map
)

base_model.config.use_cache = False
base_model.config.pretraining_tp = 1

Load tokenizer


In [None]:
tokenizer = AutoTokenizer.from_pretrained(model_name, trust_remote_code=True)
tokenizer.pad_token = tokenizer.eos_token
tokenizer.padding_side = "right"

In [None]:
# Evaluate for 1000 samples
_, _ = eval_model(base_model, tokenizer)

## Fine-Tuning with QLoRA and Supervised Fine-Tuning

We're ready to fine-tune our model using QLoRA. For this tutorial, we'll use the `SFTTrainer` from the `trl` library.

In the context of the code below, `target_modules` refers to specific components or layers of a neural network model that will be modified or adapted using LoRA (Low-Rank Adaptation). LoRA is a technique used to adapt pre-trained models with minimal additional parameters, often used in the context of Transformer models. Here's a breakdown of what each module likely represents:

1. **q_proj, k_proj, v_proj, o_proj**:
   - These refer to the projections for query (q), key (k), value (v), and output (o) in the attention mechanism of a Transformer model.

2. **gate_proj**:
   - This refer to a projection layer associated with gating mechanisms in the model, such as those found in Gated Recurrent Units (GRUs) or similar structures.

3. **up_proj, down_proj**:
   - These refer to projection layers used in upsampling or downsampling within the model.

4. **lm_head**:
   - This refers to the language model head of a Transformer, which is the final layer that produces the output (like the next word in a sequence).

In [None]:
# Load LoRA configuration
peft_config = LoraConfig(
    lora_alpha=lora_alpha,
    lora_dropout=lora_dropout,
    r=lora_r,
    target_modules=[
        "q_proj",
        "k_proj",
        "v_proj",
        "o_proj",
        "gate_proj",
        "up_proj",
        "down_proj",
        "lm_head",
    ],
    bias="none",
    task_type="CAUSAL_LM",
)

# Set training parameters
training_arguments = TrainingArguments(
    output_dir=output_dir,
    num_train_epochs=num_train_epochs,
    per_device_train_batch_size=per_device_train_batch_size,
    gradient_accumulation_steps=gradient_accumulation_steps,
    optim=optim,
    save_steps=save_steps,
    logging_steps=logging_steps,
    learning_rate=learning_rate,
    weight_decay=weight_decay,
    fp16=fp16,
    bf16=bf16,
    max_grad_norm=max_grad_norm,
    max_steps=max_steps,
    warmup_ratio=warmup_ratio,
    group_by_length=group_by_length,
    lr_scheduler_type=lr_scheduler_type,
    report_to="tensorboard",
)

# Set supervised fine-tuning parameters
trainer = SFTTrainer(
    model=base_model,
    train_dataset=train_dataset,
    peft_config=peft_config,
    dataset_text_field="text",
    max_seq_length=max_seq_length,
    tokenizer=tokenizer,
    args=training_arguments,
    packing=packing,

)

## Let's start the training process

In [None]:
# Train model
trainer.train()

# Save trained model
trainer.model.save_pretrained(new_model)

### Check VRAM usage for training

In [None]:
!nvidia-smi -q --display=MEMORY

## Merge the fine-tuned model

After fine-tuning, we can merge the fine-tuned model with the base model to get a single model that can be used for inference. This is done by using the PEFT. First, let's clean up the GPU memory by deleting the fine-tuned model. You can also restart the runtime to clear the GPU memory.

In [None]:
# Empty VRAM
import gc
del base_model
gc.collect()

#del trainer
gc.collect()

In [None]:
torch.cuda.empty_cache()

In [None]:
gc.collect()

Now, let's load the base model and fine-tuned model and merge them using PEFT.

In [None]:
base_model = AutoModelForCausalLM.from_pretrained(
    model_name,
    low_cpu_mem_usage=True,
    return_dict=True,
    torch_dtype=torch.float16,
    device_map={"": 0},
)
merged_model= PeftModel.from_pretrained(base_model, new_model,)
merged_model= merged_model.merge_and_unload()

Let's save our merged model

In [None]:
# Save the merged model
merged_model.save_pretrained("merged_model", safe_serialization=True)
tokenizer.save_pretrained("merged_model")
tokenizer.pad_token = tokenizer.eos_token
tokenizer.padding_side = "right"

# Test the merged model (After Fine-Tuning)

In [None]:
test_dataset[0]

In [None]:
_, _ = eval_model(merged_model, tokenizer, after_finetune=True)

#### Check VRAM usage for inference

In [None]:
!nvidia-smi -q --display=MEMORY