## Module 2 - Fine-tuning Phi-1.5 for sentence classification using QLoRA

Aluno: Leandro Carísio Fernandes

This notebook presents an example of how to fine-tune Phi-1.5 for sentence classification using QLoRA.

QLoRA is a fine-tuning approach that reduces memory usage enough to finetune a 65B parameter model on a single 48GB GPU while preserving full 16-bit finetuning task performance. For more details, please refer to the [QLoRA paper](https://arxiv.org/abs/2106.09647).


## Multiplicador para o tamanho do dataset de teste e treino. Se for igual a 1, usa o dataset completo. Se for menor (útil durante os testes), multiplica pelo tamanho original.

In [None]:
MULT_TRAIN = 1               # @param
MULT_TEST  = 0.1               # @param
NUM_MAX_TOKENS = 400            # @param
FAZER_AVALIACAO_INICIAL = False # @param

# Installing required packages

In this example, we have to install the following libraries:  `transformers`, `datasets`, `torch`, `peft`, `bitsandbytes`, and `trl`.

**`transformers`**:

Transformers is an open-source library for NLP developed by Hugging Face. It provides state-of-the-art pre-trained models for various NLP tasks, such as text classification, sentiment analysis, question-answering, named entity recognition, etc.

**`datasets`**:

Datasets is another open-source library developed by Hugging Face that provides a collection of preprocessed datasets for various NLP tasks, such as sentiment analysis, natural language inference, machine translation, and many more.


**`torch`**:

PyTorch is an open-source machine learning library that provides a wide range of tools and utilities for building and training custom deep learning models. It is already installed in the Colab environment, but we need to install its latest version.

**`peft`**:

🤗 PEFT, or Parameter-Efficient Fine-Tuning (PEFT), is a library for efficiently adapting pre-trained language models (PLMs) to various downstream applications without fine-tuning all the model’s parameters. We use PEFT in this example because it supports QLoRA.


**`bitsandbytes`**:

BitsAndBytes is a library designed to optimize the training of neural networks on modern GPUs. It offers efficient implementations of 8-bit optimizers, which significantly reduce the memory footprint of model parameters and gradients. This reduction in memory usage enables training larger models or using larger batch sizes within the same memory constraints.


**`trl`**:

🤗 TRL, or Transfer Learning Library, is a library for training and evaluating transfer learning models. It provides a unified API for training and evaluating various transfer learning models.

In [None]:
!pip install -q torch
!pip install -q git+https://github.com/huggingface/transformers #huggingface transformers for downloading models weights
!pip install datasets
!pip install -q peft  # Parameter efficient finetuning - for qLora Finetuning
!pip install -q bitsandbytes  # For Model weights quantization
!pip install -q trl  # Transformer Reinforcement Learning - For Finetuning using Supervised Fine-tuning

  Installing build dependencies ... [?25l[?25hdone
  Getting requirements to build wheel ... [?25l[?25hdone
  Preparing metadata (pyproject.toml) ... [?25l[?25hdone
  Building wheel for transformers (pyproject.toml) ... [?25l[?25hdone
Collecting datasets
  Downloading datasets-2.19.0-py3-none-any.whl (542 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m542.0/542.0 kB[0m [31m3.8 MB/s[0m eta [36m0:00:00[0m
Collecting dill<0.3.9,>=0.3.0 (from datasets)
  Downloading dill-0.3.8-py3-none-any.whl (116 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m116.3/116.3 kB[0m [31m18.2 MB/s[0m eta [36m0:00:00[0m
Collecting xxhash (from datasets)
  Downloading xxhash-3.4.1-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (194 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m194.1/194.1 kB[0m [31m23.9 MB/s[0m eta [36m0:00:00[0m
[?25hCollecting multiprocess (from datasets)
  Downloading multiprocess-0.70.16-py310-

# Setting the device

In this example, we will use a GPU to speed up the fine-tuning process. GPUs (Graphics Processing Units) are specialized processors that are optimized for performing large-scale computations in parallel. By using a GPU, we can accelerate the training and inference of a machine learning model, which can significantly reduce the time required to complete these tasks.

Before we begin, we need to check whether a GPU is available and select it as the default device for our PyTorch operations. This is because PyTorch can use either a CPU or a GPU to perform computations, and by default, it will use the CPU.

For using a GPU in Google Colab:
1. Click on the "Runtime" menu at the top of the screen.
2. From the dropdown menu, click on "Change runtime type".
3. In the popup window that appears, select "A100 GPU" as the hardware accelerator.
4. Click on the "Save" button.

That's it! Now you can use the GPU for faster computations in your notebook.

**IMPORTANT**: This example requires a GPU with at least 40GB of memory. If you are using Google Colab, you can select a GPU with 40GB of memory by following the steps above. If you are using a different environment, please make sure that your GPU has at least 40GB of memory.

In [None]:
!nvidia-smi

Wed Apr 24 17:45:09 2024       
+---------------------------------------------------------------------------------------+
| NVIDIA-SMI 535.104.05             Driver Version: 535.104.05   CUDA Version: 12.2     |
|-----------------------------------------+----------------------+----------------------+
| GPU  Name                 Persistence-M | Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp   Perf          Pwr:Usage/Cap |         Memory-Usage | GPU-Util  Compute M. |
|                                         |                      |               MIG M. |
|   0  Tesla T4                       Off | 00000000:00:04.0 Off |                    0 |
| N/A   34C    P8               9W /  70W |      0MiB / 15360MiB |      0%      Default |
|                                         |                      |                  N/A |
+-----------------------------------------+----------------------+----------------------+
                                                                    

# Fine-tuning

## Setting Model Parameters

We need to set various parameters for our fine-tuning process, including QLoRA (Quantization LoRA) parameters, bitsandbytes parameters, and training arguments:

In [None]:
# The model that you want to train from the Hugging Face hub
# model_name = "mistralai/Mistral-7B-Instruct-v0.1"
model_name = "microsoft/phi-1_5"

# Fine-tuned model name
new_model = "phi-1_5-IMDB"

Setting the QLora Parameters

1. **lora_r (LoRA attention dimension)**:
   - the rank of the update matrices, expressed in int. Lower rank results in smaller update matrices with fewer trainable parameters.

2. **lora_alpha (Alpha parameter for LoRA scaling)**:
   - This parameter is the LoRA scaling factor applied to the modifications.

3. **lora_dropout (Dropout probability for LoRA layers)**:
   - This parameter represents the dropout rate applied to the LoRA layers.

In [None]:
# LoRA attention dimension
lora_r = 64 # @param

# Alpha parameter for LoRA scaling
lora_alpha = 16 # @param

# Dropout probability for LoRA layers
lora_dropout = 0.1 # @param

Bitsandbytes parameters. These parameters focus on the implementation of 4-bit precision in model loading and computation. Here's an explanation of each:

1. **use_4bit (Activate 4-bit precision base model loading)**:
   - This parameter, when set to `True`, indicates that the base model (i.e., the pre-trained model or initial model weights) should be loaded using 4-bit precision.
2. **bnb_4bit_compute_dtype (Compute dtype for 4-bit base models)**:
   - This parameter specifies the data type to be used for computations in the context of 4-bit base models.
   - The value `"float16"` indicates that computations should be done using 16-bit floating-point numbers.

3. **bnb_4bit_quant_type (Quantization type)**:
   - This parameter determines the type of quantization to be used for the 4-bit models.
   - The options `"fp4"` and `"nf4"` refer to different quantization schemes.

4. **use_nested_quant (Activate nested quantization for 4-bit base models)**:
   - When set to `True`, this parameter enables nested quantization for 4-bit base models.
   - Nested quantization, often referred to as double quantization, involves applying a second layer of quantization on top of an already quantized model. This can be used for further reducing the model size or for specialized computational optimizations.

In [None]:
# Activate 4-bit precision base model loading
use_4bit = True # @param

# Compute dtype for 4-bit base models
bnb_4bit_compute_dtype = "float16" # @param

# Quantization type (fp4 or nf4)
bnb_4bit_quant_type = "nf4" # @param ["nf4","fp4"]

# Activate nested quantization for 4-bit base models (double quantization)
use_nested_quant = False # @param

Now, let's define the training arguments.

1. **output_dir**:
   - Specifies the directory where the model predictions and checkpoints will be stored.

2. **num_train_epochs**:
   - Sets the number of epochs for training, where one epoch means one pass through the entire training dataset. We set it to `1`

3. **fp16, bf16**:
   - Enable training with 16-bit floating-point precision (`fp16`) or 16-bit bfloat precision (`bf16`).

4. **per_device_train_batch_size**:
   - Determines the batch size for training per GPU. This will depend on the GPU used. For an A100, we can use a batch size of 16 examples.

5. **per_device_eval_batch_size**:
   - Sets the batch size for evaluation per GPU.

6. **gradient_accumulation_steps**:
   - Indicates the number of update steps over which to accumulate gradients.

7. **gradient_checkpointing**:
   - When enabled, saves memory by trading compute for memory. Useful for training large models that would otherwise not fit in memory.

8. **max_grad_norm (Maximum gradient norm)**:
   - Specifies the maximum norm of gradients for gradient clipping, a technique to prevent exploding gradients in deep networks.

9. **learning_rate**:
   - Sets the initial learning rate for the AdamW optimizer.

10. **weight_decay**:
    - Specifies the weight decay to apply to all layers except those with bias or LayerNorm weights, as a regularization technique.

11. **optim**:
    - Defines the optimizer to use, here specified as a variant of AdamW optimized for certain hardware configurations.

12. **lr_scheduler_type**:
    - Determines the learning rate schedule to use. "constant" means the learning rate stays the same throughout training.

13. **max_steps**:
    - Overrides `num_train_epochs` by setting the number of training steps. If set to a negative value, it's ignored. We set this to `100` to reduce the training time. That means, that our example training does not use the entire traing set.

14. **warmup_ratio**:
    - Indicates the proportion of total training steps to use for linear warmup of the learning rate.

15. **group_by_length**:
    - When enabled, sequences are grouped by length into batches. This can save memory and speed up training.

16. **save_steps**:
    - Determines how often to save a model checkpoint in terms of training steps.

17. **logging_steps**:
    - Sets the frequency, in terms of training steps, for logging training progress.


In [None]:
# Output directory where the model predictions and checkpoints will be stored
output_dir = "./results" # @param

# Number of training epochs
num_train_epochs = 1 # @param

# Enable fp16/bf16 training (set bf16 to True with an A100)
fp16 = False # @param
bf16 = False # @param

# Batch size per GPU for training
per_device_train_batch_size = 4 # @param

# Batch size per GPU for evaluation
per_device_eval_batch_size = 4 # @param

# Number of update steps to accumulate the gradients for
gradient_accumulation_steps = 1 # @param

# Enable gradient checkpointing
gradient_checkpointing = True # @param

# Maximum gradient normal (gradient clipping)
max_grad_norm = 0.3 # @param

# Initial learning rate (AdamW optimizer)
learning_rate = 5e-4 # @param

# Weight decay to apply to all layers except bias/LayerNorm weights
weight_decay = 0.001 # @param

# Optimizer to use
optim = "paged_adamw_32bit" # @param

# Learning rate schedule (constant a bit better than cosine)
lr_scheduler_type = "constant" # @param

# Number of training steps (overrides num_train_epochs)
max_steps = 500 # @param

# Ratio of steps for a linear warmup (from 0 to learning rate)
warmup_ratio = 0.03 # @param

# Group sequences into batches with same length
# Saves memory and speeds up training considerably
group_by_length = True # @param

# Save checkpoint every X updates steps
save_steps = 25 # @param

# Log every X updates steps
logging_steps = 25 # @param

Now let's defint the SFTTrainer parameters

1. **max_seq_length**:
   - This parameter specifies the maximum sequence length to be used.

2. **packing**:
   - This parameter indicates whether or not to pack multiple short examples into the same input sequence.
   - When set to `True`, this technique can be used to increase computational efficiency, particularly in batch processing.

3. **device_map**:
   - This parameter is a dictionary that maps parts of the model to specific computing devices.
   - The entry `{"": 0}` specifies that the entire model will be loaded onto GPU 0.

In [None]:
# Maximum sequence length to use
max_seq_length = None

# Pack multiple short examples in the same input sequence to increase efficiency
packing = False

# Load the entire model on the GPU 0
device_map = {"": 0}

### Lets Load the base model
Let's load the Mistral 7B Instruct base model:

In [None]:
import json
import re
from pprint import pprint

import pandas as pd
import torch
from datasets import Dataset, load_dataset
from huggingface_hub import notebook_login
from peft import LoraConfig, PeftModel
from transformers import (
    AutoModelForCausalLM,
    AutoTokenizer,
    BitsAndBytesConfig,
    TrainingArguments,
    pipeline,
    logging,
)
from trl import SFTTrainer # For supervised finetuning

Load the base model with QLoRA configuration

In [None]:
compute_dtype = getattr(torch, bnb_4bit_compute_dtype)

bnb_config = BitsAndBytesConfig(
    load_in_4bit=use_4bit,
    bnb_4bit_quant_type=bnb_4bit_quant_type,
    bnb_4bit_compute_dtype=compute_dtype,
    bnb_4bit_use_double_quant=use_nested_quant,
)

base_model = AutoModelForCausalLM.from_pretrained(
    model_name,
    quantization_config=bnb_config,
    device_map=device_map
)

base_model.config.use_cache = False
base_model.config.pretraining_tp = 1

The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.


config.json:   0%|          | 0.00/864 [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/2.84G [00:00<?, ?B/s]

generation_config.json:   0%|          | 0.00/74.0 [00:00<?, ?B/s]

Load tokenizer


In [None]:
tokenizer = AutoTokenizer.from_pretrained(model_name, trust_remote_code=True)
tokenizer.pad_token = tokenizer.eos_token
tokenizer.padding_side = "right"

tokenizer_config.json:   0%|          | 0.00/237 [00:00<?, ?B/s]

vocab.json:   0%|          | 0.00/798k [00:00<?, ?B/s]

merges.txt:   0%|          | 0.00/456k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/2.11M [00:00<?, ?B/s]

added_tokens.json:   0%|          | 0.00/1.08k [00:00<?, ?B/s]

special_tokens_map.json:   0%|          | 0.00/99.0 [00:00<?, ?B/s]

In [None]:
# Checando o código do Ramon Simões Abilio, vi que ele adiciona como tokens especiais alguns
# tokens usados na instrução (<s>, </s>, [INST], [\INST])
# Sem isso, o tokenizador quebra esses tokens. Por exemplo:

tokens_sem_special_tokens = tokenizer("<s>[INST] Your task is [\INST]</s>")['input_ids']
print(tokens_sem_special_tokens)
print(tokenizer.convert_ids_to_tokens(tokens_sem_special_tokens))

# Vou adicionar esses tokens especiais também:
special_tokens_dict = {'additional_special_tokens': ['<s>', '</s>', '[INST]', '[\INST]']}
tokenizer.add_special_tokens(special_tokens_dict)

tokens_com_special_tokens = tokenizer("<s>[INST] Your task is [\INST]</s>")['input_ids']
print(tokens_com_special_tokens)
print(tokenizer.convert_ids_to_tokens(tokens_com_special_tokens))

[27, 82, 36937, 38604, 60, 3406, 4876, 318, 685, 59, 38604, 60, 3556, 82, 29]
['<', 's', '>[', 'INST', ']', 'ĠYour', 'Ġtask', 'Ġis', 'Ġ[', '\\', 'INST', ']', '</', 's', '>']
[50295, 50297, 3406, 4876, 318, 220, 50298, 50296]
['<s>', '[INST]', 'ĠYour', 'Ġtask', 'Ġis', 'Ġ', '[\\INST]', '</s>']


## Downloading Dataset

The SST-2 dataset, or the Stanford Sentiment Treebank, is popular for sentiment analysis tasks in Natural Language Processing (NLP). It consists of movie reviews from the Rotten Tomatoes website that are labeled with either a positive or negative sentiment. The dataset contains 10,662 sentence-level movie reviews, with approximately half of the reviews labeled as positive and the other half labeled as negative. The reviews are also relatively evenly distributed in length, with a median length of 18 tokens.

The SST-2 dataset has become a benchmark dataset for sentiment analysis in NLP, and many researchers use it to evaluate the performance of their models. The dataset's popularity is partly due to its high-quality labels and the task's relative simplicity, making it an accessible starting point for researchers and developers new to NLP.

In this example, we're using the **`datasets`** library to download and load the training and validation sets of the dataset.

In [None]:
from datasets import load_dataset
import random

test_dataset = load_dataset('imdb', split='test')
train_dataset = load_dataset('imdb', split='train')

# Reduz o tamanho dos datasets de treino e teste
# https://huggingface.co/docs/datasets/v1.2.0/processing.html
random.seed(42)

full_len_train = len(train_dataset)
small_len_train = int(full_len_train * MULT_TRAIN)
train_dataset = train_dataset.select(random.sample(range(full_len_train), small_len_train))

full_len_test = len(test_dataset)
small_len_test = int(full_len_test * MULT_TEST)
test_dataset = test_dataset.select(random.sample(range(full_len_test), small_len_test))

Downloading readme:   0%|          | 0.00/7.81k [00:00<?, ?B/s]

Downloading data:   0%|          | 0.00/21.0M [00:00<?, ?B/s]

Downloading data:   0%|          | 0.00/20.5M [00:00<?, ?B/s]

Downloading data:   0%|          | 0.00/42.0M [00:00<?, ?B/s]

Generating train split:   0%|          | 0/25000 [00:00<?, ? examples/s]

Generating test split:   0%|          | 0/25000 [00:00<?, ? examples/s]

Generating unsupervised split:   0%|          | 0/50000 [00:00<?, ? examples/s]

## Data Preparation

Now, we will prepare the data for training our model. First, we define a template with the fields `sentence` and `class`. Then, we use the `map` method to apply this template to the dataset. This will create a new dataset with the fields `sentence` and `class` for each example in the original dataset.

In [None]:
template = """<s>[INST] Your task is to classify sentences' sentiment as 'positive' or 'negative'.

Sentence: {text} [\INST]
{class}</s>"""

Before, we need to convert the labels from 0 and 1 to "negative" and "positive". We can do this by using the `map` method to apply a function to each example in the dataset. The function will take the label as input and return the corresponding string and store in the column `class`.

In [None]:
import re

TOTAL_TOKENS_TEMPLATE = len(tokenizer(template.format(**{"text": "", "class": ""})).input_ids)
NUM_MAX_TOKENS_TEXT = NUM_MAX_TOKENS - TOTAL_TOKENS_TEMPLATE

re_html = re.compile('<[^>]+>') # ChatGPT

def preprocess(example):
  # Remove o html
  text = re.sub(re_html, '', example["text"])

  tokens = tokenizer(text, truncation=True, max_length=NUM_MAX_TOKENS_TEXT).input_ids
  return {
      "text": tokenizer.decode(tokens),
      "label": example["label"]
  }

In [None]:
POSITIVE_LABEL = "positive"
NEGATIVE_LABEL = "negative"

# Limpa e trunca o texto
train_dataset = train_dataset.map(preprocess)
# Converte de label para class
train_dataset = train_dataset.map(lambda example: {'class': POSITIVE_LABEL if example["label"] == 1 else NEGATIVE_LABEL})
# Aplica o template
train_dataset = train_dataset.map(lambda example: {"text": template.format(**example)})


Map:   0%|          | 0/25000 [00:00<?, ? examples/s]

Map:   0%|          | 0/25000 [00:00<?, ? examples/s]

Map:   0%|          | 0/25000 [00:00<?, ? examples/s]

In [None]:
print(train_dataset[0]['text'])
print(train_dataset[1]['text'])

<s>[INST] Your task is to classify sentences' sentiment as 'positive' or 'negative'.

Sentence: Arguably this is a very good "sequel", better than the first live action film 101 Dalmatians. It has good dogs, good actors, good jokes and all right slapstick! Cruella DeVil, who has had some rather major therapy, is now a lover of dogs and very kind to them. Many, including Chloe Simon, owner of one of the dogs that Cruella once tried to kill, do not believe this. Others, like Kevin Shepherd (owner of 2nd Chance Dog Shelter) believe that she has changed. Meanwhile, Dipstick, with his mate, have given birth to three cute dalmatian puppies! Little Dipper, Domino and Oddball...Starring Eric Idle as Waddlesworth (the hilarious macaw), Glenn Close as Cruella herself and Gerard Depardieu as Le Pelt (another baddie, the name should give a clue), this is a good family film with excitement and lots more!! One downfall of this film is that is has a lot of painful slapstick, but not quite as excessiv

The code below converts the `label` column of the test dataset into a list of strings with `"positive"` and `"negative"` labels. This is for comparing the model's predictions with the actual labels of the dataset.

In [None]:
test_dataset[0]

{'text': '"Tamara" just felt like another teen oriented knock-off of the "I Know What You Did Last Summer" trend and is painfully dull. A high school outcast, who is heavily into witchcraft and black magic, is accidentally killed during a cruel prank carried out by a group of bullies who secretly bury her in the woods, vowing to tell no one. The next day, the supposedly "dead" Tamara, arrives at school with a completely new image and seduces her would-be killers and has a little revenge... This is basically a combination of "Carrie", "The Craft", and every other straight-to-video, teeny bopper turkey that hits the shelves these days. The actors are absolutely atrocious and look about ten years too old to pass off as high schoolers. There IS some gore, which is actually nothing all that interesting since the movie is so boring and I couldn\'t wait for it to end. If you like modern garbage than I insist you seek this one out, otherwise don\'t bother...',
 'label': 0}

In [None]:
# Limpa e trunca o texto
test_dataset = test_dataset.map(preprocess)
# Converte de label para class
test_dataset = test_dataset.map(lambda example: {'class': POSITIVE_LABEL if example["label"] == 1 else NEGATIVE_LABEL})

Map:   0%|          | 0/2500 [00:00<?, ? examples/s]

Map:   0%|          | 0/2500 [00:00<?, ? examples/s]

## Fine-Tuning with QLoRA and Supervised Fine-Tuning

We're ready to fine-tune our model using QLoRA. For this tutorial, we'll use the `SFTTrainer` from the `trl` library.

In the context of the code below, `target_modules` refers to specific components or layers of a neural network model that will be modified or adapted using LoRA (Low-Rank Adaptation). LoRA is a technique used to adapt pre-trained models with minimal additional parameters, often used in the context of Transformer models. Here's a breakdown of what each module likely represents:

1. **q_proj, k_proj, v_proj, o_proj**:
   - These refer to the projections for query (q), key (k), value (v), and output (o) in the attention mechanism of a Transformer model.

2. **gate_proj**:
   - This refer to a projection layer associated with gating mechanisms in the model, such as those found in Gated Recurrent Units (GRUs) or similar structures.

3. **up_proj, down_proj**:
   - These refer to projection layers used in upsampling or downsampling within the model.

4. **lm_head**:
   - This refers to the language model head of a Transformer, which is the final layer that produces the output (like the next word in a sequence).

In [None]:
# Load LoRA configuration
peft_config = LoraConfig(
    lora_alpha=lora_alpha,
    lora_dropout=lora_dropout,
    r=lora_r,
    target_modules=[
        "q_proj",
        "k_proj",
        "v_proj",
        "o_proj",
        "gate_proj",
        "up_proj",
        "down_proj",
        "lm_head",
    ],
    bias="none",
    task_type="CAUSAL_LM",
)

# Set training parameters
training_arguments = TrainingArguments(
    output_dir=output_dir,
    num_train_epochs=num_train_epochs,
    per_device_train_batch_size=per_device_train_batch_size,
    gradient_accumulation_steps=gradient_accumulation_steps,
    optim=optim,
    save_steps=save_steps,
    logging_steps=logging_steps,
    learning_rate=learning_rate,
    weight_decay=weight_decay,
    fp16=fp16,
    bf16=bf16,
    max_grad_norm=max_grad_norm,
    max_steps=max_steps,
    warmup_ratio=warmup_ratio,
    group_by_length=group_by_length,
    lr_scheduler_type=lr_scheduler_type,
    report_to="tensorboard",
)

# Set supervised fine-tuning parameters
trainer = SFTTrainer(
    model=base_model,
    train_dataset=train_dataset,
    peft_config=peft_config,
    dataset_text_field="text",
    max_seq_length=max_seq_length,
    tokenizer=tokenizer,
    args=training_arguments,
    packing=packing,

)



Map:   0%|          | 0/25000 [00:00<?, ? examples/s]

max_steps is given, it will override any value given in num_train_epochs


## Avaliar o modelo antes de iniciar o finetuning

Função para completar uma sentença

In [None]:
device = "cuda:0"
prompt = """<s>[INST]You are a sentiment classifier. Use only "positive" or "negative".

Sentence: {sentence}[\INST]
"""

# Vou trocar o prompt acima para o mesmo prompt que será usado no fine-tuning:
prompt = """<s>[INST] Your task is to classify sentences' sentiment as 'positive' or 'negative'.

Sentence: {text} [\INST]"""

def completa_frase(model, text, max_new_tokens=1):
  text = prompt.format(text=text)
  encodeds = tokenizer(text, return_tensors="pt", add_special_tokens=False)
  model_inputs = encodeds.to(device)

  with torch.no_grad():
    outputs = model.generate(**model_inputs,max_new_tokens=max_new_tokens,bos_token_id=model.config.bos_token_id,
                                eos_token_id=model.config.eos_token_id,
                                pad_token_id=model.config.eos_token_id
                             )
    torch.cuda.empty_cache()

  return tokenizer.decode(outputs[0][len(model_inputs["input_ids"][0]):], skip_special_tokens=True)

def extrai_classificacao(texto):
  texto = texto.lower()

  if 'positive' in texto:
    classificacao = 'positive'
  elif 'negative' in texto:
    classificacao = 'negative'
  else:
    classificacao =  ''

  return classificacao

In [None]:
print('Frase completada: ', completa_frase(base_model, "This movie is too bad", 10))
print('Classificação: ', extrai_classificacao(completa_frase(base_model, "This movie is too bad", 10)))

Frase completada:  !

Hint: Use the `split
Classificação:  


In [None]:
from tqdm import tqdm

def avalia_resposta_modelo(model, dataset, max_new_tokens=10, print_debug=False):
  acertos = 0.

  for example in tqdm(dataset):
    continuacao_prompt = completa_frase(model, example['text'], max_new_tokens)
    classe_prevista = extrai_classificacao(continuacao_prompt)
    acertos += (1. if classe_prevista == example['class'] else 0.)

    if (print_debug):
      print(f"[PREVISTO]: |{classe_prevista}| [CORRETO]: |{example['class']}| [ACERTOS]: {acertos}")

  acc = acertos/len(dataset)
  print(f'Acurácia: {acc}')
  return acc

In [None]:
# É muito lento (provavelmente pela forma como o modelo é carregado nesse ponto e dá sempre 0)
if FAZER_AVALIACAO_INICIAL:
  avalia_resposta_modelo(base_model, test_dataset, 3)

## Let's start the training process

In [None]:
# Train model
trainer.train()

# Save trained model
trainer.model.save_pretrained(new_model)

Step,Training Loss
25,3.595
50,2.9666
75,3.2226
100,2.994
125,3.2415
150,2.9497
175,3.2691
200,2.9914
225,3.2287
250,2.9196




## Merge the fine-tuned model

After fine-tuning, we can merge the fine-tuned model with the base model to get a single model that can be used for inference. This is done by using the PEFT. First, let's clean up the GPU memory by deleting the fine-tuned model. You can also restart the runtime to clear the GPU memory.

In [None]:
# Empty VRAM
import gc
del base_model
gc.collect()

del trainer
gc.collect()

0

In [None]:
torch.cuda.empty_cache()

In [None]:
gc.collect()

0

Now, let's load the base model and fine-tuned model and merge them using PEFT.

In [None]:
base_model = AutoModelForCausalLM.from_pretrained(
    model_name,
    low_cpu_mem_usage=True,
    return_dict=True,
    torch_dtype=torch.float16,
    device_map={"": 0},
)
merged_model= PeftModel.from_pretrained(base_model, new_model,)
merged_model= merged_model.merge_and_unload()

Let's save our merged model

In [None]:
# Save the merged model
merged_model.save_pretrained("merged_model", safe_serialization=True)
tokenizer.save_pretrained("merged_model")
tokenizer.pad_token = tokenizer.eos_token
tokenizer.padding_side = "right"

## Test the merged model

The following code performs the inference stage of the evaluation finetuned Mistral-7B-Instruct model. We define a function called **`classify_sentence`** that is designed to use a pretrained model, likely a variant of a large language model similar to GPT, for sentiment analysis. The description below outlines the steps taken in the function to classify the sentiment of a given sentence as either positive, negative, or possibly neutral. I'll expand on the description by going through the function step-by-step:

1. The function accepts a single parameter, `sentence`, which is the text input whose sentiment is to be classified.

2. The `sentence` is formatted with the predefined prompt template. This prompt engineering is a common practice when using language models for specific tasks, as it provides context to the model about the task it is supposed to perform.

3. The `tokenizer` is applied to the formatted text. Tokenizers convert text into a format that models can understand, which in this case is a series of tokens. The tokenizer is configured to:
   - Return tensors compatible with PyTorch (`return_tensors="pt"`).
   - Not add special tokens that are usually used to indicate the start and end of a sequence (`add_special_tokens=False`).

4. The tokenized input (`encodeds`) is then converted to a PyTorch tensor and moved to the appropriate device (GPU) for inference.

5. The inference is performed inside a `torch.no_grad()` context manager, which disables gradient calculations. This is used because we are making predictions, not training the model, and therefore do not need gradients, which would only use extra memory and computational power.

6. The `model.generate` function is called to generate a response. This function takes several parameters, such as:
   - `**model_inputs`: The tokenized inputs prepared earlier.
   - `max_length=8000`: This sets the maximum length of the model's output. The choice of 8000 seems unusually high for sentence classification and might be tailored to specific requirements of the task or the model's capacity.
   - `bos_token_id=model.config.bos_token_id`: This specifies the beginning-of-sentence token id, signaling the model where a new sentence starts.
   - `eos_token_id=model.config.eos_token_id`: This specifies the end-of-sentence token id, signaling the model where a sentence ends.
   - `pad_token_id=model.config.eos_token_id`: This is used for padding shorter sentences to a uniform length. It's unusual to see the end-of-sentence token used for padding, which could be a specific requirement of this model or a mistake.

7. After the model generates a response, `torch.cuda.empty_cache()` is called to free up unused memory on the GPU. This is helpful in managing GPU resources, especially when processing multiple requests or dealing with large models.

8. Finally, the `tokenizer.decode` function is used to convert the model's output tokens back into human-readable text. The `skip_special_tokens=True` argument removes any special tokens (like padding or end-of-sentence tokens) from the output. The function also skips the input tokens (`outputs[0][len(model_inputs["input_ids"][0]):]`) to only return the newly generated text.


The code below uses the **`classify_sentence`** function to make predictions on the test dataset. We loop through the test dataset and apply the **`classify_sentence`** function to each example. The predictions are stored in a list called **`predictions`**.

In [None]:
avalia_resposta_modelo(base_model, test_dataset, 5, False)

100%|██████████| 2500/2500 [06:22<00:00,  6.53it/s]

Acurácia: 0.8816





0.8816