Vitor Domingos Baldoino dos Santos</br>
Universidade Presbiteriana Mackenzie</br>
Faculdade de Computação e Informática</br>
[vdbaldoino@gmail.com](mailto:vdbaldoino@gmail.com)</br>

Dataset: [Portuguese Tweets for Sentiment Analysis](https://www.kaggle.com/datasets/augustop/portuguese-tweets-for-sentiment-analysis)

Recursos para problemas que tive:

- Não estava conseguindo implementar tudo que queria apenas utilizando a biblioteca da HuggingFace, então pensei em fazer o fine-tuning apenas com o PyTorch.
    - Para uma abordagem usando apenas o PyTorch, cheguei no link: [BERT Fine-Tuning Tutorial with PyTorch · Chris McCormick](https://mccormickml.com/2019/07/22/BERT-fine-tuning/)
    - Para uma abordagem hibrída (HuggingFace + Training Loop do PyTorch), cheguei nessa documentação: [Fine-tune a pretrained model](https://huggingface.co/docs/transformers/training)
    - Publicação de exemplo: [análise de sentimentos em português utilizando Pytorch e Python](https://medium.com/data-hackers/an%C3%A1lise-de-sentimentos-em-portugu%C3%AAs-utilizando-pytorch-e-python-91a232165ec0)
- Para realizar a busca de hiperprâmetros no modelo achei a documentação abaixo, mas não consegui utilizar porque há um bug na integração com o `Ray Tune`.
    - [Hyperparameter Search with Transformers and Ray Tune](https://huggingface.co/blog/ray-tune)
- O link abaixxo é um notebook de exemplo para realizar a classificação de texto utilizando apenas as ferramentas da HuggingFace. Quase tudo nesse notebook foi tirado daqui. O segundo link é o tutorial de como monitorar o treinamento com o TensorBoard.
    - [Text Classification on GLUE using `Trainer`](https://colab.research.google.com/github/huggingface/notebooks/blob/main/examples/text_classification.ipynb#scrollTo=8sgjdLKcIrJm)
    - [BERT Finetuning with Hugging Face and Training Visualizations with TensorBoard](https://medium.com/nlplanet/bert-finetuning-with-hugging-face-and-training-visualizations-with-tensorboard-46368a57fc97)
- Em algum momento percebi que o biblioteca da HuggingFace não calcula as métricas de performance do modelo no dataset de treino, impedindo a detecção de um possível overfitting. Para lidar com isso eu cheguei nos links abaixo:
    - [How to tweak `Trainer` to monitor other metrics on the training set](https://discuss.huggingface.co/t/metrics-for-training-set-in-trainer/2461/3)
    - [Batch and Epoch training metrics for transformers `Trainer`](https://stackoverflow.com/questions/78311534/batch-and-epoch-training-metrics-for-transformers-trainer/78311535#78311535)

- [Performance tips for training](https://huggingface.co/docs/transformers/v4.18.0/en/performance)

## Configurações

In [None]:
%%shell
pip install -q transformers==4.39.3
pip install -q datasets==2.18.0
pip install -q evaluate==0.4.1
pip install -q accelerate==0.28.0
pip install -q torch==2.2.1
pip install -q torchtext==0.17.2
pip install -q torchdata==0.7.1
pip install -q ray[tune]==2.12.0
pip install -q optuna
pip install -q hyperopt
pip install -q scikit-learn

[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m8.8/8.8 MB[0m [31m24.0 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m3.6/3.6 MB[0m [31m48.4 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m510.5/510.5 kB[0m [31m8.6 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m116.3/116.3 kB[0m [31m11.8 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m194.1/194.1 kB[0m [31m11.0 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m134.8/134.8 kB[0m [31m12.1 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m84.1/84.1 kB[0m [31m3.1 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m290.1/290.1 kB[0m [31m6.5 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━



In [None]:
import os
import torch
import evaluate
import numpy as np

from copy import deepcopy

from transformers import (
    AutoModelForSequenceClassification,
    AutoTokenizer,
    Trainer,
    TrainingArguments,
    TrainerCallback
)

from datasets import (load_from_disk,
                      DatasetDict)

from sklearn.metrics import precision_recall_fscore_support, accuracy_score

In [None]:
from google.colab import drive

drive.mount('/content/drive')
os.chdir('/content/drive/MyDrive/sentiment-analysis/')

Mounted at /content/drive


In [None]:
print(os.getcwd())

/content/drive/MyDrive/sentiment-analysis


In [None]:
SEED = 42
DEVICE = torch.device("cuda" if torch.cuda.is_available() else "cpu")
BATCH_SIZE = 128
NUM_LABELS = 3
MAX_LENGTH = 128
TASK = "sentiment-analysis"
MODEL_NAME = "bertimbau"

ID2LABEL = {0: "Neutro", 1: "Positivo", 2: "Negativo"}
LABEL2ID = {"Neutro": 0, "Positivo": 1, "Negativo": 2}
MODEL_CHECKPOINT = "neuralmind/bert-base-portuguese-cased"

OUTPUT_DIR = f"models/{MODEL_NAME}-finetuned-{TASK}"

In [None]:
tokenizer = AutoTokenizer.from_pretrained(MODEL_CHECKPOINT, use_fast=True)
model = AutoModelForSequenceClassification.from_pretrained(
    MODEL_CHECKPOINT, num_labels=NUM_LABELS, id2label=ID2LABEL, label2id=LABEL2ID
)

The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.


tokenizer_config.json:   0%|          | 0.00/43.0 [00:00<?, ?B/s]

config.json:   0%|          | 0.00/647 [00:00<?, ?B/s]

vocab.txt:   0%|          | 0.00/210k [00:00<?, ?B/s]

added_tokens.json:   0%|          | 0.00/2.00 [00:00<?, ?B/s]

special_tokens_map.json:   0%|          | 0.00/112 [00:00<?, ?B/s]

pytorch_model.bin:   0%|          | 0.00/438M [00:00<?, ?B/s]

Some weights of BertForSequenceClassification were not initialized from the model checkpoint at neuralmind/bert-base-portuguese-cased and are newly initialized: ['classifier.bias', 'classifier.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


In [None]:
from torch.nn.functional import softmax
# from torch.nn import Softmax

from transformers.models.auto.modeling_auto import MODEL_FOR_CAUSAL_LM_MAPPING_NAMES
from transformers.trainer import _is_peft_model
from transformers.trainer import unwrap_model

class CustomTrainer(Trainer):
    def __init__(self, *args, **kwargs):
        super().__init__(*args, **kwargs)

    def compute_loss(self, model, inputs, return_outputs=False):

        if self.label_smoother is not None and "labels" in inputs:
            labels = inputs.pop("labels")
        else:
            labels = None

        outputs = model(**inputs)

        try:

            probabilities = outputs.logits.cpu().detach().softmax(dim=-1).numpy()
            ground_truth = inputs["labels"].cpu().detach().numpy()
            #print("A conversão dos logits funcionou")
            to_compute_training_metrics = (probabilities, ground_truth)
            metrics = self.compute_metrics(to_compute_training_metrics)
            #print("Calcular as metricas functionou")
            self.log(metrics)
        except Exception as message:
            print(f"O Erro é: {message}")


        if self.args.past_index >= 0:
            self._past = outputs[self.args.past_index]

        if labels is not None:
            unwrapped_model = unwrap_model(model)
            if _is_peft_model(unwrapped_model):
                model_name = unwrapped_model.base_model.model._get_name()
            else:
                model_name = unwrapped_model._get_name()
            if model_name in MODEL_FOR_CAUSAL_LM_MAPPING_NAMES.values():
                loss = self.label_smoother(outputs, labels, shift_labels=True)
            else:
                loss = self.label_smoother(outputs, labels)
        else:
            if isinstance(outputs, dict) and "loss" not in outputs:
                raise ValueError(
                    "The model did not return a loss from the inputs, only the following keys: "
                    f"{','.join(outputs.keys())}. For reference, the inputs it received are {','.join(inputs.keys())}."
                )
            # We don't use .loss here since the model may return tuples instead of ModelOutput.
            loss = outputs["loss"] if isinstance(outputs, dict) else outputs[0]

        return (loss, outputs) if return_outputs else loss


class GetTrainingMetricsCallback(TrainerCallback):

    def __init__(self, trainer) -> None:
        super().__init__()
        self._trainer = trainer

    def on_epoch_end(self, args, state, control, **kwargs):

        try:
            print(dir(args))
        except:
            pass

        if control.should_evaluate:
            control_copy = deepcopy(control)
            self._trainer.evaluate(
                eval_dataset=self._trainer.train_dataset, metric_key_prefix="train"
            )
            return control_copy


def tokenize_function(examples: DatasetDict):
    return tokenizer(
        examples["text"],
        padding="max_length",
        max_length=MAX_LENGTH,
        truncation=True
    )


def compute_metrics(eval_pred):

    logits, labels = eval_pred
    # print("O argmax sera executado")
    predictions = np.argmax(logits, axis=-1)
    # print("O argmax foi executado")

    accuracy = accuracy_score(labels, predictions)
    precision, recall, f1, _ = precision_recall_fscore_support(
        labels, predictions, average="macro"
    )

    return {"accuracy": accuracy,
            "f1": f1,
            "precision": precision,
            "recall": recall}

In [None]:
ds = load_from_disk(f"/content/drive/MyDrive/sentiment-analysis/data/intermediate/without-emoticons")
ds = ds.map(tokenize_function, batched=True)
ds.set_format("torch")

ds

Map:   0%|          | 0/630481 [00:00<?, ? examples/s]

Map:   0%|          | 0/135103 [00:00<?, ? examples/s]

Map:   0%|          | 0/135104 [00:00<?, ? examples/s]

DatasetDict({
    train: Dataset({
        features: ['text', 'labels', 'input_ids', 'token_type_ids', 'attention_mask'],
        num_rows: 630481
    })
    dev: Dataset({
        features: ['text', 'labels', 'input_ids', 'token_type_ids', 'attention_mask'],
        num_rows: 135103
    })
    test: Dataset({
        features: ['text', 'labels', 'input_ids', 'token_type_ids', 'attention_mask'],
        num_rows: 135104
    })
})

In [None]:
small_train_dataset = ds["train"].shuffle(seed=42).select(range(1000)).remove_columns("text")
small_eval_dataset = ds["test"].shuffle(seed=42).select(range(1000)).remove_columns("text")

small_train_dataset

Dataset({
    features: ['labels', 'input_ids', 'token_type_ids', 'attention_mask'],
    num_rows: 1000
})

## Hyperparameter Search

---

In [None]:
hp_space = {
        "per_device_train_batch_size": [8, 16, 32, 64, 128],
        "per_device_eval_batch_size": [8, 16, 32, 64, 128],
        "num_train_epochs": ([2, 3, 4, 5, 6]),
        "weight_decay": (0.0, 0.3),
        "learning_rate": (1e-5, 5e-5),
        "adam_epsilon": (1e-10, 1e-6),
        "warmup_ratio": (0.01, 0.03),
    }

def model_init():
    return AutoModelForSequenceClassification.from_pretrained(
      MODEL_CHECKPOINT, num_labels=NUM_LABELS, id2label=ID2LABEL, label2id=LABEL2ID
    )

def optuna_hp_space(trial):
    return {
        "per_device_train_batch_size": trial.suggest_categorical("per_device_train_batch_size", [8, 16, 32, 64, 128]),
        "per_device_eval_batch_size": trial.suggest_categorical("per_device_eval_batch_size", [8, 16, 32, 64, 128]),
        "num_train_epochs": trial.suggest_categorical("num_train_epochs", [2, 3, 4, 5, 6]),
        "weight_decay": trial.suggest_float("weight_decay", 0.0, 0.3),
        "learning_rate": trial.suggest_float("learning_rate", 1e-5, 5e-5),
        "adam_epsilon": trial.suggest_float("adam_epsilon", 1e-10, 1e-6),
        "warmup_ratio": trial.suggest_float("warmup_ratio", 0.01, 0.03),
    }

def objective(metrics):
    return metrics["eval_f1"]

training_args = TrainingArguments(
    output_dir=OUTPUT_DIR,
    seed=SEED,
    do_train=True,
    do_eval=True,
    evaluation_strategy="steps",
    save_strategy="steps",
    load_best_model_at_end=True,
    logging_strategy="steps",
    save_total_limit=2,
    save_only_model=True,
    metric_for_best_model="f1",
    skip_memory_metrics=True
)

trainer = Trainer(
    model_init=model_init,
    args=training_args,
    tokenizer=tokenizer,
    train_dataset=ds["train"],
    eval_dataset=ds["dev"],
    compute_metrics=compute_metrics,
)

trainer.add_callback(GetTrainingMetricsCallback(trainer))

best_run = trainer.hyperparameter_search(
    hp_space=optuna_hp_space,
    direction="maximize",
    backend="optuna",
    n_trials=20,
    )

## Fine-Tuning with Step Logging

---

In [None]:
training_args = TrainingArguments(
    output_dir=OUTPUT_DIR,
    per_device_train_batch_size=BATCH_SIZE,
    per_device_eval_batch_size=BATCH_SIZE,
    seed=SEED,
    evaluation_strategy="epoch",
    #eval_steps=100,
    save_strategy="epoch",
    learning_rate=2e-5,
    num_train_epochs=5,
    weight_decay=0.01,
    load_best_model_at_end=True,
    logging_strategy="epoch",
    #logging_steps=100,
    save_total_limit=2,
    save_only_model=False,
    metric_for_best_model="f1",
    report_to="tensorboard",
)

# trainer = CustomTrainer(
#     model=model,
#     args=training_args,
#     tokenizer=tokenizer,
#     train_dataset=small_train_dataset,
#     eval_dataset=small_eval_dataset,
#     compute_metrics=compute_metrics,
# )

trainer = Trainer(
    model=model,
    args=training_args,
    tokenizer=tokenizer,
    train_dataset=small_train_dataset,
    eval_dataset=small_eval_dataset,
    compute_metrics=compute_metrics,
)

trainer.add_callback(GetTrainingMetricsCallback(trainer))

trainer.train()

trainer.save_model(f"{OUTPUT_DIR}-full-dataset-no-hyperopt")

dataloader_config = DataLoaderConfiguration(dispatch_batches=None, split_batches=False, even_batches=True, use_seedable_sampler=True)


Epoch,Training Loss,Validation Loss,Accuracy,F1,Precision,Recall
1,0.0,3.10076,0.778,0.755077,0.810167,0.723478
2,0.0,3.135233,0.775,0.753522,0.81343,0.719864
3,0.0,3.167737,0.777,0.752276,0.811179,0.720447
4,0.0,3.160993,0.775,0.751417,0.808275,0.719864
5,0.0,3.140991,0.778,0.755077,0.810167,0.723478


['__class__', '__delattr__', '__dict__', '__dir__', '__doc__', '__eq__', '__format__', '__ge__', '__getattribute__', '__gt__', '__hash__', '__init__', '__init_subclass__', '__le__', '__lt__', '__module__', '__ne__', '__new__', '__reduce__', '__reduce_ex__', '__repr__', '__setattr__', '__sizeof__', '__str__', '__subclasshook__', '__weakref__', '_trainer', 'on_epoch_begin', 'on_epoch_end', 'on_evaluate', 'on_init_end', 'on_log', 'on_predict', 'on_prediction_step', 'on_save', 'on_step_begin', 'on_step_end', 'on_substep_end', 'on_train_begin', 'on_train_end']
['__class__', '__delattr__', '__dict__', '__dir__', '__doc__', '__eq__', '__format__', '__ge__', '__getattribute__', '__gt__', '__hash__', '__init__', '__init_subclass__', '__le__', '__lt__', '__module__', '__ne__', '__new__', '__reduce__', '__reduce_ex__', '__repr__', '__setattr__', '__sizeof__', '__str__', '__subclasshook__', '__weakref__', '_trainer', 'on_epoch_begin', 'on_epoch_end', 'on_evaluate', 'on_init_end', 'on_log', 'on_pre

In [None]:

trainer.state.log_history

[{'accuracy': 1.0,
  'f1': 1.0,
  'precision': 1.0,
  'recall': 1.0,
  'epoch': 0,
  'step': 0},
 {'accuracy': 1.0,
  'f1': 1.0,
  'precision': 1.0,
  'recall': 1.0,
  'epoch': 0.12,
  'step': 1},
 {'accuracy': 1.0,
  'f1': 1.0,
  'precision': 1.0,
  'recall': 1.0,
  'epoch': 0.25,
  'step': 2},
 {'accuracy': 1.0,
  'f1': 1.0,
  'precision': 1.0,
  'recall': 1.0,
  'epoch': 0.38,
  'step': 3},
 {'accuracy': 1.0,
  'f1': 1.0,
  'precision': 1.0,
  'recall': 1.0,
  'epoch': 0.5,
  'step': 4},
 {'accuracy': 1.0,
  'f1': 1.0,
  'precision': 1.0,
  'recall': 1.0,
  'epoch': 0.62,
  'step': 5},
 {'accuracy': 1.0,
  'f1': 1.0,
  'precision': 1.0,
  'recall': 1.0,
  'epoch': 0.75,
  'step': 6},
 {'accuracy': 1.0,
  'f1': 1.0,
  'precision': 1.0,
  'recall': 1.0,
  'epoch': 0.88,
  'step': 7},
 {'loss': 0.0,
  'grad_norm': 1.2146363587817177e-05,
  'learning_rate': 1.6000000000000003e-05,
  'epoch': 1.0,
  'step': 8},
 {'accuracy': 0.8125,
  'f1': 0.802087170042971,
  'precision': 0.83569739952

In [None]:
!nvidia-smi

Thu May  2 00:24:15 2024       
+---------------------------------------------------------------------------------------+
| NVIDIA-SMI 535.104.05             Driver Version: 535.104.05   CUDA Version: 12.2     |
|-----------------------------------------+----------------------+----------------------+
| GPU  Name                 Persistence-M | Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp   Perf          Pwr:Usage/Cap |         Memory-Usage | GPU-Util  Compute M. |
|                                         |                      |               MIG M. |
|   0  NVIDIA L4                      Off | 00000000:00:03.0 Off |                    0 |
| N/A   38C    P0              27W /  72W |  22691MiB / 23034MiB |      0%      Default |
|                                         |                      |                  N/A |
+-----------------------------------------+----------------------+----------------------+
                                                                    

In [None]:
!kill 63757

/bin/bash: line 1: kill: (63757) - No such process


## Fine-Tuning with Epoch Logging

---

In [None]:
training_args = TrainingArguments(
    output_dir=OUTPUT_DIR,
    per_device_train_batch_size=BATCH_SIZE,
    per_device_eval_batch_size=BATCH_SIZE,
    seed=SEED,
    evaluation_strategy="epoch",
    save_strategy="epoch",
    learning_rate=2e-5,
    num_train_epochs=5,
    weight_decay=0.01,
    load_best_model_at_end=True,
    logging_strategy="epoch",
    save_total_limit=2,
    save_only_model=True,
    metric_for_best_model="f1",
    report_to="tensorboard",
)

trainer = Trainer(
    model=model,
    args=training_args,
    tokenizer=tokenizer,
    train_dataset=ds["train"],
    eval_dataset=ds["dev"],
    compute_metrics=compute_metrics,
)

trainer.add_callback(GetTrainingMetricsCallback(trainer))

trainer.train()

trainer.save_model(f"{OUTPUT_DIR}-full-dataset-no-hyperopt")

Epoch,Training Loss,Validation Loss
