<a href="https://colab.research.google.com/github/Stefano0210/IULM_DDM2324_Notebooks/blob/main/32_fine_tuning_bert.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Fine Tuning di un modello di classificazione

In questo notebook vediamo come addestrare un modello pre-trained, in questo caso https://huggingface.co/dbmdz/bert-base-italian-xxl-cased .

**Nota che per avere un tempo di esecuzione gestibile bisogna comprare dei crediti e attaccare una GPU, per esempio una T4**

Iniziamo installando tutte le librerie richieste (sono tutte relative a Huggingface)


In [1]:
!pip install huggingface
!pip install datasets
!pip install evaluate
!pip install accelerate
!pip install transformers[torch]

Collecting huggingface
  Downloading huggingface-0.0.1-py3-none-any.whl (2.5 kB)
Installing collected packages: huggingface
Successfully installed huggingface-0.0.1
Collecting datasets
  Downloading datasets-2.18.0-py3-none-any.whl (510 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m510.5/510.5 kB[0m [31m4.5 MB/s[0m eta [36m0:00:00[0m
Collecting dill<0.3.9,>=0.3.0 (from datasets)
  Downloading dill-0.3.8-py3-none-any.whl (116 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m116.3/116.3 kB[0m [31m5.9 MB/s[0m eta [36m0:00:00[0m
Collecting xxhash (from datasets)
  Downloading xxhash-3.4.1-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (194 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m194.1/194.1 kB[0m [31m5.2 MB/s[0m eta [36m0:00:00[0m
[?25hCollecting multiprocess (from datasets)
  Downloading multiprocess-0.70.16-py310-none-any.whl (134 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m

## Caricamento del dataset

In questa sezione carichiamo il dataset e etichettiamo come Positive le reviews con una valutazione da 4 a 5 stelle, come negative le reviews da 1 a 2 stelle

In [3]:
import pandas as pd
from datasets import Dataset

!wget "https://github.com/Stefano0210/IULM_DDM2324_Notebooks/raw/main/data/italian_reviews.txt"
# Load the data
df = pd.read_csv("italian_reviews.txt")

# Filter out reviews with 3 stars
df = df[df['review_stars'] != 3]


# Define the sentiment based on the review stars
df['label'] = df['review_stars'].apply(lambda x: 1 if x >= 4 else 0)

# Select the relevant columns
df = df[['review_text', 'label']]



--2024-04-10 07:42:16--  https://github.com/Stefano0210/IULM_DDM2324_Notebooks/raw/main/data/italian_reviews.txt
Resolving github.com (github.com)... 140.82.113.4
Connecting to github.com (github.com)|140.82.113.4|:443... connected.
HTTP request sent, awaiting response... 302 Found
Location: https://raw.githubusercontent.com/Stefano0210/IULM_DDM2324_Notebooks/main/data/italian_reviews.txt [following]
--2024-04-10 07:42:16--  https://raw.githubusercontent.com/Stefano0210/IULM_DDM2324_Notebooks/main/data/italian_reviews.txt
Resolving raw.githubusercontent.com (raw.githubusercontent.com)... 185.199.108.133, 185.199.109.133, 185.199.110.133, ...
Connecting to raw.githubusercontent.com (raw.githubusercontent.com)|185.199.108.133|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 42573550 (41M) [text/plain]
Saving to: ‘italian_reviews.txt’


2024-04-10 07:42:18 (199 MB/s) - ‘italian_reviews.txt’ saved [42573550/42573550]



## Conversione a un Dataset di HuggingFace

In [4]:
# Convert the DataFrame to a Hugging Face Dataset
dataset = Dataset.from_pandas(df)


## Riduzione del numero di esempi

Per limitare il tempo di fine tuning, creiamo due  dataset bilanciati (train + eval) con 2000 esempi ciascuno di reviews positive e negative

In [5]:
from datasets import concatenate_datasets

# Convert the DataFrame to a Hugging Face Dataset
dataset = Dataset.from_pandas(df)

def create_stratified_dataset(dataset, num_samples_per_class):
    # Shuffle the dataset
    shuffled_dataset = dataset.shuffle(seed=42)

    # Separate positive and negative instances
    positive_dataset = shuffled_dataset.filter(lambda example: example['label'] == 1)
    negative_dataset = shuffled_dataset.filter(lambda example: example['label'] == 0)

    # Select 1000 instances from each class
    positive_subset = positive_dataset.select(range(num_samples_per_class))
    negative_subset = negative_dataset.select(range(num_samples_per_class))

    # Concatenate the subsets and shuffle
    balanced_dataset = concatenate_datasets([positive_subset, negative_subset]).shuffle(seed=42)
    return balanced_dataset

# Create a balanced dataset with 1000 instances of each class
balanced_dataset = create_stratified_dataset(dataset, 2000)

# Split the balanced dataset into training and testing sets
small_train_dataset, small_eval_dataset = balanced_dataset.train_test_split(test_size=0.5).values()



Filter:   0%|          | 0/143058 [00:00<?, ? examples/s]

Filter:   0%|          | 0/143058 [00:00<?, ? examples/s]

## Tokenizzazione

In questa fase si usa il tokenizzatore del modello prescelto, in modalitá blackbox, usando AutoTokenizer. Questo é comodo perché ogni modello può essere stato addestrato con un algoritmo di tokenizzaione (divisione in parole o caratteri) diverso, e vogliamo mantenerlo sia durante il fine tuning sia durante l'inferenza

In [6]:
from transformers import AutoTokenizer

model_checkpoint = "dbmdz/bert-base-italian-xxl-cased"

tokenizer = AutoTokenizer.from_pretrained(model_checkpoint)

def tokenize_function(examples):
    return tokenizer(examples["review_text"], padding="max_length", truncation=True)

small_train_dataset = small_train_dataset.map(tokenize_function, batched=True)
small_eval_dataset = small_eval_dataset.map(tokenize_function, batched=True)

The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.


tokenizer_config.json:   0%|          | 0.00/59.0 [00:00<?, ?B/s]

config.json:   0%|          | 0.00/433 [00:00<?, ?B/s]

vocab.txt:   0%|          | 0.00/235k [00:00<?, ?B/s]

Map:   0%|          | 0/2000 [00:00<?, ? examples/s]

Map:   0%|          | 0/2000 [00:00<?, ? examples/s]

## Caricamento di un modello pre-trained

Carichiamo il modello di linguaggio. Fare attenzione al "warning" che ci dice che questo modello necessita di fine tuning per essere utilizzato su di un task di classificazione

In [7]:
from transformers import AutoModelForSequenceClassification

model = AutoModelForSequenceClassification.from_pretrained(model_checkpoint, num_labels=2)


model.safetensors:   0%|          | 0.00/445M [00:00<?, ?B/s]

Some weights of BertForSequenceClassification were not initialized from the model checkpoint at dbmdz/bert-base-italian-xxl-cased and are newly initialized: ['classifier.bias', 'classifier.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


## Fine-Tuning

Utilizzando tutti i valori di default

In [8]:
from transformers import TrainingArguments
training_args = TrainingArguments(output_dir="test_trainer")

## Valutazione

In [9]:
import numpy as np
import evaluate

metric = evaluate.load("accuracy")

# questa funzione di valutazione calcola semplicemente la proporzione di etichette corrette
def compute_metrics(eval_pred):
    logits, labels = eval_pred
    predictions = np.argmax(logits, axis=-1)
    return metric.compute(predictions=predictions, references=labels)

Downloading builder script:   0%|          | 0.00/4.20k [00:00<?, ?B/s]

# Training


In [10]:
from transformers import TrainingArguments, Trainer

training_args = TrainingArguments(output_dir="test_trainer", evaluation_strategy="epoch")


trainer = Trainer(
    model=model,
    args=training_args,
    train_dataset=small_train_dataset,
    eval_dataset=small_eval_dataset,
    compute_metrics=compute_metrics,
)

trainer.train()

dataloader_config = DataLoaderConfiguration(dispatch_batches=None, split_batches=False, even_batches=True, use_seedable_sampler=True)


Epoch,Training Loss,Validation Loss,Accuracy
1,No log,0.11987,0.9735
2,0.162900,0.129683,0.973
3,0.162900,0.139544,0.977


TrainOutput(global_step=750, training_loss=0.11903938102722168, metrics={'train_runtime': 756.4382, 'train_samples_per_second': 7.932, 'train_steps_per_second': 0.991, 'total_flos': 1578666332160000.0, 'train_loss': 0.11903938102722168, 'epoch': 3.0})

## Creazione di una nuova pipeline e test di alcune frasi

In [11]:
import torch
from transformers import pipeline

test_sentences = [
    "Sono veramente soddisfatto!",
    "Questo prodotto è spazzatura.",
    "Velocissimi, consigliato",
    "Avrei preferito una consegna più veloce ma il prodotto è sicuramente di ottima qualità"
]

# Check if CUDA is available and set the device
device = 0 if torch.cuda.is_available() else -1

# Load the model and tokenizer into the pipeline
classifier = pipeline("text-classification", model=model, tokenizer=tokenizer, device=device)

# Test the model
outputs = classifier(test_sentences)
pd.DataFrame(outputs)


Unnamed: 0,label,score
0,LABEL_1,0.9985
1,LABEL_0,0.999454
2,LABEL_1,0.99872
3,LABEL_1,0.998669
