# Label your data to fine-tune a classifier with Hugging Face

**Autores:** Alonso Morgado, César. González Bartolomé, Sergio. Lozano García, José Rubén. Ramos Valbuena, Gorka. de los Reyes Rodríguez, Diego

Se sigue paso a paso el tutorial encontrado en:

https://rubrix.readthedocs.io/en/master/tutorials/01-labeling-finetuning.html

El objetivo del tutorial es crear un analizador de sentimientos basado en textos extraídos de Twitter en inglés. Se describen los pasos para un analizador de sentimientos mejor tras adaptar un analizador ya entrenado sobre un conjunto de datos nuevo y cómo se crea ese conjunto de datos nuevo. 

## 1. Instalación de paquetes de terceros

Se requiren los paquetes "transformers", "datasets" y "sklearn" para el entrenamiento y el paquete "ipywidgets" para mostrar una barra de progreso.

Se instalan *Rubrix* y *loguru*.

In [1]:
%pip install transformers[torch] datasets sklearn ipywidgets -qqq
%pip install "rubrix[server]"
%pip install loguru

Note: you may need to restart the kernel to use updated packages.
Collecting rubrix[server]
  Downloading rubrix-0.14.1-py3-none-any.whl (1.7 MB)
     |████████████████████████████████| 1.7 MB 4.5 MB/s            
[?25hCollecting pydantic>=1.7.1
  Downloading pydantic-1.9.1-cp39-cp39-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (12.4 MB)
     |████████████████████████████████| 12.4 MB 924 kB/s            
[?25hCollecting wrapt~=1.13.0
  Downloading wrapt-1.13.3-cp39-cp39-manylinux_2_5_x86_64.manylinux1_x86_64.manylinux_2_12_x86_64.manylinux2010_x86_64.whl (81 kB)
     |████████████████████████████████| 81 kB 15.4 MB/s            
Collecting httpx~=0.15.0
  Downloading httpx-0.15.5-py3-none-any.whl (65 kB)
     |████████████████████████████████| 65 kB 5.3 MB/s             
Collecting luqum~=0.11.0
  Downloading luqum-0.11.0-py3-none-any.whl (49 kB)
     |████████████████████████████████| 49 kB 5.1 MB/s             
[?25hCollecting Deprecated~=1.2.0
  Downloading Deprecated-1.2.13-p

  Building wheel for python-multipart (setup.py) ... [?25ldone
[?25h  Created wheel for python-multipart: filename=python_multipart-0.0.5-py3-none-any.whl size=31678 sha256=f37549475e14b055fd07d8661e49bc85615832766826bcff6366b91de2dc9975
  Stored in directory: /home/jovyan/.cache/pip/wheels/fe/04/d1/a10661cc45f03c3cecda50deb2d2c22f57b4e84a75b2a5987e
  Building wheel for hurry.filesize (setup.py) ... [?25ldone
[?25h  Created wheel for hurry.filesize: filename=hurry.filesize-0.9-py3-none-any.whl size=4132 sha256=37db11bc9bcecbba8386b5a534bf64fdd77c5df40c77906abb37619dc9e18820
  Stored in directory: /home/jovyan/.cache/pip/wheels/3c/97/5e/2475af1d4343e1d41becdfa497e764625ac3b226ac9299aeda
Successfully built python-multipart hurry.filesize
Installing collected packages: rfc3986, h11, httpcore, ecdsa, asgiref, wrapt, websockets, watchgod, uvloop, uvicorn, starlette, python-jose, python-dotenv, pydantic, ply, passlib, httpx, httptools, brotli, bcrypt, stopwordsiso, smart-open, rubrix, py

## 2. Carga del dataset inicial

En primer lugar, se descarga el dataset "banking77" de Hugging Face. Se trata de un dataset de consultas de usuarios de bancos, etiquetado con la intención (en este caso, un sentimiento) de cada texto. Se divide en dos subconjuntos con el 50% de datos cada uno de ellos. Se comienza usando el primero (to_label1) y se reserva el segundo para futuras iteraciones.

In [14]:
import json
import datasets
from datasets import Dataset
import pandas as pd

DATASET_VERSION = "v2" # Versión del conjunto de datos
path = f"/home/jovyan/data/dataset/{DATASET_VERSION}/data_train.csv"
train_df = pd.read_csv(path, names=["text", "intent"], sep=";")
train_df.head()
path = f"/home/jovyan/data/dataset/{DATASET_VERSION}/data_test.csv"
test_df = pd.read_csv(path, names=["text", "intent"], sep=";")
test_df.head()
train_dataset = Dataset.from_dict(train_df)
test_dataset = Dataset.from_dict(test_df)
dataset = datasets.DatasetDict({"train":train_dataset,"test":test_dataset})
to_label1, to_label2 = dataset['train'].train_test_split(test_size=0.5, seed=42).values()

In [16]:
dataset

DatasetDict({
    train: Dataset({
        features: ['text', 'intent'],
        num_rows: 181
    })
    test: Dataset({
        features: ['text', 'intent'],
        num_rows: 47
    })
})

## 3. Modelo

Se crea un clasificador de sentimientos partiendo de un modelo pre-entrenado, el modelo *distilbert*, *fine-tuned* sobre SST-2 (Stanford Sentiment Treebank). Se trata de un modelo muy popular para clasificación de sentimientos, con muy buenos resultados.

Se crea el *pipeline* a partir del modelo, indicando la tarea de **análisis de sentimientos** y se aplica sobre uno de los registros del conjunto de datos.

In [17]:
from transformers import pipeline

intent_classifier = pipeline(
    model="pysentimiento/robertuito-emotion-analysis",
    task="text-classification",
    return_all_scores=True,
)

to_label1[3]['text'], intent_classifier(to_label1[3]['text'])

Downloading:   0%|          | 0.00/1.06k [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/415M [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/334 [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/838k [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/150 [00:00<?, ?B/s]

('nos vemos en otro momento',
 [[{'label': 'others', 'score': 0.7186572551727295},
   {'label': 'joy', 'score': 0.053510431200265884},
   {'label': 'sadness', 'score': 0.13479837775230408},
   {'label': 'anger', 'score': 0.03220720961689949},
   {'label': 'surprise', 'score': 0.02744009718298912},
   {'label': 'disgust', 'score': 0.02107127569615841},
   {'label': 'fear', 'score': 0.012315340340137482}]])

El resultado ha sido un sentimiento negativo. Según el tutorial, las preguntas globales se han etiquetado como positivas (no se emplea una clase para sentimientos neutros por simplicidad), mientras que los problemas se han etiquetado como negativos.

## 4. Ejecución del modelo pre-entrenado sobre el conjunto de datos

Empleando el modelo pre-entrenado, se realizan predicciones sobre el primero de los dos conjuntos de datos en los que se ha partido el conjunto de datos de consultas bancarias, para obtener la intención de cada uno de ellos.

In [20]:
def predict(examples):
    return {"predictions": intent_classifier(examples['text'], truncation=True)}

# add .select(range(10)) before map if you just want to test this quickly with 10 examples
to_label1 = to_label1.map(predict, batched=True, batch_size=4)

  0%|          | 0/23 [00:00<?, ?ba/s]

NameError: name 'sentiment_classifier' is not defined

## 5. Lista Rubrix con predicciones

Se crea una lista con Rubrix utilizando el conjunto de datos **to_label1**.

In [5]:
import rubrix as rb

records = []
for example in to_label1.shuffle():
    record = rb.TextClassificationRecord(
        text=example["text"],
        metadata={'category': example['label']}, # log the intents for exploration of specific intents
        prediction=[(pred['label'], pred['score']) for pred in example['predictions']],
        prediction_agent="distilbert-base-uncased-finetuned-sst-2-english"
    )
    records.append(record)

## 6. Creación del dataset

A continuación se crea el dataset. En el tutorial se suben los datos a **Hugging Face Hub**. Para poder ejecutar este paso, se crea una cuenta en la plataforma y se genera un Token desde la configuración de la cuenta.

In [7]:
dataset_rb = rb.DatasetForTextClassification(records)
dataset_ds = dataset_rb.to_datasets()

dataset_ds.push_to_hub("diegorysr/sentiment-banking-pln") # Se sube el dataset a la cuenta de Hugging Face Hub.

Pushing dataset shards to the dataset hub:   0%|          | 0/1 [00:00<?, ?it/s]

Una vez subido, se carga el conjunto de datos desde **Hugging Face Hub**.

In [8]:
dataset_ds = load_dataset("diegorysr/sentiment-banking-pln", split="train")
dataset_rb = rb.read_datasets(dataset_ds, task="TextClassification")

Downloading:   0%|          | 0.00/2.53k [00:00<?, ?B/s]



Downloading and preparing dataset None/None (download: 421.25 KiB, generated: 1.15 MiB, post-processed: Unknown size, total: 1.56 MiB) to /home/jovyan/.cache/huggingface/datasets/diegorysr___parquet/diegorysr--sentiment-banking-pln-356dd510076e5c98/0.0.0/0b6d5799bb726b24ad7fc7be720c170d8e497f575d02d47537de9a5bac074901...


Downloading data files:   0%|          | 0/1 [00:00<?, ?it/s]

Downloading data:   0%|          | 0.00/431k [00:00<?, ?B/s]

Extracting data files:   0%|          | 0/1 [00:00<?, ?it/s]

Dataset parquet downloaded and prepared to /home/jovyan/.cache/huggingface/datasets/diegorysr___parquet/diegorysr--sentiment-banking-pln-356dd510076e5c98/0.0.0/0b6d5799bb726b24ad7fc7be720c170d8e497f575d02d47537de9a5bac074901. Subsequent calls will reuse this data.


## 7. Exploración y etiquetado de datos en Rubrix

Siguiendo el tutorial que se puede encontrar en [Setup and installation](https://rubrix.readthedocs.io/en/master/getting_started/setup%26installation.html), se instala **Rubrix** para llevar a cabo el resto del tutorial, haciendo uso de la instalación vía **docker-compose**:

```
mkdir rubrix && cd rubrix
```

```
wget -O docker-compose.yml https://raw.githubusercontent.com/recognai/rubrix/master/docker-compose.yaml && docker-compose up -d
```

Se comprueba que la instalación ha ido bien, accediendo al servidor local:

http://localhost:6900/

Con el usuario por defecto rubrix y contraseña 1234.

Se envían los datos a **Rubrix** para poder etiquetarlos y obtener nuestro primer conjunto de datos de entrenamiento.

In [9]:
rb.log(name='pln_actividad_grupal_pre-trained', records=dataset_rb)

  0%|          | 0/5001 [00:00<?, ?it/s]

5001 records logged to http://localhost:6900/datasets/rubrix/pln_actividad_grupal_pre-trained


BulkResponse(dataset='pln_actividad_grupal_pre-trained', processed=5001, failed=0)

## 8. Fine-tunning del modelo pre-entrenado

Partiendo de los datos etiquetados en Rubrix, se realiza *fine-tunning* al modelo pre-entrenado.

In [1]:
import rubrix as rb

rb_dataset = rb.load(name='pln_actividad_grupal_pre-trained', query="status:Validated", as_pandas=False) # En el tutorial no se indica que haya que marcar el flag "as_pandas" a False.
rb_dataset.to_pandas().head(3)



Unnamed: 0,text,inputs,prediction,prediction_agent,annotation,annotation_agent,multi_label,explanation,id,metadata,status,event_timestamp,metrics,search_keywords
0,I'd like to get another card,{'text': 'I'd like to get another card'},"[(NEGATIVE, 0.9784252047538757), (POSITIVE, 0....",distilbert-base-uncased-finetuned-sst-2-english,POSITIVE,rubrix,False,,0016b7e2-584a-4d51-88fc-243c53d6717b,{'category': 39},Validated,,{'text_length': 28},
1,I couldn't complete card activation.,{'text': 'I couldn't complete card activation.'},"[(NEGATIVE, 0.9995753169059753), (POSITIVE, 0....",distilbert-base-uncased-finetuned-sst-2-english,NEGATIVE,rubrix,False,,001f3583-5864-47cb-b2f1-675ca3514c89,{'category': 0},Validated,,{'text_length': 36},
2,What is delivery speed to the United States?,{'text': 'What is delivery speed to the United...,"[(NEGATIVE, 0.9976611137390137), (POSITIVE, 0....",distilbert-base-uncased-finetuned-sst-2-english,POSITIVE,rubrix,False,,0022afc1-97b5-463d-b5c1-49061fe43938,{'category': 12},Validated,,{'text_length': 44},


A continuación, el propio paquete de Rubrix permite preparar los datos para entrenamiento, cambiando las categorías por números.

In [2]:
rb_dataset.to_pandas().shape

(235, 14)

In [3]:
# create 🤗 dataset with labels as numeric ids
train_ds = rb_dataset.prepare_for_training()

Se preparan los datos para el entrenamiento.

In [4]:
from transformers import AutoTokenizer

# tokenize our datasets
tokenizer = AutoTokenizer.from_pretrained("distilbert-base-uncased-finetuned-sst-2-english")

def tokenize_function(examples):
    return tokenizer(examples["text"], padding="max_length", truncation=True)

tokenized_train_ds = train_ds.map(tokenize_function, batched=True)

  0%|          | 0/1 [00:00<?, ?ba/s]

In [5]:
# split the data into a training and evalutaion set
train_dataset, eval_dataset = tokenized_train_ds.train_test_split(test_size=0.2, seed=42).values()

Se ajusta el modelo con los nuevos datos, esto es, se aplica *fine-tunning* al modelo pre-entrenado con el nuevo conjunto de datos.

In [6]:
# Carga del modelo

from transformers import AutoModelForSequenceClassification

model = AutoModelForSequenceClassification.from_pretrained("distilbert-base-uncased-finetuned-sst-2-english")

In [7]:
# Configuración del Trainer

import numpy as np
from transformers import Trainer
from datasets import load_metric
from transformers import TrainingArguments

training_args = TrainingArguments(
    "distilbert-base-uncased-sentiment-banking",
    evaluation_strategy="epoch",
    logging_steps=30,
)

metric = load_metric("accuracy")

def compute_metrics(eval_pred):
    logits, labels = eval_pred
    predictions = np.argmax(logits, axis=-1)
    return metric.compute(predictions=predictions, references=labels)

trainer = Trainer(
    args=training_args,
    model=model,
    train_dataset=train_dataset,
    eval_dataset=eval_dataset,
    compute_metrics=compute_metrics,
)

In [8]:
# Entrenamiento

trainer.train()

The following columns in the training set don't have a corresponding argument in `DistilBertForSequenceClassification.forward` and have been ignored: text. If text are not expected by `DistilBertForSequenceClassification.forward`,  you can safely ignore this message.
***** Running training *****
  Num examples = 188
  Num Epochs = 3
  Instantaneous batch size per device = 8
  Total train batch size (w. parallel, distributed & accumulation) = 8
  Gradient Accumulation steps = 1
  Total optimization steps = 72


Epoch,Training Loss,Validation Loss,Accuracy
1,No log,0.203869,0.957447
2,0.663000,0.162767,0.957447
3,0.135800,0.202386,0.957447


The following columns in the evaluation set don't have a corresponding argument in `DistilBertForSequenceClassification.forward` and have been ignored: text. If text are not expected by `DistilBertForSequenceClassification.forward`,  you can safely ignore this message.
***** Running Evaluation *****
  Num examples = 47
  Batch size = 8
The following columns in the evaluation set don't have a corresponding argument in `DistilBertForSequenceClassification.forward` and have been ignored: text. If text are not expected by `DistilBertForSequenceClassification.forward`,  you can safely ignore this message.
***** Running Evaluation *****
  Num examples = 47
  Batch size = 8
The following columns in the evaluation set don't have a corresponding argument in `DistilBertForSequenceClassification.forward` and have been ignored: text. If text are not expected by `DistilBertForSequenceClassification.forward`,  you can safely ignore this message.
***** Running Evaluation *****
  Num examples = 47
  B

TrainOutput(global_step=72, training_loss=0.3390478247569667, metrics={'train_runtime': 583.0815, 'train_samples_per_second': 0.967, 'train_steps_per_second': 0.123, 'total_flos': 74711612841984.0, 'train_loss': 0.3390478247569667, 'epoch': 3.0})

Se alcanza una tasa de acierto de 95.74%.

## 8. Probar el modelo

En este apartado, se prueba el modelo tras el *fine-tunning*.

In [16]:
from transformers import pipeline

finetuned_sentiment_classifier = pipeline(
    model=model.to("cpu"),
    tokenizer=tokenizer,
    task="sentiment-analysis",
    return_all_scores=True
)

In [17]:
finetuned_sentiment_classifier(
    'I need to deposit my virtual card, how do i do that.'
), sentiment_classifier(
    'I need to deposit my virtual card, how do i do that.'
)

([[{'label': 'NEGATIVE', 'score': 0.0013895642478019},
   {'label': 'POSITIVE', 'score': 0.9986103773117065}]],
 [[{'label': 'NEGATIVE', 'score': 0.9992493987083435},
   {'label': 'POSITIVE', 'score': 0.0007506068213842809}]])

Se comprueba que el modelo sin pre-entrenar clasifica esta frase como negativa, con un 99.92% de probabilidad, cuando, al tratarse de una consulta, debería haberse clasificado como positiva. El modelo tras el *fine-tunning* es capaz de cambiar a positiva la clasificación, acertando con una tasa del 99.86%, por lo que, en este caso, el modelo ha mejorado.

In [18]:
finetuned_sentiment_classifier(
    'Why is my payment still pending?'
), sentiment_classifier(
    'Why is my payment still pending?'
)

([[{'label': 'NEGATIVE', 'score': 0.997940719127655},
   {'label': 'POSITIVE', 'score': 0.0020592797081917524}]],
 [[{'label': 'NEGATIVE', 'score': 0.9983781576156616},
   {'label': 'POSITIVE', 'score': 0.0016218440141528845}]])

En este ejemplo, los dos modelos se comportan bien, clasificando el problema como negativo, ambos con una tasa de acierto superior al 99%.

## 9. Creación del nuevo dataset

A continuación, con los nuevos datos etiquetados, se crea un nuevo dataset y se almacena en Rubrix.

In [19]:
rb_dataset = rb.load(name='pln_actividad_grupal_pre-trained', query="status:Default", as_pandas=False)

Una vez cargado el conjunto de datos inicial, se pasa por los datos que no han sido clasificados para etiquetarlos con el modelo ajustado con *fine-tunning*.

In [20]:
def predict(examples):
    texts = [example["text"] for example in examples["inputs"]]
    return {
        "prediction": finetuned_sentiment_classifier(texts),
        "prediction_agent": ["distilbert-base-uncased-banking77-sentiment"]*len(texts)
    }

ds_dataset = rb_dataset.to_datasets().map(predict, batched=True, batch_size=8)

  0%|          | 0/596 [00:00<?, ?ba/s]

Después de etiquetar los nuevos datos, se vuelven a almacenar en Rubrix, en un nuevo dataset.

In [21]:
records = rb.read_datasets(ds_dataset, task="TextClassification")

rb.log(records=records, name='pln_actividad_grupal_fine-tuned')

  0%|          | 0/4766 [00:00<?, ?it/s]

4766 records logged to http://localhost:6900/datasets/rubrix/pln_actividad_grupal_fine-tuned


BulkResponse(dataset='pln_actividad_grupal_fine-tuned', processed=4766, failed=0)

## 10. Volver a ajustar el modelo

Ahora que los datos han sido mejor etiquetados y tenemos un modelo más ajustado al haber empleado un ajuste fino o *fine-tuning* sobre el modelo pre-entrenado, volvemos a entrenar  el modelo con los nuevos datos etiquetados.

In [22]:
rb_dataset = rb.load("pln_actividad_grupal_fine-tuned", as_pandas=False)

train_ds = rb_dataset.prepare_for_training()
tokenized_train_ds = train_ds.map(tokenize_function, batched=True)

Se añaden al conjunto anterior los nuevos datos.

In [23]:
from datasets import concatenate_datasets

train_dataset = concatenate_datasets([train_dataset, tokenized_train_ds])

Y se vuelve a entrenar desde el modelo pre-entrenado con los nuevos datos.

In [24]:
from transformers import AutoModelForSequenceClassification
model = AutoModelForSequenceClassification.from_pretrained("distilbert-base-uncased-finetuned-sst-2-english")

loading configuration file https://huggingface.co/distilbert-base-uncased-finetuned-sst-2-english/resolve/main/config.json from cache at /home/jovyan/.cache/huggingface/transformers/4e60bb8efad3d4b7dc9969bf204947c185166a0a3cf37ddb6f481a876a3777b5.9f8326d0b7697c7fd57366cdde57032f46bc10e37ae81cb7eb564d66d23ec96b
Model config DistilBertConfig {
  "_name_or_path": "distilbert-base-uncased-finetuned-sst-2-english",
  "activation": "gelu",
  "architectures": [
    "DistilBertForSequenceClassification"
  ],
  "attention_dropout": 0.1,
  "dim": 768,
  "dropout": 0.1,
  "finetuning_task": "sst-2",
  "hidden_dim": 3072,
  "id2label": {
    "0": "NEGATIVE",
    "1": "POSITIVE"
  },
  "initializer_range": 0.02,
  "label2id": {
    "NEGATIVE": 0,
    "POSITIVE": 1
  },
  "max_position_embeddings": 512,
  "model_type": "distilbert",
  "n_heads": 12,
  "n_layers": 6,
  "output_past": true,
  "pad_token_id": 0,
  "qa_dropout": 0.1,
  "seq_classif_dropout": 0.2,
  "sinusoidal_pos_embds": false,
  "tie_

In [25]:
train_ds = train_dataset.shuffle(seed=42)

trainer = Trainer(
    args=training_args,
    model=model,
    train_dataset=train_dataset,
    eval_dataset=eval_dataset,
    compute_metrics=compute_metrics,
)

trainer.train()

The following columns in the training set don't have a corresponding argument in `DistilBertForSequenceClassification.forward` and have been ignored: text. If text are not expected by `DistilBertForSequenceClassification.forward`,  you can safely ignore this message.
***** Running training *****
  Num examples = 188
  Num Epochs = 3
  Instantaneous batch size per device = 8
  Total train batch size (w. parallel, distributed & accumulation) = 8
  Gradient Accumulation steps = 1
  Total optimization steps = 72


Epoch,Training Loss,Validation Loss,Accuracy
1,No log,0.203869,0.957447
2,0.663000,0.162767,0.957447
3,0.135800,0.202386,0.957447


The following columns in the evaluation set don't have a corresponding argument in `DistilBertForSequenceClassification.forward` and have been ignored: text. If text are not expected by `DistilBertForSequenceClassification.forward`,  you can safely ignore this message.
***** Running Evaluation *****
  Num examples = 47
  Batch size = 8
The following columns in the evaluation set don't have a corresponding argument in `DistilBertForSequenceClassification.forward` and have been ignored: text. If text are not expected by `DistilBertForSequenceClassification.forward`,  you can safely ignore this message.
***** Running Evaluation *****
  Num examples = 47
  Batch size = 8
The following columns in the evaluation set don't have a corresponding argument in `DistilBertForSequenceClassification.forward` and have been ignored: text. If text are not expected by `DistilBertForSequenceClassification.forward`,  you can safely ignore this message.
***** Running Evaluation *****
  Num examples = 47
  B

TrainOutput(global_step=72, training_loss=0.3390478247569667, metrics={'train_runtime': 585.3684, 'train_samples_per_second': 0.963, 'train_steps_per_second': 0.123, 'total_flos': 74711612841984.0, 'train_loss': 0.3390478247569667, 'epoch': 3.0})

Se aprecia cómo la tasa de acierto se ha mantenido en 95.74%. Lo esperable hubiera sido que se incrementase la tasa de acierto tras entrenar con un conjunto de datos con más ejemplos, pero no ha sido así. La tasa actual es suficientemente alta como para pensar que no es fácil de superar. No obstante, se podría volver al paso de exploración de datos, volver a etiquetar manualmente los que se consideren erróneamente etiquetados, y volver a entrenar el conjunto de datos, a fin de tratar de mejorar la tasa de acierto.

Finalmente, se guarda el modelo.

In [26]:
model.save_pretrained("distilbert-base-uncased-sentiment-banking")


Configuration saved in distilbert-base-uncased-sentiment-banking/config.json
Model weights saved in distilbert-base-uncased-sentiment-banking/pytorch_model.bin
