# Fine tuning de modelos de BERT
En este segundo enfoque vamos a ajustar modelos preentrenados de BERT con Pytorch haciendo uso de la clase Trainer con la que es mucho más sencillo hacer el entrenamiento. Otra opción sería emplear Pytorch en su forma nativa, teniendo que crear los bucles para entrenar y evaluar el modelo manualmente.

En primer lugar, importamos las librerías necesarias para el desarrollo del Notebook.

In [1]:
# Si no tenemos instalada la librería Transformers:
!pip3 install transformers

Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/
Collecting transformers
  Downloading transformers-4.20.1-py3-none-any.whl (4.4 MB)
[K     |████████████████████████████████| 4.4 MB 6.9 MB/s 
Collecting tokenizers!=0.11.3,<0.13,>=0.11.1
  Downloading tokenizers-0.12.1-cp37-cp37m-manylinux_2_12_x86_64.manylinux2010_x86_64.whl (6.6 MB)
[K     |████████████████████████████████| 6.6 MB 67.5 MB/s 
Collecting huggingface-hub<1.0,>=0.1.0
  Downloading huggingface_hub-0.8.1-py3-none-any.whl (101 kB)
[K     |████████████████████████████████| 101 kB 14.0 MB/s 
Collecting pyyaml>=5.1
  Downloading PyYAML-6.0-cp37-cp37m-manylinux_2_5_x86_64.manylinux1_x86_64.manylinux_2_12_x86_64.manylinux2010_x86_64.whl (596 kB)
[K     |████████████████████████████████| 596 kB 73.2 MB/s 
Installing collected packages: pyyaml, tokenizers, huggingface-hub, transformers
  Attempting uninstall: pyyaml
    Found existing installation: PyYAML 3.13
    Uninstalling P

In [2]:
import pandas as pd
from sklearn.model_selection import train_test_split
# Leemos el conjunto de datos 
df = pd.read_csv("data_twitter/tweets_finanzas.csv", delimiter=";")

p_train = 0.80 # Porcentaje de train
p_eval = 0.20 # Porcentaje de eval
p_test = 0.20 # Porcentaje de test

# Mezclamos el dataset
random_state = 8
df = df.sample(frac=1, random_state=random_state)


# Para poder entrenar es necesario codificar las etiquetas como números. 
# Para eso codificaremos los negativos con 0 y los positivos con 1.
df['Sent_Target'] = df['Sent_Target'].apply(lambda x : 1 if x == 'POS'
                                                      else 0)
df['Sent_Sociedad'] = df['Sent_Sociedad'].apply(lambda x : 1 if x == 'POS'
                                                      else 0)
df['Sent_Empresas'] = df['Sent_Empresas'].apply(lambda x : 1 if x == 'POS'
                                                      else 0)

df_train, df_test = train_test_split (df, test_size = p_test, random_state = random_state)
df_train, df_eval = train_test_split (df_train, test_size = p_eval, random_state = random_state)

print("Ejemplos usados para entrenar: ", len(df_train))
print("Ejemplos usados para evaluar: ", len(df_eval))
print("Ejemplos usados para test: ", len(df_test))

Ejemplos usados para entrenar:  816
Ejemplos usados para evaluar:  204
Ejemplos usados para test:  255


## BERTIN
Comenzamos cargando el Tokenizer y el modelo preentrenado, y leemos el conjunto de datos, para el que hacemos la partición en conjunto de entrenamiento, evaluación y test. El primer modelo con el que ajustaremos nuestro conjunto es BERTIN.




In [4]:
import numpy as np
import json
import torch
from sklearn.metrics import recall_score, precision_score, f1_score
from transformers import TrainingArguments, Trainer
from transformers import AutoModelForSequenceClassification, AutoTokenizer                

In [24]:
# Fijamos el modelo con el que queremos trabajar y el número de 
# clases que tenemos
path_bert_model = "bertin-project/bertin-roberta-base-spanish" # BERTIN

NUM_LABELS = 2

# Cargamos el Tokenizer 
tokenizer = AutoTokenizer.from_pretrained(path_bert_model)

# Cargamos el modelo para clasificación en Pytorch
bert_class_model_pytorch = AutoModelForSequenceClassification.from_pretrained(path_bert_model, 
                                                                              num_labels=NUM_LABELS)

loading file https://huggingface.co/bertin-project/bertin-roberta-base-spanish/resolve/main/vocab.json from cache at /root/.cache/huggingface/transformers/f47efb87887425ef9a4ef795bfaa907d57ac9a650d733c7ca621b9eced3235e8.0b968504b15b2f5e6e9e491723f31987782cb976064aff9a7c6188cc2d5ce8bc
loading file https://huggingface.co/bertin-project/bertin-roberta-base-spanish/resolve/main/merges.txt from cache at /root/.cache/huggingface/transformers/aba9e0895dea47dd4208a36012ffd3eb21eb4c5f7ce0be6547afb37cdd4ddef4.0d24ae8bd5fabb1f5020f91bc602cefeb5a2938ab77e21769d28776345634b23
loading file https://huggingface.co/bertin-project/bertin-roberta-base-spanish/resolve/main/tokenizer.json from cache at /root/.cache/huggingface/transformers/baad57d0f574d3e660cafb14601d0ecebe83f25071d59f3e51d225d75285b773.23862d4cf295978aaf1f42b721fda2d8ece45b112d21f0c83905e5b05454e440
loading file https://huggingface.co/bertin-project/bertin-roberta-base-spanish/resolve/main/added_tokens.json from cache at None
loading file

Preprocesemos el texto usando el Tokenizer inicializado anteriormente. El texto de entrada que estamos usando para el tokenizador es una lista de cadenas.

In [25]:
tokenized_train = tokenizer(df_train.Tweet.tolist(), truncation=True, padding = True)
tokenized_eval = tokenizer(df_eval.Tweet.tolist(), truncation=True, padding = True)
tokenized_test = tokenizer(df_test.Tweet.tolist(), truncation=True, padding = True)

El siguiente paso es preparar los datasets para que se puedan entrenar con el modelo .

In [7]:
# La API Trainer requiere que el modelo esté en una clase torch.utils.data.Dataset. 
# Por lo tanto, necesitamos crear una nueva clase que herede de la clase Torch Dataset.
# En la clase heredada, necesitamos tener los métodos:
# __getitem__(): permite a Trainer crear lotes de datos. Devuelve un diccionario 
# con input_ids , atention_mask y token_type_ids para cada texto
# __len__(): permite obtener la longitud de los datos de entrada

class Dataset(torch.utils.data.Dataset):    
    def __init__(self, encodings, labels=None):          
        self.encodings = encodings        
        self.labels = labels
     
    def __getitem__(self, idx):
        item = {key: torch.tensor(val[idx]) for key, val in self.encodings.items()}
        if self.labels:
            item["labels"] = torch.tensor(self.labels[idx])
        return item
        
    def __len__(self):
        return len(self.encodings["input_ids"])

In [26]:
dimension = "Sent_Empresas"
# Preparamos los 3 datasets para hacer el fine-tuning
train_dataset = Dataset(tokenized_train, df_train[dimension].tolist())
eval_dataset = Dataset(tokenized_eval, df_eval[dimension].tolist())
test_dataset = Dataset(tokenized_test, df_test[dimension].tolist())

Ahora definimos  los parámetros de entrenamiento y configuramos el modelo preentrenado en las clases TrainingArgs y Trainer para poder entrenar el modelo con un solo comando. 

Primero definimos una función para calcular las métricas del conjunto de validación. Dado que este es un problema de clasificación binaria con clases desbalanceadas vamos a usar métricas que tengan más en cuenta la clase minoritaria como son la precisión, recall o el F1-score. Una vez que hayamos definido los parámetros, simplemente ejecutamos trainer.train() para entrenar el modelo.

In [9]:
def compute_metrics(eval_preds):
    logits, labels = eval_preds
    pred = np.argmax(logits, axis=1)

    recall = recall_score(y_true=labels, y_pred=pred)
    precision = precision_score(y_true=labels, y_pred=pred)
    f1 = f1_score(y_true=labels, y_pred=pred)

    return {"Precision": precision, "Recall": recall, "F1": f1}

In [27]:
# Definimos los argumentos para Trainer
training_args = TrainingArguments(
    output_dir="./results",
    logging_dir = './logs',
    evaluation_strategy= "epoch",
    logging_strategy = "epoch",
    per_device_train_batch_size=16,
    per_device_eval_batch_size=64,
    num_train_epochs=3,
    seed = 123
)

trainer = Trainer(
    model=bert_class_model_pytorch,
    args=training_args,
    train_dataset=train_dataset,
    eval_dataset=eval_dataset,
    compute_metrics=compute_metrics
)

# Entrenamos el modelo
trainer.train()

PyTorch: setting up devices
The default value for the training argument `--report_to` will change in v5 (from all installed integrations to none). In v5, you will need to use `--report_to all` to get the same behavior as now. You should start updating your code and make this info disappear :-).
***** Running training *****
  Num examples = 816
  Num Epochs = 3
  Instantaneous batch size per device = 16
  Total train batch size (w. parallel, distributed & accumulation) = 16
  Gradient Accumulation steps = 1
  Total optimization steps = 153


Epoch,Training Loss,Validation Loss,Precision,Recall,F1
1,0.6831,0.67942,0.0,0.0,0.0
2,0.6414,0.942007,0.5,0.011236,0.021978
3,0.438,0.953406,0.727273,0.179775,0.288288


***** Running Evaluation *****
  Num examples = 204
  Batch size = 64
  _warn_prf(average, modifier, msg_start, len(result))
***** Running Evaluation *****
  Num examples = 204
  Batch size = 64
***** Running Evaluation *****
  Num examples = 204
  Batch size = 64


Training completed. Do not forget to share your model on huggingface.co/models =)




TrainOutput(global_step=153, training_loss=0.587514490863077, metrics={'train_runtime': 44.0113, 'train_samples_per_second': 55.622, 'train_steps_per_second': 3.476, 'total_flos': 114477975743040.0, 'train_loss': 0.587514490863077, 'epoch': 3.0})

In [28]:
print ("PREDICCIONES SOBRE EVAL")
bert_class_model_pytorch.eval()
print(json.dumps(trainer.evaluate(), indent = 2))

print("PREDICCIONES SOBRE TEST")
predictions = trainer.predict(test_dataset)
print(json.dumps(predictions.metrics, indent = 2))

***** Running Evaluation *****
  Num examples = 204
  Batch size = 64


PREDICCIONES SOBRE EVAL


***** Running Prediction *****
  Num examples = 255
  Batch size = 64


{
  "eval_loss": 0.9534059166908264,
  "eval_Precision": 0.7272727272727273,
  "eval_Recall": 0.1797752808988764,
  "eval_F1": 0.2882882882882883,
  "eval_runtime": 0.8548,
  "eval_samples_per_second": 238.647,
  "eval_steps_per_second": 4.679,
  "epoch": 3.0
}
PREDICCIONES SOBRE TEST
{
  "test_loss": 0.8013733625411987,
  "test_Precision": 0.5384615384615384,
  "test_Recall": 0.22105263157894736,
  "test_F1": 0.31343283582089554,
  "test_runtime": 1.176,
  "test_samples_per_second": 216.844,
  "test_steps_per_second": 3.401
}


## RoBERTa-BNE
Repetimos el proceso para RoBERTa-BNE.

In [12]:
path_bert_model = "PlanTL-GOB-ES/roberta-base-bne" # RoBERTa-BNE

NUM_LABELS = 2

# Cargamos el Tokenizer 
tokenizer = AutoTokenizer.from_pretrained(path_bert_model)

# Cargamos el modelo para clasificación en Pytorch
bert_class_model_pytorch = AutoModelForSequenceClassification.from_pretrained(path_bert_model, num_labels=NUM_LABELS)

https://huggingface.co/PlanTL-GOB-ES/roberta-base-bne/resolve/main/tokenizer_config.json not found in cache or force_download set to True, downloading to /root/.cache/huggingface/transformers/tmpt3yxp7be


Downloading:   0%|          | 0.00/1.42k [00:00<?, ?B/s]

storing https://huggingface.co/PlanTL-GOB-ES/roberta-base-bne/resolve/main/tokenizer_config.json in cache at /root/.cache/huggingface/transformers/5f84f2820e0dfd8c4a7b776a5ba9c4ad1893b40d9b41af518e6621648648a633.d8a7d006294d83173a76ac51a95b5a8470bbbc87c93c63633eaf9476656ed660
creating metadata file for /root/.cache/huggingface/transformers/5f84f2820e0dfd8c4a7b776a5ba9c4ad1893b40d9b41af518e6621648648a633.d8a7d006294d83173a76ac51a95b5a8470bbbc87c93c63633eaf9476656ed660
https://huggingface.co/PlanTL-GOB-ES/roberta-base-bne/resolve/main/config.json not found in cache or force_download set to True, downloading to /root/.cache/huggingface/transformers/tmpatekqh28


Downloading:   0%|          | 0.00/613 [00:00<?, ?B/s]

storing https://huggingface.co/PlanTL-GOB-ES/roberta-base-bne/resolve/main/config.json in cache at /root/.cache/huggingface/transformers/9559bd682b1ae9cf25eb8bed5a9ab64b481da43e670debc0b519981fea4afd13.33b0b03a5bf5e640494a22a3aa4909c661effc0fa0e186b1513b17d9b058ca59
creating metadata file for /root/.cache/huggingface/transformers/9559bd682b1ae9cf25eb8bed5a9ab64b481da43e670debc0b519981fea4afd13.33b0b03a5bf5e640494a22a3aa4909c661effc0fa0e186b1513b17d9b058ca59
loading configuration file https://huggingface.co/PlanTL-GOB-ES/roberta-base-bne/resolve/main/config.json from cache at /root/.cache/huggingface/transformers/9559bd682b1ae9cf25eb8bed5a9ab64b481da43e670debc0b519981fea4afd13.33b0b03a5bf5e640494a22a3aa4909c661effc0fa0e186b1513b17d9b058ca59
Model config RobertaConfig {
  "_name_or_path": "PlanTL-GOB-ES/roberta-base-bne",
  "architectures": [
    "RobertaForMaskedLM"
  ],
  "attention_probs_dropout_prob": 0.0,
  "bos_token_id": 0,
  "classifier_dropout": null,
  "eos_token_id": 2,
  "gra

Downloading:   0%|          | 0.00/1.10M [00:00<?, ?B/s]

storing https://huggingface.co/PlanTL-GOB-ES/roberta-base-bne/resolve/main/vocab.json in cache at /root/.cache/huggingface/transformers/29e9e9b32d49471e6270f83399af38178f2b21c4b221c746c5a844a40d40fd5b.26eadee3bbe78c0682ce89a698fbb1698a0eee50c36cf83be2280a0f2a7b23c1
creating metadata file for /root/.cache/huggingface/transformers/29e9e9b32d49471e6270f83399af38178f2b21c4b221c746c5a844a40d40fd5b.26eadee3bbe78c0682ce89a698fbb1698a0eee50c36cf83be2280a0f2a7b23c1
https://huggingface.co/PlanTL-GOB-ES/roberta-base-bne/resolve/main/merges.txt not found in cache or force_download set to True, downloading to /root/.cache/huggingface/transformers/tmpzhip9b15


Downloading:   0%|          | 0.00/497k [00:00<?, ?B/s]

storing https://huggingface.co/PlanTL-GOB-ES/roberta-base-bne/resolve/main/merges.txt in cache at /root/.cache/huggingface/transformers/33c2651926c588e986e1467740986ce4dfe7b086fc7d8ce6a5aeb48781dee97a.0d24ae8bd5fabb1f5020f91bc602cefeb5a2938ab77e21769d28776345634b23
creating metadata file for /root/.cache/huggingface/transformers/33c2651926c588e986e1467740986ce4dfe7b086fc7d8ce6a5aeb48781dee97a.0d24ae8bd5fabb1f5020f91bc602cefeb5a2938ab77e21769d28776345634b23
https://huggingface.co/PlanTL-GOB-ES/roberta-base-bne/resolve/main/tokenizer.json not found in cache or force_download set to True, downloading to /root/.cache/huggingface/transformers/tmpsz93b99d


Downloading:   0%|          | 0.00/1.39M [00:00<?, ?B/s]

storing https://huggingface.co/PlanTL-GOB-ES/roberta-base-bne/resolve/main/tokenizer.json in cache at /root/.cache/huggingface/transformers/43dd0ef46be9435d2e263d4cd2c9a77e429d3771c9ed2f01dcb0505b4e3b6f46.bd775ba884c9e650b58a3a333a97e47c8d1b9d37cdbe19b22fb04b1e41beb19d
creating metadata file for /root/.cache/huggingface/transformers/43dd0ef46be9435d2e263d4cd2c9a77e429d3771c9ed2f01dcb0505b4e3b6f46.bd775ba884c9e650b58a3a333a97e47c8d1b9d37cdbe19b22fb04b1e41beb19d
https://huggingface.co/PlanTL-GOB-ES/roberta-base-bne/resolve/main/special_tokens_map.json not found in cache or force_download set to True, downloading to /root/.cache/huggingface/transformers/tmpd2g_x1l2


Downloading:   0%|          | 0.00/772 [00:00<?, ?B/s]

storing https://huggingface.co/PlanTL-GOB-ES/roberta-base-bne/resolve/main/special_tokens_map.json in cache at /root/.cache/huggingface/transformers/5751a892d96bece2932abbd3d21fdbd31d3d3ac7294f549557ead0c643243a6d.cb2244924ab24d706b02fd7fcedaea4531566537687a539ebb94db511fd122a0
creating metadata file for /root/.cache/huggingface/transformers/5751a892d96bece2932abbd3d21fdbd31d3d3ac7294f549557ead0c643243a6d.cb2244924ab24d706b02fd7fcedaea4531566537687a539ebb94db511fd122a0
loading file https://huggingface.co/PlanTL-GOB-ES/roberta-base-bne/resolve/main/vocab.json from cache at /root/.cache/huggingface/transformers/29e9e9b32d49471e6270f83399af38178f2b21c4b221c746c5a844a40d40fd5b.26eadee3bbe78c0682ce89a698fbb1698a0eee50c36cf83be2280a0f2a7b23c1
loading file https://huggingface.co/PlanTL-GOB-ES/roberta-base-bne/resolve/main/merges.txt from cache at /root/.cache/huggingface/transformers/33c2651926c588e986e1467740986ce4dfe7b086fc7d8ce6a5aeb48781dee97a.0d24ae8bd5fabb1f5020f91bc602cefeb5a2938ab77e2

Downloading:   0%|          | 0.00/476M [00:00<?, ?B/s]

storing https://huggingface.co/PlanTL-GOB-ES/roberta-base-bne/resolve/main/pytorch_model.bin in cache at /root/.cache/huggingface/transformers/7fe257df6064e5fd34555f4aa8cae121eea8f5945d21cc3873956123f8484ef3.c86d60e89da68465cb73e129befe8209faa3ac57b9aa272b87db45ba1f619582
creating metadata file for /root/.cache/huggingface/transformers/7fe257df6064e5fd34555f4aa8cae121eea8f5945d21cc3873956123f8484ef3.c86d60e89da68465cb73e129befe8209faa3ac57b9aa272b87db45ba1f619582
loading weights file https://huggingface.co/PlanTL-GOB-ES/roberta-base-bne/resolve/main/pytorch_model.bin from cache at /root/.cache/huggingface/transformers/7fe257df6064e5fd34555f4aa8cae121eea8f5945d21cc3873956123f8484ef3.c86d60e89da68465cb73e129befe8209faa3ac57b9aa272b87db45ba1f619582
Some weights of the model checkpoint at PlanTL-GOB-ES/roberta-base-bne were not used when initializing RobertaForSequenceClassification: ['lm_head.dense.weight', 'lm_head.bias', 'lm_head.dense.bias', 'lm_head.decoder.bias', 'lm_head.layer_norm.

In [13]:
tokenized_train = tokenizer(df_train.Tweet.tolist(), truncation=True, padding = True)
tokenized_eval = tokenizer(df_eval.Tweet.tolist(), truncation=True, padding = True)
tokenized_test = tokenizer(df_test.Tweet.tolist(), truncation=True, padding = True)

dimension = "Sent_Empresas"
# Preparamos los 3 datasets para hacer el fine-tuning
train_dataset = Dataset(tokenized_train, df_train[dimension].tolist())
eval_dataset = Dataset(tokenized_eval, df_eval[dimension].tolist())
test_dataset = Dataset(tokenized_test, df_test[dimension].tolist())

In [14]:
# Definimos los argumentos para Trainer
training_args = TrainingArguments(
    output_dir="./results",
    logging_dir = './logs',
    evaluation_strategy= "epoch",
    logging_strategy = "epoch",
    per_device_train_batch_size=16,
    per_device_eval_batch_size=64,
    num_train_epochs=3,
    seed = 123
)

trainer = Trainer(
    model=bert_class_model_pytorch,
    args=training_args,
    train_dataset=train_dataset,
    eval_dataset=eval_dataset,
    compute_metrics=compute_metrics
)

# Entrenamos el modelo
trainer.train()

PyTorch: setting up devices
The default value for the training argument `--report_to` will change in v5 (from all installed integrations to none). In v5, you will need to use `--report_to all` to get the same behavior as now. You should start updating your code and make this info disappear :-).
***** Running training *****
  Num examples = 816
  Num Epochs = 3
  Instantaneous batch size per device = 16
  Total train batch size (w. parallel, distributed & accumulation) = 16
  Gradient Accumulation steps = 1
  Total optimization steps = 153


Epoch,Training Loss,Validation Loss,Precision,Recall,F1
1,0.6702,0.599998,0.956522,0.247191,0.392857
2,0.3668,0.600626,0.69863,0.573034,0.62963
3,0.0843,0.748029,0.8,0.674157,0.731707


***** Running Evaluation *****
  Num examples = 204
  Batch size = 64
***** Running Evaluation *****
  Num examples = 204
  Batch size = 64
***** Running Evaluation *****
  Num examples = 204
  Batch size = 64


Training completed. Do not forget to share your model on huggingface.co/models =)




TrainOutput(global_step=153, training_loss=0.373750929738961, metrics={'train_runtime': 44.4344, 'train_samples_per_second': 55.092, 'train_steps_per_second': 3.443, 'total_flos': 114477975743040.0, 'train_loss': 0.373750929738961, 'epoch': 3.0})

In [15]:
print ("PREDICCIONES SOBRE EVAL")
bert_class_model_pytorch.eval()
print(json.dumps(trainer.evaluate(), indent = 2))

print("PREDICCIONES SOBRE TEST")
predictions = trainer.predict(test_dataset)
print(json.dumps(predictions.metrics, indent = 2))

***** Running Evaluation *****
  Num examples = 204
  Batch size = 64


PREDICCIONES SOBRE EVAL


***** Running Prediction *****
  Num examples = 255
  Batch size = 64


{
  "eval_loss": 0.7480290532112122,
  "eval_Precision": 0.8,
  "eval_Recall": 0.6741573033707865,
  "eval_F1": 0.7317073170731706,
  "eval_runtime": 0.8553,
  "eval_samples_per_second": 238.5,
  "eval_steps_per_second": 4.676,
  "epoch": 3.0
}
PREDICCIONES SOBRE TEST
{
  "test_loss": 1.0480542182922363,
  "test_Precision": 0.5631067961165048,
  "test_Recall": 0.6105263157894737,
  "test_F1": 0.5858585858585859,
  "test_runtime": 1.1748,
  "test_samples_per_second": 217.056,
  "test_steps_per_second": 3.405
}


## BETO

In [16]:
# Fijamos el modelo con el que queremos trabajar y el número de 
# clases que tenemos
path_bert_model = 'dccuchile/bert-base-spanish-wwm-uncased' # BETO

NUM_LABELS = 2

# Cargamos el Tokenizer 
tokenizer = AutoTokenizer.from_pretrained(path_bert_model)

# Cargamos el modelo para clasificación en Pytorch
bert_class_model_pytorch = AutoModelForSequenceClassification.from_pretrained(path_bert_model, num_labels=NUM_LABELS)

https://huggingface.co/dccuchile/bert-base-spanish-wwm-uncased/resolve/main/tokenizer_config.json not found in cache or force_download set to True, downloading to /root/.cache/huggingface/transformers/tmp5tfqe0z0


Downloading:   0%|          | 0.00/310 [00:00<?, ?B/s]

storing https://huggingface.co/dccuchile/bert-base-spanish-wwm-uncased/resolve/main/tokenizer_config.json in cache at /root/.cache/huggingface/transformers/75654903071ce2eb376ae88599e5a32c926746e653c5f59fa8c72ede82bb45e5.97aaa6cf1585446e253a70715325df5cdf1791627e0480c0084d0dff6c5ebbf8
creating metadata file for /root/.cache/huggingface/transformers/75654903071ce2eb376ae88599e5a32c926746e653c5f59fa8c72ede82bb45e5.97aaa6cf1585446e253a70715325df5cdf1791627e0480c0084d0dff6c5ebbf8
https://huggingface.co/dccuchile/bert-base-spanish-wwm-uncased/resolve/main/config.json not found in cache or force_download set to True, downloading to /root/.cache/huggingface/transformers/tmpor4pqgsm


Downloading:   0%|          | 0.00/650 [00:00<?, ?B/s]

storing https://huggingface.co/dccuchile/bert-base-spanish-wwm-uncased/resolve/main/config.json in cache at /root/.cache/huggingface/transformers/2416dab24674c27b5521594d6aa0929fc843a024c96711b1b5015cdff867291f.afa3630b664b4bd3e82d41660bdb96ec13236bbceadb0ae7c45c7c19f58652c7
creating metadata file for /root/.cache/huggingface/transformers/2416dab24674c27b5521594d6aa0929fc843a024c96711b1b5015cdff867291f.afa3630b664b4bd3e82d41660bdb96ec13236bbceadb0ae7c45c7c19f58652c7
loading configuration file https://huggingface.co/dccuchile/bert-base-spanish-wwm-uncased/resolve/main/config.json from cache at /root/.cache/huggingface/transformers/2416dab24674c27b5521594d6aa0929fc843a024c96711b1b5015cdff867291f.afa3630b664b4bd3e82d41660bdb96ec13236bbceadb0ae7c45c7c19f58652c7
Model config BertConfig {
  "_name_or_path": "dccuchile/bert-base-spanish-wwm-uncased",
  "architectures": [
    "BertForMaskedLM"
  ],
  "attention_probs_dropout_prob": 0.1,
  "classifier_dropout": null,
  "gradient_checkpointing":

Downloading:   0%|          | 0.00/242k [00:00<?, ?B/s]

storing https://huggingface.co/dccuchile/bert-base-spanish-wwm-uncased/resolve/main/vocab.txt in cache at /root/.cache/huggingface/transformers/eebf656e2fb33420d0d3f12a0650df76137cfd2251e04587d7d926fba30ab1b0.bfb98b35b81356261ec63a5ff66aa147928e2c8f4d09be77fc850582a1000498
creating metadata file for /root/.cache/huggingface/transformers/eebf656e2fb33420d0d3f12a0650df76137cfd2251e04587d7d926fba30ab1b0.bfb98b35b81356261ec63a5ff66aa147928e2c8f4d09be77fc850582a1000498
https://huggingface.co/dccuchile/bert-base-spanish-wwm-uncased/resolve/main/tokenizer.json not found in cache or force_download set to True, downloading to /root/.cache/huggingface/transformers/tmpe4a2mhuq


Downloading:   0%|          | 0.00/475k [00:00<?, ?B/s]

storing https://huggingface.co/dccuchile/bert-base-spanish-wwm-uncased/resolve/main/tokenizer.json in cache at /root/.cache/huggingface/transformers/85478b69412001fdb7b4cb1f5e5c5e49df292e7de8a8a27c465348fd70e817e3.1fea6aa627ed25376d8778ace0885102803fe6651fb5638d1cea57cae8ccfa7f
creating metadata file for /root/.cache/huggingface/transformers/85478b69412001fdb7b4cb1f5e5c5e49df292e7de8a8a27c465348fd70e817e3.1fea6aa627ed25376d8778ace0885102803fe6651fb5638d1cea57cae8ccfa7f
https://huggingface.co/dccuchile/bert-base-spanish-wwm-uncased/resolve/main/special_tokens_map.json not found in cache or force_download set to True, downloading to /root/.cache/huggingface/transformers/tmplyzfh76b


Downloading:   0%|          | 0.00/134 [00:00<?, ?B/s]

storing https://huggingface.co/dccuchile/bert-base-spanish-wwm-uncased/resolve/main/special_tokens_map.json in cache at /root/.cache/huggingface/transformers/78141ed1e8dcc5ff370950397ca0d1c5c9da478f54ec14544187d8a93eff1a26.f982506b52498d4adb4bd491f593dc92b2ef6be61bfdbe9d30f53f963f9f5b66
creating metadata file for /root/.cache/huggingface/transformers/78141ed1e8dcc5ff370950397ca0d1c5c9da478f54ec14544187d8a93eff1a26.f982506b52498d4adb4bd491f593dc92b2ef6be61bfdbe9d30f53f963f9f5b66
loading file https://huggingface.co/dccuchile/bert-base-spanish-wwm-uncased/resolve/main/vocab.txt from cache at /root/.cache/huggingface/transformers/eebf656e2fb33420d0d3f12a0650df76137cfd2251e04587d7d926fba30ab1b0.bfb98b35b81356261ec63a5ff66aa147928e2c8f4d09be77fc850582a1000498
loading file https://huggingface.co/dccuchile/bert-base-spanish-wwm-uncased/resolve/main/tokenizer.json from cache at /root/.cache/huggingface/transformers/85478b69412001fdb7b4cb1f5e5c5e49df292e7de8a8a27c465348fd70e817e3.1fea6aa627ed253

Downloading:   0%|          | 0.00/419M [00:00<?, ?B/s]

storing https://huggingface.co/dccuchile/bert-base-spanish-wwm-uncased/resolve/main/pytorch_model.bin in cache at /root/.cache/huggingface/transformers/b138da487e3aca6fae8ba8447dee4744628afa2d19b89aec47c996be858a3d1f.acf5ffb20a878065d959fdc6669d0e8869f9ee17e9c33301a68f01555159af8a
creating metadata file for /root/.cache/huggingface/transformers/b138da487e3aca6fae8ba8447dee4744628afa2d19b89aec47c996be858a3d1f.acf5ffb20a878065d959fdc6669d0e8869f9ee17e9c33301a68f01555159af8a
loading weights file https://huggingface.co/dccuchile/bert-base-spanish-wwm-uncased/resolve/main/pytorch_model.bin from cache at /root/.cache/huggingface/transformers/b138da487e3aca6fae8ba8447dee4744628afa2d19b89aec47c996be858a3d1f.acf5ffb20a878065d959fdc6669d0e8869f9ee17e9c33301a68f01555159af8a
Some weights of the model checkpoint at dccuchile/bert-base-spanish-wwm-uncased were not used when initializing BertForSequenceClassification: ['cls.predictions.transform.LayerNorm.bias', 'cls.predictions.transform.LayerNorm.w

In [17]:
tokenized_train = tokenizer(df_train.Tweet.tolist(), truncation=True, padding = True)
tokenized_eval = tokenizer(df_eval.Tweet.tolist(), truncation=True, padding = True)
tokenized_test = tokenizer(df_test.Tweet.tolist(), truncation=True, padding = True)

dimension = "Sent_Empresas"
# Preparamos los 3 datasets para hacer el fine-tuning
train_dataset = Dataset(tokenized_train, df_train[dimension].tolist())
eval_dataset = Dataset(tokenized_eval, df_eval[dimension].tolist())
test_dataset = Dataset(tokenized_test, df_test[dimension].tolist())

In [18]:
# Definimos los argumentos para Trainer
training_args = TrainingArguments(
    output_dir="./results",
    logging_dir = './logs',
    evaluation_strategy= "epoch",
    logging_strategy = "epoch",
    per_device_train_batch_size=16,
    per_device_eval_batch_size=64,
    num_train_epochs=3,
    seed = 123
)

trainer = Trainer(
    model=bert_class_model_pytorch,
    args=training_args,
    train_dataset=train_dataset,
    eval_dataset=eval_dataset,
    compute_metrics=compute_metrics
)

# Entrenamos el modelo
trainer.train()

PyTorch: setting up devices
The default value for the training argument `--report_to` will change in v5 (from all installed integrations to none). In v5, you will need to use `--report_to all` to get the same behavior as now. You should start updating your code and make this info disappear :-).
***** Running training *****
  Num examples = 816
  Num Epochs = 3
  Instantaneous batch size per device = 16
  Total train batch size (w. parallel, distributed & accumulation) = 16
  Gradient Accumulation steps = 1
  Total optimization steps = 153


Epoch,Training Loss,Validation Loss,Precision,Recall,F1
1,0.7029,0.668219,0.678571,0.213483,0.324786
2,0.6074,0.605127,0.677419,0.47191,0.556291
3,0.348,0.658243,0.627907,0.606742,0.617143


***** Running Evaluation *****
  Num examples = 204
  Batch size = 64
***** Running Evaluation *****
  Num examples = 204
  Batch size = 64
***** Running Evaluation *****
  Num examples = 204
  Batch size = 64


Training completed. Do not forget to share your model on huggingface.co/models =)




TrainOutput(global_step=153, training_loss=0.5527357213637408, metrics={'train_runtime': 43.1757, 'train_samples_per_second': 56.699, 'train_steps_per_second': 3.544, 'total_flos': 109445976809280.0, 'train_loss': 0.5527357213637408, 'epoch': 3.0})

In [19]:
print ("PREDICCIONES SOBRE EVAL")
bert_class_model_pytorch.eval()
print(json.dumps(trainer.evaluate(), indent = 2))

print("PREDICCIONES SOBRE TEST")
predictions = trainer.predict(test_dataset)
print(json.dumps(predictions.metrics, indent = 2))

***** Running Evaluation *****
  Num examples = 204
  Batch size = 64


PREDICCIONES SOBRE EVAL


***** Running Prediction *****
  Num examples = 255
  Batch size = 64


{
  "eval_loss": 0.6582428216934204,
  "eval_Precision": 0.627906976744186,
  "eval_Recall": 0.6067415730337079,
  "eval_F1": 0.6171428571428572,
  "eval_runtime": 0.9214,
  "eval_samples_per_second": 221.409,
  "eval_steps_per_second": 4.341,
  "epoch": 3.0
}
PREDICCIONES SOBRE TEST
{
  "test_loss": 0.8133977055549622,
  "test_Precision": 0.46551724137931033,
  "test_Recall": 0.5684210526315789,
  "test_F1": 0.5118483412322274,
  "test_runtime": 1.3283,
  "test_samples_per_second": 191.977,
  "test_steps_per_second": 3.011
}


## RoBERTuito

In [20]:
# Fijamos el modelo con el que queremos trabajar y el número de 
# clases que tenemos
path_bert_model = "pysentimiento/robertuito-base-cased" # RoBERTuito

NUM_LABELS = 2

# Cargamos el Tokenizer 
tokenizer = AutoTokenizer.from_pretrained(path_bert_model)

# Cargamos el modelo para clasificación en Pytorch
bert_class_model_pytorch = AutoModelForSequenceClassification.from_pretrained(path_bert_model, num_labels=NUM_LABELS)

https://huggingface.co/pysentimiento/robertuito-base-cased/resolve/main/tokenizer_config.json not found in cache or force_download set to True, downloading to /root/.cache/huggingface/transformers/tmp0db51gai


Downloading:   0%|          | 0.00/319 [00:00<?, ?B/s]

storing https://huggingface.co/pysentimiento/robertuito-base-cased/resolve/main/tokenizer_config.json in cache at /root/.cache/huggingface/transformers/339c5812eaea8622873dfd17d0ff80c36d9b539a9cd2868e30f0516b3de132ff.6ae2760214eca2a5e9eca95c193833cf24e65d6a29ba05cbefbdf22e54a3597d
creating metadata file for /root/.cache/huggingface/transformers/339c5812eaea8622873dfd17d0ff80c36d9b539a9cd2868e30f0516b3de132ff.6ae2760214eca2a5e9eca95c193833cf24e65d6a29ba05cbefbdf22e54a3597d
https://huggingface.co/pysentimiento/robertuito-base-cased/resolve/main/tokenizer.json not found in cache or force_download set to True, downloading to /root/.cache/huggingface/transformers/tmpsu29tb7q


Downloading:   0%|          | 0.00/809k [00:00<?, ?B/s]

storing https://huggingface.co/pysentimiento/robertuito-base-cased/resolve/main/tokenizer.json in cache at /root/.cache/huggingface/transformers/cc4f40c3fa7221828413091d1002dd792360afc79baa7e0021ca004290bca697.3add50879ecd23ec84a56d45b9389efbefc57b7c9a1deb5a9936e145217a29e7
creating metadata file for /root/.cache/huggingface/transformers/cc4f40c3fa7221828413091d1002dd792360afc79baa7e0021ca004290bca697.3add50879ecd23ec84a56d45b9389efbefc57b7c9a1deb5a9936e145217a29e7
https://huggingface.co/pysentimiento/robertuito-base-cased/resolve/main/special_tokens_map.json not found in cache or force_download set to True, downloading to /root/.cache/huggingface/transformers/tmppu75dte1


Downloading:   0%|          | 0.00/150 [00:00<?, ?B/s]

storing https://huggingface.co/pysentimiento/robertuito-base-cased/resolve/main/special_tokens_map.json in cache at /root/.cache/huggingface/transformers/3be5f21f2981c0ec8d9f3a8644c4c8857d3ab20848cf2fca6ca0e09b4583ba4e.0dc5b1041f62041ebbd23b1297f2f573769d5c97d8b7c28180ec86b8f6185aa8
creating metadata file for /root/.cache/huggingface/transformers/3be5f21f2981c0ec8d9f3a8644c4c8857d3ab20848cf2fca6ca0e09b4583ba4e.0dc5b1041f62041ebbd23b1297f2f573769d5c97d8b7c28180ec86b8f6185aa8
loading file https://huggingface.co/pysentimiento/robertuito-base-cased/resolve/main/tokenizer.json from cache at /root/.cache/huggingface/transformers/cc4f40c3fa7221828413091d1002dd792360afc79baa7e0021ca004290bca697.3add50879ecd23ec84a56d45b9389efbefc57b7c9a1deb5a9936e145217a29e7
loading file https://huggingface.co/pysentimiento/robertuito-base-cased/resolve/main/added_tokens.json from cache at None
loading file https://huggingface.co/pysentimiento/robertuito-base-cased/resolve/main/special_tokens_map.json from cac

Downloading:   0%|          | 0.00/677 [00:00<?, ?B/s]

storing https://huggingface.co/pysentimiento/robertuito-base-cased/resolve/main/config.json in cache at /root/.cache/huggingface/transformers/3f85c0ee804baf604258892a88dd52cdf051d2418a511dcab7cab99a85a3a1b3.4cce50d5a926bf18fe43f2ea8d4596b505e97a64e6e700e993def66b06f1c83b
creating metadata file for /root/.cache/huggingface/transformers/3f85c0ee804baf604258892a88dd52cdf051d2418a511dcab7cab99a85a3a1b3.4cce50d5a926bf18fe43f2ea8d4596b505e97a64e6e700e993def66b06f1c83b
loading configuration file https://huggingface.co/pysentimiento/robertuito-base-cased/resolve/main/config.json from cache at /root/.cache/huggingface/transformers/3f85c0ee804baf604258892a88dd52cdf051d2418a511dcab7cab99a85a3a1b3.4cce50d5a926bf18fe43f2ea8d4596b505e97a64e6e700e993def66b06f1c83b
Model config RobertaConfig {
  "_name_or_path": "pysentimiento/robertuito-base-cased",
  "architectures": [
    "RobertaForMaskedLM"
  ],
  "attention_probs_dropout_prob": 0.1,
  "bos_token_id": 0,
  "classifier_dropout": null,
  "eos_token

Downloading:   0%|          | 0.00/415M [00:00<?, ?B/s]

storing https://huggingface.co/pysentimiento/robertuito-base-cased/resolve/main/pytorch_model.bin in cache at /root/.cache/huggingface/transformers/795f97c54d814fec7e7c661c939f5f797bd6fb98c93716c51ca7f06335899b9f.27f4ebde81f46ec68cbdd9518932c83dd3d3eac62e312dedfb680d87341e94e9
creating metadata file for /root/.cache/huggingface/transformers/795f97c54d814fec7e7c661c939f5f797bd6fb98c93716c51ca7f06335899b9f.27f4ebde81f46ec68cbdd9518932c83dd3d3eac62e312dedfb680d87341e94e9
loading weights file https://huggingface.co/pysentimiento/robertuito-base-cased/resolve/main/pytorch_model.bin from cache at /root/.cache/huggingface/transformers/795f97c54d814fec7e7c661c939f5f797bd6fb98c93716c51ca7f06335899b9f.27f4ebde81f46ec68cbdd9518932c83dd3d3eac62e312dedfb680d87341e94e9
Some weights of the model checkpoint at pysentimiento/robertuito-base-cased were not used when initializing RobertaForSequenceClassification: ['lm_head.dense.weight', 'lm_head.bias', 'lm_head.dense.bias', 'lm_head.layer_norm.weight', 

In [21]:
tokenized_train = tokenizer(df_train.Tweet.tolist(), truncation=True, padding = True)
tokenized_eval = tokenizer(df_eval.Tweet.tolist(), truncation=True, padding = True)
tokenized_test = tokenizer(df_test.Tweet.tolist(), truncation=True, padding = True)

dimension = "Sent_Empresas"
# Preparamos los 3 datasets para hacer el fine-tuning
train_dataset = Dataset(tokenized_train, df_train[dimension].tolist())
eval_dataset = Dataset(tokenized_eval, df_eval[dimension].tolist())
test_dataset = Dataset(tokenized_test, df_test[dimension].tolist())

Asking to truncate to max_length but no maximum length is provided and the model has no predefined maximum length. Default to no truncation.


In [22]:
# Definimos los argumentos para Trainer
training_args = TrainingArguments(
    output_dir="./results",
    logging_dir = './logs',
    evaluation_strategy = "epoch",
    logging_strategy = "epoch",
    per_device_train_batch_size=16,
    per_device_eval_batch_size=64,
    num_train_epochs=3,
    seed = 123
)

trainer = Trainer(
    model=bert_class_model_pytorch,
    args=training_args,
    train_dataset=train_dataset,
    eval_dataset=eval_dataset,
    compute_metrics=compute_metrics
)

# Entrenamos el modelo
trainer.train()

PyTorch: setting up devices
The default value for the training argument `--report_to` will change in v5 (from all installed integrations to none). In v5, you will need to use `--report_to all` to get the same behavior as now. You should start updating your code and make this info disappear :-).
***** Running training *****
  Num examples = 816
  Num Epochs = 3
  Instantaneous batch size per device = 16
  Total train batch size (w. parallel, distributed & accumulation) = 16
  Gradient Accumulation steps = 1
  Total optimization steps = 153


Epoch,Training Loss,Validation Loss,Precision,Recall,F1
1,0.7055,0.673284,0.333333,0.011236,0.021739
2,0.6402,0.657836,0.590164,0.404494,0.48
3,0.5501,0.651259,0.553191,0.58427,0.568306


***** Running Evaluation *****
  Num examples = 204
  Batch size = 64
***** Running Evaluation *****
  Num examples = 204
  Batch size = 64
***** Running Evaluation *****
  Num examples = 204
  Batch size = 64


Training completed. Do not forget to share your model on huggingface.co/models =)




TrainOutput(global_step=153, training_loss=0.6319015727323645, metrics={'train_runtime': 45.2672, 'train_samples_per_second': 54.079, 'train_steps_per_second': 3.38, 'total_flos': 115735975476480.0, 'train_loss': 0.6319015727323645, 'epoch': 3.0})

In [23]:
print ("PREDICCIONES SOBRE EVAL")
bert_class_model_pytorch.eval()
print(json.dumps(trainer.evaluate(), indent = 2))

print("PREDICCIONES SOBRE TEST")
predictions = trainer.predict(test_dataset)
print(json.dumps(predictions.metrics, indent = 2))

***** Running Evaluation *****
  Num examples = 204
  Batch size = 64


PREDICCIONES SOBRE EVAL


***** Running Prediction *****
  Num examples = 255
  Batch size = 64


{
  "eval_loss": 0.6512593030929565,
  "eval_Precision": 0.5531914893617021,
  "eval_Recall": 0.5842696629213483,
  "eval_F1": 0.5683060109289617,
  "eval_runtime": 0.926,
  "eval_samples_per_second": 220.296,
  "eval_steps_per_second": 4.32,
  "epoch": 3.0
}
PREDICCIONES SOBRE TEST
{
  "test_loss": 0.7020901441574097,
  "test_Precision": 0.42735042735042733,
  "test_Recall": 0.5263157894736842,
  "test_F1": 0.4716981132075472,
  "test_runtime": 1.1773,
  "test_samples_per_second": 216.598,
  "test_steps_per_second": 3.398
}
