# Fine-tunning de BETO para predicción de Emojis

In [1]:
from config import *

Primero debemos importar el dataset. En nuestro caso este está guardado como un pandas pickleado (fuente: [este tutorial](https://huggingface.co/course/chapter5/2)). La biblioteca [datasets](https://huggingface.co/docs/datasets/index) nos permite cargar un dataset con este formato. Los path pueden ser modificados en el archivo config.py.

In [37]:
from datasets import load_dataset

# data_files = {"train": file_names['df_es_train'], "test": file_names['df_es_test'], "trial": file_names['df_es_trial']}
data_files = {"train": file_names['df_es_trial'], "test": file_names['df_es_test']}  # a modo de testeo no entrenaremos en el train set entero
dataset_emoji = load_dataset("pandas", data_files=data_files)
dataset_emoji

Using custom data configuration default-cb077f95137f42ce
Reusing dataset pandas (/home/camilo/.cache/huggingface/datasets/pandas/default-cb077f95137f42ce/0.0.0/6197c1e855b639d75a767140856841a562b7a71d129104973fe1962594877ade)
100%|██████████| 2/2 [00:00<00:00, 299.49it/s]


DatasetDict({
    train: Dataset({
        features: ['id', 'text', 'label'],
        num_rows: 10000
    })
    test: Dataset({
        features: ['id', 'text', 'label'],
        num_rows: 10000
    })
})

In [38]:
dataset_emoji["train"][0]

{'id': 'trial0',
 'text': 'Plaza de Oriente , Madrid .......#madrid #city #plazadeoriente #puertadesol #tour…',
 'label': '9'}

Queremos hacer finetuning a BETO para la tarea de predecir emojis. En este caso tendremos un conjunto fijo de emojis, por ende se trata de una tarea de clasificación. Seguiremos partes de [este tutorial](https://huggingface.co/docs/transformers/tasks/sequence_classification).

### Importando el tokenizador

In [4]:
from transformers import AutoTokenizer

model_id = "dccuchile/bert-base-spanish-wwm-uncased"
tokenizer = AutoTokenizer.from_pretrained(model_id)

Downloading: 100%|██████████| 310/310 [00:00<00:00, 127kB/s]
Downloading: 100%|██████████| 650/650 [00:00<00:00, 199kB/s]
Downloading: 100%|██████████| 242k/242k [00:00<00:00, 310kB/s] 
Downloading: 100%|██████████| 475k/475k [00:01<00:00, 417kB/s]  
Downloading: 100%|██████████| 134/134 [00:00<00:00, 137kB/s]


Necesitamos crear una función que tokenice los elementos del dataset

In [20]:
def preprocess_function(examples):
    return tokenizer(examples["text"], truncation=True, padding=True)

Notemos que no necesitamos separar los hashtags, pues el tokenizer lo hace de manera automática (agrega eso si el token #).

In [21]:
print(tokenizer("#UnHashtag",truncation=True))
print(tokenizer("Un Hashtag",truncation=True))

{'input_ids': [4, 3, 1044, 20247, 5001, 3483, 5], 'token_type_ids': [0, 0, 0, 0, 0, 0, 0], 'attention_mask': [1, 1, 1, 1, 1, 1, 1]}
{'input_ids': [4, 1044, 1354, 5001, 3483, 5], 'token_type_ids': [0, 0, 0, 0, 0, 0], 'attention_mask': [1, 1, 1, 1, 1, 1]}


In [42]:
dataset_emoji = dataset_emoji.remove_columns("id")

In [52]:
dataset_emoji["train"].features

{'text': Value(dtype='string', id=None),
 'label': Value(dtype='string', id=None)}

In [55]:
from datasets import Features, Value

dataset_emoji = dataset_emoji.cast_column("id",Value(dtype="int32"))

ValueError: The columns in features (['text', 'label', 'id']) must be identical as the columns in the dataset: ['text', 'label']

In [23]:
tokenized_dataset = dataset_emoji.map(preprocess_function, batched=True)

100%|██████████| 10/10 [00:00<00:00, 22.25ba/s]
100%|██████████| 10/10 [00:00<00:00, 24.40ba/s]


In [31]:
# tokenized_dataset = tokenized_dataset.remove_columns('id')

tokenized_dataset["train"][0]

{'text': 'Plaza de Oriente , Madrid .......#madrid #city #plazadeoriente #puertadesol #tour…',
 'label': '9',
 'input_ids': [4,
  5450,
  1009,
  6567,
  1019,
  4555,
  1008,
  1008,
  1008,
  1008,
  1008,
  1008,
  1008,
  3,
  4555,
  3,
  8407,
  3,
  5450,
  6556,
  6529,
  30959,
  1043,
  3,
  2936,
  1347,
  1083,
  3,
  8449,
  3,
  5,
  1,
  1,
  1,
  1,
  1,
  1,
  1,
  1,
  1,
  1,
  1,
  1,
  1,
  1,
  1,
  1,
  1,
  1,
  1,
  1,
  1,
  1,
  1,
  1,
  1,
  1,
  1,
  1,
  1,
  1,
  1,
  1,
  1,
  1,
  1],
 'token_type_ids': [0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0],
 'attention_mask': [1,
  1,
  1,
  1,
  1,
  1,
  1,
  1,
  1,
  1,
  1,
  1,
  1,
  1,
  1,
  1,
  1,
  1,
  1,
  1,
  1,
  1,

In [24]:
from transformers import DataCollatorWithPadding

data_collator = DataCollatorWithPadding(tokenizer=tokenizer)

Ahora importamos el modelo

In [16]:
from transformers import AutoModelForSequenceClassification

model = AutoModelForSequenceClassification.from_pretrained(model_id, num_labels=19)

Downloading: 100%|██████████| 419M/419M [00:43<00:00, 10.2MB/s] 
Some weights of the model checkpoint at dccuchile/bert-base-spanish-wwm-uncased were not used when initializing BertForSequenceClassification: ['cls.predictions.decoder.weight', 'cls.predictions.decoder.bias', 'cls.predictions.bias', 'cls.predictions.transform.LayerNorm.weight', 'cls.predictions.transform.dense.bias', 'cls.predictions.transform.LayerNorm.bias', 'cls.predictions.transform.dense.weight']
- This IS expected if you are initializing BertForSequenceClassification from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing BertForSequenceClassification from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).
Some weights of BertForSequenceClass

y el trainer de transformers

In [25]:
from transformers import TrainingArguments, Trainer

training_args = TrainingArguments(
    output_dir="./results",
    learning_rate=2e-5,
    per_device_train_batch_size=16,
    per_device_eval_batch_size=16,
    num_train_epochs=1,
    weight_decay=0.01,
)

trainer = Trainer(
    model=model,
    args=training_args,
    train_dataset=tokenized_dataset["train"],
    eval_dataset=tokenized_dataset["test"],
    tokenizer=tokenizer,
    data_collator=data_collator,
)

PyTorch: setting up devices
The default value for the training argument `--report_to` will change in v5 (from all installed integrations to none). In v5, you will need to use `--report_to all` to get the same behavior as now. You should start updating your code and make this info disappear :-).


Como primer test, ejecutamos una sola epoca.

**IMPORTANTE:** en esta versión sólo entrenamos una vuelta sobre el conjunto trial.

In [26]:
trainer.train()

The following columns in the training set don't have a corresponding argument in `BertForSequenceClassification.forward` and have been ignored: text, id. If text, id are not expected by `BertForSequenceClassification.forward`,  you can safely ignore this message.
***** Running training *****
  Num examples = 10000
  Num Epochs = 1
  Instantaneous batch size per device = 16
  Total train batch size (w. parallel, distributed & accumulation) = 16
  Gradient Accumulation steps = 1
  Total optimization steps = 625


ValueError: Unable to create tensor, you should probably activate truncation and/or padding with 'padding=True' 'truncation=True' to have batched tensors with the same length.