# Fine-tunning de BETO para predicción de Emojis

In [1]:
from config import *

Primero debemos importar el dataset. En nuestro caso este está guardado como un pandas pickleado (fuente: [este tutorial](https://huggingface.co/course/chapter5/2)). La biblioteca [datasets](https://huggingface.co/docs/datasets/index) nos permite cargar un dataset con este formato. Los path pueden ser modificados en el archivo config.py.

In [2]:
from datasets import load_dataset

# data_files = {"train": file_names['df_es_train'], "test": file_names['df_es_test'], "trial": file_names['df_es_trial']}
data_files = {"train": file_names['df_es_trial'], "test": file_names['df_es_test']}  # a modo de testeo no entrenaremos en el train set entero
dataset_emoji = load_dataset("pandas", data_files=data_files)
dataset_emoji

Using custom data configuration default-cb077f95137f42ce
Reusing dataset pandas (/home/camilo/.cache/huggingface/datasets/pandas/default-cb077f95137f42ce/0.0.0/6197c1e855b639d75a767140856841a562b7a71d129104973fe1962594877ade)
100%|██████████| 2/2 [00:00<00:00, 973.83it/s]


DatasetDict({
    train: Dataset({
        features: ['id', 'text', 'label'],
        num_rows: 10000
    })
    test: Dataset({
        features: ['id', 'text', 'label'],
        num_rows: 10000
    })
})

In [3]:
dataset_emoji = dataset_emoji.remove_columns("id")

In [4]:
dataset_emoji["train"][0]

{'text': 'Plaza de Oriente , Madrid .......#madrid #city #plazadeoriente #puertadesol #tour…',
 'label': '9'}

In [5]:
dataset_emoji["train"].features

{'text': Value(dtype='string', id=None),
 'label': Value(dtype='string', id=None)}

In [6]:
from datasets import ClassLabel # Features, Value

dataset_emoji = dataset_emoji.cast_column("label",ClassLabel(num_classes=19))

Loading cached processed dataset at /home/camilo/.cache/huggingface/datasets/pandas/default-cb077f95137f42ce/0.0.0/6197c1e855b639d75a767140856841a562b7a71d129104973fe1962594877ade/cache-d9b1254ac31c380b.arrow
Loading cached processed dataset at /home/camilo/.cache/huggingface/datasets/pandas/default-cb077f95137f42ce/0.0.0/6197c1e855b639d75a767140856841a562b7a71d129104973fe1962594877ade/cache-e3006495028cdd71.arrow


In [7]:
dataset_emoji["train"].features

{'text': Value(dtype='string', id=None),
 'label': ClassLabel(num_classes=19, names=['0', '1', '2', '3', '4', '5', '6', '7', '8', '9', '10', '11', '12', '13', '14', '15', '16', '17', '18'], id=None)}

Queremos hacer finetuning a BETO para la tarea de predecir emojis. En este caso tendremos un conjunto fijo de emojis, por ende se trata de una tarea de clasificación. Seguiremos partes de [este tutorial](https://huggingface.co/docs/transformers/tasks/sequence_classification).

### Importando el tokenizador

In [8]:
from transformers import AutoTokenizer

model_id = "dccuchile/bert-base-spanish-wwm-uncased"
tokenizer = AutoTokenizer.from_pretrained(model_id)

Necesitamos crear una función que tokenice los elementos del dataset

In [9]:
def preprocess_function(examples):
    return tokenizer(examples["text"], truncation=True, padding=True)

Notemos que no necesitamos separar los hashtags, pues el tokenizer lo hace de manera automática (agrega eso si el token #).

In [10]:
print(tokenizer("#UnHashtag",truncation=True))
print(tokenizer("Un Hashtag",truncation=True))

{'input_ids': [4, 3, 1044, 20247, 5001, 3483, 5], 'token_type_ids': [0, 0, 0, 0, 0, 0, 0], 'attention_mask': [1, 1, 1, 1, 1, 1, 1]}
{'input_ids': [4, 1044, 1354, 5001, 3483, 5], 'token_type_ids': [0, 0, 0, 0, 0, 0], 'attention_mask': [1, 1, 1, 1, 1, 1]}


In [11]:
tokenized_dataset = dataset_emoji.map(preprocess_function, batched=True)

100%|██████████| 10/10 [00:00<00:00, 30.62ba/s]
100%|██████████| 10/10 [00:00<00:00, 25.97ba/s]


In [12]:
from transformers import DataCollatorWithPadding

data_collator = DataCollatorWithPadding(tokenizer=tokenizer)

Ahora importamos el modelo

In [13]:
from transformers import AutoModelForSequenceClassification

model = AutoModelForSequenceClassification.from_pretrained(model_id, num_labels=19)

Some weights of the model checkpoint at dccuchile/bert-base-spanish-wwm-uncased were not used when initializing BertForSequenceClassification: ['cls.predictions.transform.LayerNorm.weight', 'cls.predictions.bias', 'cls.predictions.decoder.bias', 'cls.predictions.decoder.weight', 'cls.predictions.transform.LayerNorm.bias', 'cls.predictions.transform.dense.weight', 'cls.predictions.transform.dense.bias']
- This IS expected if you are initializing BertForSequenceClassification from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing BertForSequenceClassification from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).
Some weights of BertForSequenceClassification were not initialized from the model checkpoint at dccuc

y el trainer de transformers

In [14]:
from transformers import TrainingArguments, Trainer

training_args = TrainingArguments(
    output_dir="./results",
    learning_rate=2e-5,
    per_device_train_batch_size=16,
    per_device_eval_batch_size=16,
    num_train_epochs=1,
    weight_decay=0.01,
)

trainer = Trainer(
    model=model,
    args=training_args,
    train_dataset=tokenized_dataset["train"],
    eval_dataset=tokenized_dataset["test"],
    tokenizer=tokenizer,
    data_collator=data_collator,
)

Como primer test, ejecutamos una sola epoca.

**IMPORTANTE:** en esta versión sólo entrenamos una vuelta sobre el conjunto trial.

In [15]:
trainer.train()

The following columns in the training set don't have a corresponding argument in `BertForSequenceClassification.forward` and have been ignored: text. If text are not expected by `BertForSequenceClassification.forward`,  you can safely ignore this message.
***** Running training *****
  Num examples = 10000
  Num Epochs = 1
  Instantaneous batch size per device = 16
  Total train batch size (w. parallel, distributed & accumulation) = 16
  Gradient Accumulation steps = 1
  Total optimization steps = 625
 80%|████████  | 500/625 [02:00<00:31,  3.95it/s]Saving model checkpoint to ./results/checkpoint-500
Configuration saved in ./results/checkpoint-500/config.json


{'loss': 2.5416, 'learning_rate': 4.000000000000001e-06, 'epoch': 0.8}


Model weights saved in ./results/checkpoint-500/pytorch_model.bin
tokenizer config file saved in ./results/checkpoint-500/tokenizer_config.json
Special tokens file saved in ./results/checkpoint-500/special_tokens_map.json
100%|██████████| 625/625 [02:33<00:00,  4.18it/s]

Training completed. Do not forget to share your model on huggingface.co/models =)


100%|██████████| 625/625 [02:33<00:00,  4.07it/s]

{'train_runtime': 153.7477, 'train_samples_per_second': 65.042, 'train_steps_per_second': 4.065, 'train_loss': 2.520683837890625, 'epoch': 1.0}





TrainOutput(global_step=625, training_loss=2.520683837890625, metrics={'train_runtime': 153.7477, 'train_samples_per_second': 65.042, 'train_steps_per_second': 4.065, 'train_loss': 2.520683837890625, 'epoch': 1.0})

Nos hemos demorado 2 minutos y medio.