# Fine-tunning de BETO para predicción de Emojis

In [1]:
from config import *

In [2]:
from huggingface_hub import notebook_login

notebook_login()

Login successful
Your token has been saved to /home/camilo/.huggingface/token
[1m[31mAuthenticated through git-credential store but this isn't the helper defined on your machine.
You might have to re-authenticate when pushing to the Hugging Face Hub. Run the following command in your terminal in case you want to set this credential helper as the default

git config --global credential.helper store[0m


Primero debemos importar el dataset. En nuestro caso este está guardado como un pandas pickleado (fuente: [este tutorial](https://huggingface.co/course/chapter5/2)). La biblioteca [datasets](https://huggingface.co/docs/datasets/index) nos permite cargar un dataset con este formato. Los path pueden ser modificados en el archivo config.py.

In [3]:
from datasets import load_dataset

data_files = {"train": file_names['df_es_train'], "test": file_names['df_es_test'], "trial": file_names['df_es_trial']}
dataset_emoji = load_dataset("pandas", data_files=data_files)
dataset_emoji

Using custom data configuration default-228967d292f22886


Downloading and preparing dataset pandas/default to /home/camilo/.cache/huggingface/datasets/pandas/default-228967d292f22886/0.0.0/6197c1e855b639d75a767140856841a562b7a71d129104973fe1962594877ade...


Downloading data files:   0%|          | 0/3 [00:00<?, ?it/s]

Extracting data files:   0%|          | 0/3 [00:00<?, ?it/s]

Dataset pandas downloaded and prepared to /home/camilo/.cache/huggingface/datasets/pandas/default-228967d292f22886/0.0.0/6197c1e855b639d75a767140856841a562b7a71d129104973fe1962594877ade. Subsequent calls will reuse this data.


  0%|          | 0/3 [00:00<?, ?it/s]

DatasetDict({
    train: Dataset({
        features: ['id', 'text', 'label'],
        num_rows: 81326
    })
    test: Dataset({
        features: ['id', 'text', 'label'],
        num_rows: 10000
    })
    trial: Dataset({
        features: ['id', 'text', 'label'],
        num_rows: 10000
    })
})

In [4]:
dataset_emoji = dataset_emoji.remove_columns("id")

In [5]:
dataset_emoji["train"][0]

{'text': 'Es imposible quererte más @ Plaza Del Callao - Madrid ',
 'label': '0'}

In [6]:
from datasets import ClassLabel # Features, Value

dataset_emoji = dataset_emoji.cast_column("label",ClassLabel(num_classes=19))

Casting the dataset:   0%|          | 0/9 [00:00<?, ?ba/s]

Casting the dataset:   0%|          | 0/1 [00:00<?, ?ba/s]

Casting the dataset:   0%|          | 0/1 [00:00<?, ?ba/s]

In [7]:
dataset_emoji["train"].features

{'text': Value(dtype='string', id=None),
 'label': ClassLabel(num_classes=19, names=['0', '1', '2', '3', '4', '5', '6', '7', '8', '9', '10', '11', '12', '13', '14', '15', '16', '17', '18'], id=None)}

Queremos hacer finetuning a BETO para la tarea de predecir emojis. En este caso tendremos un conjunto fijo de emojis, por ende se trata de una tarea de clasificación. Seguiremos partes de [este tutorial](https://huggingface.co/docs/transformers/tasks/sequence_classification).

### Importando el tokenizador

In [8]:
from transformers import AutoTokenizer

model_id = "dccuchile/bert-base-spanish-wwm-uncased"
tokenizer = AutoTokenizer.from_pretrained(model_id)

Necesitamos crear una función que tokenice los elementos del dataset

In [9]:
def preprocess_function(examples):
    return tokenizer(examples["text"], truncation=True, padding=True)

In [10]:
tokenized_dataset = dataset_emoji.map(preprocess_function, batched=True)



  0%|          | 0/82 [00:00<?, ?ba/s]

  0%|          | 0/10 [00:00<?, ?ba/s]

  0%|          | 0/10 [00:00<?, ?ba/s]

In [11]:
from transformers import DataCollatorWithPadding

data_collator = DataCollatorWithPadding(tokenizer=tokenizer)

Ahora importamos el modelo

In [12]:
from transformers import AutoModelForSequenceClassification

model = AutoModelForSequenceClassification.from_pretrained(model_id, num_labels=19)

Some weights of the model checkpoint at dccuchile/bert-base-spanish-wwm-uncased were not used when initializing BertForSequenceClassification: ['cls.predictions.decoder.bias', 'cls.predictions.transform.LayerNorm.weight', 'cls.predictions.transform.dense.bias', 'cls.predictions.transform.dense.weight', 'cls.predictions.bias', 'cls.predictions.transform.LayerNorm.bias', 'cls.predictions.decoder.weight']
- This IS expected if you are initializing BertForSequenceClassification from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing BertForSequenceClassification from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).
Some weights of BertForSequenceClassification were not initialized from the model checkpoint at dccuc

y el trainer de transformers

In [14]:
from transformers import TrainingArguments, Trainer

training_args = TrainingArguments(
    output_dir="./results",
    learning_rate=2e-5,
    per_device_train_batch_size=16,
    per_device_eval_batch_size=16,
    num_train_epochs=5,
    weight_decay=0.01
)

trainer = Trainer(
    model=model,
    args=training_args,
    train_dataset=tokenized_dataset["train"],
    eval_dataset=tokenized_dataset["test"],
    tokenizer=tokenizer,
    data_collator=data_collator,
)

PyTorch: setting up devices
The default value for the training argument `--report_to` will change in v5 (from all installed integrations to none). In v5, you will need to use `--report_to all` to get the same behavior as now. You should start updating your code and make this info disappear :-).


Entrenamos nuestro modelo en el train set para 5 épocas

In [15]:
trainer.train()

The following columns in the training set don't have a corresponding argument in `BertForSequenceClassification.forward` and have been ignored: text. If text are not expected by `BertForSequenceClassification.forward`,  you can safely ignore this message.
***** Running training *****
  Num examples = 81326
  Num Epochs = 5
  Instantaneous batch size per device = 16
  Total train batch size (w. parallel, distributed & accumulation) = 16
  Gradient Accumulation steps = 1
  Total optimization steps = 25415


Step,Training Loss
500,2.5409
1000,2.4048
1500,2.3791
2000,2.3199
2500,2.3295
3000,2.3105
3500,2.2812
4000,2.2877
4500,2.2901
5000,2.2718


Saving model checkpoint to ./results/checkpoint-500
Configuration saved in ./results/checkpoint-500/config.json
Model weights saved in ./results/checkpoint-500/pytorch_model.bin
tokenizer config file saved in ./results/checkpoint-500/tokenizer_config.json
Special tokens file saved in ./results/checkpoint-500/special_tokens_map.json
Saving model checkpoint to ./results/checkpoint-1000
Configuration saved in ./results/checkpoint-1000/config.json
Model weights saved in ./results/checkpoint-1000/pytorch_model.bin
tokenizer config file saved in ./results/checkpoint-1000/tokenizer_config.json
Special tokens file saved in ./results/checkpoint-1000/special_tokens_map.json
Saving model checkpoint to ./results/checkpoint-1500
Configuration saved in ./results/checkpoint-1500/config.json
Model weights saved in ./results/checkpoint-1500/pytorch_model.bin
tokenizer config file saved in ./results/checkpoint-1500/tokenizer_config.json
Special tokens file saved in ./results/checkpoint-1500/special_toke

Configuration saved in ./results/checkpoint-12500/config.json
Model weights saved in ./results/checkpoint-12500/pytorch_model.bin
tokenizer config file saved in ./results/checkpoint-12500/tokenizer_config.json
Special tokens file saved in ./results/checkpoint-12500/special_tokens_map.json
Saving model checkpoint to ./results/checkpoint-13000
Configuration saved in ./results/checkpoint-13000/config.json
Model weights saved in ./results/checkpoint-13000/pytorch_model.bin
tokenizer config file saved in ./results/checkpoint-13000/tokenizer_config.json
Special tokens file saved in ./results/checkpoint-13000/special_tokens_map.json
Saving model checkpoint to ./results/checkpoint-13500
Configuration saved in ./results/checkpoint-13500/config.json
Model weights saved in ./results/checkpoint-13500/pytorch_model.bin
tokenizer config file saved in ./results/checkpoint-13500/tokenizer_config.json
Special tokens file saved in ./results/checkpoint-13500/special_tokens_map.json
Saving model checkpoin

Saving model checkpoint to ./results/checkpoint-24500
Configuration saved in ./results/checkpoint-24500/config.json
Model weights saved in ./results/checkpoint-24500/pytorch_model.bin
tokenizer config file saved in ./results/checkpoint-24500/tokenizer_config.json
Special tokens file saved in ./results/checkpoint-24500/special_tokens_map.json
Saving model checkpoint to ./results/checkpoint-25000
Configuration saved in ./results/checkpoint-25000/config.json
Model weights saved in ./results/checkpoint-25000/pytorch_model.bin
tokenizer config file saved in ./results/checkpoint-25000/tokenizer_config.json
Special tokens file saved in ./results/checkpoint-25000/special_tokens_map.json


Training completed. Do not forget to share your model on huggingface.co/models =)




TrainOutput(global_step=25415, training_loss=1.8516290226766918, metrics={'train_runtime': 6494.306, 'train_samples_per_second': 62.613, 'train_steps_per_second': 3.913, 'total_flos': 1.3233162274764492e+16, 'train_loss': 1.8516290226766918, 'epoch': 5.0})

In [18]:
model.push_to_hub("beto-emoji")
tokenizer.push_to_hub("beto-emoji")



huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
	- Avoid using `tokenizers` before the fork if possible
	- Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false)
huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
	- Avoid using `tokenizers` before the fork if possible
	- Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false)


Cloning https://huggingface.co/ccarvajal/beto-emoji into local empty directory.


huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
	- Avoid using `tokenizers` before the fork if possible
	- Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false)
huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
	- Avoid using `tokenizers` before the fork if possible
	- Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false)
huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
	- Avoid using `tokenizers` before the fork if possible
	- Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false)
huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
	- Av

Configuration saved in beto-emoji/config.json
Model weights saved in beto-emoji/pytorch_model.bin


huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
	- Avoid using `tokenizers` before the fork if possible
	- Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false)
huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
	- Avoid using `tokenizers` before the fork if possible
	- Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false)
huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
	- Avoid using `tokenizers` before the fork if possible
	- Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false)
huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
	- Av

Upload file pytorch_model.bin:   0%|          | 32.0k/419M [00:00<?, ?B/s]

To https://huggingface.co/ccarvajal/beto-emoji
   c101434..cdf8e47  main -> main



huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
	- Avoid using `tokenizers` before the fork if possible
	- Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false)
huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
	- Avoid using `tokenizers` before the fork if possible
	- Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false)
huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
	- Avoid using `tokenizers` before the fork if possible
	- Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false)
huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
	- Av

tokenizer config file saved in beto-emoji/tokenizer_config.json
Special tokens file saved in beto-emoji/special_tokens_map.json


huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
	- Avoid using `tokenizers` before the fork if possible
	- Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false)
huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
	- Avoid using `tokenizers` before the fork if possible
	- Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false)
huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
	- Avoid using `tokenizers` before the fork if possible
	- Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false)
huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
	- Av

To https://huggingface.co/ccarvajal/beto-emoji
   cdf8e47..26e6d4e  main -> main



huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
	- Avoid using `tokenizers` before the fork if possible
	- Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false)
huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
	- Avoid using `tokenizers` before the fork if possible
	- Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false)


'https://huggingface.co/ccarvajal/beto-emoji/commit/26e6d4ede403913e680e53a61995bf4acd05c31c'