# Finetuning Transformer Models

In this tutorial we will finetune a transformer model for a dataset on [emotions](https://huggingface.co/datasets/dair-ai/emotion). In this dataset, there are 6 possible emotions such as sadness (0), joy (1), love (2), anger (3), fear (4) and surprise (5).

# Steps
1. At first, we will load the data and look at its attributes.
2. Then we will tokenize the data.
3. After that, we will set the model parameters and define the model.
4. Then we will set the training arguments.
5. Finally we will train the model and evaluate its performance.

In [None]:
!pip install 'transformers[torch]'
!pip install datasets zstandard evaluate
!pip install accelerate -U

In [24]:
import datasets
from datasets import DatasetInfo
from transformers import (
    AutoTokenizer,
    AutoModelForSequenceClassification,
    Trainer,
    TrainingArguments,
    DataCollatorWithPadding,
)
import torch
import evaluate
import numpy as np
from transformers import pipeline

In [4]:
# connect to GPU runtime to check this
print(f'(Free memory, Available Memory){torch.cuda.mem_get_info()}')

(Free memory, Available Memory)(15727656960, 15835660288)


In [5]:
# Load Dataset
# https://huggingface.co/docs/datasets/v2.18.0/en/package_reference/loading_methods#datasets.load_dataset
ds = datasets.load_dataset("dair-ai/emotion")

The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.
You can avoid this message in future by passing the argument `trust_remote_code=True`.
Passing `trust_remote_code=True` will be mandatory to load this dataset from the next major release of `datasets`.


Downloading builder script:   0%|          | 0.00/3.97k [00:00<?, ?B/s]

Downloading metadata:   0%|          | 0.00/3.28k [00:00<?, ?B/s]

Downloading readme:   0%|          | 0.00/8.78k [00:00<?, ?B/s]

Downloading data:   0%|          | 0.00/592k [00:00<?, ?B/s]

Downloading data:   0%|          | 0.00/74.0k [00:00<?, ?B/s]

Downloading data:   0%|          | 0.00/74.9k [00:00<?, ?B/s]

Generating train split:   0%|          | 0/16000 [00:00<?, ? examples/s]

Generating validation split:   0%|          | 0/2000 [00:00<?, ? examples/s]

Generating test split:   0%|          | 0/2000 [00:00<?, ? examples/s]

In [6]:
# Dataset Information
DatasetInfo(ds)

DatasetInfo(description=DatasetDict({
    train: Dataset({
        features: ['text', 'label'],
        num_rows: 16000
    })
    validation: Dataset({
        features: ['text', 'label'],
        num_rows: 2000
    })
    test: Dataset({
        features: ['text', 'label'],
        num_rows: 2000
    })
}), citation='', homepage='', license='', features=None, post_processed=None, supervised_keys=None, task_templates=None, builder_name=None, dataset_name=None, config_name=None, version=None, splits=None, download_checksums=None, download_size=None, post_processing_size=None, dataset_size=None, size_in_bytes=None)

In [7]:
ds

DatasetDict({
    train: Dataset({
        features: ['text', 'label'],
        num_rows: 16000
    })
    validation: Dataset({
        features: ['text', 'label'],
        num_rows: 2000
    })
    test: Dataset({
        features: ['text', 'label'],
        num_rows: 2000
    })
})

In [8]:
ds["train"][0:5]

{'text': ['i didnt feel humiliated',
  'i can go from feeling so hopeless to so damned hopeful just from being around someone who cares and is awake',
  'im grabbing a minute to post i feel greedy wrong',
  'i am ever feeling nostalgic about the fireplace i will know that it is still on the property',
  'i am feeling grouchy'],
 'label': [0, 0, 3, 2, 3]}

In [9]:
# Tokenize dataset
tokenizer = AutoTokenizer.from_pretrained("distilbert/distilbert-base-uncased")

tokenizer_config.json:   0%|          | 0.00/28.0 [00:00<?, ?B/s]

config.json:   0%|          | 0.00/483 [00:00<?, ?B/s]

vocab.txt:   0%|          | 0.00/232k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/466k [00:00<?, ?B/s]

In [11]:
def preprocess_function(batch):
    return tokenizer(batch["text"], truncation=True)

In [12]:
tokenized_ds = ds.map(preprocess_function, batched=True)

Map:   0%|          | 0/16000 [00:00<?, ? examples/s]

Asking to truncate to max_length but no maximum length is provided and the model has no predefined maximum length. Default to no truncation.


Map:   0%|          | 0/2000 [00:00<?, ? examples/s]

Map:   0%|          | 0/2000 [00:00<?, ? examples/s]

In [13]:
tokenized_ds

DatasetDict({
    train: Dataset({
        features: ['text', 'label', 'input_ids', 'attention_mask'],
        num_rows: 16000
    })
    validation: Dataset({
        features: ['text', 'label', 'input_ids', 'attention_mask'],
        num_rows: 2000
    })
    test: Dataset({
        features: ['text', 'label', 'input_ids', 'attention_mask'],
        num_rows: 2000
    })
})

In [14]:
tokenized_ds["train"][0]

{'text': 'i didnt feel humiliated',
 'label': 0,
 'input_ids': [101, 1045, 2134, 2102, 2514, 26608, 102],
 'attention_mask': [1, 1, 1, 1, 1, 1, 1]}

In [15]:
#https://huggingface.co/docs/transformers/v4.38.2/en/main_classes/data_collator#data-collator
data_collator = DataCollatorWithPadding(tokenizer=tokenizer)

In [16]:
# define our evaluation metrics
accuracy = evaluate.load("accuracy")

Downloading builder script:   0%|          | 0.00/4.20k [00:00<?, ?B/s]

In [17]:
def compute_metrics(eval_pred):
    predictions, labels = eval_pred
    predictions = np.argmax(predictions, axis=1)
    return accuracy.compute(predictions=predictions, references=labels)

In [18]:
# These variables are used to map from character labels to numeric labels. In this case we do not need them because we only have numberic labels.
id2label = {0: 0, 1: 1, 2:2, 3:3, 4:4, 5:5} # {0:"sadness", 1: "joy",.....}
label2id = {0: 0, 1: 1, 2:2, 3:3, 4:4, 5:5} # {"sadness": 0, "joy": 1,.....}
#sadness (0), joy (1), love (2), anger (3), fear (4), surprise (5)

In [19]:
# Define the model
model = AutoModelForSequenceClassification.from_pretrained(
    "distilbert/distilbert-base-uncased", num_labels=6) #id2label=id2label, label2id=label2id


model.safetensors:   0%|          | 0.00/268M [00:00<?, ?B/s]

Some weights of DistilBertForSequenceClassification were not initialized from the model checkpoint at distilbert/distilbert-base-uncased and are newly initialized: ['classifier.bias', 'classifier.weight', 'pre_classifier.bias', 'pre_classifier.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


In [21]:
# Set training arguments
# https://huggingface.co/docs/transformers/v4.38.2/en/main_classes/trainer#transformers.TrainingArguments
training_args = TrainingArguments(
    output_dir="my_shiny_model",
    learning_rate=2e-5,
    per_device_train_batch_size=16,
    per_device_eval_batch_size=16,
    num_train_epochs=2,
    weight_decay=0.01,
    evaluation_strategy="epoch",
    save_strategy="epoch",
    load_best_model_at_end=True,
    push_to_hub=False,
)



In [22]:
# Create an instance of the trainer with the set training arguments
trainer = Trainer(
    model=model,
    args=training_args,
    train_dataset=tokenized_ds["train"],
    eval_dataset=tokenized_ds["validation"],
    tokenizer=tokenizer,
    data_collator=data_collator,
    compute_metrics=compute_metrics,
)



In [23]:
# Train Model
trainer.train()

Epoch,Training Loss,Validation Loss,Accuracy
1,0.2579,0.191884,0.9275
2,0.1421,0.152576,0.9385


TrainOutput(global_step=2000, training_loss=0.3164832229614258, metrics={'train_runtime': 186.8297, 'train_samples_per_second': 171.279, 'train_steps_per_second': 10.705, 'total_flos': 389287358125632.0, 'train_loss': 0.3164832229614258, 'epoch': 2.0})

In [25]:
pipe = pipeline("text-classification", model="my_shiny_model/checkpoint-2000")

In [27]:
pipe(["I feel great about acing my exam"])

[{'label': 'LABEL_1', 'score': 0.997857391834259}]

In [None]:
# Evaluate Model
trainer.evaluate(tokenized_ds["train"])

{'eval_loss': 0.09362220764160156,
 'eval_accuracy': 0.959375,
 'eval_runtime': 26.8417,
 'eval_samples_per_second': 596.088,
 'eval_steps_per_second': 37.255,
 'epoch': 2.0}

In [None]:
trainer.evaluate(tokenized_ds["validation"])

{'eval_loss': 0.15073446929454803,
 'eval_accuracy': 0.9385,
 'eval_runtime': 3.2226,
 'eval_samples_per_second': 620.613,
 'eval_steps_per_second': 38.788,
 'epoch': 2.0}

In [None]:
trainer.evaluate(tokenized_ds["test"])

{'eval_loss': 0.17803671956062317,
 'eval_accuracy': 0.9245,
 'eval_runtime': 3.0922,
 'eval_samples_per_second': 646.781,
 'eval_steps_per_second': 40.424,
 'epoch': 2.0}

# To do
1. Choose another one or two suitable models from huggingface and compare their performance.

2. Reflect and discuss how you would improve the performance of these models.

# Reference
[Huggingface Course](https://huggingface.co/learn/nlp-course/chapter1/1)