# Fine-Tuning BERT Models for Text Classification with the Hugging Face Trainer API

# 1. Introduction

## 1.1. What?

Fine-tuning is a machine learning technique that involves taking an existing model and replacing its prediction layer with a set of randomly intialized weights, which you then train using custom data for a small number of epochs (usually 3-5).

## 1.2. Why?

BERT models contain a large number of weights, which are trained using massive datasets. This makes them great at many language processing tasks, which you can leverage for your specialized task without having to spend thousands of dollars and months of your training a model of this scale from the bottom up.
This allows to you to achieve good performance even on relatively small datasets.

## 1.3. How?

While you could do all this with a deep learning framework like PyTorch or Tensorflow, I prefer using the Trainer class from the transformers library, because it takes care of a lot of the background work, while also remaining highly customizable.

# Imports

In [1]:
from transformers import BertForSequenceClassification, BertTokenizer, Trainer, TrainingArguments, pipeline
import torch
from torch.utils.data import Dataset
from sklearn.metrics import accuracy_score, f1_score

import pandas as pd
import random

# 2. The Trainer Class


Once instantiated, training can be started by just calling a single method.

Before you can do that, however, you need to set up some parameters.
For fine-tuning, I recommend specifying at least the following:
- **model**: The model we want to fine-tune.
- **args**: A TrainingArguments object containing the hyperparameters used for training.
- **train_dataset**: A PyTorch dataset containing the training data.
- **eval_dataset**: Like train_dataset, but used for evaluation.
- **compute_metrics**: A function used for calculating evaluation metrics.

We will now go over how to set up each of these.

## 2.1. Model

Doc: https://huggingface.co/docs/transformers/model_doc/bert

The first thing you have to do is to load in the model you want to fine-tune. The transformers library has a large number available by default, including language-specific and multilingual models. You can also load user-submitted models from Hugging Face.

To load the model, call `BertForSequenceClassification.from_pretrained()` and specify the name of the model, as well as the number of classes in your dataset.

Don't be alarmed if you get a warning: By specifying `num_labels`, the model will be instantiated with a new prediction layer with random weights. This is exactly what you want.


In [2]:
model_name = 'bert-base-uncased' # name of the model; takes the form of 'username/model' for Hugging Face models
num_labels = 2 # number of classes

model = BertForSequenceClassification.from_pretrained(model_name, num_labels=num_labels)

The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.


config.json:   0%|          | 0.00/570 [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/440M [00:00<?, ?B/s]

Some weights of BertForSequenceClassification were not initialized from the model checkpoint at bert-base-uncased and are newly initialized: ['classifier.bias', 'classifier.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


## 2.2. Training Arguments

Doc: https://huggingface.co/docs/transformers/v4.49.0/en/main_classes/trainer#transformers.TrainingArguments

The TrainingArguments class allows you to set the hyperparameters used to train our model, such as learning rate, epochs, optimizer, weight decay, model checkpoints, etc.

While there is a lot to customize, you're best off using the default parameters most of the time.

I typically just set the model to evaluate and save at the end of every epoch, since the default of every 500 training steps slows down training unnecessarily.

For the sake of the tutorial I will also disable experiment tracking.

In [3]:
training_args = TrainingArguments(output_dir = 'BERT_Tutorial', # directory where checkpoints will be saved
                                  eval_strategy = 'epoch', # evaluate every epoch
                                  save_strategy = 'epoch', # save every epoch
                                  report_to = "none" # disable experiment tracking / wandb integration
                                  )

## 2.3. Datasets

Of course, you need training data. For this example, I'm using the [IMDB Review Dataset](https://huggingface.co/datasets/stanfordnlp/imdb) to fine-tune a model for sentiment analysis. For the sake of time, I will only use a subset of the entire corpus.

In [4]:
# load the dataset
splits = {'train': 'plain_text/train-00000-of-00001.parquet', 'test': 'plain_text/test-00000-of-00001.parquet', 'unsupervised': 'plain_text/unsupervised-00000-of-00001.parquet'}
train = pd.read_parquet("hf://datasets/stanfordnlp/imdb/" + splits["train"])
testval = pd.read_parquet("hf://datasets/stanfordnlp/imdb/" + splits["test"])

# take subsets
train = train.sample(len(train)//40, random_state=42)
testval = testval.sample(len(testval)//40, random_state=42)
val = testval[:len(testval)//2]
test = testval[len(testval)//2:]

# separate text and labels
X_train, y_train = list(train['text']), list(train['label'])
X_val, y_val = list(val['text']), list(val['label'])
X_test, y_test = list(test['text']), list(test['label'])

### 2.3.1. Tokenization

Doc: https://huggingface.co/docs/transformers/model_doc/bert#transformers.BertTokenizer

To allow the model to process your input text, you will have to tokenize it using a model-specific tokenizer, which, like the model, is loaded in using the `from_pretrained()` method.

BERT has a fixed input size of 512 tokens, but you can also set the maximum number of tokens to a lower number. Generally, using more tokens results in better performance, with the drawback that tokenization and training will take longer.

Since your input texts likely won't all contain the same number of tokens, you should set the model to pad shorter texts (i.e. add dummy tokens at the end) and truncate (i.e. cut off) longer texts.
There are several padding options, but I recommend always padding to `'max_length'` to avoid size mismatches between different batches of data.

You also need to set the `return_tensors` parameter to `'pt'` so that the tokenizer outputs are returned as PyTorch tensors.

The tokenizer will return a numerical representation of your tokens (`'input_ids'`), as well as an attention mask, which is a binary array that tells the model which of the tokens are actual words, and which are just padding.



In [5]:
tokenizer = BertTokenizer.from_pretrained(model_name)

test_str = 'This is a test string to showcase how the BERT tokenizer works'

tokens = tokenizer(test_str, # text to be tokenizer (string or list of strings)
          max_length = 64, # how many tokens to use (max. 512)
          padding = 'max_length', # pad to max length if the input is shorter
          truncation = True, # cut off if the input is longer
          return_tensors = 'pt' # return the tokens as PyTorch tensor
          )

print(tokens['input_ids'])
print(tokens['attention_mask'])

tokenizer_config.json:   0%|          | 0.00/48.0 [00:00<?, ?B/s]

vocab.txt:   0%|          | 0.00/232k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/466k [00:00<?, ?B/s]

tensor([[  101,  2023,  2003,  1037,  3231,  5164,  2000, 13398,  2129,  1996,
         14324, 19204, 17629,  2573,   102,     0,     0,     0,     0,     0,
             0,     0,     0,     0,     0,     0,     0,     0,     0,     0,
             0,     0,     0,     0,     0,     0,     0,     0,     0,     0,
             0,     0,     0,     0,     0,     0,     0,     0,     0,     0,
             0,     0,     0,     0,     0,     0,     0,     0,     0,     0,
             0,     0,     0,     0]])
tensor([[1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0,
         0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
         0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0]])


### 2.3.2. Labels

If the labels for your data are not already formated as integers, you will have to convert them. The easiest way to do this is by using two dictionaries, one for converting the labels to integers, and the other for converting back.

Be sure to save them to the model config so that you can access the mappings when you load your model.

In [6]:
labels = {'negative', 'positive'}

label_to_int = {l:i for i,l in enumerate(labels)}
int_to_label = {i:l for l,i in label_to_int.items()}

# overwrite the mappings in the config file
model.config.id2label = int_to_label
model.config.label2id = label_to_int

### 2.3.3. PyTorch Dataset

To load your data, you will have to a create PyTorch dataset. For this, you have to create a class that inherits from the PyTorch Dataset class and redefine the constructor, `__len__()` and `__getitem__()` methods.
Here's what each of them should do:
- `__init__()`: Stores your data and labels as attributes so that you can access them in the other methods. I like to do the label-to-integer conversion and tokenization in here for convenience.
- `__len__()`: Returns the number of samples in your dataset.
- `__getitem__()`: Returns a dictionary containing the input ids, attention mask, and class label for a specific index. The output has to follow a specific format, so rather than explain it here I would encourage you to just look at the code.

In [7]:
class BERTDataset(Dataset):
    def __init__(self, texts, labels, tokenizer, max_length=64):
        # Convert labels to integers
        # self.labels = torch.tensor([label_to_int[l] for l in labels])
        self.labels = torch.tensor(labels)

        # Tokenize the texts
        self.tokens = tokenizer(
            texts,
            max_length = max_length,
            padding = 'max_length',
            truncation = True,
            return_tensors = 'pt'
        )

    def __len__(self):
        return len(self.labels)

    def __getitem__(self, idx):
        return {
            'input_ids': self.tokens['input_ids'][idx],
            'attention_mask': self.tokens['attention_mask'][idx],
            'labels': self.labels[idx]
        }

In [8]:
# create a dataset for each partition
train_dataset = BERTDataset(X_train, y_train, tokenizer)
eval_dataset = BERTDataset(X_val, y_val, tokenizer)
test_dataset = BERTDataset(X_test, y_test, tokenizer)

## 2.4. Metrics

By default, the Trainer will only return the loss when evaluating the model. If you want to use some more informative performance metrics, such as accuracy or F1-score, you will have to implement a custom function to compute them.

Note: The predictions made by the model are logits (i.e. the input to the sigmoid/softmax function), so to get the predicted value you will have to take the argmax.

In [9]:
def compute_metrics(eval_pred):
    logits, labels = eval_pred
    predictions = logits.argmax(-1)

    accuracy = accuracy_score(labels, predictions)
    f1 = f1_score(labels, predictions, average = "macro")

    return {"accuracy": accuracy, "f1": f1} # return the results as a dictionary

## 3. Training, Evaluation, and Making Predictions

Doc: https://huggingface.co/docs/transformers/main_classes/trainer

Once you have everything set up, you can instantiate the Trainer.

In [10]:
trainer = Trainer(
    model = model,
    args = training_args,
    train_dataset = train_dataset,
    eval_dataset = eval_dataset,
    compute_metrics = compute_metrics
)

Training is as simple as calling the `train()` method.

If you have experiment tracking enabled, you will also have to enter your API key before training can start.

In [11]:
trainer.train()

Epoch,Training Loss,Validation Loss,Accuracy,F1
1,No log,0.548049,0.714744,0.702618
2,No log,0.451371,0.782051,0.782015
3,No log,0.652335,0.772436,0.7714


TrainOutput(global_step=237, training_loss=0.43927111404354563, metrics={'train_runtime': 51.7208, 'train_samples_per_second': 36.252, 'train_steps_per_second': 4.582, 'total_flos': 61666653600000.0, 'train_loss': 0.43927111404354563, 'epoch': 3.0})

After you've finished training, you can use the `evaluate()` method to evaluate your model's performance on the test set.

In [12]:
trainer.evaluate(test_dataset)

{'eval_loss': 0.6296583414077759,
 'eval_accuracy': 0.7731629392971247,
 'eval_f1': 0.7710597616128734,
 'eval_runtime': 1.1463,
 'eval_samples_per_second': 273.049,
 'eval_steps_per_second': 34.894,
 'epoch': 3.0}

To make predictions, you can create a pipeline with your fine-tuned model and the tokenizer you used.

In [13]:
test_string = "I hate this movie! It is terrible and I never want to watch it again. I hope whoever made it is ashamed of themselves!!!!"

classifier = pipeline("text-classification", model=model, tokenizer=tokenizer)
classifier(test_string)

Device set to use cuda:0


[{'label': 'negative', 'score': 0.985762357711792}]

## 4. Saving and Loading

Use the `save_pretrained()` method and specify a directory to save your model. You can also save the tokenizer, which can be useful if you're training a lot of different models and want to stay organized, especially since it doesn't require a lot of disk space.


In [14]:
model.save_pretrained('./model')
tokenizer.save_pretrained('./tokenizer')

('./tokenizer/tokenizer_config.json',
 './tokenizer/special_tokens_map.json',
 './tokenizer/vocab.txt',
 './tokenizer/added_tokens.json')

Like before, loading is done using by calling `from_pretrained()`, this time using the path to where the model is saved.

In [15]:
loaded_model = BertForSequenceClassification.from_pretrained('./model')
loaded_tokenizer = BertTokenizer.from_pretrained('./tokenizer')

classifier = pipeline("text-classification", model=loaded_model, tokenizer=loaded_tokenizer)
classifier(test_string)

Device set to use cuda:0


[{'label': 'negative', 'score': 0.985762357711792}]