# Fine-Tuning Transformer Models for Classification of Digital Behavioural Data
## Classfying political stance

By Indira Sen

This notebook will demonstrate how to fine-tune transformer models like BERT for classification. We show two options for using transformer models in Python
- HuggingFace `transformers`
- Simple Transformers

We will fine-tune transformer models like BERT, DistilBERT, and RoBERTa for a task common in Computational Social Science: Political Stance Detection


<br><br>

## **Import necessary Python libraries and modules**

First, we will import necessary Python libraries and modules. These include scikit-learn (`sklearn`) and PyTorch (`torch`), for various machine learning tools.

In [1]:
# Basic Python modules
from collections import defaultdict

# For data manipulation and analysis
import pandas as pd
import numpy as np

# For deep learning
# https://pytorch.org/tutorials/beginner/basics/quickstart_tutorial.html
import torch

We first download the datasets we need for finetuning our models. This is a **supervised** classification task, therefore, we will need labeled data. We download the following datasets:

1. PStance [https://github.com/chuchun8/PStance]

In [2]:
data = pd.read_csv('raw_train_biden.csv')
data.head()

Unnamed: 0,Tweet,Target,Stance
0,Joe Biden is looking to gather votes from unsu...,Joe Biden,AGAINST
1,Check out the latest podcast conversation betw...,Joe Biden,FAVOR
2,Thank you Secretary Clinton for your endorseme...,Joe Biden,FAVOR
3,Happening now: @JoeBiden kicking off #Hispanic...,Joe Biden,FAVOR
4,Thank you Mayor @KeishaBottoms for opening our...,Joe Biden,FAVOR


We first use the [`simpletransformers`](https://simpletransformers.ai/) package which is more beginner-friendly

In [3]:
! pip3 install simpletransformers



The basic steps for finetuning a classifier using simpletrasnformers are:
- Initialize a model based on a specific architechture (BERT, DistilBERT, etc)
- Train the model with train_model()
- Evaluate the model with eval_model()
- Make predictions on (unlabelled) data with predict()

In [4]:
from simpletransformers.classification import ClassificationModel, ClassificationArgs
import logging

In [5]:
logging.basicConfig(level=logging.INFO)
transformers_logger = logging.getLogger("transformers")
transformers_logger.setLevel(logging.WARNING)

We need to preprocess the data first before we start the finetuning process. In this step, we split the dataset into **train** and **test** sets to have a fully held-out test set that can be used to evaluate our classifier.

We can also create a **validation** that is used during the fine tuning process for hyperparameter tuning, but that is not mandatory.

In [6]:
from sklearn.model_selection import train_test_split

train_df, test_df = train_test_split(data, stratify=data['Stance'], test_size=0.2)

We now convert the dataframes into a format that can be read by simpletransformers. This is a dataframe with the columns 'text' and 'labels'. The 'labels' column should be numerical, so we use **one-hot encoding** to transform our string stance labels to numerical ones.

In [7]:
from sklearn.preprocessing import LabelEncoder
le = LabelEncoder()
le.fit(train_df['Stance'])
train_df['labels'] = le.transform(train_df['Stance'])
test_df['labels'] = le.transform(test_df['Stance'])

In [8]:
# to see which number was mapped to which class:
list(le.inverse_transform([0,1]))

['AGAINST', 'FAVOR']

So, 0 is 'against' and 1 is 'favor'. We now have the appropriate data structure. The next step is setting the training parameters and loading the classification model, in this case, DistilBERT, a lightweight model that can be trained relatively quickly compared to other transformer variants like BERT and RoBERTa.

For training parameters, we have many to choose from such as the learning rate, whether we want to stop early or not, where we should save the model, and more. You can find all of them here: https://simpletransformers.ai/docs/usage/

As a minimal setup, we will just set the number of **epochs**, i.e., the number of passes the model does over the full training set. For recent transformer models, epochs are usually set to 2 or 3, after which overfitting may happen.

In [9]:

# Optional model configuration
model_args = ClassificationArgs(num_train_epochs=3, output_dir='output_st', overwrite_output_dir=True)

# Create a ClassificationModel
model = ClassificationModel(
    "distilbert", "distilbert-base-uncased", args=model_args
)

Some weights of DistilBertForSequenceClassification were not initialized from the model checkpoint at distilbert-base-uncased and are newly initialized: ['pre_classifier.bias', 'pre_classifier.weight', 'classifier.weight', 'classifier.bias']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


We are now finally ready to begin training! This might take a while, especially when we're not using a GPU.

In [10]:
train_df = train_df[['Tweet', 'labels']]
test_df = test_df[['Tweet', 'labels']]

In [11]:
# Train the model
model.train_model(train_df)



  0%|          | 0/4644 [00:00<?, ?it/s]

Epoch:   0%|          | 0/3 [00:00<?, ?it/s]

Running Epoch 0 of 3:   0%|          | 0/581 [00:00<?, ?it/s]

Running Epoch 1 of 3:   0%|          | 0/581 [00:00<?, ?it/s]

Running Epoch 2 of 3:   0%|          | 0/581 [00:00<?, ?it/s]

(1743, 0.4079835763007693)

After training our model, we can use it to make predictions for unlabeled datapoints to classify the stance of the tweet towards the predefined target.

In [12]:
anti_biden_tweet = "Ugh, this was true yesterday and it's also true now: Biden is an idiot"
predictions, raw_outputs = model.predict([anti_biden_tweet])
le.inverse_transform(predictions)

  0%|          | 0/1 [00:00<?, ?it/s]

  0%|          | 0/1 [00:00<?, ?it/s]

array(['AGAINST'], dtype=object)

We can also use the held-out test set to quantitatively evaluate our model.

In [13]:
# Evaluate the model
result, model_outputs, wrong_predictions = model.eval_model(test_df)
result



  0%|          | 0/1162 [00:00<?, ?it/s]

Running Evaluation:   0%|          | 0/146 [00:00<?, ?it/s]

{'mcc': 0.5361786135663668,
 'tp': 356,
 'tn': 542,
 'fp': 109,
 'fn': 155,
 'auroc': 0.834780752778354,
 'auprc': 0.8319187479756306,
 'eval_loss': 0.7840714748591593}

In [14]:
# you can also use sklearn's neat classification report to get more metrics
from sklearn.metrics import classification_report

In [15]:
preds, _ = model.predict(list(test_df['Tweet'].values))
# preds = le.inverse_transform(preds)

print(classification_report(test_df['labels'], preds))

  0%|          | 0/1162 [00:00<?, ?it/s]

  0%|          | 0/146 [00:00<?, ?it/s]

              precision    recall  f1-score   support

           0       0.78      0.83      0.80       651
           1       0.77      0.70      0.73       511

    accuracy                           0.77      1162
   macro avg       0.77      0.76      0.77      1162
weighted avg       0.77      0.77      0.77      1162



We now repeat the same process with the HuggingFace [`transformers` Python library](https://huggingface.co/transformers/installation.html).

In [16]:
!pip3 install transformers



In [17]:
! pip install -U accelerate
! pip install -U transformers



We will again use DistilBERT.

In [18]:
from transformers import DistilBertTokenizerFast, DistilBertForSequenceClassification
from transformers import Trainer, TrainingArguments

We will set some of the configurations

In [19]:
model_name = 'distilbert-base-uncased'
device_name = 'cuda'

# This is the maximum number of tokens in any document; the rest will be truncated.
max_length = 512

# This is the name of the directory where we'll save our model. You can name it whatever you want.
cached_model_directory_name = 'output_hf'

We will reuse the train-test splits we created for simpletransformers, but change the data structure slightly.

In [20]:
train_texts = train_df['Tweet']#.values
train_labels = train_df['labels']#.values

test_texts = test_df['Tweet']#.values
test_labels = test_df['labels']#.values

Compared to simpletransformers, we get a closer look at what happens 'under the hood' with huggingface. We will see the transformation of the text better --- each tweet will be truncated if they're more than 512 tokens or padded if they're fewer than 512 tokens.

The tokens will be separated into "word pieces" using the transformers tokenizers ('DistilBertTokenizerFast' in this case to match the DistiBERT model). And some special tokens will also be added such as **CLS** (start token of every tweet) and **SEP** (separator between each sentence {not tweet}):

In [21]:
tokenizer = DistilBertTokenizerFast.from_pretrained(model_name)

We now encode our texts using the tokenizer.

In [22]:
from datasets import Dataset

train_df = Dataset.from_pandas(train_df)
test_df = Dataset.from_pandas(test_df)

def tokenize_function(examples):
  return tokenizer(examples["Tweet"], padding="max_length", truncation=True)


tokenized_train_df = train_df.map(tokenize_function, batched=True)
tokenized_test_df = test_df.map(tokenize_function, batched=True)

Map:   0%|          | 0/4644 [00:00<?, ? examples/s]

Map:   0%|          | 0/1162 [00:00<?, ? examples/s]

We now load the DistilBERT model and specify that it should use the GPU.

In [23]:
model = DistilBertForSequenceClassification.from_pretrained(model_name, num_labels=len(le.classes_)).to(device_name)

Some weights of DistilBertForSequenceClassification were not initialized from the model checkpoint at distilbert-base-uncased and are newly initialized: ['pre_classifier.bias', 'pre_classifier.weight', 'classifier.weight', 'classifier.bias']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


As we did with simpletransformers, we now set the training parameters, i.e., the number of epochs.

In [25]:
training_args = TrainingArguments(
    num_train_epochs=3,              # total number of training epochs
    output_dir='./results',          # output directory
    report_to='none'
)

<br><br>

## **Fine-tune the BERT model**

First, we define a custom evaluation function that returns the accuracy. You could modify this function to return precision, recall, F1, and/or other metrics.

In [26]:
from sklearn.metrics import accuracy_score

def compute_metrics(pred):
  labels = pred.label_ids
  preds = pred.predictions.argmax(-1)
  acc = accuracy_score(labels, preds)
  return {
      'accuracy': acc,
  }

Then we create a HuggingFace `Trainer` object using the `TrainingArguments` object that we created above. We also send our `compute_metrics` function to the `Trainer` object, along with our test and train datasets.

In [27]:
trainer = Trainer(
    model=model,                         # the instantiated 🤗 Transformers model to be trained
    args=training_args,                  # training arguments, defined above
    train_dataset=tokenized_train_df,         # training dataset
    compute_metrics=compute_metrics      # our custom evaluation function
)

Time to finally fine-tune!

In [28]:
trainer.train()

Step,Training Loss
500,0.5225
1000,0.3574
1500,0.2371


TrainOutput(global_step=1743, training_loss=0.3444552965979431, metrics={'train_runtime': 687.2171, 'train_samples_per_second': 20.273, 'train_steps_per_second': 2.536, 'total_flos': 1845535798075392.0, 'train_loss': 0.3444552965979431, 'epoch': 3.0})

<br><br>

## **Save fine-tuned model**

The following cell will save the model and its configuration files to a directory in Colab. To preserve this model for future use, you should download the model to your computer.

In [29]:
trainer.save_model(cached_model_directory_name)

(Optional) If you've already fine-tuned and saved the model, you can reload it using the following line. You don't have to run fine-tuning every time you want to evaluate.

In [30]:
# trainer = DistilBertForSequenceClassification.from_pretrained(cached_model_directory_name)

We can now evaluate the model by predicting the labels for the test set.

In [31]:
predicted_results = trainer.predict(tokenized_test_df)

In [32]:
predicted_labels = predicted_results.predictions.argmax(-1) # Get the highest probability prediction
predicted_labels = predicted_labels.flatten().tolist()      # Flatten the predictions into a 1D list
predicted_labels[0:5]

[0, 1, 0, 1, 1]

In [33]:
print(classification_report(tokenized_test_df['labels'],
                            predicted_labels))

              precision    recall  f1-score   support

           0       0.80      0.88      0.84       651
           1       0.83      0.72      0.77       511

    accuracy                           0.81      1162
   macro avg       0.81      0.80      0.81      1162
weighted avg       0.81      0.81      0.81      1162



Now, it's your turn. Try the following:
- train a BERT  model instead of a distilBERT model
- try to model a multiclass classification problem. You can either find a dataset you'd like to try out or try with the PStance dataset.

Hint: Try to get all the datasets listed in https://drive.google.com/drive/folders/1so8lY1XKpnhUtTvb15edEz6aeHt7CSuh and classify the target of a tweet.