# Training and Fine-Tuning BERT for Classification
## Classfying newspaper articles by topic

This notebook will demonstrate how users can train and fine-tune a BERT model for classification with the popular HuggingFace `transformers` Python library.

We will fine-tune a BERT model on news topics discussed [here](https://www.tandfonline.com/doi/full/10.1080/21670811.2020.1767509) with the goal of predicting the topic of a news article. The genres include:

-   'business'
-   'entertainment'
-   'politics'
-   'other'

Please download the data from the [bdaca github](https://github.com/uvacw/teaching-bdaca/tree/main/modules/machinelearning-text-exercises)

**Basic steps involved in using BERT and HuggingFace:**
- Split your dataset into training, validation, and testing subsets.
- Convert your data into a format that BERT can process.
- Create dataset objects by joining your data and labels.
- Load the pre-trained BERT model.
- Refine the model by training it on your training data.
- Use the model to make predictions and assess its performance on your test data.


_This notebook is heavily inspired by Herties BERT for humanities tutorial_

<br><br>

## **Import necessary Python libraries and modules**

In [None]:
! pip install transformers

Next, we will import necessary Python libraries and modules.

In [None]:
import os
import gzip
import json
import pickle
import random
import sys
import csv
import numpy as np
import pandas as pd
import seaborn as sns
from matplotlib import ticker
from sklearn import metrics
from sklearn.feature_extraction.text import CountVectorizer, TfidfVectorizer
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import accuracy_score, precision_recall_fscore_support, classification_report
from sklearn.model_selection import train_test_split
from sklearn.naive_bayes import MultinomialNB
from sklearn.svm import SVC
import torch
from transformers import Trainer, TrainingArguments
from sklearn.metrics import f1_score

from collections import defaultdict

sns.set(style='ticks', font_scale=1.2)
%matplotlib inline
import matplotlib.pyplot as plt

from sklearn.utils import compute_sample_weight

In [3]:
datadir = '/Users/rupertkiddle/Desktop/teach/2024/Introduction to Machine Learning (GESIS)/3_datasets/'

<br><br>

## **Read in data, and split into Train-Val-Test (60-20-20) samples**


This will read in the annotated newspaper data from Vermeer et al., and split it into train, val and test samples.


In [4]:
# In Google Colab: Add the dataset to your Google Drive
# Run this cell to connect Colab to your drive.
#from google.colab import drive
#drive.mount('/content/drive')
#os.chdir('/content/drive/MyDrive/')

In [5]:
csv.field_size_limit(sys.maxsize)

def get_labeled_data(fn=datadir+'transformers/labeled.csv'):
    text= []
    label= []

    with open(fn) as fi:
        next(fi)
        reader = csv.reader(fi, delimiter=',')
        for row in reader:
            try:
                text.append(row[0])
                label.append(row[1])
            except:
                # invalid row, probably an empty one. let's just ignore
                pass
    return text, label
texts, labels = get_labeled_data()


# Split your data into training and test sets
X_train_val, X_test, y_train_val, y_test = train_test_split(texts, labels, test_size=0.2, random_state=42)

# Split your training data into training and validation sets
X_train, X_val, y_train, y_val = train_test_split(X_train_val, y_train_val, test_size=0.25, random_state=42)

In [None]:
print(f"We have {len(X_train)} train examples, {len(X_val)} validation examples, and {len(X_test)} test examples.")

In [None]:
# Ideally, we want to run our code on CUDA (NVIDIA GPUs using the program management system) or MPS (Apple Silicon GPUs).
import torch

# Check if there is a GPU available...
if torch.cuda.is_available():
    # Tell PyTorch to use the CUDA GPU.
    device_name = torch.device("cuda")
    print('There are %d CUDA GPU(s) available.' % torch.cuda.device_count())
    print('We will use the CUDA GPU:', torch.cuda.get_device_name(0))

# Check if MPS is available...
elif torch.backends.mps.is_available():
    # Tell PyTorch to use the MPS GPU.
    device_name = torch.device("mps")
    print('MPS is available.')
    print('We will use the MPS GPU.')

# If not...
else:
    print('No GPU available, using the CPU instead.')
    device_name = torch.device("cpu")

In [8]:
#We will be using a Dutch model, as our data is Dutch-- specifically the '"GroNLP/bert-base-dutch-cased"' model. Check out Hugging Face's documentation for more information on the different BERT models.
model_name = 'GroNLP/bert-base-dutch-cased'

# We set the maximum number of tokens in each document to be 512, which is the maximum length for BERT models.
max_length = 512

# We define the directory where we'll save our trained model. You can choose any name for the directory.
save_directory = 'my_trained_model'

Here's an example of a training text and training label:

In [None]:
X_train[0], y_train[0]

<br><br>

## **Implementing a Baseline Model using Logistic Regression**

In this step, we train and evaluate a basic TF-IDF baseline model with logistic regression. Despite using a very small dataset, we observe a performance that is better than random. We will now check if BERT can outperform this strong baseline!

In [10]:
vectorizer = TfidfVectorizer()
Xtrain = vectorizer.fit_transform(X_train)
Xtest = vectorizer.transform(X_test)

We train a logistic regression model from scikit-learn on the newspaper training data, and then we use the trained model to make predictions on our test set.

In [11]:
model = LogisticRegression(max_iter=1000).fit(Xtrain, y_train)
predictions = model.predict(Xtest)

We can leverage the `classification_report` function provided by scikit-learn to assess the performance of the logistic regression model in terms of its ability to predict newspaper topics that match the actual labels.

In [None]:
from sklearn.metrics import classification_report
print(classification_report(y_test, predictions))

What do you think of this model? Not too bad for a baseline model, right? Lets see whether we can improve this using BERT.

## Encode data for BERT

To prepare our data for use with BERT, we need to encode the texts and labels in a way that the model can understand. Here are the steps we'll follow:

1. Convert the labels from strings to integers.

2. Tokenize the texts, which involves breaking them up into individual words, and then convert the words into "word pieces" that can be matched with their corresponding embedding vectors.

3. Truncate texts that are longer than 512 tokens, or pad texts that are shorter than 512 tokens with a special padding token.

4. Add special tokens to the beginning and end of each document, including a start token, a separator between sentences, and a padding token as necessary.


We will be using the `AutoTokenizer.from_pretrained()` module from HuggingFace library to encode our texts. This module will handle all the encoding for us, including breaking word tokens into word pieces, truncating to 512 tokens, and adding padding and special BERT tokens.

In [None]:
from transformers import AutoTokenizer

tokenizer = AutoTokenizer.from_pretrained(model_name)
#model = RobertaForSequenceClassification.from_pretrained("pdelobelle/robbert-v2-dutch-base")

In this section, we will generate a mapping of our news topics to integer keys. We begin by extracting the unique labels from our dataset and create a dictionary that associates each label with an integer.

In [14]:
unique_labels = set(label for label in y_train)
label2id = {label: id for id, label in enumerate(unique_labels)}
id2label = {id: label for label, id in label2id.items()}

In [None]:
label2id.keys()

In [None]:
id2label.keys()

Now let's encode our texts and labels!

In [17]:
train_encodings = tokenizer(X_train, truncation=True, padding=True, max_length=max_length)
val_encodings = tokenizer(X_val, truncation=True, padding=True, max_length=max_length)
test_encodings  = tokenizer(X_test, truncation=True, padding=True, max_length=max_length)

train_labels_encoded = [label2id[y] for y in y_train]
val_labels_encoded = [label2id[y] for y in y_val]
test_labels_encoded  = [label2id[y] for y in y_test]

**Examine a news article in the training set after encoding**

In [None]:
' '.join(train_encodings[0].tokens[0:100])

**Examine a news article in test set after encoding**

In [None]:
' '.join(test_encodings[0].tokens[0:100])

**Examine the training labels after encoding**

In [None]:
set(train_labels_encoded)

**Examine the test labels after encoding**

In [None]:
set(test_labels_encoded)

<br><br>

## **Create a custom Torch dataset by following these steps:**

Here we combine the encoded labels and texts into dataset objects. We use the custom Torch `MyDataSet` class to make a `train_dataset` object from  the `train_encodings` and `train_labels_encoded`. We also make a `val_dataset`, `test_dataset` object from `test_encodings` and `val_encodings`, and `val_labels_encoded` and `test_labels_encoded`.


In [22]:
class MyDataset(torch.utils.data.Dataset):
    def __init__(self, encodings, labels):
        self.encodings = encodings
        self.labels = labels

    def __getitem__(self, idx):
        item = {key: torch.tensor(val[idx]) for key, val in self.encodings.items()}
        item['labels'] = torch.tensor(self.labels[idx])
        return item

    def __len__(self):
        return len(self.labels)

In [23]:
train_dataset = MyDataset(train_encodings, train_labels_encoded)
val_dataset = MyDataset(val_encodings, val_labels_encoded)
test_dataset = MyDataset(test_encodings, test_labels_encoded)

**Examine a news article in the Torch `training_dataset` after encoding**

In [None]:
' '.join(train_dataset.encodings[0].tokens[0:100])

**Examine a news article in the Torch `test_dataset` after encoding**

In [None]:
' '.join(test_dataset.encodings[1].tokens[0:100])

In [None]:
len(id2label)

<br><br>

## **Initialize the pre-trained BERT model**

We load a pre-trained Dutch BERT model and transfer it to CUDA for efficient computation.

**Note**: If you intend to repeat the fine-tuning process after previously executing the subsequent cells, ensure that you re-run this cell to reload the original pre-trained model before commencing the fine-tuning again.

In [None]:
from transformers import AutoTokenizer, AutoModelForSequenceClassification
model = AutoModelForSequenceClassification.from_pretrained(model_name, num_labels=len(id2label)).to(device_name)

<br><br>

## **Configure the parameters required for fine-tuning BERT**

The following parameters are crucial for fine-tuning BERT and will be specified in the HuggingFace TrainingArguments objects that we will subsequently pass to the HuggingFace Trainer object. While there are numerous other arguments, we'll focus on the fundamental ones and some common pitfalls.

When fine-tuning your own model, it's critical to experiment with these parameters to identify the optimal configuration for your specific dataset.

| Parameter                     | Explanation                                                                                                                          |
|-------------------------------|--------------------------------------------------------------------------------------------------------------------------------------|
| `num_train_epochs`            | The total number of training epochs. This refers to how many times the entire dataset will be processed. Too many epochs can lead to overfitting.|
| `per_device_train_batch_size` | The batch size per device during training.                                                                                           |
| `per_device_eval_batch_size`  | The batch size for evaluation.                                                                                                      |
| `warmup_steps`                | The number of warmup steps for the learning rate scheduler. A smaller value is recommended for small datasets.                         |
| `weight_decay`                | The strength of weight decay, which reduces the size of weights, similar to regularization.                                          |
| `output_dir`                  | The directory where the fine-tuned model and configuration files will be saved.                                                     |
| `logging_dir`                 | The directory where logs will be stored.                                                                                            |
| `logging_steps`               | How often to print logging output. This enables us to terminate training early if the loss is not decreasing.                        |
| `evaluation_strategy`         | Evaluates while training so that we can monitor accuracy improvements.                                                              |


<br><br>

## **Fine-tune the BERT model**

Initially, we define a custom evaluation function that returns the accuracy of the model. However, this function can be modified to return other metrics such as precision, recall, F1 score, or any other desired evaluation metric.

In [28]:
def compute_metrics(eval_pred):
    labels = eval_pred.label_ids
    preds = eval_pred.predictions.argmax(-1)
    acc = accuracy_score(labels, preds)
    macro_f1 = f1_score(labels, preds, average='macro', sample_weight=compute_sample_weight('balanced', labels))
    return {'accuracy': acc, 'macro_f1': macro_f1}

Then we create a HuggingFace `Trainer` object using the `TrainingArguments` object that we created above. We also send our `compute_metrics` function to the `Trainer` object, along with our test and train datasets.


## **optimize your model based on a metric you select**
Note: You can also use GridSearch to identify the optimal configuration. However, be aware that finetuning multiple times with different parameter combinations can be extremely resource-intensive.

In [29]:
metric_name = 'macro_f1' # you can chance this for `accuracy` etc, according to the function `compute_metrics`

In [None]:
! pip install -U accelerate
! pip install -U transformers


In [None]:
# Instantiate an object of the TrainingArguments class with the following parameters:
training_args = TrainingArguments(

    # Number of training epochs
    num_train_epochs=5,

    # Batch size for training
    per_device_train_batch_size=8,

    # Batch size for evaluation
    per_device_eval_batch_size=8,

    # Learning rate for optimization
    learning_rate=5e-5,

    # Load the best model at the end of training
    load_best_model_at_end=True,

    # Metric used for selecting the best model
    metric_for_best_model=metric_name,

    # Number of warmup steps for the optimizer
    warmup_steps=0,

    # L2 regularization weight decay
    weight_decay=0.01,

    # Directory to save the fine-tuned model and configuration files
    output_dir='./results',

    # Directory to store logs
    logging_dir='./logs',

    # Log results every n steps
    logging_steps=20,

    # Strategy for evaluating the model during training
    evaluation_strategy='steps',
)

In [48]:
trainer = Trainer(
    model=model,                         # the instantiated 🤗 Transformers model to be trained
    args=training_args,                  # training arguments, defined above
    train_dataset=train_dataset,         # training dataset
    eval_dataset=val_dataset,           # evaluation dataset (usually a validation set; here we just send our test set)
    compute_metrics=compute_metrics      # our custom evaluation function
)

Time to finally fine-tune!

Be patient; if you've set everything in Colab to use GPUs, then it should only take a minute or two to run, but if you're running on CPU, it can take hours.

After every 20 steps (as we specified in the TrainingArguments object), the trainer will output the current state of the model, including the training loss, validation loss, and accuracy (from our `compute_metrics` function).

You should see the loss going down and the accuracy going up. If instead they are staying the same or oscillating, you probably need to change the fine-tuning parameters.

In [None]:
trainer.train()

<br><br>

## **Save fine-tuned model**

The following cell will save the model and its configuration files to a directory in Colab. To preserve this model for future use, you should download the model to your computer.

In [50]:
trainer.save_model(save_directory)

(Optional) If you've already fine-tuned and saved the model, you can reload it using the following line. You don't have to run fine-tuning every time you want to evaluate.

In [51]:
# trainer = AutoModelForSequenceClassification.from_pretrained(save_directory)

<br><br>

## **Evaluate fine-tuned model on the validation set**

The following function of the `Trainer` object will run the built-in evaluation, including our `compute_metrics` function.

In [None]:
trainer.evaluate()

<br><br>

## **Evaluate fine-tuned model on the test set**

We may desire a more detailed evaluation of the model, hence we extract the predicted labels.

In [None]:
predicted_results = trainer.predict(test_dataset)

In [None]:
predicted_results.predictions.shape

In [55]:
predicted_labels = predicted_results.predictions.argmax(-1) # Get the highest probability prediction
predicted_labels = predicted_labels.flatten().tolist()      # Flatten the predictions into a 1D list
predicted_labels = [id2label[l] for l in predicted_labels]  # Convert from integers back to strings for readability

In [None]:
len(predicted_labels)

In [None]:
print(classification_report(y_test,
                            predicted_labels))

<br><br>

## **Extracting Correct and Incorrect Classifications for Analysis**

Now that we have obtained the predicted labels, let's perform some analysis.

The fine-tuning and extraction of predicted labels using BERT is now complete. You can use the predicted labels just like you would with any other classification model. Here are some examples.

To start, let's print out some example predictions that were correct.

In [None]:
for _true_label, _predicted_label, _text in random.sample(list(zip(y_test, predicted_labels, X_test)), 20):
  if _true_label == _predicted_label:
    print('LABEL:', _true_label)
    print('REVIEW TEXT:', _text[:100], '...')
    print()

Now let's print out some misclassifications.

In [None]:
for _true_label, _predicted_label, _text in random.sample(list(zip(y_test, predicted_labels, X_test)), 80):
  if _true_label != _predicted_label:
    print('TRUE LABEL:', _true_label)
    print('PREDICTED LABEL:', _predicted_label)
    print('REVIEW TEXT:', _text[:100], '...')
    print()

Finally, let's create some heatmaps to examine misclassification patterns. We could use these patterns to think about similarities and differences between genres, according to book reviewers.

In [None]:
from collections import Counter

# Count the number of classifications for each genre pair
genre_classifications = Counter(zip(y_test, predicted_labels))

# Convert the counts to a DataFrame and pivot to wide format
df_wide = pd.DataFrame(genre_classifications, index=['Number of Classifications']).T.reset_index()
df_wide.columns = ['True Genre', 'Predicted Genre', 'Number of Classifications']
df_wide = df_wide.pivot_table(index='True Genre', columns='Predicted Genre', values='Number of Classifications', fill_value=0)

# Plot the results
plt.figure(figsize=(9,7))
sns.set(style='ticks', font_scale=1.2)
sns.heatmap(df_wide, linewidths=1, cmap='Purples')
plt.xticks(rotation=45, ha='right')
plt.tight_layout()
plt.show()


Looks good! We can see that overall, our model is assigning the correct labels for each genre.

Now, let's remove the diagonal from the plot to highlight the misclassifications.

In [None]:
genre_classifications_dict = defaultdict(int)
for _true_label, _predicted_label in zip(y_test, predicted_labels):
  if _true_label != _predicted_label: # Remove the diagonal to highlight misclassifications
    genre_classifications_dict[(_true_label, _predicted_label)] += 1

dicts_to_plot = []
for (_true_genre, _predicted_genre), _count in genre_classifications_dict.items():
  dicts_to_plot.append({'True Genre': _true_genre,
                        'Predicted Genre': _predicted_genre,
                        'Number of Classifications': _count})

df_to_plot = pd.DataFrame(dicts_to_plot)
df_wide = df_to_plot.pivot_table(index='True Genre',
                                 columns='Predicted Genre',
                                 values='Number of Classifications')

plt.figure(figsize=(9,7))
sns.set(style='ticks', font_scale=1.2)
sns.heatmap(df_wide, linewidths=1, cmap='Purples')
plt.xticks(rotation=45, ha='right')
plt.tight_layout()
plt.show()