# Fine-tuning BERT for Multi-label Topic Classification (One-step approach)

### Introduction

Source code: [https://github.com/abhimishra91/transformers-tutorials/blob/master/transformers_multi_label_classification.ipynb]

In this notebook we will be fine-tuning a BERT model for the **Multilabel Topic Classification** task. 
This is a common problem where a given piece of text/sentence/document can be classified into one or more of categories out of a predefined list of labels.

#### Flow of the notebook

The notebook will be divided into seperate sections to provide a organized walk through for the process used. This process can be modified for individual use cases. The sections are:

1. [Imports](#section01)
2. [Helper functions](#section02)
3. [Pre-processing the domain data](#section03)
4. [Model training](#section04)
5. [Validation](#section05)
6. [Saving the trained model and vocabulary](#section06)
7. [Evaluate on test data](#section07)

#### Technical Details

This script leverages on multiple tools, see the details below. Please ensure that these elements are present in your setup to successfully implement this script.

 - Data: 
	 - We are using a dataset provided by a Dutch governmental institution
	 - We are using the split dataset for the process: `train.csv`,  `validation.csv`,  `test.csv`
	 - There are rows of data. Where each row has the following data-points: 
		 - ID
		 - Text
		 - Label values (0 or 1) for each category

Each text instance can be marked for multiple topics. If the comment is about `Processes` and `Handling`, then for both those headers the value will be `1` and for the others it will be `0` in the data.

 - Language Model Used:
	 - BERT is used for this project. It is the transformer model created by the Google AI Team.  
	 - [Blog-Post](https://ai.googleblog.com/2018/11/open-sourcing-bert-state-of-art-pre.html)
	 - [Research Paper](https://arxiv.org/abs/1810.04805)
     - [Documentation for python](https://huggingface.co/transformers/model_doc/bert.html)

 - Hardware Requirements:
	 - Python 3.7.8 and above
	 - Pytorch, Transformers and all the imported Python libraries
	 - GPU enabled setup 

 - Script Objective:
	 - The objective of this script is to fine tune BERT to be able to label a text instance into the following categories:
		 - main topics:`Making contact with employee`, `Processes`, `Digital possibilities`, `General experience`, `Information provision`, `Employee attitude & behavior`, `Handling`, `No topic found`, `Knowledge & skills of employee`, `Price & quality`, `Physical service provision`
		 - subtopics:`Waiting time`, `Speaking to the right person`, `Correctness of handling`, `Functionalities web & app`, `Ease of process`, `Reception & Registration`, `Friendliness`, `Quality of information`, `Information provision web & app`, `Clarity of information`, `Solution oriented`, `Availability of employee`, `Price & costs`, `Speed of processing`, `Professionalism`, `Opening hours & accessibility`, `Ease of use web & app`, `Keeping up to date`, `Integrity & fulfilling responsibilities`, `Payout & return`, `No subtopic found`, `Quality of customer service`, `Facilities`, `Objection & evidence`, `General experience subtopic`, `Efficiency of process`, `Genuine interest`, `Expertise`, `Helpfulness`, `Personal approach`, `Communication`
		 - The model predicts if a given instance belongs to or does not belong to all of the listed topic labels
		 - The script was designed for one-step classification, meaning the model predicts the probabilities for all topic labels at once

---
***NOTE***
- *It is to be noted that the overall mechanisms for a multiclass and multilabel problems are similar, except for few differences namely:*
	- *Loss function is designed to evaluate all the probability of categories individually rather than as compared to other categories. Hence the use of `BCE` rather than `Cross Entropy` when defining loss.*
	- *Sigmoid of the outputs calcuated to rather than Softmax. Again for the reasons defined in the previous point*
	- *The [classification report](https://scikit-learn.org/stable/modules/generated/sklearn.metrics.classification_report.html#sklearn.metrics.classification_report), specifically [F1 scores](https://scikit-learn.org/stable/modules/generated/sklearn.metrics.f1_score.html#sklearn.metrics.f1_score) used from sklearn package as compared to direct comparison of expected vs predicted*
---

<a id='section01'></a>
### Imports

In the next step we will be importing the libraries and modules needed to run our script.

In [None]:
# Importing libraries
import numpy as np
import pandas as pd
from sklearn.metrics import classification_report, hamming_loss
import torch
from torch.utils.data import DataLoader
from transformers import BertTokenizer, set_seed
from torch import cuda
from utils import *

We start with defining few key variables that will be used later during the training/fine-tuning stage. A label map will be defined mapping topic labels to integers, and a random seed will be set for reproducibility. Followed by that we will preapre the device for CUDA execeution. This configuration is needed if you want to leverage on onboard GPU. 

In [None]:
# Defining parameters used during training
# Set seed for reproducibility
set_seed(123)
# Padd or truncate text instances to a specific length. Since the longest instance in the dataset has 416 tokens, we chose this number
MAX_LEN = 128
# Number of batches, during hyper-parameter tuning we can change this value to 16/32/64
BATCH_SIZE = 16
# Number of epochs during model training
EPOCHS = 5
# Learning rate determines the steps taken during training, during hyper-parameter tuning we can change this value to 3e-05
LEARNING_RATE = 2e-05
# Instances are considered to have a given label if their probability is above 50% as set below
THRESHOLD = 0.5
# Number of labels in the data
NUM_LABELS = 42
# The model used for the experiments
tokenizer = BertTokenizer.from_pretrained('bert-base-uncased')
# Setting up the device for GPU usage if available, otherwise default to CPU use
device = 'cuda' if cuda.is_available() else 'cpu'

# Dictionary of labels and their ids - this will be used to convert string labels to numbers
label_map = {
    '0': 'Making contact with employee','1': 'Processes','2': 'Digital possibilities','3': 'General experience','4': 'Information provision','5': 'Employee attitude & behavior','6': 'Handling','7': 'No topic found','8': 'Knowledge & skills of employee',
    '9': 'Price & quality','10': 'Physical service provision','11': 'Waiting time','12': 'Speaking to the right person','13': 'Correctness of handling','14': 'Functionalities web & app','15': 'Ease of process','16': 'Reception & Registration',
    '17': 'Friendliness','18': 'Quality of information','19': 'Information provision web & app','20': 'Clarity of information','21': 'Solution oriented','22': 'Availability of employee','23': 'Price & costs',
    '24': 'Speed of processing','25': 'Professionalism','26': 'Opening hours & accessibility','27': 'Ease of use web & app','28': 'Keeping up to date','29': 'Integrity & fulfilling responsibilities',
    '30': 'Payout & return','31': 'No subtopic found','32': 'Quality of customer service','33': 'Facilities','34': 'Objection & evidence','35': 'General experience subtopic','36': 'Efficiency of process',
    '37': 'Genuine interest','38': 'Expertise','39': 'Helpfulness','40': 'Personal approach','41': 'Communication'}

training_path = '../data/train.csv'
valid_path = '../data/validation.csv'
test_path = '../data/test.csv'

### Helper Functions
<a id='section02'></a>

*The helper functions are all located in utils.py. The description of the functions can be found below.*

#### Preparing the Dataset and Dataloader
##### *MultiLabelDataset* Class
- This class is defined to accept the `tokenizer`, `dataframe` and `max_length` as input and generate tokenized output and tags that is used by the BERT model for training. 
- We are using the BERT tokenizer to tokenize the data in the `text` column of the dataframe.
- The tokenizer uses the `encode_plus` method to perform tokenization and generate the necessary outputs, namely: `ids`, `attention_mask`, `token_type_ids`

- `targets` is the list of classes labled as `0` or `1` in the dataframe. 
- The *MultiLabelDataset* class is used to create 3 datasets, for training, for validation and for testing.
- *Training Dataset* is used to train the model.
- *Validation Dataset* can be used for hyper-parameter tuning. The model has not seen this data during training.
- *Test Dataset* is used to evaluate the performance of the model. The model has not seen this data during training.

##### Dataloader
- Dataloader is used for creating training, validation and test dataloader that load the data to the neural network in a defined manner. This is needed because all the data from the dataset cannot be loaded to the memory at once, hence the amount of dataloaded to the memory and then passed to the neural network needs to be controlled.
- This control is achieved using the parameters such as `batch_size` and `max_len`.
- Training and Validation dataloaders are used in the training and validation part of the flow respectively.

#### *BERTClass*
 - We will be creating a neural network with the `BERTClass`. 
 - This network will have the `BERT` model.  Follwed by a `Droput` and `Linear Layer`. They are added for the purpose of **Regularization** and **Classification** respectively. 
 - In the forward loop, there are 2 outputs from the `BERTClass` layer.
 - The second output `output_1` or called the `pooled output` is passed to the `Drop Out layer` and the subsequent output is given to the `Linear layer`. 
 - Keep note the number of dimensions for `Linear Layer` is **42** because that is the total number of topics in which we are looking to classify our model.
 - The data will be fed to the `BERTClass` as defined in the dataset. 
 - Final layer outputs will be used to calcuate the loss and to determine the accuracy of models prediction. 
 - We will initiate an instance of the network called `model`. This instance will be used for training and then to save the final trained model for future inference. 
 
#### Loss Function and Optimizer
 - The Loss is defined in the `loss_fn` function.
 - As defined above, the loss function used will be a combination of Binary Cross Entropy which is implemented as [BCELogits Loss](https://pytorch.org/docs/stable/nn.html#bcewithlogitsloss) in PyTorch
 - `Optimizer` (Adam optimizer) is defined in the next cell.
 - `Optimizer` is used to update the weights of the neural network to improve its performance.

#### Further Reading
- [Pytorch Documentation for Loss Function](https://pytorch.org/docs/stable/nn.html#loss-functions)
- [Pytorch Documentation for Optimizer](https://pytorch.org/docs/stable/optim.html)
- [Pytorch Documentation for Scheduler](https://pytorch.org/docs/stable/generated/torch.optim.lr_scheduler.StepLR.html)
- Refer to the links provided on the top of the notebook to read more about `BertModel`. 

#### Training the Model

We define a training function that trains the model on the training dataset created above, specified number of times (EPOCH). An epoch defines how many times the complete data will be passed through the network. 
The following events happen in this function to fine-tune the neural network:
- The dataloader passes data to the model based on the batch size. 
- Subsequent output from the model and the actual labels are compared to calculate the loss. 
- Loss value is used to optimize the weights of the neurons in the network.
- After every 500 steps the loss value is printed out.

#### Validating the model

During the validation stage we pass the unseen data (validation dataset) to the model. This step determines how good the model performs on the unseen data. This validation data is 10% of the full dataset. During the validation stage the weights of the model are not updated. Only the final output is compared to the actual value. This comparison is then used to calculate the performance of the model. To get a measure of the model's performance we are using the classification report and hamming loss from scikit-learn. 


### Pre-processing the domain data
<a id='section03'></a>
We will be working with the data and preparing it for fine-tuning purposes. 

*Assuming that the `train.csv`, `validation.csv` and `test.csv` are available in the `data` folder*

* A new dataframe is made and input text is stored in the **text** column.
* Taking the values of all the labels and coverting it into a list in the **labels** column.
* Preserving the **id** and **text** information, which will be needed for evaluation.

In [None]:
# Load datasets
train_data = pd.read_csv(training_path, sep=';')
validation_data = pd.read_csv(valid_path, sep=';')
test_data = pd.read_csv(test_path, sep=';')

# Create new DataFrames, we keep information about the id, text and labels for each instance
train_df = pd.DataFrame({
    'id': train_data['id'],
    'text': train_data['text'],
    'labels': train_data.iloc[:, 2:].values.tolist()  # 'text' is the first column and 'id' is the second, the rest of the columns are the labels
})

val_df = pd.DataFrame({
    'id': validation_data['id'],
    'text': validation_data['text'],
    'labels': validation_data.iloc[:, 2:].values.tolist()
})

test_df = pd.DataFrame({
    'id': test_data['id'],
    'text': test_data['text'],
    'labels': test_data.iloc[:, 2:].values.tolist()
})

In [None]:
# Creating the dataset and dataloader for the neural network
print("TRAIN Dataset: {}".format(train_df.shape))
print("VALID Dataset: {}".format(val_df.shape))
print("TEST Dataset: {}".format(test_df.shape))

training_set = MultiLabelDataset(train_df, tokenizer, MAX_LEN)
validation_set = MultiLabelDataset(val_df, tokenizer, MAX_LEN)
test_set = MultiLabelDataset(test_df, tokenizer, MAX_LEN)

training_loader = DataLoader(training_set, batch_size=BATCH_SIZE, shuffle=True)
validation_loader = DataLoader(validation_set, batch_size=BATCH_SIZE, shuffle=False)
test_loader = DataLoader(test_set, batch_size=BATCH_SIZE, shuffle=False)

### Model Training
<a id='section04'></a>

We define an optimizer and a scheduler to be used during training. We loop through the number of epochs and call the train function. The Adam optimizer is used alongsize a scheduler. The scheduler decays the learning rate of each parameter group by gamma for each epoch. 

In [None]:
model = BERTClass(NUM_LABELS)
model.to(device)
optimizer = torch.optim.Adam(params=model.parameters(), lr=LEARNING_RATE)
scheduler = torch.optim.lr_scheduler.StepLR(optimizer, step_size=1, gamma=0.95)

In [None]:
for epoch in range(EPOCHS): 
    train(model, device, training_loader, epoch, optimizer, scheduler)

### Validation
<a id='section05'></a>

We use the validation function and the validation_loader to evaluate model performance given the model's parameters. We are also creating a confusion matrix for each class, create a global classification report and calculate the hamming loss. Finally, the model predictions are saved to a CSV file output.

In [None]:
outputs, targets = validation(model, validation_loader, device)
outputs = np.array(outputs) >= THRESHOLD
outputs = np.array(outputs, dtype=int)
targets = np.array(targets, dtype=int) 

# Use the label map to convert indices to topic names for the classification report
target_names = [label_map[str(i)] for i in range(outputs.shape[1])]

report = classification_report(targets, outputs, target_names=target_names, zero_division=0)
print(report)
print("Hamming Loss:", hamming_loss(targets, outputs))

### Saving the trained model and vocabulary
<a id='section06'></a>

This is the final step in the process of fine-tuning the model. The model and its vocabulary are saved locally. These files are then used in the future to make inference on new feedback instances (such as the test set).

In [None]:
torch.save(model, '../models/pytorch_bert_MLTC.bin')
tokenizer.save_vocabulary('../models/vocab_bert_MLTC.bin')

### Evaluate on the test data
<a id='section07'></a>

We are now loading the saved model and the test data and use it for final evaluation of the model.

In [None]:
device = 'cuda' if torch.cuda.is_available() else 'cpu'
tokenizer = BertTokenizer.from_pretrained('../models/vocab_bert_MLTC.bin')
model = torch.load('../models/pytorch_bert_MLTC.bin')
model.to(device)

In [None]:
outputs, targets = validation(model, test_loader, device)
outputs = np.array(outputs) >= THRESHOLD
outputs = np.array(outputs, dtype=int)
targets = np.array(targets, dtype=int) 

# Use the label map to convert indices to topic names for the classification report
target_names = [label_map[str(i)] for i in range(outputs.shape[1])]

report_dict = classification_report(targets, outputs, target_names=target_names, zero_division=0, output_dict=True)
print(classification_report(targets, outputs, target_names=target_names, zero_division=0))
print("Hamming Loss:", hamming_loss(targets, outputs))
print_confusion_matrices(outputs, targets, target_names)

report_df = pd.DataFrame(report_dict).transpose()
report_df = report_df.round(3) 
report_df.to_csv('../results/bert/test_report_1step.csv', sep=';', index=True)
print("Classification report saved to file.")

# Create DataFrame for results and save to CSV
results_df = pd.DataFrame(outputs, columns=target_names)
results_df['ID'] = test_df['id'].values  # Adding ID for reference
results_df['Text'] = test_df['text'].values  # Adding text for reference
results_df = results_df[['ID', 'Text'] + target_names]  
results_df.to_csv('../model_predictions/bert_predictions_test_1step.csv', sep=';', index=False)
print("Model predictions saved to file.'")

End of the notebook.