# Fine Tuning BERT for MultiLabel Text Classification (Two-step approach)

### Introduction

Source code: [https://github.com/abhimishra91/transformers-tutorials/blob/master/transformers_multi_label_classification.ipynb]

In this notebook we will be fine-tuning a BERT model for the **Multilabel Topic Classification** task. 
This is a common problem where a given piece of text/sentence/document can be classified into one or more of categories out of a predefined list of labels.

#### Flow of the notebook

The notebook will be divided into seperate sections to provide a organized walk through for the process used. This process can be modified for individual use cases. The sections are:

1. [Imports](#section01)
2. [Helper functions](#section02)
3. [Pre-processing the domain data](#section03)
4. [Model training](#section04)
5. [Validation](#section05)
6. [Saving the trained model and vocabulary](#section06)
7. [Evaluate on test data](#section07)

#### Technical Details

This script leverages on multiple tools, see the details below. Please ensure that these elements are present in your setup to successfully implement this script.

 - Data: 
	 - We are using a dataset provided by a Dutch governmental institution
	 - We are using the split dataset for the process: `train.csv`,  `validation.csv`,  `test.csv`
	 - We are also using additional training sets as a result of data augmentation: `undersampled_train.csv` and `oversampled_train.csv`
	 - There are rows of data. Where each row has the following data-points: 
		 - ID
		 - Text
		 - Label values (0 or 1) for each category

Each text instance can be marked for multiple topics. If the comment is about `Processes` and `Handling`, then for both those headers the value will be `1` and for the others it will be `0` in the data.

 - Language Model Used:
	 - BERT is used for this project. It is the transformer model created by the Google AI Team.  
	 - [Blog-Post](https://ai.googleblog.com/2018/11/open-sourcing-bert-state-of-art-pre.html)
	 - [Research Paper](https://arxiv.org/abs/1810.04805)
     - [Documentation for python](https://huggingface.co/transformers/model_doc/bert.html)

 - Hardware Requirements:
	 - Python 3.7.8 and above
	 - Pytorch, Transformers and all the imported Python libraries
	 - GPU enabled setup 

 - Script Objective:
	 - The objective of this script is to fine tune BERT to be able to label a text instance into the following categories:
		 - main topics:`Making contact with employee`, `Processes`, `Digital possibilities`, `General experience`, `Information provision`, `Employee attitude & behavior`, `Handling`, `No topic found`, `Knowledge & skills of employee`, `Price & quality`, `Physical service provision`
		 - subtopics:`Waiting time`, `Speaking to the right person`, `Correctness of handling`, `Functionalities web & app`, `Ease of process`, `Reception & Registration`, `Friendliness`, `Quality of information`, `Information provision web & app`, `Clarity of information`, `Solution oriented`, `Availability of employee`, `Price & costs`, `Speed of processing`, `Professionalism`, `Opening hours & accessibility`, `Ease of use web & app`, `Keeping up to date`, `Integrity & fulfilling responsibilities`, `Payout & return`, `No subtopic found`, `Quality of customer service`, `Facilities`, `Objection & evidence`, `General experience subtopic`, `Efficiency of process`, `Genuine interest`, `Expertise`, `Helpfulness`, `Personal approach`, `Communication`
		 - The model predicts if a given instance belongs to or does not belong to all of the listed topic labels
		 - The script was designed for two-step classification, meaning the model first predicts the probabilities for the main topics. Then the model is used again to predict the corresponding subtopics based on the main topic predictions in the first phase. For example, if the model only predicted the presence of `Price & quality` for a given data instance, it can only consider the subtopics `Price & costs` and `Payout & return` and in the second phase.

---
***NOTE***
- *It is to be noted that the overall mechanisms for a multiclass and multilabel problems are similar, except for few differences namely:*
	- *Loss function is designed to evaluate all the probability of categories individually rather than as compared to other categories. Hence the use of `BCE` rather than `Cross Entropy` when defining loss.*
	- *Sigmoid of the outputs calcuated to rather than Softmax. Again for the reasons defined in the previous point*
	- *The [classification report](https://scikit-learn.org/stable/modules/generated/sklearn.metrics.classification_report.html#sklearn.metrics.classification_report), specifically [F1 scores](https://scikit-learn.org/stable/modules/generated/sklearn.metrics.f1_score.html#sklearn.metrics.f1_score) used from sklearn package as compared to direct comparison of expected vs predicted*
---

<a id='section01'></a>
### Imports

In the next step we will be importing the libraries and modules needed to run our script.

In [None]:
# Importing libraries
import numpy as np
import pandas as pd
from sklearn.metrics import classification_report, hamming_loss
import torch
from torch.utils.data import DataLoader
from transformers import BertTokenizer, set_seed
from torch import cuda
from utils import *

We start with defining few key variables that will be used later during the training/fine-tuning stage. A label map will be defined mapping topic labels to integers, and a random seed will be set for reproducibility. Followed by that we will preapre the device for CUDA execeution. This configuration is needed if you want to leverage on onboard GPU. 

In [None]:
# Defining parameters used during training
# Set seed for reproducibility
set_seed(123)
# Padd or truncate text instances to a specific length. Since the longest instance in the dataset has 416 tokens, we chose this number
MAX_LEN = 128
# Number of batches, during hyper-parameter tuning we can change this value to 16/32/64
BATCH_SIZE = 8
# Number of epochs during model training
EPOCHS = 5
# Learning rate determines the steps taken during training, during hyper-parameter tuning we can change this value to 3e-05
LEARNING_RATE = 2e-05
# Instances are considered to have a given label if their probability is above 50% as set below
THRESHOLD = 0.5
# Number of main topic labels in the data
NUM_LABELS = 11
# The model used for the experiments
tokenizer = BertTokenizer.from_pretrained('bert-base-uncased')
# Setting up the device for GPU usage if available, otherwise default to CPU use
device = 'cuda' if cuda.is_available() else 'cpu'

# Dictionary of main topic labels and their ids - this will be used to convert string labels to numbers
label_map_mt = {
    '0': 'Making contact with employee','1': 'Processes','2': 'Digital possibilities','3': 'General experience','4': 'Information provision','5': 'Employee attitude & behavior','6': 'Handling','7': 'No topic found','8': 'Knowledge & skills of employee',
    '9': 'Price & quality','10': 'Physical service provision'}

# Dictionary mapping main topics to their corresponding subtopics, will be used in Phase 2 of model training
label_map = {
    'Processes': ['Ease of process', 'Efficiency of process'], 
    'Making contact with employee': ['Waiting time', 'Availability of employee', 'Speaking to the right person'], 
    'Digital possibilities': [ 'Functionalities web & app', 'Information provision web & app', 'Ease of use web & app'], 
    'General experience': ['General experience subtopic'], 
    'Information provision': ['Clarity of information', 'Quality of information','Communication',  'Integrity & fulfilling responsibilities', 'Keeping up to date'], 
    'Employee attitude & behavior': ['Friendliness','Helpfulness', 'Personal approach','Genuine interest'], 
    'Handling': ['Speed of processing','Correctness of handling','Objection & evidence'], 
    'No topic found': ['No subtopic found'], 
    'Knowledge & skills of employee': ['Solution oriented','Expertise', 'Quality of customer service', 'Professionalism'], 
    'Price & quality': ['Price & costs', 'Payout & return'], 
    'Physical service provision': ['Reception & Registration',  'Opening hours & accessibility', 'Facilities']
    }

training_path = '../data/train.csv' # or '../data/oversampled_train.csv', '../undersampled_train.csv'
valid_path = '../data/validation.csv'
test_path = '../data/test.csv'

### Helper Functions
<a id='section02'></a>

*The helper functions are all located in utils.py. The description of the functions can be found below.*

#### Preparing the Dataset and Dataloader
##### *MultiLabelDataset* Class
- This class is defined to accept the `tokenizer`, `dataframe` and `max_length` as input and generate tokenized output and tags that is used by the BERT model for training. 
- We are using the BERT tokenizer to tokenize the data in the `text` column of the dataframe.
- The tokenizer uses the `encode_plus` method to perform tokenization and generate the necessary outputs, namely: `ids`, `attention_mask`, `token_type_ids`

- `targets` is the list of classes labled as `0` or `1` in the dataframe. 
- The *MultiLabelDataset* class is used to create 3 datasets, for training, for validation and for testing.
- *Training Dataset* is used to train the model.
- *Validation Dataset* can be used for hyper-parameter tuning. The model has not seen this data during training.
- *Test Dataset* is used to evaluate the performance of the model. The model has not seen this data during training.

##### Dataloader
- Dataloader is used for creating training, validation and test dataloader that load the data to the neural network in a defined manner. This is needed because all the data from the dataset cannot be loaded to the memory at once, hence the amount of dataloaded to the memory and then passed to the neural network needs to be controlled.
- This control is achieved using the parameters such as `batch_size` and `max_len`.
- Training and Validation dataloaders are used in the training and validation part of the flow respectively.

#### *BERTClass*
 - We will be creating a neural network with the `BERTClass`. 
 - This network will have the `BERT` model.  Follwed by a `Droput` and `Linear Layer`. They are added for the purpose of **Regularization** and **Classification** respectively. 
 - In the forward loop, there are 2 outputs from the `BERTClass` layer.
 - The second output `output_1` or called the `pooled output` is passed to the `Drop Out layer` and the subsequent output is given to the `Linear layer`. 
 - Keep note the number of dimensions for `Linear Layer` is **42** because that is the total number of topics in which we are looking to classify our model.
 - The data will be fed to the `BERTClass` as defined in the dataset. 
 - Final layer outputs will be used to calcuate the loss and to determine the accuracy of models prediction. 
 - We will initiate an instance of the network called `model`. This instance will be used for training and then to save the final trained model for future inference. 
 
#### Loss Function and Optimizer
 - The Loss is defined in the `loss_fn` function.
 - As defined above, the loss function used will be a combination of Binary Cross Entropy which is implemented as [BCELogits Loss](https://pytorch.org/docs/stable/nn.html#bcewithlogitsloss) in PyTorch
 - `Optimizer` (Adam optimizer) is defined in the next cell.
 - `Optimizer` is used to update the weights of the neural network to improve its performance.

#### Further Reading
- [Pytorch Documentation for Loss Function](https://pytorch.org/docs/stable/nn.html#loss-functions)
- [Pytorch Documentation for Optimizer](https://pytorch.org/docs/stable/optim.html)
- [Pytorch Documentation for Scheduler](https://pytorch.org/docs/stable/generated/torch.optim.lr_scheduler.StepLR.html)
- Refer to the links provided on the top of the notebook to read more about `BertModel`. 

#### Training the Model

We define a training function that trains the model on the training dataset created above, specified number of times (EPOCH). An epoch defines how many times the complete data will be passed through the network. 
The following events happen in this function to fine-tune the neural network:
- The dataloader passes data to the model based on the batch size. 
- Subsequent output from the model and the actual labels are compared to calculate the loss. 
- Loss value is used to optimize the weights of the neurons in the network.
- After every 500 steps the loss value is printed out.

#### Validating the model

During the validation stage we pass the unseen data (validation dataset) to the model. This step determines how good the model performs on the unseen data. This validation data is 10% of the full dataset. During the validation stage the weights of the model are not updated. Only the final output is compared to the actual value. This comparison is then used to calculate the performance of the model. To get a measure of the model's performance we are using the classification report and hamming loss from scikit-learn. 

<a id='section03'></a>
### Pre-processing the domain data

We will be working with the data and preparing for fine tuning purposes. 

*Assuming that the `train.csv`, `validation.csv` and `test.csv` are  saved in the `data` folder*

* A new dataframe is made and input text is stored in the **text** column.
* Taking the values of all the classes and coverting it into a list.
* The list is appened as a new column names as **labels**.

In [None]:
# Load datasets and keep the 'id' column
train_data = pd.read_csv(training_path, sep=';')
validation_data = pd.read_csv(valid_path, sep=';')
test_data = pd.read_csv(test_path, sep=';')

# Create new DataFrames
train_df = pd.DataFrame({
    'id': train_data['id'],
    'text': train_data['text'],
    'labels': train_data.iloc[:, 2:13].values.tolist()  # 'text' is the first column and 'id' is the second. We only take the main topic labels.
})

val_df = pd.DataFrame({
    'id': validation_data['id'],
    'text': validation_data['text'],
    'labels': validation_data.iloc[:, 2:13].values.tolist()
})

test_df = pd.DataFrame({
    'id': test_data['id'],
    'text': test_data['text'],
    'labels': test_data.iloc[:, 2:13].values.tolist()
})

In [None]:
# Creating the dataset and dataloader for the neural network
print("TRAIN Dataset: {}".format(train_df.shape))
print("VALID Dataset: {}".format(val_df.shape))
print("TEST Dataset: {}".format(test_df.shape))

training_set = MultiLabelDataset(train_df, tokenizer, MAX_LEN)
validation_set = MultiLabelDataset(val_df, tokenizer, MAX_LEN)
test_set = MultiLabelDataset(test_df, tokenizer, MAX_LEN)

training_loader = DataLoader(training_set, batch_size=BATCH_SIZE, shuffle=True)
validation_loader = DataLoader(validation_set, batch_size=BATCH_SIZE, shuffle=False)
test_loader = DataLoader(test_set, batch_size=BATCH_SIZE, shuffle=False)

### Model Training
<a id='section04'></a>

We define an optimizer and a scheduler to be used during training. We loop through the number of epochs and call the train function. The Adam optimizer is used alongsize a scheduler. The scheduler decays the learning rate of each parameter group by gamma for each epoch. We train the model for main topic prediction.

In [None]:
model = BERTClass(NUM_LABELS)
model.to(device)
optimizer = torch.optim.Adam(params=model.parameters(), lr=LEARNING_RATE)
scheduler = torch.optim.lr_scheduler.StepLR(optimizer, step_size=1, gamma=0.95)

In [None]:
for epoch in range(EPOCHS): 
    train(model, device, training_loader, epoch, optimizer, scheduler)

### Validation
<a id='section05'></a>

#### Phase 1: Main topics

We use the validation function and the validation_loader to evaluate model performance given the model's parameters. The model's main topic predictions are saved to a CSV file output, and we also save the model for future use.

In [None]:
outputs, targets = validation(model, validation_loader, device)
outputs = np.array(outputs) >= THRESHOLD
outputs = np.array(outputs, dtype=int)
targets = np.array(targets, dtype=int)
target_names = [label_map_mt[str(i)] for i in range(outputs.shape[1])]

# Create DataFrame for results and save to CSV
results_df = pd.DataFrame(outputs, columns=target_names)
results_df['id'] = val_df['id'].values  # Adding ID for reference
results_df['text'] = val_df['text'].values  # Adding text for reference
results_df = results_df[['id', 'text'] + target_names]
results_df.to_csv('../model_predictions/bert_predictions_valid_maintopics.csv', sep=';', index=False) # or '../model_predictions/bert_predictions_valid_maintopics_undersampled.csv' / '../model_predictions/bert_predictions_valid_maintopics_oversampled.csv'


#### Phase 2: Subtopics
We are Training the model and using it for predicting subtopics. For this we take the saved CSV file, which was the result of the previous phase. We train the model for each combination of main topic and corresponding subtopics and use it for prediction. We store the final results in a CSV file. We inspect the performance by comparing model predictions against the gold data.


In [None]:
# Load the predictions of phase 1
main_topic_predictions_df = pd.read_csv('../model_predictions/bert_predictions_valid_maintopics.csv', sep=';')  # or '../model_predictions/bert_predictions_valid_maintopics_undersampled.csv' / '../model_predictions/bert_predictions_valid_maintopics_oversampled.csv'

# Create new columns for the subtopics in the dataframe
for main_topic, subtopics in label_map.items():
    for subtopic in subtopics:
        if subtopic not in main_topic_predictions_df.columns:
            main_topic_predictions_df[subtopic] = 0

# Process each main topic and its subtopics stepwise
for main_topic, subtopics in label_map.items():
    print(f"Processing main topic: {main_topic} for the subtopics: {subtopics}")
    
    # If there's only one subtopic, directly assign the main topic's prediction to the subtopic
    if len(subtopics) == 1:
        subtopic = subtopics[0]
        main_topic_predictions_df[subtopic] = main_topic_predictions_df[main_topic].astype(int)
        continue 

    # Filter training data for the current main topic
    train_df = train_data[train_data[main_topic] == 1].copy()
    train_df['labels'] = train_df[subtopics].values.tolist()

    # Filter valid data and initialize subtopic labels
    valid_df = main_topic_predictions_df[main_topic_predictions_df[main_topic] == 1].copy()
    valid_df['labels'] = valid_df[subtopics].values.tolist()

    # After filtering to get the train and valid datasets, reset the index
    train_subtopic_df = train_df[['id', 'text', 'labels']].reset_index(drop=True)
    valid_subtopic_df = valid_df[['id', 'text', 'labels']].reset_index(drop=True)

    train_set = MultiLabelDataset(train_subtopic_df, tokenizer, MAX_LEN)
    valid_set = MultiLabelDataset(valid_subtopic_df, tokenizer, MAX_LEN)
    training_loader = DataLoader(train_set, batch_size=BATCH_SIZE, shuffle=True)
    valid_loader = DataLoader(valid_set, batch_size=BATCH_SIZE, shuffle=False)
    
    # Model, Optimizer and Scheduler setup
    model = BERTClass(len(subtopics)).to(device)
    optimizer = torch.optim.Adam(model.parameters(), lr=LEARNING_RATE)
    scheduler = torch.optim.lr_scheduler.StepLR(optimizer, step_size=1, gamma=0.95)

    # Train model
    for epoch in range(EPOCHS):
        train(model, device, training_loader, epoch, optimizer, scheduler)    
    # Evaluate the model
    subtopic_outputs, _ = validation(model, valid_loader, device)
    subtopic_outputs = np.array(subtopic_outputs) >= THRESHOLD
    subtopic_outputs = np.array(subtopic_outputs, dtype=int)

    # Update the predictions dataframe with new subtopic predictions
    subtopic_predictions_df = pd.DataFrame(subtopic_outputs, columns=subtopics)
    subtopic_predictions_df['id'] = valid_subtopic_df['id'].values  # Add ID for reference

    for index, row in subtopic_predictions_df.iterrows():
        match_index = main_topic_predictions_df[main_topic_predictions_df['id'] == row['id']].index
        for subtopic in subtopics:
            main_topic_predictions_df.loc[match_index, subtopic] = row[subtopic]
    
    print(f"Completed processing for {main_topic}")

# Save the final updated main topics CSV with all subtopics predictions included
main_topic_predictions_df.to_csv('../model_predictions/bert_predictions_valid_2step.csv', sep=';', index=False)  # or '../model_predictions/bert_predictions_valid_2step_undersampled.csv' / '../model_predictions/bert_predictions_valid_2step_oversampled.csv'
print("All predictions saved.")

In [None]:
# Performance evaluation
predictions = pd.read_csv('../model_predictions/bert_predictions_valid_2step.csv', sep=';') # or '../model_predictions/bert_predictions_valid_2step_undersampled.csv' / '../model_predictions/bert_predictions_valid_2step_oversampled.csv'
gold_data = pd.read_csv('../data/validation.csv', sep=';')

# Sort and align by 'id'
gold_data.sort_values('id', inplace=True)
predictions.sort_values('id', inplace=True)
assert all(gold_data['id'] == predictions['id']), "IDs do not match between gold data and predictions."

# Extract gold labels and predicted labels
labels = [col for col in gold_data.columns if col not in ['id', 'text']]
gold_labels = gold_data[labels]
predicted_labels = predictions[labels]

class_report = classification_report(gold_labels, predicted_labels, target_names=labels, zero_division=0)
print(class_report)
hamm_loss = hamming_loss(gold_labels, predicted_labels)
print(f"Hamming Loss: {hamm_loss}")

### Saving the trained model and vocabulary
<a id='section06'></a>

This is the final step in the process of fine-tuning the model. The model for predicting the main topics and its vocabulary are saved locally. These files are then used in the future to make inference on new feedback instances (such as the test set).

In [None]:
torch.save(model, '../models/pytorch_bert_MLTC_mainonly.bin') # or '../models/pytorch_bert_MLTC_mainonly_oversampled.bin' / '../models/pytorch_bert_MLTC_mainonly_undersampled.bin'
tokenizer.save_vocabulary('../models/vocab_bert_MLTC_mainonly.bin') # or '../models/vocab_bert_MLTC_mainonly_oversampled.bin' / '../models/vocab_bert_MLTC_mainonly_undersampled.bin'

### Evaluate on the test data
<a id='section07'></a>

We are now loading the saved model and the test data and use it for final evaluation of the model.

In [None]:
device = 'cuda' if torch.cuda.is_available() else 'cpu'
tokenizer = BertTokenizer.from_pretrained('../models/vocab_bert_MLTC_mainonly.bin') # or '../models/vocab_bert_MLTC_mainonly_oversampled.bin' / '../models/vocab_bert_MLTC_mainonly_undersampled.bin'
model = torch.load('../models/pytorch_bert_MLTC_mainonly.bin') # or '../models/pytorch_bert_MLTC_mainonly_oversampled.bin' / '../models/pytorch_bert_MLTC_mainonly_undersampled.bin'
model.to(device)

Phase 1: Main topic prediction

In [None]:
outputs, targets = validation(model, test_loader, device)
outputs = np.array(outputs) >= THRESHOLD
outputs = np.array(outputs, dtype=int)
targets = np.array(targets, dtype=int)
target_names = [label_map_mt[str(i)] for i in range(outputs.shape[1])]

# Create DataFrame for results and save to CSV
results_df = pd.DataFrame(outputs, columns=target_names)
results_df['id'] = test_df['id'].values  # Adding ID for reference
results_df['text'] = test_df['text'].values  # Adding text for reference
results_df = results_df[['id', 'text'] + target_names]
results_df.to_csv('../model_predictions/bert_predictions_test_maintopics.csv', sep=';', index=False) # or '../model_predictions/bert_predictions_test_maintopics_oversampled.csv' / '../model_predictions/bert_predictions_test_maintopics_undersampled.csv'

Phase 2: Subtopic prediction

In [None]:
# Load the predictions of phase 1
main_topic_predictions_df = pd.read_csv('../model_predictions/bert_predictions_test_maintopics.csv', sep=';') # or '../model_predictions/bert_predictions_test_maintopics_oversampled.csv' / '../model_predictions/bert_predictions_test_maintopics_undersampled.csv'

# Create new columns for the subtopics in the dataframe
for main_topic, subtopics in label_map.items():
    for subtopic in subtopics:
        if subtopic not in main_topic_predictions_df.columns:
            main_topic_predictions_df[subtopic] = 0

# Process each main topic and its subtopics stepwise
for main_topic, subtopics in label_map.items():
    print(f"Processing main topic: {main_topic} for the subtopics: {subtopics}")
    
    # If there's only one subtopic, directly assign the main topic's prediction to the subtopic
    if len(subtopics) == 1:
        subtopic = subtopics[0]
        main_topic_predictions_df[subtopic] = main_topic_predictions_df[main_topic].astype(int)
        continue 

    # Filter training data for the current main topic
    train_df = train_data[train_data[main_topic] == 1].copy()
    train_df['labels'] = train_df[subtopics].values.tolist()

    # Filter valid data and initialize subtopic labels
    valid_df = main_topic_predictions_df[main_topic_predictions_df[main_topic] == 1].copy()
    valid_df['labels'] = valid_df[subtopics].values.tolist()

    # After filtering to get the train and valid datasets, reset the index
    train_subtopic_df = train_df[['id', 'text', 'labels']].reset_index(drop=True)
    valid_subtopic_df = valid_df[['id', 'text', 'labels']].reset_index(drop=True)

    train_set = MultiLabelDataset(train_subtopic_df, tokenizer, MAX_LEN)
    valid_set = MultiLabelDataset(valid_subtopic_df, tokenizer, MAX_LEN)
    training_loader = DataLoader(train_set, batch_size=BATCH_SIZE, shuffle=True)
    valid_loader = DataLoader(valid_set, batch_size=BATCH_SIZE, shuffle=False)
    
    # Model, Optimizer and Scheduler setup
    model = BERTClass(len(subtopics)).to(device)
    optimizer = torch.optim.Adam(model.parameters(), lr=LEARNING_RATE)
    scheduler = torch.optim.lr_scheduler.StepLR(optimizer, step_size=1, gamma=0.95)

    # Train model
    for epoch in range(EPOCHS):
        train(model, device, training_loader, epoch, optimizer, scheduler)    
    
    # Evaluate the model
    subtopic_outputs, _ = validation(model, valid_loader, device)
    subtopic_outputs = np.array(subtopic_outputs) >= THRESHOLD
    subtopic_outputs = np.array(subtopic_outputs, dtype=int)

    # Update the predictions dataframe with new subtopic predictions
    subtopic_predictions_df = pd.DataFrame(subtopic_outputs, columns=subtopics)
    subtopic_predictions_df['id'] = valid_subtopic_df['id'].values  # Add ID for reference

    for index, row in subtopic_predictions_df.iterrows():
        match_index = main_topic_predictions_df[main_topic_predictions_df['id'] == row['id']].index
        for subtopic in subtopics:
            main_topic_predictions_df.loc[match_index, subtopic] = row[subtopic]
    
    print(f"Completed processing for {main_topic}")

# Save the final updated main topics CSV with all subtopics predictions included
main_topic_predictions_df.to_csv('../model_predictions/bert_predictions_test_2step.csv', sep=';', index=False) # or '../model_predictions/bert_predictions_test_2step_oversampled.csv' / '../model_predictions/bert_predictions_test_2step_undersampled.csv'
print("All predictions saved.")

In [None]:
# Performance results
predictions = pd.read_csv('../model_predictions/bert_predictions_test_2step.csv', sep=';') # or '../model_predictions/bert_predictions_test_main_and_subtopics_oversampled.csv' / '../model_predictions/bert_predictions_test_main_and_subtopics_undersampled.csv'
gold_data = pd.read_csv('../data/test.csv', sep=';')

# Sort and align by 'id'
gold_data.sort_values('id', inplace=True)
predictions.sort_values('id', inplace=True)
assert all(gold_data['id'] == predictions['id']), "IDs do not match between gold data and predictions."

# Extract gold labels and predicted labels
labels = [col for col in gold_data.columns if col not in ['id', 'text']]
gold_labels = gold_data[labels]
predicted_labels = predictions[labels]

class_report = classification_report(gold_labels, predicted_labels, target_names=labels, zero_division=0, output_dict=True)
print(classification_report(gold_labels, predicted_labels, target_names=labels, zero_division=0))
hamm_loss = hamming_loss(gold_labels, predicted_labels)
print(f"Hamming Loss: {hamm_loss}")
print_confusion_matrices(predicted_labels.to_numpy(), gold_labels.to_numpy(), labels)

report_df = pd.DataFrame(class_report).transpose()
report_df = report_df.round(3) 
report_df.to_csv('../results/bert/test_report_2step.csv', sep=';', index=True) # or '../results/bert/test_report_2step_oversampled.csv' / '../results/bert/test_report_2step_undersampled.csv'
print("Classification report saved to file.")

End of the notebook.