# D. Transformers

In this notebook, I build two transformers model to classify the news headlines into the correct categories. This notebook deviates from the previous notebooks as I did not use the resampling technique or the base classifiers. First, I did not use the resampling techniques as they have not proven to increase the performance in the previous notebooks. Next, the textual data serves as input to the transformer models so there is no need to transform the text into numerical representations and train the base classifiers.

In order to train the transformer models, I use the package Simple Transformers in Python. More documentation about this package can be found in the link below. This package is an easy-to-use package to implement transformer models (Miric et al., 2022). These models have been proven to shown great performance in previous natural language processing tasks. Therefore, I train two transformer models on this dataset:

1. BERT
2. RoBERTa

Moreover, note that I use the preprocessed text data as input for these models and not the original raw data.

https://simpletransformers.ai

# 0. Data loading

In [1]:
# General Packages #
import os
import pandas as pd
import numpy as np
import string
import re
from scipy.stats import randint
import random
from collections import Counter

# Sklearn Packages #
from sklearn.metrics import accuracy_score, precision_score, recall_score, f1_score, classification_report, make_scorer

# Transform packages
from simpletransformers.classification import ClassificationModel, ClassificationArgs
from sklearn.preprocessing import LabelEncoder
import logging


In [2]:
# Settings
logging.basicConfig(level=logging.ERROR)
transformers_logger = logging.getLogger("transformers")
transformers_logger.setLevel(logging.ERROR)

In [3]:
# Turn of warnings, just to avoid pesky messages that might cause confusion here
import warnings
warnings.filterwarnings("ignore")

In [4]:
# Change to Working Directory with Training Data # 
#os.chdir("/Users/Artur/Desktop/thesis_HIR_versie5/coding")
os.chdir("/Users/juarel/Desktop/studies artur/thesis_HIR/coding")

# Load the preprocessed data #
df_train = pd.read_csv("./data/gold_data/train.csv", header = 0)
df_test = pd.read_csv("./data/gold_data/test.csv", header = 0)

# inspect the data
df_train.head(5)

Unnamed: 0,id,Headline,category,cleaned_headline
0,194578,Head Line: US Patent granted to BASF SE (Delaw...,,head u patent granted se delaware may titled c...
1,564295,Societe Generale Launches a Next-Generation Ca...,,societe generale launch nextgeneration card in...
2,504138,BARCLAYS PLC Form 8.3 - EUTELSAT COMMUNICATION...,,plc form communication
3,91379,ASML: 4Q Earnings Snapshot,,4q earnings snapshot
4,265750,Form 8.3 - AXA INVESTMENT MANAGERS : Booker Gr...,,form investment manager group plc


# 1. Define functions and parameters

Before we continue, we first define some useful functions and parameters that we use throughout this notebook. These function will be used to train the model and evaluate the model afterwards.

1. get_classification_metrics: Create a function that return the classification metrics for each model. The precision, recall and f1 score are all determined using the average value of all classes, without adjusting weights to these classes.

2. First, we define several function that return the classification metrics. These are used instead of the previous 'get_classification_metrics()' function due to easier implementation. Moreover, the f1_multiclass fuction is also used as a scoring metric to optimize the training of the transformers for the F1 metric.

3. I define the input parameters for the two models. The Simple Transformers package allows for a wide range of parameters. For more information about these parameters, I refer to the link at the top of the notebook. 


In [5]:
# 1. Function that returns classication metrics
def get_classification_metrics(y_true, y_pred):
    
    # Calculate Model Performance Metrics
    accuracy = accuracy_score(y_true, y_pred)
    precision = precision_score(y_true, y_pred, average='macro')
    recall = recall_score(y_true, y_pred, average='macro')
    f1 = f1_score(y_true, y_pred, average='macro')


    return accuracy, precision, recall, f1

In [6]:
# 2. Define functions that return classication metrics
def f1_multiclass(labels, preds):
    return f1_score(labels, preds, average='macro')

def prec_multiclass(labels, preds):
    return precision_score(labels, preds, average='macro')

def recall_multiclass(labels, preds):
    return recall_score(labels, preds, average='macro')


Here, I define the different parameters that serve as input for my models. Here, there are several things I want to highlight:

1. One of the most important decisions was not to include an evaluation set while training the model. I did not include an evaluation set as it reduces the limited dataset drastically. Moreover, results showed that not including the evaluation set resulted in better performance of the model on my test set, probably due to the extra training data to train the model.

2. I added a regularization term to prevent overfitting. This is especially relevant as I did not include an evaluation set while training the data.

3. I used trained the model to optimize the weighted f1 score, just as all the previous models.

4. To determine the max_seq_length, I checked the length of all the headlines in the training dataset. This code can be found below definining the parameters.

In [7]:
# define path for storage BERT files as very large files and can not be pushed in github
file_path = '/Users/juarel/Desktop/studies artur/thesis_HIR/big files/BERT/'

# 3. Define the different parameters
train_args = {
    'output_dir': f'{file_path}transformers-outputs',
    'best_model_dir': f'{file_path}transformers-outputs/best_model', # directory to save best models at check points
    'cache_dir': f'{file_path}transformers-cache_dir',
    'tensorboard_dir': f'{file_path}transformers-runs',

    'max_seq_length': 69,            # maximum number of tokens (= words) per input, only few observations have more than 69 tokens 
    'do_lower_case': True,           # Set true when using uncased models
    'num_train_epochs': 10,           # The number of times the equivalent of a full training set has been processed
    'train_batch_size': 32,          # The training batch size
    'learning_rate': 4e-5,           # Controls how fast model weights are updated
    'save_steps': 1000,
    "save_model_every_epoch": False,         # Save a model checkpoint at the end of every epoch.
    'overwrite_output_dir': True,            # Overwrite existing saved models in same directory
    'no_cache': True,                        # No cache features to disk
    'use_multiprocessing': True,             # use multiprocessing when converting data into features

    'manual_seed': 7,                  # Ensure results can be reproduced
    'weight_decay': 0.001,             # Adds L2 penalty (low due to limited dataset)
    'metrics': ['f1_multiclass']
}

Define the length of the different headlines

In [8]:
# Initialize variables to store the lengths
max_length = 0
length = []

# Iterate over each headline in the cleaned_headline column
for sentence in df_train['cleaned_headline']:
    
   # Split the sentence into words
   words = sentence.split()
    
   # Update the maximum length if the current sentence is longer
   l = len(words)
   length.append(l)

# Look at the mean number of words per headline
np.mean(length)

# Creating the DataFrame
df_lengths = pd.DataFrame({'Headline words': length})

# Create a DataFrame from the 'length' array
# Sort the DataFrame by 'Sentence Length' column in descending order
df_sorted = df_lengths.sort_values(by='Headline words', ascending=False)
df_sorted.head(10)

Unnamed: 0,Headline words
29057,112
7029,98
7244,92
29504,91
11858,79
15788,79
6946,76
32251,69
10383,68
20039,68


In [9]:
# define the independent and dependent variables
X_train = df_train['cleaned_headline']
X_test = df_test['cleaned_headline']

y_train = df_train['category']
y_test = df_test['category']

# 2. Transformers

In [10]:
# Define with what vectorizer we build the models with for storage
vectorizer = 'Transformer'

#### Transform the labels of the categories

By default, the transfomer models expect the labels of the categories to be integer values from 0 up to the number of labels. Therefore, we first need to adjust the training set to create the right input for both models.

In [11]:
# Define the LabelEncoder
label_encoder = LabelEncoder()

# Encode the labels as integers for the train and test set
df_train["label_encoded"] = label_encoder.fit_transform(df_train["category"])
df_test["label_encoded"] = label_encoder.transform(df_test["category"])

# Get the number of labels to check
num_labels = len(label_encoder.classes_)
num_labels

15

Next, the transformer models require a dataframe as input with in one column the text data and in the other column the different labels as integer values.

In [12]:
# Create a dataframe that serves as input for the transformer models
df_train_tranf = df_train[['cleaned_headline', 'label_encoded']]
df_train_tranf.columns = ["text", "labels"]
df_train_tranf['labels'].nunique()

# Transform the test set for evaluation of the models
df_test_transf = df_test[['cleaned_headline', 'label_encoded']]
df_test_transf.columns = ["text", "labels"]

## 2.1 BERT

First, I train the BERT model. I specify that I use the bert-base-uncased model. This means it uses the base architecture and the model does not make any distinction between lower and upper case. As I cleaned the data and put it in lower case, I want to use an uncased model.

#### Train the model

In [13]:
# Define the transformer you use
FS = 'Bert'

In [14]:
# Define the model
model_bert = ClassificationModel('bert', 'bert-base-uncased', num_labels=15, args=train_args, 
                            use_cuda = False)

In [15]:
# Train the model
model_bert.train_model(df_train_tranf, f1=f1_multiclass)

  0%|          | 0/43246 [00:00<?, ?it/s]

Epoch:   0%|          | 0/10 [00:00<?, ?it/s]

Running Epoch 0 of 10:   0%|          | 0/1352 [00:00<?, ?it/s]

Running Epoch 1 of 10:   0%|          | 0/1352 [00:00<?, ?it/s]

Running Epoch 2 of 10:   0%|          | 0/1352 [00:00<?, ?it/s]

Running Epoch 3 of 10:   0%|          | 0/1352 [00:00<?, ?it/s]

Running Epoch 4 of 10:   0%|          | 0/1352 [00:00<?, ?it/s]

Running Epoch 5 of 10:   0%|          | 0/1352 [00:00<?, ?it/s]

Running Epoch 6 of 10:   0%|          | 0/1352 [00:00<?, ?it/s]

Running Epoch 7 of 10:   0%|          | 0/1352 [00:00<?, ?it/s]

Running Epoch 8 of 10:   0%|          | 0/1352 [00:00<?, ?it/s]

Running Epoch 9 of 10:   0%|          | 0/1352 [00:00<?, ?it/s]

(13520, 0.10275714916605165)

In [16]:
# Evaluate the performance of the model on the test set
result_bert, outputs_bert, wrong_predictions_bert = model_bert.eval_model(df_test_transf,
                                                            f1=f1_multiclass, acc=accuracy_score,
                                                            prec=prec_multiclass,
                                                            recall=recall_multiclass)

  0%|          | 0/10810 [00:00<?, ?it/s]

Running Evaluation:   0%|          | 0/1352 [00:00<?, ?it/s]

In [17]:
# Look at the results
result_bert

{'mcc': 0.6587668933100075,
 'f1': 0.5875861863141217,
 'acc': 0.9350601295097132,
 'prec': 0.5992932422951209,
 'recall': 0.5799231552974732,
 'eval_loss': 0.5098160408648408}

Here, I provide the code to make predictions on new test observations and to retrieve their predictions. After this step, I again evaluate the performance of the model on these predictions. however, this should give the same results as the evaluation metrics above.

In [19]:
# Convert the input data to a list of strings
X_test = [str(i) for i in df_test_transf['text'].values]

In [20]:
# Make predictions on new test observations
predictions_labeled_bert, raw_outputs_bert = model_bert.predict(X_test)

  0%|          | 0/10810 [00:00<?, ?it/s]

  0%|          | 0/1352 [00:00<?, ?it/s]

In [21]:
# Transform predicted labels back to original form
y_pred_bert = label_encoder.inverse_transform(predictions_labeled_bert)

In [22]:
# Evaluate the performance of the model
accuracy, precision, recall, f1 = get_classification_metrics(y_test, y_pred_bert)
accuracy, precision, recall, f1

(0.9350601295097132,
 0.5992932422951209,
 0.5799231552974732,
 0.5875861863141217)

Store the results and the predictions in a dataframe

In [23]:
# Create an empty dataframe to store the results
df_results_bert = pd.DataFrame()

# Add columns for the metrics
columns = ['vectorizer', 'FS', 'accuracy', 'precision', 'recall', 'f1']
for col in columns:
    df_results_bert[col] = 0

# add the results to the dataframe
df_results_bert['BERT'] = [vectorizer, FS, accuracy, precision, recall, f1]

In [24]:
# Create a dataframe with the predictions
predictions_bert = pd.DataFrame({'Predictions': y_pred_bert})

#### write away the results

In [25]:
# write away results and predictions
df_results_bert.to_csv('./Output/Model performance/results_BERT.csv', index = False, header = True)
predictions_bert.to_csv(f'./Output/predictions/BERT.csv', index = True, header = True)

## 2.2 RoBERTa

Note that this is exactly the same code as for the BERT model. The same input data, parameters and procedures are used. Only when defining the model, I define the uncased RoBERTa model instead of the BERT model.

In [26]:
# define path for storage RoBERTa as very large files and can not be pushed in github
file_path = '/Users/juarel/Desktop/studies artur/thesis_HIR/big files/RoBERTa/'

#### Train the model

In [27]:
# Define the transformer you use
FS = 'RoBERTa'

In [30]:
# Define the model
model_roberta = ClassificationModel('roberta', 'roberta-base', num_labels=15, args=train_args, 
                            use_cuda = False)

In [None]:
# Train the model
model_roberta.train_model(df_train_tranf, f1=f1_multiclass)

  0%|          | 0/43246 [00:00<?, ?it/s]

Epoch:   0%|          | 0/10 [00:00<?, ?it/s]

Running Epoch 0 of 10:   0%|          | 0/1352 [00:00<?, ?it/s]

Running Epoch 1 of 10:   0%|          | 0/1352 [00:00<?, ?it/s]

Running Epoch 2 of 10:   0%|          | 0/1352 [00:00<?, ?it/s]

Running Epoch 3 of 10:   0%|          | 0/1352 [00:00<?, ?it/s]

Running Epoch 4 of 10:   0%|          | 0/1352 [00:00<?, ?it/s]

Running Epoch 5 of 10:   0%|          | 0/1352 [00:00<?, ?it/s]

Running Epoch 6 of 10:   0%|          | 0/1352 [00:00<?, ?it/s]

Running Epoch 7 of 10:   0%|          | 0/1352 [00:00<?, ?it/s]

Running Epoch 8 of 10:   0%|          | 0/1352 [00:00<?, ?it/s]

In [None]:
# Evaluate the performance of the model on the test set
result_roberta, outputs_roberta, wrong_predictions_roberta = model_roberta.eval_model(df_test_transf,
                                                            f1=f1_multiclass, acc=accuracy_score,
                                                            prec=prec_multiclass,
                                                            recall=recall_multiclass)

In [None]:
# Look at the results
result_roberta

Here, I provide the code to make predictions on new test observations and to retrieve their predictions. After this step, I again evaluate the performance of the model on these predictions. however, this should give the same results as the evaluation metrics above.

In [None]:
# Convert the input data to a list of strings
X_test = [str(i) for i in df_test_transf['text'].values]

In [None]:
# Make predictions on new test observations
predictions_labeled_roberta, raw_outputs_roberta = model_roberta.predict(X_test)

In [None]:
# Transform predicted labels back to original form
y_pred_roberta = label_encoder.inverse_transform(predictions_labeled_roberta)

In [None]:
# Evaluate the performance of the model
accuracy, precision, recall, f1 = get_classification_metrics(y_test, y_pred_roberta)
accuracy, precision, recall, f1

Store the results and the predictions in a dataframe

In [None]:
# Create an empty dataframe to store the results
df_results_roberta = pd.DataFrame()

# Add columns for the metrics
columns = ['vectorizer', 'FS', 'accuracy', 'precision', 'recall', 'f1']
for col in columns:
    df_results_roberta[col] = 0

# add the results to the dataframe
df_results_roberta['RoBERTa'] = [vectorizer, FS, accuracy, precision, recall, f1]

In [None]:
# Create a dataframe with the predictions
predictions_roberta = pd.DataFrame({'Predictions': y_pred_roberta})

#### write away the results

In [None]:
# write away results and predictions
df_results_roberta.to_csv('./Output/Model performance/results_RoBERTa.csv', index = False, header = True)
predictions_roberta.to_csv(f'./Output/predictions/RoBERTa.csv', index = True, header = True)