# D. Transformers

# 0. Data loading

In [2]:
# General Packages #
import os
import pandas as pd
import numpy as np
import string
import re
from scipy.stats import randint
import random
from collections import Counter

# Sklearn Packages #
from sklearn.feature_extraction.text import TfidfTransformer, CountVectorizer, TfidfVectorizer
from sklearn.model_selection import train_test_split, StratifiedShuffleSplit, StratifiedKFold, cross_val_predict, GridSearchCV
from sklearn.metrics import accuracy_score, precision_score, recall_score, f1_score, classification_report, make_scorer
from sklearn.ensemble import RandomForestClassifier, AdaBoostClassifier
from sklearn.linear_model import LogisticRegression
from sklearn.svm import SVC
from sklearn.tree import DecisionTreeClassifier

# NLTK Packages #
import nltk
from nltk.corpus import stopwords
from textblob import TextBlob, Word
from nltk.tokenize import word_tokenize

# Import necessary libraries for handling imbalanced data
from imblearn.pipeline import make_pipeline
from imblearn.over_sampling import SMOTE
from imblearn.under_sampling import RandomUnderSampler

# Embedding related imports
import sys
import gensim
from gensim.models import Word2Vec, Doc2Vec
from gensim.models.phrases import Phraser, Phrases
from gensim.models import KeyedVectors
import gensim.downloader
from gensim.models.doc2vec import Doc2Vec, TaggedDocument
from gensim.scripts.glove2word2vec import glove2word2vec

# Transform packages
from simpletransformers.classification import ClassificationModel
import logging




In [3]:
# Turn of warnings, just to avoid pesky messages that might cause confusion here
# Remove when testing your own code #
import warnings
#warnings.filterwarnings("ignore")

In [4]:
# Change to Working Directory with Training Data # 
#os.chdir("/Users/Artur/Desktop/thesis_HIR_versie5/coding")
os.chdir("/Users/juarel/Desktop/studies artur/thesis_HIR/coding")

# Load the preprocessed data #
df_train = pd.read_csv("./data/gold_data/train.csv", header = 0)
df_test = pd.read_csv("./data/gold_data/test.csv", header = 0)

# inspect the data
df_train.head(5)

Unnamed: 0,id,Headline,category,cleaned_headline
0,194578,Head Line: US Patent granted to BASF SE (Delaw...,,head u patent granted se delaware may titled c...
1,564295,Societe Generale Launches a Next-Generation Ca...,,societe generale launch nextgeneration card in...
2,504138,BARCLAYS PLC Form 8.3 - EUTELSAT COMMUNICATION...,,plc form communication
3,91379,ASML: 4Q Earnings Snapshot,,4q earnings snapshot
4,265750,Form 8.3 - AXA INVESTMENT MANAGERS : Booker Gr...,,form investment manager group plc


# 1. Define functions and parameters

Before we continue, we first define some useful functions and parameters that we use throughout this notebook. The first four functions and parameters were also used and defined in the previous notebook.

1. get_classification_metrics: Create a function that return the classification metrics for each model. The precision, recall and f1 score are all determined using the average value of all classes, without adjusting weights to these classes.

2. Define a dataframe to store the results of the different models. Moreover, also define a dictionary that stores the best parameters for each model.

3. Define the number of splits, the stratified cross validator to ensure class frequencies are considered, and the scoring metric based on the average F1 score. We use an F1 score as scoring metric as accuracy is not a good evaluation metric in our case.

4. Define a function that trains the defined model, the input data, the classifier and its parameter grid. Besides, it will also take 4 parameters as input that give more information about the model that is being trained. This is usefull for the storage of the performance of the different algorithms.



In [5]:
# 1. Function that returns classication metrics
def get_classification_metrics(y_true, y_pred):
    
    # Calculate Model Performance Metrics
    accuracy = accuracy_score(y_true, y_pred)
    precision = precision_score(y_true, y_pred, average='macro')
    recall = recall_score(y_true, y_pred, average='macro')
    f1 = f1_score(y_true, y_pred, average='macro')


    return accuracy, precision, recall, f1


In [6]:
# 2. Create an empty dataframe to store the results of all the models
results_all_df = pd.DataFrame()

# Add columns for the metrics
columns = ['vectorizer', 'FS', 'classifier', 'resampling','accuracy', 'precision', 'recall', 'f1']
for col in columns:
    results_all_df[col] = 0

# create an empty dictionary to store the optimal parameters
best_params_dict = {}

In [7]:
# 3. Define different parameters
# Define the number of folds for cross-validation
n_splits = 5

# Initialize the stratified k-fold object
skf = StratifiedKFold(n_splits=n_splits, shuffle=True, random_state=42) # ensures class balances are kept

# Define the scoring metric
scoring = make_scorer(f1_score, average= 'macro')

In [8]:
# define the independent and dependent variables
X_train = df_train['cleaned_headline']
X_test = df_test['cleaned_headline']

y_train = df_train['category']
y_test = df_test['category']

# 2. Transformers

https://simpletransformers.ai/docs/classification-specifics/#supported-model-types

## 2.1 Bert

In [11]:
# Define with what vectorizer we build the models with for storage
vectorizer = 'Transformer'

In [12]:
# Define the implementation method of word2vec
FS = 'Bert'

In [13]:
import torch
from torch.utils.data import TensorDataset, DataLoader, RandomSampler, SequentialSampler
from transformers import BertTokenizer, BertForSequenceClassification
from tabulate import tabulate
from tqdm import trange


By default, ClassificationModel expects the labels to be ints from 0 up to num_labels.

If your dataset contains labels in another format (e.g. string labels like positive, negative), you can provide the list of all labels to the model args. Simple Transformers will handle the label mappings internally. 

In [14]:
import logging
from simpletransformers.classification import ClassificationModel, ClassificationArgs

logging.basicConfig(level=logging.ERROR)
transformers_logger = logging.getLogger("transformers")
transformers_logger.setLevel(logging.ERROR)

In [15]:
categories = df_train['category'].unique()

In [16]:
# Optional model configuration
model_args = ClassificationArgs()
model_args.num_train_epochs= 5
model_args.labels_list = [categories]

In [30]:
from sklearn.preprocessing import LabelEncoder

# Encode the labels as integers
label_encoder = LabelEncoder()
df_train["label_encoded"] = label_encoder.fit_transform(df_train["category"])

# Get the number of labels
num_labels = len(label_encoder.classes_)
num_labels

15

In [19]:
model_args = {
    "num_train_epochs": 10,
    "train_batch_size": 16,
    "eval_batch_size": 32,
    "overwrite_output_dir": True,
    "save_model_every_epoch": False,
    "reprocess_input_data": True,
    'do_lower_case':True,
    "num_labels": num_labels
}

In [20]:
# Create a ClassificationModel
model = ClassificationModel(
    "roberta", "roberta-base", args=model_args, use_cuda=False, show_running_loss = True
)


In [32]:
# Create train_df and apply label decoding to obtain string labels
input_transf = df_train[['cleaned_headline', 'label_encoded']]
input_transf.columns = ["text", "labels"]
input_transf['labels'].nunique()

15

In [None]:
# Optional model configuration
model_args = ClassificationArgs(num_train_epochs=1)

# Create a ClassificationModel
model = ClassificationModel(
    'bert',
    'bert-base-cased',
    num_labels=15,
    args=model_args,
    use_cuda = False) 

# Train the model
model.train_model(input_transf)


  0%|          | 0/43246 [00:00<?, ?it/s]

Epoch:   0%|          | 0/1 [00:00<?, ?it/s]

Running Epoch 0 of 1:   0%|          | 0/5406 [00:00<?, ?it/s]

In [None]:
# Predict on the test set and decode the integer labels to obtain string labels
_, predictions = model.predict([text for text, _ in TestDataframe])
predicted_labels = label_encoder.inverse_transform(predictions)

In [None]:
# Prepare the test data
test_data = pd.DataFrame({"text": X_test})

# Make predictions on the test data
predictions, _ = model.predict(test_data["text"])

# Decode the integer predictions back to their original string labels
decoded_predictions = label_encoder.inverse_transform(predictions)


In [22]:
from transformers import AutoTokenizer, AutoModelForSequenceClassification

# Load the tokenizer and model
tokenizer = AutoTokenizer.from_pretrained("roberta-base")
model = AutoModelForSequenceClassification.from_pretrained("roberta-base")



Downloading (…)lve/main/config.json:   0%|          | 0.00/481 [00:00<?, ?B/s]

Downloading (…)olve/main/vocab.json:   0%|          | 0.00/899k [00:00<?, ?B/s]

Downloading (…)olve/main/merges.txt:   0%|          | 0.00/456k [00:00<?, ?B/s]

Downloading (…)/main/tokenizer.json:   0%|          | 0.00/1.36M [00:00<?, ?B/s]

Downloading pytorch_model.bin:   0%|          | 0.00/501M [00:00<?, ?B/s]

In [23]:
# Tokenize your text data
text = "This is an example sentence."
inputs = tokenizer(text, return_tensors="pt")

# Pass the tokenized inputs through the model
outputs = model(**inputs)

# Get the logits (predictions) from the model's output
logits = outputs.logits


In [25]:
inputs

{'input_ids': tensor([[   0,  713,   16,   41, 1246, 3645,    4,    2]]), 'attention_mask': tensor([[1, 1, 1, 1, 1, 1, 1, 1]])}

In [35]:
TrainingDataframe = list(zip( list(X_train), list(y_train)))
TestDataframe = list(zip( list(X_test), list(y_test)))

In [41]:
train_df = pd.DataFrame(TrainingDataframe)
train_df.columns = ["text", "labels"]
#train_df["label"] = train_df["label"].apply(lambda x: list(map(int, x)))

ValueError: invalid literal for int() with base 10: 'N'

#### Like Miric

In [22]:
df_train

Unnamed: 0,id,Headline,category,cleaned_headline
0,194578,Head Line: US Patent granted to BASF SE (Delaw...,,head u patent granted se delaware may titled c...
1,564295,Societe Generale Launches a Next-Generation Ca...,,societe generale launch nextgeneration card in...
2,504138,BARCLAYS PLC Form 8.3 - EUTELSAT COMMUNICATION...,,plc form communication
3,91379,ASML: 4Q Earnings Snapshot,,4q earnings snapshot
4,265750,Form 8.3 - AXA INVESTMENT MANAGERS : Booker Gr...,,form investment manager group plc
...,...,...,...,...
43241,1329576,Tomra Systems ASA: TOM: Purchase of own shares,Financing,system asa tom purchase share
43242,671948,Swiss Federal Institute of Intellectual Proper...,,swiss federal institute intellectual granted p...
43243,1057600,ICON: Pfizer and Roche Join ADDPLAN DF Consort...,Strategic alliance,icon pfizer join addplan df consortiumnew memb...
43244,1036538,Rio Tinto PLC Transaction in Own Shares -3-,,plc transaction share


In [11]:
# Store Data in Lists for Text Classification #
IDs = np.array(df_train.index.values.tolist())
Abstract_Text = df_train['cleaned_headline'].values.tolist()
Classes = df_train['category'].values.tolist()

In [17]:
from tqdm  import tqdm_notebook

In [21]:
CLASSIFIERS = [
               ["BERT", "bert", "bert-base-uncased"]]

# Define arrays in which to store classification outputs # 
RESULTS = []
Classified_Values =[]

# Loop Through Different Classifiers #
for CL in tqdm_notebook(CLASSIFIERS, desc = "Evaluating Classifiers", leave = True):

    # Extract Classifier Names & Model #
    name  = CL[0]
    Model1 = CL[1]
    Model2 = CL[2]
    
    # Define Arrays to store Actual, Predicted and Ids variables (Because we are shuffling them in next step) # 
    y_actual = []
    y_predicted = []
    id_s = []

    # Loop through K Folds and Repeat Cross Validation #
        
    KFoldSplitter = StratifiedKFold(n_splits = 5, shuffle = True, random_state = 1)
        
    for train_i, test_i in tqdm_notebook(KFoldSplitter.split(Abstract_Text, Classes), 
                                            desc = 'Cross-Validating',
                                            leave = False,
                                            total = 5):
      
        # Select Rows in Data Based on Indexes [train_i, test_i]
        Y = np.array(Classes)

        Abstract_Text_Array = np.array(Abstract_Text)

        train_X, test_X = Abstract_Text_Array[train_i], Abstract_Text_Array[test_i]
        train_y, test_y = Y[train_i], Y[test_i]
        Train_IDs, Test_IDs = IDs[train_i], IDs[test_i]

        # Create Training Data in Paired Format (Nessesary for Transformers) # 
        TrainingDataframe = list(zip( list(train_X), list(train_y)))
        TestDataframe = list(zip( list(test_X), list(test_y)))

        train_df = pd.DataFrame(TrainingDataframe)
        train_df.columns = ["text", "labels"]

        # Create a Classification Model
        model = ClassificationModel(Model1, Model2,                                   
                                    args={'num_train_epochs':1,
                                          'overwrite_output_dir': True,
                                          'use_early_stopping':False,
                                          'use_cuda':False,
                                          'train_batch_size':50,
                                          'do_lower_case':True, 
                                          'silent':True,
                                          'no_cache':True, 
                                          'no_save':True}, use_cuda = False
                                    )

    # Train the Model
    model.train_model(train_df)

    # Predict on Holdout Sample #
    predictions, raw_outputs = model.predict(list(test_X) )

    # Store Output #
    id_s = id_s + list(Test_IDs)
    y_actual = y_actual + list(test_y)
    y_predicted = y_predicted + list(predictions)

    gc.collect()
    torch.cuda.empty_cache() 

    # ---------------------------------------------------------- #
    # This runs only after all of the folds have been classified # 
    # ---------------------------------------------------------- #

    # Calculate the classification metrics
    accuracy, precision, recall, f1 = get_classification_metrics(y_test, y_predicted)
    
    # print the results
    print(f'Results for {name}:')
    print(f'Accuracy: {accuracy}')
    print(f'Precision: {precision}')
    print(f'Recall: {recall}')
    print(f'F1: {f1}')
    
    # add the results to the dataframe with all the results
    results_all_df.loc[name] = [vectorizer, FS, classifier, resampling, accuracy, precision, recall, f1]
    

Please use `tqdm.notebook.tqdm` instead of `tqdm.tqdm_notebook`
  for CL in tqdm_notebook(CLASSIFIERS, desc = "Evaluating Classifiers", leave = True):


Evaluating Classifiers:   0%|          | 0/1 [00:00<?, ?it/s]

Please use `tqdm.notebook.tqdm` instead of `tqdm.tqdm_notebook`
  for train_i, test_i in tqdm_notebook(KFoldSplitter.split(Abstract_Text, Classes),


Cross-Validating:   0%|          | 0/5 [00:00<?, ?it/s]

Downloading pytorch_model.bin:   0%|          | 0.00/440M [00:00<?, ?B/s]

Some weights of the model checkpoint at bert-base-uncased were not used when initializing BertForSequenceClassification: ['cls.predictions.transform.dense.bias', 'cls.seq_relationship.weight', 'cls.predictions.bias', 'cls.predictions.transform.LayerNorm.weight', 'cls.predictions.transform.dense.weight', 'cls.predictions.transform.LayerNorm.bias', 'cls.predictions.decoder.weight', 'cls.seq_relationship.bias']
- This IS expected if you are initializing BertForSequenceClassification from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing BertForSequenceClassification from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).
Some weights of BertForSequenceClassification were not initialized from the model checkpoint at

Downloading (…)okenizer_config.json:   0%|          | 0.00/28.0 [00:00<?, ?B/s]

Downloading (…)solve/main/vocab.txt:   0%|          | 0.00/232k [00:00<?, ?B/s]

Downloading (…)/main/tokenizer.json:   0%|          | 0.00/466k [00:00<?, ?B/s]

Some weights of the model checkpoint at bert-base-uncased were not used when initializing BertForSequenceClassification: ['cls.predictions.transform.dense.bias', 'cls.seq_relationship.weight', 'cls.predictions.bias', 'cls.predictions.transform.LayerNorm.weight', 'cls.predictions.transform.dense.weight', 'cls.predictions.transform.LayerNorm.bias', 'cls.predictions.decoder.weight', 'cls.seq_relationship.bias']
- This IS expected if you are initializing BertForSequenceClassification from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing BertForSequenceClassification from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).
Some weights of BertForSequenceClassification were not initialized from the model checkpoint at

ValueError: too many dimensions 'str'

## 4. Write away results

In [None]:
# write away results
results_all_df.to_csv('./Output/Model performance/results_transformers.csv', index = False, header = True)

In [None]:
# Write the dictionary with the best parameters away
with open('./Output/parameters/embeddings.json', 'w') as file:
    json.dump(best_params_dict, file)