# New analysis for paper with Ernesto and Susan

<br><br>

## **Import necessary Python libraries and modules**

In [None]:
!pip3 install transformers

Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/


In [None]:
import gzip
import json
import pickle
import random
import sys
import csv
import numpy as np
import pandas as pd
import seaborn as sns
from glob import glob
import pandas as pd
from matplotlib import ticker
from sklearn import metrics
from sklearn.feature_extraction.text import CountVectorizer, TfidfVectorizer
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import accuracy_score, precision_recall_fscore_support, classification_report
from sklearn.model_selection import train_test_split
from sklearn.naive_bayes import MultinomialNB
from sklearn.svm import SVC
import torch
from transformers import Trainer, TrainingArguments
from sklearn.metrics import f1_score

from collections import defaultdict

sns.set(style='ticks', font_scale=1.2)
%matplotlib inline
import matplotlib.pyplot as plt

from sklearn.utils import compute_sample_weight

In [None]:
from google.colab import drive
drive.mount('/content/drive')

Drive already mounted at /content/drive; to attempt to forcibly remount, call drive.mount("/content/drive", force_remount=True).


In [None]:
!ls /content/drive/MyDrive/train_test_splits_for_BERT

germany_test.csv   italy_test.csv   netherlands_test.csv   poland_test.csv
germany_train.csv  italy_train.csv  netherlands_train.csv  poland_train.csv


<br><br>

## **Read in data, and split into Train-Val-Test samples**




In [None]:
trainfiles = glob("/content/drive/MyDrive/train_test_splits_for_BERT/*train.csv")
testfiles = glob("/content/drive/MyDrive/train_test_splits_for_BERT/*test.csv")

In [None]:
def read_files(filenames):
  df = pd.DataFrame()
  for fn in filenames:
    _df = pd.read_csv(fn, encoding='iso-8859-1')
    _df['is_sports'] = _df['TOPIC']==4
    _df = _df[['title_blurb', 'is_sports']]
    df = pd.concat([df, _df])
  df = df.sample(frac=1, random_state=1983)  # we need to shuffle b/c otherwise it's sorted by language
  return list(df['title_blurb']), list(df['is_sports'])

X_train_val, y_train_val  = read_files(trainfiles)
X_test, y_test  = read_files(testfiles)

# Split your training data into training and validation sets
X_train, X_val, y_train, y_val = train_test_split(X_train_val, y_train_val, test_size=0.25, random_state=42)

print(f"We have {len(X_train)} train examples, {len(X_val)} validation examples, and {len(X_test)} test examples.")

We have 6388 train examples, 2130 validation examples, and 2843 test examples.


Here's an example of a training text and training label:

In [None]:
X_train[0], y_train[0]

('Gersdorf: Obrona sadów budzi podziw demokratycznego swiata """Pragne podziekowac wszystkim, którym nie jest obojetny stan polskiego sadownictwa. (...) Dziekuje Wam wszystkim, Wasza postawa jest wazna i budzi szacunek nie tylko mój i sedziów, ale calego demokratycznego swiata"" - napisala w liscie zamieszczonym na stronach SN I Prezes Malgorzat..."',
 False)

In [None]:
sum(y_val)

43

In [None]:
model_name = 'bert-base-multilingual-cased'

# We'll run our code on NVIDIA GPUs using the program management system.
device_name = 'cuda'

# We set the maximum number of tokens in each document to be 512, which is the maximum length for BERT models.
max_length = 512

# We define the directory where we'll save our trained model. You can choose any name for the directory.
save_directory = '/content/drive/MyDrive/my_trained_model'

<br><br>

## **Implementing a Baseline Model using Logistic Regression**

In this step, we train and evaluate a basic TF-IDF baseline model with logistic regression. Despite using a very small dataset, we observe a performance that is better than random. We will now check if BERT can outperform this strong baseline!

In [None]:
vectorizer = TfidfVectorizer()
Xtrain = vectorizer.fit_transform(X_train)
Xtest = vectorizer.transform(X_test)

We train a logistic regression model from scikit-learn on the newspaper training data, and then we use the trained model to make predictions on our test set.

In [None]:
model = LogisticRegression(max_iter=1000).fit(Xtrain, y_train)
predictions = model.predict(Xtest)

We can leverage the `classification_report` function provided by scikit-learn to assess the performance of the logistic regression model in terms of its ability to predict newspaper topics that match the actual labels.

In [None]:
from sklearn.metrics import classification_report
print(classification_report(y_test, predictions))

              precision    recall  f1-score   support

       False       0.98      1.00      0.99      2783
        True       0.00      0.00      0.00        60

    accuracy                           0.98      2843
   macro avg       0.49      0.50      0.49      2843
weighted avg       0.96      0.98      0.97      2843



  _warn_prf(average, modifier, msg_start, len(result))
  _warn_prf(average, modifier, msg_start, len(result))
  _warn_prf(average, modifier, msg_start, len(result))


What do you think of this model? Not too bad for a baseline model, right? Lets see whether we can improve this using BERT.

## Encode data for BERT

To prepare our data for use with BERT, we need to encode the texts and labels in a way that the model can understand. Here are the steps we'll follow:

1. Convert the labels from strings to integers.

2. Tokenize the texts, which involves breaking them up into individual words, and then convert the words into "word pieces" that can be matched with their corresponding embedding vectors.

3. Truncate texts that are longer than 512 tokens, or pad texts that are shorter than 512 tokens with a special padding token.

4. Add special tokens to the beginning and end of each document, including a start token, a separator between sentences, and a padding token as necessary.


We will be using the `AutoTokenizer.from_pretrained()` module from HuggingFace library to encode our texts. This module will handle all the encoding for us, including breaking word tokens into word pieces, truncating to 512 tokens, and adding padding and special BERT tokens.

In [None]:
from transformers import AutoTokenizer

tokenizer = AutoTokenizer.from_pretrained(model_name)


In this section, we will generate a mapping of our news topics to integer keys. We begin by extracting the unique labels from our dataset and create a dictionary that associates each label with an integer.

In [None]:
unique_labels = set(label for label in y_train)
label2id = {label: id for id, label in enumerate(unique_labels)}
id2label = {id: label for label, id in label2id.items()}

In [None]:
label2id.keys()

dict_keys([False, True])

In [None]:
id2label.keys()

dict_keys([0, 1])

Now let's encode our texts and labels!

In [None]:
train_encodings = tokenizer(X_train, truncation=True, padding=True, max_length=max_length)
val_encodings = tokenizer(X_val, truncation=True, padding=True, max_length=max_length)
test_encodings  = tokenizer(X_test, truncation=True, padding=True, max_length=max_length)

train_labels_encoded = [label2id[y] for y in y_train]
val_labels_encoded = [label2id[y] for y in y_val]
test_labels_encoded  = [label2id[y] for y in y_test]

**Examine a news article in the training set after encoding**

In [None]:
' '.join(train_encodings[0].tokens[0:100])

'[CLS] Gers ##dorf : Ob ##rona sad ##ów bu ##dzi pod ##zi ##w demo ##krat ##ycznego s ##wia ##ta " " " Prag ##ne pod ##ziek ##owa ##c wszystkim , którym nie jest ob ##oje ##tny stan polskiego sad ##own ##ictwa . ( . . . ) D ##ziek ##uje W ##am wszystkim , Was ##za posta ##wa jest wa ##zna i bu ##dzi sz ##ac ##unek nie tylko mó ##j i sed ##zió ##w , ale cal ##ego demo ##krat ##ycznego s ##wia ##ta " " - nap ##isal ##a w li ##sci ##e za ##mies ##zczony ##m na'

**Examine a news article in test set after encoding**

In [None]:
' '.join(test_encodings[0].tokens[0:100])

"[CLS] W stanie woj ##enn ##ym za ##bron ##ili mu gra ##c , ter ##az do ##pad ##la go ' dobra zmian ##a ' . W ##y ##bit ##ny aktor z ##wo ##lni ##ony D ##zis dyrektor Polskiego Ce ##zar ##y Mora ##wski z ##wo ##lni ##l kolejne ##go cz ##lon ##ka zes ##pol ##u . Pa ##dlo na Andrzeja Wi ##lka , w ##y ##bit ##nego aktor ##a , w stanie woj ##enn ##ym op ##oz ##y ##c ##joni ##ste po ##z ##ba ##wione ##go prawa do w ##yk ##ony ##wania za ##wodu , który we w ##roc ##law"

**Examine the training labels after encoding**

In [None]:
set(train_labels_encoded)

{0, 1}

**Examine the test labels after encoding**

In [None]:
set(test_labels_encoded)

{0, 1}

<br><br>

## **Create a custom Torch dataset by following these steps:**

Here we combine the encoded labels and texts into dataset objects. We use the custom Torch `MyDataSet` class to make a `train_dataset` object from  the `train_encodings` and `train_labels_encoded`. We also make a `val_dataset`, `test_dataset` object from `test_encodings` and `val_encodings`, and `val_labels_encoded` and `test_labels_encoded`.


In [None]:
class MyDataset(torch.utils.data.Dataset):
    def __init__(self, encodings, labels):
        self.encodings = encodings
        self.labels = labels

    def __getitem__(self, idx):
        item = {key: torch.tensor(val[idx]) for key, val in self.encodings.items()}
        item['labels'] = torch.tensor(self.labels[idx])
        return item

    def __len__(self):
        return len(self.labels)

In [None]:
train_dataset = MyDataset(train_encodings, train_labels_encoded)
val_dataset = MyDataset(val_encodings, val_labels_encoded)
test_dataset = MyDataset(test_encodings, test_labels_encoded)

**Examine a news article in the Torch `training_dataset` after encoding**

In [None]:
' '.join(train_dataset.encodings[0].tokens[0:100])

'[CLS] Gers ##dorf : Ob ##rona sad ##ów bu ##dzi pod ##zi ##w demo ##krat ##ycznego s ##wia ##ta " " " Prag ##ne pod ##ziek ##owa ##c wszystkim , którym nie jest ob ##oje ##tny stan polskiego sad ##own ##ictwa . ( . . . ) D ##ziek ##uje W ##am wszystkim , Was ##za posta ##wa jest wa ##zna i bu ##dzi sz ##ac ##unek nie tylko mó ##j i sed ##zió ##w , ale cal ##ego demo ##krat ##ycznego s ##wia ##ta " " - nap ##isal ##a w li ##sci ##e za ##mies ##zczony ##m na'

**Examine a news article in the Torch `test_dataset` after encoding**

In [None]:
' '.join(test_dataset.encodings[1].tokens[0:100])

'[CLS] Nieuwe informa ##teur Za ##lm pak ##t fi ##dge ##t spin ##ner Bu ##ma af - De Sp ##eld Een grote tegen ##slag voor Sy ##brand Bu ##ma . De leider van het CD ##A moet zijn fi ##dge ##t spin ##ner in ##lever ##en . Volgens de nieuwe informa ##teur Gerrit Za ##lm lei ##dt het speelt ##je te veel af van de onder ##handeling ##en . [SEP] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD]'

In [None]:
len(id2label)

2

<br><br>

## **Initialize the pre-trained BERT model**

We load a pre-trained Dutch BERT model and transfer it to CUDA for efficient computation.

**Note**: If you intend to repeat the fine-tuning process after previously executing the subsequent cells, ensure that you re-run this cell to reload the original pre-trained model before commencing the fine-tuning again.

In [None]:
from transformers import AutoTokenizer, AutoModelForSequenceClassification
model = AutoModelForSequenceClassification.from_pretrained(model_name, num_labels=len(id2label)).to(device_name)

Some weights of the model checkpoint at bert-base-multilingual-cased were not used when initializing BertForSequenceClassification: ['cls.predictions.decoder.weight', 'cls.seq_relationship.weight', 'cls.seq_relationship.bias', 'cls.predictions.bias', 'cls.predictions.transform.LayerNorm.weight', 'cls.predictions.transform.dense.weight', 'cls.predictions.transform.dense.bias', 'cls.predictions.transform.LayerNorm.bias']
- This IS expected if you are initializing BertForSequenceClassification from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing BertForSequenceClassification from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).
Some weights of BertForSequenceClassification were not initialized from the model ch

<br><br>

## **Configure the parameters required for fine-tuning BERT**

The following parameters are crucial for fine-tuning BERT and will be specified in the HuggingFace TrainingArguments objects that we will subsequently pass to the HuggingFace Trainer object. While there are numerous other arguments, we'll focus on the fundamental ones and some common pitfalls.

When fine-tuning your own model, it's critical to experiment with these parameters to identify the optimal configuration for your specific dataset.

| Parameter                     | Explanation                                                                                                                          |
|-------------------------------|--------------------------------------------------------------------------------------------------------------------------------------|
| `num_train_epochs`            | The total number of training epochs. This refers to how many times the entire dataset will be processed. Too many epochs can lead to overfitting.|
| `per_device_train_batch_size` | The batch size per device during training.                                                                                           |
| `per_device_eval_batch_size`  | The batch size for evaluation.                                                                                                      |
| `warmup_steps`                | The number of warmup steps for the learning rate scheduler. A smaller value is recommended for small datasets.                         |
| `weight_decay`                | The strength of weight decay, which reduces the size of weights, similar to regularization.                                          |
| `output_dir`                  | The directory where the fine-tuned model and configuration files will be saved.                                                     |
| `logging_dir`                 | The directory where logs will be stored.                                                                                            |
| `logging_steps`               | How often to print logging output. This enables us to terminate training early if the loss is not decreasing.                        |
| `evaluation_strategy`         | Evaluates while training so that we can monitor accuracy improvements.                                                              |


<br><br>

## **Fine-tune the BERT model**

Initially, we define a custom evaluation function that returns the accuracy of the model. However, this function can be modified to return other metrics such as precision, recall, F1 score, or any other desired evaluation metric.

In [None]:
def compute_metrics(eval_pred):
    labels = eval_pred.label_ids
    preds = eval_pred.predictions.argmax(-1)
    acc = accuracy_score(labels, preds)
    f1 = f1_score(labels, preds)
    macro_f1 = f1_score(labels, preds, average='macro', sample_weight=compute_sample_weight('balanced', labels))
    return {'accuracy': acc, 'macro_f1': macro_f1, 'f1': f1}

Then we create a HuggingFace `Trainer` object using the `TrainingArguments` object that we created above. We also send our `compute_metrics` function to the `Trainer` object, along with our test and train datasets.


## **optimize your model based on a metric you select**

In [None]:
# macro_f1 would be good if we had multiple categories
# but we only care about F1 for the true class (politics)
metric_name = 'f1' # you can chance this for `accuracy` etc, according to the function `compute_metrics`

In [None]:
# Instantiate an object of the TrainingArguments class with the following parameters:
training_args = TrainingArguments(
    
    # Number of training epochs
    num_train_epochs=5,
    
    # Batch size for training
    per_device_train_batch_size=8,
    
    # Batch size for evaluation
    per_device_eval_batch_size=8,
    
    # Learning rate for optimization
    learning_rate=5e-5,
    
    # Load the best model at the end of training
    load_best_model_at_end=True,
    
    # Metric used for selecting the best model
    metric_for_best_model=metric_name,
    
    # Number of warmup steps for the optimizer
    warmup_steps=0,
    
    # L2 regularization weight decay
    weight_decay=0.01,
    
    # Directory to save the fine-tuned model and configuration files
    output_dir='./results',
    
    # Directory to store logs
    logging_dir='./logs',
    
    # Log results every n steps
    logging_steps=20,
    
    # Strategy for evaluating the model during training
    evaluation_strategy='steps',
)

In [None]:
trainer = Trainer(
    model=model,                         # the instantiated 🤗 Transformers model to be trained
    args=training_args,                  # training arguments, defined above
    train_dataset=train_dataset,         # training dataset
    eval_dataset=val_dataset,           # evaluation dataset (usually a validation set; here we just send our test set)
    compute_metrics=compute_metrics      # our custom evaluation function 
)

Time to finally fine-tune! 

Be patient; if you've set everything in Colab to use GPUs, then it should only take a minute or two to run, but if you're running on CPU, it can take hours.

After every 20 steps (as we specified in the TrainingArguments object), the trainer will output the current state of the model, including the training loss, validation loss, and accuracy (from our `compute_metrics` function).

You should see the loss going down and the accuracy going up. If instead they are staying the same or oscillating, you probably need to change the fine-tuning parameters.

In [None]:
trainer.train()

Step,Training Loss,Validation Loss,Accuracy,Macro F1,F1
20,0.0776,0.128239,0.979812,0.333333,0.0
40,0.1162,0.123054,0.979812,0.333333,0.0
60,0.1145,0.114277,0.979812,0.333333,0.0
80,0.1635,0.089624,0.979812,0.333333,0.0
100,0.1461,0.108925,0.979812,0.333333,0.0
120,0.1205,0.09901,0.979812,0.333333,0.0
140,0.1023,0.099907,0.979812,0.333333,0.0
160,0.041,0.11015,0.979812,0.333333,0.0
180,0.1762,0.095588,0.979812,0.333333,0.0
200,0.0984,0.113196,0.979812,0.333333,0.0


Step,Training Loss,Validation Loss,Accuracy,Macro F1,F1
20,0.2069,0.106735,0.979812,0.333333,0.0
40,0.1321,0.103974,0.979812,0.333333,0.0
60,0.1737,0.124315,0.979812,0.333333,0.0


<br><br>

## **Save fine-tuned model**

The following cell will save the model and its configuration files to a directory in Colab. To preserve this model for future use, you should download the model to your computer.

In [None]:
trainer.save_model(save_directory)

(Optional) If you've already fine-tuned and saved the model, you can reload it using the following line. You don't have to run fine-tuning every time you want to evaluate.

In [None]:
# trainer = AutoModelForSequenceClassification.from_pretrained(save_directory)

<br><br>

## **Evaluate fine-tuned model on the validation set**

The following function of the `Trainer` object will run the built-in evaluation, including our `compute_metrics` function.

In [None]:
trainer.evaluate()

<br><br>

## **Evaluate fine-tuned model on the test set**

We may desire a more detailed evaluation of the model, hence we extract the predicted labels.

In [None]:
predicted_results = trainer.predict(test_dataset)

In [None]:
predicted_results.predictions.shape

In [None]:
predicted_labels = predicted_results.predictions.argmax(-1) # Get the highest probability prediction
predicted_labels = predicted_labels.flatten().tolist()      # Flatten the predictions into a 1D list
predicted_labels = [id2label[l] for l in predicted_labels]  # Convert from integers back to strings for readability

In [None]:
len(predicted_labels)

In [None]:
print(classification_report(y_test, 
                            predicted_labels))

<br><br>

## **Extracting Correct and Incorrect Classifications for Analysis**

Now that we have obtained the predicted labels, let's perform some analysis.

The fine-tuning and extraction of predicted labels using BERT is now complete. You can use the predicted labels just like you would with any other classification model. Here are some examples.

To start, let's print out some example predictions that were correct.

In [None]:
for _true_label, _predicted_label, _text in random.sample(list(zip(y_test, predicted_labels, X_test)), 20):
  if _true_label == _predicted_label:
    print('LABEL:', _true_label)
    print('REVIEW TEXT:', _text[:100], '...')
    print()

Now let's print out some misclassifications.

In [None]:
for _true_label, _predicted_label, _text in random.sample(list(zip(y_test, predicted_labels, X_test)), 80):
  if _true_label != _predicted_label:
    print('TRUE LABEL:', _true_label)
    print('PREDICTED LABEL:', _predicted_label)
    print('REVIEW TEXT:', _text[:100], '...')
    print()

Finally, let's create some heatmaps to examine misclassification patterns. We could use these patterns to think about similarities and differences between genres, according to book reviewers.

In [None]:
from collections import Counter

# Count the number of classifications for each genre pair
genre_classifications = Counter(zip(y_test, predicted_labels))

# Convert the counts to a DataFrame and pivot to wide format
df_wide = pd.DataFrame(genre_classifications, index=['Number of Classifications']).T.reset_index()
df_wide.columns = ['True Genre', 'Predicted Genre', 'Number of Classifications']
df_wide = df_wide.pivot_table(index='True Genre', columns='Predicted Genre', values='Number of Classifications', fill_value=0)

# Plot the results
plt.figure(figsize=(9,7))
sns.set(style='ticks', font_scale=1.2)
sns.heatmap(df_wide, linewidths=1, cmap='Purples')
plt.xticks(rotation=45, ha='right')
plt.tight_layout()
plt.show()


Looks good! We can see that overall, our model is assigning the correct labels for each genre. 

Now, let's remove the diagonal from the plot to highlight the misclassifications.

In [None]:
genre_classifications_dict = defaultdict(int)
for _true_label, _predicted_label in zip(y_test, predicted_labels):
  if _true_label != _predicted_label: # Remove the diagonal to highlight misclassifications
    genre_classifications_dict[(_true_label, _predicted_label)] += 1
  
dicts_to_plot = []
for (_true_genre, _predicted_genre), _count in genre_classifications_dict.items():
  dicts_to_plot.append({'True Genre': _true_genre,
                        'Predicted Genre': _predicted_genre,
                        'Number of Classifications': _count})
  
df_to_plot = pd.DataFrame(dicts_to_plot)
df_wide = df_to_plot.pivot_table(index='True Genre', 
                                 columns='Predicted Genre', 
                                 values='Number of Classifications')

plt.figure(figsize=(9,7))
sns.set(style='ticks', font_scale=1.2)
sns.heatmap(df_wide, linewidths=1, cmap='Purples')    
plt.xticks(rotation=45, ha='right')
plt.tight_layout()
plt.show()