## **Prerequisites**

Google Colab runtime with extended RAM needed for k fold cross-validation

In [3]:
# install needed packages simpletransformers
!pip install simpletransformers scikit-learn jedi Counter lxml openpyxl


Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/
Collecting simpletransformers
  Downloading simpletransformers-0.63.11-py3-none-any.whl (250 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m250.7/250.7 kB[0m [31m13.8 MB/s[0m eta [36m0:00:00[0m
Collecting jedi
  Downloading jedi-0.18.2-py2.py3-none-any.whl (1.6 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m1.6/1.6 MB[0m [31m69.5 MB/s[0m eta [36m0:00:00[0m
[?25hCollecting Counter
  Downloading Counter-1.0.0.tar.gz (5.2 kB)
  Preparing metadata (setup.py) ... [?25l[?25hdone
Collecting transformers>=4.6.0 (from simpletransformers)
  Downloading transformers-4.29.2-py3-none-any.whl (7.1 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m7.1/7.1 MB[0m [31m114.1 MB/s[0m eta [36m0:00:00[0m
[?25hCollecting datasets (from simpletransformers)
  Downloading datasets-2.12.0-py3-none-any.whl (474 kB)
[2K     [90m━━━━━━━━━━━

In [4]:
# import needed modules
import random as rn
import numpy as np
import pandas as pd
from simpletransformers.classification import ClassificationModel, ClassificationArgs
from sklearn.model_selection import train_test_split
from sklearn.model_selection import StratifiedKFold, KFold
from collections import Counter
import openpyxl
import gc

In [None]:
# load packages to make progress bar of simpletransformers in vs code work
#from tqdm import tqdm
#from ipywidgets import interact
#import ipywidgets as widgets

In [5]:
# mount GDrive to be able to import data
from google.colab import drive
drive.mount('/content/drive')

Mounted at /content/drive


In [6]:
# set global seed for reproducibility of results
seed = 1337
np.random.seed(seed)

## **Import training data**

In [7]:
# import training dataset saved in GDrive
pd.set_option('display.max_columns', None)

# optional: use dropna() to remove empty excel rows
df = pd.read_excel("/content/drive/MyDrive/Masterarbeit/BertClassifierOpi/articles_opi_final_edited.xlsx")[["Text", "opinion"]] #.dropna()
print(df.head())

                                                Text  opinion
0  Ein neues Jahr beginnt. Es liegt vor uns wie N...        1
1  Eine große Kraftanstrengung wurde uns versproc...        1
2  Eine Leserin schrieb zu einem Beitrag über die...        1
3  Kein Zweifel: Jeder hat das Recht, seine Anspr...        1
4  Daten sind ein ganz besonderer Stoff. Flüchtig...        1


In [None]:
# construct equally disributed sample
#df = pd.concat([
#    df[df['opinion'] == 0].sample(100),
#    df[df['opinion'] == 1].sample(100)
#])
#df

Unnamed: 0,Text,opinion
37114,Wer spielt wann gegen wen? In welcher Gruppe s...,0
31076,"""In dieser Menge ist es total harmlos, das ist...",0
37482,Alle Menschen ab zwölf Jahren müssen nun bei d...,0
53846,Ein Drittel der Deutschen hat während der Coro...,0
13636,"""Es geht um Großmachtstreben, es ist ein imper...",0
...,...,...
24894,Vor vielen Jahren habe ich einmal ein kleines ...,1
1476,"Es ist gut, dass der Reise-Streit zwischen Kie...",1
175,"Deutschland, deine reichen Erben! Was machen s...",1
25772,"Ausgerechnet Tom Buhrow! Der Mann, der auch al...",1


In [8]:
# get length of imported dataset
len(df)

54158

In [9]:
# check distribution of 1s = opinion piece / 0s = descriptive article 
print(Counter(df['opinion'].values))

Counter({0: 27081, 1: 27077})


# **Data preparation**

In [10]:
# split dataset into 5 folds
kf = StratifiedKFold(n_splits=5, random_state=seed, shuffle=True)

# test if it works
i = 0

for train_index, val_index in kf.split(df, df["opinion"]):
    i = i+1

    train_df = df[["Text", "opinion"]].iloc[train_index]
    val_df  = df[["Text", "opinion"]].iloc[val_index]

    print(f"Train {i} {len(train_df)} and Test {i} {len(val_df)}")
    print(f"Train {i} {train_df.index} and Test {i} {val_df.index}")

Train 1 43326 and Test 1 10832
Train 1 Int64Index([    0,     1,     2,     4,     5,     6,     7,     8,     9,
               10,
            ...
            54145, 54146, 54147, 54148, 54151, 54152, 54153, 54155, 54156,
            54157],
           dtype='int64', length=43326) and Test 1 Int64Index([    3,    22,    23,    34,    35,    37,    39,    41,    43,
               47,
            ...
            54106, 54112, 54120, 54135, 54142, 54143, 54144, 54149, 54150,
            54154],
           dtype='int64', length=10832)
Train 2 43326 and Test 2 10832
Train 2 Int64Index([    2,     3,     4,     5,     6,     9,    11,    12,    13,
               14,
            ...
            54145, 54146, 54148, 54149, 54150, 54151, 54153, 54154, 54155,
            54156],
           dtype='int64', length=43326) and Test 2 Int64Index([    0,     1,     7,     8,    10,    26,    30,    48,    58,
               62,
            ...
            54098, 54101, 54108, 54115, 54126, 54127, 5

# **Define settings for the training process/model**


In [11]:
# define hyperparameters for model (https://simpletransformers.ai/docs/usage/)

# example for understanding batch size and epochs:
# Assume you have a dataset with 200 samples (rows of data) and you choose a batch size of 5 and 1,000 epochs.
# This means that the dataset will be divided into 40 batches, each with five samples. The model weights will be updated after each batch of five samples.
# This also means that one epoch will involve 40 batches or 40 updates to the model.
# With 1,000 epochs, the model will be exposed to or pass through the whole dataset 1,000 times. That is a total of 40,000 batches during the entire training process.


train_args ={"reprocess_input_data": True, # True needed for k fold cross validation (since we use different training sets)!!! If True, the input data will be reprocessed even if a cached file of the input data exists in the cache_dir.
             "overwrite_output_dir": True, # If True, the trained model will be saved to the ouput_dir and will overwrite existing saved models in the same directory.
             "use_cached_eval_features": False, # False needed for k fold cross validation (since we use different evaluation sets)!!! tokenize validation set not again and again when ever a validation is conducted
             "no_cache": True,
             "output_dir": "outputs", # The directory where all outputs will be stored. This includes model checkpoints and evaluation results.
             "fp16": True, # fp16 = True when graphic card is avaliable otherwise fp16 = False
             "max_seq_length": 512, # maximum number of tokens that a sequence can contain. Any tokens that appear after the max_seq_length will be truncated (max value: 512)
             "sliding_window": False, # Whether to use sliding window technique to prevent truncating sequences longer than 512 tokens
             "num_train_epochs": 1, # defines the number times that the learning algorithm will work through the entire training dataset. Take a high number since early stopping will stop the model automatically when its enough
             "train_batch_size": 16, # defines the number of samples to work through before updating the internal model parameters (smaller = better / 32 common for use / see: https://wandb.ai/ayush-thakur/dl-question-bank/reports/What-s-the-Optimal-Batch-Size-to-Train-a-Neural-Network---VmlldzoyMDkyNDU)
            # use the following if the machine has not enough gpu ram for bigger batch sizes:
            # "gradient_accumulation_steps": 2, # e.g. 16 batch size * 2 gradient accumulation = 32 batch size (uses batch size 16 but updates first internal model parameters after 2 batches are worked through)
            # when using gradient_acc use for evaluate_during_training_steps -> batch size/len(data) / gradient = steps per epoch
             "use_early_stopping": True, # prevent model from overfitting
             "early_stopping_metric": 'eval_loss', # choose evaluation metric for early stopping (other metric e.g. mcc) -> eval_loss: how good can the model extrapolate to not seen data
             "early_stopping_delta": 0.01, # stop if eval_loss cannot get better by 0.01
             "early_stopping_metric_minimze": True, # eval_loss should be minimized (note: if mcc is used, it should get maximizied!)
             "evaluate_during_training": True, # evaluation will be performed during training to monitor the training process closely in order to find best model
             "evaluate_during_training_steps": 541, # Perform evaluation at every specified number of steps. In this case evaluate twice every epoch (steps_per_epoch/2)
             "early_stopping_patience": 15, # Terminate training after this many evaluations without an improvement in the evaluation metric greater then early_stopping_delta
             "evaluate_during_training_verbose": True, # Print results from evaluation during training.
             "manual_seed": seed, # for reproducible results 
             "use_multiprocessing": False, # !!! False needed for use with extended RAM in Google Colab otherwise the training process will not start
             "use_multiprocessing_for_evaluation": False, # !!! False needed for use with extended RAM in Google Colab otherwise the training process will not start
             "save_steps": -1} # dont save checkpoint every 2000 steps by default
 

In [12]:
train_args
# model.args

{'reprocess_input_data': True,
 'overwrite_output_dir': True,
 'use_cached_eval_features': False,
 'no_cache': True,
 'output_dir': 'outputs',
 'fp16': True,
 'max_seq_length': 512,
 'sliding_window': False,
 'num_train_epochs': 1,
 'train_batch_size': 16,
 'use_early_stopping': True,
 'early_stopping_metric': 'eval_loss',
 'early_stopping_delta': 0.01,
 'early_stopping_metric_minimze': True,
 'evaluate_during_training': True,
 'evaluate_during_training_steps': 541,
 'early_stopping_patience': 15,
 'evaluate_during_training_verbose': True,
 'manual_seed': 1337,
 'use_multiprocessing': False,
 'use_multiprocessing_for_evaluation': False,
 'save_steps': -1}

In [13]:
# check how many steps per epoch will be conducted using a batch size of 16 and folds with a proportion of 80% training data and 20% validation data
steps_per_epoch = (len(df)*0.8)/float(train_args['train_batch_size'])
steps_per_epoch

2707.9

In [14]:
# evaluate 5 times per epoch
steps_per_epoch/5

541.58

# **Perform k-fold Cross validation**

In [15]:
# prepare excel file in which results of k-fold cross-validation are written

# evaluation metrices
eval_metrics = ['auprc', 'auroc', 'eval_loss', 'fn', 'fp', 'mcc', 'tn', 'tp']

# prepare excel file
from openpyxl import Workbook
from openpyxl.utils.dataframe import dataframe_to_rows

excel_file = 'k_fold_cross_validation_results_opi.xlsx'
wb = Workbook(write_only=True) # use write only of openpyxl to reduce memory usage in following loop. Otherwise you ll run into an out of ram error

# Create worksheet for results
result_sheet = wb.create_sheet(title='Results')
result_sheet.append(['Fold'] + eval_metrics)

In [16]:
# Perform K-fold Cross-Validation 
for fold, (train_idx, val_idx) in enumerate(kf.split(df, df['opinion'])):
    # Create folds out of dataset
    train_data = df[["Text", "opinion"]].iloc[train_idx]
    val_data = df[["Text", "opinion"]].iloc[val_idx]

    # Load pretrained pretrained german BERT model (cased -> takes into account lowercase and uppercase letters)
    # Models are imported from huggingface (see for a list: https://huggingface.co/transformers/v3.3.1/pretrained_models.html)
    model = ClassificationModel(
    "bert", "bert-base-german-cased",
    num_labels=2,
    args=train_args,
    use_cuda = True
    )
    
    # Initialize and train model
    model.train_model(train_data, eval_df=val_data)

    # free RAM after training has finished
    del(model)
    del(train_data)
    gc.collect()

    # load the best model from the best_model folder of the previous training process
    model = ClassificationModel(
    "bert", "/content/outputs/best_model",
    num_labels=2,
    args=train_args,
    use_cuda = True
    )

    # Use the best model to perform validation
    results, model_outputs, wrong_predictions = model.eval_model(val_data)
    
    # Save results in result worksheet
    result_sheet.append([fold + 1] + [results.get(metric, None) for metric in eval_metrics])
    
    # Create worksheet for misclassified predictions
    wrong_pred_sheet = wb.create_sheet(title=f'Fold {fold+1} - Wrongly classified predictions')
    wrong_pred_sheet.append(['Index number in whole dataset', 'Row number in validation dataset', 'Text', 'Predicted class'])

    # Save misclassified predictions to misclassified worksheet
    for pred in wrong_predictions:
        index_in_data = val_data.iloc[pred.guid].name
        wrong_pred_sheet.append([index_in_data, pred.guid, pred.text_a, pred.label])

    # free RAM after evaluation again
    del(results)
    del(model_outputs)
    del(wrong_predictions)
    del(model)
    del(val_data)
    gc.collect()
    
# save k-fold cross-validation results file
wb.save(excel_file)

Downloading (…)lve/main/config.json:   0%|          | 0.00/433 [00:00<?, ?B/s]

Downloading pytorch_model.bin:   0%|          | 0.00/439M [00:00<?, ?B/s]

Some weights of the model checkpoint at bert-base-german-cased were not used when initializing BertForSequenceClassification: ['cls.predictions.transform.dense.bias', 'cls.predictions.transform.dense.weight', 'cls.predictions.decoder.weight', 'cls.predictions.bias', 'cls.seq_relationship.weight', 'cls.predictions.transform.LayerNorm.weight', 'cls.seq_relationship.bias', 'cls.predictions.transform.LayerNorm.bias']
- This IS expected if you are initializing BertForSequenceClassification from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing BertForSequenceClassification from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).
Some weights of BertForSequenceClassification were not initialized from the model checkpoi

Downloading (…)okenizer_config.json:   0%|          | 0.00/29.0 [00:00<?, ?B/s]

Downloading (…)solve/main/vocab.txt:   0%|          | 0.00/255k [00:00<?, ?B/s]

Downloading (…)/main/tokenizer.json:   0%|          | 0.00/485k [00:00<?, ?B/s]



Epoch:   0%|          | 0/1 [00:00<?, ?it/s]

Running Epoch 0 of 1:   0%|          | 0/2708 [00:00<?, ?it/s]



Running Evaluation:   0%|          | 0/1354 [00:00<?, ?it/s]

Some weights of the model checkpoint at bert-base-german-cased were not used when initializing BertForSequenceClassification: ['cls.predictions.transform.dense.bias', 'cls.predictions.transform.dense.weight', 'cls.predictions.decoder.weight', 'cls.predictions.bias', 'cls.seq_relationship.weight', 'cls.predictions.transform.LayerNorm.weight', 'cls.seq_relationship.bias', 'cls.predictions.transform.LayerNorm.bias']
- This IS expected if you are initializing BertForSequenceClassification from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing BertForSequenceClassification from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).
Some weights of BertForSequenceClassification were not initialized from the model checkpoi

Epoch:   0%|          | 0/1 [00:00<?, ?it/s]

Running Epoch 0 of 1:   0%|          | 0/2708 [00:00<?, ?it/s]



Running Evaluation:   0%|          | 0/1354 [00:00<?, ?it/s]

Some weights of the model checkpoint at bert-base-german-cased were not used when initializing BertForSequenceClassification: ['cls.predictions.transform.dense.bias', 'cls.predictions.transform.dense.weight', 'cls.predictions.decoder.weight', 'cls.predictions.bias', 'cls.seq_relationship.weight', 'cls.predictions.transform.LayerNorm.weight', 'cls.seq_relationship.bias', 'cls.predictions.transform.LayerNorm.bias']
- This IS expected if you are initializing BertForSequenceClassification from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing BertForSequenceClassification from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).
Some weights of BertForSequenceClassification were not initialized from the model checkpoi

Epoch:   0%|          | 0/1 [00:00<?, ?it/s]

Running Epoch 0 of 1:   0%|          | 0/2708 [00:00<?, ?it/s]



Running Evaluation:   0%|          | 0/1354 [00:00<?, ?it/s]

Some weights of the model checkpoint at bert-base-german-cased were not used when initializing BertForSequenceClassification: ['cls.predictions.transform.dense.bias', 'cls.predictions.transform.dense.weight', 'cls.predictions.decoder.weight', 'cls.predictions.bias', 'cls.seq_relationship.weight', 'cls.predictions.transform.LayerNorm.weight', 'cls.seq_relationship.bias', 'cls.predictions.transform.LayerNorm.bias']
- This IS expected if you are initializing BertForSequenceClassification from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing BertForSequenceClassification from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).
Some weights of BertForSequenceClassification were not initialized from the model checkpoi

Epoch:   0%|          | 0/1 [00:00<?, ?it/s]

Running Epoch 0 of 1:   0%|          | 0/2708 [00:00<?, ?it/s]



Running Evaluation:   0%|          | 0/1354 [00:00<?, ?it/s]

Some weights of the model checkpoint at bert-base-german-cased were not used when initializing BertForSequenceClassification: ['cls.predictions.transform.dense.bias', 'cls.predictions.transform.dense.weight', 'cls.predictions.decoder.weight', 'cls.predictions.bias', 'cls.seq_relationship.weight', 'cls.predictions.transform.LayerNorm.weight', 'cls.seq_relationship.bias', 'cls.predictions.transform.LayerNorm.bias']
- This IS expected if you are initializing BertForSequenceClassification from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing BertForSequenceClassification from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).
Some weights of BertForSequenceClassification were not initialized from the model checkpoi

Epoch:   0%|          | 0/1 [00:00<?, ?it/s]

Running Epoch 0 of 1:   0%|          | 0/2708 [00:00<?, ?it/s]



Running Evaluation:   0%|          | 0/1354 [00:00<?, ?it/s]

