# Ensemble Evaluation, Optimal Threshold Selection and Prediction of Test Data
In this Notebook we are going to determine the optimal decision threshold for an ensemble and then use the ensemble and optimal decision threshold to predict the test-set for the submission.

Please Take a look at the README to setup the data and checkpoints

In [14]:
# Install packages when on google colab
!pip install -q pytorch-lightning==1.6.4 neptune-client transformers sentencepiece

^C


In [None]:
import pandas as pd
import numpy as np

from tqdm.auto import tqdm

import torch
from transformers import AutoTokenizer
import pytorch_lightning as pl

from sklearn.metrics import classification_report

import pickle


RANDOM_SEED = 42
COLAB = True


pl.seed_everything(RANDOM_SEED)

  from .autonotebook import tqdm as notebook_tqdm
Global seed set to 42


42

If you are training on google colab and want to connect to drive

In [None]:
torch.cuda.is_available()

True

In [7]:
if COLAB:
    import os
    os.getcwd()
    from google.colab import drive
    drive.mount('/content/drive')

ModuleNotFoundError: No module named 'google.colab'

In [None]:
cd ./drive/MyDrive/human_value/human_values_behind_arguments

In [None]:
!git pull

## Setup and Preprocessing
We use Pytorch Lightning for the training and therefore import the Lighntning Data and Model Modules, as well as other helper functions.

In [3]:
from data_modules.BertDataModule import BertDataset
from models.BertFineTunerPl import BertFineTunerPl
from toolbox.bert_utils import max_for_thres

Here we define the Models that we want to ensemble. Download the Models used for the submission and place them in the checkpoint folder. Here you can then specify the path in to them in order to reproduce the results.  (If you want to ensemble different combinations just select them here. If you have own models trained then you can place them here too, but you need to ensure the params are loaded (see below)).

In [4]:
PARAMS_ENSEMBLE = {
    "MODEL_CHECKPOINTS": [
                          './checkpoints/HCV-371-danschr-roberta-large-BS_16-EPOCHS_8-LR_5e-05-ACC_GRAD_2-MAX_LENGTH_165-BS_8-LR_2e-05-HL_None-DROPOUT_None-SL_None.ckpt'
                          
                          ],
    "DESCRIPTION":"FULL #3xDebL_F1 3EP 3xdanRobL_F1 3EP 3xDebL_Loss 3EP 3xdanRobL_Loss 3EP",
    "TEST_PATH" : "./data/path_to_your_test_data.csv",
    "LEAVE_OUT_DATA_PATH": "./data/leave_out_dataset_300.csv",
    "MAX_THRESHOLD_METRIC": "custom",
    "ENSEMBLE": "EN",
    "LABEL_COLUMNS":['Self-direction: thought',
                     'Self-direction: action',
                     'Stimulation',
                     'Hedonism',
                     'Achievement',
                     'Power: dominance',
                     'Power: resources',
                     'Face',
                     'Security: personal',
                     'Security: societal',
                     'Tradition',
                     'Conformity: rules',
                     'Conformity: interpersonal',
                     'Humility',
                     'Benevolence: caring',
                     'Benevolence: dependability',
                     'Universalism: concern',
                     'Universalism: nature',
                     'Universalism: tolerance',
                     'Universalism: objectivity']
}

THRESHOLD = 0.26 # We compute it later on, but we set it's default value in case you want to skip this section.

We extract the model identifier e.g "HCV-409" from the checkpoint paths. (We use it later to pair the checkpoint together with the PARAMS ( Model Parameter used for training)

In [5]:
# We extract the model identifier to log them and merge them with the corresponding parameter files
NAME = ""
ids = []
for elem in PARAMS_ENSEMBLE["MODEL_CHECKPOINTS"]:
    text_list = elem.split("checkpoints/")[1]
    text_list = text_list.split("-")
    id = text_list[0]+"-" + text_list[1]
    ids.append(id)
    NAME= NAME + "_" + id
    print(text_list[0]+"-" + text_list[1])
NAME = PARAMS_ENSEMBLE["ENSEMBLE"]+"_"+NAME[1:]

PARAMS_ENSEMBLE["IDS"] = ids
LABEL_COLUMNS = PARAMS_ENSEMBLE["LABEL_COLUMNS"]

HCV-371


## The Ensemble List

We create a dictionary containing model's checkoint path, Model identifier and Parameters.

In [6]:
# Loading the parameters for each model
PARAMS_LIST = []
for id in PARAMS_ENSEMBLE["IDS"]:
    with open(f'./checkpoints/{id}_PARAMS.pkl', 'rb') as f:
        loaded_dict = pickle.load(f)
        PARAMS_LIST.append(loaded_dict)

We group together the checkpoint and parameters in a list

In [7]:
# Concatenating relevant information into one Ensemble_list: Parameters, Id, and Path to Checkpoint.
ENSEMBLE_LIST = []
for param, id, mc in zip(PARAMS_LIST, PARAMS_ENSEMBLE["IDS"], PARAMS_ENSEMBLE["MODEL_CHECKPOINTS"]):
    ENSEMBLE_LIST.append({"PARAMS":param, "ID":id,"MODEL_CHECKPOINT":mc})

# Prediction of Leave-Out-Dataset

We compute the optimal decision threshold on the Leave-Out-Dataset. Section can be skipped. Optimal THRESHOLD is 0.26. 

In [8]:
leave_out_data = pd.read_csv(PARAMS_ENSEMBLE["LEAVE_OUT_DATA_PATH"], index_col=0)

In [9]:
def predict_unseen_data(trained_model, data, tokenizer, params, label_columns=None, collect_labels=True):
    device = torch.device('cuda' if torch.cuda.is_available() else 'cpu')
    trained_model = trained_model.to(device)

    test_dataset = BertDataset(
        data=data,
        tokenizer=tokenizer,
        max_token_count=params["MAX_TOKEN_COUNT"],
        label_columns=label_columns,
    )

    predictions = []
    labels =[]

    for item in tqdm(test_dataset):
        _, prediction = trained_model(
            item["input_ids"].unsqueeze(dim=0).to(device),
            item["attention_mask"].unsqueeze(dim=0).to(device)
        )
        predictions.append(prediction.flatten())
        if collect_labels:
            labels.append(item["labels"].int())

    predictions = torch.stack(predictions).detach().cpu()
    if collect_labels:
        labels = torch.stack(labels).detach().cpu()

    return predictions, labels

We iterate over the Models in the Ensemble List and get the predictions for the leave-out-dataset and for the test-dataset for each model (If we use a test-dataset)

In [10]:
# Iterate over elements in Ensemble_List and get predictions from each model. Collect them in predictions [] list.
predictions = []
labels = []
for idx, elem in enumerate(ENSEMBLE_LIST):
    print(f"Starting with model {elem['MODEL_CHECKPOINT']}")
    PARAMS = elem["PARAMS"]
    trained_model = BertFineTunerPl.load_from_checkpoint(
        elem["MODEL_CHECKPOINT"],
        params=PARAMS,
        label_columns=LABEL_COLUMNS,
        n_classes=len(LABEL_COLUMNS)
    )
    trained_model.eval()
    trained_model.freeze()
    print(f"With Tokenizer {PARAMS['MODEL_PATH']}")
    TOKENIZER = AutoTokenizer.from_pretrained(PARAMS["MODEL_PATH"])
    pred, lab = predict_unseen_data(trained_model=trained_model, data=leave_out_data, tokenizer=TOKENIZER, params=PARAMS, collect_labels=True, label_columns=LABEL_COLUMNS)
    predictions.append(pred)
    labels.append(lab)

Starting with model ./checkpoints/HCV-371-danschr-roberta-large-BS_16-EPOCHS_8-LR_5e-05-ACC_GRAD_2-MAX_LENGTH_165-BS_8-LR_2e-05-HL_None-DROPOUT_None-SL_None.ckpt


Some weights of the model checkpoint at danschr/roberta-large-BS_16-EPOCHS_8-LR_5e-05-ACC_GRAD_2-MAX_LENGTH_165 were not used when initializing RobertaModel: ['lm_head.dense.weight', 'lm_head.layer_norm.bias', 'lm_head.layer_norm.weight', 'lm_head.bias', 'lm_head.dense.bias']
- This IS expected if you are initializing RobertaModel from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing RobertaModel from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).
Some weights of RobertaModel were not initialized from the model checkpoint at danschr/roberta-large-BS_16-EPOCHS_8-LR_5e-05-ACC_GRAD_2-MAX_LENGTH_165 and are newly initialized: ['roberta.pooler.dense.bias', 'roberta.pooler.dense.weight']
You should probably TRAIN

With Tokenizer danschr/roberta-large-BS_16-EPOCHS_8-LR_5e-05-ACC_GRAD_2-MAX_LENGTH_165


 49%|████▉     | 147/300 [00:24<00:25,  5.90it/s]


KeyboardInterrupt: 

## Average Predictions

Average the predictions to ensemble the different opinions

In [19]:
labels_val = labels[0]

y_pred_val = torch.stack(predictions).numpy()
y_true_val = labels_val.numpy()

y_pred_val_avg = np.mean(y_pred_val, axis=0)

y_pred_val_avg_tensor = torch.tensor(y_pred_val_avg)
y_true_val_tensor =torch.tensor(y_true_val)


## Determine Threshold

Determin the optimal threshold on the leave out dataset. Use this threshold later for the final prediction on the test-data.

In [20]:
THRESHOLD = max_for_thres(y_pred=y_pred_val_avg_tensor, y_true=y_true_val_tensor, label_columns=LABEL_COLUMNS, average=PARAMS_ENSEMBLE["MAX_THRESHOLD_METRIC"])

# Predicting the submission File.
Now that we have the optimal threshold, we can create the submission file. (Note that this is the same code as in predict.ipynb. But we will show below. But we will show below how we further used stacking.

In [21]:
test_df_input = pd.read_csv('./data/arguments-test.tsv', sep='\t')
test_df_input["text"] = test_df_input["Premise"]+" " + test_df_input["Stance"]+ " " + test_df_input["Conclusion"]
test_df_input.head()

Unnamed: 0,Argument ID,Conclusion,Stance,Premise,text
0,A26004,We should end affirmative action,against,affirmative action helps with employment equity.,affirmative action helps with employment equit...
1,A26010,We should end affirmative action,in favor of,affirmative action can be considered discrimin...,affirmative action can be considered discrimin...
2,A26016,We should ban naturopathy,in favor of,naturopathy is very dangerous for the most vul...,naturopathy is very dangerous for the most vul...
3,A26024,We should prohibit women in combat,in favor of,women shouldn't be in combat because they aren...,women shouldn't be in combat because they aren...
4,A26026,We should ban naturopathy,in favor of,once eradicated illnesses are returning due to...,once eradicated illnesses are returning due to...


Generate predictions for all Models

In [22]:
predictions_test = []
for idx, elem in enumerate(ENSEMBLE_LIST):
    print(f"Starting with model {elem['MODEL_CHECKPOINT']}")
    PARAMS = elem["PARAMS"]
    trained_model = BertFineTunerPl.load_from_checkpoint(
        elem["MODEL_CHECKPOINT"],
        params=PARAMS,
        label_columns=LABEL_COLUMNS,
        n_classes=len(LABEL_COLUMNS)
    )
    trained_model.eval()
    trained_model.freeze()
    print(f"With Tokenizer {PARAMS['MODEL_PATH']}")
    TOKENIZER = AutoTokenizer.from_pretrained(PARAMS["MODEL_PATH"])

    pred, lab = predict_unseen_data(trained_model=trained_model, data=test_df_input,tokenizer=TOKENIZER, params=PARAMS, collect_labels=False)
    predictions_test.append(pred)

Starting with model ./checkpoints/HCV-371-danschr-roberta-large-BS_16-EPOCHS_8-LR_5e-05-ACC_GRAD_2-MAX_LENGTH_165-BS_8-LR_2e-05-HL_None-DROPOUT_None-SL_None.ckpt


Some weights of the model checkpoint at danschr/roberta-large-BS_16-EPOCHS_8-LR_5e-05-ACC_GRAD_2-MAX_LENGTH_165 were not used when initializing RobertaModel: ['lm_head.dense.weight', 'lm_head.dense.bias', 'lm_head.layer_norm.weight', 'lm_head.bias', 'lm_head.layer_norm.bias']
- This IS expected if you are initializing RobertaModel from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing RobertaModel from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).
Some weights of RobertaModel were not initialized from the model checkpoint at danschr/roberta-large-BS_16-EPOCHS_8-LR_5e-05-ACC_GRAD_2-MAX_LENGTH_165 and are newly initialized: ['roberta.pooler.dense.bias', 'roberta.pooler.dense.weight']
You should probably TRAIN

With Tokenizer danschr/roberta-large-BS_16-EPOCHS_8-LR_5e-05-ACC_GRAD_2-MAX_LENGTH_165


  0%|          | 0/20 [00:00<?, ?it/s]

In [23]:
predictions_test_stacked = torch.stack(predictions_test).numpy()
predictions_avg = np.mean(predictions_test_stacked, axis=0)

Binarize predictions with previously computed threshold to derive final labels

In [24]:
upper, lower = 1, 0
y_pred = np.where(predictions_avg > THRESHOLD, upper, lower)

## Create Submission File

In [None]:
prediction_dictionary = {}
prediction_dictionary["Argument ID"] = test_df_input["Argument ID"]
for idx, l_name in enumerate(LABEL_COLUMNS):
    prediction_dictionary[l_name]=y_pred[:,idx]

test_prediction_df = pd.DataFrame(prediction_dictionary)
test_prediction_df.head()

In [None]:
test_prediction_df.to_csv(f"submissions/submission_test.tsv", sep="\t", index=False)

This is how we created the final submission for the best performing system. 
In the subsequent section we apply stacking and create the variations of the system that were also submitted for the competition.

# Stacking (optional)
In the following we train logistic regressions to determine the decision threshold for each label. We train the model on the training-dataset. So we get the predictions for the training dataset and train the models in a way that they schould learn to predict the labels based on the predictions as input.


In [None]:
train_df = pd.read_csv("./data/data_training_full.csv")

In [None]:
predictions = []
labels = []
for idx, elem in enumerate(ENSEMBLE_LIST):
    print(f"Starting with model {elem['MODEL_CHECKPOINT']}")
    PARAMS = elem["PARAMS"]
    trained_model = BertFineTunerPl.load_from_checkpoint(
        elem["MODEL_CHECKPOINT"],
        params=PARAMS,
        label_columns=LABEL_COLUMNS,
        n_classes=len(LABEL_COLUMNS)
    )
    trained_model.eval()
    trained_model.freeze()
    print(f"With Tokenizer {PARAMS['MODEL_PATH']}")
    TOKENIZER = AutoTokenizer.from_pretrained(PARAMS["MODEL_PATH"])

    pred, lab = predict_unseen_data(trained_model=trained_model, data=train_df, collect_labels=True)
    predictions.append(pred)
    labels.append(lab)

## Train Logistic Regression
We structure our input-data and then train the logistic regressions.


For each sample in the data we concatenate the prediction of each model columnwise. So we get a the shape [len(data), 20*num_models_in_ensemble]

In [None]:
labels_val = labels[0]
predictions_val = torch.Tensor([])
for p in predictions:
    predictions_val = torch.cat([predictions_val, p], dim=1)

In [None]:
from sklearn.linear_model import LogisticRegression
from sklearn.multioutput import MultiOutputClassifier
from sklearn import svm
from sklearn.naive_bayes import MultinomialNB
from sklearn.tree import DecisionTreeClassifier

logReg = MultiOutputClassifier(LogisticRegression(random_state=0, max_iter=200))
# logReg=MultiOutputClassifier(MultinomialNB(alpha=0.1))
# logReg=MultiOutputClassifier(DecisionTreeClassifier(min_samples_leaf=3))

In [None]:
logReg.fit(predictions_val.numpy(), labels_val.numpy())

Get the unstacked predictions for the test-file from above and concatenate the predictions from each model columnwise.

In [None]:
predictions_transformed = torch.Tensor([])
for record in predictions_test_stacked:
    predictions_transformed = torch.cat([predictions_test, record], dim=1)

Use the trained logReg Model to predict the labels

In [None]:
y_pred = logReg.predict(predictions_test)

Create Submission File

In [None]:
prediction_dictionary = {}
prediction_dictionary["Argument ID"] = test_df_input["Argument ID"]
for idx, l_name in enumerate(LABEL_COLUMNS):
    prediction_dictionary[l_name]=y_pred[:,idx]

test_prediction_df = pd.DataFrame(prediction_dictionary)
test_prediction_df.head()

In [None]:
test_prediction_df.to_csv(f"submissions/test-submission_logReg", sep="\t", index=False)