# Outcome Exploration

Check the outcome of the training.

In [1]:
from simpletransformers.classification import ClassificationModel, ClassificationArgs
import pandas as pd

In [2]:
from run import result_to_metrics

In [3]:
BEST_MODEL_DIR = 'outputs_20000_10_comment' # 'outputs'  #'outputs_50000_10_comment'

In [4]:
def get_eval_results(dir):
    with open(dir+'/eval_results.txt', 'r') as f:
        results = {}
        for l in f.readlines():
            spl = l.split('=')
            k, v = spl[0].strip(), spl[1].strip()
            if v.find('.') > 0:
                v = float(v)
            else:
                v = int(v)
            results[k] = v
    return results

## Compare results of including context or not

Does the prediction improve when we include context?

We can do this by appending the `parent_comment` and the `comment` to form a single text
and use that as input for training and evaluation.

In an experiment we train 20000 records for 10 epochs with either only `comment` or with `parent_comment` + `comment`.

In [5]:
print('20000 record 10 epochs, only comment')
results = get_eval_results('outputs_20000_10_comment')
accuracy, precision, recall, f1 = result_to_metrics(results)
print(f'Accuracy={accuracy:0.4f}; precision={precision:0.4f}; recall={recall:0.4f}; f1={f1:0.4f}')

20000 record 10 epochs, only comment
Accuracy=0.7160; precision=0.7117; recall=0.7301; f1=0.7208


In [6]:
print('20000 record 10 epochs, parent_comment + comment')
results = get_eval_results('outputs_20000_10_parent')
accuracy, precision, recall, f1 = result_to_metrics(results)
print(f'Accuracy={accuracy:0.4f}; precision={precision:0.4f}; recall={recall:0.4f}; f1={f1:0.4f}')

20000 record 10 epochs, parent_comment + comment
Accuracy=0.7140; precision=0.7046; recall=0.7299; f1=0.7170


**Observation:** training (and evaluation) with only the `comment` leads to a marginally better result.

## Check training progress

In [7]:
training_progress = pd.read_csv('top_outputs/training_progress_scores.csv')
training_progress

Unnamed: 0,global_step,tp,tn,fp,fn,mcc,train_loss,eval_loss,auroc,auprc
0,5000,3372,3995,975,1658,0.478289,0.635047,0.551908,0.804421,0.807675
1,6250,2931,4341,629,2099,0.476318,0.524576,0.537438,0.810961,0.820866
2,10000,4170,3093,1877,860,0.461585,0.353628,0.561161,0.824938,0.835855
3,12500,3641,3810,1160,1389,0.490837,0.609307,0.574981,0.820319,0.826968
4,15000,3851,3555,1415,1179,0.481571,0.577427,0.548073,0.823761,0.838313
5,18750,3941,3430,1540,1089,0.475838,0.322401,0.57453,0.823807,0.833803
6,20000,3798,3608,1362,1232,0.481263,0.332205,0.579121,0.817095,0.811899
7,25000,3801,3602,1368,1229,0.480681,0.471639,0.592763,0.822765,0.835449
8,25000,3801,3602,1368,1229,0.480681,0.471639,0.592763,0.822765,0.835449
9,30000,3959,3387,1583,1071,0.471335,0.240553,0.673527,0.819191,0.831448


In [8]:
print('100000 record 9 epochs, comment')
results = get_eval_results('outputs')
accuracy, precision, recall, f1 = result_to_metrics(results)
print(f'Accuracy={accuracy:0.4f}; precision={precision:0.4f}; recall={recall:0.4f}; f1={f1:0.4f}')

100000 record 9 epochs, comment
Accuracy=0.7574; precision=0.7656; recall=0.7432; f1=0.7542


The standard Simpletransformers selection of the best model seems to be faulty. 
We will therefore go through all the generated models, calculate their metrics,
and pick the best model on that basis.

> **The resulting directory containing the best performing model by F1-score will have to be 
pasted into `run.py`, for creating a prediction for the test set.**

In [15]:
OUTPUTDIR = 'outputs'
import os
lst = os.listdir(OUTPUTDIR)
lst = list(filter(lambda x: x.startswith('checkpoint'), lst))

best_dir = ''
max_f1 = 0.0
max_metric = {}

for dir in lst:
    eval_file = OUTPUTDIR+'/'+dir+'/eval_results.txt'
    if os.path.exists(eval_file):
        loc_result = get_eval_results(OUTPUTDIR+'/'+dir)
        accuracy, precision, recall, f1 = result_to_metrics(loc_result)
        print(f'{dir}: Accuracy={accuracy:0.4f}; precision={precision:0.4f}; recall={recall:0.4f}; f1={f1:0.4f}')
        if f1 > max_f1:
            max_f1 = f1
            best_dir = dir
            max_metric = loc_result
    else:
        # We ignore outcomes in directories without a eval_results.txt
        pass
print('\n')
print(f'Best result:\n{best_dir}: f1={max_f1} ({max_metric})')
BEST_MODEL_DIR = OUTPUTDIR+'/'+best_dir

checkpoint-2000: Accuracy=0.6898; precision=0.7173; recall=0.6219; f1=0.6662
checkpoint-550-epoch-2: Accuracy=0.6880; precision=0.7760; recall=0.5247; f1=0.6261
checkpoint-4125-epoch-15: Accuracy=0.6958; precision=0.7151; recall=0.6464; f1=0.6790
checkpoint-3300-epoch-12: Accuracy=0.6934; precision=0.6926; recall=0.6906; f1=0.6916
checkpoint-3850-epoch-14: Accuracy=0.7000; precision=0.7120; recall=0.6673; f1=0.6889
checkpoint-825-epoch-3: Accuracy=0.6952; precision=0.6805; recall=0.7308; f1=0.7048
checkpoint-1925-epoch-7: Accuracy=0.6946; precision=0.7310; recall=0.6115; f1=0.6659
checkpoint-2750-epoch-10: Accuracy=0.6966; precision=0.7143; recall=0.6509; f1=0.6811
checkpoint-3025-epoch-11: Accuracy=0.6930; precision=0.7265; recall=0.6147; f1=0.6659
checkpoint-1375-epoch-5: Accuracy=0.6984; precision=0.7023; recall=0.6842; f1=0.6931
checkpoint-1650-epoch-6: Accuracy=0.6850; precision=0.6577; recall=0.7658; f1=0.7076
checkpoint-1100-epoch-4: Accuracy=0.7066; precision=0.7064; recall=0.7

## Explore how our model does with subsets of the data

In [None]:
model_args = ClassificationArgs()
model = ClassificationModel(
        "roberta",
        BEST_MODEL_DIR,
        args=model_args,
        use_cuda=True
    )

In [None]:
model.predict([
    "yeah, and I have a bridge to sell you!",
    "The bird flew over the cuckoo's nest"
])