# Model Evaluation

In this final recipe, we run in-depth evaluation on our best performing model using a number of classification metrics and visuals. Again, our focus is to understand model accuracy in the general sense as well as specific shortcomings to motivate future model (and data processing) iterations.

## Load Prediction Data 

We load the prediction datasets for analysis.

In [0]:
# -*- coding: utf-8 -*-
import dataiku
import pandas as pd, numpy as np
from dataiku import pandasutils as pdu

# Read recipe inputs
predictions_learn = dataiku.Dataset("predictions_learn")
predictions_learn_df = predictions_learn.get_dataframe()

predictions_test = dataiku.Dataset("predictions_test")
predictions_test_df = predictions_test.get_dataframe()

In [0]:
predictions_learn_df.head()

In [0]:
TARGET = 'income'

In [0]:
datasets = {
    'train': predictions_learn_df,
    'test': predictions_test_df
}

## Prediction Statistics 

Lets look at high level statistics of the predictions first.

In [0]:
for name, data in datasets.items():
    print(name, 'summary:')
    print(data.describe())
    
    print('\n\n')

We can already see that while the real labels contain ~8% high income, our predictions were positive only ~4% of the time. However, the predicted probabilities are fairly well calibrated showing the same 0.08 mean. We can definitely experiement with alternative cutoffs to the default 0.5 to find better precision/recall trade-offs.

In [0]:
import plotly.express as px

import plotly.offline as pyo
pyo.init_notebook_mode()

In [0]:
fig = datasets['train'].plot(kind='hist', backend='plotly', x='pred_proba', color='income', log_y=True, opacity=0.7)
fig.show()

We can see the probability distribution for the true 0/1 labels. For class 0, we are nicely concentrated on 0 as expected, however we do see a fair number of class 1 samples with low probabilities (e.g., instances misclassified by a large margin).

As future work, we can explore the segment of data where these misclassifications occurred to better understand how to address the issue.

## Calculate Metrics 

We will look at the standard binary classification metrics using:
- confusion matrices
- classification reports (accuracy, precision, recall, F1 score)
- ROC-AUC curves

In [0]:
from sklearn.metrics import confusion_matrix, classification_report, ConfusionMatrixDisplay

def get_classification_reports(datasets) -> None:
    for name, data in datasets.items():
        print(name, "\n\n")
        
        print(classification_report(data[TARGET], data['pred']))
        
        print("\n\n")
        
get_classification_reports(datasets)

Observations:
    - The train and test precision/recall scores are quite close which is a positive sign that there is no overfitting.
    - The F1 score (for our class 1) are around 0.5 on both sets with .75-.77 precision (% of predicted 1 being actually correct) and .37 recall (% of real label 1 samples being predicted 1). This is quite common for imbalanced datasets and the minority class.
    - Overall accuracy is at 94% however this is not a meaningful metric on such imbalanced data with only 8% of samples in the minority class.

Confusion matrices give another view of the correct and misclassified samples:

In [0]:
import matplotlib.pyplot as plt

ConfusionMatrixDisplay.from_predictions(predictions_learn_df['income'], predictions_learn_df['pred'])
plt.show()

In [0]:
ConfusionMatrixDisplay.from_predictions(predictions_test_df['income'], predictions_test_df['pred'])
plt.show()

### Precision-Recall Curves

As mentioned previously, we can consider adjusting our hard predictions to achieve a better precision-recall tradeoff. We can see this on the precision-recall curve - note that we used the area under this curve as the loss function to optimize during training our XGBoost models.

In [0]:
from sklearn.metrics import PrecisionRecallDisplay

display = PrecisionRecallDisplay.from_predictions(
    datasets['test'][TARGET],
    datasets['test']['pred_proba'],
    name="XGBoost",
    plot_chance_level=True,
    despine=True
)
plt.axvline(0.37, c="black") # mark the recall-precion we got from eval
plt.show()

We could select an alternive threshold:

In [0]:
from sklearn.metrics import precision_recall_curve

precision, recall, thresholds = precision_recall_curve(
    datasets['test'][TARGET],
    datasets['test']['pred_proba'],
)

In [0]:
pd.DataFrame(
    {
        'precision': precision[::1500],
        'recall': recall[::1500],
        'threshold': thresholds[::1500]
    }
).set_index('threshold').T

We can see that if we are to sacrifice some precision, the recall can be brought up .e.g, with 0.255 threshold we can achieve an almost even 0.55 precision and 0.57 recall.

### ROC-AUC

Lets look at the ROC-AUC scores and ROC curve finally:

In [0]:
from sklearn.metrics import roc_auc_score, roc_curve, RocCurveDisplay

In [0]:
def get_roc_scores(datasets) -> None:
    for name, data in datasets.items():
        print(name)
        print(roc_auc_score(data[TARGET], data['pred_proba']))
        
        print("\n")
        
get_roc_scores(datasets)

In [0]:
RocCurveDisplay.from_predictions(predictions_test_df[TARGET], predictions_test_df["pred_proba"])
RocCurveDisplay.from_predictions(predictions_learn_df[TARGET], predictions_learn_df["pred_proba"])
plt.show()

Ideally, the curve is hugging the top left corner (as we see it) but again, this metric is not sensitive enough for imbalanced datasets.

## Next Steps

To better understand model performance:
- we can look at the misclassified samples and explore the features and how they differ from the general population.
- we should more carefully consider what metric to use as training loss and monitoring during HP optimization.