# Performance evaluation metrics in Scikit-Learn


In this notebook, we will look at how to evaluate the performance of a model using Scikit-Learn.

In [None]:
from sklearn import datasets
from sklearn.model_selection import train_test_split
import numpy as np
from sklearn.linear_model import LogisticRegression
from sklearn import preprocessing
import seaborn as sns
import matplotlib.pyplot as plt
from scipy.optimize import brentq
from scipy.interpolate import interp1d
from sklearn.model_selection import cross_val_score
from sklearn.pipeline import make_pipeline

In [None]:
from sklearn.metrics import accuracy_score
from sklearn.metrics import f1_score
from sklearn.metrics import fbeta_score
from sklearn.metrics import roc_auc_score
from sklearn.metrics import average_precision_score
from sklearn.metrics import roc_curve
from sklearn.metrics import precision_recall_curve
from sklearn.metrics import precision_score
from sklearn.metrics import recall_score
from sklearn.metrics import RocCurveDisplay
from sklearn.metrics import PrecisionRecallDisplay
from sklearn.metrics import ConfusionMatrixDisplay
from sklearn.metrics import confusion_matrix

## Load the MNIST dataset

The MNIST dateaset is a large dataset of handwritten digits that were normalized to fit into a 28x28 pixel bounding box. The database is also widely used for training and testing in the field of machine learning. The MNIST database contains 60000 training images and 10000 testing images.

To better see the effect of class imbalances, we turn classifying MNIST into a binary classification problem by using digit "5" as the positive label and the remaining digits as the negative label.

In [None]:
db = datasets.fetch_openml('mnist_784', parser='pandas')

x = db.data.values

# Change class labels to 1 (digit 5) and 0 (otherwise)
y = np.where(db.target == '5', 1, 0)

In [None]:
class_ratio = (y == 0).sum() / (y == 1).sum()

print(f'There are {class_ratio:.1f}x more samples with label != 5 than label == 5.')

As can be seen, our binary classification problem is heavily imbalanced. We have 10 times more positive samples than negatives.

## Use STRATIFIED random splitting to split the dataset into a training and test set

**Stratified** random sampling ensures that the returned sets have the same proportion of instances for each class as the original dataset.

In [None]:
x_train, x_test, y_train, y_test = train_test_split(x, y, test_size=0.2, stratify=y, random_state=42)

In [None]:
print('Num training samples:', x_train.shape[0])
print('Num testing samples:', x_test.shape[0])

In [None]:
class_ratio = (y_test == 0).sum() / (y_test == 1).sum()
print(f'There are {class_ratio:.1f}x more samples in the test set with label != 5 than label == 5.')

class_ratio = (y_train == 0).sum() / (y_train == 1).sum()
print(f'There are {class_ratio:.1f}x more samples in the training set with label != 5 than label == 5.')

## Train a logistic regression model

In [None]:
# Standardize features by removing the mean and scaling to unit variance.
scaler = preprocessing.StandardScaler()
scaler.fit(x_train)

x_train = scaler.transform(x_train)
x_test = scaler.transform(x_test)

In [None]:
# Fit a logistic regression model
model = LogisticRegression(max_iter=500)
model.fit(x_train, y_train)

## Performance evaluation with different metrics

In the following, we will demonstrate how to evaluate the performance of the model with different metrics.

In [None]:
pred_label = model.predict(x_test)
pred_proba = model.predict_proba(x_test)[:, 1]

### Accuracy

In [None]:
# TODO: Evaluate the accuracy

Less surprising, the accuracy is relatively high. Yet, it's unclear how much of the high accuracy is caused by the class imbalance.

### Confusion Matrix

Scikit's `confusion_matrix()` can compute the confusion matrix for a given set of class predictions and target classes. The method can also handle problems where more than two classes are available.

In [None]:
# TODO: Obtain the values of the confusion matrix

In order to nicely visualize the confusion matrix, we can use **Seaborn**. Seaborn is a Python data visualization library based on matplotlib. It provides a high-level interface for drawing attractive and informative statistical graphics.

It's a really nice library and definitely worth looking at!

In [None]:
# TODO: Use matplotlib to plot the confusion matrix

Alternatively, we can use Scikit's built-in functionalities for plotting confusion matrices.

In [None]:
# TODO: Use ConfusionMatrixDisplay from Scikit to plot the confusion matrix

### Precision and Recall

In [None]:
# TODO: Calculate precision and recall

In contrast to the model's accuracy, precision and recall are relatively low. Among the samples predicted to be positive, only 89% were positive. Only 81% of the positive samples were correctly classified.

### F1-Score

In [None]:
# TODO: Calculate the F1 score

### F-beta score

In [None]:
# TODO: Calculate the f-beta score

### Precision-Recall Curve

The Precision-Recall curve shows precision and recall for different threshold levels.

Note that "AP" denotes the average precision.

AP summarizes a precision-recall curve as the weighted mean of precisions achieved at each threshold, with the increase in recall from the previous threshold used as the weight:

$AP = \sum_n (R_n - R_{n-1}) P_n$

where $P_n$ and $R_n$ are the precision and recall at the $n$-th threshold. This implementation is not interpolated and is different from computing the area under the precision-recall curve with the trapezoidal rule, which uses linear interpolation and can be too optimistic.

In [None]:
# TODO: Plot the precision-recall curve

### Sensitivity and Specificity

Unfortunately, Scikit does not provide dedicated methods to calculate the sensitivity or specificity of a model. However, we can calculate the metric ourselves from the confusion matrix.

In [None]:
cf_matrix = confusion_matrix(y_test, pred_label)

tn, fp, fn, tp = cf_matrix.ravel()

In [None]:
score_sensitivity = tp / (tp + fn)
score_specificity = tn / (tn + fp)

In [None]:
print('Sensitivity (TPR):', score_sensitivity)
print('Specificity (TNR):', score_specificity)

As can be seen, almost all negative samples (99%) are predicted correctly. However, only 81% of the positive samples are predicted correctly. Clearly, our classifier performs worse on positive samples than on negative ones.

### ROC Curve

Scikit's `RocCurveDisplay` can be used to generate the ROC curve. The `roc_auc_score()` function directly computes the AUC-ROC for some given prediction scores / targets.

In [None]:
# TODO: Plot the ROC curve

Note that the AUC ROC achieved by our model is 97.7\%. This isn't great, but it is still high compared to the APR (90.4\%)!

In [None]:
# Alternatively, we can .from_estimator() method to obtain the plot in just one line
display = RocCurveDisplay.from_estimator(model, x_test, y_test)

### EER (Equal Error Rate)

The rate at which the false positive rate and false negative rate are equal. The value of the EER can be easily obtained from the ROC curve. The EER is a quick way to compare the accuracy of devices with different ROC curves. In general, the device with the lowest EER is the most accurate.

Unfortunately, Scikit does not provide a dedicated method to calculate the EER. So we have to manually calculate the EER from the FPRs and TPRs returned by `roc_curve()`.

In [None]:
# TODO: Calculate the equal error rate from roc_curve()

In [None]:
display = RocCurveDisplay.from_estimator(model, x_test, y_test)
plt.plot([0, 1], [1, 0], 'k--')
plt.scatter(eer, 1.-eer, c='r', label='EER')
plt.legend()
plt.show()

As can be seen, if we want to choose the threshold such that the FPR = FNR, the threshold is 0.0827. <br/>
This threshold is totally different from the default threshold (0.5)!

Let's see what happens to precision and recall if we choose this threshold for choosing the final class label ...

In [None]:
# TODO: Calculate precision and recall for a given threshold

Choosing a different threshold has a tremendous effect on precision and recall. Only 61\% of the samples classified as positive were positive. However, the recall increased to 94\%, which means that 94\% of the positive samples were correctly classified.

**This illustrates the importance of choosing the "right" metric for choosing a classifier.** 

### Precision/Recall Tradeoff

We classify samples with *pred_proba* greater than the threshold as positive. If we increase the threshold, fewer samples will be classified as positives, and therefore the recall (a.k.a. tpr) has to decrease. On the other hand, precision has to increase since we classify fewer samples as positives.

In [None]:
def plot_precision_recall_vs_threshold(precisions, recalls, thresholds):
    plt.plot(thresholds, precisions[:-1], "b--", label="Precision")
    plt.plot(thresholds, recalls[:-1], "g-", label="Recall")
    plt.xlabel("Threshold")
    plt.ylabel('Score')
    plt.legend()
    plt.ylim([0, 1])


precisions, recalls, thresholds = precision_recall_curve(y_test, pred_proba)

plot_precision_recall_vs_threshold(precisions, recalls, thresholds)
plt.show()

### Sensitivity/Specificity Tradeoff

We classify samples with *pred_proba* greater than the threshold as positive. Consequently, if we increase the threshold, fewer samples will be classified as positives, and therefore the sensitivity (a.k.a. tpr) has to decrease. On the other hand, specificity (a.k.a. TNR) has to increase since more samples are classified as negatives.

In [None]:
def plot_sensitivity_specificity_vs_threshold(sensitivity, specificity, thresholds):
    plt.plot(thresholds, sensitivity, "b--", label='Sensitivity (a.k.a. TPR)')
    plt.plot(thresholds, specificity, "g-", label='Specificity (a.k.a. TNR)')
    plt.xlabel("Threshold")
    plt.ylabel("Score")
    plt.legend()
    plt.ylim([0, 1])


fprs, tprs, thresholds = roc_curve(y_test, pred_proba)

specificity = 1 - fprs
sensitivity = tprs

plot_sensitivity_specificity_vs_threshold(sensitivity, specificity, thresholds)
plt.show()

## Using different performance metric with cross-validation

In our previous examples, we used cross-validation to find the best parameters for our model. This raises the question of how we can tell Scikit's `cross_val()` or `cross_val_score()` method to use a different evaluation metric. 

A different evaluation metric can be configured via the `scoring` attribute.

**Hint:** The mean and standard deviation should be estimated for each training fold individually. To achieve this, Scikit provides the `Pipeline` class. A pipeline defines a chain of transformations that are applied to your data set sequentially, where the last step in the chain is your machine learning model. We use `make_pipeline()` to construct a model training pipeline. 

In [None]:
# make_pipelines ensures that StandardScaler() fit on each training fold individually
pipeline = make_pipeline(preprocessing.StandardScaler(), 
                         LogisticRegression(max_iter=500))

In [None]:
# Use 4-fold cross validation and use the f1-score to measure the model performance
cross_val_score(pipeline, x, y, cv=4, scoring="f1")

In [None]:
# Use 4-fold cross validation and use the accuracy to measure the model performance
cross_val_score(pipeline, x, y, cv=4, scoring="accuracy")

In [None]:
# Use 4-fold cross validation and use the precision to measure the model performance
cross_val_score(pipeline, x, y, cv=4, scoring="precision")