# Classification Evaluation Metrics

Reference:
- http://blog.revolutionanalytics.com/2016/03/com_class_eval_metrics_r.html

To add:
- https://scikit-learn.org/stable/modules/model_evaluation.html
- http://people.ciirc.cvut.cz/~hlavac/TeachPresEn/31PattRecog/13ClassifierPerformance.pdf

## Techniques

|**Term**|**Definition**
|-------:|:-------------
|**Baseline Models**|One way to justify the results of classifiers with poor performance is by comparing them to those of baseline classifiers and showing that they are better than random chance predictions.

## Metrics

|**Term**|**Definition**
|-------:|:-------------
|**Confusion Matrix**|A contingency table of observations by their predicted vs actual class labels. For the notebook, assume the colums are the predicted labels and the rows are the actual labels.
|**Accuracy**|The fraction of instances that are correctly classified. `sum(diagonal) / observations`
|**Precision**|The fraction of correct predictions for a certain class. `diagonal / colsums`
|**Recall**|The fraction of instances of a class that were correctly predicted. `diagonal / rowsums`
|**F1**|Harmonic mean (or a weighted average) of precision and recall. `2 * precision * recall / (precision + recall) `
|**One-vs-all Matrices**|Confusion matrix for one class at a time. The sum of these matrices allows us to compute weighted metrics.
|**Average Accuracy**|The fraction of correctly classified instances in the sum of one-vs-all matrices matrix.
|**Macro-Averaged Metrics**|Average performance, equal weights by class.
|**Micro-Averaged Metrics**|Average performance, weighted by sum of one-vs-all matrices. Favors classes with a larger number of instances. Micro-averaged precision, recall, and F-1 are equal.
|**No Information Rate**|The overall accuracy of the majority-class classifier.
|**Kappa Statistic**|A measure of agreement between the predictions and the actual labels. Also interpreted as a comparison of the overall acurracy to the expected random chance accuracy. 

## Baseline Models

|**Term**|**Definition**
|-------:|:-------------
|**Majority-Class Classifier**|Predict majority class.
|**Random-Guess Classifier**|Predict labels randomly.
|**Weighted Random-Guess Classifier**|Predict labels randomly, weighted by the prior distribution.

## Classification Evaluation Demo

Note - indexing notation will differ from the *Revolutions* blogpost.

In [1]:
import numpy as np
import pandas as pd
from pprint import pprint

In [2]:
# Generate data
np.random.seed(0)

n = 200
sample_mix = 75
data_classes = np.array(['a', 'b', 'c'])
p_true = [0.5, 0.25, 0.25]

actual = np.random.choice(data_classes, size = n, replace = True, p = p_true)
predicted = actual.copy()

mix_indices_actual = np.random.choice(range(n), size = sample_mix, replace = False)
mix_indices_predicted = np.random.choice(range(n), size = sample_mix, replace = False)

predicted[mix_indices_predicted] = actual[mix_indices_actual]

### Confusion Matrix

References for further implementation:

- https://stackoverflow.com/questions/2148543/how-to-write-a-confusion-matrix-in-python

In [3]:
cm = np.zeros([3, 3]).astype(int)

for cm_index_actual, i in enumerate(data_classes):
    for cm_index_predict, j in enumerate(data_classes):
        cm[cm_index_actual, cm_index_predict] = np.sum((actual == i) & (predicted == j))

cm_index = ['actual ' + x for x in data_classes]
cm_columns = ['predict ' + x for x in data_classes]
pd.DataFrame(cm, index = cm_index, columns = cm_columns)

Unnamed: 0,predict a,predict b,predict c
actual a,73,15,6
actual b,13,46,5
actual c,9,4,29


In [4]:
# sklearn implementation
from sklearn.metrics import confusion_matrix
cm = confusion_matrix(actual, predicted, labels = data_classes)

pd.DataFrame(cm, index = cm_index, columns = cm_columns)

Unnamed: 0,predict a,predict b,predict c
actual a,73,15,6
actual b,13,46,5
actual c,9,4,29


### Basic Variables

In [5]:
nc = len(data_classes)
diag = np.diagonal(cm)
colsums = np.sum(cm, axis = 0)
rowsums = np.sum(cm, axis = 1)
p = rowsums / n
q = colsums / n

### Accuracy, Precision, Recall, F1

In [6]:
diag = np.diagonal(cm)

accuracy = np.sum(diag) / n
precision = diag / rowsums
recall = diag / colsums
f1 = 2 * precision * recall / (precision + recall)

cm_metrics = pd.DataFrame(
    {
        'precision': precision, 
        'recall': recall,
        'f1': f1,
    }, 
    index = data_classes)

print('Accuracy: {}'.format(accuracy))
cm_metrics

Accuracy: 0.74


Unnamed: 0,precision,recall,f1
a,0.776596,0.768421,0.772487
b,0.71875,0.707692,0.713178
c,0.690476,0.725,0.707317


### Macro-Averaged Metrics

In [7]:
cm_metrics_avg = pd.DataFrame(cm_metrics.mean(), columns = ['macro-averaged'])
cm_metrics_avg

Unnamed: 0,macro-averaged
precision,0.728607
recall,0.733704
f1,0.730994


### One-Vs-All

In [8]:
ova_dict = {}

for index, i in enumerate(data_classes):
    
    cm_i = cm[index, index]
    colsums_i = np.sum(cm, axis = 0)[index]
    rowsums_i = np.sum(cm, axis = 1)[index]
    
    ova_dict[i] = np.array([
        [
            cm_i, 
            rowsums_i - cm_i,
        ],
        [
            colsums_i - cm_i, 
            n - rowsums_i - colsums_i + cm_i,
        ],
    ])
    
ova_dict_display = pd.concat({
    i: pd.DataFrame(
        ova_dict[i], 
        index = ['actual x', 'actual !x'], 
        columns = ['predict x', 'predict !x']
    ) 
    for i in ova_dict},
    axis = 1
)

ova_dict_display

Unnamed: 0_level_0,a,a,b,b,c,c
Unnamed: 0_level_1,predict x,predict !x,predict x,predict !x,predict x,predict !x
actual x,73,21,46,18,29,13
actual !x,22,84,19,117,11,147


In [9]:
# sum of one-vs-all matrices
ova_sum = np.sum(list(ova_dict.values()), axis = 0)

pd.DataFrame(ova_sum, index = ['actual x', 'actual !x'], columns = ['predict x', 'predict !x'])

Unnamed: 0,predict x,predict !x
actual x,148,52
actual !x,52,348


### Micro-Averaged Metrics

In [10]:
micro_prf = (np.diagonal(ova_sum) / np.sum(ova_sum, axis = 1))[0]
cm_metrics_avg['micro_averaged'] = micro_prf
cm_metrics_avg

Unnamed: 0,macro-averaged,micro_averaged
precision,0.728607,0.74
recall,0.733704,0.74
f1,0.730994,0.74


### Majority-Class Metrics

In [11]:
mcIndex = np.where(rowsums == max(rowsums))
mcAccuracy = p[mcIndex][0]

mcRecall = 0 * p
mcRecall[mcIndex] = 1

mcPrecision = 0 * p
mcPrecision[mcIndex] = p[mcIndex]

mcF1 = 0 * p
mcF1[mcIndex] = 2 * mcPrecision[mcIndex] / (mcPrecision[mcIndex] + 1)

print('{} accuracy vs {} majority-class accuracy'.format(accuracy, mcAccuracy))
pd.DataFrame(
    {
        'recall': recall,
        'base_recall': mcRecall,
        'recall_diff': recall - mcRecall,
        'precision': precision,
        'base_precision': mcPrecision, 
        'precision_diff': precision - mcPrecision,
        'f1': f1,
        'base_f1': mcF1,
        'f1_diff': f1 - mcF1,
        
    }, index = data_classes)

0.74 accuracy vs 0.47 majority-class accuracy


Unnamed: 0,recall,base_recall,recall_diff,precision,base_precision,precision_diff,f1,base_f1,f1_diff
a,0.768421,1.0,-0.231579,0.776596,0.47,0.306596,0.772487,0.639456,0.133031
b,0.707692,0.0,0.707692,0.71875,0.0,0.71875,0.713178,0.0,0.713178
c,0.725,0.0,0.725,0.690476,0.0,0.690476,0.707317,0.0,0.707317


### Random-Guess Metrics

In [12]:
rg_cm = (n / nc) * np.repeat([p], nc, axis = 1).reshape(nc, nc)
pd.DataFrame(rg_cm, index = cm_index, columns = cm_columns)

Unnamed: 0,predict a,predict b,predict c
actual a,31.333333,31.333333,31.333333
actual b,21.333333,21.333333,21.333333
actual c,14.0,14.0,14.0


In [13]:
rgAccuracy = 1 / nc
rgPrecision = p
rgRecall = 0 * p + 1 / nc
rgF1 = 2 * p / (nc * p + 1)

print('{} accuracy vs {} random-class accuracy'.format(accuracy, rgAccuracy))
pd.DataFrame(
    {
        'recall': recall,
        'base_recall': rgRecall,
        'recall_diff': recall - rgRecall,
        'precision': precision,
        'base_precision': rgPrecision, 
        'precision_diff': precision - rgPrecision,
        'f1': f1,
        'base_f1': rgF1,
        'f1_diff': f1 - rgF1,
        
    }, index = data_classes)

0.74 accuracy vs 0.3333333333333333 random-class accuracy


Unnamed: 0,recall,base_recall,recall_diff,precision,base_precision,precision_diff,f1,base_f1,f1_diff
a,0.768421,0.333333,0.435088,0.776596,0.47,0.306596,0.772487,0.390041,0.382445
b,0.707692,0.333333,0.374359,0.71875,0.32,0.39875,0.713178,0.326531,0.386648
c,0.725,0.333333,0.391667,0.690476,0.21,0.480476,0.707317,0.257669,0.449648


### Weighted Random-Guess Metrics

In [14]:
wrg_cm = np.zeros([3, 3])

for i in range(nc):
    for j in range(nc):
        wrg_cm[i, j] = n * p[i] * p[j]

pd.DataFrame(wrg_cm, index = cm_index, columns = cm_columns)

Unnamed: 0,predict a,predict b,predict c
actual a,44.18,30.08,19.74
actual b,30.08,20.48,13.44
actual c,19.74,13.44,8.82


In [15]:
wrgAccuracy = np.sum(np.power(p, 2))
wrgPrecision = p
wrgRecall = p
wrgF1 = p

print('{} accuracy vs {} weighted random-class accuracy'.format(accuracy, wrgAccuracy))
pd.DataFrame(
    {
        'recall': recall,
        'base_recall': wrgRecall,
        'recall_diff': recall - wrgRecall,
        'precision': precision,
        'base_precision': wrgPrecision, 
        'precision_diff': precision - wrgPrecision,
        'f1': f1,
        'base_f1': wrgF1,
        'f1_diff': f1 - wrgF1,
        
    }, index = data_classes)

0.74 accuracy vs 0.36739999999999995 weighted random-class accuracy


Unnamed: 0,recall,base_recall,recall_diff,precision,base_precision,precision_diff,f1,base_f1,f1_diff
a,0.768421,0.47,0.298421,0.776596,0.47,0.306596,0.772487,0.47,0.302487
b,0.707692,0.32,0.387692,0.71875,0.32,0.39875,0.713178,0.32,0.393178
c,0.725,0.21,0.515,0.690476,0.21,0.480476,0.707317,0.21,0.497317


### Weighted Random-Guess Metrics (known prior)

In [16]:
pwrg_cm = np.zeros([3, 3])

for i in range(nc):
    for j in range(nc):
        pwrg_cm[i, j] = n * p_true[i] * p_true[j]

pd.DataFrame(pwrg_cm, index = cm_index, columns = cm_columns)

Unnamed: 0,predict a,predict b,predict c
actual a,50.0,25.0,25.0
actual b,25.0,12.5,12.5
actual c,25.0,12.5,12.5


In [17]:
pwrgAccuracy = np.sum(np.power(p_true, 2))
pwrgPrecision = p_true
pwrgRecall = p_true
pwrgF1 = p_true

print('{} accuracy vs {} prior-weighted random-class accuracy'.format(accuracy, pwrgAccuracy))
pd.DataFrame(
    {
        'recall': recall,
        'base_recall': pwrgRecall,
        'recall_diff': recall - pwrgRecall,
        'precision': precision,
        'base_precision': pwrgPrecision, 
        'precision_diff': precision - pwrgPrecision,
        'f1': f1,
        'base_f1': pwrgF1,
        'f1_diff': f1 - pwrgF1,
        
    }, index = data_classes)

0.74 accuracy vs 0.375 prior-weighted random-class accuracy


Unnamed: 0,recall,base_recall,recall_diff,precision,base_precision,precision_diff,f1,base_f1,f1_diff
a,0.768421,0.5,0.268421,0.776596,0.5,0.276596,0.772487,0.5,0.272487
b,0.707692,0.25,0.457692,0.71875,0.25,0.46875,0.713178,0.25,0.463178
c,0.725,0.25,0.475,0.690476,0.25,0.440476,0.707317,0.25,0.457317


### Kappa Statistic

Further reading: https://stats.stackexchange.com/questions/82162/cohens-kappa-in-plain-english

In [18]:
expAccuracy = np.sum(p * q)
kappa = (accuracy - expAccuracy) / (1 - expAccuracy)
kappa

0.5877923107411811