### What's this
This short notebook serves as a demonstration of a pitfall of the quadratically-weighted kappa metric for evaluation purposes on image grading tasks. Specifically, it varies quite heavily in the presence of label shift (when label distribution changes over time/space, see [this link](https://d2l.ai/chapter_multilayer-perceptrons/environment.html#label-shift) for definition). We will also see that accuracy and balanced accuracy are unaffected by this, and Matthews Correlation Coefficient is affected, but much less than Quad Kappa.

See:

In [64]:
y_true = [0, 1, 2, 3, 4]
y_pred = [0, 1, 2, 3, 0]
matthews_corrcoef(y_true, y_pred), cohen_kappa_score(y_true, y_pred, weights='quadratic')

(0.7905694150420948, 0.19999999999999996)

In [65]:
y_true = [0, 1, 2, 3, 4]
y_pred = [0, 1, 2, 3, 3]
matthews_corrcoef(y_true, y_pred), cohen_kappa_score(y_true, y_pred, weights='quadratic')

(0.7905694150420948, 0.9411764705882353)

In [48]:
def repl(v):
    dict_map = {0:3, 1:4, 2:2, 3:0, 4:1}
    for x in range(len(v)):
        v[x] = dict_map[v[x]]
    return v

In [60]:
matthews_corrcoef(repl(y_true), repl(y_pred)), cohen_kappa_score(repl(y_true), repl(y_pred), weights='quadratic')

0.7905694150420948

0.8

In [53]:
y_true = [0, 0, 1, 1, 2, 2, 3, 3, 4, 4]
y_pred = [0, 0, 1, 1, 2, 2, 3, 3, 3, 3]
matthews_corrcoef(y_true, y_pred), cohen_kappa_score(y_true, y_pred, weights='quadratic')

(0.7905694150420948, 0.9411764705882353)

In [47]:
replace(y_true), replace(y_pred)

([3, 3, 4, 4, 2, 2, 0, 0, 1, 1], [3, 3, 4, 4, 3, 0, 0, 0, 1, 1])

In [49]:
matthews_corrcoef(replace(y_true), replace(y_pred))

0.7798128673650546

In [50]:
cohen_kappa_score(replace(y_true), replace(y_pred), weights='quadratic')

0.8888888888888888

### Let's start

In [1]:
import numpy as np
from sklearn.metrics import cohen_kappa_score, matthews_corrcoef, accuracy_score
from sklearn.metrics import balanced_accuracy_score

Let us generate a synthetic dataset with 500 samples labeled with classes 0,1,2,3,4, as is typical in diabetic retinopathy grading. In this dataset there are 100 examples of each class.

In [2]:
y_true = np.array(100*[0] + 100*[1] + 100*[2] + 100*[3] + 100*[4])

We now build a synthetic model that 400 over 500 times predicts the true category, but on the remaining 100 samples it predicts randomly:

In [3]:
# copy labels over predictions
y_pred = y_true.copy()
# pick 100 positions on y_pred to replace label by random prediction
replaced_pred_inds = np.random.choice(len(y_true), size=100, replace=False)
# replaced labels by random predictions on those positions
y_pred[replaced_pred_inds] = np.random.randint(0,4+1, size=100)

The model is pretty decent on all four metrics. Note that because the dataset is perfectly balanced, accuracy and balanced accuracy are exactly the same in this case:

In [4]:
k = cohen_kappa_score(y_true, y_pred, weights='quadratic')
mcc = matthews_corrcoef(y_true, y_pred)
acc = accuracy_score(y_true, y_pred)
b_acc = balanced_accuracy_score(y_true, y_pred)
print('Quad-kappa={:.2f}, MCC={:.2f}, ACC={:.2f}, bal-ACC={:.2f},'.format(100*k, 100*mcc, 100*acc, 100*b_acc))

Quad-kappa=80.45, MCC=78.28, ACC=82.60, bal-ACC=82.60,


Let us use the same model on a different dataset that has different prevalences. This dataset also has 500 samples, but categories 3 and 4 are less represented, and category 0 is much more represented:

In [5]:
y_true = np.array(280*[0] + 100*[1] + 100*[2] + 10*[3] + 10*[4])

The model is the same as before, 80% of the time it says the truth, 20% it is random:

In [6]:
# copy labels over predictions
y_pred = y_true.copy()
# pick 100 positions on y_pred to replace label by random prediction
replaced_pred_inds = np.random.choice(len(y_true), size=100, replace=False)
# replaced labels by random predictions on those positions
y_pred[replaced_pred_inds] = np.random.randint(0,4+1, size=100)

We look again at the same four metrics:

In [7]:
k = cohen_kappa_score(y_true, y_pred, weights='quadratic')
mcc = matthews_corrcoef(y_true, y_pred)
acc = accuracy_score(y_true, y_pred)
b_acc = balanced_accuracy_score(y_true, y_pred)
print('Quad-kappa={:.2f}, MCC={:.2f}, ACC={:.2f}, bal-ACC={:.2f},'.format(100*k, 100*mcc, 100*acc, 100*b_acc))

Quad-kappa=61.14, MCC=75.88, ACC=84.00, bal-ACC=80.14,


We can see that accuracy is very robust to label shift, and so appears to be Matthews Correlation Coefficient, but quadratic kappa varies dramatically. I

Also, if we run the above cell multiple times, we will also see that balanced accuracy is quite volatile, as the model making a few more or less mistakes in minority classes result in heavy improvements or degradations of this metric.

Just to check that this was not a random coincidence, let us run this experiment 1000 times and look at the statistics:

In [11]:
N_EXPERIMENTS = 1000

In [14]:
def get_model_performance(y_true):
    # copy labels over predictions
    y_pred = y_true.copy()
    # pick 100 positions on y_pred to replace label by random prediction
    replaced_pred_inds = np.random.choice(len(y_true), size=100, replace=False)
    # replaced labels by random predictions on those positions
    y_pred[replaced_pred_inds] = np.random.randint(0,4+1, size=100)
    
    k = cohen_kappa_score(y_true, y_pred, weights='quadratic')
    mcc = matthews_corrcoef(y_true, y_pred)
    acc = accuracy_score(y_true, y_pred)
    b_acc = balanced_accuracy_score(y_true, y_pred)
    
    return k, mcc, acc, b_acc
    
ks_balanced, mccs_balanced = np.zeros((N_EXPERIMENTS,)), np.zeros((N_EXPERIMENTS,))
accs_balanced, bal_accs_balanced = np.zeros((N_EXPERIMENTS,)), np.zeros((N_EXPERIMENTS,))

ks_imbalanced, mccs_imbalanced = np.zeros((N_EXPERIMENTS,)), np.zeros((N_EXPERIMENTS,))
accs_imbalanced, bal_accs_imbalanced = np.zeros((N_EXPERIMENTS,)), np.zeros((N_EXPERIMENTS,))

for ex in range(N_EXPERIMENTS):
    # balanced dataset
    y_true = np.array(100*[0] + 100*[1] + 100*[2] + 100*[3] + 100*[4])
    k, mcc, acc, bal_acc = get_model_performance(y_true)
    
    ks_balanced[ex], mccs_balanced[ex], accs_balanced[ex], bal_accs_balanced[ex] = k, mcc, acc, bal_acc
    
    # imbalanced dataset - label shift
    y_true = np.array(280*[0] + 100*[1] + 100*[2] + 10*[3] + 10*[4])
    k, mcc, acc, bal_acc = get_model_performance(y_true)
    
    ks_imbalanced[ex], mccs_imbalanced[ex], accs_imbalanced[ex], bal_accs_imbalanced[ex] = k, mcc, acc, bal_acc

In [15]:
print('BALANCED DATASET:')
print('Avg Quad-kappa    = {:.2f}+/-{:.2f}'.format(100*ks_balanced.mean(), 100*ks_balanced.std()))
print('Avg MCC           = {:.2f}+/-{:.2f}'.format(100*mccs_balanced.mean(), 100*mccs_balanced.std()))
print('Avg Accuracy      = {:.2f}+/-{:.2f}'.format(100*accs_balanced.mean(), 100*accs_balanced.std()))
print('Avg Bal. Accuracy = {:.2f}+/-{:.2f}'.format(100*bal_accs_balanced.mean(), 100*bal_accs_balanced.std()))

print(40*'-')

print('IMBALANCED DATASET:')
print('Avg Quad-kappa    = {:.2f}+/-{:.2f}'.format(100*ks_imbalanced.mean(), 100*ks_imbalanced.std()))
print('Avg MCC           = {:.2f}+/-{:.2f}'.format(100*mccs_imbalanced.mean(), 100*mccs_imbalanced.std()))
print('Avg Accuracy      = {:.2f}+/-{:.2f}'.format(100*accs_imbalanced.mean(), 100*accs_imbalanced.std()))
print('Avg Bal. Accuracy = {:.2f}+/-{:.2f}'.format(100*bal_accs_imbalanced.mean(), 100*bal_accs_imbalanced.std()))

BALANCED DATASET:
Avg Quad-kappa    = 80.09+/-2.27
Avg MCC           = 80.03+/-1.01
Avg Accuracy      = 84.00+/-0.81
Avg Bal. Accuracy = 84.00+/-0.81
----------------------------------------
IMBALANCED DATASET:
Avg Quad-kappa    = 62.77+/-3.37
Avg MCC           = 75.80+/-1.18
Avg Accuracy      = 84.02+/-0.81
Avg Bal. Accuracy = 83.91+/-3.19


Note that even if on average balanced accuracy and accuracy are similar in the long run, the standard devation of the balanced accuracies is much larger; it is even higher than the standard deviation of the quadratic kappa!

### Conclusion:
If one has datasets from different regions or different moments in time, one should be careful with comparing the performance of the same model on those using the quadratric kappa metric, since this is not an apples to apples comparison. It would be wiser to use accuracy instead, or MCC. It is also interesting to note that Balanced Accuracy is only a fair metric for a large number of datasets.