# Comparison of Classifier's predicted probability: t-test, Wilcoxon test and others

In this notebook, we use paired tests to assess whether the posterior probabilities of two classifiers differ significantly.

First, let's load the Breast Cancer Dataset. We will construct two RandomForest with 50 and 51 estimators with the hope that there is no real difference between them



In [6]:
import sklearn
import numpy as np
import scipy
from sklearn.ensemble import RandomForestClassifier
from sklearn.datasets import load_iris, load_breast_cancer, load_digits
from sklearn.model_selection import train_test_split
from __future__ import print_function

dataset = load_breast_cancer()

data = dataset.data
target = dataset.target



Now, let's perform a paired t-test.

In [7]:
for _ in range(10):
    X_train, X_test, Y_train, Y_test = train_test_split(data, target, test_size=0.2)

    clf_A = RandomForestClassifier(n_estimators=50)
    clf_B = RandomForestClassifier(n_estimators=51)

    clf_A.fit(X_train, Y_train);
    clf_B.fit(X_train, Y_train);
    
    prob_A = clf_A.predict_proba(X_test)[:, 1]
    prob_B = clf_B.predict_proba(X_test)[:, 1]
    
    print("{} run".format(_+1))
    print("P-value : {}".format(scipy.stats.ttest_rel(prob_A, prob_B).pvalue))

1 run
P-value : 0.803051858565
2 run
P-value : 0.235007027539
3 run
P-value : 0.246274852779
4 run
P-value : 0.0624261284336
5 run
P-value : 0.048288650826
6 run
P-value : 0.279970029367
7 run
P-value : 0.989384821222
8 run
P-value : 0.577798898327
9 run
P-value : 0.922509261412
10 run
P-value : 0.967157109588


As we can see, there is a high probability of performing a type I error. Is it because data is non-normal?

Let's perform a Wilcoxon non parametric test.

In [8]:
for _ in range(10):
    X_train, X_test, Y_train, Y_test = train_test_split(data, target, test_size=0.2)

    clf_A = RandomForestClassifier(n_estimators=50)
    clf_B = RandomForestClassifier(n_estimators=51)

    clf_A.fit(X_train, Y_train);
    clf_B.fit(X_train, Y_train);
    
    prob_A = clf_A.predict_proba(X_test)[:, 1]
    prob_B = clf_B.predict_proba(X_test)[:, 1]
    
    print("{} run".format(_+1))
    print("P-value : {}".format(scipy.stats.wilcoxon(prob_A, prob_B).pvalue))

1 run
P-value : 0.107725164429
2 run
P-value : 0.315788095921
3 run
P-value : 0.629511470224
4 run
P-value : 0.802640192194
5 run
P-value : 0.433733102243
6 run
P-value : 0.635596646395
7 run
P-value : 0.49267168113
8 run
P-value : 0.0365398300494
9 run
P-value : 0.386452210453
10 run
P-value : 0.237561828978


There is a high variability here too, depending mainly on the random split of the test set. Let's try this on another dataset

In [9]:
dataset = load_digits()

data = dataset.data
target = dataset.target


for _ in range(10):
    X_train, X_test, Y_train, Y_test = train_test_split(data, target, test_size=0.2)

    clf_A = RandomForestClassifier(n_estimators=50)
    clf_B = RandomForestClassifier(n_estimators=51)

    clf_A.fit(X_train, Y_train);
    clf_B.fit(X_train, Y_train);
    
    prob_A = clf_A.predict_proba(X_test)[:, 0]
    prob_B = clf_B.predict_proba(X_test)[:, 0]
    
    print("{} run".format(_+1))
    print("P-value : {}".format(scipy.stats.wilcoxon(prob_A, prob_B).pvalue))

1 run
P-value : 0.274646747515
2 run
P-value : 0.376916372728
3 run
P-value : 0.278462636287
4 run
P-value : 0.206219391555
5 run
P-value : 0.120399696459
6 run
P-value : 0.0319334504654
7 run
P-value : 0.00289103538983
8 run
P-value : 0.0532025213858
9 run
P-value : 0.685694385542
10 run
P-value : 0.0192099698263


Seeing these examples, it seems that comparing posterior probabilities is very unstable.