# Classifier comparison

Let's compare the performance of the different classifiers.

In the `folds` dictionary we have the results of the score (is it ROC AUC?) for each fold of the 10-fold cross validation. We will use statistical tests to assess whether there is a significant difference in one of them.


In [1]:
from scipy import stats

folds = {
    "f0_i0_sr" : [
        31.272509,
        33.193277,
        27.85114,
        30.552221,
        31.512605,
        30.612245,
        30.792317,
        29.711885,
        32.232893,
        30.630631
    ],
    "i0" : [
        29.471789,
        29.951981,
        31.212485,
        28.631453,
        30.972389,
        29.351741,
        30.372149,
        28.391357,
        28.091236,
        31.111111,
    ],

    "i0_f0" : [ 
        30.492197,
        30.372149,
        29.951981,
        30.792317,
        30.552221,
        30.912365,
        28.691477,
        28.631453,
        31.711712,
    ],

    "i0_sr" : [
        32.172869,
        32.533013,
        30.372149,
        31.572629,
        31.872749,
        30.312125,
        30.252101,
        30.072029,
        31.932773,
        30.21021,
    ]
}


# Tests estadísticos

Veamos dos posibilidades:

- Usar One Way Anova: ¿se cumplen las asunciones? (normalidad de los errores, igualdad de varianzas, etc)
- Usar One Way Anova no paramétrico: Kruskal Wallis (acá sólo se asume lo de igualdad de varianzas)

In [2]:
help(stats.f_oneway)


Help on function f_oneway in module scipy.stats.stats:

f_oneway(*args)
    Performs a 1-way ANOVA.
    
    The one-way ANOVA tests the null hypothesis that two or more groups have
    the same population mean.  The test is applied to samples from two or
    more groups, possibly with differing sizes.
    
    Parameters
    ----------
    sample1, sample2, ... : array_like
        The sample measurements for each group.
    
    Returns
    -------
    statistic : float
        The computed F-value of the test.
    pvalue : float
        The associated p-value from the F-distribution.
    
    Notes
    -----
    The ANOVA test has important assumptions that must be satisfied in order
    for the associated p-value to be valid.
    
    1. The samples are independent.
    2. Each sample is from a normally distributed population.
    3. The population standard deviations of the groups are all equal.  This
       property is known as homoscedasticity.
    
    If these assumptions are 

In [3]:
help(stats.kruskal)

Help on function kruskal in module scipy.stats.stats:

kruskal(*args, **kwargs)
    Compute the Kruskal-Wallis H-test for independent samples
    
    The Kruskal-Wallis H-test tests the null hypothesis that the population
    median of all of the groups are equal.  It is a non-parametric version of
    ANOVA.  The test works on 2 or more independent samples, which may have
    different sizes.  Note that rejecting the null hypothesis does not
    indicate which of the groups differs.  Post-hoc comparisons between
    groups are required to determine which groups are different.
    
    Parameters
    ----------
    sample1, sample2, ... : array_like
       Two or more arrays with the sample measurements can be given as
       arguments.
    nan_policy : {'propagate', 'raise', 'omit'}, optional
        Defines how to handle when input contains nan. 'propagate' returns nan,
        'raise' throws an error, 'omit' performs the calculations ignoring nan
        values. Default is 'propaga

In [4]:
print("Comparing {}".format(folds.keys()))
stats.kruskal(*folds.values())

Comparing ['i0', 'i0_sr', 'f0_i0_sr', 'i0_f0']


KruskalResult(statistic=6.1450612281649661, pvalue=0.10476205937416561)

In [5]:
print("Comparing {}".format(folds.keys()))

stats.f_oneway(*folds.values())

Comparing ['i0', 'i0_sr', 'f0_i0_sr', 'i0_f0']


F_onewayResult(statistic=2.7932093521527301, pvalue=0.054636757342817123)

There is no evidence these classifiers perform differently with $p \leq 0.05$ for none of the above tests

# Incremental comparison

Let's start with $I$ and add variables one by one, checking if that makes a significant difference.

As we can see, adding $SR$ to $I$ increments significantly the accuracy of the classifier ($p=0.0116$). Adding $F_0$ to $I + SR$ doesn't change it significantly.

In [44]:
for k, v in folds.iteritems():
    
    print len(v)

10
10
10
9


In [46]:

def compare(k1, k2):
    
    pval = stats.mannwhitneyu(folds[k1], folds[k2]).pvalue
    print("Comparing {} with {} pval = {}".format(k1, k2, pval))
    
compare("i0", "i0_sr")
compare("i0", "i0_f0")
compare("i0_sr", "f0_i0_sr")

Comparing i0 with i0_sr pval = 0.0116451174645
Comparing i0 with i0_f0 pval = 0.206800512526
Comparing i0_sr with f0_i0_sr pval = 0.484924988497
