Chapter 28
# Rank Significance Tests

In applied machine learning, we often need to determine whether two data samples have the same or different distributions.  If the data does not have the familiar Gaussian distribution, we must resort to nonparametric versions of the significance tests.  These tests operate in a similar manner, but are distribution free, requiring that real valued data be first transformed into rank data before the test can be performed.

# Nonparametric Statistical Significance Tests
Nonparametric statistical methods were developed for use with ordinal or interval data, but in practice can also be used with a ranking of real-valued observations in a data sample, rather than on the observation values themselves.

A common question about two or more datasets is whether they are different: specifically, whether the difference between their central tendency (e.g. mean or median) is statistically significant.

The null hypothesis of nonparametric statistical significance tests is often the assumption that both samples were drawn from a population with the same distribution, and therefore have the same population parameters (such as mean or median).

These tests are often used on samples of model skill scores in order to confirm that the difference in skill between machine learning models is significant.

In general, each test calculates a test statistic, that must be interpreted with some background in statistics and a deeper knowledge of the statistical test itself.  Tests also return a p-value that can be used to interpret the result of the test: it can be thought of as the probability of observing the two data samples given the null hypothesis that the two samples were drawn from a population with the same distribution.  The p-value can be interpreted in the context of a chosen significance level called alpha (commonly 5% or 0.05):
- p-value <= alpha: sigificant result, so reject null hypotheses i.e. samples were likely drawn from populations with differing distributions
- p-value > alpha: non-significant result, so fail to reject null hypothesis i.e. samples were likely drawn from the same distribution

# Test Dataset
We will generate two samples drawn from different distributions:
- one sample of 100 uniform random numbers between 50-59
- one sample of 100 uniform random numbers between 51-60

We expect the statistical tests to discover that the samples were drawn from differing distributions, although the small sample size will add some noise

In [1]:
# generate data samples
from numpy.random import seed
from numpy.random import rand

# seed the random number generator
seed(1)

# generate two sets of univariate observations
data1 = 50 + (rand(100) * 10)
data2 = 51 + (rand(100) * 10)

# summarize
print('data1: min=%.3f max=%.3f' % (min(data1), max(data1)))
print('data2: min=%.3f max=%.3f' % (min(data2), max(data2)))

data1: min=50.001 max=59.889
data2: min=51.126 max=60.973


# Mann-Whitney U Test
This is a nonparametric statistical significance test for determing whether two independent samples were drawn from a population with the same distribution.

The null hypothesis is that there is no difference between the distributions of the data samples.  Rejection of this hypothesis suggests that there is likely some difference between the samples.

More specifically, the test determines whether it is equally likely that any randomly selected observation from one sample will be greater or less than a sample in the other distribution.  If violated, it suggests differing distributions
- fail to reject H0 - sample distributions are equal
- reject H0 - sample distributions are not equal

For the test to be effective, it requires at least 20 observations in each data sample.

We can implement the Mann-Whitney U test using the SciPy function mannwhitneyu(), which takes the two data samples as arguments, and returns the test statistic and p-value

In [4]:
# example of the mann-whitney u test
from numpy.random import seed
from numpy.random import rand
from scipy.stats import mannwhitneyu

# seed the random number generator
seed(1)

# generate two independent samples
data1 = 50 + (rand(100) * 10)
data2 = 51 + (rand(100) * 10)

# compare samples
stat, p = mannwhitneyu(data1, data2)
print('Statistics=%.3f, p=%.3f' % (stat, p))

# interpret
alpha = 0.05
if p > alpha:
    print('Same distribution (fail to reject H0)')
else:
    print('Different distribution (reject H0)')

# the p-value of 0.012 strongly suggests the sample distributions are different

Statistics=4077.000, p=0.012
Different distribution (reject H0)


# Wilcoxon Signed-Rank Test
In some cases, the data samples may be paired e.g. the samples
- are related or matched in some way
- represent two measurements of the same technique
- are independent samples from the same population

Examples of paired samples in machine learning might be:
- the same algorithm evaluated on different datasets
- different algorithms evaluated on exactly the same training and test data

The samples are not independent, and therefore the Mann-Whitney U test cannot be used.  Instead, the Wilcoxon t-test is used.  This is the equivalent of the paired Student's t-test, but for ranked data instead of real valued data with a Gaussian distribution.

The null hypothesis is that the two samples have the same distribution:
- fail to reject H0 - the sample distributions are equal
- reject H0 - the sample distributions are not equal

For the test to be effective, it requires at least 20 observations in each data sample.

The test can be implemented using the SciPy function wilcoxon(), which takes the two samples as arguments, and returns the calculated statistic and p-value.

In our example, the two samples are technically not paired, but we can pretend they are for the sake of demonstrating the calculation of this significance test.

In [8]:
# example of the wilcoxon signed-rank test
from numpy.random import seed
from numpy.random import rand
from scipy.stats import wilcoxon

# seed the random number generator
seed(1)

# generate two independent samples
data1 = 50 + (rand(100) * 10)
data2 = 51 + (rand(100) * 10)

# compare samples
stat, p = wilcoxon(data1, data2)
print('Statistics=%.3f, p=%.3f' % (stat, p))

# interpret
alpha = 0.05
if p > alpha:
    print('Same distribution (fail to reject H0)')
else:
    print('Different distribution (reject H0)')

# the p-value of 0.043 strongly suggests that the samples are drawn from different distributions

Statistics=1937.000, p=0.043
Different distribution (reject H0)


# Kruskal-Wallis H Test
When working with significance tests (such as Mann-Whitney U and Wicoxon signed-rank), comparisons between data samples must be performed pairwise.  This can be inefficient if you have many data samples and are only interested in whether two or more samples have a different distribution.

The Kruskal-Wallis test is a nonparametric version of the one-way analysis of variance (ANOVA) test, and can be used to determine whether more than two independent samples have a different distribution.  It can be thought of as a generalisation of the Mann-Whitney U test.

The null hypothesis is that all data samples werwe drawn from the same distribution: specifically, that the population medians of all groups are equal.

A rejection of the null hypothesis indicates that there is enough evidence to suggest that one or more samples dominate another sample, but the test does not indicate which samples or by how much
- fail to reject H0 - all sample distributions are equal
- reject H0 - one or more sample distributions are not equal

Each data sample must be independent and have 5 or more observations.  The data samples can differ in size.

In [6]:
# example of the kruskal-wallis h-test
from numpy.random import seed
from numpy.random import rand
from scipy.stats import kruskal

# seed the random number generator
seed(1)

# generate three independent samples, each with a slightly different sample mean
data1 = 50 + (rand(100) * 10)
data2 = 51 + (rand(100) * 10)
data3 = 52 + (rand(100) * 10)

# compare samples
stat, p = kruskal(data1, data2, data3)
print('Statistics=%.3f, p=%.3f' % (stat, p))

# interpret
alpha = 0.05
if p > alpha:
    print('Same distributions (fail to reject H0)')
else:
    print('Different distributions (reject H0)')

# the p-value of 0.00 correctly suggests that we reject the null hypothesis

Statistics=34.747, p=0.000
Different distributions (reject H0)


# Friedman Test
As in the previous example, we may have more than two different samples and an interest in whether all samples have the same distribution.  If the samples are paired in some way (such as repeated measures) then the Kruskal-Wallis H test would not be appropriate.

Instead the Friedman test can be used: this is the nonparametric version of the repeated measures analysis of variance test, or repeated measures ANOVA.  The test can be thought of as a generalisation of the Kruskal-Wallis H Test to more than two samples.

The null hypothesis is that the multiple paired samples have the same distribution:
- fail to reject H0 - paired sample distributions are equal
- reject H0 - paired sample distributions are not equal

The test assumes two or more paired data samples with 10 or more samples per group

The test can be implemented using the SciPy function friedmanchisquare(), which takes as arguments the data samples to compare, and returns the calculated statistic and p-value.

In our example, the samples are technically not paired, but we can pretend they are for the sake of demonstrating the calculation of this significance test.

In [9]:
# example of the friedman test
from numpy.random import seed
from numpy.random import rand
from scipy.stats import friedmanchisquare

# seed the random number generator
seed(1)

# generate three independent samples
data1 = 50 + (rand(100) * 10)
data2 = 51 + (rand(100) * 10)
data3 = 52 + (rand(100) * 10)

# compare samples
stat, p = friedmanchisquare(data1, data2, data3)
print('Statistics=%.3f, p=%.3f' % (stat, p))

# interpret
alpha = 0.05
if p > alpha:
    print('Same distributions (fail to reject H0)')
else:
    print('Different distributions (reject H0)')

# the p-value of 0.00 correctly suggests that we reject the null hypothesis

Statistics=36.240, p=0.000
Different distributions (reject H0)


# Extensions

In [13]:
# update the mann-whitney u test example to operate on data samples that have the same distribution
from numpy.random import seed
from numpy.random import rand
from scipy.stats import mannwhitneyu

# seed the random number generator
seed(1)

# generate two independent samples
data1 = 50 + (rand(1000) * 10)
data2 = 50 + (rand(1000) * 10)

# compare samples
stat, p = mannwhitneyu(data1, data2)
print('Statistics=%.3f, p=%.3f' % (stat, p))

# interpret
alpha = 0.05
if p > alpha:
    print('Same distribution (fail to reject H0)')
else:
    print('Different distribution (reject H0)')

# the p-value of 0.085 suggests the sample distributions are the same

Statistics=482249.000, p=0.085
Same distribution (fail to reject H0)


In [14]:
# update the wilcoxon signed-rank test example to operate on data samples that have the same distribution
from numpy.random import seed
from numpy.random import rand
from scipy.stats import wilcoxon

# seed the random number generator
seed(1)

# generate two independent samples
data1 = 50 + (rand(1000) * 10)
data2 = 50 + (rand(1000) * 10)

# compare samples
stat, p = wilcoxon(data1, data2)
print('Statistics=%.3f, p=%.3f' % (stat, p))

# interpret
alpha = 0.05
if p > alpha:
    print('Same distribution (fail to reject H0)')
else:
    print('Different distribution (reject H0)')

# the p-value of 0.208 strongly suggests that the samples are drawn from the same distribution

Statistics=238750.000, p=0.208
Same distribution (fail to reject H0)


In [23]:
# update the kruskal-wallis h-test example to operate on data samples that have the same distribution
from numpy.random import seed
from numpy.random import rand
from scipy.stats import kruskal

# seed the random number generator
seed(1)

# generate three independent samples, each with the same sample mean
data1 = 50 + (rand(100) * 10)
data2 = 50 + (rand(100) * 10)
data3 = 50 + (rand(100) * 10)

# compare samples
stat, p = kruskal(data1, data2, data3)
print('Statistics=%.3f, p=%.3f' % (stat, p))

# interpret
alpha = 0.05
if p > alpha:
    print('Same distributions (fail to reject H0)')
else:
    print('Different distributions (reject H0)')

# the p-value of 0.188 with a sample size of 100 correctly suggests that we fail to reject the null hypothesis
# however the p-value of 0.018 with sample sizes of 1000 suggests that the distributions are different

Statistics=3.343, p=0.188
Same distributions (fail to reject H0)


In [25]:
# update the friedman test example to operate on data samples that have the same distribution
from numpy.random import seed
from numpy.random import rand
from scipy.stats import friedmanchisquare

# seed the random number generator
seed(1)

# generate three independent samples
data1 = 50 + (rand(100) * 10)
data2 = 50 + (rand(100) * 10)
data3 = 50 + (rand(100) * 10)

# compare samples
stat, p = friedmanchisquare(data1, data2, data3)
print('Statistics=%.3f, p=%.3f' % (stat, p))

# interpret
alpha = 0.05
if p > alpha:
    print('Same distributions (fail to reject H0)')
else:
    print('Different distributions (reject H0)')

# the p-value of 0.054 with a sample size of 100 correctly suggests that we reject the null hypothesis
# however the p-value of 0.035 with a sample size of 1000 suggests that the distributions are different

Statistics=6.714, p=0.035
Different distributions (reject H0)
