# Nonparametric Statistical Significance Tests in Python

Nonparametric statistics are those methods that do not assume a specific distribution to the data.

Often, they refer to statistical methods that do not assume a Gaussian distribution. They were developed for use with ordinal or interval data, but in practice can also be used with a ranking of real-valued observations in a data sample rather than on the observation values themselves.

- To test whether the two data samples have the same or different distributions.

- The null hypothesis of these tests is often the assumption that both samples were drawn from a population with the same distribution, and therefore the same population parameters, such as mean or median.

- If after calculating the significance test on two or more samples the null hypothesis is rejected, it indicates that there is evidence to suggest that samples were drawn from different populations, and in turn the difference between sample estimates of population parameters, such as means or medians may be significant.

- Tests also return a  p-value that can be used to interpret the result of the test. The  p-value can be thought of as the probability of observing the two data samples given the base assumption (null hypothesis) that the two samples were drawn from a population with the same distribution.

- The p-value can be interpreted in the context of a chosen significance level called  α. A common value for alpha is 5% or 0.05. If the p-value is below the significance level, then the test says there is enough evidence to reject the null hypothesis and that the samples were likely drawn from populations with differing distributions.

p<=α : reject H0, different distribution.

p>α : fail to reject H0, same distribution


In [1]:
# generate gaussian data samples
from numpy.random import seed
from numpy.random import randn
from numpy import mean
from numpy import std

# seed the random number generator
seed(1)
# generate two sets of univariate observations
data1 = 5 * randn(100) + 50
data2 = 5 * randn(100) + 51
# summarize
print('data1: mean=%.3f stdv=%.3f' % (mean(data1), std(data1)))
print('data2: mean=%.3f stdv=%.3f' % (mean(data2), std(data2)))

data1: mean=50.303 stdv=4.426
data2: mean=51.764 stdv=4.660


# The Mann-Whitney U test (for two independent samples)

- The Mann-Whitney U test is a nonparametric statistical significance test for determining whether two independent samples were drawn from a population with the same distribution.
  
More specifically, the test determines whether it is equally likely that any randomly selected observation from one sample will be greater or less than a sample in the other distribution. If violated, it suggests differing distributions.

- Fail to Reject H0: Sample distributions are equal.
- Reject H0: Sample distributions are not equal.

In [2]:
from scipy.stats import mannwhitneyu

In [3]:
# compare samples
stat, p = mannwhitneyu(data1, data2)
print('Statistics=%.3f, p=%.3f' % (stat, p))
# interpret
alpha = 0.05
if p > alpha:
    print('Same distribution (fail to reject H0)')
else:
    print('Different distribution (reject H0)')

Statistics=4025.000, p=0.017
Different distribution (reject H0)


Mann-Whitney U test is used to assess the existence (or absence) of significant differences between two samples of numerical data.

Python scipy library incorporates an implementation of this and other non-parametric test. The code below illustrates the application of the Mann-Whitney U test to compare the quality ratings between two groups of red wine samples associated with two Spanish regions: region A and region B.

In [9]:
import numpy as np
from scipy.stats import mannwhitneyu

# The two data samples, of size 30 each, are generated synthetically
np.random.seed(42)
region_a = np.random.randint(80, 95, size=30)
region_b = np.random.randint(85, 100, size=30)

# Apply Mann-Whitney U test
stat, p_value = mannwhitneyu(region_a, region_b, alternative='two-sided')

# Output the data
print(f"Region A ratings: {region_a}")
print(f"Region B ratings: {region_b}")
# Output the test results
print(f"Mann-Whitney U statistic: {stat}")
print(f"P-value: {p_value}")

# Interpretation
if p_value < 0.05:
    print("There is a significant difference in wine quality ratings between the two regions.")
else:
    print("There is no significant difference in wine quality ratings between the two regions.")

Region A ratings: [86 83 92 94 90 87 92 84 86 89 82 86 90 90 87 84 83 87 87 82 85 84 81 87
 91 93 85 81 91 84]
Region B ratings: [85 96 94 90 97 96 93 85 95 95 99 94 96 96 99 98 98 99 98 87 96 91 88 93
 87 89 87 91 89 93]
Mann-Whitney U statistic: 127.5
P-value: 1.819820648463293e-06
There is a significant difference in wine quality ratings between the two regions.


Some details about the previous code:

- The two samples of wine ratings are randomly generated within specified intervals: [80,95] for region A, and [85,100] for region B.
- The test produces a double output: the U statistic and a p-value. In the Mann-Whitney U test, the U statistic is a measure of overlap between the two groups when all values are merged and ranked together. A larger value indicates more overlap and therefore less statistical difference. Meanwhile, the p-value is a probability that, when low (below 0.05), indicates a statistical difference between the groups.

# Wilcoxon Signed-Rank Test (for two paired samples/ one sample)

The samples are related or matched in some way or represent two measurements of the same technique. More specifically, each sample is independent, but comes from the same population.

- The Wilcoxon signed ranks test is a nonparametric statistical procedure for comparing two samples that are paired, or related.
- The parametric equivalent to the Wilcoxon signed ranks test goes by names such as the Student’s t-test, t-test for matched pairs, t-test for paired samples, or t-test for dependent samples.

The default assumption for the test, the null hypothesis, is that the two samples have the same distribution.

- Fail to Reject H0: Sample distributions are equal.
- Reject H0: Sample distributions are not equal.

For the test to be effective, it requires at least 20 observations in each data sample.

### two paired samples (e.g. before and after treatments)

In [4]:
from scipy.stats import wilcoxon
# compare samples
stat, p = wilcoxon(data1, data2)
print('Statistics=%.3f, p=%.3f' % (stat, p))
# interpret
alpha = 0.05
if p > alpha:
    print('Same distribution (fail to reject H0)')
else:
    print('Different distribution (reject H0)')

Statistics=1886.000, p=0.028
Different distribution (reject H0)


### one sample: testing the difference between the two samples

In [5]:
stat, p = wilcoxon(data1-data2)

print('Statistics=%.3f, p=%.3f' % (stat, p))
# interpret
alpha = 0.05
if p > alpha:
    print('Same distribution (fail to reject H0)')
else:
    print('Different distribution (reject H0)')

Statistics=1886.000, p=0.028
Different distribution (reject H0)


Unlike the Mann-Whitney U test which compares two independent data groups to detect whether there is a statistical difference in their distributions or not, the Wilcoxon signed-rank test assumes both groups are related and the significant difference between their group medians is tested.

This test could be applied for instance if we want to analyze significant differences for two sets of ratings for wine produced in the same region: one before applying an innovative preservation technique, and one after applying it.

The application of Wilcoxon signed-rank test using Python is extremely similar to the previous example:

In [10]:
import numpy as np
from scipy.stats import wilcoxon

# Synthetic data generation: wine quality ratings before and after using a preservation technique
np.random.seed(42)
before = np.random.randint(85, 95, size=10)
after = before + np.random.randint(-2, 5, size=10)

# Apply Wilcoxon Signed-Rank Test
stat, p_value = wilcoxon(before, after)

# Output the data
print(f"Ratings before: {before}")
print(f"Ratings after: {after}")
#Output the test results
print(f"Wilcoxon statistic: {stat}")
print(f"P-value: {p_value}")

# Interpretation
if p_value < 0.05:
    print("There is a significant difference in wine quality ratings after applying the preservation technique.")
else:
    print("There is no significant difference in wine quality ratings after applying the preservation technique.")

Ratings before: [91 88 92 89 91 94 87 91 92 89]
Ratings after: [92 88 95 91 90 95 90 94 91 90]
Wilcoxon statistic: 6.0
P-value: 0.045797938810819776
There is a significant difference in wine quality ratings after applying the preservation technique.


The test statistic is calculated as the sum of the ranks of non-zero differences between paired samples, constituting an indicator of the magnitude and direction of changes between the two samples. Again, the Python implementation of this test applies these calculations internally for us.

# Kruskal-Wallis H Test (ANOVA: for more than two independent samples)

The Kruskal-Wallis test is a nonparametric version of the one-way analysis of variance test or ANOVA for short. It is named for the developers of the method, William Kruskal and Wilson Wallis. This test can be used to determine whether more than two independent samples have a different distribution. It can be thought of as the generalization of the Mann-Whitney U test.

When the Kruskal-Wallis H-test leads to significant results, then at least one of the samples is different from the other samples. However, the test does not identify where the difference(s) occur. Moreover, it does not identify how many differences occur. To identify the particular differences between sample pairs, a researcher might use sample contrasts, or post hoc tests, to analyze the specific sample pairs for significant difference(s). The Mann-Whitney U-test is a useful method for performing sample contrasts between individual sample sets.

 - Fail to Reject H0: All sample distributions are equal.
 - Reject H0: One or more sample distributions are not equal.

In [6]:
import numpy as np
data1 = np.random.randn(100)*5 + 50
data2 = np.random.randn(100)*5 + 50
data3 = np.random.randn(100)*5 + 52

In [7]:
from scipy.stats import kruskal

stats,p=kruskal(data1,data2,data3)
print('Statistics=%.3f, p=%.3f' % (stat, p))

alpha=0.05

if p>alpha:
    print('Same distribution (fail to reject H0)')
else:
    print('Different distribution (reject H0)')

Statistics=1886.000, p=0.001
Different distribution (reject H0)


the Kruskal-Wallis test is the go-to option when there are more than two data groups we want to compare in terms of statistically significant differences between their distributions. The below example is applied similarly as the previous two, this time evaluating the existence or absence of significant differences between wine ratings for producers in three different regions.

In [11]:
import numpy as np
from scipy.stats import kruskal

# Synthetic data generation
np.random.seed(42)
region_a = np.random.randint(80, 95, size=15)
region_b = np.random.randint(85, 100, size=15)
region_c = np.random.randint(75, 90, size=15)

# Apply Kruskal-Wallis test
stat, p_value = kruskal(region_a, region_b, region_c)

# Output the data
print(f"Region A ratings: {region_a}")
print(f"Region B ratings: {region_b}")
print(f"Region C ratings: {region_c}")
# Output the test results
print(f"Kruskal-Wallis statistic: {stat}")
print(f"P-value: {p_value}")

# Interpretation
if p_value < 0.05:
    print("There is a significant difference in wine quality ratings among the regions.")
else:
    print("There is no significant difference in wine quality ratings among the regions.")

Region A ratings: [86 83 92 94 90 87 92 84 86 89 82 86 90 90 87]
Region B ratings: [89 88 92 92 87 90 89 86 92 96 98 90 86 96 89]
Region C ratings: [75 86 84 80 87 86 83 75 85 85 89 84 86 86 89]
Kruskal-Wallis statistic: 16.63300149217167
P-value: 0.0002444497611367245
There is a significant difference in wine quality ratings among the regions.


The returned Kruskal-Wallis statistic compares the ranks of data values across groups. Higher values of the statistic are a stronger indicator of having significantly different distributions between groups. Therefore, unlike the Mann-Whitney U test, a higher statistic for Kruskal-Wallis test yields a lower p-value.

# Friedman Test (repeated measures ANOVA: for more than two paired samples)

The test assumes two or more paired data samples with 10 or more samples per group.

The Friedman test is a nonparametric statistical procedure for comparing more than two samples that are related. The parametric equivalent to this test is the repeated measures analysis of variance (ANOVA). When the Friedman test leads to significant results, at least one of the samples is different from the other samples.

The default assumption, or null hypothesis, is that the multiple paired samples have the same distribution. A rejection of the null hypothesis indicates that one or more of the paired samples has a different distribution.

- Fail to Reject H0: Paired sample distributions are equal.
- Reject H0: Paired sample distributions are not equal.

In [8]:
from scipy.stats import friedmanchisquare

# seed the random number generator
seed(1)
# generate three independent samples
data1 = 5 * randn(100) + 50
data2 = 5 * randn(100) + 50
data3 = 5 * randn(100) + 52
# compare samples
stat, p = friedmanchisquare(data1, data2, data3)
print('Statistics=%.3f, p=%.3f' % (stat, p))
# interpret
alpha = 0.05
if p > alpha:
    print('Same distributions (fail to reject H0)')
else:
    print('Different distributions (reject H0)')

Statistics=9.360, p=0.009
Different distributions (reject H0)
