**Perform Kolmogorov-Smirnov Test for goodness of fit**

This test compares the underlying continuous distributions F(x) and G(x) of two independent samples.

Source : https://docs.scipy.org/doc/scipy/reference/generated/scipy.stats.ks_2samp.html#scipy.stats.ks_2samp

In [5]:
from scipy import stats
import numpy as np

**Usage**

-- stats.ks_2samp(data1, data2, alternative='two-sided', method='auto')

There are three options for the null and corresponding alternative hypothesis that can be selected using the **alternative** parameter.

* less: The null hypothesis is that F(x) >= G(x) for all x; the alternative is that F(x) < G(x) for at least one x. The statistic is the magnitude of the minimum (most negative) difference between the empirical distribution functions of the samples.

* greater: The null hypothesis is that F(x) <= G(x) for all x; the alternative is that F(x) > G(x) for at least one x. The statistic is the maximum (most positive) difference between the empirical distribution functions of the samples.

* two-sided: The null hypothesis is that the two distributions are identical, F(x)=G(x) for all x; the alternative is that they are not identical. The statistic is the maximum absolute difference between the empirical distribution functions of the samples.

Examples

Suppose we wish to test the null hypothesis that two samples were drawn from the same distribution. We choose a **confidence level of 95%**; that is, we will **reject** the null hypothesis in favor of the alternative if the **p-value is less than 0.05**.

If the first sample were drawn from a uniform distribution and the second were drawn from the standard normal, we would expect the null hypothesis to be rejected.

In [6]:
import numpy as np
rng = np.random.default_rng()
sample1 = stats.uniform.rvs(size=100, random_state=rng)
sample2 = stats.norm.rvs(size=110, random_state=rng)
stats.ks_2samp(sample1, sample2, alternative='two-sided')

KstestResult(statistic=0.5636363636363636, pvalue=1.9984014443252818e-15)

Indeed, the **p-value** is lower than our **threshold of 0.05**, so we **reject the null hypothesis** in favor of the default “two-sided” alternative: the data were not drawn from the same distribution.

When both samples are drawn from the same distribution, we expect the data to be consistent with the null hypothesis most of the time.

In [7]:
sample1 = stats.uniform.rvs(size=105, random_state=rng)
sample2 = stats.uniform.rvs(size=95, random_state=rng)
stats.ks_2samp(sample1, sample2, alternative='two-sided')

KstestResult(statistic=0.13784461152882205, pvalue=0.26720413899452866)

As expected, the p-value of **0.26** is not below our **threshold of 0.05**, so we **cannot reject the null hypothesis**.

Suppose, however, that the first sample were drawn from a normal distribution **shifted toward greater values**. In this case, the cumulative density function (CDF) of the underlying distribution tends to be **less than** the CDF underlying the second sample. Therefore, we would expect the null hypothesis to be **rejected with alternative='less'**:

In [8]:
sample1 = stats.uniform.rvs(size=105, loc=0.5, random_state=rng)
sample2 = stats.uniform.rvs(size=95, random_state=rng)
stats.ks_2samp(sample1, sample2, alternative='less')

KstestResult(statistic=0.569423558897243, pvalue=8.646754182177063e-16)

and indeed, with p-value smaller than our threshold, we reject the null hypothesis in favor of the alternative.