## Notebook for comparing tests
### Testing distributions

This notebook is trying to compare different drift tests and highlight each strength. Every test is useful! We just have to use it in a fitted scenario!

Import libraries

In [None]:
import pandas as pd
import numpy as np

from scipy import stats

from evidently.calculations.stattests import StatTest
from evidently.test_suite import TestSuite
from evidently.tests import TestColumnDrift

from plotly import graph_objs as go
import plotly.express as px

In [None]:
import warnings
warnings.filterwarnings('ignore')
warnings.simplefilter('ignore')

## Prepare Datasets/Distributions

Now we are going to define four distributions with different types of drifts for the two samples

In [None]:
#function that will help us define sample and control group

def give_me_smp_cntr_df(sample1,sample2):
    """
    It receives two arrays of the produced sample distributions and
    returns two dataframes that have the sample and control groups to test later the drift
    """
    sample_df = pd.DataFrame(np.array([sample1,sample2]).T,columns=['sample_group','control_group'])
    #initial dataset
    smp_df=sample_df['sample_group'].reset_index().rename(columns={'sample_group': "test_group"})
    #control dataset
    cntr_df=sample_df['control_group'].reset_index().rename(columns={'control_group': "test_group"})
    return smp_df,cntr_df


In [None]:
# General gamma distribution

a, c = 3, -1.02
#defining sample 1
r1 = stats.gengamma.rvs(a, c, size=1000)

a, c = 3, -1.32
#defining sample 2
r2 = stats.gengamma.rvs(a, c, size=1000)

smp_df,cntr_df = give_me_smp_cntr_df(r1,r2)

In [None]:
# Normal distribution

mu, sigma = 0, 0.08 # mean and standard deviation
normal = np.random.normal(mu, sigma, 1000)

mu, sigma = 0, 0.05 # mean and standard deviation
normal2 = np.random.normal(mu, sigma, 1000)

smp_df2,cntr_df2 = give_me_smp_cntr_df(normal,normal2)

In [None]:
# Discrete bionmal

n=10
p=0.8

data_binom = stats.binom.rvs(10,0.8,size=1000)
data_binom2 = stats.binom.rvs(10,0.75,size=1000)

smp_df3,cntr_df3 = give_me_smp_cntr_df(data_binom,data_binom2)

In [None]:
# Discrete poisson

mu=1.5
data_poisson = stats.poisson.rvs(mu=1.5, size=2000)
data_poisson2 = stats.poisson.rvs(mu=2, size=2000)

smp_df4,cntr_df4 = give_me_smp_cntr_df(data_poisson,data_poisson2)

## Define custom tests

Here we are defining custom test.

First define Mann-Whitney U-rank

sources:

In [None]:
from scipy.stats import mannwhitneyu

def mannwhitneyu_rank(
    reference_data: pd.Series,
    current_data: pd.Series,
    feature_type: str,
    threshold: float,
    use_continuity: bool = True
):
    """Calculate the Mann-Whitney U-rank test between two arrays
    Args:
        reference_data: reference data
        current_data: current data
        feature_type: feature type
        threshold: all values above this threshold means data drift
        use_continuity: take or not into account continuity correction
    Returns:
        pvalue: the p-value for the test depending on alternative and method
        test_result: whether the drift is detected
    """
    wil_p_value = mannwhitneyu(x=reference_data, y=current_data,use_continuity=use_continuity)[1]
    return wil_p_value, wil_p_value < threshold


mann_whitney_u_stat_test = StatTest(
    name="mannw",
    display_name="Mann-Whitney U-rank test",
    func=mannwhitneyu_rank,
    allowed_feature_types=["num"],
    default_threshold=0.05
)

In [None]:
from scipy.stats import epps_singleton_2samp

def _epps_singleton(
    reference_data: pd.Series,
    current_data: pd.Series,
    feature_type: str,
    threshold: float):
    """Run the Epps-Singleton (ES) test of two samples.
    Args:
        reference_data: reference data
        current_data: current data
        threshold: level of significance (default will be 0.05)
    Returns:
        p_value: p-value based on the asymptotic chi2-distribution.
        test_result: whether the drift is detected
    """
    p_value = epps_singleton_2samp(reference_data, current_data)[1]
    return p_value, p_value < threshold


epps_singleton_test = StatTest(
    name="es",
    display_name="Epps-Singleton",
    func=_epps_singleton,
    allowed_feature_types=["num"],
    default_threshold=0.05
)

In [None]:
feature = 'test_group'

data_drift_dataset_tests = TestSuite(tests=[
    TestColumnDrift(column_name=feature, stattest=mann_whitney_u_stat_test),
    TestColumnDrift(column_name=feature, stattest=epps_singleton_test),
    TestColumnDrift(column_name=feature, stattest='ks'),
    TestColumnDrift(column_name=feature, stattest='anderson'),
    TestColumnDrift(column_name=feature, stattest='cramer_von_mises')
])

In [None]:
# Define function for checking p-values per population

def create_test_result_dataset(data_drift_dataset_tests):
    d = []

    for tests in data_drift_dataset_tests.as_dict()['tests']:
        d2 = []
        d2.append(tests['parameters']['features']['test_group']['stattest_name'])
        d2.append(tests['parameters']['features']['test_group']['score'])

        #added the test name and drift score(p-value or distance)
        d.append(d2)

    df = pd.DataFrame(d, columns = ['test','p-value'])

    return df

# Run tests

In [None]:
# Poisson distribution
fig = go.Figure()
fig.add_trace(go.Histogram(x=data_poisson, nbinsx=40, name='data_poisson'))
fig.add_trace(go.Histogram(x=data_poisson2, nbinsx=40, name='data_poisson2'))

fig.show()

In [None]:
# Poisson distribution
df_n = pd.DataFrame()

for n in range(100,1100,100):
    data_drift_dataset_tests.run(reference_data = smp_df4[0:n], current_data = cntr_df4[0:n])
    df = create_test_result_dataset(data_drift_dataset_tests)
    df['data_length'] = n
    df_n = pd.concat([df_n, df])

In [None]:
# Poisson distribution
fig = px.line(
    df_n.reset_index(), 
    x="data_length", 
    y="p-value", 
    color="test")

fig.show()

"When comparing the incomes of two different groups (especially groups that span the socioeconomic
spectrum), the distributions will likely be highly variable and highly skewed. In such a case,
it might be better to use a nonparametric test like Wilcoxon’s signed-rank test."

"This is a paired test that compares the medians of two distributions"

Of course for this case the Mann-Whitney U test is similar to the Wilcoxon test, but can be used to compare
multiple samples that aren’t necessarily paired.

source: https://www.mit.edu/~6.s085/notes/lecture5.pdf 5.1.3 Wilcoxon’s signed-rank test

Anderson and cramer von mises perform also good at this case

Let's see another case of discrete population drift:

In [None]:
#Binomal distribution
fig = go.Figure()
fig.add_trace(go.Histogram(x=data_binom, nbinsx=40, name='data_binom'))
fig.add_trace(go.Histogram(x=data_binom2, nbinsx=40, name='data_binom2'))

fig.show()

In [None]:
# Binomal distribution
df_n=pd.DataFrame()

for n in range(100,1100,100):
    
    data_drift_dataset_tests.run(reference_data=smp_df3[0:n], current_data=cntr_df3[0:n])
    df = create_test_result_dataset(data_drift_dataset_tests)
    df['data_length'] = n
    df_n=pd.concat([df_n, df])

In [None]:
#Binomal distribution
fig = px.line(
    df_n.reset_index(), 
    x="data_length", 
    y="p-value", 
    color="test")

fig.show()

Again, KS seems to be slower to realize that the two distributions are different.

But wait, when is the Mann-Whitney U actually not good at detecting drifts? and KS (as well other tests) better?

Mann-Whitney U mentions that uses medians to do the tests. So lets try with two normal distributions

In [None]:
# Normal distribution
fig = go.Figure()
fig.add_trace(go.Histogram(x=normal, nbinsx=40, name='normal'))
fig.add_trace(go.Histogram(x=normal2, nbinsx=40, name='normal2'))

fig.show()

In [None]:
#Normal distribution
df_n=pd.DataFrame()

for n in range(100,1100,100):
    
    data_drift_dataset_tests.run(reference_data=smp_df2[0:n], current_data=cntr_df2[0:n])
    df = create_test_result_dataset(data_drift_dataset_tests)
    df['data_length'] = n
    df_n=pd.concat([df_n, df])

In [None]:
# Normal distribution
fig = px.line(
    df_n.reset_index(), 
    x="data_length", 
    y="p-value", 
    color="test")

fig.show()

As you see here the Mann-Whitney U test never converges to a low p-value as others. It uses medians to find the differences however, here we have a different sigma as the main difference. Thus, at such type of drift, Mann-Whitney U will not fail.

In this short example you saw how a statistical test based on its strength points or weak can detect or not detect drift.

Every test is good for specific cases.

## So choose wisely!!

Want to plug and play?
1) Define your population
2) Run the tests
3) Select the test based on drift detection plot  

In [None]:
# Your distribution
a, c = 3, -1.02
mean, var, skew, kurt = stats.gengamma.stats(a, c, moments='mvsk')
your_r = stats.gengamma.rvs(a, c, size=2000)

a, c = 2.5, -1.02
mean, var, skew, kurt = stats.gengamma.stats(a, c, moments='mvsk')
your_r2 = stats.gengamma.rvs(a, c, size=2000)

print(mean, var, skew, kurt)

smp_df,cntr_df = give_me_smp_cntr_df(your_r,your_r2)


In [None]:
# Your distribution
fig = go.Figure()
fig.add_trace(go.Histogram(x=your_r, nbinsx=40, name='your_r'))
fig.add_trace(go.Histogram(x=your_r2, nbinsx=40, name='your_r2'))

fig.show()

In [None]:
# gen gamma
df_n=pd.DataFrame()

for n in range(100,2100,100):
    
    data_drift_dataset_tests.run(reference_data=smp_df[0:n], current_data=cntr_df[0:n])
    df = create_test_result_dataset(data_drift_dataset_tests)
    df['data_length'] = n
    df_n=pd.concat([df_n, df])

In [None]:
# Your distribution
fig = px.line(
    df_n.reset_index(), 
    x="data_length", 
    y="p-value", 
    color="test")

fig.show()

# Support Evidently
Enjoyed the tutorial? Star Evidently on GitHub to contribute back! This helps us continue creating free open-source tools for the community. https://github.com/evidentlyai/evidently