# TopoTest for two-sample problem

This notebook shows how to use codebase for TopoTests.

Based on paper `Topology-Driven Goodness-of-Fit Tests in Arbitrary Dimensions` 
by Paweł Dłotko, Niklas Hellmer, Łukasz Stettner and Rafał Topolnicki.

In [1]:
import sys
import pandas as pd
import numpy as np
sys.path.append('topotests/')
from topotests.topotests import TopoTestTwosample

# import some random variables (RV) generators
from scipy.stats import norm, multivariate_normal, t, multivariate_t
# import univariate Kolmogorov-Smirnov test
from scipy.stats import ks_2samp

# set random number generator seed for reproducibility
np.random.seed(seed=12345)

## Univarite test (two-sample)

In [2]:
# create some random variabes used togenerate data. 
# TopoTest requires that RV has a rvs(size) method that returns a random sample from null distribution

# first let's create the standard normal RV and a RV that represents a Student's distribution with df=3 degrees df is defined later) of freedom
rv_norm = norm
rv_t = t(df=3)

# draw samples from this distriutions
n = 100 # sample size
sample_norm = rv_norm.rvs(size=n)
sample_t = rv_norm.rvs(size=n)

# set the significance level
alpha=0.05

In two-sample problem the null hypotehsis is that samples `sample_norm` and `sample_t` were drawn from the same distribution.

Run two-sample TopoTest

In [3]:
TopoTestTwosample(X1=sample_norm, X2=sample_t)

TopoTestResult(statistic=0.06000000000000005, pvalue=0.622)

Estimated p-value is 0.622. Hence, in this case one should not reject the null hypothesis at significance level 0.05. 

As can be seen the null hypothesis is also not rejected by the Kolmogorov-Smirnov test (p-value is 0.11)

In [4]:
ks_2samp(sample_norm, sample_t)

KstestResult(statistic=0.17, pvalue=0.11119526053829192)

Let run the sanity check to verify that type I error is recovered when two input samples were drawn for the same distribution

In [5]:
mcloops = 500
pvalues = []
for mcloop in range(mcloops):
    s1 = rv_norm.rvs(size=n)
    s2 = rv_norm.rvs(size=n)
    _, pvalue = TopoTestTwosample(X1=s1, X2=s2, loops=250)
    pvalues.append(pvalue)

alpha = 0.05
power_tt = np.mean([pvalue < alpha for pvalue in pvalues])
print(f'Empirical power (TopoTest) = {power_tt}')

Empirical power (TopoTest) = 0.056


The empirical pvalue (0.056) is very close to significance level as expected.

Now estimate the power of the TopoTest for samples drawn from standard normal and Student's-t distributions. Lets compute the power of Kolmogorov-Smirnov counterpart as well.

In [6]:
mcloops = 500
pvalues = []
pvalues_ks = []
for mcloop in range(mcloops):
    sample_n = rv_norm.rvs(size=100)
    sample_t = rv_t.rvs(size=100)
    _, pvalue = TopoTestTwosample(X1=sample_n, X2=sample_t, loops=500)
    pvalues.append(pvalue)
    pvalues_ks.append(ks_2samp(sample_n, sample_t).pvalue)

In [7]:
alpha = 0.05
power_tt = np.mean([pvalue < alpha for pvalue in pvalues])
power_ks = np.mean([pvalue < alpha for pvalue in pvalues_ks])
print(f'Empirical power (TopoTest) = {power_tt}')
print(f'Empirical power (Komogorov-Smirnov) = {power_ks}')

Empirical power (TopoTest) = 0.506
Empirical power (Komogorov-Smirnov) = 0.044


The empirical power of two sample TopoTest (0.506) is much larger than empircial power of Kolmogorov-Smirnov test (0.044).  