# TopoTest for one-sample problem

This notebook shows how to use codebase for TopoTests.

Based on paper `Topology-Driven Goodness-of-Fit Tests in Arbitrary Dimensions` 
by Paweł Dłotko, Niklas Hellmer, Łukasz Stettner and Rafał Topolnicki.

You should first install topotest using pip

`pip install topotest`

In [1]:
import numpy as np
from topotest import TopoTestOnesample
# import some random variables (RV) generators
from scipy.stats import norm, multivariate_normal, t, multivariate_t
# import univariate Kolmogorov-Smirnov test
from scipy.stats import kstest

# set random number generator seed for reproducibility
np.random.seed(seed=12345)

## Univarite test

In [2]:
# create some random variabes used to generate data
# TopoTest requires that RV has a rvs(size) method that returns a random sample from null distribution


# first let's create the standard normal RV and a RV that represents a Student's distribution with df=3 degrees df is defined later) of freedom
rv_norm = norm
rv_t = t(df=3)

# draw samples from this distriutions
n = 100 # sample size
sample_norm = rv_norm.rvs(size=n)
sample_t = rv_t.rvs(size=n)

# set the significance level
alpha=0.05

Lets assume we are interested in testinh hypotehsis:

$H_0:$ sample was generated from $\mathcal{N}(0,1)$ vs. $H_1:$ sample was sampled from some distribution different different from standard normal $\mathcal{N}(0,1)$

Hence, $F=\mathcal{N}(0,1)$, reprented here by `rv_norm` object, is a null distributon.


Not create an aculat test via TopoTestOnesample object

In [3]:
tt = TopoTestOnesample(n=n, dim=1, significance_level=alpha)

TopoTest needs to be fitted to the null distirbution, therefore

In [4]:
tt.fit(rv=rv_norm, n_signature=1000)

Now we are redy to run actual test

In [5]:
tt.predict(sample_norm)

TopoTestResult(statistic=0.05980999999999903, pvalue=0.767)

p-value is 0.766, which is below assumed significance level $\alpha=0.05$ hence we do not reject the $H_0$ hypotehsis (this is indicated by value `True` returned in the first element of the tuple).

Now let's do the same for sample generated from Student's-t distribution.

In [6]:
tt.predict(sample_t)

TopoTestResult(statistic=0.13770000000000038, pvalue=0.009)

pvalue=0.009 indicates that at siginificance level 0.05 the null hypothesis is rejected by the TopoTest.

For the same sample, the null hypothesis is not rejected by the Kolmogorov-Smirnov (pvalue=0.816) test, indicatin a better performance of the TopoTest

In [7]:
kstest(sample_t, cdf=rv_norm.cdf)

KstestResult(statistic=0.06185145001574388, pvalue=0.8159981812674457)

Now the p-value > $\alpha$ and according to KS test there is no evidence to reject null hypothesis.

Mind that the fitting procedur, peformed by `tt.fit()` need to be run only once for given sample size $n$ and null distribution. Moreover, the `predict` method can take more than just one sample. Therfore it is straightforward to compute the power of the TopoTest then Student's with 3 degrees of freedom is an alternative

In [8]:
number_of_samples = 1000
samples_t = [rv_t.rvs(size=n) for i in range(number_of_samples)]
stat, pvals = tt.predict(samples_t)

power = np.mean([pv < alpha for pv in pvals])
print(f'Estimated power for TopoTest for null=N(0,1) and alternative t(df=3) is {power}')

Estimated power for TopoTest for null=N(0,1) and alternative t(df=3) is 0.612


Let's compute the power of the Kolmogorov-Smirnov test

In [9]:
ks_pvals = [kstest(sample, cdf=rv_norm.cdf).pvalue for sample in samples_t]
ks_reject = [pval < alpha for pval in ks_pvals]
ks_power = np.mean(ks_reject)
print(f'Estimated power for Kolmogorov-Smirnov for null=N(0,1) and alternative t(df=3) is {ks_power}')

Estimated power for Kolmogorov-Smirnov for null=N(0,1) and alternative t(df=3) is 0.121


Hence, in this setting, the power of TopoTest is significanly larger then power of Kolmogorov-Smirnov (0.612 vs 0.121 respectively).

## Multivariate test

Lets run the test in case of bivariate distribution. Although the bivariate case is considered here for simplicity in the same way any distribution in arbitrary dimension may be considered.

Create multivariate normal and multivariate Student's-t variables and draw a sample of size $n=250$ from the later.

In [10]:
rv_mvn = multivariate_normal([0, 0], [[1, 0], [0, 1]])
rv_mt = multivariate_t([0, 0], [[1.0, 0], [0, 1.0]], df=7)

n = 250
X = rv_mt.rvs(size=n)

We want to test the null hypothesis $H_0: X \sim \mathcal{N}(0, I_{2\times2})$, where $I_{2\times2}$ is a 2x2 identity matrix

In [11]:
tt_bivariate = TopoTestOnesample(n=n, dim=2, significance_level=alpha)
tt_bivariate.fit(rv=rv_mvn, n_signature=1000)
stat, pvalue = tt_bivariate.predict(X)
print(f'pvalue is {pvalue}')

pvalue is 0.013


The pvalue 0.013 is below assumed siginicance level $\alpha=0.05$ hence null hypotesis should be rejected