# Distrinution Test

This script tests if two data sets follow a similar distribution by applying [Kolmogorov–Smirnov test](https://en.wikipedia.org/wiki/Kolmogorov%E2%80%93Smirnov_test)

In [None]:
import pandas as pd
import numpy as np

import matplotlib.pyplot as plt

from scipy.stats import ks_2samp

from tqdm import tqdm
tqdm.pandas()

In [None]:
data_1 = pd.read_csv('data_1.csv')
data_2 = pd.read_csv('data_2.csv')

In [None]:
cols = data_1.columns # narrow down the columns if needed
# apply pre-processing if needed

**ks_2sample:** 
Under the null hypothesis the two distributions are identical. If the K-S statistic is small or the p-value is high (greater than the significance level, say 5%), then we cannot reject the hypothesis that the distributions of the two samples are the same. Conversely, we can reject the null hypothesis if the p-value is low.

In [None]:
sim = []
for c in tqdm(cols):
    sim.append(ks_2samp(data_1[c], data_2[c]).statistic)

In [None]:
plt.cla()
plt.bar(cols, sim, label='Data 1 vs Data 2')
plt.legend()
plt.title('Data Distribution')
plt.xlabel('Columns')
plt.ylabel('Similarity')
plt.xticks([])
plt.show()

In [None]:
print('nom of features similar: ', sum(s > np.mean(sim) for s in sim))
print('mean: ', np.mean(sim))
print('std: ', np.std(sim))

In [None]:
p_val = []
for c in tqdm(cols):
    p_val.append(1- ks_2samp(data_1[c], data_2[c]).pvalue)

In [None]:
plt.bar(cols, p_val, label='Data 1 vs Data 2 Significance')
plt.axhline(0.95, label='95% significance')
plt.legend()
plt.title('Data Distribution Significance')
plt.xlabel('Columns')
plt.ylabel('Significance')
plt.xticks([])
plt.show

In [None]:
print('num of features significant: ', sum(s > 0.95 for s in p_val))
print('mean: ', np.mean(p_val))
print('std: ', np.std(p_val))