<a href="https://colab.research.google.com/github/aakhterov/ML_algorithms_from_scratch/blob/master/statistical_tests_from_scratch.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Various statistical test from scratch

References

1. https://www.scribbr.com/statistics/t-test/
2. https://www.statisticshowto.com/probability-and-statistics/t-test/
3. https://en.wikipedia.org/wiki/Student%27s_t-test

In [41]:
import numpy as np
from sklearn.datasets import load_iris
from scipy.stats import ttest_ind

## 1. T-test (Student's T-test)

### 1.1. Two sample T-test

Let's imagine we want to know whether the mean petal length of iris flowers differs according to their species. We're going to pick 50 petals length of versicolor and virginica species. Then we want to test the difference between these two groups using a T-test and null and alterative hypotheses.

**H0**: the true difference between these group means is zero.

**H1**: the true difference is different from zero.

In the case of equal or unequal sample sizes and similar variances ($ \frac{1}{2} < \frac{s_{x_1}}{s_{x_2}} < 2 $)

$ t = \frac{\bar{x_1} - \bar{x_2}}{s_p \sqrt{\frac{1}{n_1} + \frac{1}{n_1}}} $,

where $ s_p = \sqrt{ \frac{(n_1 - 1)s_1^2 + (n_2 - 1)s_2^2}{n_1 + n_2 - 2} } $ is the pooled standard deviation

In [13]:
data = load_iris(as_frame=True)
df = data['data']
df['specie'] = list(map(lambda x: data['target_names'][x], data['target']))
df.head()

Unnamed: 0,sepal length (cm),sepal width (cm),petal length (cm),petal width (cm),specie
0,5.1,3.5,1.4,0.2,setosa
1,4.9,3.0,1.4,0.2,setosa
2,4.7,3.2,1.3,0.2,setosa
3,4.6,3.1,1.5,0.2,setosa
4,5.0,3.6,1.4,0.2,setosa


In [15]:
df['specie'].value_counts()

setosa        50
versicolor    50
virginica     50
Name: specie, dtype: int64

In [59]:
g1 = df.loc[df['specie'] == 'versicolor']['petal length (cm)'].to_numpy()
g2 = df.loc[df['specie'] == 'virginica']['petal length (cm)'].to_numpy()

In [60]:
s1, s2 = np.std(g1), np.std(g2)
1/2 < s1/s2 < 2

True

**Conclusion:** Hence, we have case of equal sample sizes and similar variances

In [61]:
n1, n2 = len(g1), len(g2)
mean1, mean2 = np.mean(g1), np.mean(g2)

In [62]:
s_p = np.sqrt(( (n1-1)*s1**2 + (n2-1)*s2**2 )/(n1 + n2 -2))
s_p

0.5073933385451567

In [63]:
t = (mean1 - mean2) / (s_p * np.sqrt(1/n1 + 1/n2))
t

-12.731739873689888

In [58]:
# se1=s1/np.sqrt(n1)
# se2=s2/np.sqrt(n2)
# sed=np.sqrt((se1**2) + (se2**2))
# t_stat=(mean1 - mean2)/sed
# print(t_stat)

-12.73173987368989


In [64]:
(mean1 - mean2) / np.sqrt(s1**2/n1 + s2**2/n2)

-12.731739873689886

In [65]:
mean1, mean2

(4.26, 5.5520000000000005)

In [66]:
tt = ttest_ind(g1, g2, equal_var=False, alternative='less')

In [67]:
tt

Ttest_indResult(statistic=-12.603779441384987, pvalue=2.4501437636990477e-22)