<h1 class="list-group-item list-group-item-action active">Guide to Statistical Hypothesis Tests in Python</h1>


<img src = "https://d33wubrfki0l68.cloudfront.net/a5cb4bbe1b04d9099c6fc771724ea67ec087845b/cb16f/wp-content/uploads/2019/07/statistics-vs-machine-learning.png">

<div class="list-group" id="list-tab" role="tablist">
  <h3 class="list-group-item list-group-item-action active" data-toggle="list"  role="tab" aria-controls="home">Notebook Content:</h3>
  <a class="list-group-item list-group-item-action" data-toggle="list" href="#one_" role="tab" aria-controls="profile">1. Normality Tests<span class="badge badge-primary badge-pill">1</span></a>
  <a class="list-group-item list-group-item-action" data-toggle="list" href="#two_" role="tab" aria-controls="messages">2. Correlation Tests<span class="badge badge-primary badge-pill">2</span></a>
  <a class="list-group-item list-group-item-action"  data-toggle="list" href="#three_" role="tab" aria-controls="settings">3. Stationary Tests<span class="badge badge-primary badge-pill">3</span></a>
   <a class="list-group-item list-group-item-action"  data-toggle="list" href="#four_" role="tab" aria-controls="settings">4. Parametric Statistical Hypothesis Tests<span class="badge badge-primary badge-pill">4</span></a>
  <a class="list-group-item list-group-item-action"  data-toggle="list" href="#five_" role="tab" aria-controls="settings">5. Non-Parametric Statistical Hypothesis Tests<span class="badge badge-primary badge-pill">5</span></a>

In [None]:
import warnings
warnings.filterwarnings('ignore')

In order to carry out any Machine learning Projects Probability and Statistics plays a major role.Probability deals with 
predicting the likelihood of future events however Statistics deals with analyse frequency of past events.


<h2 class="list-group-item list-group-item-action active">Normality Tests</h2>
<p> Main obejctive of performing Normality Tests is to validate the Gaussian distribution of data. </p>
<h3 class="alert alert-info">Shapiro-Wilk Test</h3>
Tests whether a data sample has a Gaussian distribution.

<div class="alert alert-info">Assumption</div>
 Observations in each sample are independent and distributed identically.
<div class="alert alert-info">Hypothesis</div>


* H0: the sample has a Gaussian distribution.
* H1: the sample does not have a Gaussian distribution.

<a href="https://docs.scipy.org/doc/scipy/reference/generated/scipy.stats.shapiro.html" class="btn btn-warning" role="button">Scipy Ref -></a>

In [None]:
from scipy.stats import shapiro
data = [0.873, 2.817, 0.121, -0.945, -0.055, -1.436, 0.360, -1.478, -1.637, -1.869]
stat, p = shapiro(data)
print('stat={0:.3f}, p={0:.3f}' .format(stat, p))
if p > 0.05:
    print('Probably Gaussian')
else:
    print('Probably not Gaussian')

<h3 class="alert alert-info">D’Agostino’s K^2 Test</h3>
Tests whether a data sample has a Gaussian distribution.

<div class="alert alert-info">Assumption</div>
 Observations in each sample are independent and distributed identically.
<div class="alert alert-info">Hypothesis</div>


* H0: the sample has a Gaussian distribution.
* H1: the sample does not have a Gaussian distribution.

<a href="https://docs.scipy.org/doc/scipy/reference/generated/scipy.stats.normaltest.html" class="btn btn-warning" role="button">Scipy Ref -></a>

In [None]:
# Example of the D'Agostino's K^2 Normality Test
from scipy.stats import normaltest
data = [0.873, 2.817, 0.121, -0.945, -0.055, -1.436, 0.360, -1.478, -1.637, -1.869]
stat, p = normaltest(data)
print('stat={0:.3f}, p={0:.3g}'.format(stat, p))
if p > 0.05:
    print('Probably Gaussian')
else:
    print('Probably not Gaussian')

<h3 class="alert alert-info">Anderson-Darling Test</h3>
Tests whether a data sample has a Gaussian distribution.


<div class="alert alert-info">Assumption</div>
 Observations in each sample are independent and distributed identically.
<div class="alert alert-info">Hypothesis</div>


* H0: the sample has a Gaussian distribution.
* H1: the sample does not have a Gaussian distribution.

<a href="https://docs.scipy.org/doc/scipy/reference/generated/scipy.stats.anderson.html" class="btn btn-warning" role="button">Scipy Ref -></a>

In [None]:
# Example of the Anderson-Darling Normality Test
from scipy.stats import anderson
data = [0.873, 2.817, 0.121, -0.945, -0.055, -1.436, 0.360, -1.478, -1.637, -1.869]
result = anderson(data)
print('stat={0:.3g}'.format(result.statistic))
for i in range(len(result.critical_values)):
    sl, cv = result.significance_level[i], result.critical_values[i]
    if result.statistic < cv:
        print('Probably Gaussian at the %.1f%% level' % (sl))
    else:
        print('Probably not Gaussian at the %.1f%% level' % (sl))

<h2 class="list-group-item list-group-item-action active">Correlation Tests</h2>
<p> Correlation Tests are used to check the correlation between two independent features or variables. </p>

<h3 class="alert alert-info">Pearson’s Correlation Coefficient</h3>
Tests whether a data sample is linearly separable.

<div class="alert alert-info">Assumption</div>
a) Observations in each sample are independent and distributed identically. <br>
b) Observations are normally distributed. <br>
c) Similar variance between independent variables
<div class="alert alert-info">Hypothesis</div>


* H0: the samples are correlated.
* H1: the sample does not have any correlation.

<a href="https://docs.scipy.org/doc/scipy/reference/generated/scipy.stats.normaltest.html" class="btn btn-warning" role="button">Scipy Ref -></a>



In [None]:
# Example of the Pearson's Correlation test
from scipy.stats import pearsonr
data1 = [0.873, 2.817, 0.121, -0.945, -0.055, -1.436, 0.360, -1.478, -1.637, -1.869]
data2 = [0.353, 3.517, 0.125, -7.545, -0.555, -1.536, 3.350, -1.578, -3.537, -1.579]
stat, p = pearsonr(data1, data2)
print('stat={0:.3f}, p={0:.3f}'.format(stat, p))
if p > 0.05:
    print('Probably independent')
else:
    print('Probably dependent')

<h3 class="alert alert-info">Spearman’s Rank Correlation</h3>
Tests whether a data sample is montonically separable.

<div class="alert alert-info">Assumption</div>
a) Observations in each sample are independent and distributed identically. <br>
b) Observations in each sample are ranked . <br>
<div class="alert alert-info">Hypothesis</div>


* H0: the samples are correlated.
* H1: the sample does not have any correlation.

<a href="https://docs.scipy.org/doc/scipy/reference/generated/scipy.stats.spearmanr.html" class="btn btn-warning" role="button">Scipy Ref -></a>

In [None]:
# Example of the Spearman's Rank Correlation Test
from scipy.stats import spearmanr
data1 = [0.873, 2.817, 0.121, -0.945, -0.055, -1.436, 0.360, -1.478, -1.637, -1.869]
data2 = [0.353, 3.517, 0.125, -7.545, -0.555, -1.536, 3.350, -1.578, -3.537, -1.579]
stat, p = spearmanr(data1, data2)
print('stat={0:.3g}, p={0:.3f}'.format(stat, p))
if p > 0.05:
    print('Probably independent')
else:
    print('Probably dependent')


<h3 class="alert alert-info">Kendall’s Rank Correlation</h3>
Tests whether a data sample is montonically separable.

<div class="alert alert-info">Assumption</div>
a) Observations in each sample are independent and distributed identically. <br>
b) Observations in each sample are ranked . <br>
<div class="alert alert-info">Hypothesis</div>


* H0: the samples are correlated.
* H1: the sample does not have any correlation.

<a href="https://docs.scipy.org/doc/scipy/reference/generated/scipy.stats.kendalltau.html" class="btn btn-warning" role="button">Scipy Ref -></a>

In [None]:
# Example of the Kendall's Rank Correlation Test
from scipy.stats import kendalltau
data1 = [0.873, 2.817, 0.121, -0.945, -0.055, -1.436, 0.360, -1.478, -1.637, -1.869]
data2 = [0.353, 3.517, 0.125, -7.545, -0.555, -1.536, 3.350, -1.578, -3.537, -1.579]
stat, p = kendalltau(data1, data2)
print('stat={0:.3f}, p={0:.3f}'.format(stat, p))
if p > 0.05:
    print('Probably independent')
else:
    print('Probably dependent')

<h3 class="alert alert-info">Chi-Squared Test</h3>
Tests whether two categorical variables are related to each other.

<div class="alert alert-info">Assumption</div>
a) Observations in used in contengency table are Independent. <br>
b) There are more than 25 examples in contengency table . <br>
<div class="alert alert-info">Hypothesis</div>


* H0: the samples are correlated.
* H1: the sample does not have any correlation.

<a href="https://docs.scipy.org/doc/scipy/reference/generated/scipy.stats.chi2_contingency.html" class="btn btn-warning" role="button">Scipy Ref -></a>

In [None]:
# Example of the Chi-Squared Test
from scipy.stats import chi2_contingency
table = [[10, 20, 30],[6,  9,  17]]
stat, p, dof, expected = chi2_contingency(table)
print('stat={0:.3g}, p={0:.3f}' .format(stat, p))
if p > 0.05:
    print('Probably independent')
else:
    print('Probably dependent')

<h2 class="list-group-item list-group-item-action active">Stationary Tests</h2>
<p> Used for Validating the Time series data trends(Stationary/Not-Stationary). </p>
<h3 class="alert alert-info">Augmented Dickey-Fuller Unit Root Test</h3>
Tests whether a Time series data has autoregressive trend.

<div class="alert alert-info">Assumption</div>
 Data Instance have temporality.
<div class="alert alert-info">Hypothesis</div>


* H0: the unit root is present.
* H1: the unit root not present.

<a href="https://www.statsmodels.org/dev/generated/statsmodels.tsa.stattools.adfuller.html" class="btn btn-warning" role="button">Stats-Model Ref -></a>

In [None]:
# Example of the Augmented Dickey-Fuller unit root test
from statsmodels.tsa.stattools import adfuller
data = [0, 1, 2, 3, 4, 5, 6, 7, 8, 9]
stat, p, lags, obs, crit, t = adfuller(data)
print('stat={0:.3f}, p={0:.3f}'.format(stat, p))
if p > 0.05:
    print('Probably not Stationary')
else:
    print('Probably Stationary')

<h3 class="alert alert-info">Kwiatkowski-Phillips-Schmidt-Shin</h3>
Tests whether a Time series trend is stationary or not.

<div class="alert alert-info">Assumption</div>
 Data Instance have temporality.
<div class="alert alert-info">Hypothesis</div>


* H0: the stationarity is present.
* H1: the stationarity not present.

<a href="https://www.statsmodels.org/stable/generated/statsmodels.tsa.stattools.kpss.html#statsmodels.tsa.stattools.kpss" class="btn btn-warning" role="button">Stats-Model Ref -></a>

In [None]:
# Example of the Kwiatkowski-Phillips-Schmidt-Shin test
from statsmodels.tsa.stattools import kpss
data = [0, 1, 2, 3, 4, 5, 6, 7, 8, 9]
stat, p, lags, crit = kpss(data)
print('stat={0:.3g}, p={0:.3g}'.format(stat, p))
if p > 0.05:
    print('Probably not Stationary')
else:
    print('Probably Stationary')

<h2 class="list-group-item list-group-item-action active">Parametric Statistical Hypothesis Tests</h2>
<p> Statistical Test for comaparison between data samples. </p>
<h3 class="alert alert-info">Student’s t-test</h3>
Average between two data samples are significantly different.

<div class="alert alert-info">Assumption</div>
a)Each data sample's observation are independent and distributed. <br>
b)Observations are normally distributed.<br>
c)Observations have same variance between each other. <br>
 
<div class="alert alert-info">Hypothesis</div>


* H0: the mean between two samples are equal .
* H1: the  mean between two samples are not equal.

<a href="https://docs.scipy.org/doc/scipy/reference/generated/scipy.stats.ttest_ind.html" class="btn btn-warning" role="button">Scipy Ref -></a>

In [None]:
# Example of the Student's t-test
from scipy.stats import ttest_ind
data1 = [0.873, 2.817, 0.121, -0.945, -0.055, -1.436, 0.360, -1.478, -1.637, -1.869]
data2 = [1.142, -0.432, -0.938, -0.729, -0.846, -0.157, 0.500, 1.183, -1.075, -0.169]
stat, p = ttest_ind(data1, data2)
print('stat={0:.3f}, p={0:.3f}'.format(stat, p))
if p > 0.05:
    print('Probably the same distribution')
else:
    print('Probably different distributions')

<h3 class="alert alert-info">Paired Student’s t-test</h3>
Average between two data samples are significantly different.

<div class="alert alert-info">Assumption</div>
a)Each data sample's observation are independent and distributed. <br>
b)Observations are normally distributed. <br>
c)Observations have same variance between each other. <br>
d)Observations are paired.
 
<div class="alert alert-info">Hypothesis</div>


* H0: the mean between two samples are equal .
* H1: the  mean between two samples are not equal.

<a href="https://docs.scipy.org/doc/scipy/reference/generated/scipy.stats.ttest_rel.html" class="btn btn-warning" role="button">Scipy Ref -></a>

In [None]:
# Example of the Paired Student's t-test
from scipy.stats import ttest_rel
data1 = [0.873, 2.817, 0.121, -0.945, -0.055, -1.436, 0.360, -1.478, -1.637, -1.869]
data2 = [1.142, -0.432, -0.938, -0.729, -0.846, -0.157, 0.500, 1.183, -1.075, -0.169]
stat, p = ttest_rel(data1, data2)
print('stat={0:.3f}, p={0:.3f}'.format(stat, p))
if p > 0.05:
    print('Probably the same distribution')
else:
    print('Probably different distributions')

<h3 class="alert alert-info">Analysis of Variance Test (ANOVA)</h3>
Average between two data samples are significantly independent and different.

<div class="alert alert-info">Assumption</div>
a)Each data sample's observation are independent and distributed. <br>
b)Observations are normally distributed. <br>
c)Observations have same variance between each other. <br>
 
<div class="alert alert-info">Hypothesis</div>


* H0: the mean between two samples are equal .
* H1: the  mean between two samples are not equal.

<a href="https://docs.scipy.org/doc/scipy/reference/generated/scipy.stats.f_oneway.html" class="btn btn-warning" role="button">Scipy Ref -></a>

In [None]:
# Example of the Analysis of Variance Test
from scipy.stats import f_oneway
data1 = [0.873, 2.817, 0.121, -0.945, -0.055, -1.436, 0.360, -1.478, -1.637, -1.869]
data2 = [1.142, -0.432, -0.938, -0.729, -0.846, -0.157, 0.500, 1.183, -1.075, -0.169]
data3 = [-0.208, 0.696, 0.928, -1.148, -0.213, 0.229, 0.137, 0.269, -0.870, -1.204]
stat, p = f_oneway(data1, data2, data3)
print('stat={0:.3g}, p={0:.3g}'.format(stat, p))
if p > 0.05:
    print('Probably the same distribution')
else:
    print('Probably different distributions')

<h3 class="alert alert-info">Repeated Measures ANOVA Test</h3>
Average between two or more paired samples are significantly different.

<div class="alert alert-info">Assumption</div>
a)Each data sample's observation are independent and distributed. <br>
b)Observations are normally distributed. <br>
c)Observations have same variance between each other. <br>
d)Observation can be paired.
 
<div class="alert alert-info">Hypothesis</div>


* H0: the mean between two samples are equal .
* H1: the  mean between two samples are not equal.

<div class="alert alert-warning">No Python Implementation Available till now.</div>


<h2 class="list-group-item list-group-item-action active">Nonparametric Statistical Hypothesis Tests</h2>
<h3 class="alert alert-info">Mann-Whitney U Test</h3>
Distribution of two data samples are equal or not.
<div class="alert alert-info">Assumption</div>
a)Each data sample's observation are independent and distributed. <br>
b)Observations in each data samples can be ranked.<br>
 
<div class="alert alert-info">Hypothesis</div>


* H0: the distribution of two samples are equal .
* H1: the  distribution of two samples are not equal.

<a href="https://docs.scipy.org/doc/scipy/reference/generated/scipy.stats.mannwhitneyu.html" class="btn btn-warning" role="button">Scipy Ref -></a>

In [None]:
# Example of the Mann-Whitney U Test
from scipy.stats import mannwhitneyu
data1 = [0.873, 2.817, 0.121, -0.945, -0.055, -1.436, 0.360, -1.478, -1.637, -1.869]
data2 = [1.142, -0.432, -0.938, -0.729, -0.846, -0.157, 0.500, 1.183, -1.075, -0.169]
stat, p = mannwhitneyu(data1, data2)
print('stat={0:.3g}, p={0:.3g}'.format(stat, p))
if p > 0.05:
    print('Probably the same distribution')
else:
    print('Probably different distributions')

<h3 class="alert alert-info">Wilcoxon Signed-Rank Test</h3>
Distribution between two paired samples are significantly equal or not.

<div class="alert alert-info">Assumption</div>
a)Each data sample's observation are independent and distributed. <br>
b)Observations can be ranked. <br>
c)Observations are paired. <br>
 
<div class="alert alert-info">Hypothesis</div>


* H0: the distribution of two samples are equal .
* H1: the  distribution of two samples are not equal.


<a href="https://docs.scipy.org/doc/scipy/reference/generated/scipy.stats.wilcoxon.html" class="btn btn-warning" role="button">Scipy Ref -></a>

In [None]:
# Example of the Wilcoxon Signed-Rank Test
from scipy.stats import wilcoxon
data1 = [0.873, 2.817, 0.121, -0.945, -0.055, -1.436, 0.360, -1.478, -1.637, -1.869]
data2 = [1.142, -0.432, -0.938, -0.729, -0.846, -0.157, 0.500, 1.183, -1.075, -0.169]
stat, p = wilcoxon(data1, data2)
print('stat={0:.3g}, p={0:.3g}' .format (stat, p))
if p > 0.05:
    print('Probably the same distribution')
else:
    print('Probably different distributions')

<h3 class="alert alert-info">Kruskal-Wallis H Test</h3>
Distribution between two independent samples are significantly equal or not.

<div class="alert alert-info">Assumption</div>
a)Each data sample's observation are independent and distributed. <br>
b)Observations can be ranked. <br>
 
<div class="alert alert-info">Hypothesis</div>


* H0: the distribution of  samples are equal .
* H1: the  distribution of  samples are not equal.


<a href="https://docs.scipy.org/doc/scipy/reference/generated/scipy.stats.kruskal.html" class="btn btn-warning" role="button">Scipy Ref -></a>

In [None]:
# Example of the Kruskal-Wallis H Test
from scipy.stats import kruskal
data1 = [0.873, 2.817, 0.121, -0.945, -0.055, -1.436, 0.360, -1.478, -1.637, -1.869]
data2 = [1.142, -0.432, -0.938, -0.729, -0.846, -0.157, 0.500, 1.183, -1.075, -0.169]
stat, p = kruskal(data1, data2)
print('stat={0:.3g}, p={0:.3g}'.format(stat, p))
if p > 0.05:
    print('Probably the same distribution')
else:
    print('Probably different distributions')

<h3 class="alert alert-info">Friedman Test</h3>
Distribution between two paired samples are significantly equal or not.

<div class="alert alert-info">Assumption</div>
a)Each data sample's observation are independent and distributed. <br>
b)Observations can be ranked. <br>
c)Observations can be paired.
 
<div class="alert alert-info">Hypothesis</div>


* H0: the distribution of  all samples are equal .
* H1: the  distribution of  one or more samples are not equal.


<a href="https://docs.scipy.org/doc/scipy/reference/generated/scipy.stats.friedmanchisquare.html" class="btn btn-warning" role="button">Scipy Ref -></a>

In [None]:
# Example of the Friedman Test
from scipy.stats import friedmanchisquare
data1 = [0.873, 2.817, 0.121, -0.945, -0.055, -1.436, 0.360, -1.478, -1.637, -1.869]
data2 = [1.142, -0.432, -0.938, -0.729, -0.846, -0.157, 0.500, 1.183, -1.075, -0.169]
data3 = [-0.208, 0.696, 0.928, -1.148, -0.213, 0.229, 0.137, 0.269, -0.870, -1.204]
stat, p = friedmanchisquare(data1, data2, data3)
print('stat={0:.3g}, p={0:.3f}'.format(stat, p))
if p > 0.05:
    print('Probably the same distribution')
else:
    print('Probably different distributions')

<div class="alert alert-warning">Above some of the key Statistical Tests that can be used in any Machine learning Projects.These test can be used in normality validation, establishing relationships between variables, and differences between samples.</div>

<div class = "alert alert-info"> Thumbs Up if you find it useful 👍 Cheers!!!! </div>
<div class="alert alert-info">Reference-</div>

-> [Weblink](https://machinelearningmastery.com/statistical-hypothesis-tests-in-python-cheat-sheet/?fbclid=IwAR0inTph5-QpLK1HPfSXVIQdyg7_m00djV9TnE5Y7dGDxeD5lfiaf2ZsKh0)

