### Hypothesis Testing
- Parametric Statistical Hypothesis Tests
  1. t test(tailed test)
      * 1. 1 sample t test
          * Left Tailed test
          * Right  Tailed test
      * 2. 2 sample t test
  2. Paired Student’s t-test
  3. Analysis of Variance Test (ANOVA)
- Nonparametric Statistical Hypothesis Tests
  1. Mann-Whitney U Test
  2. Wilcoxon Signed-Rank Test
  3. Kruskal-Wallis H Test
  4. Friedman Test
  5. KS Test
- Normality Tests
  1. Shapiro-Wilk Test
  2. D’Agostino’s K^2 Test
  3. Anderson-Darling Test
- Correlation Tests
  1. Pearson’s Correlation Coefficient
  2. Spearman’s Rank Correlation
  3. Kendall’s Rank Correlation
  4. Chi-Squared Test
  5. Fisher's Exact Test
- Stationary Tests
  1. Augmented Dickey-Fuller
  2. Kwiatkowski-Phillips-Schmidt-Shin
  
https://machinelearningmastery.com/statistical-hypothesis-tests-in-python-cheat-sheet/

#### 1. T Test(Tailed Test) : 
- Used to determine difference b/w mean of 2 groups.
- 2 type
  1. 1 sample T Test(1 Tailed Test)
  2. 2 sample T Test(2 Tailed Test)

#### 1.1. 1 Sample T Test :
- Region of rejection is only one side of sampling distribution.
- Total 0.05 of total area under the curve
- Critical value either + or -

  1. Left Tailed test : Region of rejection is extreme left of sampling distribution.
  2. Right Tailed test : Region of rejection is extreme right of sampling distribution.
- link : https://docs.scipy.org/doc/scipy/reference/generated/scipy.stats.ttest_1samp.html

In [69]:
from scipy.stats import ttest_1samp

In [70]:
marriage_age_raj = [33,34,35,36,32,28,29,30,31,37,36,35,33,34,31,40,24]

###### Null Hypothesis , H0 = Marriage age in rajasthan is 35

###### Alt Hypothesis , Ha = Marriage age in rajasthan is NOT 35

In [71]:
stat, p = ttest_1samp(marriage_age_raj,35)
print('statistic=%.3f, Pvalue=%.3f' % (stat, p))
print('Probability of marriage age = 35 in rajasthan = ', p*100,"%")
if p > 0.05:
    print("We Accept Null Hypothesis")
    print("marriage age in ranjasthan = 35")
else:
    print("We Reject Null Hypothesis")
    print("marriage age in ranjasthan NOT = 35")

statistic=-2.354, Pvalue=0.032
Probability of marriage age = 35 in rajasthan =  3.166804359862131 %
We Reject Null Hypothesis
marriage age in ranjasthan NOT = 35


#### 1.2. 2 Sample T Test : 
- Region of rejection is both side of sampling distribution.
- Total 0.025 of total area under the curve on both side
- Critical value both + and -
- link : https://docs.scipy.org/doc/scipy/reference/generated/scipy.stats.ttest_ind.html

In [72]:
from scipy.stats import ttest_ind

In [73]:
marriage_age_raj = [33,34,35,36,32,28,29,30,31,37,36,35,33,34,31,40,24]
marriage_age_bang = [29,31,28,33,31,34,32,20,32,28,27,26,30,31,34,30]

###### Null Hypothesis , H0 = Marriage age in Rajasthan and Bangalore is same. 
###### Alt Hypothesis , Ha = Marriage age in Rajasthan and Bangalore is NOT same. 

In [74]:
stat,p = ttest_ind(marriage_age_raj,marriage_age_bang)
print('statistic=%.3f, Pvalue=%.3f' % (stat, p))
print('Probability of marriage age in bangalore and rajasthan = ', p*100,"%")
if p > 0.05:
    print("We Accept Null Hypothesis")
    print("marriage age in ranjasthan and bangalore is same")
else:
    print("We Reject Null Hypothesis")
    print("marriage age in ranjasthan and bangalore is NOT same")

statistic=2.405, Pvalue=0.022
Probability of marriage age in bangalore and rajasthan =  2.2355127034138325 %
We Reject Null Hypothesis
marriage age in ranjasthan and bangalore is NOT same


#### 2. Paired T Test : 
- Used to compair mean of 2 population where we have two sample in which observation of one sample can be paired    with other.
- link : https://docs.scipy.org/doc/scipy/reference/generated/scipy.stats.ttest_rel.html

In [75]:
from scipy.stats import ttest_rel

In [76]:
pre = [88, 82, 84, 93, 75, 78, 84, 87, 95, 91, 83, 89, 77, 68, 91]
post = [91, 84, 88, 90, 79, 80, 88, 90, 90, 96, 88, 89, 81, 74, 92]

###### Null Hypothesis , H0 = The mean pre-test and post-test scores are equal
###### Alt Hypothesis , Ha = The mean pre-test and post-test scores are  not equal

In [77]:
stat,p = ttest_rel(pre,post)
print('statistic=%.3f, Pvalue=%.3f' % (stat, p))
print('Probability of score of pre and post = ', p*100,"%")
if p > 0.05:
    print("We Accept Null Hypothesis")
    print("The mean pre-test and post-test scores are equal")
else:
    print("We Reject Null Hypothesis")
    print("The mean pre-test and post-test scores are not equal")

statistic=-2.973, Pvalue=0.010
Probability of score of pre and post =  1.007144862643272 %
We Reject Null Hypothesis
The mean pre-test and post-test scores are not equal


#### 3. Analysis of Variance Test (ANOVA) Test : 
- Known as Analysis of Variance
- Used to compair mean of more than 2 population in order to determine whether or not there is a significant difference b/w means.
- link : https://docs.scipy.org/doc/scipy/reference/generated/scipy.stats.f_oneway.html

In [78]:
from scipy.stats import f_oneway

In [79]:
pre = [88, 82, 84, 93, 75, 78, 84, 87, 95, 91, 83, 89, 77, 68, 91]
post = [91, 84, 88, 90, 79, 80, 88, 90, 90, 96, 88, 89, 81, 74, 92]
In = [87, 88, 92, 85, 80, 90, 82, 84, 96, 81, 73, 78, 75, 82, 79]

###### Null Hypothesis , H0 = The mean of samples are equal
###### Alt Hypothesis , Ha = The mean of samples are  not equal

In [80]:
stat,p = f_oneway(pre, post, In)
print('statistic=%.3f, Pvalue=%.3f' % (stat, p))
print('Probability of score of pre and post = ', p*100,"%")
if p > 0.05:
    print("We Accept Null Hypothesis")
    print("The mean of samples are equal")
else:
    print("We Reject Null Hypothesis")
    print("The mean of samples are not equal")

statistic=0.949, Pvalue=0.395
Probability of score of pre and post =  39.525674452794746 %
We Accept Null Hypothesis
The mean of samples are equal


#### 4. Mann-Whitney U Test
- Used to tests whether the distributions of two independent samples are equal or not.
- link : https://docs.scipy.org/doc/scipy/reference/generated/scipy.stats.mannwhitneyu.html

In [81]:
from scipy.stats import mannwhitneyu

In [82]:
data1 = [0.873, 2.817, 0.121, -0.945, -0.055, -1.436, 0.360, -1.478, -1.637, -1.869]
data2 = [1.142, -0.432, -0.938, -0.729, -0.846, -0.157, 0.500, 1.183, -1.075, -0.169]

###### Null Hypothesis, H0: the distributions of both samples are equal.
###### Alt Hypothesis ,H1: the distributions of both samples are not equal.

In [83]:
stat,p = mannwhitneyu(data1, data2)
print('statistic=%.3f, Pvalue=%.3f' % (stat, p))
print('Probability  = ', p*100,"%")
if p > 0.05:
    print("We Accept Null Hypothesis")
    print("the distributions of both samples are equal")
else:
    print("We Reject Null Hypothesis")
    print("the distributions of both samples are not equal")

statistic=40.000, Pvalue=0.236
Probability  =  23.63377967557936 %
We Accept Null Hypothesis
the distributions of both samples are equal


#### 5. Wilcoxon Signed-Rank Test
- Used to tests whether the distributions of two paired samples are equal or not.
- link : https://docs.scipy.org/doc/scipy/reference/generated/scipy.stats.wilcoxon.html

In [84]:
from scipy.stats import wilcoxon

In [85]:
data1 = [0.873, 2.817, 0.121, -0.945, -0.055, -1.436, 0.360, -1.478, -1.637, -1.869]
data2 = [1.142, -0.432, -0.938, -0.729, -0.846, -0.157, 0.500, 1.183, -1.075, -0.169]

###### Null Hypothesis, H0: the distributions of both samples are equal.
###### Alt Hypothesis, H1: the distributions of both samples are not equal.

In [86]:
stat,p = wilcoxon(data1, data2)
print('statistic=%.3f, Pvalue=%.3f' % (stat, p))
print('Probability  = ', p*100,"%")
if p > 0.05:
    print("We Accept Null Hypothesis")
    print("the distributions of both samples are equal")
else:
    print("We Reject Null Hypothesis")
    print("the distributions of both samples are not equal")

statistic=21.000, Pvalue=0.557
Probability  =  55.6640625 %
We Accept Null Hypothesis
the distributions of both samples are equal


#### 6. Kruskal-Wallis H Test : 
- Used to tests whether the distributions of two or more independent samples are equal or not.
- link : https://docs.scipy.org/doc/scipy/reference/generated/scipy.stats.kruskal.html

In [87]:
from scipy.stats import kruskal

In [88]:
data1 = [0.873, 2.817, 0.121, -0.945, -0.055, -1.436, 0.360, -1.478, -1.637, -1.869]
data2 = [1.142, -0.432, -0.938, -0.729, -0.846, -0.157, 0.500, 1.183, -1.075, -0.169]

###### Null Hypothesis, H0: the distributions of all samples are equal.
###### Alt Hypothesis, H1: the distributions of one or more samples are not equal.

In [89]:
stat, p = kruskal(data1, data2)
print('statistic=%.3f, Pvalue=%.3f' % (stat, p))
print('Probability  = ', p*100,"%")
if p > 0.05:
    print("We Accept Null Hypothesis")
    print("")
else:
    print("We Reject Null Hypothesis")
    print("")

statistic=0.571, Pvalue=0.450
Probability  =  44.96917979688917 %
We Accept Null Hypothesis



#### 7. Friedman Test : 
- Used to tests whether the distributions of two or more paired samples are equal or not
- link : https://docs.scipy.org/doc/scipy/reference/generated/scipy.stats.friedmanchisquare.html

In [90]:
from scipy.stats import friedmanchisquare

In [91]:
data1 = [0.873, 2.817, 0.121, -0.945, -0.055, -1.436, 0.360, -1.478, -1.637, -1.869]
data2 = [1.142, -0.432, -0.938, -0.729, -0.846, -0.157, 0.500, 1.183, -1.075, -0.169]
data3 = [-0.208, 0.696, 0.928, -1.148, -0.213, 0.229, 0.137, 0.269, -0.870, -1.204]

###### Null Hypothesis,  H0: the distributions of all samples are equal.
###### Alt Hypothesis, H1: the distributions of one or more samples are not equal.

In [92]:
stat, p = friedmanchisquare(data1, data2, data3)
print('statistic=%.3f, Pvalue=%.3f' % (stat, p))
print('Probability  = ', p*100,"%")
if p > 0.05:
    print("We Accept Null Hypothesis")
    print("the distributions of all samples are equal")
else:
    print("We Reject Null Hypothesis")
    print("the distributions of one or more samples are not equal")

statistic=0.800, Pvalue=0.670
Probability  =  67.03200460356355 %
We Accept Null Hypothesis
the distributions of all samples are equal


#### 8. KS(Kolmogorov–Smirnov ) Test : 
- Used to compare a sample with a reference probability distribution (one-sample K–S test), or to compare two samples (two-sample K–S test).
- The KS test is only valid for continuous distributions
- link : https://docs.scipy.org/doc/scipy/reference/generated/scipy.stats.kstest.html#scipy.stats.kstest

In [93]:
from scipy.stats import kstest

In [94]:
data1 = [0.873, 2.817, 0.121, -0.945, -0.055, -1.436, 0.360, -1.478, -1.637, -1.869]
data2 = [1.142, -0.432, -0.938, -0.729, -0.846, -0.157, 0.500, 1.183, -1.075, -0.169]

###### Null Hypothesis,  H0: the distributions of all samples are equal.
###### Alt Hypothesis, H1: the distributions of one or more samples are not equal.

In [95]:
stat, p = kstest(data1, data2)
print('statistic=%.3f, Pvalue=%.3f' % (stat, p))
print('Probability  = ', p*100,"%")
if p > 0.05:
    print("We Accept Null Hypothesis")
    print("the distributions of all samples are equal")
else:
    print("We Reject Null Hypothesis")
    print("the distributions of one or more samples are not equal")

statistic=0.400, Pvalue=0.418
Probability  =  41.752365281777045 %
We Accept Null Hypothesis
the distributions of all samples are equal


#### 9. Shapiro-Wilk Test : 
- Used to tests whether a data sample has a Gaussian distribution.
- link : https://docs.scipy.org/doc/scipy/reference/generated/scipy.stats.shapiro.html

In [96]:
from scipy.stats import shapiro

In [97]:
data = [0.873, 2.817, 0.121, -0.945, -0.055, -1.436, 0.360, -1.478, -1.637, -1.869]

###### Null Hypothesis, H0: the sample has a Gaussian distribution.
###### Alt Hypothesis, H1: the sample does not have a Gaussian distribution.

In [98]:
stat, p = shapiro(data)
print('statistic=%.3f, Pvalue=%.3f' % (stat, p))
print('Probability  = ', p*100,"%")
if p > 0.05:
    print("We Accept Null Hypothesis")
    print("the sample has a Gaussian distribution")
else:
    print("We Reject Null Hypothesis")
    print("the sample does not have a Gaussian distribution")

statistic=0.895, Pvalue=0.193
Probability  =  19.340917468070984 %
We Accept Null Hypothesis
the sample has a Gaussian distribution


#### 10. D’Agostino’s K^2 Test : 
- USed to tests whether a data sample has a Gaussian distribution.
- link : https://docs.scipy.org/doc/scipy/reference/generated/scipy.stats.normaltest.html

In [99]:
from scipy.stats import normaltest

In [100]:
data = [0.873, 2.817, 0.121, -0.945, -0.055, -1.436, 0.360, -1.478, -1.637, -1.869]

###### Null Hypothesis, H0: the sample has a Gaussian distribution.
###### Alt Hypothesis, H1: the sample does not have a Gaussian distribution.

In [101]:
print('statistic=%.3f, Pvalue=%.3f' % (stat, p))
print('Probability  = ', p*100,"%")
if p > 0.05:
    print("We Accept Null Hypothesis")
    print("the sample has a Gaussian distribution")
else:
    print("We Reject Null Hypothesis")
    print("the sample does not have a Gaussian distribution")

statistic=0.895, Pvalue=0.193
Probability  =  19.340917468070984 %
We Accept Null Hypothesis
the sample has a Gaussian distribution


#### 11. Anderson-Darling Test :
- Used to tests whether a data sample has a Gaussian distribution.
- link : https://docs.scipy.org/doc/scipy/reference/generated/scipy.stats.anderson.html

In [102]:
from scipy.stats import anderson

In [103]:
data = [0.873, 2.817, 0.121, -0.945, -0.055, -1.436, 0.360, -1.478, -1.637, -1.869]

###### Null Hypothesis, H0: the sample has a Gaussian distribution.
###### Alt Hypothesis, H1: the sample does not have a Gaussian distribution.

In [104]:
result = anderson(data)
print('statistic=%.3f' % (result.statistic))
for i in range(len(result.critical_values)):
    sl, cv = result.significance_level[i], result.critical_values[i]
    if result.statistic < cv:
        print("We Accept Null Hypothesis")
        print('Probably Gaussian at the %.1f%% level' % (sl))
    else:
        print("We Reject Null Hypothesis")
        print('Probably not Gaussian at the %.1f%% level' % (sl))

statistic=0.424
We Accept Null Hypothesis
Probably Gaussian at the 15.0% level
We Accept Null Hypothesis
Probably Gaussian at the 10.0% level
We Accept Null Hypothesis
Probably Gaussian at the 5.0% level
We Accept Null Hypothesis
Probably Gaussian at the 2.5% level
We Accept Null Hypothesis
Probably Gaussian at the 1.0% level


#### 12. Pearson’s Correlation Coefficient : 
- Used to tests whether two samples have a linear relationship.
- link https://docs.scipy.org/doc/scipy/reference/generated/scipy.stats.pearsonr.html

In [105]:
from scipy.stats import pearsonr

In [106]:
data1 = [0.873, 2.817, 0.121, -0.945, -0.055, -1.436, 0.360, -1.478, -1.637, -1.869]
data2 = [0.353, 3.517, 0.125, -7.545, -0.555, -1.536, 3.350, -1.578, -3.537, -1.579]

###### Null Hypothesis, H0: the two samples are independent.
###### Alt Hypothesis, H1: the two samples are not independent.

In [107]:
r, p = pearsonr(data1, data2)
print('correlation coefficient=%.3f, Pvalue=%.3f' % (r, p))
print('Probability  = ', p*100,"%")
if p > 0.05:
    print("We Accept Null Hypothesis")
    print("the two samples are independent.")
else:
    print("We Reject Null Hypothesis")
    print("the two samples are not independent.")

correlation coefficient=0.688, Pvalue=0.028
Probability  =  2.787296951449617 %
We Reject Null Hypothesis
the two samples are not independent.


#### 13. Spearman’s Rank Correlation :
- Used to tests whether two samples have a monotonic relationship.
- link : https://docs.scipy.org/doc/scipy/reference/generated/scipy.stats.spearmanr.html

In [108]:
from scipy.stats import spearmanr

In [109]:
data1 = [0.873, 2.817, 0.121, -0.945, -0.055, -1.436, 0.360, -1.478, -1.637, -1.869]
data2 = [0.353, 3.517, 0.125, -7.545, -0.555, -1.536, 3.350, -1.578, -3.537, -1.579]

###### Null Hypothesis, H0: the two samples are independent.
###### Alt Hypothesis, H1: the two samples are not independent.

In [110]:
r, p = spearmanr(data1, data2)
print('correlation coefficient=%.3f, Pvalue=%.3f' % (r, p))
print('Probability  = ', p*100,"%")
if p > 0.05:
    print("We Accept Null Hypothesis")
    print("the two samples are independent.")
else:
    print("We Reject Null Hypothesis")
    print("the two samples are not independent.")

correlation coefficient=0.855, Pvalue=0.002
Probability  =  0.16368033159867143 %
We Reject Null Hypothesis
the two samples are not independent.


#### 14. Kendall’s Rank Correlation : 
- Used to tests whether two samples have a monotonic relationship
- link : https://docs.scipy.org/doc/scipy/reference/generated/scipy.stats.kendalltau.html

In [111]:
from scipy.stats import kendalltau

In [112]:
data1 = [0.873, 2.817, 0.121, -0.945, -0.055, -1.436, 0.360, -1.478, -1.637, -1.869]
data2 = [0.353, 3.517, 0.125, -7.545, -0.555, -1.536, 3.350, -1.578, -3.537, -1.579]

###### Null Hypothesis, H0: the two samples are independent.
###### Alt Hypothesis, H1: the two samples are not independent.

In [113]:
r, p = kendalltau(data1, data2)
print('correlation coefficient=%.3f, Pvalue=%.3f' % (r, p))
print('Probability  = ', p*100,"%")
if p > 0.05:
    print("We Accept Null Hypothesis")
    print("the two samples are independent")
else:
    print("We Reject Null Hypothesis")
    print("the two samples are not independent")

correlation coefficient=0.733, Pvalue=0.002
Probability  =  0.2212852733686067 %
We Reject Null Hypothesis
the two samples are not independent


#### 15. Chi-Squared Test : 
- Used to tests whether two categorical variables are related or independent.
- link : https://docs.scipy.org/doc/scipy/reference/generated/scipy.stats.chi2_contingency.html

In [114]:
from scipy.stats import chi2_contingency

In [115]:
table = [[10, 20, 30],[6,  9,  17]]

###### Null Hypothesis, H0 : The two variables are independent  
###### Alt Hypothesis, H1 : The two variables are not independent

In [116]:
stat, p, dof, expected = chi2_contingency(table)
print('statistic=%.3f, Pvalue=%.3f' % (stat, p))
print('Probability  = ', p*100,"%")
if p > 0.05:
    print("We Accept Null Hypothesis")
    print("The two variables are independent")
else:
    print("We Reject Null Hypothesis")
    print("The two variables are not independent")

statistic=0.272, Pvalue=0.873
Probability  =  87.3028283380073 %
We Accept Null Hypothesis
The two variables are independent


#### 16. Fisher's Exact Test : 
- Used to tests whether two categorical variables are related or independent.
- used as an alternative to the Chi-Square 
- 2x2 contingency table. Elements should be non-negative integers
- link : https://docs.scipy.org/doc/scipy/reference/generated/scipy.stats.fisher_exact.html

In [117]:
from scipy.stats import fisher_exact

In [118]:
table = [[10, 20],[6,  9]]

###### Null Hypothesis, H0 : The two variables are independent  
###### Alt Hypothesis, H1 : The two variables are not independent

In [119]:
odd, p = fisher_exact(table)    
print('prior odds ratio=%.3f, Pvalue=%.3f' % (odd, p))
print('Probability  = ', p*100,"%")
if p > 0.05:
    print("We Accept Null Hypothesis")
    print("")
else:
    print("We Reject Null Hypothesis")
    print("")

prior odds ratio=0.750, Pvalue=0.746
Probability  =  74.63051677558627 %
We Accept Null Hypothesis



#### 17. Augmented Dickey-Fuller : 
- Used to tests whether a time series has a unit root, e.g. has a trend or more generally is autoregressive.
- link : https://www.statsmodels.org/dev/generated/statsmodels.tsa.stattools.adfuller.html

In [120]:
from statsmodels.tsa.stattools import adfuller

In [121]:
data = [0, 1, 2, 3, 4, 5, 6, 7, 8, 9]

###### Null Hypothesis, H0: a unit root is present (series is non-stationary).
###### Alt Hypothesis, H1: a unit root is not present (series is stationary)

In [122]:
stat, p, lags, obs, crit, t = adfuller(data)
print('statistic=%.3f, Pvalue=%.3f' % (stat, p))
print('Probability  = ', p*100,"%")
if p > 0.05:
    print("We Accept Null Hypothesis")
    print("a unit root is present (series is non-stationary).")
else:
    print("We Reject Null Hypothesis")
    print("a unit root is not present (series is stationary).")

statistic=nan, Pvalue=nan
Probability  =  nan %
We Reject Null Hypothesis
a unit root is not present (series is stationary).


#### 18. Kwiatkowski-Phillips-Schmidt-Shin : 
- Used to tests whether a time series is trend stationary or not.
- link : https://www.statsmodels.org/dev/generated/statsmodels.tsa.stattools.kpss.html

In [123]:
from statsmodels.tsa.stattools import kpss

In [124]:
data = [0, 1, 2, 3, 4, 5, 6, 7, 8, 9]

###### Null Hypothesis, H0: the time series is not trend-stationary.
###### Alt Hypothesis, H1: the time series is trend-stationary.

In [125]:
stat, p, lags, crit = kpss(data)
print('statistic=%.3f, Pvalue=%.3f' % (stat, p))
print('Probability  = ', p*100,"%")
if p > 0.05:
    print("We Accept Null Hypothesis")
    print("the time series is not trend-stationary")
else:
    print("We Reject Null Hypothesis")
    print("the time series is trend-stationary")

statistic=0.410, Pvalue=0.073
Probability  =  7.2860732917674005 %
We Accept Null Hypothesis
the time series is not trend-stationary
