https://machinelearningmastery.com/statistical-hypothesis-tests-in-python-cheat-sheet/

In [1]:
import matplotlib.pyplot as plt
import matplotlib.mlab as mlab
import seaborn as sns
import pandas as pd
import numpy as np

In [2]:
url='https://raw.githubusercontent.com/FazlyRabbiBD/Data-Science-Book/master/data-diabetes.csv'
df = pd.read_csv(url)
df.head()

Unnamed: 0,glucose,bloodpressure,diabetes
0,40,85,0
1,40,92,0
2,45,63,1
3,45,80,0
4,40,73,1


# **Normality Test**

Assumption: 
* Observations in each sample are independent and identically distributed (iid).

Hypothesis:

* H0 = Normal Distribution

* H1= Not a Normal Distribution

Rules:
if p < 0.05: Reject the H0

**Shapiro-Wilk Test**

In [3]:
#Help from Python
from scipy.stats import shapiro

DataToTest = df['bloodpressure']
stat, p = shapiro(DataToTest)
print('stat=%.6f, p=%.6f' % (stat, p))
if p < 0.05:
    print('Not a normal distribution/ H0 Rejected')   
else:
    print('Normal distribution/ H0 Accepted')

stat=0.970822, p=0.000000
Not a normal distribution/ H0 Rejected


**D'Agostino's K^2 Normality Test**

In [4]:
# Example of the D'Agostino's K^2 Normality Test
from scipy.stats import normaltest
DataToTest = df['bloodpressure']

stat, p = normaltest(DataToTest)
print('stat=%.6f, p=%.6f' % (stat, p))
if p < 0.05:
    print('Not a normal distribution/ H0 Rejected')   
else:
    print('Normal distribution/ H0 Accepted')

stat=101.061591, p=0.000000
Not a normal distribution/ H0 Rejected


**Anderson-Darling Normality Test**

In [5]:
# Example of the Anderson-Darling Normality Test
from scipy.stats import anderson
result = anderson(df['glucose'])
print('stat=%.6f, p=%.6f' % (stat, p))
if p < 0.05:
    print('Not a normal distribution/ H0 Rejected')   
else:
    print('Normal distribution/ H0 Accepted')

stat=101.061591, p=0.000000
Not a normal distribution/ H0 Rejected


# **Correlation Test**

Assumptions

 

*   Observations in each sample are independent and identically distributed (iid)
*   Observations in each sample are normally distributed
* Observations in each sample have the same variance.

Hypothesis:

* H0: Variables are Independent / Not correlated

* H1: Variables are Dependent / Correlated

Rules: if p < 0.05: Reject the H0

In [6]:
df.corr()

Unnamed: 0,glucose,bloodpressure,diabetes
glucose,1.0,-0.164553,0.031585
bloodpressure,-0.164553,1.0,-0.808303
diabetes,0.031585,-0.808303,1.0


**Pearson correlation**

In [10]:
#pearson correlation
from scipy.stats import pearsonr
stat, p = pearsonr(df.bloodpressure, df.glucose)

print('stat=%.6f, p=%9f' % (stat, p))

if p < 0.05:
    print('Variables are Dependent (Correlated)/ H0 Rejected')    
else:
    print('Variables are Independent (Not Correlated)/ H0 Accepted')

stat=-0.164553, p= 0.000000
Variables are Dependent (Correlated)/ H0 Rejected


**Spearman Rank Correlation**

Assumptions:

* Observations in each sample are independent and identically distributed (iid).
* Observations in each sample can be ranked.

In [11]:
#Spearman Rank Correlation
from scipy.stats import spearmanr
stat, p = spearmanr(df.bloodpressure, df.glucose)

print('stat=%.6f, p=%6f' % (stat, p))
if p < 0.05:
    print('Variables are Dependent (Correlated)/ H0 Rejected')    
else:
    print('Variables are Independent (Not Correlated)/ H0 Accepted')

stat=-0.130079, p=0.000039
Variables are Dependent (Correlated)/ H0 Rejected


**Kendall's Rank Correlation Test**

In [12]:
# Example of the Kendall's Rank Correlation Test
from scipy.stats import kendalltau
stat, p = kendalltau(df.bloodpressure, df.glucose)

print('stat=%.6f, p=%6f' % (stat, p))
if p < 0.05:
    print('Variables are Dependent (Correlated)/ H0 Rejected')    
else:
    print('Variables are Independent (Not Correlated)/ H0 Accepted')

stat=-0.096423, p=0.000058
Variables are Dependent (Correlated)/ H0 Rejected


# **Categorical Relationship: Chi square test**

Assumptions:

* Observations used in the calculation of the contingency table are independent.
* 25 or more examples in each cell of the contingency table.

Hypothesis:

* H0: Variables are Independent / Not correlated

* H1: Variables are Dependent / Correlated

In [13]:
url='https://raw.githubusercontent.com/FazlyRabbiBD/Data-Science-Book/master/data-drugs.csv'
df1 = pd.read_csv(url)
df1.head()

Unnamed: 0,Age,Sex,BP,Cholesterol,Na_to_K,Drug
0,23,F,HIGH,HIGH,25.355,drugY
1,47,M,LOW,HIGH,13.093,drugC
2,47,M,LOW,HIGH,10.114,drugC
3,28,F,NORMAL,HIGH,7.798,drugX
4,61,F,LOW,HIGH,18.043,drugY


In [15]:
contingency_data = pd.crosstab(df1['Sex'], df1['BP'])
contingency_data 

BP,HIGH,LOW,NORMAL
Sex,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
F,38,28,30
M,39,36,29


In [16]:
from scipy.stats import chi2_contingency
table = df1[["Sex","BP"]]
stat, p, dof, expected = chi2_contingency(contingency_data)
print('stat=%.3f, p=%.3f' % (stat, p))
if p < 0.05:
    print('Variables are Dependent / H0 Rejected')    
else:
    print('Variables are Independent / H0 Accepted')

stat=0.711, p=0.701
Variables are Independent / H0 Accepted


#**Parametric: Z-Test**

https://inblog.in/Hypothesis-Testing-using-Python-RqrE4uDqMe

In [23]:
import pandas as pd
from scipy import stats
df2=pd.read_csv('https://raw.githubusercontent.com/jeevanhe/Data-Science-Statistical-Methods/master/systolic%20blood%20pressure.csv')

In [18]:
df2.head()

Unnamed: 0,armsys,fingsys
0,140,154
1,110,112
2,138,156
3,124,152
4,142,142


In [24]:
df2.armsys.mean()

128.52

In [25]:
df2.fingsys.mean()

132.815

**Z-Test: One Sample**

Assumptions:

Hypothesis:
* H0: Mean is not Equal to the given value.
* H1: Mean is Equal to the given value.

Rules: Reject the H0 when p < 0.05

In [29]:
from statsmodels.stats import weightstats as stests
ztest ,pval = stests.ztest(df2['armsys'], value=126)

print('p=%.6f' % (pval))
if pval < 0.05:
    print("Mean is NOT EQUAL to the given value / H0 Rejected")
else:
    print("Mean is  EQUAL to the given value/ H0 Accepted")

p=0.125930
Mean is  EQUAL to the given value/ H0 Accepted


**Z-Test: Indipendent Sample**

In [30]:
ztest ,pval1 = stests.ztest(df2['armsys'], x2=df2['fingsys'], value=0,alternative='two-sided')
print(pval1)
print('p=%.6f' % (pval1))
if pval1 < 0.05:
    print("Mean is NOT EQUAL to the given value / H0 Rejected")
else:
    print("Mean is  EQUAL to the given value/ H0 Accepted")

0.07954652069053099
p=0.079547
Mean is  EQUAL to the given value/ H0 Accepted


# **Parametric: T-Test**

**T-test: One Sample**

Assumptions:
* Observation is independent and identically distributed (iid).
* Norrmally distributed.
* Sample size is large

Hypothesis:
* H0: the means of the samples are equal.
* H1: the means of the samples are unequal.

Rules: Reject the H0 when p < 0.05

In [31]:
df1.Age.mean()

44.315

In [36]:
from scipy.stats import ttest_1samp
from scipy import stats
import numpy as np
tset, pval = ttest_1samp(df1["Age"], 43)
print('p=',pval)
if pval < 0.05:    # alpha value is 0.05 or 5%
    print("Mean is  NOT EQUAL to the given value /H0 Rejected")
else:
    print("Mean is  EQUAL to the given value /H0 Accepted")

p= 0.26233895951766983
Mean is  EQUAL to the given value /H0 Accepted


**T-test: Independent Sample**

Assumptions:

* Observations in each sample are independent and identically distributed (iid).
* Observations in each sample are normally distributed.
* Observations in each sample have the same variance.
* Samole size is large


In [38]:
df1.groupby('Sex')['Na_to_K'].mean()

Sex
F    17.022062
M    15.219029
Name: Na_to_K, dtype: float64

In [39]:
female=df1.query('Sex=="F"')["Na_to_K"]
male=df1.query('Sex=="M"')["Na_to_K"]

In [41]:
# Example of the Student's t-test
from scipy.stats import ttest_ind

stat, p = ttest_ind(male, female)

print('p=',pval)
if pval < 0.05:    # alpha value is 0.05 or 5%
    print("Means are NOT EQUAL/ H0 Rejected")
else:
    print("Means are  EQUAL/H0 Accepted")

p= 0.26233895951766983
Means are  EQUAL/H0 Accepted


**T-test: Paired**

Assumptions

* Observations in each sample are independent and identically distributed (iid).
* Observations in each sample are normally distributed.
* Observations in each sample have the same variance.
* Observations across each sample are paired.

Hypothesis:
* H0: the means of the samples are equal.
* H1: the means of the samples are unequal.

In [42]:
PreCOVIDIncome = [18,27,34,26,55,40,70,18,105,45]
PostCOVIDIncome = [18,15,25,26,55,20,80,10,105,33]

In [43]:
# Example of the Paired Student's t-test
from scipy.stats import ttest_rel

stat, p = ttest_rel(PreCOVIDIncome, PostCOVIDIncome)
print('stat=%.3f, p=%.3f' % (stat, p))
if p < 0.05:
	print('Means are NOT EQUAL/ H0 Rejected')
else:
	print('Means are  EQUAL/ H0 Accepted')

stat=1.865, p=0.095
Means are  EQUAL/ H0 Accepted


# **Paremetric: ANOVA Test / F-Test**

Tests whether the means of ***two or more*** independent samples are significantly different.

Assumptions:
* Observations in each sample are independent and identically distributed (iid).
* Observations in each sample are normally distributed.
* Observations in each sample have the same variance.

Hypothesis:
* H0: the means of the samples are equal.
* H1: one or more of the means of the samples are unequal.

**ANOVA-One Way**

In [44]:
# Example of the Analysis of Variance Test
from scipy.stats import f_oneway
ResultJony = [75,88,56,78,91]
ResultAnik = [70,44,88,55,74]
ResultTony= [55,76,88,90,91]
stat, p = f_oneway(ResultJony, ResultAnik, ResultTony)
print('stat=%.3f, p=%.3f' % (stat, p))
if p < 0.05:
	print('Means are NOT EQUAL/ H0 Rejected')
else:
	print('Means are EQUAL/ H0 Accepted')

stat=1.142, p=0.352
Means are EQUAL/ H0 Accepted


In [None]:
import statsmodels.api as sm
from statsmodels.formula.api import ols

  import pandas.util.testing as tm


**ANOVA-Two Way**

In [None]:
df_anova2 = pd.read_csv("https://raw.githubusercontent.com/Opensourcefordatascience/Data-sets/master/crop_yield.csv")
df_anova2.head()

Unnamed: 0,Fert,Water,Yield
0,A,High,27.4
1,A,High,33.6
2,A,High,29.8
3,A,High,35.2
4,A,High,33.0


In [None]:
model = ols('Yield ~ C(Fert)*C(Water)', df_anova2).fit()
print(f"Overall model F({model.df_model: .0f},{model.df_resid: .0f}) = {model.fvalue: .3f}, p = {model.f_pvalue: .4f}")
res = sm.stats.anova_lm(model, typ= 2)
res

Overall model F( 3, 16) =  4.112, p =  0.0243


Unnamed: 0,sum_sq,df,F,PR(>F)
C(Fert),69.192,1.0,5.766,0.028847
C(Water),63.368,1.0,5.280667,0.035386
C(Fert):C(Water),15.488,1.0,1.290667,0.272656
Residual,192.0,16.0,,


# **Non-parametric**

**Mann-Whitney U Test**

Assumptions
* Observations in each sample are independent and identically distributed (iid).
* Observations in each sample can be ranked.

Hypothesis:
* H0: Sample means are equal.
* H1: Sample means are not equal.

Rules: Reject the H0 when p < 0.05

In [None]:
# Example of the Mann-Whitney U Test
from scipy.stats import mannwhitneyu
data1 = [0.873, 2.817, 0.121, -0.945, -0.055, -1.436, 0.360, -1.478, -1.637, -1.869]
data2 = [1.142, -0.432, -0.938, -0.729, -0.846, -0.157, 0.500, 1.183, -1.075, -0.169]
stat, p = mannwhitneyu(data1, data2)
print('stat=%.3f, p=%.3f' % (stat, p))
if p < 0.05:
	print('Means are NOT EQUAL / H0 Rejected')
else:
	print('Means are EQUAL / H0 Accepted')

stat=40.000, p=0.236
Means are EQUAL / H0 Accepted


**Wilcoxon Signed-Rank Test**

Assumptions

* Observations in each sample are independent and identically distributed (iid).
* Observations in each sample can be ranked.
* Observations across each sample are paired.

In [None]:
# Example of the Wilcoxon Signed-Rank Test
from scipy.stats import wilcoxon
data1 = [0.873, 2.817, 0.121, -0.945, -0.055, -1.436, 0.360, -1.478, -1.637, -1.869]
data2 = [1.142, -0.432, -0.938, -0.729, -0.846, -0.157, 0.500, 1.183, -1.075, -0.169]
stat, p = wilcoxon(data1, data2)
print('stat=%.3f, p=%.3f' % (stat, p))
if p < 0.05:
	print('Means are NOT EQUAL / H0 Rejected')
else:
	print('Means are EQUAL / H0 Accepted')

stat=21.000, p=0.508
Means are EQUAL / H0 Accepted


**Kruskal-Wallis H Test**

Assumptions:

* Observations in each sample are independent and identically distributed (iid).
* Observations in each sample can be ranked.

In [None]:
from scipy.stats import kruskal
data1 = [0.873, 2.817, 0.121, -0.945, -0.055, -1.436, 0.360, -1.478, -1.637, -1.869]
data2 = [1.142, -0.432, -0.938, -0.729, -0.846, -0.157, 0.500, 1.183, -1.075, -0.169]
stat, p = kruskal(data1, data2)
print('stat=%.3f, p=%.3f' % (stat, p))
if p < 0.05:
	print('Means are NOT EQUAL / H0 Rejected')
else:
	print('Means are EQUAL / H0 Accepted')

stat=0.571, p=0.450
Means are EQUAL / H0 Accepted


**Friedman Test**

Assumptions

* Observations in each sample are independent and identically distributed (iid).
* Observations in each sample can be ranked.
* Observations across each sample are paired.

In [None]:
# Example of the Friedman Test
from scipy.stats import friedmanchisquare
data1 = [0.873, 2.817, 0.121, -0.945, -0.055, -1.436, 0.360, -1.478, -1.637, -1.869]
data2 = [1.142, -0.432, -0.938, -0.729, -0.846, -0.157, 0.500, 1.183, -1.075, -0.169]
data3 = [-0.208, 0.696, 0.928, -1.148, -0.213, 0.229, 0.137, 0.269, -0.870, -1.204]
stat, p = friedmanchisquare(data1, data2, data3)
print('stat=%.3f, p=%.3f' % (stat, p))
if p < 0.05:
	print('Means are NOT EQUAL / H0 Rejected')
else:
	print('Means are EQUAL / H0 Accepted')

stat=0.800, p=0.670
Means are EQUAL / H0 Accepted
