Chi Square Test is applied when we have 2 categorical variables from the same population. It is used to determine 
whether there is significant association between them

Here H0: X AND Y ARE INDEPENDENT / UNASSOCIATED
H1: X AND Y ARE DEPENDENT /ASSOCIATED

In [1]:
import scipy.stats as stats
import pandas as pd
import numpy as np
import seaborn as sns
from scipy.stats import chi2

In [2]:
df=pd.read_excel('Lung_Capacity.xlsx')
df

Unnamed: 0,Age,Height,Smoke,Gender,LungCap
0,6,62.1,no,male,6.475
1,18,74.7,yes,female,10.125
2,16,69.7,no,female,9.550
3,14,71.0,no,male,11.125
4,5,56.9,no,male,4.800
...,...,...,...,...,...
720,9,56.0,no,female,5.725
721,18,72.0,yes,male,9.050
722,11,60.5,yes,female,3.850
723,15,64.9,no,female,9.825


In [3]:
#Cross Tabulation --> Contingency Table between Smoker & Sex
df_table=pd.crosstab(df['Smoke'],df['Gender'])
df_table

Gender,female,male
Smoke,Unnamed: 1_level_1,Unnamed: 2_level_1
no,314,334
yes,44,33


In [4]:
obs_values=df_table.values
obs_values

array([[314, 334],
       [ 44,  33]], dtype=int64)

In [6]:
#Expected Value under Null Hypothesis = (Row total* Column total)/ Grand Total
val=stats.chi2_contingency(df_table)
val

(1.7443447336666713,
 0.18658926054734912,
 1,
 array([[319.97793103, 328.02206897],
        [ 38.02206897,  38.97793103]]))

In [7]:
Expected_value=val[3]
Expected_value

array([[319.97793103, 328.02206897],
       [ 38.02206897,  38.97793103]])

In [8]:
no_of_rows=len(df_table.iloc[0:2,0]) #Row Index, Column Index
no_of_rows

2

In [9]:
no_of_cols=len(df_table.iloc[0,0:2]) #Row Index, Column Index
ddof=(no_of_rows-1)*(no_of_cols-1)
ddof

1

In [13]:
alpha=0.05

In [10]:
#Calc Chi square statistic = ((Observed-Expected)**2)/Expected 
#And Add respective chi sqaure values
chi_square=sum([(o-e)**2/e for o,e in zip(obs_values,Expected_value)])
chi_square

array([1.05154789, 1.02576061])

In [11]:
#Add
chi_square_statistic=chi_square[0]+chi_square[1]
chi_square_statistic

2.0773085020480737

In [12]:
#p value, cdf --> Cumulative Distribution Function
p_value=1-chi2.cdf(x=chi_square_statistic,df=ddof)
p_value

0.14950359042946615

In [13]:
if p_value < 0.05:
    print('Reject Null Hypotesis')
else:
    print('Fail to reject Null Hypothesis') # x and y are independent (unassociated)

Fail to reject Null Hypothesis
