<h1>Table of Contents<span class="tocSkip"></span></h1>
<div class="toc"><ul class="toc-item"><li><span><a href="#Chi-Square-Test" data-toc-modified-id="Chi-Square-Test-1"><span class="toc-item-num">1&nbsp;&nbsp;</span>Chi-Square Test</a></span></li><li><span><a href="#Chi-Squared-from-Scratch" data-toc-modified-id="Chi-Squared-from-Scratch-2"><span class="toc-item-num">2&nbsp;&nbsp;</span>Chi-Squared from Scratch</a></span></li></ul></div>

# Chi-Square Test
The test is applied when you have two categorical variables from a single population. It is used to determine whether there is a significant association between the two variables.
For example, in an election survey, voters might be classified by gender (male or female) and voting preference (Democrat, Republican, or Independent). We could use a chi-square test for independence to determine whether gender is related to voting preference.


$\chi_{c}^{2}=\sum \frac{\left(O_{i}-E_{i}\right)^{2}}{E_{i}}$

In [38]:
import numpy as np
import pandas as pd
import seaborn as sns
import os,sys,time
import scipy
import statsmodels

from scipy import stats
from scipy.stats import ttest_1samp
from scipy.stats import ttest_ind # independent means two samples.
from statsmodels.stats import weightstats as stests # stests.ztest

SEED = 100
pd.set_option('max_columns',100)
pd.set_option('plotting.backend','plotly') # matplotlib, bokeh, altair, plotly
%load_ext watermark
%watermark -iv

The watermark extension is already loaded. To reload it, use:
  %reload_ext watermark
scipy       1.4.1
autopep8    1.5.2
seaborn     0.11.0
json        2.0.9
numpy       1.18.4
statsmodels 0.12.0
pandas      1.1.0



In [9]:
df = pd.read_csv('data/chi_test.csv')
df.columns = ["Gender", "Shopping"]
print(df.shape)
df.head()

(9, 2)


Unnamed: 0,Gender,Shopping
0,Male,No
1,Female,Yes
2,Male,Yes
3,Female,Yes
4,Female,Yes


In [12]:
df_contingency_table = pd.crosstab(df["Gender"],df["Shopping"])
df_contingency_table


Shopping,No,Yes
Gender,Unnamed: 1_level_1,Unnamed: 2_level_1
Female,2,3
Male,2,2


In [13]:
#Observed Values
Observed_Values = df_contingency_table.values 
print("Observed Values :-\n",Observed_Values)

Observed Values :-
 [[2 3]
 [2 2]]


In [17]:
# stats.chi2_contingency?
_ = """
Returns
-------
chi2 : float
    The test statistic.
p : float
    The p-value of the test
dof : int
    Degrees of freedom
expected : ndarray, same shape as `observed`
    The expected frequencies, based on the marginal sums of the table.

""";

In [27]:
chi2, p, dof, Expected_Values = stats.chi2_contingency(contingency_table)

print(f"chi2 = {chi2}")
print(f"p    = {p}")
print(f"dof  = {dof}")
print("Expected Values =\n", Expected_Values)

chi2 = 0.1406249999999999
p    = 0.7076604666545525
dof  = 1
Expected Values =
 [[2.22222222 2.77777778]
 [1.77777778 2.22222222]]


# Chi-Squared from Scratch

In [40]:
nrows=len(df_contingency_table.iloc[0:2,0])
ncols=len(df_contingency_table.iloc[0,0:2])

print(f"nrows,ncols = {nrows,ncols}")
nrows, ncols = df_contingency_table.shape
print(f"nrows,ncols = {nrows,ncols}")

ddof = (nrows-1) * (ncols-1)

print("Degree of Freedom = ",ddof)
alpha = 0.05

nrows,ncols = (2, 2)
nrows,ncols = (2, 2)
Degree of Freedom =  1


In [28]:
chi_sq = sum([(o-e)**2./e for o,e in zip(Observed_Values,Expected_Values)])
chi_sq

array([0.05, 0.04])

In [29]:
chi_sq_statistic = chi_sq[0] + chi_sq[1]
print("chi-square statistic:-",chi_sq_statistic)

chi-square statistic:- 0.09000000000000008


In [30]:
critical_value = stats.chi2.ppf(q=1-alpha,df=ddof)
print('critical_value:',critical_value)

critical_value: 3.841458820694124


In [32]:
p_value= 1 - stats.chi2.cdf(x=chi_sq_statistic,df=ddof)
print('p-value:',p_value)

p-value: 0.7641771556220945


In [35]:
print('Significance level  : ',alpha)
print('Degree of Freedom   : ',ddof)
print('chi-square statistic: ',chi_sq_statistic)
print('critical_value      : ',critical_value)
print('p-value             : ',p_value)

Significance level  :  0.05
Degree of Freedom   :  1
chi-square statistic: 0.09000000000000008
critical_value      : 3.841458820694124
p-value             : 0.7641771556220945


In [36]:
if chi_sq_statistic>=critical_value:
    print("Reject H0,There is a relationship between 2 categorical variables")
else:
    print("Retain H0,There is no relationship between 2 categorical variables")

Retain H0,There is no relationship between 2 categorical variables


In [37]:
if p_value<=alpha:
    print("Reject H0,There is a relationship between 2 categorical variables")
else:
    print("Retain H0,There is no relationship between 2 categorical variables")

Retain H0,There is no relationship between 2 categorical variables
