<a href="https://colab.research.google.com/github/davidmuna/CHI-SQUARE-TEST-OF-INDEPENDENCE-PYTHON/blob/master/CHI_SQUARE_TEST_OF_INDEPENDENCE_PYTHON.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# CHI-SQUARE TEST OF INDEPENDENCE

The  test of independence tests for dependence between categorical variables and is an omnibus test. Meaning, that if a significant relationship is found and one wants to test for differences between groups then post-hoc testing will need to be conducted. Typically, a proportions test is used as a follow-up post-hoc test.

The  test of independence analysis utilizes a cross tabulation table between the variables of interest  rows and  columns. Based on the cell counts, it is possible to test if there is a relationship, dependence, between the variables and to estimate the strength of the relationship. This is done by testing the difference between the expected count, , and the observed count, . The subscript i will be used to denote the row group, i.e. , and j will be used to denote the column group, i.e. , meaning the cell will be denoted with the appropriate row and column group subscripts, i.e.  and  will be . Let's take a look at an example cross tabulation.

 X2(squared) test of independence assumptions
*   The two samples are independent
*   No expected cell count is = 0
*   No more than 20% of the cells have and expected cell count < 5

Hypothesis
*   Ho: Variables are independent
*   Ha: Variables are dependent


This demonstration will cover how to conduct a  test of independence using scipy.stats and researchpy. First, let's import pandas, statsmodels.api, scipy.stats, researchpy, and the data for this demonstration.

The data used in this example comes from Stata and is 1980 U.S. census data from 956 cities.

In [2]:
import pandas as pd

#uncomment and run the line below if researchpy module not found
!pip install -q researchpy

import researchpy as rp
import scipy.stats as stats

# To load a sample dataset for this demonstration
import statsmodels.api as sm

df = sm.datasets.webuse("citytemp2")

  import pandas.util.testing as tm


In [3]:
#Previewing the data
df.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 956 entries, 0 to 955
Data columns (total 7 columns):
 #   Column    Non-Null Count  Dtype   
---  ------    --------------  -----   
 0   division  956 non-null    category
 1   region    956 non-null    category
 2   heatdd    953 non-null    float64 
 3   cooldd    953 non-null    float64 
 4   tempjan   954 non-null    float32 
 5   tempjuly  954 non-null    float32 
 6   agecat    956 non-null    category
dtypes: category(3), float32(2), float64(2)
memory usage: 33.3 KB


Research question: Is there a relationship between the region and age. Before testing this relationship, let's see some basic univariate statistics.

In [4]:
rp.summary_cat(df[["agecat", "region"]])

#The data is majority in the 19-29 age group while the regions are fairly similar except for the 
#Northeast region having the fewest population.

Unnamed: 0,Variable,Outcome,Count,Percent
0,agecat,19-29,507,53.03
1,,30-34,316,33.05
2,,35+,133,13.91
3,region,N Cntrl,284,29.71
4,,West,256,26.78
5,,South,250,26.15
6,,NE,166,17.36


##### CHI-SQUARE TEST OF INDEPENDENCE WITH SCIPY.STATS
*   The method that needs to be used is scipy.stats.chi2_contingency
*   This method requires one to pass a crosstabulation table, this can be accomplished using pandas.crosstab.



In [5]:
crosstab = pd.crosstab(df["region"], df["agecat"])

crosstab

#The information is returned within a tuple where the first value is the Chi-Square test static
#the second value is the p-value, and the third number is the degrees of freedom. 
#An array is also returned which contains the expected cell counts

agecat,19-29,30-34,35+
region,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
NE,46,83,37
N Cntrl,162,92,30
South,139,68,43
West,160,73,23


In [6]:
stats.chi2_contingency(crosstab)

(61.28767688406036,
 2.463382670201326e-11,
 6,
 array([[ 88.03556485,  54.87029289,  23.09414226],
        [150.61506276,  93.87447699,  39.51046025],
        [132.58368201,  82.63598326,  34.78033473],
        [135.76569038,  84.61924686,  35.61506276]]))

There is a relationship between region and the age distribution, Chi-Squared(6) = 61.29, p< 0.0001.

##### CHI-SQUARE TEST OF INDEPENDENCE WITH RESEARCHPY



The method that needs to be used is researchpy.crosstab. For cleaner output, one can assign each requested object from the tuple to another object and then those separately. The expected cell counts will be requested and used later while checking the assumptions for this statistical test. Additionally, will request the crosstabulation be returned with the cell percentage instead of the cell count.

In [7]:
crosstab, test_results, expected = rp.crosstab(df["region"], df["agecat"],
                                               test= "chi-square",
                                               expected_freqs= True,
                                               prop= "cell")

crosstab

Unnamed: 0_level_0,agecat,agecat,agecat,agecat
agecat,19-29,30-34,35+,All
region,Unnamed: 1_level_2,Unnamed: 2_level_2,Unnamed: 3_level_2,Unnamed: 4_level_2
NE,4.81,8.68,3.87,17.36
N Cntrl,16.95,9.62,3.14,29.71
South,14.54,7.11,4.5,26.15
West,16.74,7.64,2.41,26.78
All,53.03,33.05,13.91,100.0


In [8]:
test_results

#The one piece of information that researchpy calculates that scipy.stats does not is a measure of the strength of the relationship - this is akin to a correlation statistic such as 
#Pearson's correlation coefficient

#Phi and Cramer's V	Interpretation
#>0.25	Very strong
#>0.15	Strong
#>0.10	Moderate
#>0.05	Weak
#>0	No or very weak

Unnamed: 0,Chi-square test,results
0,Pearson Chi-square ( 6.0) =,61.2877
1,p-value =,0.0
2,Cramer's V =,0.179


##### ASSUMPTION CHECK

*   The two samples are independent - The variables were collected independently of each other, i.e. the answer from one variable was not dependent on the answer of the other
*   No expected cell count is = 0
*   No more than 20% of the cells have and expected cell count < 5
*   The last two assumptions can be checked by looking at the expected frequency table.





In [9]:
expected

#It can be seen that all the assumptions are met which indicates the 
#statistical test results are reliable.

Unnamed: 0_level_0,agecat,agecat,agecat
agecat,19-29,30-34,35+
region,Unnamed: 1_level_2,Unnamed: 2_level_2,Unnamed: 3_level_2
NE,88.035565,54.870293,23.094142
N Cntrl,150.615063,93.874477,39.51046
South,132.583682,82.635983,34.780335
West,135.76569,84.619247,35.615063
