# Chi-Squared Test

For categorical data, analog to t-test

$$\chi^2=\sum\left(\frac{(observed-expected)^2}{expected} \right)$$

Let's generate some fake demographic data for U.S. and Minnesota and walk through the chi-square goodness of fit test to check whether they are different:

In [1]:
import numpy as np
import pandas as pd
import scipy.stats as stats

In [7]:
national = pd.DataFrame(
    ["white"]*100000 + ["hispanic"]*60000 + ["black"]*50000 +\
        ["asian"]*15000 + ["other"]*35000
)

minnesota = pd.DataFrame(
    ["white"]*600 + ["hispanic"]*300 + ["black"]*250 +\
        ["asian"]*75 + ["other"]*150
)

national_table = pd.crosstab(index=national[0], columns='count')
minnesota_table = pd.crosstab(index=minnesota[0], columns='count')

print(f'National\n{national_table}\n')
print(f'Minnesota\n{minnesota_table}')

National
col_0      count
0               
asian      15000
black      50000
hispanic   60000
other      35000
white     100000

Minnesota
col_0     count
0              
asian        75
black       250
hispanic    300
other       150
white       600


In the formula, observed is the actual observed count for each category and expected is the expected count based on the distribution of the population for the corresponding category. Let's calculate the chi-squared statistic for our data to illustrate:

In [9]:
observed = minnesota_table
national_ratios = national_table/len(national)
expected = national_ratios * len(minnesota) 

chi_squared_stat = (((observed-expected)**2)/expected).sum()
print(f'{chi_squared_stat=}')

chi_squared_stat=col_0
count    18.194805
dtype: float64


In [10]:
chi_squared_stat

col_0
count    18.194805
dtype: float64

In [13]:
minnesota_table

col_0,count
0,Unnamed: 1_level_1
asian,75
black,250
hispanic,300
other,150
white,600


In [14]:
expected

col_0,count
0,Unnamed: 1_level_1
asian,79.326923
black,264.423077
hispanic,317.307692
other,185.096154
white,528.846154


Like the t-test, we compare the chi-squared stat to a critical value based on the chi-square distribution. We use a critical value of 95%.

In [15]:
crit = stats.chi2.ppf(q=0.95,
                      df=4)     # Degrees of freedom, number of categories - 1

print(f'Critical value: {crit}')

p_value = 1 - stats.chi2.cdf(x=chi_squared_stat, df=4)

print(f'p-value: {p_value}')

Critical value: 9.487729036781154
p-value: [0.00113047]


Both show significance.

The large difference between the Crit Value and the chi_squred stat shows significance at 95% confidence.

p-value shows significance at 99% confidence

`chisquare()` for automatic calculation

In [16]:
stats.chisquare(f_obs=observed, f_exp=expected)

Power_divergenceResult(statistic=array([18.19480519]), pvalue=array([0.00113047]))

## Chi-Squared Test of Independence

In [18]:
np.random.seed(10)

races = ["asian","black","hispanic","other","white"]
race_prop = [0.05,0.15,0.25,0.05,0.5]
parties = ["democrat","independent","republican"]
parties_prop = [0.4,0.2,0.4]

voter_race = np.random.choice(a=races, p=race_prop, size=1000)
voter_party = np.random.choice(a=parties, p=parties_prop, size=1000)

voters = pd.DataFrame({"race":voter_race,"party":voter_party})

voter_tab = pd.crosstab(voters.race, voters.party, margins=True)

voter_tab.columns = ["democrat","independent","republican","row_totals"]
voter_tab.index = ["asian","black","hispanic","other","white","col_totals"]

observed = voter_tab.iloc[0:5,0:3]
voter_tab

Unnamed: 0,democrat,independent,republican,row_totals
asian,21,7,32,60
black,65,25,64,154
hispanic,107,50,94,251
other,15,8,15,38
white,189,96,212,497
col_totals,397,186,417,1000


**Note that we did not use the race data to inform our generation of the party data so the variables are independent**

For a test of independence, we use the same chi-squared formula that we used for the goodness-of-fit test. The main difference is we have to calculate the expected counts of each cell in a 2-dimensional table instead of a 1-dimensional table. To get the expected count for a cell, multiply the row total for that cell by the column total for that cell and then divide by the total number of observations. We can quickly get the expected counts for all cells in the table by taking the row totals and column totals of the table, performing an outer product on them with the np.outer() function and dividing by the number of observations:

In [19]:
expected = np.outer(voter_tab["row_totals"][0:5],
                    voter_tab.loc["col_totals"][0:3]) / 1000

expected = pd.DataFrame(expected)

expected.columns = ["democrat","independent","republican"]
expected.index = ["asian","black","hispanic","other","white"]

expected

Unnamed: 0,democrat,independent,republican
asian,23.82,11.16,25.02
black,61.138,28.644,64.218
hispanic,99.647,46.686,104.667
other,15.086,7.068,15.846
white,197.309,92.442,207.249


**the stats must be summed twice because 2 dimensions**

In [21]:
chi_squared_stat = (((observed - expected)**2)/expected).sum().sum()
print(chi_squared_stat)

7.169321280162059


In [24]:
crit = stats.chi2.ppf(q=0.95,
                  df=8)         # multiple of number of categories minus 1
                                # here (3-1) * (5-1) = 8

print(f'Critical value:\n{crit}')

p_value = 1 - stats.chi2.cdf(x=chi_squared_stat, df=8)

print(f'P value:/n{p_value}')

Critical value:
15.50731305586545
P value:/n0.518479392948842


Test using `chi2_contingency()`

In [25]:
stats.chi2_contingency(observed=observed)

Chi2ContingencyResult(statistic=7.169321280162059, pvalue=0.518479392948842, dof=8, expected_freq=array([[ 23.82 ,  11.16 ,  25.02 ],
       [ 61.138,  28.644,  64.218],
       [ 99.647,  46.686, 104.667],
       [ 15.086,   7.068,  15.846],
       [197.309,  92.442, 207.249]]))