<h1>Table of Contents<span class="tocSkip"></span></h1>
<div class="toc"><ul class="toc-item"><li><span><a href="#Chi-Squared-Goodness-Of-Fit-Test" data-toc-modified-id="Chi-Squared-Goodness-Of-Fit-Test-1"><span class="toc-item-num">1&nbsp;&nbsp;</span>Chi-Squared Goodness-Of-Fit Test</a></span></li><li><span><a href="#Chi-Squared-Test-of-Independence" data-toc-modified-id="Chi-Squared-Test-of-Independence-2"><span class="toc-item-num">2&nbsp;&nbsp;</span>Chi-Squared Test of Independence</a></span></li><li><span><a href="#sklearn-feature-selection-chi2" data-toc-modified-id="sklearn-feature-selection-chi2-3"><span class="toc-item-num">3&nbsp;&nbsp;</span>sklearn feature selection chi2</a></span></li></ul></div>

# Chi-Squared Goodness-Of-Fit Test
Ref: http://hamelg.blogspot.com/2015/11/python-for-data-analysis-part-25-chi.html

In our study of t-tests, we introduced the one-way t-test to check whether a sample mean differs from the an expected (population) mean. The chi-squared goodness-of-fit test is an analog of the one-way t-test for categorical variables: it tests whether the distribution of sample categorical data matches an expected distribution. For example, you could use a chi-squared goodness-of-fit test to check whether the race demographics of members at your church or school match that of the entire U.S. population or whether the computer browser preferences of your friends match those of Internet uses as a whole.
When working with categorical data, the values themselves aren't of much use for statistical testing because categories like "male", "female," and "other" have no mathematical meaning. Tests dealing with categorical variables are based on variable counts instead of the actual value of the variables themselves.
Let's generate some fake demographic data for U.S. and Minnesota and walk through the chi-square goodness of fit test to check whether they are different:


$\chi_{c}^{2}=\sum \frac{\left(O_{i}-E_{i}\right)^{2}}{E_{i}}$

In [1]:
import numpy as np
import pandas as pd
import seaborn as sns
import os,sys,time
import scipy
import statsmodels

from scipy import stats
from scipy.stats import ttest_1samp
from scipy.stats import ttest_ind # independent means two samples.
from statsmodels.stats import weightstats as stests # stests.ztest

SEED = 100
pd.set_option('max_columns',100)
pd.set_option('plotting.backend','plotly') # matplotlib, bokeh, altair, plotly
%load_ext watermark
%watermark -iv

statsmodels 0.12.0
seaborn     0.11.0
scipy       1.4.1
numpy       1.18.4
pandas      1.1.0
json        2.0.9
autopep8    1.5.2



In [2]:
national = pd.DataFrame(["white"]*100000 + ["hispanic"]*60000 +\
                        ["black"]*50000 + ["asian"]*15000 + ["other"]*35000)
           

minnesota = pd.DataFrame(["white"]*600 + ["hispanic"]*300 + \
                         ["black"]*250 +["asian"]*75 + ["other"]*150)

national_table = pd.crosstab(index=national[0], columns="count")
minnesota_table = pd.crosstab(index=minnesota[0], columns="count")

print( "National")
print(national_table)
print(" ")
print( "Minnesota")
print(minnesota_table)

National
col_0      count
0               
asian      15000
black      50000
hispanic   60000
other      35000
white     100000
 
Minnesota
col_0     count
0              
asian        75
black       250
hispanic    300
other       150
white       600


In [3]:
observed = minnesota_table

national_ratios = national_table/len(national)  # Get population ratios

expected = national_ratios * len(minnesota)   # Get expected counts

chi_squared_stat = (((observed-expected)**2)/expected).sum()

print(chi_squared_stat)

col_0
count    18.194805
dtype: float64


*Note*: The chi-squared test assumes none of the expected counts are less than 5.
Similar to the t-test where we compared the t-test statistic to a critical value based on the t-distribution to determine whether the result is significant, in the chi-square test we compare the chi-square test statistic to a critical value based on the chi-square distribution. The scipy library shorthand for the chi-square distribution is chi2. Let's use this knowledge to find the critical value for 95% confidence level and check the p-value of our result:

In [4]:
crit = stats.chi2.ppf(q = 0.95, # Find the critical value for 95% confidence*
                      df = 4)   # Df = number of variable categories - 1

print("Critical value")
print(crit)

p_value = 1 - stats.chi2.cdf(x=chi_squared_stat,  # Find the p-value
                             df=4)
print("P value")
print(p_value)

Critical value
9.487729036781154
P value
[0.00113047]


*Note*: we are only interested in the right tail of the chi-square distribution.
Since our chi-squared statistic exceeds the critical value, we'd reject the null hypothesis that the two distributions are the same.
You can carry out a chi-squared goodness-of-fit test automatically using the scipy function scipy.stats.chisquare():

In [5]:
stats.chisquare(f_obs= observed,   # Array of observed counts
                f_exp= expected)   # Array of expected counts

Power_divergenceResult(statistic=array([18.19480519]), pvalue=array([0.00113047]))

# Chi-Squared Test of Independence
Independence is a key concept in probability that describes a situation where knowing the value of one variable tells you nothing about the value of another. For instance, the month you were born probably doesn't tell you anything about which web browser you use, so we'd expect birth month and browser preference to be independent. On the other hand, your month of birth might be related to whether you excelled at sports in school, so month of birth and sports performance might not be independent.
The chi-squared test of independence tests whether two categorical variables are independent. The test of independence is commonly used to determine whether variables like education, political views and other preferences vary based on demographic factors like gender, race and religion. Let's generate some fake voter polling data and perform a test of independence:

In [6]:
np.random.seed(10)

# Sample data randomly at fixed probabilities
voter_race = np.random.choice(a= ["asian","black","hispanic","other","white"],
                              p = [0.05, 0.15 ,0.25, 0.05, 0.5],
                              size=1000)

# Sample data randomly at fixed probabilities
voter_party = np.random.choice(a= ["democrat","independent","republican"],
                              p = [0.4, 0.2, 0.4],
                              size=1000)

voters = pd.DataFrame({"race":voter_race, 
                       "party":voter_party})

voter_tab = pd.crosstab(voters.race, voters.party, margins = True)

voter_tab.columns = ["democrat","independent","republican","row_totals"]

voter_tab.index = ["asian","black","hispanic","other","white","col_totals"]

observed = voter_tab.iloc[:-1,:-1]   # Get table without totals for later use
voter_tab

Unnamed: 0,democrat,independent,republican,row_totals
asian,21,7,32,60
black,65,25,64,154
hispanic,107,50,94,251
other,15,8,15,38
white,189,96,212,497
col_totals,397,186,417,1000


In [7]:
# how to calculate expected value?
# look at race=asian and party=democrate   i.e. value = 21
# out of 1000 population, for asian democrat we consider only population 60*397/1000

# myexpected = (col_total * row_total) / whole_total
myexpected = 397 * 60/1000
myexpected

23.82

To get the expected count for a cell, multiply the row total for that cell by the column total for that cell and then divide by the total number of observations. We can quickly get the expected counts for all cells in the table by taking the row totals and column totals of the table, performing an outer product on them with the np.outer() function and dividing by the number of observations.

For two arrays of lengths m,n, the  np.outer(arr1,arr2) it gives matrix of shape m,n
with corrosponding products.

In [8]:
total = 1000
row_totals = voter_tab["row_totals"][:-1]
col_totals = voter_tab.loc["col_totals"][:-1]

expected =  np.outer(row_totals, col_totals) / total

expected = pd.DataFrame(expected)

expected.columns = ["democrat","independent","republican"]
expected.index = ["asian","black","hispanic","other","white"]

expected

Unnamed: 0,democrat,independent,republican
asian,23.82,11.16,25.02
black,61.138,28.644,64.218
hispanic,99.647,46.686,104.667
other,15.086,7.068,15.846
white,197.309,92.442,207.249


In [9]:
# another way
contingency_table = observed.to_numpy()
stat, p, dof, expected = stats.chi2_contingency(contingency_table)
print('dof=%d' % dof)
print('expected values=\n',expected)
print()

dof=8
expected values=
 [[ 23.82   11.16   25.02 ]
 [ 61.138  28.644  64.218]
 [ 99.647  46.686 104.667]
 [ 15.086   7.068  15.846]
 [197.309  92.442 207.249]]



In [10]:
chi_squared_stat = (((observed-expected)**2)/expected).sum().sum()
print(chi_squared_stat)

7.169321280162059


In [11]:
nrows, ncols = observed.shape
ddof = (nrows-1) * (ncols-1)
crit = stats.chi2.ppf(q = 0.95, df = ddof) 

print("Critical value")
print(crit)

p_value = 1 - stats.chi2.cdf(x=chi_squared_stat,df=ddof)
print("P value")
print(p_value)

Critical value
15.50731305586545
P value
0.518479392948842


*Note: The degrees of freedom for a test of independence equals the product of the number of categories in each variable minus 1. In this case we have a 5x3 table so df = 4x2 = 8.
As with the goodness-of-fit test, we can use scipy to conduct a test of independence quickly. Use stats.chi2_contingency() function to conduct a test of independence automatically given a frequency table of observed counts:

In [12]:
stats.chi2_contingency(observed= observed)

(7.169321280162059,
 0.518479392948842,
 8,
 array([[ 23.82 ,  11.16 ,  25.02 ],
        [ 61.138,  28.644,  64.218],
        [ 99.647,  46.686, 104.667],
        [ 15.086,   7.068,  15.846],
        [197.309,  92.442, 207.249]]))

The output shows the chi-square statistic, the p-value and the degrees of freedom followed by the expected counts.
As expected, given the high p-value, the test result does not detect a significant relationship between the variables.

# sklearn feature selection chi2

References
- https://scikit-learn.org/stable/modules/generated/sklearn.feature_selection.chi2.html  
- https://scikit-learn.org/stable/modules/generated/sklearn.feature_selection.chi2.html  
- https://scikit-learn.org/stable/modules/feature_selection.html#univariate-feature-selection

Compute chi-squared stats between each non-negative feature and class.

This score can be used to select the n_features features with the highest values for the test chi-squared statistic from X, which must contain only non-negative features such as booleans or frequencies (e.g., term counts in document classification), relative to the classes.

Recall that the chi-square test measures dependence between stochastic variables, so using this function “weeds out” the features that are the most likely to be independent of class and therefore irrelevant for classification.

In [18]:
X = np.array([[1, 0, 0, 0, 1],
       [1, 1, 0, 1, 1],
       [1, 0, 0, 0, 0],
       [0, 0, 0, 0, 0],
       [0, 0, 0, 0, 1],
       [1, 0, 0, 0, 1],
       [1, 0, 1, 1, 1],
       [0, 1, 1, 0, 0],
       [1, 0, 1, 1, 1],
       [1, 1, 1, 1, 0]])
y = np.array([1, 0, 0, 0, 1, 1, 1, 1, 0, 1])

print(f"X shape: {X.shape}")
print(f"y sahpe: {y.shape}")

X shape: (10, 5)
y sahpe: (10,)


In [14]:
Y = np.vstack([1 - y, y])
observed = np.dot(Y, X)
observed

array([[3, 1, 1, 2, 2],
       [4, 2, 3, 2, 4]])

In [15]:
feature_count = X.sum(axis=0)
class_prob = Y.mean(axis=1)
expected = np.dot(feature_count.reshape(-1, 1), class_prob.reshape(1, -1)).T
expected

array([[2.8, 1.2, 1.6, 1.6, 2.4],
       [4.2, 1.8, 2.4, 2.4, 3.6]])

In [16]:
from scipy.stats import chisquare
score, pval = chisquare(observed, expected)
score

array([0.02380952, 0.05555556, 0.375     , 0.16666667, 0.11111111])

In [17]:
pval

array([0.87737056, 0.81366372, 0.54029137, 0.6830914 , 0.73888268])

In [19]:
## sklearn example

In [21]:
from sklearn.datasets import load_iris
from sklearn.feature_selection import SelectKBest
from sklearn.feature_selection import chi2

X, y = load_iris(return_X_y=True)
print(f"old shape: {X.shape}")

X_new = SelectKBest(chi2, k=2).fit_transform(X, y)
print(f"New shape: {X_new.shape}")

old shape: (150, 4)
New shape: (150, 2)
