# Chi2 test for independence: why two implementations return different results?

This workbook, and the chi2_util module demonstrates the difference in the chi2 implementation of sklearn (SelectKBest) and chi2 implementation of scipy.stats.

To understand the purpose, logic and conclusion of this notebook, please relate to the article on ondata.blog, published in November 2021.


In [7]:
import pandas as pd, numpy as np, seaborn as sns
import os, sys
from sklearn.feature_selection import SelectKBest, chi2
import sklearn.feature_selection as skfs

df = pd.read_csv('sample300.csv')

In [8]:
df = df.rename(columns = {'A': 'education', 
                          'E': 'expertise', 
                          'label': 'success'})
label = 'success'
df.head()

Unnamed: 0,education,B,C,D,expertise,F,success
0,4,11,131,45,20,159,2
1,0,6,12,63,73,64,2
2,4,8,137,56,102,240,2
3,3,14,137,58,116,59,1
4,4,4,137,10,50,200,2


## Chi2 (with sklearn)

In [9]:
cat_feature_cols = list(set(df.columns) - set([label, 'id']))
fs = SelectKBest(score_func=skfs.chi2, k = 'all')
X, y = df[cat_feature_cols], df[label]
selector = fs.fit(X, y)
kbest = pd.DataFrame({'feature': X.columns, 'score': fs.scores_})
kbest.sort_values(by = 'score', ascending = False).reset_index()

Unnamed: 0,index,feature,score
0,3,expertise,1647.696011
1,0,C,232.577148
2,4,D,116.422861
3,1,B,69.178778
4,5,F,24.250652
5,2,education,1.412797


## Manual verification

In [34]:
df['dummy'] = 1
df.pivot_table(values = 'dummy', columns = label, index = 'education', aggfunc = len).fillna(0)

success,1,2
education,Unnamed: 1_level_1,Unnamed: 2_level_1
0,1.0,18.0
1,1.0,0.0
2,0.0,1.0
3,127.0,23.0
4,21.0,108.0


## Chi2 calculated manually (with scipy.stats)

In [37]:
from importlib import reload
reload(chi2_util)

<module 'chi2_util' from 'C:\\Users\\pplaszczak\\Documents\\00-STUFF\\stuff\\projs\\2021-10-chi2\\chi2_util.py'>

In [38]:
import chi2_util
chi2_util.chi2_score(df, 
                     features = cat_feature_cols, 
                     target = label, 
                     alpha = 0.05,
                     deep = True)

blah


Unnamed: 0,chi2,critical,dof,p,rank,reverse_rank
education,127.497517,3.841459,1.0,1.445816e-29,33.18987,0.03013
D,32.38498,9.487729,4.0,1.595989e-06,3.413354,0.292967
expertise,6.153006,3.841459,1.0,0.01311889,1.601737,0.624322
B,10.313977,7.814728,3.0,0.01607738,1.319813,0.757683
C,0.0,,0.0,1.0,,
F,0.0,,0.0,1.0,,


# practical test

In [24]:
from sklearn.linear_model import LogisticRegression
chi2_util.accuracy_by_feature(X, y, classifier = LogisticRegression(max_iter = 1000)).round(2)

Unnamed: 0,accuracy,feature
3,0.81,expertise
2,0.8,education
1,0.66,B
4,0.61,D
0,0.56,C
5,0.46,F


In [25]:
from xgboost import XGBClassifier
chi2_util.accuracy_by_feature(X, y, classifier = XGBClassifier()).round(2)

Unnamed: 0,accuracy,feature
2,0.86,education
3,0.84,expertise
4,0.69,D
1,0.65,B
0,0.59,C
5,0.44,F


# conclusion

The chi2 implementations of sklearn and scipy.stats return different results. 

It seems that the former may be incorrect, while the latter is correct. Please relate to the article for more details.

In [30]:
pd.__version__

'1.2.3'

In [31]:
import sklearn
sklearn.__version__

'0.24.1'

In [32]:
import scipy
scipy.__version__

'1.6.2'

In [33]:
from platform import python_version 
python_version()

'3.7.7'