# Chi2 test for independence: why two implementations return different results?

This workbook, and the chi2_util module demonstrates the difference in the chi2 implementation of sklearn (SelectKBest) and chi2 implementation of scipy.stats.

To understand the purpose, logic and conclusion of this notebook, please relate to the article on ondata.blog, published in November 2021: https://ondata.blog/articles/dont-trust-data-science-ask-the-people/

<b>Update 2021-11-25</b>: Also, the discrepancy that this notebook describes has been filed as scikit-learn bug #21455, available under this github thread: https://github.com/scikit-learn/scikit-learn/issues/21455 

<b> Update 2021-11-29</b>The chi2_util code has been improved (the count of expected is now implemented as part of the procedure), and so the output of the notebook may be slightly different than the one pasted in the article.



In [1]:
import pandas as pd, numpy as np, seaborn as sns
import os, sys
from sklearn.feature_selection import SelectKBest, chi2
import sklearn.feature_selection as skfs

df = pd.read_csv('sample300.csv')

In [2]:
df = df.rename(columns = {'A': 'education', 
                          'E': 'expertise', 
                          'label': 'success'})
label = 'success'
df.head()

Unnamed: 0,education,B,C,D,expertise,F,success
0,4,11,131,45,20,159,2
1,0,6,12,63,73,64,2
2,4,8,137,56,102,240,2
3,3,14,137,58,116,59,1
4,4,4,137,10,50,200,2


## Chi2 (with sklearn)

In [3]:
cat_feature_cols = list(set(df.columns) - set([label, 'id']))
fs = SelectKBest(score_func=skfs.chi2, k = 'all')
X, y = df[cat_feature_cols], df[label]
selector = fs.fit(X, y)
kbest = pd.DataFrame({'feature': X.columns, 'score': fs.scores_})
kbest.sort_values(by = 'score', ascending = False).reset_index()

Unnamed: 0,index,feature,score
0,1,expertise,1647.696011
1,3,C,232.577148
2,4,D,116.422861
3,5,B,69.178778
4,2,F,24.250652
5,0,education,1.412797


## Manual verification

In [4]:
df['dummy'] = 1
df.pivot_table(values = 'dummy', columns = label, index = 'education', aggfunc = len).fillna(0)

success,1,2
education,Unnamed: 1_level_1,Unnamed: 2_level_1
0,1.0,18.0
1,1.0,0.0
2,0.0,1.0
3,127.0,23.0
4,21.0,108.0


## Chi2 calculated correctly (with scipy.stats)
### using chi2_util.chi2_score

In [14]:
import chi2_util
from importlib import reload
reload(chi2_util)

<module 'chi2_util' from 'C:\\Users\\pplaszczak\\Documents\\00-STUFF\\stuff\\projs\\2021-10-chi2\\chi2_util.py'>

In [15]:
chi2_util.chi2_score(df, 
                     features = cat_feature_cols, 
                     target = label, 
                     alpha = 0.05,
                     deep = True)

Unnamed: 0,chi2,critical,dof,p,rank,reverse_rank
expertise,179.441098,5.991465,2.0,1.083579e-39,29.949455,0.03339
education,145.991612,7.814728,3.0,1.9291820000000002e-31,18.681599,0.053529
D,83.383496,14.06714,7.0,2.806969e-15,5.927537,0.168704
B,72.761234,12.591587,6.0,1.108339e-13,5.778559,0.173054
C,26.008687,3.841459,1.0,3.398845e-07,6.770524,0.147699
F,1.714086,3.841459,1.0,0.1904561,0.446207,2.241112


# practical test

In [16]:
from sklearn.linear_model import LogisticRegression
chi2_util.accuracy_by_feature(X, y, classifier = LogisticRegression(max_iter = 1000)).round(2)

Unnamed: 0,accuracy,feature
1,0.81,expertise
0,0.8,education
5,0.66,B
4,0.61,D
3,0.56,C
2,0.46,F


In [17]:
from xgboost import XGBClassifier
chi2_util.accuracy_by_feature(X, y, classifier = XGBClassifier()).round(2)

Unnamed: 0,accuracy,feature
0,0.86,education
1,0.84,expertise
4,0.69,D
5,0.65,B
3,0.59,C
2,0.44,F


# conclusion

The chi2 implementations of sklearn and scipy.stats return different results. 

It seems that the former may be incorrect, while the latter is correct. Please relate to the article mentioned earlier for more details.

In [9]:
pd.__version__

'1.2.3'

In [10]:
import sklearn
sklearn.__version__

'0.24.1'

In [11]:
import scipy
scipy.__version__

'1.6.2'

In [12]:
from platform import python_version 
python_version()

'3.7.7'

In [13]:
import sklearn; sklearn.show_versions()


System:
    python: 3.7.7 (default, May  6 2020, 11:45:54) [MSC v.1916 64 bit (AMD64)]
executable: C:\Users\pplaszczak\AppData\Local\Continuum\anaconda3\python.exe
   machine: Windows-10-10.0.18362-SP0

Python dependencies:
          pip: 21.0.1
   setuptools: 52.0.0.post20210125
      sklearn: 0.24.1
        numpy: 1.19.2
        scipy: 1.6.2
       Cython: 0.29.22
       pandas: 1.2.3
   matplotlib: 3.3.4
       joblib: 1.0.1
threadpoolctl: 2.1.0

Built with OpenMP: True
