# Chi2_score() usage example

The purpose of this notebook: to provide alternative to SelectKBest().

The scikit-learn's method sklearn.SelectKBest(score_func=chi2) returns faulty results, when chi2 is used as the scoring parameter. as described in the bug #21455 available here: https://github.com/scikit-learn/scikit-learn/issues/21455 . I discovered this using sklearn's version 0.24.1, but as I understand the bug is still there in the latest edition of scikit-learn 1.0.1 released October 2021. 

Until the fix is developed, developers may use the method chi2_util.chi2_score(), as demonstrated below. This method is a wrapper around scipy.stats.chi2_contingency(), which is an alternative implementation of chi-square test. Below I show how to use it.

## prepare the environment

In [9]:
#imports
import pandas as pd

# read in the sample data
df = pd.read_csv('sample300.csv')

# you don't need to do this. I below rename the data to remain in line with the story line
# https://ondata.blog/articles/dont-trust-data-science-ask-the-people/
# but you don't need to do this. Renaming of the features is not needed.

df = df.rename(columns = {'A': 'education', 
                          'E': 'expertise', 
                          'label': 'success'})
label = 'success'

# here's how the data looks
df.head()

Unnamed: 0,education,B,C,D,expertise,F,success
0,4,11,131,45,20,159,2
1,0,6,12,63,73,64,2
2,4,8,137,56,102,240,2
3,3,14,137,58,116,59,1
4,4,4,137,10,50,200,2


## calculate chi2 score

In the result you get the complete dataframe of features sorted by ranks.

In [10]:
import chi2_util
# what are our categorical feature columns? In this case, all columns except label
cat_feature_cols = list(set(df.columns) - set([label]))
result = chi2_util.chi2_score(df, 
                              features = cat_feature_cols, 
                              target = label,
                              alpha = 0.05,
                              deep = True)
result

Unnamed: 0,chi2,critical,dof,p,rank,reverse_rank
expertise,179.441098,5.991465,2.0,1.083579e-39,29.949455,0.03339
education,145.991612,7.814728,3.0,1.9291820000000002e-31,18.681599,0.053529
D,83.383496,14.06714,7.0,2.806969e-15,5.927537,0.168704
B,72.761234,12.591587,6.0,1.108339e-13,5.778559,0.173054
C,26.008687,3.841459,1.0,3.398845e-07,6.770524,0.147699
F,1.714086,3.841459,1.0,0.1904561,0.446207,2.241112


## How to use this result table

Here's a few examples what you can do.

In [11]:
# get the names of top 3 features
result.index[:3].tolist()

['expertise', 'education', 'D']

In [12]:
# get the chi2 scores for top 5 features
result['chi2'][:5]

expertise    179.441098
education    145.991612
D             83.383496
B             72.761234
C             26.008687
Name: chi2, dtype: float64

In [13]:
# get the p-values for all features
result['p']

expertise    1.083579e-39
education    1.929182e-31
D            2.806969e-15
B            1.108339e-13
C            3.398845e-07
F            1.904561e-01
Name: p, dtype: float64

# More reading

The implementation is intended as a temporary fix. It should work well for most cases, but it is not fully robust. Two sections of the algorithm should be improved. Read the code to see the detail. 

Should anything not work, this may be to do with the dependencies so compare your versions of the libraries to mine, below:

In [27]:
pd.__version__

'1.2.3'

In [28]:
import sklearn
sklearn.__version__

'0.24.1'

In [29]:
import scipy
scipy.__version__

'1.6.2'

In [33]:
from platform import python_version 
python_version()

'3.7.7'

In [39]:
import sklearn; sklearn.show_versions()


System:
    python: 3.7.7 (default, May  6 2020, 11:45:54) [MSC v.1916 64 bit (AMD64)]
executable: C:\Users\pplaszczak\AppData\Local\Continuum\anaconda3\python.exe
   machine: Windows-10-10.0.18362-SP0

Python dependencies:
          pip: 21.0.1
   setuptools: 52.0.0.post20210125
      sklearn: 0.24.1
        numpy: 1.19.2
        scipy: 1.6.2
       Cython: 0.29.22
       pandas: 1.2.3
   matplotlib: 3.3.4
       joblib: 1.0.1
threadpoolctl: 2.1.0

Built with OpenMP: True
