## What is Pearson’s chi-square test of association/independence? How it is useful in feature selection?

Chi-square test is used for categorical features in a dataset. In practice, we calculate Chi-square between each feature and the target and select the desired number of features with best Chi-square scores.

- Chi square test is akin to correlateion 
- used for testing relationships between categorical variables i.e. categorical response and categorical predictor
- The null hypothesis of the Chi-Square test is that no relationship exists on the categorical variables in the population i.e they are independent. 
- Independent when p > 0.05 and Dependent when p < 0.05, a higher the Chi-Square value the feature is more dependent
- Chi-Square is sensitive to small frequencies in cells of tables. Generally, when the expected value in a cell of a table is less than 5, chi-square can lead to errors in conclusions.
- the other chi square test is goodness of fit

- ANOVA - continuous response and categorical predictor, ANOVA can also be used for feature selection

## Chi-square Test can also be used for feature selection

In [2]:
from sklearn.datasets import load_iris 
from sklearn.feature_selection import SelectKBest ,chi2
import pandas as pd
import warnings
warnings.filterwarnings('ignore')

In [9]:
iris = load_iris() 
X = pd.DataFrame(iris.data, columns=iris.feature_names)
y = pd.DataFrame(iris.target)

### SelectKBest

Default Score function:
- For regression: f_regression, mutual_info_regression
- For classification: chi2, f_classif, mutual_info_classif

if you pass chi2 as a score function, SelectKBest will compute the chi2 statistic between each feature of X and y (assumed to be class labels). A small value will mean the feature is independent of y. A large value will mean the feature is non-randomly related to y, and so likely to provide important information. Only k features will be retained.

In [10]:
# Two features with highest chi-squared statistics are selected 
# score function is a callable, can also be user defined
selector = SelectKBest(score_func=chi2,k=2)
selector.fit(X, y)

SelectKBest(k=2, score_func=<function chi2 at 0x7fde1858b050>)

In [34]:
# Reduced features 
X_new = selector.transform(X)
print(X.shape,X_new.shape)

(150, 4) (150, 2)


In [37]:
# get list of columns
list(X.columns[selector.get_support(indices=True)])

['petal length (cm)', 'petal width (cm)']

In [38]:
# first array rep. chi sq value, 2nd array rep. p-value
chi_scores = chi2(X,y) 

print('Chi Square Values:', chi_scores[0]) 
print('p-values:', chi_scores[1]) 

Chi Square Values: [ 10.81782088   3.7107283  116.31261309  67.0483602 ]
p-values: [4.47651499e-03 1.56395980e-01 5.53397228e-26 2.75824965e-15]


## Feature slection with RFE

In [48]:
from sklearn.feature_selection import RFE
from sklearn.linear_model import LogisticRegression

# feature extraction
model = LogisticRegression()
rfe = RFE(model, 15)
fit = rfe.fit(X, y)


print("Num Features: %d" % fit.n_features_)
print("Selected Features: %s" % fit.support_)
print("Feature Ranking: %s" % fit.ranking_)

Num Features: 4
Selected Features: [ True  True  True  True]
Feature Ranking: [1 1 1 1]


In [7]:
import numpy as np
from statsmodels.stats.outliers_influence import variance_inflation_factor as vif

# get varinace inflation factor
def get_vif(X):
    
    """
    Takes a pd.DataFrame or 2D np.array
    and prints Variance Inflation Factor 
    for every variable.
    """
    
    if isinstance(X, pd.DataFrame) == False:
        X = pd.DataFrame(X)
    
    X['__INTERCEPT'] = np.ones(X.shape[0])
    
    for i in range(X.shape[1]-1):
        the_vif = vif(X.values, i)
        print("VIF for column {:03}: {:.02f}".format(i, the_vif))

In [8]:
get_vif(X)

VIF for column 000: 7.07
VIF for column 001: 2.10
VIF for column 002: 31.26
VIF for column 003: 16.09
