### Fisher Score -Chisquare Test for Feature Selection

Compute chi-squared stats between each non-negative features and class.
- This score should be used to evaluate categorical variables in classifiation task

This score can be used to select n_features features with highest values for the test chi-squared statistic from X, Which must
contain only non-negative features such as booleans or frequencies.(Eg. term counts in document classification),relative to the 
classes.

Recall that the chi-square test measures the dependence between stochastic variables,so using this function "Weeds out" the 
features that are the most likely to be independent of class and therefore irrelavent for classification.The chi-square statistic is commonly used for testing relationship between categorical feature.

It compares the observed distribution of the different classes of target Y among the different categories of the feature,against the expected distribution of the target classes,regardless of the feature categories.

In [91]:
# Import datasets

import pandas as pd
df = pd.read_csv('D:\\Feature Selection\\Datasets\\Titanic Datasets\\train.csv')
df.head()

Unnamed: 0,PassengerId,Survived,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked
0,1,0,3,"Braund, Mr. Owen Harris",male,22.0,1,0,A/5 21171,7.25,,S
1,2,1,1,"Cumings, Mrs. John Bradley (Florence Briggs Th...",female,38.0,1,0,PC 17599,71.2833,C85,C
2,3,1,3,"Heikkinen, Miss. Laina",female,26.0,0,0,STON/O2. 3101282,7.925,,S
3,4,1,1,"Futrelle, Mrs. Jacques Heath (Lily May Peel)",female,35.0,1,0,113803,53.1,C123,S
4,5,0,3,"Allen, Mr. William Henry",male,35.0,0,0,373450,8.05,,S


In [92]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 891 entries, 0 to 890
Data columns (total 12 columns):
 #   Column       Non-Null Count  Dtype  
---  ------       --------------  -----  
 0   PassengerId  891 non-null    int64  
 1   Survived     891 non-null    int64  
 2   Pclass       891 non-null    int64  
 3   Name         891 non-null    object 
 4   Sex          891 non-null    object 
 5   Age          714 non-null    float64
 6   SibSp        891 non-null    int64  
 7   Parch        891 non-null    int64  
 8   Ticket       891 non-null    object 
 9   Fare         891 non-null    float64
 10  Cabin        204 non-null    object 
 11  Embarked     889 non-null    object 
dtypes: float64(2), int64(5), object(5)
memory usage: 83.7+ KB


In [93]:
# Select Catagorical feature
## ['Sex','Embarked','Name','Pclass','Survived']

# Avoid 'Name' feature because we have domain knowlegdge idea names are unique and internally no relationship to dependent feature

df = df[['Sex','Embarked','Pclass','Survived']]
df.head()

Unnamed: 0,Sex,Embarked,Pclass,Survived
0,male,S,3,0
1,female,C,1,1
2,female,S,3,1
3,female,S,1,1
4,male,S,3,0


### Remember before applying chi-square we need to must apply label encoding for the categorical feature

In [94]:
# Lets perform label encoding for 'Sex' feature

import numpy as np
df['Sex'] = np.where(df['Sex']=='male',1,0)


# Lets perform label encoding for 'Embarked' feature

ordinal_label = {k:i for i,k in enumerate(df['Embarked'].unique(),0)}
df['Embarked'] = df['Embarked'].map(ordinal_label)


In [95]:
df.head()

Unnamed: 0,Sex,Embarked,Pclass,Survived
0,1,0,3,0
1,0,1,1,1
2,0,0,3,1
3,0,0,1,1
4,1,0,3,0


In [96]:
# Then perform train_test_split

from sklearn.model_selection import train_test_split
X_train,X_test,y_train,y_test = train_test_split(df[['Sex','Embarked','Pclass']],df['Survived'],test_size = 0.3,random_state = 32)

In [97]:
X_train.isnull().sum()

Sex         0
Embarked    0
Pclass      0
dtype: int64

In [98]:
y_train

446    1
435    1
602    0
322    1
517    0
      ..
403    0
88     1
310    1
555    0
727    1
Name: Survived, Length: 623, dtype: int64

### To Perform Chi-Square test

chi-square returns the 2 values.One is F_score and another one is P_value

In [99]:
# Perform chi-square test

from sklearn.feature_selection import chi2
f_p_values = chi2(X_train,y_train)

In [100]:
f_p_values

(array([64.34158945, 15.18084156, 22.00013369]),
 array([1.04614006e-15, 9.76895397e-05, 2.72631475e-06]))

(array([64.34158945, 15.18084156, 22.00013369]),   --- This indicates F-Score

 array([1.04614006e-15, 9.76895397e-05, 2.72631475e-06])) --- This indicates P-Value
 
 The More F-Score value considered as best for feature selection
 
 The Less P-Value considered as best for feature selection

In [105]:
# Sort the values
p_values = pd.Series(f_p_values[1])
p_values.index = X_train.columns

p_values.sort_index(ascending = False)

Sex         1.046140e-15
Pclass      2.726315e-06
Embarked    9.768954e-05
dtype: float64

### Why sort_index is used instead of sort_values?

p_values is a Series with column names as the index and p-values as the values.

   - When you want to sort this Series by the column names (index) in descending order, you should use sort_index(ascending=False). This will reorder the Series by the index (column names), placing the highest value p-values first and the lowest value p-values last.

- If you were to use sort_values(), it would sort the Series by the p-values (values) in ascending or descending order, not the column names, which may not be what you want in this context.

So, in this case, you're sorting the p-values by the corresponding column names, and sort_index is the appropriate method to achieve that.

### Observation

Sex column is the most important column when compare to the output feature Survived.

Chi Square is also known as Fisher Score.