Compute chi-squared stats between each non-negative feature and class.

This score can be used to select the n_features features with the highest values for the test chi-squared statistic from X, which must contain only non-negative features such as booleans or frequencies (e.g., term counts in document classification), relative to the classes.

Recall that the chi-square test measures dependence between stochastic variables, so 

In [58]:
import seaborn as sns
df=sns.load_dataset('titanic')

In [59]:
df.head()

Unnamed: 0,survived,pclass,sex,age,sibsp,parch,fare,embarked,class,who,adult_male,deck,embark_town,alive,alone
0,0,3,male,22.0,1,0,7.25,S,Third,man,True,,Southampton,no,False
1,1,1,female,38.0,1,0,71.2833,C,First,woman,False,C,Cherbourg,yes,False
2,1,3,female,26.0,0,0,7.925,S,Third,woman,False,,Southampton,yes,True
3,1,1,female,35.0,1,0,53.1,S,First,woman,False,C,Southampton,yes,False
4,0,3,male,35.0,0,0,8.05,S,Third,man,True,,Southampton,no,True


In [60]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 891 entries, 0 to 890
Data columns (total 15 columns):
 #   Column       Non-Null Count  Dtype   
---  ------       --------------  -----   
 0   survived     891 non-null    int64   
 1   pclass       891 non-null    int64   
 2   sex          891 non-null    object  
 3   age          714 non-null    float64 
 4   sibsp        891 non-null    int64   
 5   parch        891 non-null    int64   
 6   fare         891 non-null    float64 
 7   embarked     889 non-null    object  
 8   class        891 non-null    category
 9   who          891 non-null    object  
 10  adult_male   891 non-null    bool    
 11  deck         203 non-null    category
 12  embark_town  889 non-null    object  
 13  alive        891 non-null    object  
 14  alone        891 non-null    bool    
dtypes: bool(2), category(2), float64(2), int64(4), object(5)
memory usage: 80.6+ KB


In [61]:
##sex,embarked,alone,pclass,survived these are all the categorical features
df=df[['sex','embarked','alone','pclass','survived']]

In [62]:
df.head()

Unnamed: 0,sex,embarked,alone,pclass,survived
0,male,S,False,3,0
1,female,C,False,1,1
2,female,S,True,3,1
3,female,S,False,1,1
4,male,S,True,3,0


Before applying chi square test, we need to do label encoding

In [63]:
# Label encoding on sex column
import numpy as np
df['sex']=np.where((df['sex']=="male"),1,0)

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  This is separate from the ipykernel package so we can avoid doing imports until


In [64]:
#using dict comprehension
ordinal_label = {k:i for i,k in enumerate(df['embarked'].unique(),0)}

In [65]:
ordinal_label

{'C': 1, 'Q': 2, 'S': 0, nan: 3}

In [66]:
df.head()

Unnamed: 0,sex,embarked,alone,pclass,survived
0,1,S,False,3,0
1,0,C,False,1,1
2,0,S,True,3,1
3,0,S,False,1,1
4,1,S,True,3,0


In [67]:
df['embarked']=df['embarked'].map(ordinal_label)

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  """Entry point for launching an IPython kernel.


In [68]:
df.head()

Unnamed: 0,sex,embarked,alone,pclass,survived
0,1,0,False,3,0
1,0,1,False,1,1
2,0,0,True,3,1
3,0,0,False,1,1
4,1,0,True,3,0


In [69]:
df['alone'] = np.where(df['alone']==False,1,0)

In [70]:
df.head()

Unnamed: 0,sex,embarked,alone,pclass,survived
0,1,0,1,3,0
1,0,1,1,1,1
2,0,0,0,3,1
3,0,0,1,1,1
4,1,0,0,3,0


In [71]:
#all columns are done with label encoding
x=df.drop(['survived'],axis=1)
y=df['survived']

In [72]:
from sklearn.model_selection import train_test_split
x_train,x_test,y_train,y_test = train_test_split(x,y,test_size=0.3,random_state=100)

In [73]:
x_train.head()

Unnamed: 0,sex,embarked,alone,pclass
69,1,0,1,3
85,0,0,1,3
794,1,0,0,3
161,0,0,0,2
815,1,0,0,1


In [74]:
x_train.isnull().sum()

sex         0
embarked    0
alone       0
pclass      0
dtype: int64

In [75]:
#Performing Chi square test
#chi2 return 2 values. f square and the p value
from sklearn.feature_selection import chi2
f_p_values = chi2(x_train,y_train)

In [76]:
f_p_values

(array([65.67929505,  7.55053653, 17.02136634, 21.97994154]),
 array([5.30603805e-16, 5.99922095e-03, 3.69615464e-05, 2.75514881e-06]))

In [77]:
#f score needs to be higher.
#p value should be lesser, that means it is imp feature.

In [78]:
import pandas as pd
pvalues= pd.Series(f_p_values[1])
pvalues.index=x_train.columns


In [79]:
f_score = pd.Series(f_p_values[0])
f_score.index = x_train.columns

In [80]:
pvalues

sex         5.306038e-16
embarked    5.999221e-03
alone       3.696155e-05
pclass      2.755149e-06
dtype: float64

In [81]:
f_score

sex         65.679295
embarked     7.550537
alone       17.021366
pclass      21.979942
dtype: float64

### Observations
Sex column is the most important column when compared to the output feature.

Because we can say that men died most in titanic.