# Module 3: Feature selection - Practice

In this session you will practice **feature selection**, which reduces the dimensionality of data for the following reasons:

1. Reduces overfitting by removing noise introduced by some of the features.
2. Reduces training time, which allows you to experiment more with different models and hyperparameters.
3. Reduces data acquisition requirements.
4. Improves comprehensibility of the model because a smaller set of features is more comprehendible to humans. That will enable you to focus on the main sources of predictability, make the model more justifiable to another person.

We are going to use **titanic** dataset for this practice.

sklearn API reference:

+ [sklearn.feature_selection.SelectKBest](http://scikit-learn.org/stable/modules/generated/sklearn.feature_selection.SelectKBest.html)
+ [sklearn.feature_selection.chi2](http://scikit-learn.org/stable/modules/generated/sklearn.feature_selection.chi2.html)
+ [sklearn.preprocessing.LabelBinarizer](http://scikit-learn.org/stable/modules/generated/sklearn.preprocessing.LabelBinarizer.html)
+ [sklearn.feature_selection.sklearn.feature_selection.f_regression](http://scikit-learn.org/stable/modules/generated/sklearn.feature_selection.f_regression.html)
+ [sklearn.preprocessing.scale](http://scikit-learn.org/stable/modules/generated/sklearn.preprocessing.scale.html)
+ [sklearn.feature_selection.mutual_info_classif](http://scikit-learn.org/stable/modules/generated/sklearn.feature_selection.mutual_info_classif.html)
+ [sklearn.feature_selection.RFE](http://scikit-learn.org/stable/modules/generated/sklearn.feature_selection.RFE.html)


In [5]:
import os, sys
import numpy as np
import pandas as pd
from sklearn.feature_selection import SelectKBest
from sklearn.feature_selection import chi2
from sklearn.feature_selection import f_regression
from sklearn.feature_selection import mutual_info_classif
from sklearn.model_selection import train_test_split
from sklearn.naive_bayes import GaussianNB
from sklearn.feature_selection import RFE
from sklearn.base import clone

np.random.seed(18937)

## Load Dataset

In [6]:
# Dataset location
DATASET = '/dsa/data/all_datasets/titanic_ML/titanic.csv'
assert os.path.exists(DATASET)

# Load and shuffle
dataset = pd.read_csv(DATASET).sample(frac = 1).reset_index(drop=True)
dataset.describe()

Unnamed: 0,pclass,sex,age,sibsp,parch,fare,embarked,survived
count,890.0,890.0,890.0,890.0,890.0,890.0,890.0,890.0
mean,2.31236,0.642697,29.548697,0.503371,0.351685,32.865772,0.895506,0.389888
std,0.837241,0.479475,13.379025,1.095286,0.790069,52.639685,0.529535,0.487999
min,1.0,0.0,0.17,0.0,0.0,0.0,0.0,0.0
25%,2.0,0.0,21.0,0.0,0.0,7.8958,1.0,0.0
50%,3.0,1.0,28.0,0.0,0.0,13.775,1.0,0.0
75%,3.0,1.0,37.0,1.0,0.0,29.925,1.0,1.0
max,3.0,1.0,80.0,8.0,9.0,512.3292,2.0,1.0


Create variable **X** and **y** and pull features and labels respectively.

In [7]:
# Complete code below this comment 
# ----------------------------------
X = np.array(dataset.iloc[:,:-1])
y = np.array(dataset.survived)

Create a train/validate split with **20%** of data held out for validation only.

In [8]:
# Complete code below this comment 
# ----------------------------------
X_train, X_test, y_train, y_test = train_test_split(X,y,test_size=.2)

## χ² feature selection

Create a k-best feature selector with [SelectKBest](http://scikit-learn.org/stable/modules/generated/sklearn.feature_selection.SelectKBest.html)
that uses [χ² feature selection](http://scikit-learn.org/stable/modules/generated/sklearn.feature_selection.chi2.html)
to select **top 5** features.

In [9]:
# Complete code below this comment 
# ----------------------------------
selector = SelectKBest(chi2,k=5)

**Fit** the selector to the **training dataset**.

Then **print** χ² statistic (the score for this selector).

In [10]:
# Add code below this comment 
# ----------------------------------
selector.fit(X_train,y_train)
print('X_squared statistic',selector.scores_)

X_squared statistic [   20.10600106    74.40316364    14.7231048      4.28721714     9.80957579
  4084.61256534     5.20247536]


What are the columns indices that just got selected by the feature selector?

**Print these indices.**

In [11]:
# Add code below this comment 
# ----------------------------------
print('Selected indices',selector.get_support(True))

Selected indices [0 1 2 4 5]


What are the names of these columns selected?
Does it logically make sense that these features got selected?

**Print name of columns selected.**

In [12]:
# Add code below this comment 
# ----------------------------------
[dataset.columns[i] for i in selector.get_support(True)]

['pclass', 'sex', 'age', 'parch', 'fare']

Call **selector.transform()** method to select those feature columns from dataset.

In [13]:
# Complete code below this comment 
# ----------------------------------
X_train_selected = selector.transform(X_train)
X_test_selected = selector.transform(X_test)

Now **fit** a Gaussian Naive Bayes model with these selected features.

**Compute validation accuracy** using model.score().

In [14]:
# Complete code below this comment 
# ----------------------------------
model = GaussianNB()
model.fit(X_train_selected, y_train)
model.score(X_test_selected, y_test)

0.7247191011235955

How does this compare to a model trained without feature selection?

In [15]:
# Complete code below this comment 
# ----------------------------------
model = GaussianNB()
model.fit(X_train, y_train)
model.score(X_test, y_test)

0.7191011235955056

Are those feature selected statistically significant? <span style="background: yellow;">TODO</span>

.7247 for the model with feature selection compared to .71910.I wouldsay very small difference.
The main question is if that difference will scale somehow if used on more data or not.

In [16]:
chi2_sklearn, pvalue_sklearn = chi2(X_train, y_train)
print(pvalue_sklearn)

[  7.32664275e-06   6.36868624e-18   1.24511081e-04   3.83999690e-02
   1.73605546e-03   0.00000000e+00   2.25547470e-02]


In [17]:
from scipy.stats import chi2 as chi2_distribution
u = np.array([11.3,15.6,13.0,4.12,0.752,162,2.76e+03,2.30e-04,0.155,4.56,46.4])
np.round(chi2_distribution.sf(u, 10) * 100)

array([  33.,   11.,   22.,   94.,  100.,    0.,    0.,  100.,  100.,
         92.,    0.])

## Mutual information

Mutual information is generally considered a robust measure of dependence, coming from information theory.
It could apply to both regression and classification problems.
sklearn provided [mutual_info_classif()](http://scikit-learn.org/stable/modules/generated/sklearn.feature_selection.mutual_info_classif.html#sklearn.feature_selection.mutual_info_classif)
and [mutual_info_regression()](http://scikit-learn.org/stable/modules/generated/sklearn.feature_selection.mutual_info_regression.html#sklearn.feature_selection.mutual_info_regression) respectively.

We review feature selection again and also try this method below.

**Tip**: _Putting Python variables in a function allows you to limit its scope and hide global variables.
Only keywords that create new scope in Python are **class**, **def** and **lambda** ._

In [18]:
# Complete code below this comment 
# ----------------------------------
def mutual_info_session():
    selector = SelectKBest(mutual_info_classif, k=3)
    selector.fit(X_train, y_train)
    print(selector.get_support(True))
    model = GaussianNB()
    model.fit(selector.transform(X_train), y_train)
    return model.score(selector.transform(X_test), y_test)
    
mutual_info_session()

[0 1 5]


0.7247191011235955

## Forward selection

Complete following code based on your understanding of forward selection.
Please refer to the feature selection lab, same section for hints.

In [19]:
class ForwardSelector(object):
    def __init__(self, estimator):
        self.estimator = estimator
        
    def fit(self, X, y, k): 
        selected = np.zeros(X.shape[1]).astype(bool)
        score = lambda X_features: clone(self.estimator).fit(X_features, y).score(X_features, y)
        selected_indices = lambda: list(np.flatnonzero(selected))

        # What is the exit condition for forward selection?
        # ----------------------------------
        while np.sum(selected) < k:
            rest_indices = list(np.flatnonzero(~selected))
            
            scores = list()
            for i in rest_indices:
                # Which columns are we currently using to score the model?
                # ----------------------------------
                feature_subset = selected_indices()+[i]
                s = score(X[:, feature_subset])
                scores.append(s)
           
            idx_to_add = rest_indices[np.argmax(scores)]
            selected[idx_to_add] = True

        self.selected = selected.copy()
        return self
        
    def transform(self, X):
        return X[:, self.selected]
    
    def get_support(self, indices=False):
        return np.flatnonzero(self.selected) if indices else self.selected
    

Now **write a test case** for your completed forward selection algorithm and see if it works well for you.
** Select 3 features. **

**Tip**: _You could copy-paste-edit what we just did in mutual information section._

In [27]:
def forward_selection_session():
    model = GaussianNB()
    selector = ForwardSelector(model)
    # Add code below this comment 
    # selector = SelectKBest(mutual_info_classif, k=3)
    selector.fit(X_train,y_train,3)
    print(selector.get_support(True))
    #model = GaussianNB()
    model.fit(selector.transform(X_train), y_train)
    return model.score(selector.transform(X_test),y_test)
    
forward_selection_session()

[1 3 4]


0.7415730337078652

## Recursive feature elimination

Finally, also give it a shot at RFE provided by sklearn.

In [31]:
# Complete code below this comment 
# ----------------------------------
def rfe_session():
    # using SVC instead because RFE doesn't support GaussianNB
    from sklearn.svm import SVC
    model = SVC(kernel="linear")
    # Add code below this comment 
    # ----------------------------------
    selector = RFE(model,3)
    selector.fit(X_train,y_train)
    print(selector.get_support(True))
    model.fit(selector.transform(X_train),y_train)
    return model.score(selector.transform(X_test), y_test)

rfe_session()

[0 1 3]


0.7415730337078652

# Save your notebook!