## Feature Selection

In the following cells, we will select a group of variables, the most predictive ones, to build our machine learning model.

We will select variables using the **Lasso regression:** Lasso has the property of setting the coefficient of non-informative variables to zero.

This has some interesting properties:

1. For production: Fewer variables mean smaller client input requirements (e.g. customers filling out a form on a website or mobile app), and hence less code for error handling. This reduces the chances of introducing bugs.

2. For model performance: Fewer variables mean simpler, more interpretable, better generalizing models

In [1]:
# Import basic libraries
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt

# Feature selection packages
from sklearn.linear_model import Lasso
from sklearn.feature_selection import SelectFromModel

# Visualize all columns in the dataframe
pd.pandas.set_option('display.max_columns', None)

In [6]:
# Load the data obtained in the feature engineering notebook

X_train = pd.read_csv('xtrain.csv')
X_test = pd.read_csv('xtest.csv')
X_train.head()

Unnamed: 0,Survived,Pclass,Sex,Age,SibSp,Parch,Fare,Embarked,Age_na
0,0,1.0,1.0,0.748255,0.0,0.333333,0.029758,0.666667,1.0
1,0,0.5,0.0,0.801769,0.0,0.0,0.020495,0.0,0.0
2,0,0.5,0.0,0.801769,0.125,0.166667,0.072227,0.666667,0.0
3,0,1.0,0.0,0.710132,0.0,0.0,0.007832,0.666667,0.0
4,0,1.0,0.0,0.720334,0.0,0.0,0.014151,0.0,0.0


In [7]:
# Define target variables
y_train = X_train['Survived']
y_test = X_test['Survived']

# Drop target variables from X's datasets
X_train = X_train.drop(['Survived'], axis=1)
X_test = X_test.drop(['Survived'], axis=1)

In [8]:
X_train.head()

Unnamed: 0,Pclass,Sex,Age,SibSp,Parch,Fare,Embarked,Age_na
0,1.0,1.0,0.748255,0.0,0.333333,0.029758,0.666667,1.0
1,0.5,0.0,0.801769,0.0,0.0,0.020495,0.0,0.0
2,0.5,0.0,0.801769,0.125,0.166667,0.072227,0.666667,0.0
3,1.0,0.0,0.710132,0.0,0.0,0.007832,0.666667,0.0
4,1.0,0.0,0.720334,0.0,0.0,0.014151,0.0,0.0


### Feature Selection

Keep in mind we are in the research enviroment of the ML pipeline. Therefore, we applying Lasso Regression it is importan to set the seed.

In [9]:
# We specify Lasso Regression model with a penalty coefficient 
# (alpha). The bigger the alpha, the less features will be 
# selected.

# selectFromModel object from sklearn, which will select
# automatically the features which coefficients are non-zero

selector = SelectFromModel(Lasso(alpha=0.005, random_state=0))

# Train the selector to choose features
selector.fit(X_train, y_train)

SelectFromModel(estimator=Lasso(alpha=0.005, random_state=0))

In [10]:
# We can visualize which features are selected
selector.get_support()

array([ True,  True,  True,  True, False, False,  True,  True])

It is good that Lasso Regressor does not consider *Fare* variable, as the transformation of that variable was quite tricky and far from be Gaussian.

In [11]:
# Make a list of selected features
selected_feat = X_train.columns[(selector.get_support())]

# Print the numbers of features
print('total features: {}'.format((X_train.shape[1])))
print('selected features: {}'.format(len(selected_feat)))
print('features with coefficients shrank to zero: {}'.format(
    np.sum(selector.estimator_.coef_ == 0)))

total features: 8
selected features: 6
features with coefficients shrank to zero: 2


In [12]:
# Print selected variables
selected_feat

Index(['Pclass', 'Sex', 'Age', 'SibSp', 'Embarked', 'Age_na'], dtype='object')

In [13]:
# Store the selected features
pd.Series(selected_feat).to_csv('selected_features.csv', index=False, header=False)