# Titanic: Machine Learning Disaster


## Feature Selection

It is important to note that the smaller the number of input requirement from client, the less code for error handling and also the chances of introducing bugs. 
Fewer varibales also means simpler, more interpretable, better generalizing model.
In this notebook, we will select features using **Lasso regression** as it has the property of setting the coefficient of non-informative variables to zero. This way, we can identify and remove them from our final model.

In [1]:
# to handle datasets
import pandas as pd
import numpy as np

# for plotting
import matplotlib.pyplot as plt

# to build the models
from sklearn.linear_model import Lasso
from sklearn.feature_selection import SelectFromModel

# to visualise al the columns in the dataframe
pd.set_option('display.max_columns', None)

In [2]:
# load the train and test set with the engineered variables from the 
# last notebook

X_train = pd.read_csv('xtrain.csv')
X_test = pd.read_csv('xtest.csv')

X_train.head()

Unnamed: 0,PassengerId,Survived,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked,Age_na
0,816,0,0.0,0.0,0.0,0.739785,0.0,0.0,1.0,0.0,0.0,1.0,1.0
1,878,0,1.0,0.0,0.0,0.639657,0.0,0.0,1.0,0.350202,1.0,1.0,0.0
2,194,1,0.5,0.0,0.0,0.225027,0.125,0.166667,1.0,0.528101,0.0,1.0,0.0
3,524,1,0.0,0.0,1.0,0.848572,0.0,0.166667,1.0,0.653299,0.0,0.666667,0.0
4,635,0,1.0,0.0,1.0,0.461086,0.375,0.333333,1.0,0.538998,1.0,1.0,0.0


In [3]:
# select the target
y_train = X_train['Survived']
y_test = X_test['Survived']

# drop unnecessary variables from our training and testing sets
X_train.drop(['PassengerId', 'Survived', 'Name'], axis=1, inplace=True)
X_test.drop(['PassengerId', 'Survived', 'Name'], axis=1, inplace=True)

***Let's proceed to selecting the most predictive features from the list we have***

In [4]:
# we will fit the model and select features together
# select suitable alpha, the bigger the lesser the number of selected features
# Then we use the selectFromModel object from sklearn, which
# will automatically select the features which coefficients are non-zero

ft_sel = SelectFromModel(Lasso(alpha=0.005, random_state=0))

# train Lasso model and select features
ft_sel.fit(X_train, y_train)

SelectFromModel(estimator=Lasso(alpha=0.005, copy_X=True, fit_intercept=True,
                                max_iter=1000, normalize=False, positive=False,
                                precompute=False, random_state=0,
                                selection='cyclic', tol=0.0001,
                                warm_start=False),
                max_features=None, norm_order=1, prefit=False, threshold=None)

In [5]:
# selected features marked with True

ft_sel.get_support()

array([ True,  True,  True,  True, False,  True, False,  True,  True,
        True])

In [6]:
# let's print the number of total and selected features

# make a list of the selected features
selected_feats = X_train.columns[ft_sel.get_support()]

print('total features: {}'.format((X_train.shape[1])))
print('selected features: {}'.format(len(selected_feats)))
print('features with coefficients shrank to zero: {}'.format(
    np.sum(ft_sel.estimator_.coef_ == 0)))

total features: 10
selected features: 8
features with coefficients shrank to zero: 2


In [7]:
# print the selected features
selected_feats

Index(['Pclass', 'Sex', 'Age', 'SibSp', 'Ticket', 'Cabin', 'Embarked',
       'Age_na'],
      dtype='object')

In [8]:
pd.Series(selected_feats).to_csv('selected_features.csv', index=False)

  """Entry point for launching an IPython kernel.
