## Machine Learning Pipeline - Feature Selection
In the following notebooks, we will go through the implementation of each steps in the Machine Learning Pipeline: 

1. Data Analysis
2. Feature Engineering
3. **Feature Selection**
4. Model Training
5. Obtaining Predictions/Scoring

### Rossman Store Sales Prediction
The aim of the project is to build an end-to-end machine learning model to predict the sales of a given store and a set of inputs, including the promotions, competition, school and state holidays, seasonality, and locality.

In [1]:
# to handle datasets
import pandas as pd
import numpy as np

# for plotting
import matplotlib.pyplot as plt

# to build the models
from sklearn.linear_model import Lasso
from sklearn.feature_selection import SelectFromModel

# to visualise al the columns in the dataframe
pd.pandas.set_option('display.max_columns', None)

In [2]:
# load the train and test set with the engineered variables
X_train = pd.read_csv('xtrain.csv')
X_test = pd.read_csv('xtest.csv')

X_train.head()

Unnamed: 0,DayOfWeek,Customers,Open,Promo,StateHoliday,SchoolHoliday,StoreType,Assortment,CompetitionDistance,CompetitionOpenSinceMonth,CompetitionOpenSinceYear,Promo2,Promo2SinceYear,PromoInterval,CompetitionDistance_na,CompetitionOpenSinceMonth_na,CompetitionOpenSinceYear_na,Promo2SinceYear_na
0,0.833333,0.587547,0.0,0.0,0.5,0.0,0.333333,0.0,0.354962,0.727273,0.0,1.0,0.333333,0.333333,0.0,1.0,1.0,0.0
1,0.5,0.59932,0.0,0.0,0.5,0.0,0.333333,1.0,0.245362,0.727273,0.0,1.0,0.333333,0.666667,0.0,1.0,1.0,0.0
2,0.5,0.653409,0.0,1.0,0.5,0.0,0.333333,0.0,0.3869,0.363636,0.052174,1.0,0.0,0.666667,0.0,0.0,0.0,0.0
3,0.0,0.777498,0.0,1.0,0.5,0.0,0.0,1.0,0.782834,0.727273,0.0,1.0,0.666667,0.666667,0.0,1.0,1.0,0.0
4,0.833333,0.586215,0.0,0.0,0.5,0.0,0.333333,0.0,0.474427,0.727273,0.0,0.0,0.333333,1.0,0.0,1.0,1.0,1.0


In [3]:
# load the target (remember that the target is log transformed)
y_train = pd.read_csv('ytrain.csv')
y_test = pd.read_csv('ytest.csv')

y_train.head()

Unnamed: 0,Sales
0,8.37563
1,8.268732
2,8.69249
3,10.093364
4,8.385032


### Feature Selection

In [4]:
# first, we specify the Lasso Regression model, and we
# select a suitable alpha (equivalent of penalty).
# The bigger the alpha the less features that will be selected.

# Then we use the selectFromModel object from sklearn, which
# will select automatically the features which coefficients are non-zero

# remember to set the seed, the random state in this function
sel_ = SelectFromModel(Lasso(alpha=0.001, random_state=0))

# train Lasso model and select features
sel_.fit(X_train, y_train)

SelectFromModel(estimator=Lasso(alpha=0.001, random_state=0))

In [5]:
sel_.get_support().sum()

11

In [6]:
# visualise those features that were selected.
# (selected features marked with True)
sel_.get_support()

array([ True,  True, False,  True, False,  True, False,  True,  True,
       False, False,  True,  True,  True, False,  True, False,  True])

In [7]:
# print the number of total and selected features
# make a list of the selected features
selected_feats = X_train.columns[(sel_.get_support())]

print('total features: {}'.format((X_train.shape[1])))
print('selected features: {}'.format(len(selected_feats)))
print('features with coefficients shrank to zero: {}'.format(
    np.sum(sel_.estimator_.coef_ == 0)))

total features: 18
selected features: 11
features with coefficients shrank to zero: 7


In [8]:
# print the selected features
selected_feats

Index(['DayOfWeek', 'Customers', 'Promo', 'SchoolHoliday', 'Assortment',
       'CompetitionDistance', 'Promo2', 'Promo2SinceYear', 'PromoInterval',
       'CompetitionOpenSinceMonth_na', 'Promo2SinceYear_na'],
      dtype='object')

In [9]:
pd.Series(selected_feats).to_csv('selected_features.csv', index=False)