1. Data Analysis
2. Feature Engineering
3. Feature Selection
4. Model Building

### Introduction

Data Source: https://www.kaggle.com/c/house-prices-advanced-regression-techniques/data

This dataset contains 79 explanatory variables describing (almost) every aspect of residential homes in Ames, Iowa.
The goal is to predict the house sale price. 

In this notebook, we will proceed feature selection to best infromative feature for the training the model. This will help reduce the reduncy and noisy of the data as well as improve the model performance. 

We will select variables using the Lasso regression: Lasso has the property of setting the coefficient of non-informative variables to zero. This way we can identify those variables and remove them from our final models.

In [None]:
# Import libraries
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
%matplotlib inline

# for model 
from sklearn.linear_model import Lasso
from sklearn.feature_selection import SelectFromModel
# to visualise al the columns in the dataframe
pd.pandas.set_option('display.max_columns', None)

In [None]:
# load dataset
# We  load the datasets with the engineered values: we built and saved these datasets in the previous lecture.
# If you haven't done so, go ahead and check the previous lecture / notebook to find out how to create these datasets

X_train = pd.read_csv('xtrain.csv')
X_test = pd.read_csv('xtest.csv')

X_train.head()

In [None]:
# capture the target
y_train = X_train['SalePrice']
y_test = X_test['SalePrice']

# drop unnecessary variables from our training and testing sets
X_train.drop(['Id', 'SalePrice'], axis=1, inplace=True)
X_test.drop(['Id', 'SalePrice'], axis=1, inplace=True)

In [None]:
# here I will do the model fitting and feature selection
# altogether in one line of code

# first, I specify the Lasso Regression model, and I
# select a suitable alpha (equivalent of penalty).
# The bigger the alpha the less features that will be selected.

# Then I use the selectFromModel object from sklearn, which
# will select the features which coefficients are non-zero

sel_ = SelectFromModel(Lasso(alpha=0.005, random_state=0)) # remember to set the seed, the random state in this function
sel_.fit(X_train, y_train)

In [None]:
# this command let's us visualise those features that were kept.
# Kept features have a True indicator
sel_.get_support()

In [None]:
# let's print the number of total and selected features

# this is how we can make a list of the selected features
selected_feat = X_train.columns[(sel_.get_support())]

# let's print some stats
print('total features: {}'.format((X_train.shape[1])))
print('selected features: {}'.format(len(selected_feat)))
print('features with coefficients shrank to zero: {}'.format(
    np.sum(sel_.estimator_.coef_ == 0)))

In [None]:
# print the selected features
selected_feat

In [None]:
# this is an alternative way of identifying the selected features 
# based on the non-zero regularisation coefficients:
selected_feats = X_train.columns[(sel_.estimator_.coef_ != 0).ravel().tolist()]
selected_feats

In [None]:
# now we save the selected list of features
pd.Series(selected_feats).to_csv('selected_features.csv', index=False)