# Stage 7: Model selection

In this stage, we are going to test and estimate several different models, and select the best of them for fine-tuning.

In [1]:
%pylab inline

import pandas as pd
import numpy as np

Populating the interactive namespace from numpy and matplotlib


# Preparing the data

In [2]:
dataset = pd.read_csv('data/house_prices.csv').set_index('Id')
target_y = dataset['SalePrice']

In [3]:
#
# We've saved base data processing/cleaning and splitting routines from Stage #3 to the common_data.py module
#
from common_data import data_train_test_split, data_process_and_clean_advanced_features

### Cleaning the data using pre-saved data pipeline

In [4]:
processed_dataset = data_process_and_clean_advanced_features(dataset, target_y)

Preparing the dataset with feature engineering


In [5]:
processed_dataset.head()

Unnamed: 0_level_0,KitchenAbvGr,HalfBath,Fireplaces,BsmtUnfSF,EnclosedPorch,GrLivArea,BsmtFinSF1,BsmtHalfBath,2ndFlrSF,GarageArea,...,!CentralAir,!Electrical,!KitchenQual,!FireplaceQu,!GarageQual,!GarageCond,!PavedDrive,!HouseAge,!LandQuality,!ExretiorScore
Id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
1,0.693147,0.693147,0.0,5.01728,0.0,7.444833,6.561031,0.0,6.751101,6.308098,...,1,1,0.75,0.0,0.5,0.5,0.75,5,4.2,0.929985
2,0.693147,0.0,0.693147,5.652489,0.0,7.141245,6.886532,0.693147,0.0,6.133398,...,1,1,0.5,0.5,0.5,0.5,0.75,31,4.4,0.519958
3,0.693147,0.693147,0.693147,6.075346,0.0,7.488294,6.188264,0.0,6.765039,6.411818,...,1,1,0.75,0.5,0.5,0.5,0.75,7,3.95,0.929985
4,0.693147,0.0,0.693147,6.293419,5.609472,7.448916,5.379897,0.0,6.629363,6.466145,...,1,1,0.75,0.75,0.5,0.5,0.75,91,4.05,0.538824
5,0.693147,0.693147,0.693147,6.196444,0.0,7.695758,6.486161,0.0,6.960348,6.729824,...,1,1,0.75,0.5,0.5,0.5,0.75,8,4.15,0.929985


### Do train/test splitting

In [6]:
X_train, X_test, y_train, y_test = data_train_test_split(processed_dataset, target_y)

print(f"Train dataset length: {len(X_train):<5} ({len(X_train)/len(processed_dataset)*100:<2}%)")
print(f"Test  dataset length: {len(X_test):<5} ({len(X_test)/len(processed_dataset)*100:<2}%)")

Train dataset length: 1095  (75.0%)
Test  dataset length: 365   (25.0%)


# Estimating performance of 3 generic models on the dataset

In [8]:
from common_models import model_base, calc_cross_val_scores

## The base model (RandomForest)

In [11]:
calc_cross_val_scores(model_base(), X_train, y_train)

[ 0.18179293  0.14199651  0.13649459  0.14021901  0.17790797  0.12321717
  0.16219295  0.16327759  0.13761238  0.17653879]
Mean: 0.15412498869804825
StDev: 0.019651086743725222


0.15412498869804825

## Ridge regression model

In [12]:
from sklearn.linear_model import Ridge

model_ridge = Ridge()

calc_cross_val_scores(model_ridge, X_train, y_train)

[ 0.17709318  0.14334069  0.1246967   0.1380504   0.14722836  0.12656402
  0.14324442  0.15597635  0.11926467  0.18440499]
Mean: 0.1459863776315146
StDev: 0.020413209245545506


0.14598637763151459

## Lasso regression model

In [13]:
from sklearn.linear_model import Lasso

model_lasso = Lasso()

calc_cross_val_scores(model_lasso, X_train, y_train)

[ 0.37040863  0.31222252  0.29459575  0.33147569  0.35237585  0.2823137
  0.34327353  0.30689178  0.29839396  0.33746527]
Mean: 0.3229416680604874
StDev: 0.026918107900728655


0.32294166806048741

## Stochastic Gradient Descent - SGD

In [19]:
from sklearn.linear_model import SGDRegressor

model_sgd_regressor = SGDRegressor()

calc_cross_val_scores(model_sgd_regressor, X_train, y_train)

[  1.11766246e+12   1.52053516e+12   1.23507608e+12   2.16631464e+12
   1.64606878e+12   4.02017727e+12   5.24217792e+11   2.91767432e+12
   4.04727703e+11   1.76191015e+12]
Mean: 1731436436131.9321
StDev: 1035757434930.0978




1731436436131.9321

# Simple linear regression

In [16]:
from sklearn.linear_model import LinearRegression

model_lin_reg = LinearRegression()
calc_cross_val_scores(model_lin_reg, X_train, y_train)

[ 0.17779686  0.14320863  0.1226069   0.13939943  0.15139162  0.13565912
  0.14321876  0.15422052  0.12368225  0.19589744]
Mean: 0.1487081538181148
StDev: 0.02176108425360529


0.14870815381811481

# Conclusion

We tried multiple models with default parameters to find TOP-3 candidates for fine-tuning. As you can see, the simple regression model shows the better results than RandomForest (base model). This might be strong evidence that features engineering have added extra value to the dataset. 

Models cross-validation scores summary (lower is better!):
- Ridge regression model - 0.145
- Simple linear regression - 0.148
- Random Forest (base model) - 0.154

Didn't pass:
- Lasso regression model - 0.322
- Stochastic Gradient Descent - 1731436436131.9321

