## Selecting a model

We are facing a regression problem. Since the relationship between features and target is non-linear and interactions between features are to be expected, a decision-tree based approach seems appropriate.

In [1]:
# Load the "autoreload" extension
%load_ext autoreload

# always reload modules marked with "%aimport"
%autoreload 1

import os
import sys

# add the 'src' directory as one where we can import modules
src_dir = os.path.join(os.getcwd(), os.pardir, 'src')
sys.path.append(src_dir)

# import my method from the source code
%aimport features.build_features
%aimport visualization.visualize
from features.build_features import read_raw_data
import numpy as np
import pandas as pd

In [17]:
from sklearn.model_selection import train_test_split
from sklearn.model_selection import cross_val_score
from sklearn.ensemble import GradientBoostingRegressor
from sklearn.model_selection import cross_val_score
from sklearn.model_selection import ShuffleSplit
from math import sqrt

df = read_raw_data("../data/processed/eval.csv")
X = df.drop(['SalePrice','Id'], axis=1)
y = df['SalePrice'] 

gbr = GradientBoostingRegressor()
shuffle_split = ShuffleSplit(test_size=.5, train_size=.5, n_splits=10)
scores = cross_val_score(gbr, X, y, cv=shuffle_split, scoring='neg_mean_squared_log_error')
scores = [sqrt(abs(s)) for s in scores]

print(scores)




[0.15139392197163504, 0.1360043354727259, 0.1499187014755776, 0.1337637893727601, 0.14944426780692585, 0.15091743943287841, 0.14467944906123012, 0.13596761845723826, 0.1323531749571777, 0.14320722044664486]


The gradient boosting regressor with default parameters already achieves reasonable results in cross validation. Let's see what happens if we combine it with the feature reduction explored in the previous notebook.

In [18]:
from sklearn.feature_selection import SelectKBest
from sklearn.feature_selection import mutual_info_regression

df = read_raw_data("../data/processed/eval.csv")
y = df['SalePrice']
X = df.drop(['SalePrice', 'Id'], axis=1)

selector = SelectKBest(f_regression, k=10)
X_new = selector.fit_transform(X, y)
mask_sk = selector.get_support()
print("Features retained with SelectKBest - mutual_info_regression :\n{}".format(X.columns[mask_sk]))

gbr = GradientBoostingRegressor()
shuffle_split = ShuffleSplit(test_size=.5, train_size=.5, n_splits=10)
scores = cross_val_score(gbr, X_new, y, cv=shuffle_split, scoring='neg_mean_squared_log_error')
scores = [sqrt(abs(s)) for s in scores]
print(scores)

Features retained with SelectKBest - mutual_info_regression :
Index(['OverallQual', 'ExterQual', 'BsmtQual', '1stFlrSF', 'GrLivArea',
       'FullBath', 'KitchenQual', 'TotRmsAbvGrd', 'GarageCars', 'GarageArea'],
      dtype='object')
[0.16826150189704298, 0.17361925618148322, 0.17124582598039265, 0.1783035727114947, 0.16417803746239823, 0.17253943610649441, 0.161793649683385, 0.15705002138473306, 0.16386381609246964, 0.16463183385725166]


The quality of the prediction visibly deteriorates.

In [19]:
from sklearn.feature_selection import RFE

rfegbr = GradientBoostingRegressor()
rfe = RFE(estimator=rfegbr, n_features_to_select=10, step=1)
X_new = rfe.fit_transform(X, y)
mask_rfe = rfe.get_support()
print("Features retained with Recursive Feature Elimination - Gradient boosting regressor :\n{}".format(X.columns[mask_rfe]))

gbr = GradientBoostingRegressor()
shuffle_split = ShuffleSplit(test_size=.5, train_size=.5, n_splits=10)
scores = cross_val_score(gbr, X_new, y, cv=shuffle_split, scoring='neg_mean_squared_log_error')
scores = [sqrt(abs(s)) for s in scores]
print(scores)

Features retained with Recursive Feature Elimination - Gradient boosting regressor :
Index(['LotArea', 'OverallQual', 'OverallCond', 'YearBuilt', 'BsmtUnfSF',
       '1stFlrSF', 'GrLivArea', 'GarageYrBlt', 'GarageArea', 'BsmtFinSF'],
      dtype='object')
[0.15485111805775933, 0.15954230140236256, 0.15488490788950504, 0.1505861303581483, 0.1460030156791995, 0.13328329002850225, 0.13374513233538196, 0.15487189015202468, 0.12622506787525165, 0.14268790551625196]


With RFE the quality deteriorates as well, however less so than with the SelectKBest Feature Selection. It would be worthwile to select the ideal number of features using grid cross validation. The same holds for the parameters of the gradient boosting regressor.

A possible improvement for the feature selection might be to select features with cross-validation.

In [21]:
from sklearn.feature_selection import RFECV

rfegbr = GradientBoostingRegressor()
rfecv = RFECV(estimator=rfegbr, step=1)
X_new = rfecv.fit_transform(X, y)
mask_rfecv = rfecv.get_support()
print("Features retained with Recursive Feature Elimination CV - Gradient boosting regressor :\n{}".format(X.columns[mask_rfecv]))

gbr = GradientBoostingRegressor()
shuffle_split = ShuffleSplit(test_size=.5, train_size=.5, n_splits=10)
scores = cross_val_score(gbr, X_new, y, cv=shuffle_split, scoring='neg_mean_squared_log_error')
scores = [sqrt(abs(s)) for s in scores]
print(scores)

Features retained with Recursive Feature Elimination CV - Gradient boosting regressor :
Index(['LotFrontage', 'LotArea', 'OverallQual', 'OverallCond', 'YearBuilt',
       'YearRemodAdd', 'MasVnrArea', 'ExterQual', 'BsmtQual', 'BsmtExposure',
       'BsmtUnfSF', '1stFlrSF', '2ndFlrSF', 'GrLivArea', 'KitchenQual',
       'FireplaceQu', 'GarageYrBlt', 'GarageCars', 'GarageArea', 'WoodDeckSF',
       'OpenPorchSF', 'ScreenPorch', 'MSZoning_C (all)',
       'Neighborhood_Crawfor', 'Neighborhood_StoneBr', 'Condition1_Norm',
       'Exterior1st_BrkFace', 'Functional_Typ', 'SaleType_New',
       'SaleCondition_Abnorml', 'BsmtFinSF'],
      dtype='object')
[0.14611250382625218, 0.13399878708922558, 0.13557921763462974, 0.15332844168918816, 0.13635773036791987, 0.13502128010757636, 0.1321793668873622, 0.13906389927745016, 0.14328086127093473, 0.1486450145298405]


The RFECV finishes by selecting 31 features. 
Let's freeze this number of features and continue setting the parameters of the gradient boosting regressor using a gridsearch. 

In [24]:
from sklearn.model_selection import train_test_split
from sklearn.pipeline import Pipeline
from sklearn.model_selection import GridSearchCV

X_train, X_test, y_train, y_test = train_test_split(X, y, random_state=0)

rfegbr = GradientBoostingRegressor()
rfe = RFE(estimator=rfegbr, n_features_to_select=31, step=1)

gbr = GradientBoostingRegressor(learning_rate=0.1)
param_grid = {'gbr__max_depth' : [3,4,5],
              'gbr__subsample' : [1, 0.75],}
pipe = Pipeline([("rfe", rfe), ("gbr", gbr)])
pipe.fit(X_train, y_train)

grid = GridSearchCV(pipe, param_grid=param_grid, cv=5, scoring='neg_mean_squared_log_error')
grid.fit(X_train, y_train)

print("Best cross_validaton score : {:.3f}".format(grid.best_score_))
print("Test set score : {:.3f}".format(grid.score(X_test, y_test)))
print("Best parameters : {}".format(grid.best_params_))


Best cross_validaton score : -0.018
Test set score : -0.015
Best parameters : {'gbr__max_depth': 4, 'gbr__subsample': 0.75}


In [28]:
print("Best cross_validaton score : {:.3f}".format(sqrt(abs(grid.best_score_))))

Best cross_validaton score : 0.135


In [30]:
from sklearn.metrics import mean_squared_log_error
prediction = grid.predict(X_test)

for t,p in zip(y_test, prediction):
    print("truth: {}, predict.: {}, score: {}".format(t, p, sqrt(mean_squared_log_error([t], [p]))))


truth: 200624, predict.: 212339.1626418124, score: 0.05675203565080267
truth: 133000, predict.: 145334.05991101748, score: 0.08868518774155554
truth: 110000, predict.: 104717.11580466968, score: 0.0492173279183703
truth: 192000, predict.: 213697.33622340337, score: 0.10706479637589617
truth: 88000, predict.: 94141.73117734843, score: 0.06746386937116355
truth: 85000, predict.: 99724.93945941646, score: 0.1597627971206439
truth: 282922, predict.: 271499.6781304465, score: 0.04121013872689261
truth: 141000, predict.: 138670.64527751564, score: 0.016658108159337104
truth: 745000, predict.: 447567.877080425, score: 0.5095551195258459
truth: 148800, predict.: 158243.21134784553, score: 0.061529638388201136
truth: 208900, predict.: 201292.97178451822, score: 0.037094069758779824
truth: 136905, predict.: 142123.88120466113, score: 0.037411557273845375
truth: 225000, predict.: 229588.80098188133, score: 0.020189396009167027
truth: 123000, predict.: 122153.98707727317, score: 0.0069018610655007

Let's take a look at the subset of records where the score is higher than 0.1:

In [116]:
y = pd.DataFrame(y_test)
p = pd.DataFrame(prediction, columns=['Prediction'])
result = pd.concat([X_test, y], axis=1)
result['Prediction'] = prediction

def msle(row):
    return sqrt(mean_squared_log_error([row['SalePrice']], [row['Prediction']]))

result['Score'] = result.apply(msle, axis=1)

In [121]:
result[(result['Score'] >= 0.2) & (result['SalePrice'] < result['Prediction'])][X_test.columns[mask_rfecv]]


Unnamed: 0,LotFrontage,LotArea,OverallQual,OverallCond,YearBuilt,YearRemodAdd,MasVnrArea,ExterQual,BsmtQual,BsmtExposure,...,ScreenPorch,MSZoning_C (all),Neighborhood_Crawfor,Neighborhood_StoneBr,Condition1_Norm,Exterior1st_BrkFace,Functional_Typ,SaleType_New,SaleCondition_Abnorml,BsmtFinSF
589,50.0,9100,5,6,1930,1960,0.0,3,3,0,...,0,0,0,0,0,0,1,0,0,0
535,70.0,7000,5,7,1910,1991,0.0,3,4,4,...,0,0,0,0,1,0,1,0,0,969
1163,60.0,12900,4,4,1969,1969,0.0,3,4,3,...,0,0,0,0,0,0,1,0,0,1198
479,50.0,5925,4,7,1937,2000,435.0,3,2,0,...,0,0,0,0,1,0,1,0,0,168
223,70.0,10500,4,6,1971,1971,0.0,3,3,0,...,0,0,0,0,1,0,1,0,1,704
1355,102.0,10192,7,6,1968,1992,143.0,3,3,0,...,0,0,0,0,1,0,1,0,0,0
308,0.0,12342,4,5,1940,1950,0.0,3,3,0,...,0,0,0,0,1,0,1,0,0,262
666,0.0,18450,6,5,1965,1979,113.0,3,4,0,...,0,0,0,0,1,0,0,0,1,910
632,85.0,11900,7,5,1977,1977,209.0,3,3,0,...,0,0,0,0,1,0,1,0,0,822
986,59.0,5310,6,8,1910,2003,0.0,3,3,0,...,0,0,0,0,0,0,1,0,0,0


In [122]:
result[(result['Score'] >= 0.2) & (result['SalePrice'] > result['Prediction'])][X_test.columns[mask_rfecv]]


Unnamed: 0,LotFrontage,LotArea,OverallQual,OverallCond,YearBuilt,YearRemodAdd,MasVnrArea,ExterQual,BsmtQual,BsmtExposure,...,ScreenPorch,MSZoning_C (all),Neighborhood_Crawfor,Neighborhood_StoneBr,Condition1_Norm,Exterior1st_BrkFace,Functional_Typ,SaleType_New,SaleCondition_Abnorml,BsmtFinSF
1182,160.0,15623,10,5,1996,1996,0.0,4,5,3,...,0,0,0,0,1,0,1,0,1,2096
142,71.0,8520,5,4,1952,1952,0.0,3,3,0,...,0,0,0,0,0,1,1,0,0,507
678,80.0,11844,8,5,2008,2008,464.0,4,5,1,...,0,0,0,1,1,0,1,1,0,0
393,0.0,7446,4,5,1941,1950,0.0,3,3,0,...,0,0,0,0,0,0,1,0,1,266
1000,74.0,10206,3,3,1952,1952,0.0,3,0,0,...,0,0,0,0,1,0,0,0,0,0
688,60.0,8089,8,6,2007,2007,0.0,4,4,3,...,0,0,0,1,1,0,1,1,0,945
1122,0.0,8926,4,3,1956,1956,0.0,3,3,0,...,160,0,0,0,1,0,1,0,1,0
613,70.0,8402,5,5,2007,2007,0.0,3,4,0,...,0,0,0,0,0,0,1,1,0,206
769,47.0,53504,8,5,2003,2003,603.0,5,4,4,...,210,0,0,1,1,0,0,0,0,1416
546,70.0,8737,6,7,1923,1950,0.0,3,4,0,...,0,0,0,0,1,1,1,0,0,300


We see that in most cases the Prediction is too high. Let's check if a different model would be able to perform better on this set of test cases.

In [130]:
from sklearn.linear_model import Lasso
prediction = grid.predict(X_test)


lasso = Lasso(alpha=0.01, max_iter=50000)
lasso.fit(X_train[X_train.columns[mask_rfecv]], y_train)
lasso_predict = lasso.predict(result[(result['Score'] >= 0.2)][X_test.columns[mask_rfecv]])

for t,p,l in zip(result[(result['Score'] >= 0.2)]['SalePrice'], result[(result['Score'] >= 0.2)]['Score'], lasso_predict):
    print("score-gbr: {}, score-lasso: {}".format(p, sqrt(mean_squared_log_error([t], [l]))))


score-gbr: 0.5095551195258459, score-lasso: 0.48132525736244425
score-gbr: 0.253986064146531, score-lasso: 0.18028837241712026
score-gbr: 0.2027022020808058, score-lasso: 0.08226160115742864
score-gbr: 0.24726726682222555, score-lasso: 0.5771497465329283
score-gbr: 0.20434316741195957, score-lasso: 0.8912464299537373
score-gbr: 0.5004749959713006, score-lasso: 0.29092881740443666
score-gbr: 0.27273862710495855, score-lasso: 0.6162200937705151
score-gbr: 0.2146313442831147, score-lasso: 0.049880758273772585
score-gbr: 0.22235238175008298, score-lasso: 0.3576178932817271
score-gbr: 0.33592992489058915, score-lasso: 0.17783630632239067
score-gbr: 0.30096257486977507, score-lasso: 0.3565740695985369
score-gbr: 0.283166946424112, score-lasso: 0.0652167706304958
score-gbr: 0.24626815879394393, score-lasso: 0.060072880830004394
score-gbr: 0.20313640077714723, score-lasso: 0.15878936198813776
score-gbr: 0.2976219359782135, score-lasso: 0.07556615557948376
score-gbr: 0.4092872577173132, score-l

In [142]:
lasso_predict_all = lasso.predict(X_test[X_test.columns[mask_rfecv]])
avg_score = 0
for t, lp, gbrp in zip(result['SalePrice'], lasso_predict_all, result['Prediction']):
    print("gbr: {:.3f}, lasso: {:.3f}, real: {}".format(gbrp, lp, t))
    mean_pred = (lp + gbrp)/2.0
    score = sqrt(mean_squared_log_error([t], [mean_pred]))
    avg_score += score
avg_score = avg_score/float(len(lasso_predict_all))
print(avg_score)

gbr: 212339.163, lasso: 252751.015, real: 200624
gbr: 145334.060, lasso: 132762.461, real: 133000
gbr: 104717.116, lasso: 117154.070, real: 110000
gbr: 213697.336, lasso: 217139.176, real: 192000
gbr: 94141.731, lasso: 112028.475, real: 88000
gbr: 99724.939, lasso: 85107.398, real: 85000
gbr: 271499.678, lasso: 237172.644, real: 282922
gbr: 138670.645, lasso: 151315.042, real: 141000
gbr: 447567.877, lasso: 460382.714, real: 745000
gbr: 158243.211, lasso: 155757.299, real: 148800
gbr: 201292.972, lasso: 212719.562, real: 208900
gbr: 142123.881, lasso: 147771.639, real: 136905
gbr: 229588.801, lasso: 233534.716, real: 225000
gbr: 122153.987, lasso: 119509.262, real: 123000
gbr: 120798.123, lasso: 118415.030, real: 119200
gbr: 149172.522, lasso: 141668.200, real: 145000
gbr: 228593.854, lasso: 229500.024, real: 190000
gbr: 128000.992, lasso: 110442.132, real: 123600
gbr: 130612.591, lasso: 122366.907, real: 149350
gbr: 175312.008, lasso: 186293.802, real: 155000
gbr: 128766.409, lasso: 1

Combining both models actually allows to improve the score. The best combination on the Kaggle test set is 20% lasso and 80% gbr : using the features selected previously and this averaging of models we get a score of 0.12916 (achieved without subsampling in gbr to reduce randomness of results).

Let's, for fun, see what happens if we add an SVM on top. SVMs are known to be sensitive to the scaling of the data, so we should apply a scaler, 

In [None]:
from sklearn.linear_model import LinearSVC
