## Selecting a model

We are facing a regression problem. Since the relationship between features and target is non-linear and interactions between features are to be expected, a decision-tree based approach seems appropriate.

In [5]:
# Load the "autoreload" extension
%load_ext autoreload

# always reload modules marked with "%aimport"
%autoreload 1

import os
import sys

# add the 'src' directory as one where we can import modules
src_dir = os.path.join(os.getcwd(), os.pardir, 'src')
sys.path.append(src_dir)

# import my method from the source code
%aimport features.build_features
%aimport visualization.visualize
from features.build_features import read_raw_data
import numpy as np
import pandas as pd

In [6]:
from sklearn.model_selection import train_test_split
from sklearn.model_selection import cross_val_score
from sklearn.ensemble import GradientBoostingRegressor
from sklearn.model_selection import cross_val_score
from sklearn.model_selection import ShuffleSplit
from math import sqrt

df = read_raw_data("../data/processed/eval.csv")
X = df.drop(['SalePrice','Id'], axis=1)
y = df['SalePrice'] 

gbr = GradientBoostingRegressor()
shuffle_split = ShuffleSplit(test_size=.5, train_size=.5, n_splits=10)
scores = cross_val_score(gbr, X, y, cv=shuffle_split, scoring='neg_mean_squared_log_error')
scores = [sqrt(abs(s)) for s in scores]

print(scores)




[0.12897351819848882, 0.13999192900331167, 0.13894196737564943, 0.1482179188211841, 0.121826010298229, 0.14543364709674989, 0.15290697166880193, 0.1406155156790504, 0.13362100357469012, 0.14321808185345272]


The gradient boosting regressor with default parameters already achieves reasonable results in cross validation. Let's see what happens if we combine it with the feature reduction explored in the previous notebook.

In [8]:
from sklearn.feature_selection import SelectKBest
from sklearn.feature_selection import mutual_info_regression

df = read_raw_data("../data/processed/eval.csv")
y = df['SalePrice']
X = df.drop(['SalePrice', 'Id'], axis=1)

selector = SelectKBest(mutual_info_regression, k=10)
X_new = selector.fit_transform(X, y)
mask_sk = selector.get_support()
print("Features retained with SelectKBest - mutual_info_regression :\n{}".format(X.columns[mask_sk]))

gbr = GradientBoostingRegressor()
shuffle_split = ShuffleSplit(test_size=.5, train_size=.5, n_splits=10)
scores = cross_val_score(gbr, X_new, y, cv=shuffle_split, scoring='neg_mean_squared_log_error')
scores = [sqrt(abs(s)) for s in scores]
print(scores)



Features retained with SelectKBest - mutual_info_regression :
Index(['OverallQual', 'YearBuilt', 'ExterQual', 'BsmtQual', '1stFlrSF',
       'GrLivArea', 'FullBath', 'KitchenQual', 'GarageCars', 'GarageArea'],
      dtype='object')
[0.16172048183401694, 0.16008693354572967, 0.16449754709822062, 0.15465858292555415, 0.15434830203823713, 0.1549160768379923, 0.15710166420555327, 0.15878251573446817, 0.16620302922156283, 0.15977646011870936]


The quality of the prediction visibly deteriorates.

In [9]:
from sklearn.feature_selection import RFE

rfegbr = GradientBoostingRegressor()
rfe = RFE(estimator=rfegbr, n_features_to_select=10, step=1)
X_new = rfe.fit_transform(X, y)
mask_rfe = rfe.get_support()
print("Features retained with Recursive Feature Elimination - Gradient boosting regressor :\n{}".format(X.columns[mask_rfe]))

gbr = GradientBoostingRegressor()
shuffle_split = ShuffleSplit(test_size=.5, train_size=.5, n_splits=10)
scores = cross_val_score(gbr, X_new, y, cv=shuffle_split, scoring='neg_mean_squared_log_error')
scores = [sqrt(abs(s)) for s in scores]
print(scores)

Features retained with Recursive Feature Elimination - Gradient boosting regressor :
Index(['LotArea', 'OverallQual', 'OverallCond', 'YearBuilt', 'BsmtUnfSF',
       '1stFlrSF', '2ndFlrSF', 'GrLivArea', 'GarageArea', 'BsmtFinSF'],
      dtype='object')
[0.16070260862181582, 0.12730982115573092, 0.1415031179495064, 0.14210617923356783, 0.15065696025518688, 0.1563119144878549, 0.152656803977064, 0.1497709630267808, 0.1375772876496778, 0.1448940587799028]


With RFE the quality deteriorates as well, however less so than with the SelectKBest Feature Selection. It would be worthwile to select the ideal number of features using grid cross validation. The same holds for the parameters of the gradient boosting regressor.

A possible improvement for the feature selection might be to select features with cross-validation.

In [10]:
from sklearn.feature_selection import RFECV

rfegbr = GradientBoostingRegressor()
rfecv = RFECV(estimator=rfegbr, step=1)
X_new = rfecv.fit_transform(X, y)
mask_rfecv = rfecv.get_support()
print("Features retained with Recursive Feature Elimination CV - Gradient boosting regressor :\n{}".format(X.columns[mask_rfecv]))

gbr = GradientBoostingRegressor()
shuffle_split = ShuffleSplit(test_size=.5, train_size=.5, n_splits=10)
scores = cross_val_score(gbr, X_new, y, cv=shuffle_split, scoring='neg_mean_squared_log_error')
scores = [sqrt(abs(s)) for s in scores]
print(scores)

Features retained with Recursive Feature Elimination CV - Gradient boosting regressor :
Index(['LotFrontage', 'LotArea', 'OverallQual', 'OverallCond', 'YearBuilt',
       'YearRemodAdd', 'MasVnrArea', 'ExterQual', 'BsmtQual', 'BsmtCond',
       'BsmtExposure', 'BsmtUnfSF', '1stFlrSF', '2ndFlrSF', 'LowQualFinSF',
       'GrLivArea', 'BsmtFullBath', 'FullBath', 'HalfBath', 'BedroomAbvGr',
       'KitchenAbvGr', 'KitchenQual', 'Fireplaces', 'FireplaceQu',
       'GarageYrBlt', 'GarageCars', 'GarageArea', 'GarageQual', 'GarageCond',
       'WoodDeckSF', 'OpenPorchSF', 'EnclosedPorch', 'ScreenPorch', 'PoolArea',
       'MoSold', 'YrSold', 'MSSubClass_50', 'MSSubClass_60',
       'MSZoning_C (all)', 'MSZoning_RH', 'MSZoning_RL', 'LotConfig_CulDSac',
       'LotConfig_FR3', 'Neighborhood_ClearCr', 'Neighborhood_Crawfor',
       'Neighborhood_Edwards', 'Neighborhood_OldTown', 'Neighborhood_StoneBr',
       'Condition1_Artery', 'Condition1_Norm', 'BldgType_1Fam',
       'RoofMatl_Tar&Grv', 'Roo

The RFECV finishes by selecting 67 features. 
Let's freeze this number of features and continue setting the parameters of the gradient boosting regressor using a gridsearch. 

In [13]:
from sklearn.model_selection import train_test_split
from sklearn.pipeline import Pipeline
from sklearn.model_selection import GridSearchCV

X_train, X_test, y_train, y_test = train_test_split(X, y, random_state=0)

rfegbr = GradientBoostingRegressor()
rfe = RFE(estimator=rfegbr, n_features_to_select=31, step=1)

gbr = GradientBoostingRegressor(learning_rate=0.1)
param_grid = {'gbr__max_depth' : [3,4,5],
              'gbr__subsample' : [1, 0.75],}
pipe = Pipeline([("rfe", rfe), ("gbr", gbr)])
pipe.fit(X_train, y_train)

grid = GridSearchCV(pipe, param_grid=param_grid, cv=5, scoring='neg_mean_squared_log_error')
grid.fit(X_train, y_train)

print("Best cross_validaton score : {:.3f}".format(grid.best_score_))
print("Test set score : {:.3f}".format(grid.score(X_test, y_test)))
print("Best parameters : {}".format(grid.best_params_))

## produces best params 4 - 0.75

In [28]:
print("Best cross_validation score : {:.3f}".format(sqrt(abs(grid.best_score_))))

Best cross_validaton score : 0.135


In [15]:
from sklearn.metrics import mean_squared_log_error
gbr = GradientBoostingRegressor(learning_rate=0.1, max_depth=4)
gbr.fit(X_train[X_train.columns[mask_rfecv]], y_train)
prediction = gbr.predict(X_test[X_test.columns[mask_rfecv]])
#prediction = grid.predict(X_test)

for t,p in zip(y_test, prediction):
    print("truth: {}, predict.: {}, score: {}".format(t, p, sqrt(mean_squared_log_error([t], [p]))))


truth: 200624, predict.: 220956.3324818651, score: 0.0965321233009746
truth: 133000, predict.: 146742.48013534182, score: 0.09832938235148347
truth: 110000, predict.: 103861.57233517451, score: 0.05742085121801388
truth: 192000, predict.: 204001.11506368167, score: 0.0606297814021417
truth: 88000, predict.: 93586.40888537517, score: 0.061547675929460866
truth: 85000, predict.: 98620.68500065854, score: 0.14862814531351987
truth: 282922, predict.: 275521.4045709711, score: 0.026505827222779388
truth: 141000, predict.: 124139.65569987592, score: 0.12735173964129665
truth: 745000, predict.: 503843.3811288407, score: 0.39111810786920564
truth: 148800, predict.: 155113.890492316, score: 0.04155622852088392
truth: 208900, predict.: 190898.2881568332, score: 0.09011445368559734
truth: 136905, predict.: 137345.85620812117, score: 0.003214964423273514
truth: 225000, predict.: 239934.49373703235, score: 0.06426526448053593
truth: 123000, predict.: 129664.34378808783, score: 0.052764367329201534


Let's take a look at the subset of records where the score is higher than 0.1:

In [16]:
y = pd.DataFrame(y_test)
p = pd.DataFrame(prediction, columns=['Prediction'])
result = pd.concat([X_test, y], axis=1)
result['Prediction'] = prediction

def msle(row):
    return sqrt(mean_squared_log_error([row['SalePrice']], [row['Prediction']]))

result['Score'] = result.apply(msle, axis=1)

In [17]:
result[(result['Score'] >= 0.2) & (result['SalePrice'] < result['Prediction'])][X_test.columns[mask_rfecv]]


Unnamed: 0,LotFrontage,LotArea,OverallQual,OverallCond,YearBuilt,YearRemodAdd,MasVnrArea,ExterQual,BsmtQual,BsmtCond,...,CentralAir_Y,Functional_Maj1,Functional_Sev,Functional_Typ,GarageType_Attchd,SaleType_New,SaleCondition_Abnorml,SaleCondition_Family,SaleCondition_Partial,BsmtFinSF
1322,107.0,10186,7,5,1992,1992,0.0,4,4,3,...,1,0,0,1,1,0,0,0,0,674
482,50.0,2500,7,8,1915,2005,0.0,4,3,3,...,1,0,0,0,1,0,0,0,0,299
589,50.0,9100,5,6,1930,1960,0.0,3,3,3,...,1,0,0,1,0,0,0,0,0,0
438,40.0,4280,5,6,1913,2002,0.0,3,3,3,...,0,0,0,1,0,0,0,0,0,365
224,103.0,13472,10,5,2003,2003,922.0,5,5,3,...,1,0,0,1,1,0,0,0,0,56
1163,60.0,12900,4,4,1969,1969,0.0,3,4,3,...,1,0,0,1,0,0,0,0,0,1198
963,122.0,11923,9,5,2007,2007,0.0,4,5,3,...,1,0,0,1,1,0,0,0,0,0
479,50.0,5925,4,7,1937,2000,435.0,3,2,3,...,1,0,0,1,0,0,0,0,0,168
223,70.0,10500,4,6,1971,1971,0.0,3,3,3,...,1,0,0,1,0,0,1,0,0,704
1355,102.0,10192,7,6,1968,1992,143.0,3,3,3,...,1,0,0,1,1,0,0,0,0,0


In [18]:
result[(result['Score'] >= 0.2) & (result['SalePrice'] > result['Prediction'])][X_test.columns[mask_rfecv]]


Unnamed: 0,LotFrontage,LotArea,OverallQual,OverallCond,YearBuilt,YearRemodAdd,MasVnrArea,ExterQual,BsmtQual,BsmtCond,...,CentralAir_Y,Functional_Maj1,Functional_Sev,Functional_Typ,GarageType_Attchd,SaleType_New,SaleCondition_Abnorml,SaleCondition_Family,SaleCondition_Partial,BsmtFinSF
1182,160.0,15623,10,5,1996,1996,0.0,4,5,3,...,1,0,0,1,1,0,1,0,0,2096
142,71.0,8520,5,4,1952,1952,0.0,3,3,3,...,1,0,0,1,0,0,0,0,0,507
688,60.0,8089,8,6,2007,2007,0.0,4,4,3,...,1,0,0,1,1,1,0,0,1,945
1122,0.0,8926,4,3,1956,1956,0.0,3,3,3,...,1,0,0,1,0,0,1,0,0,0
896,50.0,8765,4,6,1936,1950,0.0,3,3,3,...,0,0,0,1,0,0,1,0,0,285
769,47.0,53504,8,5,2003,2003,603.0,5,4,3,...,1,0,0,0,0,0,0,0,0,1416
546,70.0,8737,6,7,1923,1950,0.0,3,4,3,...,1,0,0,1,0,0,0,0,0,300


We see that in most cases the Prediction is too high. Let's check if a different model would be able to perform better on this set of test cases.

In [21]:
from sklearn.linear_model import Lasso


lasso = Lasso(alpha=0.01, max_iter=50000)
lasso.fit(X_train[X_train.columns[mask_rfecv]], y_train)
lasso_predict = lasso.predict(result[(result['Score'] >= 0.2)][X_test.columns[mask_rfecv]])

for t,p,l in zip(result[(result['Score'] >= 0.2)]['SalePrice'], result[(result['Score'] >= 0.2)]['Score'], lasso_predict):
    print("score-gbr: {}, score-lasso: {}".format(p, sqrt(mean_squared_log_error([t], [l]))))


score-gbr: 0.39111810786920564, score-lasso: 0.4433921551144593
score-gbr: 0.23125335156286653, score-lasso: 0.20999327967534498
score-gbr: 0.20674583714046868, score-lasso: 0.09611199645152446
score-gbr: 0.28176347427252146, score-lasso: 0.18473655272177147
score-gbr: 0.49772553800758246, score-lasso: 0.2946475787921532
score-gbr: 0.3002008290091869, score-lasso: 0.735442041198981
score-gbr: 0.25711397355137855, score-lasso: 0.047528543222265185
score-gbr: 0.21315697530746824, score-lasso: 0.1697326283713263
score-gbr: 0.24439118172004903, score-lasso: 0.34553063460013256
score-gbr: 0.21062944633733416, score-lasso: 0.023804864719936702
score-gbr: 0.3510809243214936, score-lasso: 0.08529065648169798
score-gbr: 0.24141416024353468, score-lasso: 0.09900156267618243
score-gbr: 0.31771697681182864, score-lasso: 0.3819709806707827
score-gbr: 0.21901551711090228, score-lasso: 0.12846311343982109
score-gbr: 0.22561121067482404, score-lasso: 0.11183480820409919
score-gbr: 0.30601537693753755,

In [24]:
lasso_predict_all = lasso.predict(X_test[X_test.columns[mask_rfecv]])
avg_score = 0
for t, lp, gbrp in zip(result['SalePrice'], lasso_predict_all, result['Prediction']):
    print("gbr: {:.3f}, lasso: {:.3f}, real: {}".format(gbrp, lp, t))
    mean_pred = (lp + gbrp)/2.0
    score = sqrt(mean_squared_log_error([t], [mean_pred]))
    avg_score += score
avg_score = avg_score/float(len(lasso_predict_all))
print(avg_score)

gbr: 220956.332, lasso: 255254.880, real: 200624
gbr: 146742.480, lasso: 136376.851, real: 133000
gbr: 103861.572, lasso: 119430.607, real: 110000
gbr: 204001.115, lasso: 226267.606, real: 192000
gbr: 93586.409, lasso: 100473.819, real: 88000
gbr: 98620.685, lasso: 93297.713, real: 85000
gbr: 275521.405, lasso: 251539.541, real: 282922
gbr: 124139.656, lasso: 142889.158, real: 141000
gbr: 503843.381, lasso: 478181.953, real: 745000
gbr: 155113.890, lasso: 150006.192, real: 148800
gbr: 190898.288, lasso: 189219.786, real: 208900
gbr: 137345.856, lasso: 127254.255, real: 136905
gbr: 239934.494, lasso: 251073.990, real: 225000
gbr: 129664.344, lasso: 113145.046, real: 123000
gbr: 120141.373, lasso: 123069.418, real: 119200
gbr: 146195.281, lasso: 143412.404, real: 145000
gbr: 239434.169, lasso: 234397.490, real: 190000
gbr: 118998.500, lasso: 124193.641, real: 123600
gbr: 133535.639, lasso: 136460.594, real: 149350
gbr: 190599.079, lasso: 170636.865, real: 155000
gbr: 125238.805, lasso: 1

Combining both models actually allows to improve the score. The best combination on the Kaggle test set is 20% lasso and 80% gbr : using the features selected previously and this averaging of models we get a score of 0.12916 (achieved without subsampling in gbr to reduce randomness of results).

Let's, for fun, see what happens if we add an SVM on top. SVMs are known to be sensitive to the scaling of the data, so we should apply a scaler, 

In [51]:
from sklearn.preprocessing import MinMaxScaler
from sklearn.svm import SVR

min_max_scaler = MinMaxScaler()
X_train_minmax = min_max_scaler.fit_transform(X_train[X_train.columns[mask_rfecv]])
X_test_minmax = min_max_scaler.transform(X_test[X_test.columns[mask_rfecv]])

svr_rbf = SVR(kernel='linear', C=1000)
svr_rbf.fit(X_train_minmax, y_train)
svr_pred = svr_rbf.predict(X_test_minmax)



for t, lp, gbrp, svr in zip(result['SalePrice'], lasso_predict_all, result['Prediction'], svr_pred):
    print("gbr: {:.3f}, lasso: {:.3f}, svr: {:.3f}, real: {}".format(gbrp, lp, svr, t))
    mean_pred = (lp + gbrp + svr)/3.0
    score = sqrt(mean_squared_log_error([t], [mean_pred]))
    avg_score += score
avg_score = avg_score/float(len(lasso_predict_all))
print(avg_score)

gbr: 220956.332, lasso: 255254.880, svr: 228379.197, real: 200624
gbr: 146742.480, lasso: 136376.851, svr: 133422.623, real: 133000
gbr: 103861.572, lasso: 119430.607, svr: 140980.687, real: 110000
gbr: 204001.115, lasso: 226267.606, svr: 206745.828, real: 192000
gbr: 93586.409, lasso: 100473.819, svr: 103079.327, real: 88000
gbr: 98620.685, lasso: 93297.713, svr: 108762.232, real: 85000
gbr: 275521.405, lasso: 251539.541, svr: 243332.609, real: 282922
gbr: 124139.656, lasso: 142889.158, svr: 150955.536, real: 141000
gbr: 503843.381, lasso: 478181.953, svr: 349642.969, real: 745000
gbr: 155113.890, lasso: 150006.192, svr: 159662.925, real: 148800
gbr: 190898.288, lasso: 189219.786, svr: 200371.570, real: 208900
gbr: 137345.856, lasso: 127254.255, svr: 131340.086, real: 136905
gbr: 239934.494, lasso: 251073.990, svr: 235793.216, real: 225000
gbr: 129664.344, lasso: 113145.046, svr: 138003.044, real: 123000
gbr: 120141.373, lasso: 123069.418, svr: 124832.247, real: 119200
gbr: 146195.281

In [60]:

for t, lp, gbrp, svr in zip(result['SalePrice'], lasso_predict_all, result['Prediction'], svr_pred):
    #print("gbr: {:.3f}, lasso: {:.3f}, svr: {:.3f}, real: {}".format(gbrp, lp, svr, t))
    mean_pred = 0.1*lp + 0.8*gbrp + 0.1*svr
    score = sqrt(mean_squared_log_error([t], [mean_pred]))
    avg_score += score
avg_score = avg_score/float(len(lasso_predict_all))
print(avg_score)

0.08121222862336014


In [112]:
lasso_scaled = Lasso(alpha=0.01, max_iter=50000)
lasso_scaled.fit(X_train_minmax, y_train)
lasso_predict_all_scaled = lasso_scaled.predict(X_test_minmax)

for t, lp, gbrp in zip(result['SalePrice'], lasso_predict_all_scaled, result['Prediction']):
    print("gbr: {:.3f}, lasso: {:.3f}, real: {}".format(gbrp, lp, t))
    mean_pred = 0.2*lp + 0.8*gbrp
    score = sqrt(mean_squared_log_error([t], [mean_pred]))
    avg_score += score
avg_score = avg_score/float(len(lasso_predict_all_scaled))
print(avg_score)

gbr: 220956.332, lasso: 255256.155, real: 200624
gbr: 146742.480, lasso: 136376.817, real: 133000
gbr: 103861.572, lasso: 119431.758, real: 110000
gbr: 204001.115, lasso: 226267.744, real: 192000
gbr: 93586.409, lasso: 100473.492, real: 88000
gbr: 98620.685, lasso: 93298.047, real: 85000
gbr: 275521.405, lasso: 251539.958, real: 282922
gbr: 124139.656, lasso: 142889.397, real: 141000
gbr: 503843.381, lasso: 478176.492, real: 745000
gbr: 155113.890, lasso: 150006.280, real: 148800
gbr: 190898.288, lasso: 189220.879, real: 208900
gbr: 137345.856, lasso: 127254.543, real: 136905
gbr: 239934.494, lasso: 251074.268, real: 225000
gbr: 129664.344, lasso: 113145.729, real: 123000
gbr: 120141.373, lasso: 123069.780, real: 119200
gbr: 146195.281, lasso: 143411.957, real: 145000
gbr: 239434.169, lasso: 234397.531, real: 190000
gbr: 118998.500, lasso: 124193.103, real: 123600
gbr: 133535.639, lasso: 136461.044, real: 149350
gbr: 190599.079, lasso: 170637.238, real: 155000
gbr: 125238.805, lasso: 1

In [64]:
result['Lasso'] = lasso_predict_all_scaled

def msle(row):
    mean_pred = 0.2*row['Lasso'] + 0.8*row['Prediction']
    return sqrt(mean_squared_log_error([row['SalePrice']], [mean_pred]))

result['Score-Combi'] = result.apply(msle, axis=1)


In [68]:
result[result['Score-Combi'] > 0.2]

Unnamed: 0,LotFrontage,LotArea,OverallQual,OverallCond,YearBuilt,YearRemodAdd,MasVnrArea,ExterQual,ExterCond,BsmtQual,...,SaleCondition_Alloca,SaleCondition_Family,SaleCondition_Normal,SaleCondition_Partial,BsmtFinSF,SalePrice,Prediction,Score,Lasso,Score-Combi
1182,160.0,15623,10,5,1996,1996,0.0,4,3,5,...,0,0,0,0,2096,745000,503843.381129,0.391118,478176.49235,0.401359
1322,107.0,10186,7,5,1992,1992,0.0,4,3,4,...,0,0,1,0,674,190000,239434.168945,0.231253,234397.530752,0.227037
142,71.0,8520,5,4,1952,1952,0.0,3,2,3,...,0,0,1,0,507,166000,125238.804839,0.281763,138000.228805,0.261589
393,0.0,7446,4,5,1941,1950,0.0,3,3,3,...,0,0,0,0,266,100000,85017.637086,0.16231,68989.164092,0.200745
107,50.0,6000,5,5,1948,1950,0.0,3,3,3,...,0,0,0,1,273,115000,95705.171293,0.183658,53315.275006,0.276413
688,60.0,8089,8,6,2007,2007,0.0,4,3,4,...,0,0,0,1,945,392000,238301.018099,0.497726,291959.287544,0.453676
1122,0.0,8926,4,3,1956,1956,0.0,3,3,3,...,0,0,0,0,0,112000,82954.719947,0.300201,53680.93935,0.373392
589,50.0,9100,5,6,1930,1960,0.0,3,3,3,...,0,0,1,0,0,79500,102809.097587,0.257114,75809.485825,0.203161
896,50.0,8765,4,6,1936,1950,0.0,3,3,3,...,0,0,0,0,285,106500,86054.927054,0.213157,89873.78261,0.204321
438,40.0,4280,5,6,1913,2002,0.0,3,3,3,...,0,0,1,0,365,90350,115363.106108,0.244391,127640.31198,0.265452


In [66]:
result[(result['Score-Combi'] > 0.2) & (result['SalePrice'] < 0.2*result['Lasso'] + 0.8*result['Prediction'])]

Unnamed: 0,LotFrontage,LotArea,OverallQual,OverallCond,YearBuilt,YearRemodAdd,MasVnrArea,ExterQual,ExterCond,BsmtQual,...,SaleCondition_Alloca,SaleCondition_Family,SaleCondition_Normal,SaleCondition_Partial,BsmtFinSF,SalePrice,Prediction,Score,Lasso,Score-Combi
1322,107.0,10186,7,5,1992,1992,0.0,4,3,4,...,0,0,1,0,674,190000,239434.168945,0.231253,234397.530752,0.227037
589,50.0,9100,5,6,1930,1960,0.0,3,3,3,...,0,0,1,0,0,79500,102809.097587,0.257114,75809.485825,0.203161
438,40.0,4280,5,6,1913,2002,0.0,3,3,3,...,0,0,1,0,365,90350,115363.106108,0.244391,127640.31198,0.265452
1163,60.0,12900,4,4,1969,1969,0.0,3,3,4,...,1,0,0,0,1198,108959,154787.824705,0.351081,118662.446508,0.303279
963,122.0,11923,9,5,2007,2007,0.0,4,3,5,...,0,0,1,0,0,239000,304258.786176,0.241414,263871.265604,0.214507
479,50.0,5925,4,7,1937,2000,435.0,3,3,2,...,1,0,0,0,168,89471,122932.394919,0.317717,131090.282816,0.330902
223,70.0,10500,4,6,1971,1971,0.0,3,3,3,...,0,0,0,0,704,97000,120750.752374,0.219016,110296.405945,0.201548
1355,102.0,10192,7,6,1968,1992,143.0,3,3,3,...,0,0,1,0,0,170000,213025.278228,0.225611,190116.648322,0.203869
308,0.0,12342,4,5,1940,1950,0.0,3,3,3,...,0,0,1,0,262,82500,112035.621039,0.306015,91538.162403,0.268739
666,0.0,18450,6,5,1965,1979,113.0,3,4,4,...,0,0,0,0,910,129000,191691.120898,0.39607,199106.100979,0.403777


In [106]:
from sklearn.neighbors import KNeighborsRegressor

knb = KNeighborsRegressor(n_neighbors=5)
knb.fit(X_train[X_train.columns[mask_rfecv]], y_train)
nn_predict = knb.predict(X_test[X_test.columns[mask_rfecv]])

for t, lp, gbrp, n in zip(result['SalePrice'], result['Lasso'], result['Prediction'], nn_predict):
    mean_pred = 0.1*lp + 0.8*gbrp + 0.1*n 
    score = sqrt(mean_squared_log_error([t], [mean_pred]))
    print("gbr: {:.3f}, lasso: {:.3f}, nn: {:.3f}, mean: {:.3f}, real: {}, score: {}".format(gbrp, lp, n, mean_pred, t, score))
    avg_score += score
avg_score = avg_score/float(len(nn_predict))
print(avg_score)

gbr: 220956.332, lasso: 255256.155, nn: 267940.000, mean: 229084.681, real: 200624, score: 0.132658594626891
gbr: 146742.480, lasso: 136376.817, nn: 165100.000, mean: 147541.666, real: 133000, score: 0.10376074634310939
gbr: 103861.572, lasso: 119431.758, nn: 120280.000, mean: 107060.434, real: 110000, score: 0.027086640908995818
gbr: 204001.115, lasso: 226267.744, nn: 218230.000, mean: 207650.666, real: 192000, score: 0.07836141488273718
gbr: 93586.409, lasso: 100473.492, nn: 103800.000, mean: 95296.476, real: 88000, score: 0.07965515083575525
gbr: 98620.685, lasso: 93298.047, nn: 127180.000, mean: 100944.353, real: 85000, score: 0.1719162865240893
gbr: 275521.405, lasso: 251539.958, nn: 248800.000, mean: 270451.119, real: 282922, score: 0.045079700628896546
gbr: 124139.656, lasso: 142889.397, nn: 178395.000, mean: 131440.164, real: 141000, score: 0.07020765111705529
gbr: 503843.381, lasso: 478176.492, nn: 366852.200, mean: 487577.574, real: 745000, score: 0.42393410549114563
gbr: 155

In [107]:
result['Kn'] = nn_predict

def msle2(row):
    mean_pred = 0.1*row['Lasso'] + 0.8*row['Prediction'] + 0.1*row['Kn']
    return sqrt(mean_squared_log_error([row['SalePrice']], [mean_pred]))

result['Score-Combi2'] = result.apply(msle2, axis=1)

In [109]:
result[result['Score-Combi2'] > 0.2]

Unnamed: 0,LotFrontage,LotArea,OverallQual,OverallCond,YearBuilt,YearRemodAdd,MasVnrArea,ExterQual,ExterCond,BsmtQual,...,SaleCondition_Normal,SaleCondition_Partial,BsmtFinSF,SalePrice,Prediction,Score,Lasso,Score-Combi,Kn,Score-Combi2
1182,160.0,15623,10,5,1996,1996,0.0,4,3,5,...,0,0,2096,745000,503843.381129,0.391118,478176.49235,0.401359,366852.2,0.423934
1322,107.0,10186,7,5,1992,1992,0.0,4,3,4,...,1,0,674,190000,239434.168945,0.231253,234397.530752,0.227037,231658.0,0.225888
142,71.0,8520,5,4,1952,1952,0.0,3,2,3,...,1,0,507,166000,125238.804839,0.281763,138000.228805,0.261589,175000.0,0.233047
728,85.0,11475,5,5,1958,1958,95.0,3,3,3,...,0,0,0,110000,124259.39701,0.12189,146658.138343,0.157307,255920.0,0.238767
107,50.0,6000,5,5,1948,1950,0.0,3,3,3,...,0,1,273,115000,95705.171293,0.183658,53315.275006,0.276413,109200.0,0.214315
688,60.0,8089,8,6,2007,2007,0.0,4,3,4,...,0,1,945,392000,238301.018099,0.497726,291959.287544,0.453676,173000.0,0.502623
1122,0.0,8926,4,3,1956,1956,0.0,3,3,3,...,0,0,0,112000,82954.719947,0.300201,53680.93935,0.373392,85598.6,0.332829
589,50.0,9100,5,6,1930,1960,0.0,3,3,3,...,1,0,0,79500,102809.097587,0.257114,75809.485825,0.203161,96300.0,0.223978
438,40.0,4280,5,6,1913,2002,0.0,3,3,3,...,1,0,365,90350,115363.106108,0.244391,127640.31198,0.265452,138400.0,0.274543
1163,60.0,12900,4,4,1969,1969,0.0,3,3,4,...,0,0,1198,108959,154787.824705,0.351081,118662.446508,0.303279,189980.0,0.350478


In [146]:
# Try the three models and then a linear regression on the result!

gbr_train = gbr.predict(X_train[X_train.columns[mask_rfecv]])
lasso_train = lasso_scaled.predict(X_train_minmax)
svr_train = svr_rbf.predict(X_train_minmax)
knn_train = knb.predict(X_train[X_train.columns[mask_rfecv]])

training = pd.DataFrame(X_train[X_train.columns[mask_rfecv]])
training['GBR'] = gbr_train
training['LASSO'] = lasso_train
training['SVR'] = svr_train
training['KNN'] = knn_train

In [161]:
metagbr = GradientBoostingRegressor()
metagbr.fit(training, y_train)

gbr_test = gbr.predict(X_test[X_test.columns[mask_rfecv]])
lasso_test = lasso_scaled.predict(X_test_minmax)
svr_test = svr_rbf.predict(X_test_minmax)
knn_test = knb.predict(X_test[X_test.columns[mask_rfecv]])

testing = pd.DataFrame(X_test[X_test.columns[mask_rfecv]])
testing['GBR'] = gbr_test
testing['LASSO'] = lasso_test
testing['SVR'] = svr_test
testing['KNN'] = knn_test

meta_preds = metagbr.predict(testing)

In [163]:
meta_score = []
for t, m, g, l, s, v in zip(y_test, meta_preds, testing['GBR'], testing['LASSO'], testing['SVR'], testing['KNN']):
    score = sqrt(mean_squared_log_error([t], [m]))
    print("gbr: {:.3f}, lasso: {:.3f}, svr: {:.3f}, knn: {:.3f}, meta: {:.3f}, real: {}, score: {:.5f}".format(g, l, s, v, m, t, score))
    avg_score += score
    meta_score.append(score)
avg_score = avg_score/float(len(meta_preds))
print(avg_score)
testing['Meta-score'] = meta_score

gbr: 220956.332, lasso: 255256.155, svr: 228379.197, knn: 267940.000, meta: 223866.985, real: 200624, score: 0.10962
gbr: 146742.480, lasso: 136376.817, svr: 133422.623, knn: 165100.000, meta: 147595.141, real: 133000, score: 0.10412
gbr: 103861.572, lasso: 119431.758, svr: 140980.687, knn: 120280.000, meta: 102824.410, real: 110000, score: 0.06746
gbr: 204001.115, lasso: 226267.744, svr: 206745.828, knn: 218230.000, meta: 206393.787, real: 192000, score: 0.07229
gbr: 93586.409, lasso: 100473.492, svr: 103079.327, knn: 103800.000, meta: 90607.821, real: 88000, score: 0.02920
gbr: 98620.685, lasso: 93298.047, svr: 108762.232, knn: 127180.000, meta: 97384.740, real: 85000, score: 0.13602
gbr: 275521.405, lasso: 251539.958, svr: 243332.609, knn: 248800.000, meta: 274724.445, real: 282922, score: 0.02940
gbr: 124139.656, lasso: 142889.397, svr: 150955.536, knn: 178395.000, meta: 125803.259, real: 141000, score: 0.11404
gbr: 503843.381, lasso: 478176.492, svr: 349642.969, knn: 366852.200, m

In [183]:
testing['Meta-score'] = meta_score
testing['SalePrice'] = y_test
testing['Meta-preds'] = meta_preds

In [184]:
testing[testing['Meta-score'] > 0.2]

Unnamed: 0,LotFrontage,LotArea,OverallQual,OverallCond,YearBuilt,YearRemodAdd,MasVnrArea,ExterQual,BsmtQual,BsmtCond,...,SaleCondition_Family,SaleCondition_Partial,BsmtFinSF,GBR,LASSO,SVR,KNN,Meta-score,SalePrice,Meta-preds
1182,160.0,15623,10,5,1996,1996,0.0,4,5,3,...,0,0,2096,503843.381129,478176.49235,349642.969175,366852.2,0.293216,745000,555666.421894
1322,107.0,10186,7,5,1992,1992,0.0,4,4,3,...,0,0,674,239434.168945,234397.530752,232428.791083,231658.0,0.243726,190000,242439.167137
482,50.0,2500,7,8,1915,2005,0.0,4,3,3,...,0,0,299,190599.078991,170637.237588,179334.552084,143800.0,0.208654,155000,190963.077979
142,71.0,8520,5,4,1952,1952,0.0,3,3,3,...,0,0,507,125238.804839,138000.228805,140358.577072,175000.0,0.284076,166000,124949.474383
426,0.0,12800,7,5,1989,1989,145.0,4,4,3,...,0,0,1518,230410.357881,253052.139036,233547.319044,217680.0,0.204261,275000,224193.351857
107,50.0,6000,5,5,1948,1950,0.0,3,3,3,...,0,1,273,95705.171293,53315.275006,106628.892185,109200.0,0.205446,115000,93642.438329
688,60.0,8089,8,6,2007,2007,0.0,4,4,3,...,0,1,945,238301.018099,291959.287544,264193.528676,173000.0,0.483766,392000,241650.910598
1122,0.0,8926,4,3,1956,1956,0.0,3,3,3,...,0,0,0,82954.719947,53680.93935,86134.562738,85598.6,0.359839,112000,78152.004349
589,50.0,9100,5,6,1930,1960,0.0,3,3,3,...,0,0,0,102809.097587,75809.485825,89162.435115,96300.0,0.250524,79500,102133.768944
438,40.0,4280,5,6,1913,2002,0.0,3,3,3,...,0,0,365,115363.106108,127640.31198,144873.260986,138400.0,0.215479,90350,112075.465018


In [206]:
# seems that GBR is typically biggest or smallest - check this

# count number of times GBR is smallest
# count number of times GBR is biggest
# count number of times KN is smallest
# count number of times  KN is biggest

print(testing[['GBR', 'LASSO', 'SVR', 'KNN']].idxmax(axis=1).value_counts())
print(testing[['GBR', 'LASSO', 'SVR', 'KNN']].idxmin(axis=1).value_counts())

stats = []

for el in range(0,len(testing['GBR'])):
    arr = np.array(np.argsort(testing[['GBR', 'LASSO', 'SVR', 'KNN']][el:el+1])).flatten()
    mystring = ""
    for digit in arr:
        mystring += str(digit)
    stats.append(mystring)
print(stats)    

testing['Order'] = stats


KNN      126
GBR       86
SVR       78
LASSO     75
dtype: int64
KNN      129
LASSO     90
SVR       77
GBR       69
dtype: int64
['0213', '2103', '0132', '0231', '0123', '1023', '2310', '0123', '2310', '3102', '3102', '1203', '2013', '1032', '0123', '2103', '3210', '3012', '0123', '3120', '0123', '3012', '1203', '1023', '0321', '3201', '2310', '2310', '3210', '1203', '0213', '3012', '3012', '2301', '3201', '0312', '3210', '1032', '3210', '2301', '2103', '3102', '0312', '2130', '2130', '1023', '1023', '1032', '0312', '3120', '3210', '2013', '0231', '1302', '3021', '1302', '1203', '3210', '1203', '1023', '0132', '1203', '1023', '3102', '3021', '0321', '1203', '3021', '3012', '0132', '3012', '1203', '1203', '3021', '1032', '3021', '1023', '1230', '3021', '0321', '0312', '0312', '3021', '3012', '2301', '3120', '1023', '3012', '1320', '2301', '1023', '1032', '1320', '3201', '3012', '0123', '3021', '2130', '1320', '3102', '2103', '3210', '1230', '2103', '3012', '2103', '0132', '0213', '3012

In [210]:
print("Too high")
print(testing[(testing['SalePrice'] > testing['Meta-preds']) & (testing['Meta-score'] > 0.12)]['Order'].value_counts())
print("Too low")
print(testing[(testing['SalePrice'] < testing['Meta-preds']) & (testing['Meta-score'] > 0.12)]['Order'].value_counts())


Too high
3021    8
1023    4
3210    4
0312    3
0213    3
2310    2
1203    2
3201    2
1302    2
2103    2
0231    1
0123    1
1032    1
0132    1
Name: Order, dtype: int64
Too low
1203    7
3210    7
2301    4
2103    3
3120    3
3201    3
3021    3
2310    3
1230    3
2013    2
2130    2
0312    1
0132    1
1023    1
0123    1
2031    1
1302    1
Name: Order, dtype: int64


In [226]:
bad = len(testing[testing['Meta-score'] > 0.12])
good = len(testing[testing['Meta-score'] <= 0.12])

print(testing[(testing['Meta-score'] > 0.12)]['Order'].value_counts()/float(bad))
print('------------')
print(testing[(testing['Meta-score'] <= 0.12)]['Order'].value_counts()/float(good))


3210    0.134146
3021    0.134146
1203    0.109756
1023    0.060976
3201    0.060976
2103    0.060976
2310    0.060976
2301    0.048780
0312    0.048780
3120    0.036585
0213    0.036585
1230    0.036585
1302    0.036585
2013    0.024390
2130    0.024390
0123    0.024390
0132    0.024390
2031    0.012195
0231    0.012195
1032    0.012195
Name: Order, dtype: float64
------------
3012    0.106007
1203    0.106007
3210    0.067138
3201    0.060071
0213    0.056537
2103    0.053004
3021    0.053004
0123    0.049470
1023    0.045936
2013    0.042403
2310    0.042403
2130    0.035336
1032    0.035336
0321    0.031802
3120    0.031802
3102    0.031802
0312    0.031802
1230    0.024735
0132    0.021201
2301    0.017668
1320    0.017668
2031    0.014134
1302    0.014134
0231    0.010601
Name: Order, dtype: float64


In [229]:
print(testing[(testing['Order'] == '3012')][['GBR', 'LASSO', 'SVR', 'KNN', 'SalePrice', 'Meta-preds']])


                GBR          LASSO            SVR       KNN  SalePrice  \
89    118998.499616  124193.102875  126973.897714   83480.0     123600   
811   146953.069951  152932.558509  174667.634187  141500.0     144500   
846   195605.081478  214901.257046  215817.986077  193500.0     213000   
360   147570.879906  164664.399421  178568.502929  123280.0     156000   
1274  129501.242734  146989.686010  170143.229320  100460.0     139000   
315   177098.848869  195783.669062  222634.263626  170340.0     188500   
1047  138711.842872  144003.191283  150825.061507  132980.0     145000   
511   185471.584270  213017.767790  217218.441080  158800.0     202665   
443   181247.543864  212067.506448  217279.183519  169879.0     172500   
353   106757.742643  113013.423361  116544.910921   99550.2     105900   
965   170002.309714  182564.775951  215086.319730  152600.0     178900   
1265  169149.763305  174693.938841  175952.582647  146120.8     183900   
381   189144.677417  203088.118533  21

In [251]:
# Try the three models and then a linear regression on the result!

sgbr_train = gbr.predict(X_train[X_train.columns[mask_rfecv]])
slasso_train = lasso_scaled.predict(X_train_minmax)
ssvr_train = svr_rbf.predict(X_train_minmax)
sknn_train = knb.predict(X_train[X_train.columns[mask_rfecv]])

straining = pd.DataFrame(sgbr_train, columns=['GBR'])
training['GBR'] = gbr_train
straining['LASSO'] = slasso_train
#straining['SVR'] = ssvr_train
#straining['KNN'] = sknn_train

In [255]:
smetagbr = GradientBoostingRegressor()
smetagbr.fit(straining, y_train)

sgbr_test = gbr.predict(X_test[X_test.columns[mask_rfecv]])
slasso_test = lasso_scaled.predict(X_test_minmax)
ssvr_test = svr_rbf.predict(X_test_minmax)
sknn_test = knb.predict(X_test[X_test.columns[mask_rfecv]])

stesting = pd.DataFrame(sgbr_test, columns=['GBR'])
stesting['LASSO'] = slasso_test
#stesting['SVR'] = ssvr_test
#stesting['KNN'] = sknn_test

#smeta_preds = smetagbr.predict(stesting)
smeta_preds = 0.8*stesting['GBR'] + 0.2*stesting['LASSO']

In [256]:
smeta_score = []
savg_score = 0
for t, m, g, l,  in zip(y_test, smeta_preds, stesting['GBR'], stesting['LASSO'], ):
    score = sqrt(mean_squared_log_error([t], [m]))
    print("gbr: {:.3f}, lasso: {:.3f}, meta: {:.3f}, real: {}, score: {:.5f}".format(g, l, m, t, score))
    savg_score += score
    smeta_score.append(score)
savg_score = savg_score/float(len(smeta_preds))
print(savg_score)
stesting['Meta-score'] = smeta_score

gbr: 220956.332, lasso: 255256.155, meta: 227816.297, real: 200624, score: 0.12711
gbr: 146742.480, lasso: 136376.817, meta: 144669.347, real: 133000, score: 0.08410
gbr: 103861.572, lasso: 119431.758, meta: 106975.609, real: 110000, score: 0.02788
gbr: 204001.115, lasso: 226267.744, meta: 208454.441, real: 192000, score: 0.08222
gbr: 93586.409, lasso: 100473.492, meta: 94963.826, real: 88000, score: 0.07616
gbr: 98620.685, lasso: 93298.047, meta: 97556.157, real: 85000, score: 0.13778
gbr: 275521.405, lasso: 251539.958, meta: 270725.115, real: 282922, score: 0.04407
gbr: 124139.656, lasso: 142889.397, meta: 127889.604, real: 141000, score: 0.09759
gbr: 503843.381, lasso: 478176.492, meta: 498710.003, real: 745000, score: 0.40136
gbr: 155113.890, lasso: 150006.280, meta: 154092.368, real: 148800, score: 0.03495
gbr: 190898.288, lasso: 189220.879, meta: 190562.806, real: 208900, score: 0.09187
gbr: 137345.856, lasso: 127254.543, meta: 135327.593, real: 136905, score: 0.01159
gbr: 239934

In [282]:
# Try the three models and then a linear regression on the result!

gbr_train = gbr.predict(X_train[X_train.columns[mask_rfecv]])
lasso_train = lasso_scaled.predict(X_train_minmax)
svr_train = svr_rbf.predict(X_train_minmax)
knn_train = knb.predict(X_train[X_train.columns[mask_rfecv]])

training = pd.DataFrame(X_train[X_train.columns[mask_rfecv]])
training['GBR'] = gbr_train
training['LASSO'] = lasso_train
training['SVR'] = svr_train
training['KNN'] = knn_train

mmm = MinMaxScaler()
trmmm = mmm.fit_transform(training)

In [354]:
from sklearn.linear_model import LinearRegression
from sklearn.neighbors import RadiusNeighborsRegressor

rkn = RadiusNeighborsRegressor(radius=100)
rkn.fit(training, y_train)
#0.0835
#msvr = SVR(kernel='linear', C=100000, gamma=0.01)
#msvr.fit(trmmm, y_train)

# 0.0804
#mkn = KNeighborsRegressor(n_neighbors=50, weights='distance')
#mkn.fit(training, y_train)

# 0.0841
#lr = LinearRegression() 
#lr.fit(training, y_train)

gbr_test = gbr.predict(X_test[X_test.columns[mask_rfecv]])
lasso_test = lasso_scaled.predict(X_test_minmax)
svr_test = svr_rbf.predict(X_test_minmax)
knn_test = knb.predict(X_test[X_test.columns[mask_rfecv]])

testing = pd.DataFrame(X_test[X_test.columns[mask_rfecv]])
testing['GBR'] = gbr_test
testing['LASSO'] = lasso_test
testing['SVR'] = svr_test
testing['KNN'] = knn_test

print(testing)
#temmm = mmm.transform(testing)
meta_preds = rkn.predict(testing)



      LotFrontage  LotArea  OverallQual  OverallCond  YearBuilt  YearRemodAdd  \
529           0.0    32668            6            3       1957          1975   
491          79.0     9490            6            7       1941          1950   
459           0.0     7015            5            4       1950          1950   
279          83.0    10005            7            5       1977          1977   
655          21.0     1680            6            5       1971          1971   
1013         60.0     7200            5            4       1910          2006   
1403         49.0    15256            8            5       2007          2007   
601          50.0     9000            6            6       1937          1950   
1182        160.0    15623           10            5       1996          1996   
687           0.0     5105            7            5       2004          2004   
1317         47.0     4230            7            5       2006          2007   
1003          0.0    11500  

  out=out, **kwargs)
  ret, rcount, out=ret, casting='unsafe', subok=False)


In [348]:
meta_score = []
for t, m, g, l, s, v in zip(y_test, meta_preds, testing['GBR'], testing['LASSO'], testing['SVR'], testing['KNN']):
    score = sqrt(mean_squared_log_error([t], [m]))
    print("gbr: {:.3f}, lasso: {:.3f}, svr: {:.3f}, knn: {:.3f}, meta: {:.3f}, real: {}, score: {:.5f}".format(g, l, s, v, m, t, score))
    avg_score += score
    meta_score.append(score)
avg_score = avg_score/float(len(meta_preds))
print(avg_score)
testing['Meta-score'] = meta_score

ValueError: Input contains NaN, infinity or a value too large for dtype('float64').