### Basic Overview

We will be exploring gradient boosting methods (using xgboost) to build a predictive model for predicting housing prices, given relevant data.

In [1]:
import pandas as pd
import numpy as np
import sys
sys.path.append('../../common_routines/')
from relevant_functions import\
    evaluate_model_score_given_predictions,\
    evaluate_model_score


#### Get clean data first

In [2]:
train_data = pd.read_csv('../../cleaned_input/train_data.csv')
validation_data = pd.read_csv('../../cleaned_input/validation_data.csv')

In [3]:
train_data.columns

Index(['Id', 'MSSubClass', 'MSZoning', 'LotArea', 'Street', 'LotShape',
       'LandContour', 'Utilities', 'LotConfig', 'LandSlope',
       ...
       'LogGarageArea', 'LogWoodDeckSF', 'LogOpenPorchSF', 'LogEnclosedPorch',
       'Log3SsnPorch', 'LogScreenPorch', 'LogPoolArea', 'LogMiscVal',
       'LogSalePrice', 'LogMasVnrArea_times_not_missing'],
      dtype='object', length=102)

In [4]:
train_validation_data = pd.concat([train_data, validation_data])

In [5]:
test_data = pd.read_csv('../../input/test.csv')

In [6]:
test_data.isnull().sum().sum()

7000

In [7]:
## Are they indeed clean ?
train_data.isnull().sum().any()

False

In [8]:
validation_data.isnull().sum().any()

False

### Brief framework.

We will be building according the following framework (similar to how we did for PCA).

Given set of columns, we should be able to do the following :

1. Train model on training set.

2. Validate on validation set.

3. Generate predicitons on test data.

4. Do cross validation on combined set of training/validation data.


In [10]:
ALL_CATEGORICAL_COLUMNS = ['MSSubClass',
 'MSZoning',
 'Street',
 'LotShape',
 'LandContour',
 'Utilities',
 'LotConfig',
 'LandSlope',
 'Neighborhood',
 'Condition1',
 'Condition2',
 'BldgType',
 'HouseStyle',
 'RoofStyle',
 'RoofMatl',
 'Exterior1st',
 'Exterior2nd',
 'MasVnrType',
 'ExterQual',
 'ExterCond',
 'Foundation',
 'BsmtQual',
 'BsmtCond',
 'BsmtExposure',
 'BsmtFinType1',
 'BsmtFinType2',
 'Heating',
 'HeatingQC',
 'CentralAir',
 'Electrical',
 'KitchenQual',
 'Functional',
 'FireplaceQu',
 'GarageType',
 'GarageFinish',
 'GarageQual',
 'GarageCond',
 'PavedDrive',
 'PoolQC',
 'Fence',
 'MiscFeature',
 'MoSold',
 'YrSold',
 'SaleType',
 'SaleCondition']

In [11]:
ALL_NUMERICAL_COLUMNS = ['LotArea',
 'OverallQual',
 'OverallCond',
 'YearBuilt',
 'YearRemodAdd',
 'MasVnrArea_times_not_missing',
 'MasVnrArea_not_missing',
 'BsmtFinSF1',
 'BsmtUnfSF',
 'TotalBsmtSF',
 '1stFlrSF',
 '2ndFlrSF',
 'LowQualFinSF',
 'GrLivArea',
 'BsmtFullBath',
 'BsmtHalfBath',
 'FullBath',
 'HalfBath',
 'BedroomAbvGr',
 'KitchenAbvGr',
 'TotRmsAbvGrd',
 'Fireplaces',
 'GarageYrBlt_times_not_missing',
 'GarageYrBlt_not_missing',
 'GarageCars',
 'GarageArea',
 'WoodDeckSF',
 'OpenPorchSF',
 'EnclosedPorch',
 '3SsnPorch',
 'ScreenPorch',
 'PoolArea',
 'MiscVal']

In [12]:
ALL_COLUMNS = ALL_CATEGORICAL_COLUMNS + ALL_NUMERICAL_COLUMNS

In [13]:
# Columns the model to be trained 
# Check out ExterQual and YearBuilt instead of YearRemodAdd
# Check out BsmtCond
#cat_cols_in_model = ['MSSubClass', 'Neighborhood', 'ExterQual', 'Foundation', 'BsmtQual', 'BsmtCond',
#                     'BsmtFinType1']
#numeric_cols_in_model = ['GrLivArea', 'OverallQual', 'OverallCond', 'YearRemodAdd', 'BsmtFinSF1', 
#                         'TotalBsmtSF']
# Check out GarageCond
cat_cols_in_model = ['MSSubClass', 'Neighborhood']
numeric_cols_in_model = ['GrLivArea', 'OverallQual', 'OverallCond', 'YearRemodAdd', 'BsmtFinSF1', 
                         'TotalBsmtSF', '1stFlrSF', '2ndFlrSF', 'GarageArea', 'LowQualFinSF']

all_cols_in_model = cat_cols_in_model + numeric_cols_in_model

dep_var_col = 'LogSalePrice'

In [14]:
all_cols_in_model

['MSSubClass',
 'Neighborhood',
 'GrLivArea',
 'OverallQual',
 'OverallCond',
 'YearRemodAdd',
 'BsmtFinSF1',
 'TotalBsmtSF',
 '1stFlrSF',
 '2ndFlrSF',
 'GarageArea',
 'LowQualFinSF']

#### Training model on the training set


In [15]:
def get_h2o_frame_with_rel_factors(test_data):
    test_data_h2o = h2o.H2OFrame(test_data)
    for col in ALL_CATEGORICAL_COLUMNS:
        test_data_h2o[col] = test_data_h2o[col].asfactor()
    return test_data_h2o

In [16]:
from h2o.estimators.gbm import H2OGradientBoostingEstimator

In [17]:
hpr_1 = H2OGradientBoostingEstimator(model_id='housing_price_regression', seed=1)
hpr_1.train(x=all_cols_in_model, 
            y=dep_var_col, 
            training_frame=get_h2o_frame_with_rel_factors(train_data))

Parse progress: |█████████████████████████████████████████████████████████| 100%
gbm Model Build progress: |███████████████████████████████████████████████| 100%


In [18]:
predict_out = hpr_1.predict(
    get_h2o_frame_with_rel_factors(train_data))                
train_data['Predictions'] = predict_out.as_data_frame()['predict'].values.tolist()

Parse progress: |█████████████████████████████████████████████████████████| 100%
gbm prediction progress: |████████████████████████████████████████████████| 100%


In [19]:
#evaluate_model_score_given_predictions(np.log(train_data['Predictions'].values), 
#                                       np.log(train_data[dep_var_col].values))

In [20]:
evaluate_model_score_given_predictions((train_data['Predictions'].values), 
                                       (train_data[dep_var_col].values))

0.07767421000623992

#### Inspect the output model

It may seem like a trivial thing, but shouldn't we inspect the model to see what exactly it does ? This is especially important in data science and it is very easy to get entangled in a quagmire of models and functions without clearly understanding what any of them does.


In [21]:
from h2o.tree import H2OTree
tree = H2OTree(model = hpr_1, tree_number = 0, tree_class = None)

In [22]:
tree

<h2o.tree.tree.H2OTree at 0x125358a20>

In [23]:
len(tree)

57

In [24]:
print(tree)

Tree related to model housing_price_regression. Tree number is 0, tree class is 'None'




In [25]:
tree.levels

[None,
 None,
 None,
 ['BrDale', 'BrkSide', 'Edwards', 'IDOTRR', 'MeadowV', 'OldTown'],
 ['Blmngtn',
  'Blueste',
  'ClearCr',
  'CollgCr',
  'Crawfor',
  'Gilbert',
  'Mitchel',
  'NAmes',
  'NPkVill',
  'NWAmes',
  'NoRidge',
  'NridgHt',
  'SWISU',
  'Sawyer',
  'SawyerW',
  'Somerst',
  'StoneBr',
  'Timber',
  'Veenker'],
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 ['Blmngtn',
  'Blueste',
  'Mitchel',
  'NAmes',
  'NPkVill',
  'SWISU',
  'Sawyer',
  'SawyerW'],
 ['ClearCr',
  'CollgCr',
  'Crawfor',
  'Gilbert',
  'NWAmes',
  'NoRidge',
  'NridgHt',
  'Somerst',
  'StoneBr',
  'Timber',
  'Veenker'],
 None,
 None,
 None,
 None,
 None,
 None,
 ['CollgCr', 'Edwards', 'Somerst', 'Timber'],
 ['Blmngtn',
  'Blueste',
  'BrDale',
  'BrkSide',
  'ClearCr',
  'Crawfor',
  'Gilbert',
  'IDOTRR',
  'MeadowV',
  'Mitchel',
  'NAmes',
  'NPkVill',
  'NWAmes',
  'NoRidge',
  'NridgHt',
  'OldTown',
  'SWISU',
  'Sawyer',
  '

In [26]:
tree.tree_number

0

In [27]:
tree.show()

Tree related to model housing_price_regression. Tree number is 0, tree class is 'None'




In [28]:
print(tree.root_node)

Node ID 0 
Left child node ID = 1
Right child node ID = 2

Splits on column OverallQual
Split threshold < 6.5 to the left node, >= 6.5 to the right node 

NA values go to the LEFT


In [29]:
print(tree.root_node.left_child)

Node ID 1 
Left child node ID = 3
Right child node ID = 4

Splits on column Neighborhood
  - Categorical levels going to the left node: ['BrDale', 'BrkSide', 'Edwards', 'IDOTRR', 'MeadowV', 'OldTown']
  - Categorical levels going to the right node: ['Blmngtn', 'Blueste', 'ClearCr', 'CollgCr', 'Crawfor', 'Gilbert', 'Mitchel', 'NAmes', 'NPkVill', 'NWAmes', 'NoRidge', 'NridgHt', 'SWISU', 'Sawyer', 'SawyerW', 'Somerst', 'StoneBr', 'Timber', 'Veenker']

NA values go to the RIGHT


#### Testing model on validation set

In [30]:
predict_out = hpr_1.predict(
    get_h2o_frame_with_rel_factors(validation_data))                
validation_data['Predictions'] = predict_out.as_data_frame()['predict'].values.tolist()

Parse progress: |█████████████████████████████████████████████████████████| 100%
gbm prediction progress: |████████████████████████████████████████████████| 100%


In [31]:
#evaluate_model_score_given_predictions(np.log(validation_data['Predictions'].values), 
#                                       np.log(validation_data[dep_var_col].values))

In [32]:
validation_score = evaluate_model_score_given_predictions((validation_data['Predictions'].values), 
                                                          (validation_data[dep_var_col].values))
print(validation_score)

0.1406531537323738


#### Generate predictions on test data

In [33]:
test_data_one_hot = pd.read_csv('../../cleaned_input/test_data_one_hot.csv')

In [34]:
test_data['LogMasVnrArea_times_not_missing'] = test_data_one_hot['LogMasVnrArea_times_not_missing']

In [35]:
test_data[all_cols_in_model].isnull().sum()

MSSubClass      0
Neighborhood    0
GrLivArea       0
OverallQual     0
OverallCond     0
YearRemodAdd    0
BsmtFinSF1      1
TotalBsmtSF     1
1stFlrSF        0
2ndFlrSF        0
GarageArea      1
LowQualFinSF    0
dtype: int64

In [36]:
hpr_1 = H2OGradientBoostingEstimator(model_id='housing_price_regression', seed=1)
hpr_1.train(x=all_cols_in_model, 
            y=dep_var_col, 
            training_frame=get_h2o_frame_with_rel_factors(train_validation_data))

Parse progress: |█████████████████████████████████████████████████████████| 100%
gbm Model Build progress: |███████████████████████████████████████████████| 100%


In [37]:
predict_out = hpr_1.predict(
    get_h2o_frame_with_rel_factors(test_data))                
test_data['Predictions'] = predict_out.as_data_frame()['predict'].values.tolist()

Parse progress: |█████████████████████████████████████████████████████████| 100%
gbm prediction progress: |████████████████████████████████████████████████| 100%




#### Cross validation

In [38]:
# Do a 10 fold cross validation as that is done typically.
hpr_cross_val = H2OGradientBoostingEstimator(model_id='housing_price_regression', 
                                             seed=1, 
                                             nfolds=5,
                                             keep_cross_validation_predictions=True)
hpr_cross_val.train(x=all_cols_in_model, 
                    y=dep_var_col, 
                    training_frame=get_h2o_frame_with_rel_factors(train_validation_data))


Parse progress: |█████████████████████████████████████████████████████████| 100%
gbm Model Build progress: |███████████████████████████████████████████████| 100%


In [39]:
hpr_cross_val.cross_validation_predictions

Model Details
H2OGradientBoostingEstimator :  Gradient Boosting Machine
Model Key:  housing_price_regression


ModelMetricsRegression: gbm
** Reported on train data. **

MSE: 0.00632110403238809
RMSE: 0.07950537109144319
MAE: 0.05716325485543029
RMSLE: 0.006173452437315788
Mean Residual Deviance: 0.00632110403238809

ModelMetricsRegression: gbm
** Reported on cross-validation data. **

MSE: 0.01845607613430793
RMSE: 0.13585314179034627
MAE: 0.093444616995405
RMSLE: 0.010580527456288643
Mean Residual Deviance: 0.01845607613430793
Cross-Validation Metrics Summary: 


0,1,2,3,4,5,6,7
,mean,sd,cv_1_valid,cv_2_valid,cv_3_valid,cv_4_valid,cv_5_valid
mae,0.0933439,0.0065253,0.1025743,0.0764201,0.0961447,0.0999359,0.0916443
mean_residual_deviance,0.0184260,0.0028112,0.0221449,0.0114161,0.0185095,0.0223651,0.0176943
mse,0.0184260,0.0028112,0.0221449,0.0114161,0.0185095,0.0223651,0.0176943
r2,0.8844129,0.0140069,0.8861869,0.907745,0.876609,0.851228,0.9002959
residual_deviance,0.0184260,0.0028112,0.0221449,0.0114161,0.0185095,0.0223651,0.0176943
rmse,0.1348554,0.0109546,0.1488117,0.1068459,0.1360497,0.1495498,0.1330200
rmsle,0.0104985,0.0008811,0.0115970,0.0082262,0.0105690,0.0116745,0.0104259


Scoring History: 


0,1,2,3,4,5,6
,timestamp,duration,number_of_trees,training_rmse,training_mae,training_deviance
,2019-04-10 14:29:23,0.800 sec,0.0,0.3993151,0.3098117,0.1594525
,2019-04-10 14:29:23,0.808 sec,1.0,0.3662428,0.2833855,0.1341338
,2019-04-10 14:29:23,0.812 sec,2.0,0.3366125,0.2590401,0.1133080
,2019-04-10 14:29:23,0.816 sec,3.0,0.3103217,0.2378703,0.0962996
,2019-04-10 14:29:23,0.821 sec,4.0,0.2865193,0.2182946,0.0820933
---,---,---,---,---,---,---
,2019-04-10 14:29:23,1.017 sec,46.0,0.0808845,0.0581656,0.0065423
,2019-04-10 14:29:23,1.021 sec,47.0,0.0805606,0.0579031,0.0064900
,2019-04-10 14:29:23,1.025 sec,48.0,0.0803071,0.0577252,0.0064492



See the whole table with table.as_data_frame()
Variable Importances: 


0,1,2,3
variable,relative_importance,scaled_importance,percentage
OverallQual,626.3391724,1.0,0.5322870
Neighborhood,167.4724884,0.2673831,0.1423245
GrLivArea,151.6552582,0.2421296,0.1288824
GarageArea,54.3965988,0.0868485,0.0462283
TotalBsmtSF,44.1078262,0.0704216,0.0374845
BsmtFinSF1,35.6226578,0.0568744,0.0302735
1stFlrSF,27.0880566,0.0432482,0.0230205
OverallCond,23.7564754,0.0379291,0.0201892
MSSubClass,23.3876972,0.0373403,0.0198758


<bound method ModelBase.cross_validation_predictions of >

In [40]:
def get_cross_validated_rmse(hpr_cross_val):
    cv_preds = hpr_cross_val.cross_validation_predictions()
    for i in range(len(cv_preds)):
        if i == 0:
            result_cv = cv_preds[0]['predict'].as_data_frame().copy()
        else:
            result_cv +=  cv_preds[i]['predict'].as_data_frame()
    return evaluate_model_score_given_predictions(result_cv, train_validation_data['LogSalePrice'])    

In [41]:
get_cross_validated_rmse(hpr_cross_val)

0.13585314441613844

### Try somewhat of a greedy method to select columns

In this approach, we would start with an empty model and at every step, we would the predictor that would decrease the cross validation error the most. We have not fully automated it (it would be extremely easy to do so), but have built the model manually after inspection in each step. 

As simple as it seems, this looks to work pretty well. In fact, we were able to get a much better model, with almost no extra effort !

In [42]:
import operator
def get_cross_val_scores_new_col(base_model_cols, 
                                 train_validation_data=train_validation_data,
                                 nfolds=5,
                                 dep_var_col='LogSalePrice'):
    columns_to_cross_val_score = dict()
    for col in ALL_COLUMNS:
        # If the column was already included, skip it.
        if col in base_model_cols:
            continue
        
        cur_model_cols = base_model_cols + [col]            
        print(cur_model_cols)

        # Do a 10 fold cross validation as that is done typically.
        hpr_cross_val = H2OGradientBoostingEstimator(model_id='housing_price_regression', 
                                                    seed=1, 
                                                    nfolds=nfolds,
                                                    keep_cross_validation_predictions=True)
        hpr_cross_val.train(x=cur_model_cols, 
                            y=dep_var_col, 
                            training_frame=get_h2o_frame_with_rel_factors(train_validation_data))

        cv_score = get_cross_validated_rmse(hpr_cross_val)

        columns_to_cross_val_score[col] = cv_score
    
    sorted_cross_val_scores = sorted(columns_to_cross_val_score.items(), key=operator.itemgetter(1))
    return sorted_cross_val_scores

In [43]:
def get_cross_val_scores_given_cols(cur_model_cols,
                                    train_validation_data=train_validation_data,
                                    nfolds=5,
                                    dep_var_col='LogSalePrice'):

    # Do a 10 fold cross validation as that is done typically.
    hpr_cross_val = H2OGradientBoostingEstimator(model_id='housing_price_regression', 
                                                 seed=1, 
                                                 nfolds=nfolds,
                                                 keep_cross_validation_predictions=True)
    hpr_cross_val.train(x=cur_model_cols, 
                        y=dep_var_col, 
                        training_frame=get_h2o_frame_with_rel_factors(train_validation_data))

    cv_score = get_cross_validated_rmse(hpr_cross_val)

    return cv_score

In [44]:
sorted_cross_val_scores = get_cross_val_scores_new_col([])

['MSSubClass']
Parse progress: |█████████████████████████████████████████████████████████| 100%
gbm Model Build progress: |███████████████████████████████████████████████| 100%
['MSZoning']
Parse progress: |█████████████████████████████████████████████████████████| 100%
gbm Model Build progress: |███████████████████████████████████████████████| 100%
['Street']
Parse progress: |█████████████████████████████████████████████████████████| 100%
gbm Model Build progress: |███████████████████████████████████████████████| 100%
['LotShape']
Parse progress: |█████████████████████████████████████████████████████████| 100%
gbm Model Build progress: |███████████████████████████████████████████████| 100%
['LandContour']
Parse progress: |█████████████████████████████████████████████████████████| 100%
gbm Model Build progress: |███████████████████████████████████████████████| 100%
['Utilities']
Parse progress: |█████████████████████████████████████████████████████████| 100%
gbm Model Build progress: |

In [46]:
sorted_cross_val_scores

[('OverallQual', 0.2314695712659162),
 ('Neighborhood', 0.2661823676386924),
 ('GrLivArea', 0.2822379597113502),
 ('GarageCars', 0.28722454372514045),
 ('GarageArea', 0.28959681068562915),
 ('BsmtQual', 0.29633568722309883),
 ('ExterQual', 0.29727394225499093),
 ('KitchenQual', 0.2975780445221094),
 ('YearBuilt', 0.30534834110461134),
 ('TotalBsmtSF', 0.3069435946305675),
 ('GarageYrBlt_times_not_missing', 0.31341028785876873),
 ('GarageFinish', 0.3150945438439022),
 ('FullBath', 0.3202448719162201),
 ('1stFlrSF', 0.3232316651641163),
 ('GarageType', 0.32759433952831857),
 ('YearRemodAdd', 0.3288379886446765),
 ('MSSubClass', 0.3304419379314102),
 ('FireplaceQu', 0.33395878921281624),
 ('Foundation', 0.3346190527933579),
 ('TotRmsAbvGrd', 0.33766254948869984),
 ('Fireplaces', 0.3437448251514915),
 ('HeatingQC', 0.351077704317355),
 ('2ndFlrSF', 0.3542051099416797),
 ('LotArea', 0.35558970955529917),
 ('BsmtFinSF1', 0.3580490106303493),
 ('BsmtFinType1', 0.3580948577653219),
 ('OpenPorc

In [47]:
sorted_cross_val_scores = get_cross_val_scores_new_col(['OverallQual'])

['OverallQual', 'MSSubClass']
Parse progress: |█████████████████████████████████████████████████████████| 100%
gbm Model Build progress: |███████████████████████████████████████████████| 100%
['OverallQual', 'MSZoning']
Parse progress: |█████████████████████████████████████████████████████████| 100%
gbm Model Build progress: |███████████████████████████████████████████████| 100%
['OverallQual', 'Street']
Parse progress: |█████████████████████████████████████████████████████████| 100%
gbm Model Build progress: |███████████████████████████████████████████████| 100%
['OverallQual', 'LotShape']
Parse progress: |█████████████████████████████████████████████████████████| 100%
gbm Model Build progress: |███████████████████████████████████████████████| 100%
['OverallQual', 'LandContour']
Parse progress: |█████████████████████████████████████████████████████████| 100%
gbm Model Build progress: |███████████████████████████████████████████████| 100%
['OverallQual', 'Utilities']
Parse progress: |█

In [48]:
sorted_cross_val_scores

[('GrLivArea', 0.19979076978516308),
 ('Neighborhood', 0.20158124327134033),
 ('LotArea', 0.2086117136262508),
 ('1stFlrSF', 0.20916027532063497),
 ('GarageArea', 0.21118839406544518),
 ('GarageCars', 0.21210903191574046),
 ('MSSubClass', 0.21339287395479262),
 ('TotalBsmtSF', 0.21349608711799245),
 ('MSZoning', 0.21633551604252482),
 ('2ndFlrSF', 0.21636079987039075),
 ('TotRmsAbvGrd', 0.21677203601911393),
 ('BsmtFinSF1', 0.21782344972120488),
 ('FullBath', 0.21834379416833274),
 ('GarageType', 0.21854836667132044),
 ('GarageYrBlt_times_not_missing', 0.22020065807453437),
 ('Fireplaces', 0.22057968012849063),
 ('GarageFinish', 0.22065427643136734),
 ('YearRemodAdd', 0.22066299321292457),
 ('YearBuilt', 0.22091091116415848),
 ('FireplaceQu', 0.2216876850495251),
 ('BedroomAbvGr', 0.22238461662138687),
 ('BsmtQual', 0.22456533301346177),
 ('GarageQual', 0.22474393094433626),
 ('KitchenQual', 0.2250805674269091),
 ('BsmtFullBath', 0.22559548343585886),
 ('LotShape', 0.22574296129005111)

In [49]:
sorted_cross_val_scores = get_cross_val_scores_new_col(['OverallQual', 'GrLivArea'])

['OverallQual', 'GrLivArea', 'MSSubClass']
Parse progress: |█████████████████████████████████████████████████████████| 100%
gbm Model Build progress: |███████████████████████████████████████████████| 100%
['OverallQual', 'GrLivArea', 'MSZoning']
Parse progress: |█████████████████████████████████████████████████████████| 100%
gbm Model Build progress: |███████████████████████████████████████████████| 100%
['OverallQual', 'GrLivArea', 'Street']
Parse progress: |█████████████████████████████████████████████████████████| 100%
gbm Model Build progress: |███████████████████████████████████████████████| 100%
['OverallQual', 'GrLivArea', 'LotShape']
Parse progress: |█████████████████████████████████████████████████████████| 100%
gbm Model Build progress: |███████████████████████████████████████████████| 100%
['OverallQual', 'GrLivArea', 'LandContour']
Parse progress: |█████████████████████████████████████████████████████████| 100%
gbm Model Build progress: |████████████████████████████████████

In [50]:
sorted_cross_val_scores


[('Neighborhood', 0.1694131644108542),
 ('MSSubClass', 0.17819759720887973),
 ('BsmtFinSF1', 0.1800213472533662),
 ('YearBuilt', 0.18038389723413192),
 ('TotalBsmtSF', 0.1822986206795278),
 ('HouseStyle', 0.18253302632686508),
 ('MSZoning', 0.18259065820783588),
 ('BsmtFinType1', 0.18353022050933876),
 ('1stFlrSF', 0.18550341051481578),
 ('GarageCars', 0.18566147744113873),
 ('GarageArea', 0.18609418568613864),
 ('2ndFlrSF', 0.18622540072183352),
 ('GarageYrBlt_times_not_missing', 0.18640116709426663),
 ('GarageFinish', 0.18729361538215003),
 ('YearRemodAdd', 0.18759867383379622),
 ('LotArea', 0.18777064009045794),
 ('GarageType', 0.18781069592456234),
 ('BsmtQual', 0.18948315114534336),
 ('Foundation', 0.1900034916035423),
 ('BsmtFullBath', 0.1904430102390244),
 ('CentralAir', 0.19092182039277808),
 ('PavedDrive', 0.19118437694645365),
 ('KitchenQual', 0.19177763699748113),
 ('BsmtExposure', 0.19190851503790426),
 ('OverallCond', 0.19279611927608084),
 ('ExterQual', 0.1931605444800647

In [51]:
sorted_cross_val_scores = get_cross_val_scores_new_col(['OverallQual', 'GrLivArea', 'Neighborhood'])

['OverallQual', 'GrLivArea', 'Neighborhood', 'MSSubClass']
Parse progress: |█████████████████████████████████████████████████████████| 100%
gbm Model Build progress: |███████████████████████████████████████████████| 100%
['OverallQual', 'GrLivArea', 'Neighborhood', 'MSZoning']
Parse progress: |█████████████████████████████████████████████████████████| 100%
gbm Model Build progress: |███████████████████████████████████████████████| 100%
['OverallQual', 'GrLivArea', 'Neighborhood', 'Street']
Parse progress: |█████████████████████████████████████████████████████████| 100%
gbm Model Build progress: |███████████████████████████████████████████████| 100%
['OverallQual', 'GrLivArea', 'Neighborhood', 'LotShape']
Parse progress: |█████████████████████████████████████████████████████████| 100%
gbm Model Build progress: |███████████████████████████████████████████████| 100%
['OverallQual', 'GrLivArea', 'Neighborhood', 'LandContour']
Parse progress: |███████████████████████████████████████████████

gbm Model Build progress: |███████████████████████████████████████████████| 100%
['OverallQual', 'GrLivArea', 'Neighborhood', 'MiscVal']
Parse progress: |█████████████████████████████████████████████████████████| 100%
gbm Model Build progress: |███████████████████████████████████████████████| 100%


In [52]:
sorted_cross_val_scores

[('BsmtFinSF1', 0.15387954567502765),
 ('BsmtFinType1', 0.15825619424056567),
 ('MSSubClass', 0.16099514990084143),
 ('OverallCond', 0.16099929541431593),
 ('TotalBsmtSF', 0.16144762533207346),
 ('BsmtFullBath', 0.16184424659026292),
 ('GarageFinish', 0.1622912915655121),
 ('HouseStyle', 0.16239477237419978),
 ('GarageArea', 0.16255750896671112),
 ('1stFlrSF', 0.16275949924328942),
 ('YearRemodAdd', 0.16361569880621885),
 ('GarageYrBlt_times_not_missing', 0.16380598694309897),
 ('2ndFlrSF', 0.1641174670104861),
 ('GarageType', 0.1645776904312746),
 ('BsmtCond', 0.16463858276388774),
 ('GarageCars', 0.16466629727922324),
 ('LotArea', 0.16471722621750137),
 ('BsmtExposure', 0.16480961520427026),
 ('CentralAir', 0.1650653262307796),
 ('KitchenQual', 0.16522031052414815),
 ('YearBuilt', 0.16536760888334093),
 ('PavedDrive', 0.16543479908951994),
 ('ExterCond', 0.1658601228041285),
 ('BsmtUnfSF', 0.1661532936617741),
 ('BldgType', 0.16617356793725416),
 ('BsmtQual', 0.16617993010860788),
 (

In [53]:
sorted_cross_val_scores = get_cross_val_scores_new_col(['OverallQual', 
                                                        'Neighborhood', 
                                                        'GrLivArea',
                                                        'BsmtFinSF1'])

['OverallQual', 'Neighborhood', 'GrLivArea', 'BsmtFinSF1', 'MSSubClass']
Parse progress: |█████████████████████████████████████████████████████████| 100%
gbm Model Build progress: |███████████████████████████████████████████████| 100%
['OverallQual', 'Neighborhood', 'GrLivArea', 'BsmtFinSF1', 'MSZoning']
Parse progress: |█████████████████████████████████████████████████████████| 100%
gbm Model Build progress: |███████████████████████████████████████████████| 100%
['OverallQual', 'Neighborhood', 'GrLivArea', 'BsmtFinSF1', 'Street']
Parse progress: |█████████████████████████████████████████████████████████| 100%
gbm Model Build progress: |███████████████████████████████████████████████| 100%
['OverallQual', 'Neighborhood', 'GrLivArea', 'BsmtFinSF1', 'LotShape']
Parse progress: |█████████████████████████████████████████████████████████| 100%
gbm Model Build progress: |███████████████████████████████████████████████| 100%
['OverallQual', 'Neighborhood', 'GrLivArea', 'BsmtFinSF1', 'LandCont

['OverallQual', 'Neighborhood', 'GrLivArea', 'BsmtFinSF1', 'EnclosedPorch']
Parse progress: |█████████████████████████████████████████████████████████| 100%
gbm Model Build progress: |███████████████████████████████████████████████| 100%
['OverallQual', 'Neighborhood', 'GrLivArea', 'BsmtFinSF1', '3SsnPorch']
Parse progress: |█████████████████████████████████████████████████████████| 100%
gbm Model Build progress: |███████████████████████████████████████████████| 100%
['OverallQual', 'Neighborhood', 'GrLivArea', 'BsmtFinSF1', 'ScreenPorch']
Parse progress: |█████████████████████████████████████████████████████████| 100%
gbm Model Build progress: |███████████████████████████████████████████████| 100%
['OverallQual', 'Neighborhood', 'GrLivArea', 'BsmtFinSF1', 'PoolArea']
Parse progress: |█████████████████████████████████████████████████████████| 100%
gbm Model Build progress: |███████████████████████████████████████████████| 100%
['OverallQual', 'Neighborhood', 'GrLivArea', 'BsmtFinSF1', 

In [54]:
sorted_cross_val_scores

[('OverallCond', 0.14694732517713285),
 ('GarageYrBlt_times_not_missing', 0.14915727375504756),
 ('MSSubClass', 0.14920160377904446),
 ('GarageArea', 0.1492092311631473),
 ('GarageFinish', 0.1494785274171415),
 ('GarageCars', 0.1497141335614295),
 ('HouseStyle', 0.15037935932884597),
 ('LotArea', 0.15047523392305015),
 ('CentralAir', 0.15049034508004536),
 ('KitchenQual', 0.15054138979589854),
 ('GarageType', 0.15070059874120537),
 ('YearRemodAdd', 0.1508575552114522),
 ('YearBuilt', 0.15090898043958162),
 ('HeatingQC', 0.15102677616424),
 ('ExterCond', 0.1510782885947169),
 ('GarageYrBlt_not_missing', 0.1518280671139293),
 ('GarageCond', 0.15183641745812906),
 ('BsmtCond', 0.15186313469186313),
 ('GarageQual', 0.15186764515239182),
 ('2ndFlrSF', 0.1519704246326977),
 ('ExterQual', 0.15212416696553385),
 ('BldgType', 0.1521633957157672),
 ('PavedDrive', 0.15220060690747703),
 ('ScreenPorch', 0.15230428978581362),
 ('TotalBsmtSF', 0.15230958502617117),
 ('1stFlrSF', 0.1523891834338011),

In [55]:

sorted_cross_val_scores = get_cross_val_scores_new_col(['OverallQual', 
                                                        'Neighborhood', 
                                                        'GrLivArea',
                                                        'BsmtFinSF1',
                                                        'OverallCond'])


['OverallQual', 'Neighborhood', 'GrLivArea', 'BsmtFinSF1', 'OverallCond', 'MSSubClass']
Parse progress: |█████████████████████████████████████████████████████████| 100%
gbm Model Build progress: |███████████████████████████████████████████████| 100%
['OverallQual', 'Neighborhood', 'GrLivArea', 'BsmtFinSF1', 'OverallCond', 'MSZoning']
Parse progress: |█████████████████████████████████████████████████████████| 100%
gbm Model Build progress: |███████████████████████████████████████████████| 100%
['OverallQual', 'Neighborhood', 'GrLivArea', 'BsmtFinSF1', 'OverallCond', 'Street']
Parse progress: |█████████████████████████████████████████████████████████| 100%
gbm Model Build progress: |███████████████████████████████████████████████| 100%
['OverallQual', 'Neighborhood', 'GrLivArea', 'BsmtFinSF1', 'OverallCond', 'LotShape']
Parse progress: |█████████████████████████████████████████████████████████| 100%
gbm Model Build progress: |███████████████████████████████████████████████| 100%
['Overal

['OverallQual', 'Neighborhood', 'GrLivArea', 'BsmtFinSF1', 'OverallCond', 'GarageArea']
Parse progress: |█████████████████████████████████████████████████████████| 100%
gbm Model Build progress: |███████████████████████████████████████████████| 100%
['OverallQual', 'Neighborhood', 'GrLivArea', 'BsmtFinSF1', 'OverallCond', 'WoodDeckSF']
Parse progress: |█████████████████████████████████████████████████████████| 100%
gbm Model Build progress: |███████████████████████████████████████████████| 100%
['OverallQual', 'Neighborhood', 'GrLivArea', 'BsmtFinSF1', 'OverallCond', 'OpenPorchSF']
Parse progress: |█████████████████████████████████████████████████████████| 100%
gbm Model Build progress: |███████████████████████████████████████████████| 100%
['OverallQual', 'Neighborhood', 'GrLivArea', 'BsmtFinSF1', 'OverallCond', 'EnclosedPorch']
Parse progress: |█████████████████████████████████████████████████████████| 100%
gbm Model Build progress: |███████████████████████████████████████████████| 1

In [56]:
sorted_cross_val_scores

[('GarageArea', 0.1402491005087443),
 ('GarageCars', 0.1404932690041224),
 ('TotalBsmtSF', 0.1405420267570022),
 ('GarageYrBlt_times_not_missing', 0.14090687372915237),
 ('MSSubClass', 0.1409860704498476),
 ('GarageFinish', 0.1415681988542476),
 ('LotArea', 0.14186102574648687),
 ('YearBuilt', 0.14188030515077824),
 ('HouseStyle', 0.1426879548952321),
 ('GarageType', 0.14274524567252642),
 ('1stFlrSF', 0.14276047005609024),
 ('BsmtUnfSF', 0.14287212994547802),
 ('PavedDrive', 0.14334310596612665),
 ('YearRemodAdd', 0.14358773344437847),
 ('CentralAir', 0.14362752746123006),
 ('GarageCond', 0.14393671600844943),
 ('GarageYrBlt_not_missing', 0.14423634966694507),
 ('SaleCondition', 0.14462333397892604),
 ('BldgType', 0.14464588985478732),
 ('BsmtQual', 0.14464933736997063),
 ('2ndFlrSF', 0.14465994888352376),
 ('Foundation', 0.14468192074481517),
 ('GarageQual', 0.14476706842958617),
 ('FireplaceQu', 0.14500840487465094),
 ('Condition1', 0.1451132943102544),
 ('ScreenPorch', 0.1451224795

In [None]:

sorted_cross_val_scores = get_cross_val_scores_new_col(['OverallQual', 
                                                        'Neighborhood', 
                                                        'GrLivArea',
                                                        'BsmtFinSF1',
                                                        'OverallCond',
                                                        'GarageArea'])


['OverallQual', 'Neighborhood', 'GrLivArea', 'BsmtFinSF1', 'OverallCond', 'GarageArea', 'MSSubClass']
Parse progress: |█████████████████████████████████████████████████████████| 100%
gbm Model Build progress: |███████████████████████████████████████████████| 100%
['OverallQual', 'Neighborhood', 'GrLivArea', 'BsmtFinSF1', 'OverallCond', 'GarageArea', 'MSZoning']
Parse progress: |█████████████████████████████████████████████████████████| 100%
gbm Model Build progress: |███████████████████████████████████████████████| 100%
['OverallQual', 'Neighborhood', 'GrLivArea', 'BsmtFinSF1', 'OverallCond', 'GarageArea', 'Street']
Parse progress: |█████████████████████████████████████████████████████████| 100%
gbm Model Build progress: |███████████████████████████████████████████████| 100%
['OverallQual', 'Neighborhood', 'GrLivArea', 'BsmtFinSF1', 'OverallCond', 'GarageArea', 'LotShape']
Parse progress: |█████████████████████████████████████████████████████████| 100%
gbm Model Build progress: |██████

In [None]:
#sorted_cross_val_scores

In [None]:
'''
sorted_cross_val_scores = get_cross_val_scores_new_col(['OverallQual', 
                                                        'Neighborhood', 
                                                        'GrLivArea',
                                                        'BsmtFinSF1',
                                                        'GarageCars',
                                                        'OverallCond',
                                                        'YearBuilt'])
'''

In [None]:
#sorted_cross_val_scores

In [None]:
'''
sorted_cross_val_scores = get_cross_val_scores_new_col(['OverallQual', 
                                                        'Neighborhood', 
                                                        'GrLivArea',
                                                        'BsmtFinSF1',
                                                        'GarageCars',
                                                        'OverallCond',
                                                        'YearBuilt',
                                                        'LotArea'])
'''

In [None]:
#sorted_cross_val_scores

In [None]:
'''
sorted_cross_val_scores = get_cross_val_scores_new_col(['OverallQual', 
                                                        'Neighborhood', 
                                                        'GrLivArea',
                                                        'BsmtFinSF1',
                                                        'GarageCars',
                                                        'OverallCond',
                                                        'YearBuilt',
                                                        'LotArea',
                                                        'Fireplaces'])
'''


In [None]:
#sorted_cross_val_scores

In [None]:
'''
sorted_cross_val_scores = get_cross_val_scores_new_col(['OverallQual', 
                                                        'Neighborhood', 
                                                        'GrLivArea',
                                                        'BsmtFinSF1',
                                                        'GarageCars',
                                                        'OverallCond',
                                                        'YearBuilt',
                                                        'LotArea',
                                                        'Fireplaces',
                                                        'CentralAir'])
'''


In [None]:
#sorted_cross_val_scores

In [None]:
sorted_cross_val_scores = get_cross_val_scores_new_col(['OverallQual', 
                                                        'Neighborhood', 
                                                         'GrLivArea',
                                                        'BsmtFinSF1',
                                                        'GarageCars',
                                                        'OverallCond',
                                                        'YearBuilt',
                                                        'LotArea',
                                                        'Fireplaces',
                                                        'CentralAir',
                                                        'ExterCond'])



In [None]:
sorted_cross_val_scores

We see that the cross validation score is decreasing, indicating that we have reached a plateau. Let us make check out our cross validation score now.

In [None]:
final_cols = ['OverallQual', 
              'Neighborhood', 
              'GrLivArea', 
              'BsmtFinSF1',
              'GarageCars',
              'OverallCond',
              'YearBuilt',
              'LotArea',
              'Fireplaces',
              'CentralAir',
              'ExterCond']

In [None]:
cv_score = get_cross_val_scores_given_cols(final_cols)
print(cv_score)


#### Complete summary of the cross validation metrics.

In [None]:
# Do a 10 fold cross validation as that is done typically.
hpr_cross_val = H2ORandomForestEstimator(model_id='housing_price_regression', 
                                         seed=1, 
                                         nfolds=5,
                                         keep_cross_validation_predictions=True)
hpr_cross_val.train(x=final_cols, 
                    y=dep_var_col, 
                    training_frame=get_h2o_frame_with_rel_factors(train_validation_data))


In [None]:
hpr_cross_val

In [None]:
rmses = np.array([0.1566032, 0.1134766, 0.1250018, 0.1384239, 0.1279203])

In [None]:
rmses.std()

In [None]:
rmses.mean()

### Quantifying validation error on validation data set.

Ideally this is not necessary as we have already performed cross validation over the entire training data set, but it is still useful to do it as this be a comparable yardstick against other models and can be useful while ensembling as well.

In [None]:
# Do a 10 fold cross validation as that is done typically.
model_train_data = H2ORandomForestEstimator(model_id='housing_price_regression', 
                                            seed=1)
model_train_data.train(x=final_cols, 
                       y=dep_var_col, 
                       training_frame=get_h2o_frame_with_rel_factors(train_data))

predict_validation = model_train_data.predict(
    get_h2o_frame_with_rel_factors(validation_data))                

validation_data['Predictions'] = predict_validation['predict'].as_data_frame()
evaluate_model_score_given_predictions(validation_data['LogSalePrice'], validation_data['Predictions'])


In [None]:
# Dump validaiton data to a file, which would be used later for ensembling.
validation_data[['Id', 'LogSalePrice', 'Predictions']].to_csv('housing_price_random_forest_validation.csv', 
                                                              index=False)

### Making predictions on test data

Now, there is another important matter to consider. In order to make predictions on test data, firstly it is important that we make sure that the test data is good. 

One problem that could be relevant here is presence of lots of null values in the test set and not in the training set. Let us check whether that is the case

In [None]:
test_data[final_cols].isnull().sum()

Okay, things look good. Let us generate predictions on the test set.

In [None]:
final_model = H2ORandomForestEstimator(model_id='housing_price_regression', seed=1)
final_model.train(x=final_cols, 
                  y=dep_var_col, 
                  training_frame=get_h2o_frame_with_rel_factors(train_validation_data))


In [None]:
predict_out = final_model.predict(
    get_h2o_frame_with_rel_factors(test_data))                


In [None]:
test_data['LogSalePrice'] = predict_out['predict'].as_data_frame()

In [None]:
test_data['SalePrice'] = test_data['LogSalePrice'].apply(lambda x : np.exp(x))

In [None]:
test_data[['Id', 'SalePrice']].to_csv('housing_price_random_forest_predictions.csv', index=False)