### Basic Overview

We will be exploring gradient boosting methods (in the h2o framework) to build a predictive model for predicting housing prices, given relevant data.

However, we will be examining every step in detail, to make sure that we make full use of the library.

In [105]:
import pandas as pd
import numpy as np
import sys
sys.path.append('../../common_routines/')
from relevant_functions import\
    evaluate_model_score_given_predictions,\
    evaluate_model_score


#### Get clean data first

In [106]:
train_data = pd.read_csv('../../cleaned_input/train_data.csv')
validation_data = pd.read_csv('../../cleaned_input/validation_data.csv')

In [107]:
train_data.columns

Index(['Id', 'MSSubClass', 'MSZoning', 'LotArea', 'Street', 'LotShape',
       'LandContour', 'Utilities', 'LotConfig', 'LandSlope',
       ...
       'LogGarageArea', 'LogWoodDeckSF', 'LogOpenPorchSF', 'LogEnclosedPorch',
       'Log3SsnPorch', 'LogScreenPorch', 'LogPoolArea', 'LogMiscVal',
       'LogSalePrice', 'LogMasVnrArea_times_not_missing'],
      dtype='object', length=102)

In [108]:
train_validation_data = pd.concat([train_data, validation_data])

In [109]:
test_data = pd.read_csv('../../cleaned_input/test.csv')

In [110]:
test_data.isnull().sum().sum()

27

In [111]:
## Are they indeed clean ?
train_data.isnull().sum().any()

False

In [112]:
validation_data.isnull().sum().any()

False

#### Get h2o up and running !


In [113]:
# Using h2o
import h2o
h2o.init(nthreads = -1, max_mem_size = 15)

Checking whether there is an H2O instance running at http://localhost:54321 . connected.


0,1
H2O cluster uptime:,16 hours 44 mins
H2O cluster timezone:,America/New_York
H2O data parsing timezone:,UTC
H2O cluster version:,3.24.0.1
H2O cluster version age:,11 days
H2O cluster name:,H2O_from_python_babs4JESUS_auajwj
H2O cluster total nodes:,1
H2O cluster free memory:,12.84 Gb
H2O cluster total cores:,8
H2O cluster allowed cores:,8


### Brief framework.

We will be building according the following framework (similar to how we did for PCA).

Given set of columns, we should be able to do the following :

1. Train model on training set.

2. Validate on validation set.

3. Generate predicitons on test data.

4. Do cross validation on combined set of training/validation data.


In [114]:
ALL_CATEGORICAL_COLUMNS = ['MSSubClass',
 'MSZoning',
 'Street',
 'LotShape',
 'LandContour',
 'Utilities',
 'LotConfig',
 'LandSlope',
 'Neighborhood',
 'Condition1',
 'Condition2',
 'BldgType',
 'HouseStyle',
 'RoofStyle',
 'RoofMatl',
 'Exterior1st',
 'Exterior2nd',
 'MasVnrType',
 'ExterQual',
 'ExterCond',
 'Foundation',
 'BsmtQual',
 'BsmtCond',
 'BsmtExposure',
 'BsmtFinType1',
 'BsmtFinType2',
 'Heating',
 'HeatingQC',
 'CentralAir',
 'Electrical',
 'KitchenQual',
 'Functional',
 'FireplaceQu',
 'GarageType',
 'GarageFinish',
 'GarageQual',
 'GarageCond',
 'PavedDrive',
 'PoolQC',
 'Fence',
 'MiscFeature',
 'MoSold',
 'YrSold',
 'SaleType',
 'SaleCondition']

In [115]:
ALL_NUMERICAL_COLUMNS = ['LotArea',
 'OverallQual',
 'OverallCond',
 'YearBuilt',
 'YearRemodAdd',
 'MasVnrArea_times_not_missing',
 'MasVnrArea_not_missing',
 'BsmtFinSF1',
 'BsmtUnfSF',
 'TotalBsmtSF',
 '1stFlrSF',
 '2ndFlrSF',
 'LowQualFinSF',
 'GrLivArea',
 'BsmtFullBath',
 'BsmtHalfBath',
 'FullBath',
 'HalfBath',
 'BedroomAbvGr',
 'KitchenAbvGr',
 'TotRmsAbvGrd',
 'Fireplaces',
 'GarageYrBlt_times_not_missing',
 'GarageYrBlt_not_missing',
 'GarageCars',
 'GarageArea',
 'WoodDeckSF',
 'OpenPorchSF',
 'EnclosedPorch',
 '3SsnPorch',
 'ScreenPorch',
 'PoolArea',
 'MiscVal']

In [116]:
ALL_COLUMNS = ALL_CATEGORICAL_COLUMNS + ALL_NUMERICAL_COLUMNS

In [117]:
# Columns the model to be trained 
# Check out ExterQual and YearBuilt instead of YearRemodAdd
# Check out BsmtCond
#cat_cols_in_model = ['MSSubClass', 'Neighborhood', 'ExterQual', 'Foundation', 'BsmtQual', 'BsmtCond',
#                     'BsmtFinType1']
#numeric_cols_in_model = ['GrLivArea', 'OverallQual', 'OverallCond', 'YearRemodAdd', 'BsmtFinSF1', 
#                         'TotalBsmtSF']
# Check out GarageCond
cat_cols_in_model = ['MSSubClass', 'Neighborhood']
numeric_cols_in_model = ['GrLivArea', 'OverallQual', 'OverallCond', 'YearRemodAdd', 'BsmtFinSF1', 
                         'TotalBsmtSF', '1stFlrSF', '2ndFlrSF', 'GarageArea', 'LowQualFinSF']

all_cols_in_model = cat_cols_in_model + numeric_cols_in_model
all_cols_in_model = ['OverallQual']
dep_var_col = 'LogSalePrice'

In [118]:
all_cols_in_model

['OverallQual']

#### Training model on the training set


In [119]:
def get_h2o_frame_with_rel_factors(test_data):
    test_data_h2o = h2o.H2OFrame(test_data)
    for col in ALL_CATEGORICAL_COLUMNS:
        test_data_h2o[col] = test_data_h2o[col].asfactor()
    return test_data_h2o

In [120]:
from h2o.estimators.gbm import H2OGradientBoostingEstimator

In [121]:
hpr_1 = H2OGradientBoostingEstimator(model_id='housing_price_regression', seed=1)
hpr_1.train(x=all_cols_in_model, 
            y=dep_var_col, 
            training_frame=get_h2o_frame_with_rel_factors(train_data))

Parse progress: |█████████████████████████████████████████████████████████| 100%
gbm Model Build progress: |███████████████████████████████████████████████| 100%


In [122]:
predict_out = hpr_1.predict(
    get_h2o_frame_with_rel_factors(train_data))                
train_data['Predictions'] = predict_out.as_data_frame()['predict'].values.tolist()

Parse progress: |█████████████████████████████████████████████████████████| 100%
gbm prediction progress: |████████████████████████████████████████████████| 100%


In [123]:
evaluate_model_score_given_predictions((train_data['Predictions'].values), 
                                       (train_data[dep_var_col].values))

0.2297343624924363

#### Inspect the output model

It may seem like a trivial thing, but shouldn't we inspect the model to see what exactly it does ? This is especially important in data science and it is very easy to get entangled in a quagmire of models and functions without clearly understanding what any of them does.


In [124]:
from h2o.tree import H2OTree
tree = H2OTree(model = hpr_1, tree_number = 0, tree_class = None)

In [125]:
tree

<h2o.tree.tree.H2OTree at 0x12fa327f0>

In [126]:
len(tree)

15

In [127]:
print(tree)

Tree related to model housing_price_regression. Tree number is 0, tree class is 'None'




In [128]:
tree.levels

[None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None]

In [129]:
tree.tree_number

0

In [130]:
tree.show()

Tree related to model housing_price_regression. Tree number is 0, tree class is 'None'




In [131]:
print(tree.root_node)

Node ID 0 
Left child node ID = 1
Right child node ID = 2

Splits on column OverallQual
Split threshold < 6.5 to the left node, >= 6.5 to the right node 

NA values go to the LEFT


In [132]:
print(tree.root_node.left_child)

Node ID 1 
Left child node ID = 3
Right child node ID = 4

Splits on column OverallQual
Split threshold < 4.5 to the left node, >= 4.5 to the right node 

NA values go to the RIGHT


#### Testing model on validation set

In [133]:
predict_out = hpr_1.predict(
    get_h2o_frame_with_rel_factors(validation_data))                
validation_data['Predictions'] = predict_out.as_data_frame()['predict'].values.tolist()

Parse progress: |█████████████████████████████████████████████████████████| 100%
gbm prediction progress: |████████████████████████████████████████████████| 100%


In [134]:
#evaluate_model_score_given_predictions(np.log(validation_data['Predictions'].values), 
#                                       np.log(validation_data[dep_var_col].values))

In [135]:
validation_score = evaluate_model_score_given_predictions((validation_data['Predictions'].values), 
                                                          (validation_data[dep_var_col].values))
print(validation_score)

0.23048405571983568


#### Generate predictions on test data

In [136]:
test_data_one_hot = pd.read_csv('../../cleaned_input/test_data_one_hot.csv')

In [137]:
test_data['LogMasVnrArea_times_not_missing'] = test_data_one_hot['LogMasVnrArea_times_not_missing']

In [138]:
test_data[all_cols_in_model].isnull().sum()

OverallQual    0
dtype: int64

In [139]:
hpr_1 = H2OGradientBoostingEstimator(model_id='housing_price_regression', seed=1)
hpr_1.train(x=all_cols_in_model, 
            y=dep_var_col, 
            training_frame=get_h2o_frame_with_rel_factors(train_validation_data))

Parse progress: |█████████████████████████████████████████████████████████| 100%
gbm Model Build progress: |███████████████████████████████████████████████| 100%


In [140]:
predict_out = hpr_1.predict(
    get_h2o_frame_with_rel_factors(test_data))                
test_data['Predictions'] = predict_out.as_data_frame()['predict'].values.tolist()

Parse progress: |█████████████████████████████████████████████████████████| 100%
gbm prediction progress: |████████████████████████████████████████████████| 100%


#### Cross validation

In [141]:
# Do a 10 fold cross validation as that is done typically.
hpr_cross_val = H2OGradientBoostingEstimator(model_id='housing_price_regression', 
                                             seed=1, 
                                             nfolds=5,
                                             keep_cross_validation_predictions=True)
hpr_cross_val.train(x=all_cols_in_model, 
                    y=dep_var_col, 
                    training_frame=get_h2o_frame_with_rel_factors(train_validation_data))


Parse progress: |█████████████████████████████████████████████████████████| 100%
gbm Model Build progress: |███████████████████████████████████████████████| 100%


In [142]:
hpr_cross_val.cross_validation_predictions

Model Details
H2OGradientBoostingEstimator :  Gradient Boosting Machine
Model Key:  housing_price_regression


ModelMetricsRegression: gbm
** Reported on train data. **

MSE: 0.05283147764757129
RMSE: 0.2298509900948249
MAE: 0.1729413862097753
RMSLE: 0.017775319501599753
Mean Residual Deviance: 0.05283147764757129

ModelMetricsRegression: gbm
** Reported on cross-validation data. **

MSE: 0.053578162226033635
RMSE: 0.23146957084254863
MAE: 0.17412129196219545
RMSLE: 0.017903618872038885
Mean Residual Deviance: 0.053578162226033635
Cross-Validation Metrics Summary: 


0,1,2,3,4,5,6,7
,mean,sd,cv_1_valid,cv_2_valid,cv_3_valid,cv_4_valid,cv_5_valid
mae,0.1739756,0.0092284,0.1981142,0.1597065,0.1712044,0.1744567,0.1663963
mean_residual_deviance,0.0534754,0.0066526,0.0690148,0.0393210,0.0531473,0.0527417,0.0531521
mse,0.0534754,0.0066526,0.0690148,0.0393210,0.0531473,0.0527417,0.0531521
r2,0.6645809,0.0160242,0.6453004,0.6822407,0.6457014,0.6491643,0.700498
residual_deviance,0.0534754,0.0066526,0.0690148,0.0393210,0.0531473,0.0527417,0.0531521
rmse,0.2303484,0.0144050,0.2627068,0.1982952,0.2305369,0.2296556,0.2305475
rmsle,0.0178089,0.0011740,0.0204315,0.0151848,0.0177887,0.0177245,0.0179152


Scoring History: 


0,1,2,3,4,5,6
,timestamp,duration,number_of_trees,training_rmse,training_mae,training_deviance
,2019-04-12 12:23:09,0.162 sec,0.0,0.3993151,0.3098117,0.1594525
,2019-04-12 12:23:09,0.165 sec,1.0,0.3730871,0.2883666,0.1391940
,2019-04-12 12:23:09,0.166 sec,2.0,0.3504063,0.2697649,0.1227845
,2019-04-12 12:23:09,0.167 sec,3.0,0.3308971,0.2537511,0.1094929
,2019-04-12 12:23:09,0.168 sec,4.0,0.3142081,0.2400349,0.0987267
---,---,---,---,---,---,---
,2019-04-12 12:23:09,0.225 sec,46.0,0.2298589,0.1729028,0.0528351
,2019-04-12 12:23:09,0.226 sec,47.0,0.2298562,0.1729144,0.0528339
,2019-04-12 12:23:09,0.228 sec,48.0,0.2298541,0.1729246,0.0528329



See the whole table with table.as_data_frame()
Variable Importances: 


0,1,2,3
variable,relative_importance,scaled_importance,percentage
OverallQual,819.2989502,1.0,1.0


<bound method ModelBase.cross_validation_predictions of >

In [143]:
def get_cross_validated_rmse(hpr_cross_val):
    cv_preds = hpr_cross_val.cross_validation_predictions()
    for i in range(len(cv_preds)):
        if i == 0:
            result_cv = cv_preds[0]['predict'].as_data_frame().copy()
        else:
            result_cv +=  cv_preds[i]['predict'].as_data_frame()
    return evaluate_model_score_given_predictions(result_cv, train_validation_data['LogSalePrice'])    

In [144]:
get_cross_validated_rmse(hpr_cross_val)

0.2314695712659162

#### Model building in detail.

We make sure that we exploit the capabilities given by the h2o module to the maximum.

So in this section, we try to build models in multiple ways using different values for certain parameters and see how they perform. Here we go !

In [145]:
# # Build the model using default parameters and evaluate error on the training/validation set.

# hpr_1 = H2OGradientBoostingEstimator(model_id='housing_price_regression', 
#                                      seed=1,
#                                      ntrees=50,
#                                      learn_rate=0.1,
#                                      stopping_rounds=0)
# hpr_1.train(x=all_cols_in_model, 
#             y=dep_var_col, 
#             training_frame=get_h2o_frame_with_rel_factors(train_data),

#             validation_frame=None)

# predict_out = hpr_1.predict(
#     get_h2o_frame_with_rel_factors(train_data))                
# train_data['Predictions'] = predict_out.as_data_frame()['predict'].values.tolist()

# print("Score on training data : ",
#       evaluate_model_score_given_predictions((train_data['Predictions'].values), 
#                                              (train_data[dep_var_col].values)))



# predict_out = hpr_1.predict(
#     get_h2o_frame_with_rel_factors(validation_data))                
# validation_data['Predictions'] = predict_out.as_data_frame()['predict'].values.tolist()

# print("Score on training data : ",
#       evaluate_model_score_given_predictions((validation_data['Predictions'].values), 
#                                              (validation_data[dep_var_col].values)))



In [146]:
# # Tinker with the learning rate alone here.
# # Build the model using default parameters and evaluate error on the training/validation set.

# hpr_1 = H2OGradientBoostingEstimator(model_id='housing_price_regression', 
#                                      seed=1,
#                                      ntrees=50,
#                                      learn_rate=0.01,
#                                      stopping_rounds=0)
# hpr_1.train(x=all_cols_in_model, 
#             y=dep_var_col, 
#             training_frame=get_h2o_frame_with_rel_factors(train_data),

#             validation_frame=None)

# predict_out = hpr_1.predict(
#     get_h2o_frame_with_rel_factors(train_data))                
# train_data['Predictions'] = predict_out.as_data_frame()['predict'].values.tolist()

# print("Score on training data : ",
#       evaluate_model_score_given_predictions((train_data['Predictions'].values), 
#                                              (train_data[dep_var_col].values)))



# predict_out = hpr_1.predict(
#     get_h2o_frame_with_rel_factors(validation_data))                
# validation_data['Predictions'] = predict_out.as_data_frame()['predict'].values.tolist()

# print("Score on training data : ",
#       evaluate_model_score_given_predictions((validation_data['Predictions'].values), 
#                                              (validation_data[dep_var_col].values)))




In [147]:
# # Try increasing the number of trees here and see whether it makes any difference.

# hpr_1 = H2OGradientBoostingEstimator(model_id='housing_price_regression', 
#                                      seed=1,
#                                      ntrees=500,
#                                      learn_rate=0.01,
#                                      stopping_rounds=0)
# hpr_1.train(x=all_cols_in_model, 
#             y=dep_var_col, 
#             training_frame=get_h2o_frame_with_rel_factors(train_data),

#             validation_frame=None)

# predict_out = hpr_1.predict(
#     get_h2o_frame_with_rel_factors(train_data))                
# train_data['Predictions'] = predict_out.as_data_frame()['predict'].values.tolist()

# print("Score on training data : ",
#       evaluate_model_score_given_predictions((train_data['Predictions'].values), 
#                                              (train_data[dep_var_col].values)))



# predict_out = hpr_1.predict(
#     get_h2o_frame_with_rel_factors(validation_data))                
# validation_data['Predictions'] = predict_out.as_data_frame()['predict'].values.tolist()

# print("Score on training data : ",
#       evaluate_model_score_given_predictions((validation_data['Predictions'].values), 
#                                              (validation_data[dep_var_col].values)))




In [148]:
# # Now comes the real boosting tree part. Add in a validation frame and set a value for stopping rounds as well.

# hpr_1 = H2OGradientBoostingEstimator(model_id='housing_price_regression', 
#                                      seed=1,
#                                      ntrees=50,
#                                      learn_rate=0.1,
#                                      stopping_rounds=5)
# hpr_1.train(x=all_cols_in_model, 
#             y=dep_var_col, 
#             training_frame=get_h2o_frame_with_rel_factors(train_data),

#             validation_frame=get_h2o_frame_with_rel_factors(validation_data))

# predict_out = hpr_1.predict(
#     get_h2o_frame_with_rel_factors(train_data))                
# train_data['Predictions'] = predict_out.as_data_frame()['predict'].values.tolist()

# print("Score on training data : ",
#       evaluate_model_score_given_predictions((train_data['Predictions'].values), 
#                                              (train_data[dep_var_col].values)))



# predict_out = hpr_1.predict(
#     get_h2o_frame_with_rel_factors(validation_data))                
# validation_data['Predictions'] = predict_out.as_data_frame()['predict'].values.tolist()

# print("Score on training data : ",
#       evaluate_model_score_given_predictions((validation_data['Predictions'].values), 
#                                              (validation_data[dep_var_col].values)))




In [149]:
# # Now comes the real boosting tree part. Add in a validation frame and set a value for stopping rounds as well.
# # Increases the number of trees to 500.
# hpr_1 = H2OGradientBoostingEstimator(model_id='housing_price_regression', 
#                                      seed=1,
#                                      ntrees=500,
#                                      learn_rate=0.01,
#                                      stopping_rounds=5)
# hpr_1.train(x=all_cols_in_model, 
#             y=dep_var_col, 
#             training_frame=get_h2o_frame_with_rel_factors(train_data),

#             validation_frame=get_h2o_frame_with_rel_factors(validation_data))

# predict_out = hpr_1.predict(
#     get_h2o_frame_with_rel_factors(train_data))                
# train_data['Predictions'] = predict_out.as_data_frame()['predict'].values.tolist()

# print("Score on training data : ",
#       evaluate_model_score_given_predictions((train_data['Predictions'].values), 
#                                              (train_data[dep_var_col].values)))



# predict_out = hpr_1.predict(
#     get_h2o_frame_with_rel_factors(validation_data))                
# validation_data['Predictions'] = predict_out.as_data_frame()['predict'].values.tolist()

# print("Score on training data : ",
#       evaluate_model_score_given_predictions((validation_data['Predictions'].values), 
#                                              (validation_data[dep_var_col].values)))




In [150]:
# # Now let us try cross validation along with stopping and see where we end up.
# hpr_1 = H2OGradientBoostingEstimator(model_id='housing_price_regression', 
#                                      seed=1,
#                                      ntrees=500,
#                                      learn_rate=0.01,
#                                      stopping_rounds=5,
#                                      nfolds=5,
#                                      keep_cross_validation_predictions=True)
# hpr_1.train(x=all_cols_in_model, 
#             y=dep_var_col, 
#             training_frame=get_h2o_frame_with_rel_factors(train_validation_data))

# print("Cross validaition score is ", get_cross_validated_rmse(hpr_1))

In [151]:
# # Now let us try the same with a different number of trees.
# hpr_1 = H2OGradientBoostingEstimator(model_id='housing_price_regression', 
#                                      seed=1,
#                                      ntrees=50,
#                                      learn_rate=0.1,
#                                      stopping_rounds=5,
#                                      nfolds=5,
#                                      keep_cross_validation_predictions=True)
# hpr_1.train(x=all_cols_in_model, 
#             y=dep_var_col, 
#             training_frame=get_h2o_frame_with_rel_factors(train_validation_data))

# print("Cross validaition score is ", get_cross_validated_rmse(hpr_1))

#### Conclusion.

At this stage, after seeing outputs, it is difficult to say something concretly. The model outputs will definitely depend on the number of predictors and at this stage, with just just predictor, it does not look to make much of a difference.

### Try somewhat of a greedy method to select columns

In this approach, we would start with an empty model and at every step, we would the predictor that would decrease the cross validation error the most. We have not fully automated it (it would be extremely easy to do so), but have built the model manually after inspection in each step. 

As simple as it seems, this looks to work pretty well. In fact, we were able to get a much better model, with almost no extra effort !

In [152]:
import operator
def get_cross_val_scores_new_col(base_model_cols, 
                                 ntrees=50,
                                 learn_rate=0.1,
                                 stopping_rounds=0,
                                 train_validation_data=train_validation_data,
                                 nfolds=5,
                                 dep_var_col='LogSalePrice'):
    columns_to_cross_val_score = dict()
    for col in ALL_COLUMNS:
        # If the column was already included, skip it.
        if col in base_model_cols:
            continue
        
        cur_model_cols = base_model_cols + [col]            
        print(cur_model_cols)

        # Do a 10 fold cross validation as that is done typically.
        hpr_cross_val = H2OGradientBoostingEstimator(model_id='housing_price_regression', 
                                                    seed=1, 
                                                    ntrees=ntrees,
                                                    learn_rate=learn_rate,
                                                    stopping_rounds=stopping_rounds,
                                                    nfolds=nfolds,
                                                    keep_cross_validation_predictions=True)
        hpr_cross_val.train(x=cur_model_cols, 
                            y=dep_var_col, 
                            training_frame=get_h2o_frame_with_rel_factors(train_validation_data))

        cv_score = get_cross_validated_rmse(hpr_cross_val)

        columns_to_cross_val_score[col] = cv_score
    
    sorted_cross_val_scores = sorted(columns_to_cross_val_score.items(), key=operator.itemgetter(1))
    return sorted_cross_val_scores

In [153]:
def get_cross_val_scores_given_cols(cur_model_cols,
                                    ntrees=50,
                                    learn_rate=0.1,
                                    stopping_rounds=0,
                                    train_validation_data=train_validation_data,
                                    nfolds=5,
                                    dep_var_col='LogSalePrice'):

    # Do a 10 fold cross validation as that is done typically.
    hpr_cross_val = H2OGradientBoostingEstimator(model_id='housing_price_regression', 
                                                 seed=1, 
                                                 ntrees=50,
                                                 learn_rate=0.1,
                                                 stopping_rounds=0,
                                                 nfolds=nfolds,
                                                 keep_cross_validation_predictions=True)
    hpr_cross_val.train(x=cur_model_cols, 
                        y=dep_var_col, 
                        training_frame=get_h2o_frame_with_rel_factors(train_validation_data))

    cv_score = get_cross_validated_rmse(hpr_cross_val)

    return cv_score

In [154]:
# Let us begin our testing on a good number of trees witha a low learning rate and see how it looks.
#sorted_cross_val_scores = get_cross_val_scores_new_col([],ntrees=500, learn_rate=0.01)

In [155]:
#sorted_cross_val_scores 

In [156]:
# Let us check out default values now. We would not expect much of a difference here.
#sorted_cross_val_scores = get_cross_val_scores_new_col([])

In [157]:
#sorted_cross_val_scores

In [158]:
# We do not see much of difference here. Let us try with stopping rounds (we do not expect to see much of a difference
# here as well.)
# Let us begin our testing on a good number of trees witha a low learning rate and see how it looks.
#sorted_cross_val_scores = get_cross_val_scores_new_col([],ntrees=500, learn_rate=0.01, stopping_rounds=5)

In [159]:
#sorted_cross_val_scores

In [160]:
#sorted_cross_val_scores = get_cross_val_scores_new_col(['OverallQual'])

In [161]:
#sorted_cross_val_scores

In [162]:
#sorted_cross_val_scores = get_cross_val_scores_new_col(['OverallQual'],ntrees=500, learn_rate=0.01, stopping_rounds=5)

In [163]:
#sorted_cross_val_scores

In [164]:
#sorted_cross_val_scores = get_cross_val_scores_new_col(['OverallQual', 'GrLivArea'])

In [165]:
#sorted_cross_val_scores


In [166]:
# Do this after checkiing the earlier result to see if GrLivArea is the factor to be included.
#sorted_cross_val_scores = get_cross_val_scores_new_col(['OverallQual', 'GrLivArea'],
#                                                       ntrees=500, learn_rate=0.01, stopping_rounds=5)

In [167]:
#sorted_cross_val_scores

In [168]:
#sorted_cross_val_scores = get_cross_val_scores_new_col(['OverallQual', 'GrLivArea', 'Neighborhood'])

In [169]:
#sorted_cross_val_scores

In [170]:
# Do this after checkiing the earlier result to see if GrLivArea is the factor to be included.
#sorted_cross_val_scores = get_cross_val_scores_new_col(['OverallQual', 'GrLivArea', 'Neighborhood'],
#                                                       ntrees=500, learn_rate=0.01, stopping_rounds=5)

In [171]:
#sorted_cross_val_scores

In [172]:
#sorted_cross_val_scores = get_cross_val_scores_new_col(['OverallQual', 'GrLivArea', 'Neighborhood'],
#                                                       ntrees=500, learn_rate=0.01)

In [173]:
#sorted_cross_val_scores

If the results shown above look right, go for the final model and apply stopping rounds to the same (you may want to check for the 'heat' variable.


In [174]:
# sorted_cross_val_scores = get_cross_val_scores_new_col(['OverallQual', 
#                                                         'Neighborhood', 
#                                                         'GrLivArea',
#                                                         'BsmtFinSF1'])

In [175]:
# sorted_cross_val_scores

In [176]:

# sorted_cross_val_scores = get_cross_val_scores_new_col(['OverallQual', 
#                                                         'Neighborhood', 
#                                                         'GrLivArea',
#                                                         'BsmtFinSF1',
#                                                         'OverallCond'])


In [177]:
# sorted_cross_val_scores

In [178]:

# sorted_cross_val_scores = get_cross_val_scores_new_col(['OverallQual', 
#                                                         'Neighborhood', 
#                                                         'GrLivArea',
#                                                         'BsmtFinSF1',
#                                                         'OverallCond',
#                                                         'GarageArea'])

In [179]:
# sorted_cross_val_scores

In [180]:

# sorted_cross_val_scores = get_cross_val_scores_new_col(['OverallQual', 
#                                                         'Neighborhood', 
#                                                         'GrLivArea',
#                                                         'BsmtFinSF1',
#                                                         'OverallCond',
#                                                         'GarageArea',
#                                                         'TotalBsmtSF'])


In [181]:
# sorted_cross_val_scores

In [182]:


# sorted_cross_val_scores = get_cross_val_scores_new_col(['OverallQual', 
#                                                         'Neighborhood', 
#                                                         'GrLivArea',
#                                                         'BsmtFinSF1',
#                                                         'OverallCond',
#                                                         'GarageArea',
#                                                         'TotalBsmtSF',
#                                                         'YearBuilt'])


In [183]:
# sorted_cross_val_scores

In [184]:
# sorted_cross_val_scores = get_cross_val_scores_new_col(['OverallQual', 
#                                                         'Neighborhood', 
#                                                         'GrLivArea',
#                                                         'BsmtFinSF1',
#                                                         'OverallCond',
#                                                         'GarageArea',
#                                                         'TotalBsmtSF',
#                                                         'YearBuilt',
#                                                         'GarageFinish'])



In [185]:
# sorted_cross_val_scores

In [186]:
# sorted_cross_val_scores = get_cross_val_scores_new_col(['OverallQual', 
#                                                         'Neighborhood', 
#                                                         'GrLivArea',
#                                                         'BsmtFinSF1',
#                                                         'OverallCond',
#                                                         'GarageArea',
#                                                         'TotalBsmtSF',
#                                                         'YearBuilt',
#                                                         'GarageFinish',
#                                                         'LotArea'])



In [187]:
# sorted_cross_val_scores

In [188]:
# sorted_cross_val_scores = get_cross_val_scores_new_col(['OverallQual', 
#                                                         'Neighborhood', 
#                                                         'GrLivArea',
#                                                         'BsmtFinSF1',
#                                                         'OverallCond',
#                                                         'GarageArea',
#                                                         'TotalBsmtSF',
#                                                         'YearBuilt',
#                                                         'GarageFinish',
#                                                         'LotArea',
#                                                         'CentralAir'])



In [189]:
# sorted_cross_val_scores

In [190]:
# sorted_cross_val_scores = get_cross_val_scores_new_col(['OverallQual', 
#                                                         'Neighborhood', 
#                                                         'GrLivArea',
#                                                         'BsmtFinSF1',
#                                                         'OverallCond',
#                                                         'GarageArea',
#                                                         'TotalBsmtSF',
#                                                         'YearBuilt',
#                                                         'GarageFinish',
#                                                         'LotArea',
#                                                         'CentralAir',
#                                                         'BsmtFinType1'])



In [191]:
# sorted_cross_val_scores

In [192]:
# sorted_cross_val_scores = get_cross_val_scores_new_col(['OverallQual', 
#                                                         'Neighborhood', 
#                                                         'GrLivArea',
#                                                         'BsmtFinSF1',
#                                                         'OverallCond',
#                                                         'GarageArea',
#                                                         'TotalBsmtSF',
#                                                         'YearBuilt',
#                                                         'GarageFinish',
#                                                         'LotArea',
#                                                         'CentralAir',
#                                                         'BsmtFinType1',
#                                                         'KitchenAbvGr'])



In [193]:
# sorted_cross_val_scores

In [194]:
# sorted_cross_val_scores = get_cross_val_scores_new_col(['OverallQual', 
#                                                         'Neighborhood', 
#                                                         'GrLivArea',
#                                                         'BsmtFinSF1',
#                                                         'OverallCond',
#                                                         'GarageArea',
#                                                         'TotalBsmtSF',
#                                                         'YearBuilt',
#                                                         'GarageFinish',
#                                                         'LotArea',
#                                                         'CentralAir',
#                                                         'BsmtFinType1',
#                                                         'KitchenAbvGr',
#                                                         'Condition1'])



In [195]:
# sorted_cross_val_scores

In [196]:
# sorted_cross_val_scores = get_cross_val_scores_new_col(['OverallQual', 
#                                                         'Neighborhood', 
#                                                         'GrLivArea',
#                                                         'BsmtFinSF1',
#                                                         'OverallCond',
#                                                         'GarageArea',
#                                                         'TotalBsmtSF',
#                                                         'YearBuilt',
#                                                         'GarageFinish',
#                                                         'LotArea',
#                                                         'CentralAir',
#                                                         'BsmtFinType1',
#                                                         'KitchenAbvGr',
#                                                         'Condition1',
#                                                         'ScreenPorch'])



In [197]:
# sorted_cross_val_scores

In [198]:
# sorted_cross_val_scores = get_cross_val_scores_new_col(['OverallQual', 
#                                                         'Neighborhood', 
#                                                         'GrLivArea',
#                                                         'BsmtFinSF1',
#                                                         'OverallCond',
#                                                         'GarageArea',
#                                                         'TotalBsmtSF',
#                                                         'YearBuilt',
#                                                         'GarageFinish',
#                                                         'LotArea',
#                                                         'CentralAir',
#                                                         'BsmtFinType1',
#                                                         'KitchenAbvGr',
#                                                         'Condition1',
#                                                         'ScreenPorch',
#                                                         'BsmtFinType2'])



In [199]:
# sorted_cross_val_scores

In [200]:
# sorted_cross_val_scores = get_cross_val_scores_new_col(['OverallQual', 
#                                                         'Neighborhood', 
#                                                         'GrLivArea',
#                                                         'BsmtFinSF1',
#                                                         'OverallCond',
#                                                         'GarageArea',
#                                                         'TotalBsmtSF',
#                                                         'YearBuilt',
#                                                         'GarageFinish',
#                                                         'LotArea',
#                                                         'CentralAir',
#                                                         'BsmtFinType1',
#                                                         'KitchenAbvGr',
#                                                         'Condition1',
#                                                         'ScreenPorch',
#                                                         'BsmtFinType2',
#                                                         'Heating'])



In [201]:
# cross_val_score_this_model = get_cross_val_scores_given_cols(['OverallQual', 
#                                                         'Neighborhood', 
#                                                         'GrLivArea',
#                                                         'BsmtFinSF1',
#                                                         'OverallCond',
#                                                         'GarageArea',
#                                                         'TotalBsmtSF',
#                                                         'YearBuilt',
#                                                         'GarageFinish',
#                                                         'LotArea',
#                                                         'CentralAir',
#                                                         'BsmtFinType1',
#                                                         'KitchenAbvGr',
#                                                         'Condition1',
#                                                         'ScreenPorch',
#                                                         'BsmtFinType2',
#                                                         'Heating'])



In [202]:
# cross_val_score_this_model

In [203]:
# sorted_cross_val_scores

We see that the cross validation score is decreasing, indicating that we have reached a plateau. Let us make check out our cross validation score now.

In [204]:
final_cols = ['OverallQual', 
              'Neighborhood', 
              'GrLivArea',
              'BsmtFinSF1',
              'OverallCond',
              'GarageArea',
              'TotalBsmtSF',
              'YearBuilt',
              'GarageFinish',
              'LotArea',
              'CentralAir',
              'BsmtFinType1',
              'KitchenAbvGr',
              'Condition1',
              'ScreenPorch',
              'BsmtFinType2',
              'Heating']

In [205]:
cv_score = get_cross_val_scores_given_cols(final_cols)
print(cv_score)


Parse progress: |█████████████████████████████████████████████████████████| 100%
gbm Model Build progress: |███████████████████████████████████████████████| 100%
0.1256114948485721


#### Run at home BEGIN

In [206]:
'''
cv_score = get_cross_val_scores_given_cols(final_cols, ntrees=500, learn_rate=0.01)
print(cv_score)
'''


'\ncv_score = get_cross_val_scores_given_cols(final_cols, ntrees=500, learn_rate=0.01)\nprint(cv_score)\n'

In [207]:
'''
cv_score = get_cross_val_scores_given_cols(final_cols, ntrees=500, learn_rate=0.01, stopping_rounds=5)
print(cv_score)
'''


'\ncv_score = get_cross_val_scores_given_cols(final_cols, ntrees=500, learn_rate=0.01, stopping_rounds=5)\nprint(cv_score)\n'

#### Run at home END

#### Complete summary of the cross validation metrics.

In [208]:
# Do a 10 fold cross validation as that is done typically.
hpr_cross_val = H2OGradientBoostingEstimator(model_id='housing_price_regression', 
                                             seed=1, 
                                             nfolds=5,
                                             keep_cross_validation_predictions=True)
hpr_cross_val.train(x=final_cols, 
                    y=dep_var_col, 
                    training_frame=get_h2o_frame_with_rel_factors(train_validation_data))


Parse progress: |█████████████████████████████████████████████████████████| 100%
gbm Model Build progress: |███████████████████████████████████████████████| 100%


In [209]:
hpr_cross_val

Model Details
H2OGradientBoostingEstimator :  Gradient Boosting Machine
Model Key:  housing_price_regression


ModelMetricsRegression: gbm
** Reported on train data. **

MSE: 0.005787660939965575
RMSE: 0.07607667802924609
MAE: 0.053875103388747125
RMSLE: 0.005919316221845295
Mean Residual Deviance: 0.005787660939965575

ModelMetricsRegression: gbm
** Reported on cross-validation data. **

MSE: 0.015778247201824646
RMSE: 0.12561149311199452
MAE: 0.08664433888908148
RMSLE: 0.00976176167592953
Mean Residual Deviance: 0.015778247201824646
Cross-Validation Metrics Summary: 


0,1,2,3,4,5,6,7
,mean,sd,cv_1_valid,cv_2_valid,cv_3_valid,cv_4_valid,cv_5_valid
mae,0.0865637,0.0048704,0.0938477,0.0761395,0.0882947,0.0932474,0.0812892
mean_residual_deviance,0.0157500,0.0022251,0.0188923,0.0110734,0.0158627,0.0193542,0.0135676
mse,0.0157500,0.0022251,0.0188923,0.0110734,0.0158627,0.0193542,0.0135676
r2,0.9004954,0.0123754,0.9029038,0.9105138,0.8942537,0.8712566,0.9235492
residual_deviance,0.0157500,0.0022251,0.0188923,0.0110734,0.0158627,0.0193542,0.0135676
rmse,0.1248452,0.0090473,0.1374491,0.1052304,0.1259472,0.1391194,0.1164801
rmsle,0.0096992,0.0007248,0.0106937,0.0080755,0.0097665,0.0108362,0.0091243


Scoring History: 


0,1,2,3,4,5,6
,timestamp,duration,number_of_trees,training_rmse,training_mae,training_deviance
,2019-04-12 12:23:29,0.373 sec,0.0,0.3993151,0.3098117,0.1594525
,2019-04-12 12:23:29,0.378 sec,1.0,0.3661650,0.2832440,0.1340768
,2019-04-12 12:23:29,0.382 sec,2.0,0.3365929,0.2589587,0.1132948
,2019-04-12 12:23:29,0.386 sec,3.0,0.3103553,0.2374981,0.0963204
,2019-04-12 12:23:29,0.390 sec,4.0,0.2865502,0.2185276,0.0821110
---,---,---,---,---,---,---
,2019-04-12 12:23:29,0.593 sec,46.0,0.0774682,0.0548900,0.0060013
,2019-04-12 12:23:29,0.596 sec,47.0,0.0771927,0.0546091,0.0059587
,2019-04-12 12:23:29,0.598 sec,48.0,0.0766233,0.0542210,0.0058711



See the whole table with table.as_data_frame()
Variable Importances: 


0,1,2,3
variable,relative_importance,scaled_importance,percentage
OverallQual,613.5524902,1.0,0.5196104
Neighborhood,178.9126282,0.2916012,0.1515190
GrLivArea,154.8893890,0.2524468,0.1311740
TotalBsmtSF,55.8950195,0.0911006,0.0473368
GarageArea,55.5602303,0.0905550,0.0470533
BsmtFinSF1,29.2153645,0.0476167,0.0247422
OverallCond,24.1364708,0.0393389,0.0204409
CentralAir,15.8737736,0.0258719,0.0134433
LotArea,15.8381004,0.0258138,0.0134131




### Quantifying validation error on validation data set.

Ideally this is not necessary as we have already performed cross validation over the entire training data set, but it is still useful to do it as this be a comparable yardstick against other models and can be useful while ensembling as well.

In [210]:
# Do a 10 fold cross validation as that is done typically.
model_train_data = H2OGradientBoostingEstimator(model_id='housing_price_regression', 
                                                seed=1)
model_train_data.train(x=final_cols, 
                       y=dep_var_col, 
                       training_frame=get_h2o_frame_with_rel_factors(train_data))

predict_validation = model_train_data.predict(
    get_h2o_frame_with_rel_factors(validation_data))                

validation_data['Predictions'] = predict_validation['predict'].as_data_frame()
evaluate_model_score_given_predictions(validation_data['LogSalePrice'], validation_data['Predictions'])


Parse progress: |█████████████████████████████████████████████████████████| 100%
gbm Model Build progress: |███████████████████████████████████████████████| 100%
Parse progress: |█████████████████████████████████████████████████████████| 100%
gbm prediction progress: |████████████████████████████████████████████████| 100%




0.13303899344520437

#### 4 entries having values not encountered in training set for the 'Heating' column.

Hence we are ignoring it for now. Also, this issue is not presnet in test data as well.

In [211]:
# Dump validaiton data to a file, which would be used later for ensembling.
validation_data[['Id', 'LogSalePrice', 'Predictions']].to_csv('housing_price_h2o_gradient_boosting_validation.csv', 
                                                              index=False)

### Making predictions on test data

Now, there is another important matter to consider. In order to make predictions on test data, firstly it is important that we make sure that the test data is good. 

One problem that could be relevant here is presence of lots of null values in the test set and not in the training set. Let us check whether that is the case

In [212]:
test_data[final_cols].isnull().sum()

OverallQual     0
Neighborhood    0
GrLivArea       0
BsmtFinSF1      1
OverallCond     0
GarageArea      1
TotalBsmtSF     1
YearBuilt       0
GarageFinish    0
LotArea         0
CentralAir      0
BsmtFinType1    0
KitchenAbvGr    0
Condition1      0
ScreenPorch     0
BsmtFinType2    0
Heating         0
dtype: int64

Okay, things look good. Let us generate predictions on the test set.

In [213]:
final_model = H2OGradientBoostingEstimator(model_id='housing_price_regression', seed=1)
final_model.train(x=final_cols, 
                  y=dep_var_col, 
                  training_frame=get_h2o_frame_with_rel_factors(train_validation_data))


Parse progress: |█████████████████████████████████████████████████████████| 100%
gbm Model Build progress: |███████████████████████████████████████████████| 100%


In [214]:
predict_out = final_model.predict(
    get_h2o_frame_with_rel_factors(test_data))                


Parse progress: |█████████████████████████████████████████████████████████| 100%
gbm prediction progress: |████████████████████████████████████████████████| 100%


In [215]:
test_data['LogSalePrice'] = predict_out['predict'].as_data_frame()

In [216]:
test_data['SalePrice'] = test_data['LogSalePrice'].apply(lambda x : np.exp(x))

In [217]:
test_data[['Id', 'SalePrice']].to_csv('housing_price_h2o_gradient_boosting_predictions.csv', index=False)