### Basic Overview

We will be exploring gradient boosting methods (in the h2o framework) to build a predictive model for predicting housing prices, given relevant data.

However, we will be examining every step in detail, to make sure that we make full use of the library.

In [88]:
import pandas as pd
import numpy as np
import sys
sys.path.append('../../common_routines/')
from relevant_functions import\
    evaluate_model_score_given_predictions,\
    evaluate_model_score


#### Get clean data first

In [89]:
train_data = pd.read_csv('../../cleaned_input/train_data.csv')
validation_data = pd.read_csv('../../cleaned_input/validation_data.csv')

In [90]:
train_data.columns

Index(['Id', 'MSSubClass', 'MSZoning', 'LotArea', 'Street', 'LotShape',
       'LandContour', 'Utilities', 'LotConfig', 'LandSlope',
       ...
       'LogGarageArea', 'LogWoodDeckSF', 'LogOpenPorchSF', 'LogEnclosedPorch',
       'Log3SsnPorch', 'LogScreenPorch', 'LogPoolArea', 'LogMiscVal',
       'LogSalePrice', 'LogMasVnrArea_times_not_missing'],
      dtype='object', length=102)

In [91]:
train_validation_data = pd.concat([train_data, validation_data])

In [92]:
test_data = pd.read_csv('../../cleaned_input/test.csv')

In [93]:
test_data.isnull().sum().sum()

27

In [94]:
## Are they indeed clean ?
train_data.isnull().sum().any()

False

In [95]:
validation_data.isnull().sum().any()

False

#### Get h2o up and running !


In [96]:
# Using h2o
import h2o
h2o.init(nthreads = -1, max_mem_size = 15)

Checking whether there is an H2O instance running at http://localhost:54321 . connected.


0,1
H2O cluster uptime:,16 hours 43 mins
H2O cluster timezone:,America/New_York
H2O data parsing timezone:,UTC
H2O cluster version:,3.24.0.1
H2O cluster version age:,11 days
H2O cluster name:,H2O_from_python_babs4JESUS_auajwj
H2O cluster total nodes:,1
H2O cluster free memory:,12.84 Gb
H2O cluster total cores:,8
H2O cluster allowed cores:,8


### Brief framework.

We will be building according the following framework (similar to how we did for PCA).

Given set of columns, we should be able to do the following :

1. Train model on training set.

2. Validate on validation set.

3. Generate predicitons on test data.

4. Do cross validation on combined set of training/validation data.


In [97]:
ALL_CATEGORICAL_COLUMNS = ['MSSubClass',
 'MSZoning',
 'Street',
 'LotShape',
 'LandContour',
 'Utilities',
 'LotConfig',
 'LandSlope',
 'Neighborhood',
 'Condition1',
 'Condition2',
 'BldgType',
 'HouseStyle',
 'RoofStyle',
 'RoofMatl',
 'Exterior1st',
 'Exterior2nd',
 'MasVnrType',
 'ExterQual',
 'ExterCond',
 'Foundation',
 'BsmtQual',
 'BsmtCond',
 'BsmtExposure',
 'BsmtFinType1',
 'BsmtFinType2',
 'Heating',
 'HeatingQC',
 'CentralAir',
 'Electrical',
 'KitchenQual',
 'Functional',
 'FireplaceQu',
 'GarageType',
 'GarageFinish',
 'GarageQual',
 'GarageCond',
 'PavedDrive',
 'PoolQC',
 'Fence',
 'MiscFeature',
 'MoSold',
 'YrSold',
 'SaleType',
 'SaleCondition']

In [98]:
ALL_NUMERICAL_COLUMNS = ['LotArea',
 'OverallQual',
 'OverallCond',
 'YearBuilt',
 'YearRemodAdd',
 'MasVnrArea_times_not_missing',
 'MasVnrArea_not_missing',
 'BsmtFinSF1',
 'BsmtUnfSF',
 'TotalBsmtSF',
 '1stFlrSF',
 '2ndFlrSF',
 'LowQualFinSF',
 'GrLivArea',
 'BsmtFullBath',
 'BsmtHalfBath',
 'FullBath',
 'HalfBath',
 'BedroomAbvGr',
 'KitchenAbvGr',
 'TotRmsAbvGrd',
 'Fireplaces',
 'GarageYrBlt_times_not_missing',
 'GarageYrBlt_not_missing',
 'GarageCars',
 'GarageArea',
 'WoodDeckSF',
 'OpenPorchSF',
 'EnclosedPorch',
 '3SsnPorch',
 'ScreenPorch',
 'PoolArea',
 'MiscVal']

In [99]:
ALL_COLUMNS = ALL_CATEGORICAL_COLUMNS + ALL_NUMERICAL_COLUMNS

In [100]:
# Columns the model to be trained 
# Check out ExterQual and YearBuilt instead of YearRemodAdd
# Check out BsmtCond
#cat_cols_in_model = ['MSSubClass', 'Neighborhood', 'ExterQual', 'Foundation', 'BsmtQual', 'BsmtCond',
#                     'BsmtFinType1']
#numeric_cols_in_model = ['GrLivArea', 'OverallQual', 'OverallCond', 'YearRemodAdd', 'BsmtFinSF1', 
#                         'TotalBsmtSF']
# Check out GarageCond
cat_cols_in_model = ['MSSubClass', 'Neighborhood']
numeric_cols_in_model = ['GrLivArea', 'OverallQual', 'OverallCond', 'YearRemodAdd', 'BsmtFinSF1', 
                         'TotalBsmtSF', '1stFlrSF', '2ndFlrSF', 'GarageArea', 'LowQualFinSF']

all_cols_in_model = cat_cols_in_model + numeric_cols_in_model
all_cols_in_model = ['OverallQual']
dep_var_col = 'LogSalePrice'

In [101]:
all_cols_in_model

['OverallQual']

#### Training model on the training set


In [102]:
def get_h2o_frame_with_rel_factors(test_data):
    test_data_h2o = h2o.H2OFrame(test_data)
    for col in ALL_CATEGORICAL_COLUMNS:
        test_data_h2o[col] = test_data_h2o[col].asfactor()
    return test_data_h2o

In [103]:
from h2o.estimators.gbm import H2OGradientBoostingEstimator

In [None]:
hpr_1 = H2OGradientBoostingEstimator(model_id='housing_price_regression', seed=1)
hpr_1.train(x=all_cols_in_model, 
            y=dep_var_col, 
            training_frame=get_h2o_frame_with_rel_factors(train_data))

In [None]:
predict_out = hpr_1.predict(
    get_h2o_frame_with_rel_factors(train_data))                
train_data['Predictions'] = predict_out.as_data_frame()['predict'].values.tolist()

In [None]:
evaluate_model_score_given_predictions((train_data['Predictions'].values), 
                                       (train_data[dep_var_col].values))

#### Inspect the output model

It may seem like a trivial thing, but shouldn't we inspect the model to see what exactly it does ? This is especially important in data science and it is very easy to get entangled in a quagmire of models and functions without clearly understanding what any of them does.


In [104]:
from h2o.tree import H2OTree
tree = H2OTree(model = hpr_1, tree_number = 0, tree_class = None)

NameError: name 'hpr_1' is not defined

In [None]:
tree

In [None]:
len(tree)

In [None]:
print(tree)

In [None]:
tree.levels

In [None]:
tree.tree_number

In [None]:
tree.show()

In [None]:
print(tree.root_node)

In [None]:
print(tree.root_node.left_child)

#### Testing model on validation set

In [None]:
predict_out = hpr_1.predict(
    get_h2o_frame_with_rel_factors(validation_data))                
validation_data['Predictions'] = predict_out.as_data_frame()['predict'].values.tolist()

In [None]:
#evaluate_model_score_given_predictions(np.log(validation_data['Predictions'].values), 
#                                       np.log(validation_data[dep_var_col].values))

In [None]:
validation_score = evaluate_model_score_given_predictions((validation_data['Predictions'].values), 
                                                          (validation_data[dep_var_col].values))
print(validation_score)

#### Generate predictions on test data

In [None]:
test_data_one_hot = pd.read_csv('../../cleaned_input/test_data_one_hot.csv')

In [None]:
test_data['LogMasVnrArea_times_not_missing'] = test_data_one_hot['LogMasVnrArea_times_not_missing']

In [None]:
test_data[all_cols_in_model].isnull().sum()

In [None]:
hpr_1 = H2OGradientBoostingEstimator(model_id='housing_price_regression', seed=1)
hpr_1.train(x=all_cols_in_model, 
            y=dep_var_col, 
            training_frame=get_h2o_frame_with_rel_factors(train_validation_data))

In [None]:
predict_out = hpr_1.predict(
    get_h2o_frame_with_rel_factors(test_data))                
test_data['Predictions'] = predict_out.as_data_frame()['predict'].values.tolist()

#### Cross validation

In [None]:
# Do a 10 fold cross validation as that is done typically.
hpr_cross_val = H2OGradientBoostingEstimator(model_id='housing_price_regression', 
                                             seed=1, 
                                             nfolds=5,
                                             keep_cross_validation_predictions=True)
hpr_cross_val.train(x=all_cols_in_model, 
                    y=dep_var_col, 
                    training_frame=get_h2o_frame_with_rel_factors(train_validation_data))


In [None]:
hpr_cross_val.cross_validation_predictions

In [None]:
def get_cross_validated_rmse(hpr_cross_val):
    cv_preds = hpr_cross_val.cross_validation_predictions()
    for i in range(len(cv_preds)):
        if i == 0:
            result_cv = cv_preds[0]['predict'].as_data_frame().copy()
        else:
            result_cv +=  cv_preds[i]['predict'].as_data_frame()
    return evaluate_model_score_given_predictions(result_cv, train_validation_data['LogSalePrice'])    

In [None]:
get_cross_validated_rmse(hpr_cross_val)

#### Model building in detail.

We make sure that we exploit the capabilities given by the h2o module to the maximum.

So in this section, we try to build models in multiple ways using different values for certain parameters and see how they perform. Here we go !

In [None]:
# # Build the model using default parameters and evaluate error on the training/validation set.

# hpr_1 = H2OGradientBoostingEstimator(model_id='housing_price_regression', 
#                                      seed=1,
#                                      ntrees=50,
#                                      learn_rate=0.1,
#                                      stopping_rounds=0)
# hpr_1.train(x=all_cols_in_model, 
#             y=dep_var_col, 
#             training_frame=get_h2o_frame_with_rel_factors(train_data),

#             validation_frame=None)

# predict_out = hpr_1.predict(
#     get_h2o_frame_with_rel_factors(train_data))                
# train_data['Predictions'] = predict_out.as_data_frame()['predict'].values.tolist()

# print("Score on training data : ",
#       evaluate_model_score_given_predictions((train_data['Predictions'].values), 
#                                              (train_data[dep_var_col].values)))



# predict_out = hpr_1.predict(
#     get_h2o_frame_with_rel_factors(validation_data))                
# validation_data['Predictions'] = predict_out.as_data_frame()['predict'].values.tolist()

# print("Score on training data : ",
#       evaluate_model_score_given_predictions((validation_data['Predictions'].values), 
#                                              (validation_data[dep_var_col].values)))



In [None]:
# # Tinker with the learning rate alone here.
# # Build the model using default parameters and evaluate error on the training/validation set.

# hpr_1 = H2OGradientBoostingEstimator(model_id='housing_price_regression', 
#                                      seed=1,
#                                      ntrees=50,
#                                      learn_rate=0.01,
#                                      stopping_rounds=0)
# hpr_1.train(x=all_cols_in_model, 
#             y=dep_var_col, 
#             training_frame=get_h2o_frame_with_rel_factors(train_data),

#             validation_frame=None)

# predict_out = hpr_1.predict(
#     get_h2o_frame_with_rel_factors(train_data))                
# train_data['Predictions'] = predict_out.as_data_frame()['predict'].values.tolist()

# print("Score on training data : ",
#       evaluate_model_score_given_predictions((train_data['Predictions'].values), 
#                                              (train_data[dep_var_col].values)))



# predict_out = hpr_1.predict(
#     get_h2o_frame_with_rel_factors(validation_data))                
# validation_data['Predictions'] = predict_out.as_data_frame()['predict'].values.tolist()

# print("Score on training data : ",
#       evaluate_model_score_given_predictions((validation_data['Predictions'].values), 
#                                              (validation_data[dep_var_col].values)))




In [None]:
# # Try increasing the number of trees here and see whether it makes any difference.

# hpr_1 = H2OGradientBoostingEstimator(model_id='housing_price_regression', 
#                                      seed=1,
#                                      ntrees=500,
#                                      learn_rate=0.01,
#                                      stopping_rounds=0)
# hpr_1.train(x=all_cols_in_model, 
#             y=dep_var_col, 
#             training_frame=get_h2o_frame_with_rel_factors(train_data),

#             validation_frame=None)

# predict_out = hpr_1.predict(
#     get_h2o_frame_with_rel_factors(train_data))                
# train_data['Predictions'] = predict_out.as_data_frame()['predict'].values.tolist()

# print("Score on training data : ",
#       evaluate_model_score_given_predictions((train_data['Predictions'].values), 
#                                              (train_data[dep_var_col].values)))



# predict_out = hpr_1.predict(
#     get_h2o_frame_with_rel_factors(validation_data))                
# validation_data['Predictions'] = predict_out.as_data_frame()['predict'].values.tolist()

# print("Score on training data : ",
#       evaluate_model_score_given_predictions((validation_data['Predictions'].values), 
#                                              (validation_data[dep_var_col].values)))




In [None]:
# # Now comes the real boosting tree part. Add in a validation frame and set a value for stopping rounds as well.

# hpr_1 = H2OGradientBoostingEstimator(model_id='housing_price_regression', 
#                                      seed=1,
#                                      ntrees=50,
#                                      learn_rate=0.1,
#                                      stopping_rounds=5)
# hpr_1.train(x=all_cols_in_model, 
#             y=dep_var_col, 
#             training_frame=get_h2o_frame_with_rel_factors(train_data),

#             validation_frame=get_h2o_frame_with_rel_factors(validation_data))

# predict_out = hpr_1.predict(
#     get_h2o_frame_with_rel_factors(train_data))                
# train_data['Predictions'] = predict_out.as_data_frame()['predict'].values.tolist()

# print("Score on training data : ",
#       evaluate_model_score_given_predictions((train_data['Predictions'].values), 
#                                              (train_data[dep_var_col].values)))



# predict_out = hpr_1.predict(
#     get_h2o_frame_with_rel_factors(validation_data))                
# validation_data['Predictions'] = predict_out.as_data_frame()['predict'].values.tolist()

# print("Score on training data : ",
#       evaluate_model_score_given_predictions((validation_data['Predictions'].values), 
#                                              (validation_data[dep_var_col].values)))




In [None]:
# # Now comes the real boosting tree part. Add in a validation frame and set a value for stopping rounds as well.
# # Increases the number of trees to 500.
# hpr_1 = H2OGradientBoostingEstimator(model_id='housing_price_regression', 
#                                      seed=1,
#                                      ntrees=500,
#                                      learn_rate=0.01,
#                                      stopping_rounds=5)
# hpr_1.train(x=all_cols_in_model, 
#             y=dep_var_col, 
#             training_frame=get_h2o_frame_with_rel_factors(train_data),

#             validation_frame=get_h2o_frame_with_rel_factors(validation_data))

# predict_out = hpr_1.predict(
#     get_h2o_frame_with_rel_factors(train_data))                
# train_data['Predictions'] = predict_out.as_data_frame()['predict'].values.tolist()

# print("Score on training data : ",
#       evaluate_model_score_given_predictions((train_data['Predictions'].values), 
#                                              (train_data[dep_var_col].values)))



# predict_out = hpr_1.predict(
#     get_h2o_frame_with_rel_factors(validation_data))                
# validation_data['Predictions'] = predict_out.as_data_frame()['predict'].values.tolist()

# print("Score on training data : ",
#       evaluate_model_score_given_predictions((validation_data['Predictions'].values), 
#                                              (validation_data[dep_var_col].values)))




In [None]:
# # Now let us try cross validation along with stopping and see where we end up.
# hpr_1 = H2OGradientBoostingEstimator(model_id='housing_price_regression', 
#                                      seed=1,
#                                      ntrees=500,
#                                      learn_rate=0.01,
#                                      stopping_rounds=5,
#                                      nfolds=5,
#                                      keep_cross_validation_predictions=True)
# hpr_1.train(x=all_cols_in_model, 
#             y=dep_var_col, 
#             training_frame=get_h2o_frame_with_rel_factors(train_validation_data))

# print("Cross validaition score is ", get_cross_validated_rmse(hpr_1))

In [None]:
# # Now let us try the same with a different number of trees.
# hpr_1 = H2OGradientBoostingEstimator(model_id='housing_price_regression', 
#                                      seed=1,
#                                      ntrees=50,
#                                      learn_rate=0.1,
#                                      stopping_rounds=5,
#                                      nfolds=5,
#                                      keep_cross_validation_predictions=True)
# hpr_1.train(x=all_cols_in_model, 
#             y=dep_var_col, 
#             training_frame=get_h2o_frame_with_rel_factors(train_validation_data))

# print("Cross validaition score is ", get_cross_validated_rmse(hpr_1))

#### Conclusion.

At this stage, after seeing outputs, it is difficult to say something concretly. The model outputs will definitely depend on the number of predictors and at this stage, with just just predictor, it does not look to make much of a difference.

### Try somewhat of a greedy method to select columns

In this approach, we would start with an empty model and at every step, we would the predictor that would decrease the cross validation error the most. We have not fully automated it (it would be extremely easy to do so), but have built the model manually after inspection in each step. 

As simple as it seems, this looks to work pretty well. In fact, we were able to get a much better model, with almost no extra effort !

In [None]:
import operator
def get_cross_val_scores_new_col(base_model_cols, 
                                 ntrees=50,
                                 learn_rate=0.1,
                                 stopping_rounds=0,
                                 train_validation_data=train_validation_data,
                                 nfolds=5,
                                 dep_var_col='LogSalePrice'):
    columns_to_cross_val_score = dict()
    for col in ALL_COLUMNS:
        # If the column was already included, skip it.
        if col in base_model_cols:
            continue
        
        cur_model_cols = base_model_cols + [col]            
        print(cur_model_cols)

        # Do a 10 fold cross validation as that is done typically.
        hpr_cross_val = H2OGradientBoostingEstimator(model_id='housing_price_regression', 
                                                    seed=1, 
                                                    ntrees=ntrees,
                                                    learn_rate=learn_rate,
                                                    stopping_rounds=stopping_rounds,
                                                    nfolds=nfolds,
                                                    keep_cross_validation_predictions=True)
        hpr_cross_val.train(x=cur_model_cols, 
                            y=dep_var_col, 
                            training_frame=get_h2o_frame_with_rel_factors(train_validation_data))

        cv_score = get_cross_validated_rmse(hpr_cross_val)

        columns_to_cross_val_score[col] = cv_score
    
    sorted_cross_val_scores = sorted(columns_to_cross_val_score.items(), key=operator.itemgetter(1))
    return sorted_cross_val_scores

In [None]:
def get_cross_val_scores_given_cols(cur_model_cols,
                                    ntrees=50,
                                    learn_rate=0.1,
                                    stopping_rounds=0,
                                    train_validation_data=train_validation_data,
                                    nfolds=5,
                                    dep_var_col='LogSalePrice'):

    # Do a 10 fold cross validation as that is done typically.
    hpr_cross_val = H2OGradientBoostingEstimator(model_id='housing_price_regression', 
                                                 seed=1, 
                                                 ntrees=50,
                                                 learn_rate=0.1,
                                                 stopping_rounds=0,
                                                 nfolds=nfolds,
                                                 keep_cross_validation_predictions=True)
    hpr_cross_val.train(x=cur_model_cols, 
                        y=dep_var_col, 
                        training_frame=get_h2o_frame_with_rel_factors(train_validation_data))

    cv_score = get_cross_validated_rmse(hpr_cross_val)

    return cv_score

In [None]:
# Let us begin our testing on a good number of trees witha a low learning rate and see how it looks.
#sorted_cross_val_scores = get_cross_val_scores_new_col([],ntrees=500, learn_rate=0.01)

In [None]:
#sorted_cross_val_scores 

In [None]:
# Let us check out default values now. We would not expect much of a difference here.
#sorted_cross_val_scores = get_cross_val_scores_new_col([])

In [None]:
#sorted_cross_val_scores

In [None]:
# We do not see much of difference here. Let us try with stopping rounds (we do not expect to see much of a difference
# here as well.)
# Let us begin our testing on a good number of trees witha a low learning rate and see how it looks.
#sorted_cross_val_scores = get_cross_val_scores_new_col([],ntrees=500, learn_rate=0.01, stopping_rounds=5)

In [None]:
#sorted_cross_val_scores

In [None]:
#sorted_cross_val_scores = get_cross_val_scores_new_col(['OverallQual'])

In [None]:
#sorted_cross_val_scores

In [None]:
#sorted_cross_val_scores = get_cross_val_scores_new_col(['OverallQual'],ntrees=500, learn_rate=0.01, stopping_rounds=5)

In [None]:
#sorted_cross_val_scores

In [None]:
#sorted_cross_val_scores = get_cross_val_scores_new_col(['OverallQual', 'GrLivArea'])

In [None]:
#sorted_cross_val_scores


In [None]:
# Do this after checkiing the earlier result to see if GrLivArea is the factor to be included.
#sorted_cross_val_scores = get_cross_val_scores_new_col(['OverallQual', 'GrLivArea'],
#                                                       ntrees=500, learn_rate=0.01, stopping_rounds=5)

In [None]:
#sorted_cross_val_scores

In [None]:
#sorted_cross_val_scores = get_cross_val_scores_new_col(['OverallQual', 'GrLivArea', 'Neighborhood'])

In [None]:
#sorted_cross_val_scores

In [None]:
# Do this after checkiing the earlier result to see if GrLivArea is the factor to be included.
#sorted_cross_val_scores = get_cross_val_scores_new_col(['OverallQual', 'GrLivArea', 'Neighborhood'],
#                                                       ntrees=500, learn_rate=0.01, stopping_rounds=5)

In [None]:
#sorted_cross_val_scores

In [None]:
#sorted_cross_val_scores = get_cross_val_scores_new_col(['OverallQual', 'GrLivArea', 'Neighborhood'],
#                                                       ntrees=500, learn_rate=0.01)

In [None]:
#sorted_cross_val_scores

If the results shown above look right, go for the final model and apply stopping rounds to the same (you may want to check for the 'heat' variable.


In [None]:
# sorted_cross_val_scores = get_cross_val_scores_new_col(['OverallQual', 
#                                                         'Neighborhood', 
#                                                         'GrLivArea',
#                                                         'BsmtFinSF1'])

In [None]:
# sorted_cross_val_scores

In [None]:

# sorted_cross_val_scores = get_cross_val_scores_new_col(['OverallQual', 
#                                                         'Neighborhood', 
#                                                         'GrLivArea',
#                                                         'BsmtFinSF1',
#                                                         'OverallCond'])


In [None]:
# sorted_cross_val_scores

In [None]:

# sorted_cross_val_scores = get_cross_val_scores_new_col(['OverallQual', 
#                                                         'Neighborhood', 
#                                                         'GrLivArea',
#                                                         'BsmtFinSF1',
#                                                         'OverallCond',
#                                                         'GarageArea'])

In [None]:
# sorted_cross_val_scores

In [None]:

# sorted_cross_val_scores = get_cross_val_scores_new_col(['OverallQual', 
#                                                         'Neighborhood', 
#                                                         'GrLivArea',
#                                                         'BsmtFinSF1',
#                                                         'OverallCond',
#                                                         'GarageArea',
#                                                         'TotalBsmtSF'])


In [None]:
# sorted_cross_val_scores

In [None]:


# sorted_cross_val_scores = get_cross_val_scores_new_col(['OverallQual', 
#                                                         'Neighborhood', 
#                                                         'GrLivArea',
#                                                         'BsmtFinSF1',
#                                                         'OverallCond',
#                                                         'GarageArea',
#                                                         'TotalBsmtSF',
#                                                         'YearBuilt'])


In [None]:
# sorted_cross_val_scores

In [None]:
# sorted_cross_val_scores = get_cross_val_scores_new_col(['OverallQual', 
#                                                         'Neighborhood', 
#                                                         'GrLivArea',
#                                                         'BsmtFinSF1',
#                                                         'OverallCond',
#                                                         'GarageArea',
#                                                         'TotalBsmtSF',
#                                                         'YearBuilt',
#                                                         'GarageFinish'])



In [None]:
# sorted_cross_val_scores

In [None]:
# sorted_cross_val_scores = get_cross_val_scores_new_col(['OverallQual', 
#                                                         'Neighborhood', 
#                                                         'GrLivArea',
#                                                         'BsmtFinSF1',
#                                                         'OverallCond',
#                                                         'GarageArea',
#                                                         'TotalBsmtSF',
#                                                         'YearBuilt',
#                                                         'GarageFinish',
#                                                         'LotArea'])



In [None]:
# sorted_cross_val_scores

In [None]:
# sorted_cross_val_scores = get_cross_val_scores_new_col(['OverallQual', 
#                                                         'Neighborhood', 
#                                                         'GrLivArea',
#                                                         'BsmtFinSF1',
#                                                         'OverallCond',
#                                                         'GarageArea',
#                                                         'TotalBsmtSF',
#                                                         'YearBuilt',
#                                                         'GarageFinish',
#                                                         'LotArea',
#                                                         'CentralAir'])



In [None]:
# sorted_cross_val_scores

In [None]:
# sorted_cross_val_scores = get_cross_val_scores_new_col(['OverallQual', 
#                                                         'Neighborhood', 
#                                                         'GrLivArea',
#                                                         'BsmtFinSF1',
#                                                         'OverallCond',
#                                                         'GarageArea',
#                                                         'TotalBsmtSF',
#                                                         'YearBuilt',
#                                                         'GarageFinish',
#                                                         'LotArea',
#                                                         'CentralAir',
#                                                         'BsmtFinType1'])



In [None]:
# sorted_cross_val_scores

In [None]:
# sorted_cross_val_scores = get_cross_val_scores_new_col(['OverallQual', 
#                                                         'Neighborhood', 
#                                                         'GrLivArea',
#                                                         'BsmtFinSF1',
#                                                         'OverallCond',
#                                                         'GarageArea',
#                                                         'TotalBsmtSF',
#                                                         'YearBuilt',
#                                                         'GarageFinish',
#                                                         'LotArea',
#                                                         'CentralAir',
#                                                         'BsmtFinType1',
#                                                         'KitchenAbvGr'])



In [None]:
# sorted_cross_val_scores

In [None]:
# sorted_cross_val_scores = get_cross_val_scores_new_col(['OverallQual', 
#                                                         'Neighborhood', 
#                                                         'GrLivArea',
#                                                         'BsmtFinSF1',
#                                                         'OverallCond',
#                                                         'GarageArea',
#                                                         'TotalBsmtSF',
#                                                         'YearBuilt',
#                                                         'GarageFinish',
#                                                         'LotArea',
#                                                         'CentralAir',
#                                                         'BsmtFinType1',
#                                                         'KitchenAbvGr',
#                                                         'Condition1'])



In [None]:
# sorted_cross_val_scores

In [None]:
# sorted_cross_val_scores = get_cross_val_scores_new_col(['OverallQual', 
#                                                         'Neighborhood', 
#                                                         'GrLivArea',
#                                                         'BsmtFinSF1',
#                                                         'OverallCond',
#                                                         'GarageArea',
#                                                         'TotalBsmtSF',
#                                                         'YearBuilt',
#                                                         'GarageFinish',
#                                                         'LotArea',
#                                                         'CentralAir',
#                                                         'BsmtFinType1',
#                                                         'KitchenAbvGr',
#                                                         'Condition1',
#                                                         'ScreenPorch'])



In [None]:
# sorted_cross_val_scores

In [None]:
# sorted_cross_val_scores = get_cross_val_scores_new_col(['OverallQual', 
#                                                         'Neighborhood', 
#                                                         'GrLivArea',
#                                                         'BsmtFinSF1',
#                                                         'OverallCond',
#                                                         'GarageArea',
#                                                         'TotalBsmtSF',
#                                                         'YearBuilt',
#                                                         'GarageFinish',
#                                                         'LotArea',
#                                                         'CentralAir',
#                                                         'BsmtFinType1',
#                                                         'KitchenAbvGr',
#                                                         'Condition1',
#                                                         'ScreenPorch',
#                                                         'BsmtFinType2'])



In [None]:
# sorted_cross_val_scores

In [None]:
# sorted_cross_val_scores = get_cross_val_scores_new_col(['OverallQual', 
#                                                         'Neighborhood', 
#                                                         'GrLivArea',
#                                                         'BsmtFinSF1',
#                                                         'OverallCond',
#                                                         'GarageArea',
#                                                         'TotalBsmtSF',
#                                                         'YearBuilt',
#                                                         'GarageFinish',
#                                                         'LotArea',
#                                                         'CentralAir',
#                                                         'BsmtFinType1',
#                                                         'KitchenAbvGr',
#                                                         'Condition1',
#                                                         'ScreenPorch',
#                                                         'BsmtFinType2',
#                                                         'Heating'])



In [None]:
# cross_val_score_this_model = get_cross_val_scores_given_cols(['OverallQual', 
#                                                         'Neighborhood', 
#                                                         'GrLivArea',
#                                                         'BsmtFinSF1',
#                                                         'OverallCond',
#                                                         'GarageArea',
#                                                         'TotalBsmtSF',
#                                                         'YearBuilt',
#                                                         'GarageFinish',
#                                                         'LotArea',
#                                                         'CentralAir',
#                                                         'BsmtFinType1',
#                                                         'KitchenAbvGr',
#                                                         'Condition1',
#                                                         'ScreenPorch',
#                                                         'BsmtFinType2',
#                                                         'Heating'])



In [None]:
# cross_val_score_this_model

In [None]:
# sorted_cross_val_scores

We see that the cross validation score is decreasing, indicating that we have reached a plateau. Let us make check out our cross validation score now.

In [None]:
final_cols = ['OverallQual', 
              'Neighborhood', 
              'GrLivArea',
              'BsmtFinSF1',
              'OverallCond',
              'GarageArea',
              'TotalBsmtSF',
              'YearBuilt',
              'GarageFinish',
              'LotArea',
              'CentralAir',
              'BsmtFinType1',
              'KitchenAbvGr',
              'Condition1',
              'ScreenPorch',
              'BsmtFinType2',
              'Heating']

In [None]:
cv_score = get_cross_val_scores_given_cols(final_cols)
print(cv_score)


#### Run at home BEGIN

In [None]:
'''
cv_score = get_cross_val_scores_given_cols(final_cols, ntrees=500, learn_rate=0.01)
print(cv_score)
'''


In [None]:
'''
cv_score = get_cross_val_scores_given_cols(final_cols, ntrees=500, learn_rate=0.01, stopping_rounds=5)
print(cv_score)
'''


#### Run at home END

#### Complete summary of the cross validation metrics.

In [None]:
# Do a 10 fold cross validation as that is done typically.
hpr_cross_val = H2OGradientBoostingEstimator(model_id='housing_price_regression', 
                                             seed=1, 
                                             nfolds=5,
                                             keep_cross_validation_predictions=True)
hpr_cross_val.train(x=final_cols, 
                    y=dep_var_col, 
                    training_frame=get_h2o_frame_with_rel_factors(train_validation_data))


In [None]:
hpr_cross_val

### Quantifying validation error on validation data set.

Ideally this is not necessary as we have already performed cross validation over the entire training data set, but it is still useful to do it as this be a comparable yardstick against other models and can be useful while ensembling as well.

In [None]:
# Do a 10 fold cross validation as that is done typically.
model_train_data = H2OGradientBoostingEstimator(model_id='housing_price_regression', 
                                                seed=1)
model_train_data.train(x=final_cols, 
                       y=dep_var_col, 
                       training_frame=get_h2o_frame_with_rel_factors(train_data))

predict_validation = model_train_data.predict(
    get_h2o_frame_with_rel_factors(validation_data))                

validation_data['Predictions'] = predict_validation['predict'].as_data_frame()
evaluate_model_score_given_predictions(validation_data['LogSalePrice'], validation_data['Predictions'])


#### 4 entries having values not encountered in training set for the 'Heating' column.

Hence we are ignoring it for now. Also, this issue is not presnet in test data as well.

In [None]:
# Dump validaiton data to a file, which would be used later for ensembling.
validation_data[['Id', 'LogSalePrice', 'Predictions']].to_csv('housing_price_h2o_gradient_boosting_validation.csv', 
                                                              index=False)

### Making predictions on test data

Now, there is another important matter to consider. In order to make predictions on test data, firstly it is important that we make sure that the test data is good. 

One problem that could be relevant here is presence of lots of null values in the test set and not in the training set. Let us check whether that is the case

In [None]:
test_data[final_cols].isnull().sum()

Okay, things look good. Let us generate predictions on the test set.

In [None]:
final_model = H2OGradientBoostingEstimator(model_id='housing_price_regression', seed=1)
final_model.train(x=final_cols, 
                  y=dep_var_col, 
                  training_frame=get_h2o_frame_with_rel_factors(train_validation_data))


In [None]:
predict_out = final_model.predict(
    get_h2o_frame_with_rel_factors(test_data))                


In [None]:
test_data['LogSalePrice'] = predict_out['predict'].as_data_frame()

In [None]:
test_data['SalePrice'] = test_data['LogSalePrice'].apply(lambda x : np.exp(x))

In [None]:
test_data[['Id', 'SalePrice']].to_csv('housing_price_h2o_gradient_boosting_predictions.csv', index=False)