### Basic Overview

We will be exploring random forest methods to build a predictive model for predicting housing prices, given relevant data.

As detailed in https://roamanalytics.com/2016/10/28/are-categorical-variables-getting-lost-in-your-random-forests/, we 
will be using h2o instead of sklearn as that is more tailored to handling of  categorical variables.

In [1]:
import pandas as pd
import numpy as np
import sys
sys.path.append('../../common_routines/')
from relevant_functions import\
    evaluate_model_score_given_predictions,\
    evaluate_model_score


#### Get clean data first

In [2]:
train_data = pd.read_csv('../../cleaned_input/train_data.csv')
validation_data = pd.read_csv('../../cleaned_input/validation_data.csv')

In [3]:
train_data.columns

Index(['Id', 'MSSubClass', 'MSZoning', 'LotArea', 'Street', 'LotShape',
       'LandContour', 'Utilities', 'LotConfig', 'LandSlope',
       ...
       'LogGarageArea', 'LogWoodDeckSF', 'LogOpenPorchSF', 'LogEnclosedPorch',
       'Log3SsnPorch', 'LogScreenPorch', 'LogPoolArea', 'LogMiscVal',
       'LogSalePrice', 'LogMasVnrArea_times_not_missing'],
      dtype='object', length=102)

In [4]:
train_validation_data = pd.concat([train_data, validation_data])

In [5]:
test_data = pd.read_csv('../../input/test.csv')

In [6]:
test_data.isnull().sum().sum()

7000

In [7]:
## Are they indeed clean ?
train_data.isnull().sum().any()

False

In [8]:
validation_data.isnull().sum().any()

False

#### Get h2o up and running !


In [9]:
# Using h2o
import h2o
h2o.init(nthreads = -1, max_mem_size = 15)

Checking whether there is an H2O instance running at http://localhost:54321 ..... not found.
Attempting to start a local H2O server...
  Java Version: openjdk version "1.8.0_121"; OpenJDK Runtime Environment (Zulu 8.20.0.5-macosx) (build 1.8.0_121-b15); OpenJDK 64-Bit Server VM (Zulu 8.20.0.5-macosx) (build 25.121-b15, mixed mode)
  Starting server from /Users/babs4JESUS/anaconda3/lib/python3.6/site-packages/h2o/backend/bin/h2o.jar
  Ice root: /var/folders/cz/3nvpl4mj0g5ds3hlsc15wxdr0000gn/T/tmpaezrj8r2
  JVM stdout: /var/folders/cz/3nvpl4mj0g5ds3hlsc15wxdr0000gn/T/tmpaezrj8r2/h2o_babs4JESUS_started_from_python.out
  JVM stderr: /var/folders/cz/3nvpl4mj0g5ds3hlsc15wxdr0000gn/T/tmpaezrj8r2/h2o_babs4JESUS_started_from_python.err
  Server is running at http://127.0.0.1:54321
Connecting to H2O server at http://127.0.0.1:54321 ... successful.


0,1
H2O cluster uptime:,02 secs
H2O cluster timezone:,America/New_York
H2O data parsing timezone:,UTC
H2O cluster version:,3.24.0.1
H2O cluster version age:,9 days
H2O cluster name:,H2O_from_python_babs4JESUS_dkx256
H2O cluster total nodes:,1
H2O cluster free memory:,13.33 Gb
H2O cluster total cores:,8
H2O cluster allowed cores:,8


### Brief framework.

We will be building according the following framework (similar to how we did for PCA).

Given set of columns, we should be able to do the following :

1. Train model on training set.

2. Validate on validation set.

3. Generate predicitons on test data.

4. Do cross validation on combined set of training/validation data.


In [10]:
ALL_CATEGORICAL_COLUMNS = ['MSSubClass',
 'MSZoning',
 'Street',
 'LotShape',
 'LandContour',
 'Utilities',
 'LotConfig',
 'LandSlope',
 'Neighborhood',
 'Condition1',
 'Condition2',
 'BldgType',
 'HouseStyle',
 'RoofStyle',
 'RoofMatl',
 'Exterior1st',
 'Exterior2nd',
 'MasVnrType',
 'ExterQual',
 'ExterCond',
 'Foundation',
 'BsmtQual',
 'BsmtCond',
 'BsmtExposure',
 'BsmtFinType1',
 'BsmtFinType2',
 'Heating',
 'HeatingQC',
 'CentralAir',
 'Electrical',
 'KitchenQual',
 'Functional',
 'FireplaceQu',
 'GarageType',
 'GarageFinish',
 'GarageQual',
 'GarageCond',
 'PavedDrive',
 'PoolQC',
 'Fence',
 'MiscFeature',
 'MoSold',
 'YrSold',
 'SaleType',
 'SaleCondition']

In [11]:
ALL_NUMERICAL_COLUMNS = ['LotArea',
 'OverallQual',
 'OverallCond',
 'YearBuilt',
 'YearRemodAdd',
 'MasVnrArea_times_not_missing',
 'MasVnrArea_not_missing',
 'BsmtFinSF1',
 'BsmtUnfSF',
 'TotalBsmtSF',
 '1stFlrSF',
 '2ndFlrSF',
 'LowQualFinSF',
 'GrLivArea',
 'BsmtFullBath',
 'BsmtHalfBath',
 'FullBath',
 'HalfBath',
 'BedroomAbvGr',
 'KitchenAbvGr',
 'TotRmsAbvGrd',
 'Fireplaces',
 'GarageYrBlt_times_not_missing',
 'GarageYrBlt_not_missing',
 'GarageCars',
 'GarageArea',
 'WoodDeckSF',
 'OpenPorchSF',
 'EnclosedPorch',
 '3SsnPorch',
 'ScreenPorch',
 'PoolArea',
 'MiscVal']

In [12]:
ALL_COLUMNS = ALL_CATEGORICAL_COLUMNS + ALL_NUMERICAL_COLUMNS

In [13]:
# Columns the model to be trained 
# Check out ExterQual and YearBuilt instead of YearRemodAdd
# Check out BsmtCond
#cat_cols_in_model = ['MSSubClass', 'Neighborhood', 'ExterQual', 'Foundation', 'BsmtQual', 'BsmtCond',
#                     'BsmtFinType1']
#numeric_cols_in_model = ['GrLivArea', 'OverallQual', 'OverallCond', 'YearRemodAdd', 'BsmtFinSF1', 
#                         'TotalBsmtSF']
# Check out GarageCond
cat_cols_in_model = ['MSSubClass', 'Neighborhood']
numeric_cols_in_model = ['GrLivArea', 'OverallQual', 'OverallCond', 'YearRemodAdd', 'BsmtFinSF1', 
                         'TotalBsmtSF', '1stFlrSF', '2ndFlrSF', 'GarageArea', 'LowQualFinSF']

all_cols_in_model = cat_cols_in_model + numeric_cols_in_model

dep_var_col = 'LogSalePrice'

In [14]:
all_cols_in_model

['MSSubClass',
 'Neighborhood',
 'GrLivArea',
 'OverallQual',
 'OverallCond',
 'YearRemodAdd',
 'BsmtFinSF1',
 'TotalBsmtSF',
 '1stFlrSF',
 '2ndFlrSF',
 'GarageArea',
 'LowQualFinSF']

#### Training model on the training set


In [15]:
def get_h2o_frame_with_rel_factors(test_data):
    test_data_h2o = h2o.H2OFrame(test_data)
    for col in ALL_CATEGORICAL_COLUMNS:
        test_data_h2o[col] = test_data_h2o[col].asfactor()
    return test_data_h2o

In [16]:
from h2o.estimators.random_forest import H2ORandomForestEstimator

In [17]:
hpr_1 = H2ORandomForestEstimator(model_id='housing_price_regression', seed=1)
hpr_1.train(x=all_cols_in_model, 
            y=dep_var_col, 
            training_frame=get_h2o_frame_with_rel_factors(train_data))

Parse progress: |█████████████████████████████████████████████████████████| 100%
drf Model Build progress: |███████████████████████████████████████████████| 100%


In [18]:
predict_out = hpr_1.predict(
    get_h2o_frame_with_rel_factors(train_data))                
train_data['Predictions'] = predict_out.as_data_frame()['predict'].values.tolist()

Parse progress: |█████████████████████████████████████████████████████████| 100%
drf prediction progress: |████████████████████████████████████████████████| 100%


In [19]:
#evaluate_model_score_given_predictions(np.log(train_data['Predictions'].values), 
#                                       np.log(train_data[dep_var_col].values))

In [20]:
evaluate_model_score_given_predictions((train_data['Predictions'].values), 
                                       (train_data[dep_var_col].values))

0.05519499883354487

#### Inspect the output model

It may seem like a trivial thing, but shouldn't we inspect the model to see what exactly it does ? This is especially important in data science and it is very easy to get entangled in a quagmire of models and functions without clearly understanding what any of them does.


In [21]:
from h2o.tree import H2OTree
tree = H2OTree(model = hpr_1, tree_number = 0, tree_class = None)

In [22]:
tree

<h2o.tree.tree.H2OTree at 0x12576a710>

In [23]:
len(tree)

1349

In [24]:
print(tree)

Tree related to model housing_price_regression. Tree number is 0, tree class is 'None'




In [25]:
tree.levels

[None,
 ['Blueste',
  'BrDale',
  'BrkSide',
  'Edwards',
  'IDOTRR',
  'MeadowV',
  'Mitchel',
  'NAmes',
  'NPkVill',
  'OldTown',
  'SWISU',
  'Sawyer'],
 ['Blmngtn',
  'ClearCr',
  'CollgCr',
  'Crawfor',
  'Gilbert',
  'NWAmes',
  'NoRidge',
  'NridgHt',
  'SawyerW',
  'Somerst',
  'StoneBr',
  'Timber',
  'Veenker'],
 None,
 None,
 None,
 None,
 ['30', '40', '50', '70', '75', '160', '180', '190'],
 ['20', '45', '60', '80', '85', '90', '120'],
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 ['30', '40', '70', '160', '180'],
 ['50', '75', '190'],
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 ['20',
  '30',
  '40',
  '45',
  '50',
  '75',
  '80',
  '85',
  '90',
  '120',
  '160',
  '180',
  '190'],
 ['60', '70'],
 None,
 None,
 None,
 None,
 None,
 None,
 ['Blmngtn',
  'ClearCr',
  'CollgCr',
  'Gilbert',
  'NWAmes',
  'SawyerW',
  'Timber

In [26]:
tree.tree_number

0

In [27]:
tree.show()

Tree related to model housing_price_regression. Tree number is 0, tree class is 'None'




In [28]:
print(tree.root_node)

Node ID 0 
Left child node ID = 1
Right child node ID = 2

Splits on column Neighborhood
  - Categorical levels going to the left node: ['Blueste', 'BrDale', 'BrkSide', 'Edwards', 'IDOTRR', 'MeadowV', 'Mitchel', 'NAmes', 'NPkVill', 'OldTown', 'SWISU', 'Sawyer']
  - Categorical levels going to the right node: ['Blmngtn', 'ClearCr', 'CollgCr', 'Crawfor', 'Gilbert', 'NWAmes', 'NoRidge', 'NridgHt', 'SawyerW', 'Somerst', 'StoneBr', 'Timber', 'Veenker']

NA values go to the RIGHT


In [29]:
print(tree.root_node.left_child)

Node ID 1 
Left child node ID = 3
Right child node ID = 4

Splits on column GrLivArea
Split threshold < 1377.5 to the left node, >= 1377.5 to the right node 

NA values go to the LEFT


#### Testing model on validation set

In [30]:
predict_out = hpr_1.predict(
    get_h2o_frame_with_rel_factors(validation_data))                
validation_data['Predictions'] = predict_out.as_data_frame()['predict'].values.tolist()

Parse progress: |█████████████████████████████████████████████████████████| 100%
drf prediction progress: |████████████████████████████████████████████████| 100%


In [31]:
#evaluate_model_score_given_predictions(np.log(validation_data['Predictions'].values), 
#                                       np.log(validation_data[dep_var_col].values))

In [32]:
validation_score = evaluate_model_score_given_predictions((validation_data['Predictions'].values), 
                                                          (validation_data[dep_var_col].values))
print(validation_score)

0.13821357736206205


#### Generate predictions on test data

In [33]:
test_data_one_hot = pd.read_csv('../../cleaned_input/test_data_one_hot.csv')

In [34]:
test_data['LogMasVnrArea_times_not_missing'] = test_data_one_hot['LogMasVnrArea_times_not_missing']

In [35]:
test_data[all_cols_in_model].isnull().sum()

MSSubClass      0
Neighborhood    0
GrLivArea       0
OverallQual     0
OverallCond     0
YearRemodAdd    0
BsmtFinSF1      1
TotalBsmtSF     1
1stFlrSF        0
2ndFlrSF        0
GarageArea      1
LowQualFinSF    0
dtype: int64

In [36]:
hpr_1 = H2ORandomForestEstimator(model_id='housing_price_regression', seed=1)
hpr_1.train(x=all_cols_in_model, 
            y=dep_var_col, 
            training_frame=get_h2o_frame_with_rel_factors(train_validation_data))

Parse progress: |█████████████████████████████████████████████████████████| 100%
drf Model Build progress: |███████████████████████████████████████████████| 100%


In [37]:
predict_out = hpr_1.predict(
    get_h2o_frame_with_rel_factors(test_data))                
test_data['Predictions'] = predict_out.as_data_frame()['predict'].values.tolist()

Parse progress: |█████████████████████████████████████████████████████████| 100%
drf prediction progress: |████████████████████████████████████████████████| 100%




#### Cross validation

In [38]:
# Do a 10 fold cross validation as that is done typically.
hpr_cross_val = H2ORandomForestEstimator(model_id='housing_price_regression', 
                                         seed=1, 
                                         nfolds=5,
                                         keep_cross_validation_predictions=True)
hpr_cross_val.train(x=all_cols_in_model, 
                    y=dep_var_col, 
                    training_frame=get_h2o_frame_with_rel_factors(train_validation_data))


Parse progress: |█████████████████████████████████████████████████████████| 100%
drf Model Build progress: |███████████████████████████████████████████████| 100%


In [39]:
hpr_cross_val.cross_validation_predictions

Model Details
H2ORandomForestEstimator :  Distributed Random Forest
Model Key:  housing_price_regression


ModelMetricsRegression: drf
** Reported on train data. **

MSE: 0.019799110697422875
RMSE: 0.1407093127601115
MAE: 0.09757718828825032
RMSLE: 0.01093486090327993
Mean Residual Deviance: 0.019799110697422875

ModelMetricsRegression: drf
** Reported on cross-validation data. **

MSE: 0.01938395105795873
RMSE: 0.13922625850736178
MAE: 0.09577830267605716
RMSLE: 0.01083603165391223
Mean Residual Deviance: 0.01938395105795873
Cross-Validation Metrics Summary: 


0,1,2,3,4,5,6,7
,mean,sd,cv_1_valid,cv_2_valid,cv_3_valid,cv_4_valid,cv_5_valid
mae,0.0956881,0.0058530,0.1084217,0.0829694,0.0954006,0.0988700,0.0927787
mean_residual_deviance,0.0193422,0.0029547,0.0252202,0.0125499,0.0191893,0.0216115,0.0181401
mse,0.0193422,0.0029547,0.0252202,0.0125499,0.0191893,0.0216115,0.0181401
r2,0.8790135,0.0117323,0.8703818,0.8985824,0.8720775,0.8562414,0.8977843
residual_deviance,0.0193422,0.0029547,0.0252202,0.0125499,0.0191893,0.0216115,0.0181401
rmse,0.1382107,0.0109537,0.1588085,0.1120262,0.1385254,0.1470084,0.134685
rmsle,0.0107523,0.0008834,0.0123688,0.0086034,0.0107532,0.0114827,0.0105534


Scoring History: 


0,1,2,3,4,5,6
,timestamp,duration,number_of_trees,training_rmse,training_mae,training_deviance
,2019-04-10 10:52:27,1.808 sec,0.0,,,
,2019-04-10 10:52:27,1.833 sec,1.0,0.1954341,0.1397954,0.0381945
,2019-04-10 10:52:27,1.854 sec,2.0,0.1908295,0.1352168,0.0364159
,2019-04-10 10:52:27,1.873 sec,3.0,0.1815573,0.1314005,0.0329631
,2019-04-10 10:52:27,1.892 sec,4.0,0.1760658,0.1279177,0.0309992
---,---,---,---,---,---,---
,2019-04-10 10:52:27,2.294 sec,46.0,0.1414475,0.0981634,0.0200074
,2019-04-10 10:52:27,2.305 sec,47.0,0.1413494,0.0980232,0.0199797
,2019-04-10 10:52:27,2.316 sec,48.0,0.1411098,0.0977282,0.0199120



See the whole table with table.as_data_frame()
Variable Importances: 


0,1,2,3
variable,relative_importance,scaled_importance,percentage
OverallQual,2756.4650879,1.0,0.3048628
Neighborhood,2168.5063477,0.7866983,0.2398350
GrLivArea,1208.7586670,0.4385177,0.1336877
TotalBsmtSF,714.8059692,0.2593198,0.0790570
GarageArea,542.7357788,0.1968956,0.0600261
1stFlrSF,459.9797363,0.1668730,0.0508734
YearRemodAdd,366.4526367,0.1329430,0.0405294
MSSubClass,345.7257690,0.1254236,0.0382370
BsmtFinSF1,175.9045105,0.0638153,0.0194549


<bound method ModelBase.cross_validation_predictions of >

In [40]:
def get_cross_validated_rmse(hpr_cross_val):
    cv_preds = hpr_cross_val.cross_validation_predictions()
    for i in range(len(cv_preds)):
        if i == 0:
            result_cv = cv_preds[0]['predict'].as_data_frame().copy()
        else:
            result_cv +=  cv_preds[i]['predict'].as_data_frame()
    return evaluate_model_score_given_predictions(result_cv, train_validation_data['LogSalePrice'])    

In [41]:
get_cross_validated_rmse(hpr_cross_val)

0.13922626299526575

### Try somewhat of a greedy method to select columns

In this approach, we would start with an empty model and at every step, we would the predictor that would decrease the cross validation error the most. We have not fully automated it (it would be extremely easy to do so), but have built the model manually after inspection in each step. 

As simple as it seems, this looks to work pretty well. In fact, we were able to get a much better model, with almost no extra effort !

In [42]:
import operator
def get_cross_val_scores_new_col(base_model_cols, 
                                 train_validation_data=train_validation_data,
                                 nfolds=5,
                                 dep_var_col='LogSalePrice'):
    columns_to_cross_val_score = dict()
    for col in ALL_COLUMNS:
        # If the column was already included, skip it.
        if col in base_model_cols:
            continue
        
        cur_model_cols = base_model_cols + [col]            
        print(cur_model_cols)

        # Do a 10 fold cross validation as that is done typically.
        hpr_cross_val = H2ORandomForestEstimator(model_id='housing_price_regression', 
                                                 seed=1, 
                                                 nfolds=nfolds,
                                                 keep_cross_validation_predictions=True)
        hpr_cross_val.train(x=cur_model_cols, 
                            y=dep_var_col, 
                            training_frame=get_h2o_frame_with_rel_factors(train_validation_data))

        cv_score = get_cross_validated_rmse(hpr_cross_val)

        columns_to_cross_val_score[col] = cv_score
    
    sorted_cross_val_scores = sorted(columns_to_cross_val_score.items(), key=operator.itemgetter(1))
    return sorted_cross_val_scores

In [43]:
def get_cross_val_scores_given_cols(cur_model_cols,
                                    train_validation_data=train_validation_data,
                                    nfolds=5,
                                    dep_var_col='LogSalePrice'):

    # Do a 10 fold cross validation as that is done typically.
    hpr_cross_val = H2ORandomForestEstimator(model_id='housing_price_regression', 
                                             seed=1, 
                                             nfolds=nfolds,
                                             keep_cross_validation_predictions=True)
    hpr_cross_val.train(x=cur_model_cols, 
                        y=dep_var_col, 
                        training_frame=get_h2o_frame_with_rel_factors(train_validation_data))

    cv_score = get_cross_validated_rmse(hpr_cross_val)

    return cv_score

In [44]:
#sorted_cross_val_scores = get_cross_val_scores_new_col([])

In [45]:
#sorted_cross_val_scores

In [46]:
#sorted_cross_val_scores = get_cross_val_scores_new_col(['OverallQual'])

In [47]:
#sorted_cross_val_scores

In [48]:
#sorted_cross_val_scores = get_cross_val_scores_new_col(['OverallQual', 'Neighborhood'])

In [49]:
#sorted_cross_val_scores


In [50]:
#sorted_cross_val_scores = get_cross_val_scores_new_col(['OverallQual', 'Neighborhood', 'GrLivArea'])

In [51]:
#sorted_cross_val_scores

In [52]:
#sorted_cross_val_scores = get_cross_val_scores_new_col(['OverallQual', 
#                                                        'Neighborhood', 
#                                                        'GrLivArea',
#                                                        'BsmtFinSF1'])

In [53]:
#sorted_cross_val_scores

In [54]:
'''
sorted_cross_val_scores = get_cross_val_scores_new_col(['OverallQual', 
                                                        'Neighborhood', 
                                                        'GrLivArea',
                                                        'BsmtFinSF1',
                                                        'GarageCars'])
'''

"\nsorted_cross_val_scores = get_cross_val_scores_new_col(['OverallQual', \n                                                        'Neighborhood', \n                                                        'GrLivArea',\n                                                        'BsmtFinSF1',\n                                                        'GarageCars'])\n"

In [55]:
#sorted_cross_val_scores

In [56]:
'''
sorted_cross_val_scores = get_cross_val_scores_new_col(['OverallQual', 
                                                        'Neighborhood', 
                                                        'GrLivArea',
                                                        'BsmtFinSF1',
                                                        'GarageCars',
                                                        'OverallCond'])
'''

"\nsorted_cross_val_scores = get_cross_val_scores_new_col(['OverallQual', \n                                                        'Neighborhood', \n                                                        'GrLivArea',\n                                                        'BsmtFinSF1',\n                                                        'GarageCars',\n                                                        'OverallCond'])\n"

In [57]:
#sorted_cross_val_scores

In [58]:
'''
sorted_cross_val_scores = get_cross_val_scores_new_col(['OverallQual', 
                                                        'Neighborhood', 
                                                        'GrLivArea',
                                                        'BsmtFinSF1',
                                                        'GarageCars',
                                                        'OverallCond',
                                                        'YearBuilt'])
'''

"\nsorted_cross_val_scores = get_cross_val_scores_new_col(['OverallQual', \n                                                        'Neighborhood', \n                                                        'GrLivArea',\n                                                        'BsmtFinSF1',\n                                                        'GarageCars',\n                                                        'OverallCond',\n                                                        'YearBuilt'])\n"

In [59]:
#sorted_cross_val_scores

In [60]:
'''
sorted_cross_val_scores = get_cross_val_scores_new_col(['OverallQual', 
                                                        'Neighborhood', 
                                                        'GrLivArea',
                                                        'BsmtFinSF1',
                                                        'GarageCars',
                                                        'OverallCond',
                                                        'YearBuilt',
                                                        'LotArea'])
'''

"\nsorted_cross_val_scores = get_cross_val_scores_new_col(['OverallQual', \n                                                        'Neighborhood', \n                                                        'GrLivArea',\n                                                        'BsmtFinSF1',\n                                                        'GarageCars',\n                                                        'OverallCond',\n                                                        'YearBuilt',\n                                                        'LotArea'])\n"

In [61]:
#sorted_cross_val_scores

In [62]:
'''
sorted_cross_val_scores = get_cross_val_scores_new_col(['OverallQual', 
                                                        'Neighborhood', 
                                                        'GrLivArea',
                                                        'BsmtFinSF1',
                                                        'GarageCars',
                                                        'OverallCond',
                                                        'YearBuilt',
                                                        'LotArea',
                                                        'Fireplaces'])
'''


"\nsorted_cross_val_scores = get_cross_val_scores_new_col(['OverallQual', \n                                                        'Neighborhood', \n                                                        'GrLivArea',\n                                                        'BsmtFinSF1',\n                                                        'GarageCars',\n                                                        'OverallCond',\n                                                        'YearBuilt',\n                                                        'LotArea',\n                                                        'Fireplaces'])\n"

In [63]:
#sorted_cross_val_scores

In [64]:
'''
sorted_cross_val_scores = get_cross_val_scores_new_col(['OverallQual', 
                                                        'Neighborhood', 
                                                        'GrLivArea',
                                                        'BsmtFinSF1',
                                                        'GarageCars',
                                                        'OverallCond',
                                                        'YearBuilt',
                                                        'LotArea',
                                                        'Fireplaces',
                                                        'CentralAir'])
'''


"\nsorted_cross_val_scores = get_cross_val_scores_new_col(['OverallQual', \n                                                        'Neighborhood', \n                                                        'GrLivArea',\n                                                        'BsmtFinSF1',\n                                                        'GarageCars',\n                                                        'OverallCond',\n                                                        'YearBuilt',\n                                                        'LotArea',\n                                                        'Fireplaces',\n                                                        'CentralAir'])\n"

In [65]:
#sorted_cross_val_scores

In [66]:
sorted_cross_val_scores = get_cross_val_scores_new_col(['OverallQual', 
                                                        'Neighborhood', 
                                                         'GrLivArea',
                                                        'BsmtFinSF1',
                                                        'GarageCars',
                                                        'OverallCond',
                                                        'YearBuilt',
                                                        'LotArea',
                                                        'Fireplaces',
                                                        'CentralAir',
                                                        'ExterCond'])



['OverallQual', 'Neighborhood', 'GrLivArea', 'BsmtFinSF1', 'GarageCars', 'OverallCond', 'YearBuilt', 'LotArea', 'Fireplaces', 'CentralAir', 'ExterCond', 'MSSubClass']
Parse progress: |█████████████████████████████████████████████████████████| 100%
drf Model Build progress: |███████████████████████████████████████████████| 100%
['OverallQual', 'Neighborhood', 'GrLivArea', 'BsmtFinSF1', 'GarageCars', 'OverallCond', 'YearBuilt', 'LotArea', 'Fireplaces', 'CentralAir', 'ExterCond', 'MSZoning']
Parse progress: |█████████████████████████████████████████████████████████| 100%
drf Model Build progress: |███████████████████████████████████████████████| 100%
['OverallQual', 'Neighborhood', 'GrLivArea', 'BsmtFinSF1', 'GarageCars', 'OverallCond', 'YearBuilt', 'LotArea', 'Fireplaces', 'CentralAir', 'ExterCond', 'Street']
Parse progress: |█████████████████████████████████████████████████████████| 100%
drf Model Build progress: |███████████████████████████████████████████████| 100%
['OverallQual', 'Ne

drf Model Build progress: |███████████████████████████████████████████████| 100%
['OverallQual', 'Neighborhood', 'GrLivArea', 'BsmtFinSF1', 'GarageCars', 'OverallCond', 'YearBuilt', 'LotArea', 'Fireplaces', 'CentralAir', 'ExterCond', 'HeatingQC']
Parse progress: |█████████████████████████████████████████████████████████| 100%
drf Model Build progress: |███████████████████████████████████████████████| 100%
['OverallQual', 'Neighborhood', 'GrLivArea', 'BsmtFinSF1', 'GarageCars', 'OverallCond', 'YearBuilt', 'LotArea', 'Fireplaces', 'CentralAir', 'ExterCond', 'Electrical']
Parse progress: |█████████████████████████████████████████████████████████| 100%
drf Model Build progress: |███████████████████████████████████████████████| 100%
['OverallQual', 'Neighborhood', 'GrLivArea', 'BsmtFinSF1', 'GarageCars', 'OverallCond', 'YearBuilt', 'LotArea', 'Fireplaces', 'CentralAir', 'ExterCond', 'KitchenQual']
Parse progress: |█████████████████████████████████████████████████████████| 100%
drf Model Bui

Parse progress: |█████████████████████████████████████████████████████████| 100%
drf Model Build progress: |███████████████████████████████████████████████| 100%
['OverallQual', 'Neighborhood', 'GrLivArea', 'BsmtFinSF1', 'GarageCars', 'OverallCond', 'YearBuilt', 'LotArea', 'Fireplaces', 'CentralAir', 'ExterCond', 'BsmtFullBath']
Parse progress: |█████████████████████████████████████████████████████████| 100%
drf Model Build progress: |███████████████████████████████████████████████| 100%
['OverallQual', 'Neighborhood', 'GrLivArea', 'BsmtFinSF1', 'GarageCars', 'OverallCond', 'YearBuilt', 'LotArea', 'Fireplaces', 'CentralAir', 'ExterCond', 'BsmtHalfBath']
Parse progress: |█████████████████████████████████████████████████████████| 100%
drf Model Build progress: |███████████████████████████████████████████████| 100%
['OverallQual', 'Neighborhood', 'GrLivArea', 'BsmtFinSF1', 'GarageCars', 'OverallCond', 'YearBuilt', 'LotArea', 'Fireplaces', 'CentralAir', 'ExterCond', 'FullBath']
Parse progr

In [67]:
sorted_cross_val_scores

[('YearRemodAdd', 0.13357889753316388),
 ('KitchenAbvGr', 0.13424023975099275),
 ('MSZoning', 0.1342506213873648),
 ('RoofMatl', 0.1343139636615266),
 ('2ndFlrSF', 0.13442283683669068),
 ('BldgType', 0.1344909458381618),
 ('MasVnrType', 0.13451367825811059),
 ('BsmtQual', 0.13457373111725413),
 ('KitchenQual', 0.13458859326117534),
 ('SaleCondition', 0.13458906806114015),
 ('BsmtFinType1', 0.13466932297464246),
 ('MiscFeature', 0.13471533408053524),
 ('BsmtExposure', 0.13472850803877107),
 ('3SsnPorch', 0.1347326983586935),
 ('PavedDrive', 0.1347579904448328),
 ('HalfBath', 0.1347948213206885),
 ('ExterQual', 0.13484117607853957),
 ('BedroomAbvGr', 0.13486481745263615),
 ('Functional', 0.13490828517001419),
 ('LowQualFinSF', 0.1349971910633761),
 ('TotalBsmtSF', 0.13501259407102242),
 ('Condition2', 0.1350321314567296),
 ('GarageYrBlt_not_missing', 0.135068976865008),
 ('PoolQC', 0.13509011638833593),
 ('BsmtUnfSF', 0.13509325551294343),
 ('GarageCond', 0.13510946370069643),
 ('Conditi

We see that the cross validation score is decreasing, indicating that we have reached a plateau. Let us make check out our cross validation score now.

In [68]:
final_cols = ['OverallQual', 
              'Neighborhood', 
              'GrLivArea', 
              'BsmtFinSF1',
              'GarageCars',
              'OverallCond',
              'YearBuilt',
              'LotArea',
              'Fireplaces',
              'CentralAir',
              'ExterCond']

In [69]:
cv_score = get_cross_val_scores_given_cols(final_cols)
print(cv_score)


Parse progress: |█████████████████████████████████████████████████████████| 100%
drf Model Build progress: |███████████████████████████████████████████████| 100%
0.13318037477953817


#### Complete summary of the cross validation metrics.

In [70]:
# Do a 10 fold cross validation as that is done typically.
hpr_cross_val = H2ORandomForestEstimator(model_id='housing_price_regression', 
                                         seed=1, 
                                         nfolds=5,
                                         keep_cross_validation_predictions=True)
hpr_cross_val.train(x=final_cols, 
                    y=dep_var_col, 
                    training_frame=get_h2o_frame_with_rel_factors(train_validation_data))


Parse progress: |█████████████████████████████████████████████████████████| 100%
drf Model Build progress: |███████████████████████████████████████████████| 100%


In [71]:
hpr_cross_val

Model Details
H2ORandomForestEstimator :  Distributed Random Forest
Model Key:  housing_price_regression


ModelMetricsRegression: drf
** Reported on train data. **

MSE: 0.01876046582064431
RMSE: 0.1369688498186515
MAE: 0.09613065753996555
RMSLE: 0.010654973596795337
Mean Residual Deviance: 0.01876046582064431

ModelMetricsRegression: drf
** Reported on cross-validation data. **

MSE: 0.0177370102727234
RMSE: 0.13318036744476794
MAE: 0.09314720558793582
RMSLE: 0.010349630298402355
Mean Residual Deviance: 0.0177370102727234
Cross-Validation Metrics Summary: 


0,1,2,3,4,5,6,7
,mean,sd,cv_1_valid,cv_2_valid,cv_3_valid,cv_4_valid,cv_5_valid
mae,0.0931009,0.0058925,0.1077814,0.0831914,0.0878446,0.0950394,0.0916479
mean_residual_deviance,0.0177103,0.0027939,0.0245246,0.0128769,0.0156254,0.0191612,0.0163636
mse,0.0177103,0.0027939,0.0245246,0.0128769,0.0156254,0.0191612,0.0163636
r2,0.8892132,0.0097215,0.8739568,0.8959394,0.8958353,0.8725405,0.9077942
residual_deviance,0.0177103,0.0027939,0.0245246,0.0128769,0.0156254,0.0191612,0.0163636
rmse,0.1322852,0.0102709,0.1566032,0.1134766,0.1250018,0.1384239,0.1279203
rmsle,0.0102768,0.0008208,0.0121702,0.0087022,0.0096951,0.0107954,0.0100210


Scoring History: 


0,1,2,3,4,5,6
,timestamp,duration,number_of_trees,training_rmse,training_mae,training_deviance
,2019-04-10 10:56:32,0.931 sec,0.0,,,
,2019-04-10 10:56:32,0.941 sec,1.0,0.2173429,0.1545650,0.0472379
,2019-04-10 10:56:32,0.950 sec,2.0,0.2078081,0.1473969,0.0431842
,2019-04-10 10:56:32,0.958 sec,3.0,0.1991726,0.1426968,0.0396697
,2019-04-10 10:56:32,0.967 sec,4.0,0.1926141,0.1398376,0.0371002
---,---,---,---,---,---,---
,2019-04-10 10:56:33,1.333 sec,46.0,0.1377770,0.0962122,0.0189825
,2019-04-10 10:56:33,1.341 sec,47.0,0.1373581,0.0962104,0.0188673
,2019-04-10 10:56:33,1.351 sec,48.0,0.1371388,0.0961501,0.0188071



See the whole table with table.as_data_frame()
Variable Importances: 


0,1,2,3
variable,relative_importance,scaled_importance,percentage
OverallQual,2514.7780762,1.0,0.2804140
GrLivArea,1697.8699951,0.6751570,0.1893234
Neighborhood,1609.9715576,0.6402042,0.1795222
YearBuilt,906.5980835,0.3605082,0.1010915
GarageCars,663.9132080,0.2640047,0.0740306
BsmtFinSF1,468.0740356,0.1861294,0.0521933
Fireplaces,447.5970764,0.1779867,0.0499100
LotArea,319.0757446,0.1268803,0.0355790
OverallCond,168.6915588,0.0670801,0.0188102




In [72]:
rmses = np.array([0.1566032, 0.1134766, 0.1250018, 0.1384239, 0.1279203])

In [73]:
rmses.std()

0.01452523640573192

In [74]:
rmses.mean()

0.13228515999999998

### Quantifying validation error on validation data set.

Ideally this is not necessary as we have already performed cross validation over the entire training data set, but it is still useful to do it as this be a comparable yardstick against other models and can be useful while ensembling as well.

In [75]:
# Do a 10 fold cross validation as that is done typically.
model_train_data = H2ORandomForestEstimator(model_id='housing_price_regression', 
                                            seed=1)
model_train_data.train(x=final_cols, 
                       y=dep_var_col, 
                       training_frame=get_h2o_frame_with_rel_factors(train_data))

predict_validation = model_train_data.predict(
    get_h2o_frame_with_rel_factors(validation_data))                

validation_data['Predictions'] = predict_validation['predict'].as_data_frame()
evaluate_model_score_given_predictions(validation_data['LogSalePrice'], validation_data['Predictions'])


Parse progress: |█████████████████████████████████████████████████████████| 100%
drf Model Build progress: |███████████████████████████████████████████████| 100%
Parse progress: |█████████████████████████████████████████████████████████| 100%
drf prediction progress: |████████████████████████████████████████████████| 100%


0.12942634553451818

In [76]:
# Dump validaiton data to a file, which would be used later for ensembling.
validation_data[['Id', 'LogSalePrice', 'Predictions']].to_csv('housing_price_random_forest_validation.csv', 
                                                              index=False)

### Making predictions on test data

Now, there is another important matter to consider. In order to make predictions on test data, firstly it is important that we make sure that the test data is good. 

One problem that could be relevant here is presence of lots of null values in the test set and not in the training set. Let us check whether that is the case

In [77]:
test_data[final_cols].isnull().sum()

OverallQual     0
Neighborhood    0
GrLivArea       0
BsmtFinSF1      1
GarageCars      1
OverallCond     0
YearBuilt       0
LotArea         0
Fireplaces      0
CentralAir      0
ExterCond       0
dtype: int64

Okay, things look good. Let us generate predictions on the test set.

In [78]:
final_model = H2ORandomForestEstimator(model_id='housing_price_regression', seed=1)
final_model.train(x=final_cols, 
                  y=dep_var_col, 
                  training_frame=get_h2o_frame_with_rel_factors(train_validation_data))


Parse progress: |█████████████████████████████████████████████████████████| 100%
drf Model Build progress: |███████████████████████████████████████████████| 100%


In [79]:
predict_out = final_model.predict(
    get_h2o_frame_with_rel_factors(test_data))                


Parse progress: |█████████████████████████████████████████████████████████| 100%
drf prediction progress: |████████████████████████████████████████████████| 100%


In [80]:
test_data['LogSalePrice'] = predict_out['predict'].as_data_frame()

In [81]:
test_data['SalePrice'] = test_data['LogSalePrice'].apply(lambda x : np.exp(x))

In [82]:
test_data[['Id', 'SalePrice']].to_csv('housing_price_random_forest_predictions.csv', index=False)