# h20.automl 
### Using h2o.automl to predict house sale precises for Kaggle contest

#### Step 1
Downloading the contest data 

In [26]:
# Import packages
import pandas as pd
import numpy as np

In [27]:
# Read in the data
otrain = pd.read_csv('train_clean.csv')
otest= pd.read_csv('test_clean.csv')

#### Step 2
Converting the "SalePrices" variable to log(1+x) form. This step is performed for linearizing/normalizing the data for easier model development. Detailed steps and graphs to show this change are shown in the housing jupyter file. 

In [28]:
otrain["SalePrice"] = np.log1p(otrain["SalePrice"])

#### Step 3
Using h2o to develop the models. Automl feature in h2o basically runs all types of models with random features. This is a quick and easy way to see which type of model may be best suited for any particular dataset, and then further analysis can be performed only on that type of models. 

In [29]:
import h2o
h2o.init(min_mem_size='2G', max_mem_size='4G')
SEED = 123

Checking whether there is an H2O instance running at http://localhost:54321 . connected.


0,1
H2O cluster uptime:,1 hour 38 mins
H2O cluster timezone:,America/New_York
H2O data parsing timezone:,UTC
H2O cluster version:,3.24.0.1
H2O cluster version age:,2 months and 22 days
H2O cluster name:,H2O_from_python_architmanuja_t4nwyz
H2O cluster total nodes:,1
H2O cluster free memory:,3.473 Gb
H2O cluster total cores:,4
H2O cluster allowed cores:,4


In [30]:
train = h2o.H2OFrame(otrain)
test = h2o.H2OFrame(otest)

Parse progress: |█████████████████████████████████████████████████████████| 100%
Parse progress: |█████████████████████████████████████████████████████████| 100%


In [31]:
# Split train into 70% training and 30% validation
train, valid = train.split_frame([0.7])

In [32]:
# creating the x and y variables/columns
ColsToDrop = ['id']
y = 'SalePrice'
X = [name for name in train.columns if name not in [y] + ColsToDrop]

In [34]:
# performing the h2o.automl feature - developing the models
from h2o.automl import H2OAutoML
aml = H2OAutoML(max_models=20, sort_metric = "RMSLE", nfolds = 3, seed=123)
aml.train(x=X, y=y, training_frame=train, validation_frame=valid)

AutoML progress: |████████████████████████████████████████████████████████| 100%


In [36]:
# looking at the top 10 models and their associated deviance and errors - since the Kaggle contest is compared based on RMSLE, 
# our models are also sorted based on lowest RMSLE
lb = aml.leaderboard
lb.head() 

model_id,mean_residual_deviance,rmse,mse,mae,rmsle
StackedEnsemble_AllModels_AutoML_20190623_180943,0.0147122,0.121294,0.0147122,0.0809973,0.00938254
StackedEnsemble_BestOfFamily_AutoML_20190623_180943,0.0155417,0.124667,0.0155417,0.083832,0.00964299
DeepLearning_grid_1_AutoML_20190623_180943_model_1,0.0166823,0.12916,0.0166823,0.0897432,0.0100111
DeepLearning_grid_1_AutoML_20190623_180943_model_5,0.0175625,0.132524,0.0175625,0.0901132,0.0102162
DeepLearning_grid_1_AutoML_20190623_180943_model_3,0.0178059,0.133439,0.0178059,0.0904554,0.0103515
DeepLearning_grid_1_AutoML_20190623_180943_model_4,0.0183866,0.135597,0.0183866,0.0897282,0.0104429
GBM_2_AutoML_20190623_180943,0.0181754,0.134816,0.0181754,0.0943146,0.010472
GBM_4_AutoML_20190623_180943,0.0183117,0.135321,0.0183117,0.095336,0.0105219
GBM_3_AutoML_20190623_180943,0.0183322,0.135396,0.0183322,0.0954589,0.010523
GBM_1_AutoML_20190623_180943,0.0188572,0.137321,0.0188572,0.0957997,0.0106557




In [47]:
# creating predictions for the test dataset based on the best model 
bestmodel = aml.leader
preds = bestmodel.predict(test)
predictions = h2o.as_list(preds)

stackedensemble prediction progress: |████████████████████████████████████| 100%




#### Step 4
Creating the submission file for submitting the data. 

In [48]:
# Create submission output
submission3 = pd.concat([otest.Id.reset_index(drop=True), predictions], axis=1)

In [49]:
# Change column names
submission3 = submission3.rename(index=str, columns={"predict": "SalePrice"})

In [50]:
# changing the SalePrice variable from log(x+1) values back to normal
submission3["SalePrice"] = np.expm1(submission3["SalePrice"])

In [51]:
# saving the data as a csv for easier submission
submission3.to_csv("submission_automl.csv", index=False)