# h20.automl 
### Using h2o.automl to predict house sale precises for Kaggle contest

#### Step 1
Downloading the contest data 

In [1]:
# Import packages
import pandas as pd
import numpy as np

In [2]:
# Read in the data
otrain = pd.read_csv('train_clean.csv')
otest= pd.read_csv('test_clean.csv')

#### Step 2
Converting the "SalePrices" variable to log(1+x) form. This step is performed for linearizing/normalizing the data for easier model development. Detailed steps and graphs to show this change are shown in the housing jupyter file. 

In [3]:
otrain["SalePrice"] = np.log1p(otrain["SalePrice"])

#### Step 3
Using h2o to develop the models. Automl feature in h2o basically runs all types of models with random features. This is a quick and easy way to see which type of model may be best suited for any particular dataset, and then further analysis can be performed only on that type of models. 

In [4]:
import h2o
h2o.init(min_mem_size='2G', max_mem_size='4G')
SEED = 123

Checking whether there is an H2O instance running at http://localhost:54321 . connected.


0,1
H2O cluster uptime:,15 hours 21 mins
H2O cluster timezone:,America/New_York
H2O data parsing timezone:,UTC
H2O cluster version:,3.24.0.1
H2O cluster version age:,2 months and 23 days
H2O cluster name:,H2O_from_python_architmanuja_t4nwyz
H2O cluster total nodes:,1
H2O cluster free memory:,3.146 Gb
H2O cluster total cores:,4
H2O cluster allowed cores:,4


In [5]:
train = h2o.H2OFrame(otrain)
test = h2o.H2OFrame(otest)

Parse progress: |█████████████████████████████████████████████████████████| 100%
Parse progress: |█████████████████████████████████████████████████████████| 100%


In [6]:
# Split train into 70% training and 30% validation
train, valid = train.split_frame([0.7])

In [7]:
# creating the x and y variables/columns
ColsToDrop = ['id']
y = 'SalePrice'
X = [name for name in train.columns if name not in [y] + ColsToDrop]

In [8]:
# performing the h2o.automl feature - developing the models
from h2o.automl import H2OAutoML
aml = H2OAutoML(max_models=250, sort_metric = "RMSLE", nfolds = 3, seed=123)
aml.train(x=X, y=y, training_frame=train, validation_frame=valid)

AutoML progress: |████████████████████████████████████████████████████████| 100%


In [9]:
# looking at the top 10 models and their associated deviance and errors - since the Kaggle contest is compared based on RMSLE, 
# our models are also sorted based on lowest RMSLE
lb = aml.leaderboard
lb.head() 

model_id,mean_residual_deviance,rmse,mse,mae,rmsle
StackedEnsemble_BestOfFamily_AutoML_20190624_075216,0.0169765,0.130294,0.0169765,0.0814412,0.0100683
StackedEnsemble_AllModels_AutoML_20190624_075216,0.0177232,0.133129,0.0177232,0.0817716,0.0102697
DeepLearning_grid_1_AutoML_20190624_075216_model_22,0.0179866,0.134114,0.0179866,0.0870901,0.0103632
GBM_4_AutoML_20190624_075216,0.0187382,0.136888,0.0187382,0.0932486,0.0106476
DeepLearning_grid_1_AutoML_20190624_075216_model_23,0.0190205,0.137915,0.0190205,0.0920885,0.0106967
GBM_2_AutoML_20190624_075216,0.0189531,0.13767,0.0189531,0.0934635,0.0107056
GBM_grid_1_AutoML_20190624_075216_model_16,0.0189193,0.137547,0.0189193,0.0921308,0.0107222
GBM_3_AutoML_20190624_075216,0.0191298,0.13831,0.0191298,0.0934249,0.0107683
GBM_grid_1_AutoML_20190624_075216_model_10,0.0192198,0.138635,0.0192198,0.0932753,0.0107812
GBM_grid_1_AutoML_20190624_075216_model_4,0.0191835,0.138505,0.0191835,0.0926702,0.0108045




In [10]:
# creating predictions for the test dataset based on the best model 
bestmodel = aml.leader
preds = bestmodel.predict(test)
predictions = h2o.as_list(preds)

stackedensemble prediction progress: |████████████████████████████████████| 100%




#### Step 4
Creating the submission file for submitting the data. 

In [11]:
# Create submission output
submission3 = pd.concat([otest.Id.reset_index(drop=True), predictions], axis=1)

In [12]:
# Change column names
submission3 = submission3.rename(index=str, columns={"predict": "SalePrice"})

In [13]:
# changing the SalePrice variable from log(x+1) values back to normal
submission3["SalePrice"] = np.expm1(submission3["SalePrice"])

In [14]:
# saving the data as a csv for easier submission
submission3.to_csv("submission_automl.csv", index=False)

## The Best Model gave a score of 0.12244 on Kaggle.

#### Step 5
Choosing the second best model

In [24]:
model_ids = list(aml.leaderboard['model_id'].as_data_frame().iloc[:,0])
SecondBest = h2o.get_model([mid for mid in model_ids if "StackedEnsemble_AllModels" in mid][0])

In [25]:
preds = SecondBest.predict(test)
predictions = h2o.as_list(preds)

stackedensemble prediction progress: |████████████████████████████████████| 100%




In [26]:
# Create submission output
submission3 = pd.concat([otest.Id.reset_index(drop=True), predictions], axis=1)

In [27]:
# Change column names
submission3 = submission3.rename(index=str, columns={"predict": "SalePrice"})

In [28]:
# changing the SalePrice variable from log(x+1) values back to normal
submission3["SalePrice"] = np.expm1(submission3["SalePrice"])

In [29]:
# saving the data as a csv for easier submission
submission3.to_csv("submission_automl_secondbest.csv", index=False)

## The Second Best Model gave a score of 0.12649 on Kaggle (worse than the best model).

#### Step 6
Taking the average of values shown by best model and second best model

In [30]:
#running predictions on the best model again to use for calculating the average
bestmodel = aml.leader
preds = bestmodel.predict(test)
predictions = h2o.as_list(preds)
submission3 = pd.concat([otest.Id.reset_index(drop=True), predictions], axis=1)
submission3 = submission3.rename(index=str, columns={"predict": "SalePrice"})
submission3["SalePrice"] = np.expm1(submission3["SalePrice"])
BestModelValues = submission3["SalePrice"]

stackedensemble prediction progress: |████████████████████████████████████| 100%




In [32]:
#running predictions on the second best model again to use for calculating the average
preds = SecondBest.predict(test)
predictions = h2o.as_list(preds)
submission3 = pd.concat([otest.Id.reset_index(drop=True), predictions], axis=1)
submission3 = submission3.rename(index=str, columns={"predict": "SalePrice"})
submission3["SalePrice"] = np.expm1(submission3["SalePrice"])
SecondBestModelValues = submission3["SalePrice"]

stackedensemble prediction progress: |████████████████████████████████████| 100%




In [118]:
#creating the Average of the top two best models and saving it into a new variable
pre = pd.Series(range(1461,2920))
AverageValues = (BestModelValues + SecondBestModelValues)/2
AvgCSV = pd.DataFrame(list(zip(pre,AverageValues)),columns=['Id', 'SalePrice'])

In [119]:
# saving to a csv
AvgCSV.to_csv("submission_automl_AVERAGEMODEL.csv", index=False)

## The Average of the Best and the Second Best Models gave a score of 0.12170 on Kaggle (Best so far for StackedEnsemble Models using h2o.automl feature). This put us in the top 28% rank at the time of the submission. 