## Load H2o AutoML class and try it against the diamonds dataset

In [5]:
import h2o
from h2o.automl import H2OAutoML

h2o.init()

Checking whether there is an H2O instance running at http://localhost:54321..... not found.
Attempting to start a local H2O server...
  Java Version: java version "1.8.0_171"; Java(TM) SE Runtime Environment (build 1.8.0_171-b11); Java HotSpot(TM) 64-Bit Server VM (build 25.171-b11, mixed mode)
  Starting server from /Users/gamino/anaconda3/envs/py3/lib/python3.6/site-packages/h2o/backend/bin/h2o.jar
  Ice root: /var/folders/_3/pf52_xvx7cv8w9ht736sn03w0000gn/T/tmprql6kuz6
  JVM stdout: /var/folders/_3/pf52_xvx7cv8w9ht736sn03w0000gn/T/tmprql6kuz6/h2o_gamino_started_from_python.out
  JVM stderr: /var/folders/_3/pf52_xvx7cv8w9ht736sn03w0000gn/T/tmprql6kuz6/h2o_gamino_started_from_python.err
  Server is running at http://127.0.0.1:54321
Connecting to H2O server at http://127.0.0.1:54321... successful.


0,1
H2O cluster uptime:,01 secs
H2O cluster timezone:,America/New_York
H2O data parsing timezone:,UTC
H2O cluster version:,3.18.0.11
H2O cluster version age:,"14 days, 12 hours and 46 minutes"
H2O cluster name:,H2O_from_python_gamino_fmllre
H2O cluster total nodes:,1
H2O cluster free memory:,3.556 Gb
H2O cluster total cores:,8
H2O cluster allowed cores:,8


### Load diamonds.csv

This file first need to be uploaded to H2o flow

In [6]:
from h2o.utils.shared_utils import _locate
data_path = _locate("diamonds.csv")
data = h2o.import_file(data_path)

Parse progress: |█████████████████████████████████████████████████████████| 100%


### Split data into train, test, validate

In [7]:
train,test,valid = data.split_frame(ratios=[.7, .15])


### Show columns

In [8]:
train.columns

['carat', 'cut', 'color', 'clarity', 'depth', 'table', 'price', 'x', 'y', 'z']

In [9]:
test

carat,cut,color,clarity,depth,table,price,x,y,z
0.26,Very Good,H,SI1,61.9,55,337,4.07,4.11,2.53
0.31,Ideal,J,SI2,62.2,54,344,4.35,4.37,2.71
0.2,Premium,E,SI2,60.2,62,345,3.79,3.75,2.27
0.23,Very Good,D,VS2,60.5,61,357,3.96,3.97,2.4
0.31,Good,H,SI1,64.0,54,402,4.29,4.31,2.75
0.25,Very Good,E,VS2,63.3,60,404,4.0,4.03,2.54
0.3,Premium,H,SI1,62.9,59,554,4.28,4.24,2.68
0.3,Premium,H,SI1,62.5,57,554,4.29,4.25,2.67
0.26,Very Good,E,VVS2,59.9,58,554,4.15,4.23,2.51
0.26,Very Good,D,VVS2,62.8,60,554,4.01,4.05,2.53




### Get columns for training data and for target value

x contains the columns that will be used as features

y is the target variable

In [10]:
# Identify predictors and response
x = train.columns
y = "price"
x.remove(y)

In [11]:
x

['carat', 'cut', 'color', 'clarity', 'depth', 'table', 'x', 'y', 'z']

In [12]:
y

'price'

### This is what does the magic!

H2OAutoML checks multiple algorithms to come up with the best one for the training set


In [13]:
# Run AutoML for 180 seconds
aml = H2OAutoML(max_runtime_secs = 180)
aml.train(x = x, y = y,training_frame = train)

AutoML progress: |████████████████████████████████████████████████████████| 100%
Parse progress: |█████████████████████████████████████████████████████████| 100%


### Leaderboard shows the best performing algorithms

In [65]:
# View the AutoML Leaderboard
lb = aml.leaderboard
lb

model_id,mean_residual_deviance,rmse,mae,rmsle
GBM_grid_0_AutoML_20180607_180617_model_2,286930,535.658,274.17,
GBM_grid_0_AutoML_20180607_181251_model_3,288110,536.758,272.205,0.0942363
GBM_grid_0_AutoML_20180607_180617_model_1,288308,536.944,277.406,0.102236
GBM_grid_0_AutoML_20180607_181251_model_1,288673,537.283,278.491,0.101685
GBM_grid_0_AutoML_20180607_181251_model_5,289240,537.811,277.324,0.100033
GBM_grid_0_AutoML_20180607_180617_model_0,291223,539.651,281.23,0.101893
GBM_grid_0_AutoML_20180607_180617_model_3,291779,540.165,271.793,
GBM_grid_0_AutoML_20180607_181251_model_2,291884,540.263,277.073,0.0994824
GBM_grid_0_AutoML_20180607_181251_model_0,293097,541.384,281.363,0.102533
GBM_grid_0_AutoML_20180607_181251_model_4,298746,546.577,275.188,0.0990725




### Make predictions

In [17]:
preds = aml.predict(test)

Parse progress: |█████████████████████████████████████████████████████████| 100%
gbm prediction progress: |████████████████████████████████████████████████| 100%


In [18]:
# These are the prices that need to be predicted
test['price']

price
337
344
345
357
402
404
554
554
554
554




In [19]:
# These are the predictions
preds

predict
463.944
403.19
573.689
445.473
513.694
501.353
590.693
586.282
653.47
596.303




## Now let's try a Gradient Boosting Machine (GBM)

This is to compare another library

In [23]:
# Import H2O GBM:
from h2o.estimators.gbm import H2OGradientBoostingEstimator

In [24]:
#distribution is gaussian since this is a regression 
model = H2OGradientBoostingEstimator(distribution='gaussian',
                                    ntrees=100,
                                    max_depth=4,
                                    learn_rate=0.1)

In [25]:
model.train(x=x, y=y, training_frame=train, validation_frame=valid)

gbm Model Build progress: |███████████████████████████████████████████████| 100%


### RMSE slighty better

RMSE = 532.4020635978326, which is slightly better than the AutoML one.

In [26]:
print(model)

Model Details
H2OGradientBoostingEstimator :  Gradient Boosting Machine
Model Key:  GBM_model_python_1528421256106_1


ModelMetricsRegression: gbm
** Reported on train data. **

MSE: 283451.95732323057
RMSE: 532.4020635978326
MAE: 288.7472782782823
RMSLE: 0.1152764036062172
Mean Residual Deviance: 283451.95732323057

ModelMetricsRegression: gbm
** Reported on validation data. **

MSE: 296436.3973642804
RMSE: 544.4597297911761
MAE: 294.18529021312133
RMSLE: 0.11717214588231087
Mean Residual Deviance: 296436.3973642804
Scoring History: 


0,1,2,3,4,5,6,7,8,9
,timestamp,duration,number_of_trees,training_rmse,training_mae,training_deviance,validation_rmse,validation_mae,validation_deviance
,2018-06-07 21:36:57,0.001 sec,0.0,3985.1110417,3023.0329563,15881110.0148942,3992.4900280,3024.8641092,15939976.6240625
,2018-06-07 21:36:57,0.050 sec,1.0,3622.9033243,2732.0680656,13125428.4970491,3625.6343790,2731.1352458,13145224.6503541
,2018-06-07 21:36:57,0.085 sec,2.0,3297.9048149,2472.2462836,10876176.1683887,3300.0325754,2470.9469260,10890214.9989235
,2018-06-07 21:36:57,0.103 sec,3.0,3007.3107088,2241.5720048,9043917.6991896,3007.4356649,2239.6791626,9044669.2787503
,2018-06-07 21:36:57,0.125 sec,4.0,2746.5965518,2034.6642295,7543792.6182017,2745.8599775,2032.6625673,7539747.0160709
---,---,---,---,---,---,---,---,---,---
,2018-06-07 21:36:58,1.399 sec,96.0,533.5118830,289.2538225,284634.9292559,545.0672241,294.4237733,297098.2787551
,2018-06-07 21:36:58,1.411 sec,97.0,533.2759382,288.9947942,284383.2263024,544.7951581,294.1768023,296801.7642895
,2018-06-07 21:36:58,1.423 sec,98.0,532.9586042,288.9296683,284044.8737432,544.3982221,294.0963496,296369.4242790



See the whole table with table.as_data_frame()
Variable Importances: 


0,1,2,3
variable,relative_importance,scaled_importance,percentage
y,1434649362432.0000000,1.0,0.4611877
carat,1314949824512.0000000,0.9165653,0.4227086
clarity,197562040320.0000000,0.1377075,0.0635090
color,99812540416.0000000,0.0695728,0.0320861
z,38503088128.0000000,0.0268380,0.0123773
x,20378761216.0000000,0.0142047,0.0065510
cut,3082376192.0000000,0.0021485,0.0009909
depth,1331828352.0000000,0.0009283,0.0004281
table,501426400.0000000,0.0003495,0.0001612





In [27]:
#show performance 
perf = model.model_performance(test)

In [28]:
perf


ModelMetricsRegression: gbm
** Reported on test data. **

MSE: 295466.5312393864
RMSE: 543.5683317112821
MAE: 297.6899932238606
RMSLE: 0.11839695754707748
Mean Residual Deviance: 295466.5312393864




In [29]:
test['price']

price
337
344
345
357
402
404
554
554
554
554




In [30]:
#And this can be seen by predicted values as compared to test
model.predict(test)

gbm prediction progress: |████████████████████████████████████████████████| 100%


predict
361.707
269.039
842.117
565.522
477.192
532.418
525.352
525.352
625.238
655.455


