# 1. Importing dependencies 

In [1]:
import h2o
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt

# 2. Starting H2O

H2O will automatically check if an instance is already running and connect to

In [2]:
h2o.init()

Checking whether there is an H2O instance running at http://localhost:54321 . connected.


0,1
H2O cluster uptime:,2 hours 36 mins
H2O cluster timezone:,America/Sao_Paulo
H2O data parsing timezone:,UTC
H2O cluster version:,3.26.0.3
H2O cluster version age:,"21 days, 12 hours and 27 minutes"
H2O cluster name:,H2O_from_python_Semantix_zpo2sn
H2O cluster total nodes:,1
H2O cluster free memory:,3.279 Gb
H2O cluster total cores:,8
H2O cluster allowed cores:,8


# 3a. Importing synthetic dataset from local file

In [3]:
people = pd.read_csv ('C:\Intel\people.csv')

In [4]:
people

Unnamed: 0.1,Unnamed: 0,id,bloodType,age,healthyEating,activeLifestyle,income
0,1,1,A,31.516143,4,3,31300
1,2,2,A,55.050341,3,4,52500
2,3,3,O,37.221915,7,3,38700
3,4,4,O,59.501818,7,6,56800
4,5,5,O,62.201962,2,7,54900
5,6,6,AB,20.141155,5,9,23900
6,7,7,B,42.820958,3,6,37700
7,8,8,A,59.943695,1,3,54600
8,9,9,A,43.917446,5,7,41700
9,10,10,A,39.460893,5,7,35800


# 3b. Importing synthetic dataset from H2O (if already there)

In [5]:
people = h2o.get_frame("people.hex")

In [6]:
people

C1,id,bloodType,age,healthyEating,activeLifestyle,income
1,1,A,31.5161,4,3,31300
2,2,A,55.0503,3,4,52500
3,3,O,37.2219,7,3,38700
4,4,O,59.5018,7,6,56800
5,5,O,62.202,2,7,54900
6,6,AB,20.1412,5,9,23900
7,7,B,42.821,3,6,37700
8,8,A,59.9437,1,3,54600
9,9,A,43.9174,5,7,41700
10,10,A,39.4609,5,7,35800




# 4. Splitting the dataset

We create three subset: train, validation and test dataset

In [28]:
train, valid, test = people.split_frame(
    ratios = [0.8, 0.1],
    destination_frames = ["people_train", "people_valid", "people_test"],
    seed = 123
    )

Let's take a look at the size of the three sets

In [8]:
print("%d/%d/%d" % (train.nrows, valid.nrows, test.nrows))

788/118/94


# 5. Retrieving H2O Frame 

We can now import the three dataframes from the H2O cluster

In [9]:
train = h2o.get_frame("people_train")
valid = h2o.get_frame("people_valid")
test =  h2o.get_frame("people_test")

# 6. Running a GBM model 

In [10]:
from h2o.estimators.gbm import H2OGradientBoostingEstimator

In [11]:
y = "income"
ignoreFields = [y, "id"]
x = [i for i in train.names if i not in ignoreFields]

In [12]:
m1 = H2OGradientBoostingEstimator(model_id = "defaults")
m1.train(x, y, train, 
        validation_frame = valid)

gbm Model Build progress: |███████████████████████████████████████████████| 100%


let's look at the MAE performance of our model on the three dataset

In [13]:
m1.mae(train=True)

844.115231896415

In [14]:
m1.mae(valid=True)

1412.689692125053

In [15]:
perf = m1.model_performance(test)
perf.mae


ModelMetricsRegression: gbm
** Reported on test data. **

MSE: 2381694.5886023785
RMSE: 1543.273983647226
MAE: 1328.2845678162523
RMSLE: 0.041702167209093526
Mean Residual Deviance: 2381694.5886023785


<bound method MetricsBase.mae of >

# 7. Let's overfit the data 

We now bump up the number of trees and max_depth to overfit the data

In [16]:
m2 = H2OGradientBoostingEstimator(model_id = "overfit", 
                                 ntrees = 1000,
                                 max_depth = 10)
m2.train(x, y, train, 
        validation_frame = valid)

gbm Model Build progress: |███████████████████████████████████████████████| 100%


In [17]:
print("Train: %d --> %d"% (m1.mae(train=True), m2.mae(train=True)))
print("Valid: %d --> %d"% (m1.mae(valid=True), m2.mae(valid=True)))
print(" Test: %d --> %d"% (perf.mae(), m2.model_performance(test).mae()))

Train: 844 --> 22
Valid: 1412 --> 1582
 Test: 1328 --> 1490


# 8. Crossvalidation

An alternative approach to split the data (instead of train/valid/test) is to use cross-validation 

In [18]:
train, test = people.split_frame(
    ratios = [0.897],
    destination_frames = ["people_train", "people_test"],
    seed = 123
    )

In [19]:
print("%d/%d" % (train.nrows, test.nrows))

900/100


In [20]:
from h2o.estimators.gbm import H2OGradientBoostingEstimator

In [21]:
y = "income"
ignoredFields = [y, "id"]
x = [i for i in train.names if i not in ignoreFields]

In [22]:
m3 = H2OGradientBoostingEstimator(model_id = "def9folds", 
                                  nfolds = 9)
m3.train(x, y, train)

gbm Model Build progress: |███████████████████████████████████████████████| 100%


In [23]:
m3.mae(train=True)

908.9596506076389

In [24]:
m3.mae(xval=True)

1328.8341969124283

In [25]:
perf = m3.model_performance(test)
perf.mae()

1284.04442177443

this done has done better than the previous model - probably because it has trained with more training data. Still have overfitting but not as much

# 9. Let's overfit the model with cross-validation

We'll increase the number of trees and max_depth

In [26]:
m4 = H2OGradientBoostingEstimator(model_id = "overfit", 
                                 ntrees = 1000,
                                 max_depth = 10,
                                 nfolds = 9)
m4.train(x, y, train, 
        validation_frame = valid)

gbm Model Build progress: |███████████████████████████████████████████████| 100%


In [27]:
print("Train: %d --> %d"% (m3.mae(train=True), m4.mae(train=True)))
print("Valid: %d --> %d"% (m3.mae(xval=True), m4.mae(xval=True)))
print(" Test: %d --> %d"% (perf.mae(), m4.model_performance(test).mae()))

Train: 908 --> 27
Valid: 1328 --> 1429
 Test: 1284 --> 1423


# 10. Conclusions 

Cross-validation is a better way to build prediction at the cost of training time

We can still run into overfitting if we're not careful