## Tutorial 
Te vinden op https://github.com/h2oai/h2o-tutorials/blob/master/h2o-open-tour-2016/chicago/intro-to-h2o.ipynb

In [1]:
import h2o
h2o.init(nthreads=-1, max_mem_size = 8)

Checking whether there is an H2O instance running at http://localhost:54321..... not found.
Attempting to start a local H2O server...
  Java Version: openjdk version "1.8.0_121"; OpenJDK Runtime Environment (Zulu 8.20.0.5-macosx) (build 1.8.0_121-b15); OpenJDK 64-Bit Server VM (Zulu 8.20.0.5-macosx) (build 25.121-b15, mixed mode)
  Starting server from /opt/miniconda2/envs/py36h2o/lib/python3.6/site-packages/h2o/backend/bin/h2o.jar
  Ice root: /var/folders/32/40942p_n2jx6zw74vvpshddc0000gq/T/tmpi_j5lgp7
  JVM stdout: /var/folders/32/40942p_n2jx6zw74vvpshddc0000gq/T/tmpi_j5lgp7/h2o_etto_started_from_python.out
  JVM stderr: /var/folders/32/40942p_n2jx6zw74vvpshddc0000gq/T/tmpi_j5lgp7/h2o_etto_started_from_python.err
  Server is running at http://127.0.0.1:54321
Connecting to H2O server at http://127.0.0.1:54321... successful.


0,1
H2O cluster uptime:,03 secs
H2O cluster timezone:,Europe/Amsterdam
H2O data parsing timezone:,UTC
H2O cluster version:,3.18.0.8
H2O cluster version age:,4 days
H2O cluster name:,H2O_from_python_etto_48kjf4
H2O cluster total nodes:,1
H2O cluster free memory:,7.111 Gb
H2O cluster total cores:,4
H2O cluster allowed cores:,4


In [2]:
loan_csv = 'loan.csv'
data = h2o.import_file(loan_csv)

Parse progress: |█████████████████████████████████████████████████████████| 100%


In [6]:
data.shape

(163987, 15)

In [9]:
data.columns

['loan_amnt',
 'term',
 'int_rate',
 'emp_length',
 'home_ownership',
 'annual_inc',
 'purpose',
 'addr_state',
 'dti',
 'delinq_2yrs',
 'revol_util',
 'total_acc',
 'bad_loan',
 'longest_credit_length',
 'verification_status']

## Encode response variable
Since we want to train a binary classification model, we must ensure that the response is coded as a factor. If the response is 0/1, H2O will assume it's numeric, which means that H2O will train a regression model instead.

In [4]:
data['bad_loan'] = data['bad_loan'].asfactor() #encode the binary repsonse as a factor
data['bad_loan'].levels()  #optional: after encoding, this shows the two factor levels, '0' and '1'

[['0', '1']]

## Partition data

In [10]:
splits = data.split_frame(ratios=[0.7, 0.15], seed=1)
train = splits[0]
valid = splits[1]
test = splits[2]

Notice that split_frame() uses approximate splitting not exact splitting (for efficiency), so these are not exactly 70%, 15% and 15% of the total rows.

In [16]:
print('samples: {:.0f} => {:.2f} %'.format(train.nrow, train.nrow * 100 / data.nrow))
print('samples: {:.0f} => {:.2f} %'.format(valid.nrow, valid.nrow * 100 / data.nrow))
print('samples: {:.0f} => {:.2f} %'.format(test.nrow, test.nrow * 100 / data.nrow))

samples: 114908 => 70.07 %
samples: 24498 => 14.94 %
samples: 24581 => 14.99 %


## Identify response and predictor variables

### x geeft ingangsvariabelen aan; y geeft afhankelijke variabele aan

y = 'bad_loan'
#x = list(data.columns)
x = data.columns

In [20]:
x.remove(y)
x.remove('int_rate') # is correlated with outcome

### list of predictor columns 

In [22]:
x

['loan_amnt',
 'term',
 'emp_length',
 'home_ownership',
 'annual_inc',
 'purpose',
 'addr_state',
 'dti',
 'delinq_2yrs',
 'revol_util',
 'total_acc',
 'longest_credit_length',
 'verification_status']

## H2O Machine learning

Now that we have prepared the data, we can train some models. We will start by training a single model from each of the H2O supervised algos:

* Generalized Linear Model (GLM)
* Random Forest (RF)
* Gradient Boosting Machine (RF)
* Deep Learning (DL)
* Naive Bayes (NB)

### GLM

We first create an object of class, `H2OGeneralizedLinearEstimator`. 

In [27]:
# import GLM
from h2o.estimators.glm import H2OGeneralizedLinearEstimator

# Initialize the GLM estimator:
# Similar to R's glm() and H2O's R GLM, H2O's GLM has the "family" argument

glm_fit1 = H2OGeneralizedLinearEstimator(family='binomial', model_id='glm_fit1')
glm_fit1.train(x=x, y=y, training_frame=train)

glm Model Build progress: |███████████████████████████████████████████████| 100%


### Train a GLM with lambda search
Next we will do some automatic tuning by passing in a validation frame and setting `lambda_search = True`. Since we are training a GLM with regularization, we should try to find the right amount of regularization (to avoid overfitting). The model parameter, lambda, controls the amount of regularization in a GLM model and we can find the optimal value for lambda automatically by setting `lambda_search = True` and passing in a validation frame (which is used to evaluate model performance using a particular value of lambda).

In [62]:
glm_fit2 = H2OGeneralizedLinearEstimator(family='binomial', model_id='glm_fit2', lambda_search=True)
glm_fit2.train(x=x,y=y, training_frame=train,validation_frame=valid)

glm Model Build progress: |███████████████████████████████████████████████| 100%


### model comparison

In [63]:
from IPython.core.display import display, HTML
display(HTML("<style>.container { width:98% !important; }</style>"))
glm_perf1 = glm_fit1.model_performance(test)
glm_perf2 = glm_fit2.model_performance(test)
print(glm_perf1)
print(glm_perf2)


ModelMetricsBinomialGLM: glm
** Reported on test data. **

MSE: 0.14215509900200168
RMSE: 0.37703461247211995
LogLoss: 0.4510784560183179
Null degrees of freedom: 24580
Residual degrees of freedom: 24529
Null deviance: 23672.922265642752
Residual deviance: 22175.919054772545
AIC: 22279.919054772545
AUC: 0.6774535578047158
Gini: 0.35490711560943167
Confusion Matrix (Act/Pred) for max f1 @ threshold = 0.1926472819204629: 


0,1,2,3,4
,0.0,1.0,Error,Rate
0,13615.0,6376.0,0.3189,(6376.0/19991.0)
1,1930.0,2660.0,0.4205,(1930.0/4590.0)
Total,15545.0,9036.0,0.3379,(8306.0/24581.0)


Maximum Metrics: Maximum metrics at their respective thresholds



0,1,2,3
metric,threshold,value,idx
max f1,0.1926473,0.3904301,223.0
max f2,0.1186182,0.5569293,308.0
max f0point5,0.2774452,0.3539450,146.0
max accuracy,0.4942441,0.8144095,30.0
max precision,0.7445000,1.0,0.0
max recall,0.0012250,1.0,399.0
max specificity,0.7445000,1.0,0.0
max absolute_mcc,0.1984641,0.2111993,217.0
max min_per_class_accuracy,0.1798270,0.6269321,236.0


Gains/Lift Table: Avg response rate: 18.67 %



0,1,2,3,4,5,6,7,8,9,10,11
,group,cumulative_data_fraction,lower_threshold,lift,cumulative_lift,response_rate,cumulative_response_rate,capture_rate,cumulative_capture_rate,gain,cumulative_gain
,1,0.0100077,0.4653404,2.8735958,2.8735958,0.5365854,0.5365854,0.0287582,0.0287582,187.3595834,187.3595834
,2,0.0200155,0.4286398,2.3946632,2.6341295,0.4471545,0.4918699,0.0239651,0.0527233,139.4663195,163.4129514
,3,0.0300232,0.4021231,2.2422755,2.5035115,0.4186992,0.4674797,0.0224401,0.0751634,124.2275537,150.3511522
,4,0.0400309,0.3853779,2.2858149,2.4490874,0.4268293,0.4573171,0.0228758,0.0980392,128.5814868,144.9087359
,5,0.0500386,0.3692731,2.0463485,2.3685396,0.3821138,0.4422764,0.0204793,0.1185185,104.6348548,136.8539597
,6,0.1000366,0.3097208,1.8911445,2.1299391,0.3531326,0.3977227,0.0945534,0.2130719,89.1144473,112.9939106
,7,0.1500346,0.2743521,1.6863431,1.9821139,0.3148902,0.3701193,0.0843137,0.2973856,68.6343113,98.2113869
,8,0.2000325,0.2482081,1.3943922,1.8352133,0.2603743,0.3426886,0.0697168,0.3671024,39.4392238,83.5213343
,9,0.3000285,0.2107601,1.2810979,1.6505332,0.2392189,0.3082034,0.1281046,0.4952070,28.1097869,65.0533230





ModelMetricsBinomialGLM: glm
** Reported on test data. **

MSE: 0.1422245666567667
RMSE: 0.3771267249304492
LogLoss: 0.4512893512028316
Null degrees of freedom: 24580
Residual degrees of freedom: 24535
Null deviance: 23672.922265642752
Residual deviance: 22186.28708383361
AIC: 22278.28708383361
AUC: 0.676930980597042
Gini: 0.3538619611940841
Confusion Matrix (Act/Pred) for max f1 @ threshold = 0.19320759584208072: 


0,1,2,3,4
,0.0,1.0,Error,Rate
0,13637.0,6354.0,0.3178,(6354.0/19991.0)
1,1953.0,2637.0,0.4255,(1953.0/4590.0)
Total,15590.0,8991.0,0.3379,(8307.0/24581.0)


Maximum Metrics: Maximum metrics at their respective thresholds



0,1,2,3
metric,threshold,value,idx
max f1,0.1932076,0.3883366,219.0
max f2,0.1206225,0.5557548,305.0
max f0point5,0.2608371,0.3519563,153.0
max accuracy,0.4748547,0.8144909,34.0
max precision,0.7362129,1.0,0.0
max recall,0.0012514,1.0,399.0
max specificity,0.7362129,1.0,0.0
max absolute_mcc,0.1980297,0.2086543,214.0
max min_per_class_accuracy,0.1803576,0.6263819,233.0


Gains/Lift Table: Avg response rate: 18.67 %



0,1,2,3,4,5,6,7,8,9,10,11
,group,cumulative_data_fraction,lower_threshold,lift,cumulative_lift,response_rate,cumulative_response_rate,capture_rate,cumulative_capture_rate,gain,cumulative_gain
,1,0.0100077,0.4615795,2.9606745,2.9606745,0.5528455,0.5528455,0.0296296,0.0296296,196.0674496,196.0674496
,2,0.0200155,0.4250287,2.1551969,2.5579357,0.4024390,0.4776423,0.0215686,0.0511983,115.5196875,155.7935686
,3,0.0300232,0.3990780,2.4599722,2.5252812,0.4593496,0.4715447,0.0246187,0.0758170,145.9972191,152.5281187
,4,0.0400309,0.3813066,2.2422755,2.4545298,0.4186992,0.4583333,0.0224401,0.0982571,124.2275537,145.4529775
,5,0.0500386,0.3664622,2.1116575,2.3859553,0.3943089,0.4455285,0.0211329,0.1193900,111.1657545,138.5955329
,6,0.1000366,0.3089409,1.8824295,2.1342948,0.3515053,0.3985360,0.0941176,0.2135076,88.2429522,113.4294810
,7,0.1500346,0.2734459,1.5861212,1.9516198,0.2961758,0.3644252,0.0793028,0.2928105,58.6121171,95.1619809
,8,0.2000325,0.2478101,1.5294740,1.8461048,0.2855980,0.3447224,0.0764706,0.3692810,52.9473987,84.6104817
,9,0.3000285,0.2103265,1.2593105,1.6505332,0.2351505,0.3082034,0.1259259,0.4952070,25.9310490,65.0533230






In [35]:
#test set AUC
print(glm_perf1.auc())
print(glm_perf2.auc())

0.6774535578047158
0.676930980597042


In [37]:
# Compare test AUC to the training AUC and validation AUC
print(glm_fit2.auc(train=True))
print(glm_fit2.auc(valid=True))

0.6735052047456153
0.6754201824374698


In [42]:
glm_fit2.scoring_history()

Unnamed: 0,Unnamed: 1,timestamp,duration,iteration,lambda,predictors,deviance_train,deviance_test
0,,2018-04-24 16:35:54,0.000 sec,1,.82E-1,1,0.948060,0.959066
1,,2018-04-24 16:35:54,0.129 sec,3,.74E-1,3,0.945682,0.956569
2,,2018-04-24 16:35:54,0.203 sec,5,.68E-1,3,0.943440,0.954204
3,,2018-04-24 16:35:54,0.240 sec,6,.62E-1,3,0.941491,0.952142
4,,2018-04-24 16:35:54,0.277 sec,7,.56E-1,3,0.939794,0.950342
5,,2018-04-24 16:35:54,0.309 sec,8,.51E-1,3,0.938324,0.948778
6,,2018-04-24 16:35:55,0.425 sec,9,.47E-1,3,0.937055,0.947422
7,,2018-04-24 16:35:55,0.498 sec,10,.43E-1,6,0.934762,0.945084
8,,2018-04-24 16:35:55,0.557 sec,12,.39E-1,6,0.931086,0.941378
9,,2018-04-24 16:35:55,0.678 sec,14,.35E-1,7,0.927657,0.937944


In [39]:
table.as_data_frame()

NameError: name 'table' is not defined

## Train with deep learning

In [44]:
from h2o.estimators.deeplearning import H2ODeepLearningEstimator

### Default DL
DL will infer the response distribution from the response encoding if not specified explicitly through the distribution argument. 

In H2O's DL, early stopping is enabled by default, so below, it will use the training set and default stopping parameters to perform early stopping.

In [45]:
# init and train
dl_fit1 = H2ODeepLearningEstimator(model_id='dl_fit1', seed=1)
dl_fit1.train(x=x, y=y, training_frame=train)

deeplearning Model Build progress: |██████████████████████████████████████| 100%


### Train a DL with new architecture and more epochs
Next we will increase the number of epochs used in the GBM by setting epochs=20 (the default is 10). Increasing the number of epochs in a deep neural net may increase performance of the model, however, you have to be careful not to overfit your model. To automatically find the optimal number of epochs, you must use H2O's early stopping functionality. Unlike the rest of the H2O algorithms, H2O's DL will use early by default, so we will first turn it off in the next example by setting stopping_rounds=0, for comparison.

In [52]:
dl_fit2 = H2ODeepLearningEstimator(model_id='dl_fit2',
                                  epochs=20,
                                  hidden=[10,10],
                                  stopping_rounds=0, #disable early stopping
                                  seed=1)
dl_fit2.train(x=x, y=y, training_frame=train)

deeplearning Model Build progress: |██████████████████████████████████████| 100%


### DL with early stopping
same as before, only with early stopping
validation set is set (recommended for early stopping).

In [53]:
dl_fit3 = H2ODeepLearningEstimator(model_id='dl_fit3',
                                   epochs=20,
                                   hidden=[10,10],
                                   score_interval=1, #for early stopping
                                   stopping_rounds=3, #for early stopping
                                   stopping_metric='AUC', #for early stopping
                                   stopping_tolerance=0.0005, #for early stopping
                                   seed=1)
dl_fit3.train(x=x, y=y, training_frame=train)

deeplearning Model Build progress: |██████████████████████████████████████| 100%


In [54]:
dl_perf1 = dl_fit1.model_performance(test)
dl_perf2 = dl_fit2.model_performance(test)
dl_perf3 = dl_fit3.model_performance(test)

In [56]:

# Retrieve test set AUC#  
print(dl_perf1.auc())
print(dl_perf2.auc())
print(dl_perf3.auc())

0.6819236739321366
0.6794474180047688
0.6836110835932815


In [59]:
dl_fit2.score_history()

Unnamed: 0,Unnamed: 1,timestamp,duration,training_speed,epochs,iterations,samples,training_rmse,training_logloss,training_auc,training_lift,training_classification_error
0,,2018-04-24 17:13:51,0.000 sec,,0.0,0,0.0,,,,,
1,,2018-04-24 17:13:52,0.830 sec,196807 obs/sec,0.87007,1,99978.0,0.37917,0.456005,0.664572,2.231624,0.384102
2,,2018-04-24 17:13:57,5.857 sec,272966 obs/sec,13.055862,15,1500223.0,0.379191,0.457515,0.677786,2.558203,0.378944
3,,2018-04-24 17:13:59,8.462 sec,285395 obs/sec,20.018484,23,2300284.0,0.377275,0.44999,0.680138,2.775923,0.364179


In [60]:
dl_fit3.score_history()

Unnamed: 0,Unnamed: 1,timestamp,duration,training_speed,epochs,iterations,samples,training_rmse,training_logloss,training_auc,training_lift,training_classification_error
0,,2018-04-24 17:14:01,0.000 sec,,0.0,0,0.0,,,,,
1,,2018-04-24 17:14:01,0.595 sec,219731 obs/sec,0.87007,1,99978.0,0.381209,0.463554,0.660087,2.667063,0.344053
2,,2018-04-24 17:14:02,1.862 sec,243719 obs/sec,3.482673,4,400187.0,0.379342,0.458296,0.667747,3.102502,0.327063
3,,2018-04-24 17:14:03,2.940 sec,263417 obs/sec,6.095546,7,700427.0,0.382589,0.469197,0.6702,2.884782,0.367112
4,,2018-04-24 17:14:05,4.084 sec,266221 obs/sec,8.706609,10,1000459.0,0.376837,0.450211,0.677644,2.721493,0.372067
5,,2018-04-24 17:14:06,5.334 sec,281876 obs/sec,12.184374,14,1400082.0,0.379132,0.456918,0.674027,2.449344,0.365696
6,,2018-04-24 17:14:07,6.575 sec,291966 obs/sec,15.669544,18,1800556.0,0.376353,0.448879,0.679561,2.558203,0.363066
7,,2018-04-24 17:14:08,7.883 sec,295761 obs/sec,19.149807,22,2200466.0,0.383167,0.472424,0.678817,2.667063,0.369033
8,,2018-04-24 17:14:09,8.219 sec,298080 obs/sec,20.018484,23,2300284.0,0.377063,0.45043,0.679843,2.775923,0.356594
