You will need to run the following code in order to use this notebook:

`conda install -c h2oai h2o`

https://github.com/h2oai/h2o-tutorials

## Preprocessing

In [1]:
#Import the models we will be using
import h2o
# Import H2O Grid Search:
from h2o.grid.grid_search import H2OGridSearch

In [2]:
h2o.init()

Checking whether there is an H2O instance running at http://localhost:54321..... not found.
Attempting to start a local H2O server...
; Java HotSpot(TM) 64-Bit Server VM (build 25.191-b12, mixed mode)
  Starting server from C:\Users\zgeorge\AppData\Local\Continuum\anaconda3\lib\site-packages\h2o\backend\bin\h2o.jar
  Ice root: C:\Users\zgeorge\AppData\Local\Temp\tmpc_46in70
  JVM stdout: C:\Users\zgeorge\AppData\Local\Temp\tmpc_46in70\h2o_zgeorge_started_from_python.out
  JVM stderr: C:\Users\zgeorge\AppData\Local\Temp\tmpc_46in70\h2o_zgeorge_started_from_python.err
  Server is running at http://127.0.0.1:54321
Connecting to H2O server at http://127.0.0.1:54321... successful.


0,1
H2O cluster uptime:,02 secs
H2O cluster timezone:,America/Denver
H2O data parsing timezone:,UTC
H2O cluster version:,3.22.0.2
H2O cluster version age:,12 days
H2O cluster name:,H2O_from_python_zgeorge_i0at7e
H2O cluster total nodes:,1
H2O cluster free memory:,1.755 Gb
H2O cluster total cores:,4
H2O cluster allowed cores:,4


In [3]:
#import data
funds_csv = "Buy More Data.csv"  # modify this for your machine
data = h2o.import_file(funds_csv)

Parse progress: |█████████████████████████████████████████████████████████| 100%


In [4]:
data.shape

(21353, 28)

In [5]:
data['target'] = data['target'].asfactor()  #encode the binary repsonse as a factor
data['target'].levels()  #optional: after encoding, this shows the two factor levels, '0' and '1'

[['0', '1']]

In [6]:
# Partition data into 70%, 15%, 15% chunks
# Setting a seed will guarantee reproducibility

splits = data.split_frame(ratios=[0.7, 0.15], seed=1)  

train = splits[0]
valid = splits[1]
test = splits[2]

In [7]:
print(train.nrow)
print(valid.nrow)
print(test.nrow)

15010
3202
3141


In [8]:
y = 'target'
x = list(data.columns)

In [9]:
x

['Locationid',
 'Location',
 'LocPropertyType',
 'LocProvince',
 'LocPostal',
 'LocDistrict',
 'LocRegion',
 'RegionName',
 'Market',
 'PopulationRank',
 'FundraisingYear',
 'cy_funds',
 'cy_1_funds',
 'cy_2_funds',
 'change',
 'target',
 'cy_checkins',
 'cy_1_checkins',
 'cy_2_checkins',
 'cy_contacts',
 'cy_1_contacts',
 'cy_2_contacts',
 'cy_images',
 'cy_1_images',
 'cy_2_images',
 'cy_notes',
 'cy_1_notes',
 'cy_2_notes']

In [10]:
x.remove('target')  #remove the response
x.remove('change')  
x.remove('cy_funds')

In [11]:
# List of predictor columns
x

['Locationid',
 'Location',
 'LocPropertyType',
 'LocProvince',
 'LocPostal',
 'LocDistrict',
 'LocRegion',
 'RegionName',
 'Market',
 'PopulationRank',
 'FundraisingYear',
 'cy_1_funds',
 'cy_2_funds',
 'cy_checkins',
 'cy_1_checkins',
 'cy_2_checkins',
 'cy_contacts',
 'cy_1_contacts',
 'cy_2_contacts',
 'cy_images',
 'cy_1_images',
 'cy_2_images',
 'cy_notes',
 'cy_1_notes',
 'cy_2_notes']

### Linear Regression

In [12]:
# Import H2O GLM:
from h2o.estimators.glm import H2OGeneralizedLinearEstimator

In [13]:
# Initialize the GLM estimator:
# Similar to R's glm() and H2O's R GLM, H2O's GLM has the "family" argument

glm_fit1 = H2OGeneralizedLinearEstimator(family='binomial', model_id='glm_fit1')

In [14]:
glm_fit1.train(x=x, y=y, training_frame=train)

glm Model Build progress: |███████████████████████████████████████████████| 100%


In [15]:
glm_fit2 = H2OGeneralizedLinearEstimator(family='binomial', model_id='glm_fit2', lambda_search=True)
glm_fit2.train(x=x, y=y, training_frame=train, validation_frame=valid)

glm Model Build progress: |███████████████████████████████████████████████| 100%


In [16]:
glm_perf1 = glm_fit1.model_performance(test)
glm_perf2 = glm_fit2.model_performance(test)

In [17]:
# Print model performance
print(glm_perf1)
print(glm_perf2)


ModelMetricsBinomialGLM: glm
** Reported on test data. **

MSE: 0.1325019732014012
RMSE: 0.36400820485450763
LogLoss: 0.42389688854309976
Null degrees of freedom: 3140
Residual degrees of freedom: 3017
Null deviance: 3147.763150497247
Residual deviance: 2662.9202538277395
AIC: 2910.9202538277395
AUC: 0.7872029837068241
pr_auc: 0.46590377805201455
Gini: 0.5744059674136481
Confusion Matrix (Act/Pred) for max f1 @ threshold = 0.2844315749107653: 


0,1,2,3,4
,0.0,1.0,Error,Rate
0,1895.0,617.0,0.2456,(617.0/2512.0)
1,208.0,421.0,0.3307,(208.0/629.0)
Total,2103.0,1038.0,0.2627,(825.0/3141.0)


Maximum Metrics: Maximum metrics at their respective thresholds



0,1,2,3
metric,threshold,value,idx
max f1,0.2844316,0.5050990,196.0
max f2,0.1512106,0.6524806,281.0
max f0point5,0.3885001,0.5025641,132.0
max accuracy,0.4631851,0.8147087,90.0
max precision,0.9588363,1.0,0.0
max recall,0.0001034,1.0,399.0
max specificity,0.9588363,1.0,0.0
max absolute_mcc,0.3885001,0.3685933,132.0
max min_per_class_accuracy,0.2631339,0.7165605,210.0


Gains/Lift Table: Avg response rate: 20.03 %, avg score: 21.30 %



0,1,2,3,4,5,6,7,8,9,10,11,12,13
,group,cumulative_data_fraction,lower_threshold,lift,cumulative_lift,response_rate,score,cumulative_response_rate,cumulative_score,capture_rate,cumulative_capture_rate,gain,cumulative_gain
,1,0.0101878,0.6534986,3.1210254,3.1210254,0.625,0.7491132,0.625,0.7491132,0.0317965,0.0317965,212.1025437,212.1025437
,2,0.0200573,0.6078620,2.8995333,3.0120372,0.5806452,0.6276422,0.6031746,0.6893418,0.0286169,0.0604134,189.9533309,201.2037247
,3,0.0302451,0.5696614,3.4331280,3.1538783,0.6875,0.5891153,0.6315789,0.6555813,0.0349762,0.0953895,243.3127981,215.3878337
,4,0.0401146,0.5461892,3.0606185,3.1309335,0.6129032,0.5567126,0.6269841,0.6312564,0.0302067,0.1255962,206.0618493,213.0933454
,5,0.0503025,0.5292651,2.8089229,3.0657161,0.5625,0.5370944,0.6139241,0.6121856,0.0286169,0.1542130,180.8922893,206.5716126
,6,0.1002865,0.4626406,2.6717568,2.8693618,0.5350318,0.4904364,0.5746032,0.5515043,0.1335453,0.2877583,167.1756807,186.9361799
,7,0.1502706,0.4103146,1.9402043,2.5602988,0.3885350,0.4370110,0.5127119,0.5134207,0.0969793,0.3847377,94.0204348,156.0298833
,8,0.2002547,0.3733940,1.9720110,2.4134607,0.3949045,0.3921220,0.4833068,0.4831442,0.0985692,0.4833068,97.2010977,141.3460688
,9,0.3002229,0.3058401,1.3358784,2.0546475,0.2675159,0.3383750,0.4114528,0.4349390,0.1335453,0.6168521,33.5878404,105.4647499





ModelMetricsBinomialGLM: glm
** Reported on test data. **

MSE: 0.13244745245982614
RMSE: 0.3639333077087423
LogLoss: 0.42361583191300856
Null degrees of freedom: 3140
Residual degrees of freedom: 3000
Null deviance: 3147.763150497247
Residual deviance: 2661.154656077545
AIC: 2943.154656077545
AUC: 0.7874918989802842
pr_auc: 0.46594437658791943
Gini: 0.5749837979605683
Confusion Matrix (Act/Pred) for max f1 @ threshold = 0.28726403807664963: 


0,1,2,3,4
,0.0,1.0,Error,Rate
0,1903.0,609.0,0.2424,(609.0/2512.0)
1,211.0,418.0,0.3355,(211.0/629.0)
Total,2114.0,1027.0,0.2611,(820.0/3141.0)


Maximum Metrics: Maximum metrics at their respective thresholds



0,1,2,3
metric,threshold,value,idx
max f1,0.2872640,0.5048309,196.0
max f2,0.1479239,0.6530899,286.0
max f0point5,0.3877807,0.5032523,134.0
max accuracy,0.4641578,0.8147087,93.0
max precision,0.9579584,1.0,0.0
max recall,0.0001036,1.0,399.0
max specificity,0.9579584,1.0,0.0
max absolute_mcc,0.3877807,0.3692555,134.0
max min_per_class_accuracy,0.2623813,0.7153662,213.0


Gains/Lift Table: Avg response rate: 20.03 %, avg score: 21.31 %



0,1,2,3,4,5,6,7,8,9,10,11,12,13
,group,cumulative_data_fraction,lower_threshold,lift,cumulative_lift,response_rate,score,cumulative_response_rate,cumulative_score,capture_rate,cumulative_capture_rate,gain,cumulative_gain
,1,0.0101878,0.6536031,3.1210254,3.1210254,0.625,0.7485498,0.625,0.7485498,0.0317965,0.0317965,212.1025437,212.1025437
,2,0.0200573,0.6069538,2.8995333,3.0120372,0.5806452,0.6275981,0.6031746,0.6890339,0.0286169,0.0604134,189.9533309,201.2037247
,3,0.0302451,0.5712177,3.4331280,3.1538783,0.6875,0.5900502,0.6315789,0.6556920,0.0349762,0.0953895,243.3127981,215.3878337
,4,0.0401146,0.5458395,2.8995333,3.0913014,0.5806452,0.5575078,0.6190476,0.6315356,0.0286169,0.1240064,189.9533309,209.1301385
,5,0.0503025,0.5281740,2.9649742,3.0657161,0.59375,0.5366799,0.6139241,0.6123243,0.0302067,0.1542130,196.4974165,206.5716126
,6,0.1002865,0.4623209,2.6717568,2.8693618,0.5350318,0.4902037,0.5746032,0.5514579,0.1335453,0.2877583,167.1756807,186.9361799
,7,0.1502706,0.4114277,1.9083977,2.5497191,0.3821656,0.4366212,0.5105932,0.5132601,0.0953895,0.3831479,90.8397720,154.9719086
,8,0.2002547,0.3734411,2.0674309,2.4293387,0.4140127,0.3919384,0.4864865,0.4829779,0.1033386,0.4864865,106.7430863,142.9338719
,9,0.3002229,0.3058950,1.3358784,2.0652385,0.2675159,0.3383789,0.4135737,0.4348293,0.1335453,0.6200318,33.5878404,106.5238465






In [18]:
# Retreive test set AUC
print(glm_perf1.auc())
print(glm_perf2.auc())

0.7872029837068241
0.7874918989802842


In [19]:
# Compare test AUC to the training AUC and validation AUC
print(glm_fit2.auc(train=True))
print(glm_fit2.auc(valid=True))

0.8148576616995139
0.7886099739452724


### Random Forrest

In [20]:
# Import H2O RF:
from h2o.estimators.random_forest import H2ORandomForestEstimator

In [21]:
# Initialize the RF estimator:

rf_fit1 = H2ORandomForestEstimator(model_id='rf_fit1', seed=1)

In [22]:
rf_fit1.train(x=x, y=y, training_frame=train)

drf Model Build progress: |███████████████████████████████████████████████| 100%


In [23]:
rf_fit2 = H2ORandomForestEstimator(model_id='rf_fit2', ntrees=100, seed=1)
rf_fit2.train(x=x, y=y, training_frame=train)

drf Model Build progress: |███████████████████████████████████████████████| 100%


In [24]:
rf_perf1 = rf_fit1.model_performance(test)
rf_perf2 = rf_fit2.model_performance(test)

In [25]:
# Retreive test set AUC
print(rf_perf1.auc())
print(rf_perf2.auc())

0.7782466102295627
0.7819528267495671


In [26]:
#Cross Validate Performance
rf_fit3 = H2ORandomForestEstimator(model_id='rf_fit3', seed=1, nfolds=5)
rf_fit3.train(x=x, y=y, training_frame=data)

drf Model Build progress: |███████████████████████████████████████████████| 100%


In [27]:
print(rf_fit3.auc(xval=True))

0.7965930639332441


### Gradient Boosting

In [28]:
# Import H2O GBM:
from h2o.estimators.gbm import H2OGradientBoostingEstimator

In [29]:
# GBM hyperparameters
gbm_params1 = {'learn_rate': [0.01, 0.1], 
                'max_depth': [3, 5, 9],
                'sample_rate': [0.8, 1.0],
                'col_sample_rate': [0.2, 0.5, 1.0]}

In [30]:
# Initialize and train the GBM estimator:
gbm_grid1 = H2OGridSearch(model=H2OGradientBoostingEstimator,grid_id='gbm_grid1', hyper_params=gbm_params1)
gbm_grid1.train(x=x, y=y, training_frame=train, validation_frame=valid, ntrees=100, seed=1)

gbm Grid Build progress: |████████████████████████████████████████████████| 100%


In [31]:
gbm_fit2 = H2OGradientBoostingEstimator(model_id='gbm_fit2', ntrees=500, seed=1)
gbm_fit2.train(x=x, y=y, training_frame=train)

gbm Model Build progress: |███████████████████████████████████████████████| 100%


In [32]:
# Now let's use early stopping to find optimal ntrees

gbm_fit3 = H2OGradientBoostingEstimator(model_id='gbm_fit3', 
                                        ntrees=500, 
                                        score_tree_interval=5,     #used for early stopping
                                        stopping_rounds=3,         #used for early stopping
                                        stopping_metric='AUC',     #used for early stopping
                                        stopping_tolerance=0.0005, #used for early stopping
                                        seed=1)

# The use of a validation_frame is recommended with using early stopping
gbm_fit3.train(x=x, y=y, training_frame=train, validation_frame=valid)

gbm Model Build progress: |███████████████████████████████████████████████| 100%


In [33]:
gbm_gridperf = gbm_grid1.get_grid(sort_by='auc', decreasing=True)
# Grab the model_id for the top GBM model, chosen by validation AUC
best_gbm_model = gbm_gridperf.models[0]

gbm_perf1 = best_gbm_model.model_performance(test)
gbm_perf2 = gbm_fit2.model_performance(test)
gbm_perf3 = gbm_fit3.model_performance(test)

In [34]:
# Retreive test set AUC
print(gbm_perf1.auc())
print(gbm_perf2.auc())
print(gbm_perf3.auc())

0.7913614016789363
0.7590307383066843
0.7865621803894565


In [35]:
print(best_gbm_model.varimp)

Model Details
H2OGradientBoostingEstimator :  Gradient Boosting Machine
Model Key:  gbm_grid1_model_28


ModelMetricsBinomial: gbm
** Reported on train data. **

MSE: 0.0939164352192401
RMSE: 0.30645788490303216
LogLoss: 0.3112471965832744
Mean Per-Class Error: 0.1761286919831223
AUC: 0.9079513833253218
pr_auc: 0.7811770700775346
Gini: 0.8159027666506435
Confusion Matrix (Act/Pred) for max f1 @ threshold = 0.3473552703651643: 


0,1,2,3,4
,0.0,1.0,Error,Rate
0,10890.0,960.0,0.081,(960.0/11850.0)
1,982.0,2178.0,0.3108,(982.0/3160.0)
Total,11872.0,3138.0,0.1294,(1942.0/15010.0)


Maximum Metrics: Maximum metrics at their respective thresholds



0,1,2,3
metric,threshold,value,idx
max f1,0.3473553,0.6916481,197.0
max f2,0.1796157,0.7561319,275.0
max f0point5,0.5151581,0.7421095,132.0
max accuracy,0.3984780,0.8778814,176.0
max precision,0.9920373,1.0,0.0
max recall,0.0260822,1.0,388.0
max specificity,0.9920373,1.0,0.0
max absolute_mcc,0.3984780,0.6108500,176.0
max min_per_class_accuracy,0.2344234,0.8208861,249.0


Gains/Lift Table: Avg response rate: 21.05 %, avg score: 21.06 %



0,1,2,3,4,5,6,7,8,9,10,11,12,13
,group,cumulative_data_fraction,lower_threshold,lift,cumulative_lift,response_rate,score,cumulative_response_rate,cumulative_score,capture_rate,cumulative_capture_rate,gain,cumulative_gain
,1,0.0100600,0.8840778,4.75,4.75,1.0,0.9287074,1.0,0.9287074,0.0477848,0.0477848,375.0,375.0
,2,0.0200533,0.8235887,4.75,4.75,1.0,0.8510829,1.0,0.8900241,0.0474684,0.0952532,375.0,375.0
,3,0.0300466,0.7777038,4.6866667,4.7289357,0.9866667,0.7998031,0.9955654,0.8600171,0.0468354,0.1420886,368.6666667,372.8935698
,4,0.0400400,0.7378799,4.5283333,4.6788686,0.9533333,0.7583026,0.9850250,0.8346308,0.0452532,0.1873418,352.8333333,367.8868552
,5,0.0500333,0.6996041,4.5600000,4.6551265,0.96,0.7185235,0.9800266,0.8114403,0.0455696,0.2329114,356.0000000,365.5126498
,6,0.1,0.5451970,3.8190000,4.2373418,0.804,0.6174350,0.8920720,0.7145023,0.1908228,0.4237342,281.9000000,323.7341772
,7,0.1500333,0.4371572,2.7703063,3.7481128,0.5832224,0.4900944,0.7890764,0.6396664,0.1386076,0.5623418,177.0306258,274.8112789
,8,0.2,0.3582846,2.1913333,3.3591772,0.4613333,0.3936205,0.7071952,0.5781959,0.1094937,0.6718354,119.1333333,235.9177215
,9,0.3,0.2432810,1.3765823,2.6983122,0.2898068,0.2959357,0.5680657,0.4841092,0.1376582,0.8094937,37.6582278,169.8312236




ModelMetricsBinomial: gbm
** Reported on validation data. **

MSE: 0.12563969447335024
RMSE: 0.3544569007275077
LogLoss: 0.3979801598652918
Mean Per-Class Error: 0.2721098535394777
AUC: 0.8050844439938029
pr_auc: 0.5487940291249233
Gini: 0.6101688879876057
Confusion Matrix (Act/Pred) for max f1 @ threshold = 0.29511949289772904: 


0,1,2,3,4
,0.0,1.0,Error,Rate
0,2142.0,411.0,0.161,(411.0/2553.0)
1,254.0,395.0,0.3914,(254.0/649.0)
Total,2396.0,806.0,0.2077,(665.0/3202.0)


Maximum Metrics: Maximum metrics at their respective thresholds



0,1,2,3
metric,threshold,value,idx
max f1,0.2951195,0.5429553,200.0
max f2,0.1309875,0.6551476,292.0
max f0point5,0.4414375,0.5424990,138.0
max accuracy,0.6023131,0.8272954,79.0
max precision,0.9806578,1.0,0.0
max recall,0.0195411,1.0,391.0
max specificity,0.9806578,1.0,0.0
max absolute_mcc,0.2951195,0.4146362,200.0
max min_per_class_accuracy,0.1966145,0.7203290,251.0


Gains/Lift Table: Avg response rate: 20.27 %, avg score: 20.72 %



0,1,2,3,4,5,6,7,8,9,10,11,12,13
,group,cumulative_data_fraction,lower_threshold,lift,cumulative_lift,response_rate,score,cumulative_response_rate,cumulative_score,capture_rate,cumulative_capture_rate,gain,cumulative_gain
,1,0.0103061,0.8379341,4.6347294,4.6347294,0.9393939,0.8997138,0.9393939,0.8997138,0.0477658,0.0477658,363.4729421,363.4729421
,2,0.0202998,0.7786006,3.7003082,4.1747066,0.75,0.8065861,0.8461538,0.8538663,0.0369800,0.0847458,270.0308166,317.4706649
,3,0.0302936,0.7298866,4.0086672,4.1199307,0.8125,0.7529234,0.8350515,0.8205655,0.0400616,0.1248074,300.8667180,311.9930742
,4,0.0402873,0.6854358,2.9294106,3.8246079,0.59375,0.7059701,0.7751938,0.7921388,0.0292758,0.1540832,192.9410632,282.4607924
,5,0.0502811,0.6458099,3.3919492,3.7386136,0.6875,0.6643818,0.7577640,0.7667461,0.0338983,0.1879815,239.1949153,273.8613634
,6,0.1002498,0.5276032,2.3126926,3.0278742,0.46875,0.5782137,0.6137072,0.6727735,0.1155624,0.3035439,131.2692604,202.7874180
,7,0.1502186,0.4384326,2.4977080,2.8515195,0.50625,0.4820712,0.5779626,0.6093383,0.1248074,0.4283513,149.7708012,185.1519529
,8,0.2001874,0.3557617,1.7576464,2.5784779,0.35625,0.3952564,0.5226209,0.5559013,0.0878274,0.5161787,75.7646379,157.8477869
,9,0.3001249,0.2465655,1.3876156,2.1819368,0.28125,0.2969576,0.4422477,0.4696765,0.1386749,0.6548536,38.7615562,118.1936831



Scoring History: 


0,1,2,3,4,5,6,7,8,9,10,11,12,13,14,15
,timestamp,duration,number_of_trees,training_rmse,training_logloss,training_auc,training_pr_auc,training_lift,training_classification_error,validation_rmse,validation_logloss,validation_auc,validation_pr_auc,validation_lift,validation_classification_error
,2018-12-04 08:02:48,1 min 35.746 sec,0.0,0.4076825,0.5146532,0.5,0.0,1.0,0.7894737,0.4020768,0.5042899,0.5,0.0,1.0,0.7973142
,2018-12-04 08:02:48,1 min 35.762 sec,1.0,0.4002327,0.4970929,0.7838472,0.4954491,3.6159910,0.2770153,0.3957908,0.4895936,0.7507377,0.4228101,2.9153943,0.3191755
,2018-12-04 08:02:48,1 min 35.777 sec,2.0,0.3950060,0.4854956,0.7968698,0.5336195,4.1105769,0.2749500,0.3909855,0.4789946,0.7665359,0.4627749,3.7336443,0.2307933
,2018-12-04 08:02:48,1 min 35.793 sec,3.0,0.3899929,0.4750773,0.8019961,0.5573404,4.34375,0.2093271,0.3872385,0.4711547,0.7678845,0.4826368,4.0248966,0.2429731
,2018-12-04 08:02:48,1 min 35.809 sec,4.0,0.3852380,0.4646688,0.8138739,0.5751946,4.3096026,0.2299134,0.3835043,0.4629838,0.7749504,0.4936927,4.4984138,0.2439101
---,---,---,---,---,---,---,---,---,---,---,---,---,---,---,---
,2018-12-04 08:02:51,1 min 38.727 sec,96.0,0.3080079,0.3140165,0.9061847,0.7763366,4.75,0.1316456,0.3546308,0.3984393,0.8046089,0.5471952,4.6347294,0.2133042
,2018-12-04 08:02:51,1 min 38.758 sec,97.0,0.3077235,0.3134351,0.9064875,0.7770495,4.75,0.1315789,0.3544934,0.3980760,0.8049710,0.5480112,4.6347294,0.2158026
,2018-12-04 08:02:51,1 min 38.789 sec,98.0,0.3069511,0.3121404,0.9072710,0.7792928,4.75,0.1299800,0.3544351,0.3978643,0.8052347,0.5487733,4.6347294,0.2076827



See the whole table with table.as_data_frame()
Variable Importances: 


0,1,2,3
variable,relative_importance,scaled_importance,percentage
cy_1_funds,2464.3447266,1.0,0.4641631
Market,1218.5041504,0.4944536,0.2295071
LocProvince,380.0971069,0.1542386,0.0715919
LocPropertyType,260.3808289,0.1056593,0.0490431
cy_2_funds,201.0982971,0.0816032,0.0378772
---,---,---,---
cy_2_contacts,12.3178082,0.0049984,0.0023201
cy_1_checkins,11.5123444,0.0046716,0.0021684
PopulationRank,9.4104185,0.0038186,0.0017725



See the whole table with table.as_data_frame()
<bound method ModelBase.varimp of >


## Deep Learning

In [36]:
# Import H2O DL:
from h2o.estimators.deeplearning import H2ODeepLearningEstimator

In [37]:
# Initialize and train the DL estimator:

dl_fit1 = H2ODeepLearningEstimator(model_id='dl_fit1', seed=1)
dl_fit1.train(x=x, y=y, training_frame=train)

deeplearning Model Build progress: |██████████████████████████████████████| 100%


In [38]:
dl_fit2 = H2ODeepLearningEstimator(model_id='dl_fit2', 
                                   epochs=20, 
                                   hidden=[10,10], 
                                   stopping_rounds=0,  #disable early stopping
                                   seed=1)
dl_fit2.train(x=x, y=y, training_frame=train)

deeplearning Model Build progress: |██████████████████████████████████████| 100%


In [39]:
dl_fit3 = H2ODeepLearningEstimator(model_id='dl_fit3', 
                                   epochs=20, 
                                   hidden=[10,10],
                                   score_interval=1,          #used for early stopping
                                   stopping_rounds=3,         #used for early stopping
                                   stopping_metric='AUC',     #used for early stopping
                                   stopping_tolerance=0.0005, #used for early stopping
                                   seed=1)
dl_fit3.train(x=x, y=y, training_frame=train, validation_frame=valid)

deeplearning Model Build progress: |██████████████████████████████████████| 100%


In [40]:
# Import H2O DL:
from h2o.estimators.deeplearning import H2ODeepLearningEstimator

In [41]:
# DL hyperparameters
activation_opt = ["Rectifier", "RectifierWithDropout", "Maxout", "MaxoutWithDropout"]
l1_opt = [0, 0.00001, 0.0001, 0.001, 0.01, 0.1]
l2_opt = [0, 0.00001, 0.0001, 0.001, 0.01, 0.1]
dl_params = {'activation': activation_opt, 'l1': l1_opt, 'l2': l2_opt}

# Search criteria
search_criteria = {'strategy': 'RandomDiscrete', 'max_runtime_secs': 120, 'seed':1}

In [42]:
#dl_grid = H2OGridSearch(model=H2ODeepLearningEstimator,
#                        grid_id='dl_grid1',
#                        hyper_params=dl_params,
#                        search_criteria=search_criteria)

#dl_grid.train(x=x, y=y,
#              training_frame=train, 
#              validation_frame=valid, 
#              hidden=[10,10],
#              hyper_params=dl_params,
#              search_criteria=search_criteria)

#dl_gridperf = dl_grid.get_grid(sort_by='auc', decreasing=True)

In [43]:
# Grab the model_id for the top GBM model, chosen by validation AUC
#best_dl_model = dl_gridperf.models[0]

# Now let's evaluate the model performance on a test set
# so we get an honest estimate of top model performance

#dl_perf = best_gbm_model.model_performance(test)
#print dl_perf.auc()

In [44]:
dl_perf1 = dl_fit1.model_performance(test)
dl_perf2 = dl_fit2.model_performance(test)
dl_perf3 = dl_fit3.model_performance(test)

In [45]:
# Retreive test set AUC
print(dl_perf1.auc())
print(dl_perf2.auc())
print(dl_perf3.auc())

0.7883194054864155
0.7902256766883031
0.7924553557866596


In [46]:
# Grab the model_id for the top GBM model, chosen by validation AUC
#best_dl_model = dl_gridperf.models[0]

# Now let's evaluate the model performance on a test set
# so we get an honest estimate of top model performance

#dl_perf = best_gbm_model.model_performance(test)
#print(dl_perf.auc())