You will need to run the following code in "Anaconda Prompt" in order to use this notebook:

`conda install -c h2oai h2o`

https://github.com/h2oai/h2o-tutorials


The point of this project is to determine which locations will benefit from In Flight (check ins, images, contacts, Notes) usage based on what other locations increased in fundraising due to In Flight Usage (i.e. if locations that of a certain property type respond better to In Flight visits, we will suggest that the hospitals visit specifically during that time):


Also, if we wanted to look at what kind of checkins were the most effective, we can do that as well or instead. 
## Preprocessing

In [1]:
#Import the models we will be using
import h2o
# Import H2O Grid Search:
from h2o.grid.grid_search import H2OGridSearch

In [2]:
h2o.init()

Checking whether there is an H2O instance running at http://localhost:54321..... not found.
Attempting to start a local H2O server...
; Java HotSpot(TM) 64-Bit Server VM (build 25.191-b12, mixed mode)
  Starting server from C:\Users\zgeorge\AppData\Local\Continuum\anaconda3\lib\site-packages\h2o\backend\bin\h2o.jar
  Ice root: C:\Users\zgeorge\AppData\Local\Temp\tmpbal43xsv
  JVM stdout: C:\Users\zgeorge\AppData\Local\Temp\tmpbal43xsv\h2o_zgeorge_started_from_python.out
  JVM stderr: C:\Users\zgeorge\AppData\Local\Temp\tmpbal43xsv\h2o_zgeorge_started_from_python.err
  Server is running at http://127.0.0.1:54321
Connecting to H2O server at http://127.0.0.1:54321... successful.


0,1
H2O cluster uptime:,02 secs
H2O cluster timezone:,America/Denver
H2O data parsing timezone:,UTC
H2O cluster version:,3.22.0.2
H2O cluster version age:,13 days
H2O cluster name:,H2O_from_python_zgeorge_si7qa5
H2O cluster total nodes:,1
H2O cluster free memory:,1.755 Gb
H2O cluster total cores:,4
H2O cluster allowed cores:,4


In [3]:
#import data
funds_csv = "Dunder Mifflin Fundraising.csv"  # modify this for your machine
data = h2o.import_file(funds_csv)

Parse progress: |█████████████████████████████████████████████████████████| 100%


In [4]:
data.shape

(1914, 28)

In [5]:
data['target'] = data['target'].asfactor()  #encode the binary repsonse as a factor
data['target'].levels()  #optional: after encoding, this shows the two factor levels, '0' and '1'

[['0', '1']]

In [6]:
# Partition data into 70%, 15%, 15% chunks
# Setting a seed will guarantee reproducibility

splits = data.split_frame(ratios=[0.7, 0.15], seed=1)  

train = splits[0]
valid = splits[1]
test = splits[2]

In [7]:
print(train.nrow)
print(valid.nrow)
print(test.nrow)

1361
262
291


In [8]:
y = 'target'
x = list(data.columns)

In [9]:
x

['Lookup',
 'Hour',
 'Days since last checkin',
 'Check Ins',
 'Contacts Created',
 'Images Uploaded',
 'Notes Written',
 'Checkin_w_images',
 'Checkin_w_Contact',
 'Checkin_w_Note',
 'Checkin_w_I_C',
 'Checkin_w_C_N',
 'Checkin_w_I_N',
 'Checkin_w_All',
 'Property Type',
 'RegionName',
 'PopulationRank',
 'Market',
 'LocRegion',
 'LocProvince',
 'Distance',
 'FundraisingYear',
 'day of week',
 'day of month',
 'Past 1-3 avg',
 'Past 3-6 avg',
 'Change in past week',
 'target']

In [10]:
x.remove('target')  #remove the response
x.remove('Lookup')  
x.remove('Market')
x.remove('RegionName')
x.remove('LocProvince') 

#x.remove('Checkin_w_I_C')
#x.remove('Checkin_w_C_N')
#x.remove('Checkin_w_All')
#x.remove('Checkin_w_Contact') 

In [11]:
# List of predictor columns
x

['Hour',
 'Days since last checkin',
 'Check Ins',
 'Contacts Created',
 'Images Uploaded',
 'Notes Written',
 'Checkin_w_images',
 'Checkin_w_Contact',
 'Checkin_w_Note',
 'Checkin_w_I_C',
 'Checkin_w_C_N',
 'Checkin_w_I_N',
 'Checkin_w_All',
 'Property Type',
 'PopulationRank',
 'LocRegion',
 'Distance',
 'FundraisingYear',
 'day of week',
 'day of month',
 'Past 1-3 avg',
 'Past 3-6 avg',
 'Change in past week']

### Linear Regression

In [12]:
# Import H2O GLM:
from h2o.estimators.glm import H2OGeneralizedLinearEstimator

In [13]:
# Initialize the GLM estimator:
# Similar to R's glm() and H2O's R GLM, H2O's GLM has the "family" argument

glm_fit1 = H2OGeneralizedLinearEstimator(family='binomial', model_id='glm_fit1')

In [14]:
glm_fit1.train(x=x, y=y, training_frame=train)

glm Model Build progress: |███████████████████████████████████████████████| 100%


In [15]:
glm_fit2 = H2OGeneralizedLinearEstimator(family='binomial', model_id='glm_fit2', lambda_search=True)
glm_fit2.train(x=x, y=y, training_frame=train, validation_frame=valid)

glm Model Build progress: |███████████████████████████████████████████████| 100%


In [16]:
glm_perf1 = glm_fit1.model_performance(test)
glm_perf2 = glm_fit2.model_performance(test)

In [17]:
# Print model performance
print(glm_perf1)
print(glm_perf2)


ModelMetricsBinomialGLM: glm
** Reported on test data. **

MSE: 0.15183921155448257
RMSE: 0.38966551240067754
LogLoss: 0.4691812752400193
Null degrees of freedom: 290
Residual degrees of freedom: 249
Null deviance: 301.9853527118239
Residual deviance: 273.06350218969123
AIC: 357.06350218969123
AUC: 0.7464431610085928
pr_auc: 0.3688944519339669
Gini: 0.4928863220171855
Confusion Matrix (Act/Pred) for max f1 @ threshold = 0.26219854288406946: 


0,1,2,3,4
,0.0,1.0,Error,Rate
0,164.0,65.0,0.2838,(65.0/229.0)
1,18.0,44.0,0.2903,(18.0/62.0)
Total,182.0,109.0,0.2852,(83.0/291.0)


Maximum Metrics: Maximum metrics at their respective thresholds



0,1,2,3
metric,threshold,value,idx
max f1,0.2621985,0.5146199,108.0
max f2,0.1574012,0.6561086,193.0
max f0point5,0.2802455,0.4577465,90.0
max accuracy,0.8207004,0.7835052,0.0
max precision,0.4955976,0.4545455,10.0
max recall,0.0145674,1.0,275.0
max specificity,0.8207004,0.9956332,0.0
max absolute_mcc,0.2450643,0.3606791,116.0
max min_per_class_accuracy,0.2621985,0.7096774,108.0


Gains/Lift Table: Avg response rate: 21.31 %, avg score: 22.93 %



0,1,2,3,4,5,6,7,8,9,10,11,12,13
,group,cumulative_data_fraction,lower_threshold,lift,cumulative_lift,response_rate,score,cumulative_response_rate,cumulative_score,capture_rate,cumulative_capture_rate,gain,cumulative_gain
,1,0.0103093,0.6071642,1.5645161,1.5645161,0.3333333,0.7058240,0.3333333,0.7058240,0.0161290,0.0161290,56.4516129,56.4516129
,2,0.0206186,0.5732731,1.5645161,1.5645161,0.3333333,0.5946306,0.3333333,0.6502273,0.0161290,0.0322581,56.4516129,56.4516129
,3,0.0309278,0.5265811,3.1290323,2.0860215,0.6666667,0.5550464,0.4444444,0.6185003,0.0322581,0.0645161,212.9032258,108.6021505
,4,0.0412371,0.4844738,1.5645161,1.9556452,0.3333333,0.5033124,0.4166667,0.5897033,0.0161290,0.0806452,56.4516129,95.5645161
,5,0.0515464,0.4567736,0.0,1.5645161,0.0,0.4704876,0.3333333,0.5658602,0.0,0.0806452,-100.0,56.4516129
,6,0.1030928,0.4123623,1.8774194,1.7209677,0.4,0.4300858,0.3666667,0.4979730,0.0967742,0.1774194,87.7419355,72.0967742
,7,0.1512027,0.3677313,2.0115207,1.8134164,0.4285714,0.3842434,0.3863636,0.4617863,0.0967742,0.2741935,101.1520737,81.3416422
,8,0.2027491,0.3352017,1.8774194,1.8296884,0.4,0.3487565,0.3898305,0.4330499,0.0967742,0.3709677,87.7419355,82.9688354
,9,0.3024055,0.2860656,2.2658509,1.9734238,0.4827586,0.3107206,0.4204545,0.3927368,0.2258065,0.5967742,126.5850945,97.3423754





ModelMetricsBinomialGLM: glm
** Reported on test data. **

MSE: 0.15273861005969747
RMSE: 0.39081787326029177
LogLoss: 0.46988500750163176
Null degrees of freedom: 290
Residual degrees of freedom: 264
Null deviance: 301.9853527118239
Residual deviance: 273.4730743659497
AIC: 327.4730743659497
AUC: 0.7379912663755459
pr_auc: 0.36083222989684344
Gini: 0.4759825327510918
Confusion Matrix (Act/Pred) for max f1 @ threshold = 0.24109958220411135: 


0,1,2,3,4
,0.0,1.0,Error,Rate
0,152.0,77.0,0.3362,(77.0/229.0)
1,15.0,47.0,0.2419,(15.0/62.0)
Total,167.0,124.0,0.3162,(92.0/291.0)


Maximum Metrics: Maximum metrics at their respective thresholds



0,1,2,3
metric,threshold,value,idx
max f1,0.2410996,0.5053763,123.0
max f2,0.1667237,0.6555556,201.0
max f0point5,0.2769934,0.4265403,89.0
max accuracy,0.4663460,0.7869416,7.0
max precision,0.4663460,0.5,7.0
max recall,0.0528875,1.0,275.0
max specificity,0.6113136,0.9956332,0.0
max absolute_mcc,0.2410996,0.3492783,123.0
max min_per_class_accuracy,0.2491063,0.6935484,110.0


Gains/Lift Table: Avg response rate: 21.31 %, avg score: 22.75 %



0,1,2,3,4,5,6,7,8,9,10,11,12,13
,group,cumulative_data_fraction,lower_threshold,lift,cumulative_lift,response_rate,score,cumulative_response_rate,cumulative_score,capture_rate,cumulative_capture_rate,gain,cumulative_gain
,1,0.0103093,0.5090527,0.0,0.0,0.0,0.5658904,0.0,0.5658904,0.0,0.0,-100.0,-100.0
,2,0.0206186,0.4872230,3.1290323,1.5645161,0.6666667,0.4993419,0.3333333,0.5326161,0.0322581,0.0322581,212.9032258,56.4516129
,3,0.0309278,0.4579470,3.1290323,2.0860215,0.6666667,0.4714787,0.4444444,0.5122370,0.0322581,0.0645161,212.9032258,108.6021505
,4,0.0412371,0.4327006,1.5645161,1.9556452,0.3333333,0.4474832,0.4166667,0.4960485,0.0161290,0.0806452,56.4516129,95.5645161
,5,0.0515464,0.4284449,1.5645161,1.8774194,0.3333333,0.4314590,0.4,0.4831306,0.0161290,0.0967742,56.4516129,87.7419355
,6,0.1030928,0.3802904,1.2516129,1.5645161,0.2666667,0.4012048,0.3333333,0.4421677,0.0645161,0.1612903,25.1612903,56.4516129
,7,0.1512027,0.3371161,2.3467742,1.8134164,0.5,0.3583441,0.3863636,0.4154966,0.1129032,0.2741935,134.6774194,81.3416422
,8,0.2027491,0.3095613,1.8774194,1.8296884,0.4,0.3216910,0.3898305,0.3916477,0.0967742,0.3709677,87.7419355,82.9688354
,9,0.3024055,0.2788812,1.7803115,1.8134164,0.3793103,0.2942106,0.3863636,0.3595377,0.1774194,0.5483871,78.0311457,81.3416422






In [18]:
# Retreive test set AUC
print(glm_perf1.auc())
print(glm_perf2.auc())

0.7464431610085928
0.7379912663755459


In [19]:
# Compare test AUC to the training AUC and validation AUC
print(glm_fit2.auc(train=True))
print(glm_fit2.auc(valid=True))

0.7289495616836701
0.7296534017971759


### Random Forrest

In [20]:
# Import H2O RF:
from h2o.estimators.random_forest import H2ORandomForestEstimator

In [21]:
# Initialize the RF estimator:

rf_fit1 = H2ORandomForestEstimator(model_id='rf_fit1', seed=1, balance_classes = True)

In [22]:
rf_fit1.train(x=x, y=y, training_frame=train)

drf Model Build progress: |███████████████████████████████████████████████| 100%


In [23]:
rf_fit2 = H2ORandomForestEstimator(model_id='rf_fit2', ntrees=100, seed=1, balance_classes = True)
rf_fit2.train(x=x, y=y, training_frame=train)

drf Model Build progress: |███████████████████████████████████████████████| 100%


In [24]:
rf_perf1 = rf_fit1.model_performance(test)
rf_perf2 = rf_fit2.model_performance(test)

In [25]:
# Retreive test set AUC
print(rf_perf1.auc())
print(rf_perf2.auc())

0.8717424989435132
0.8800535286660093


In [26]:
#Cross Validate Performance
rf_fit3 = H2ORandomForestEstimator(model_id='rf_fit3', seed=1, nfolds=5, balance_classes = True)
rf_fit3.train(x=x, y=y, training_frame=data)

drf Model Build progress: |███████████████████████████████████████████████| 100%


In [27]:
print(rf_fit3.auc(xval=True))

0.8905302421901437


### Gradient Boosting

In [28]:
# Import H2O GBM:
from h2o.estimators.gbm import H2OGradientBoostingEstimator
encoding = auto

In [29]:
# GBM hyperparameters
gbm_params1 = {'learn_rate': [0.01, 0.1], 
                'max_depth': [3, 5, 9],
                'sample_rate': [0.8, 1.0],
                'col_sample_rate': [0.2, 0.5, 1.0]}

In [30]:
# Initialize and train the GBM estimator:
gbm_grid1 = H2OGridSearch(model=H2OGradientBoostingEstimator,grid_id='gbm_grid1', hyper_params=gbm_params1,categorical_encoding = encoding)
gbm_grid1.train(x=x, y=y, training_frame=train, validation_frame=valid, ntrees=100, seed=1)

gbm Grid Build progress: |████████████████████████████████████████████████| 100%


In [31]:
gbm_fit2 = H2OGradientBoostingEstimator(model_id='gbm_fit2', ntrees=500, seed=1, balance_classes = True)
gbm_fit2.train(x=x, y=y, training_frame=train)

gbm Model Build progress: |███████████████████████████████████████████████| 100%


In [32]:
# Now let's use early stopping to find optimal ntrees

gbm_fit3 = H2OGradientBoostingEstimator(model_id='gbm_fit3', 
                                        ntrees=500, 
                                        score_tree_interval=5,     #used for early stopping
                                        stopping_rounds=3,         #used for early stopping
                                        stopping_metric='AUC',     #used for early stopping
                                        stopping_tolerance=0.0005, #used for early stopping
                                        seed=1,
                                        balance_classes = True)

# The use of a validation_frame is recommended with using early stopping
gbm_fit3.train(x=x, y=y, training_frame=train, validation_frame=valid)

gbm Model Build progress: |███████████████████████████████████████████████| 100%


In [33]:
gbm_gridperf = gbm_grid1.get_grid(sort_by='auc', decreasing=True)
# Grab the model_id for the top GBM model, chosen by validation AUC
best_gbm_model = gbm_gridperf.models[0]

gbm_perf1 = best_gbm_model.model_performance(test)
gbm_perf2 = gbm_fit2.model_performance(test)
gbm_perf3 = gbm_fit3.model_performance(test)

In [34]:
# Retreive test set AUC
print(gbm_perf1.auc())
print(gbm_perf2.auc())
print(gbm_perf3.auc())

0.8797013663896323
0.8583603324411889
0.878574447105226


In [35]:
print(best_gbm_model.varimp)

Model Details
H2OGradientBoostingEstimator :  Gradient Boosting Machine
Model Key:  gbm_grid1_model_14


ModelMetricsBinomial: gbm
** Reported on train data. **

MSE: 0.08066536547054884
RMSE: 0.28401648802586943
LogLoss: 0.28890536869540756
Mean Per-Class Error: 0.06989031445622618
AUC: 0.9800902183368921
pr_auc: 0.9331209330072399
Gini: 0.9601804366737843
Confusion Matrix (Act/Pred) for max f1 @ threshold = 0.3338791636654174: 


0,1,2,3,4
,0.0,1.0,Error,Rate
0,1010.0,37.0,0.0353,(37.0/1047.0)
1,50.0,264.0,0.1592,(50.0/314.0)
Total,1060.0,301.0,0.0639,(87.0/1361.0)


Maximum Metrics: Maximum metrics at their respective thresholds



0,1,2,3
metric,threshold,value,idx
max f1,0.3338792,0.8585366,175.0
max f2,0.2302233,0.9115044,239.0
max f0point5,0.3646576,0.8773181,159.0
max accuracy,0.3338792,0.9360764,175.0
max precision,0.7246297,1.0,0.0
max recall,0.1814096,1.0,278.0
max specificity,0.7246297,1.0,0.0
max absolute_mcc,0.3338792,0.8175743,175.0
max min_per_class_accuracy,0.2707173,0.9245463,209.0


Gains/Lift Table: Avg response rate: 23.07 %, avg score: 23.07 %



0,1,2,3,4,5,6,7,8,9,10,11,12,13
,group,cumulative_data_fraction,lower_threshold,lift,cumulative_lift,response_rate,score,cumulative_response_rate,cumulative_score,capture_rate,cumulative_capture_rate,gain,cumulative_gain
,1,0.0102866,0.7131796,4.3343949,4.3343949,1.0,0.7176938,1.0,0.7176938,0.0445860,0.0445860,333.4394904,333.4394904
,2,0.0205731,0.7085035,4.3343949,4.3343949,1.0,0.7106744,1.0,0.7141841,0.0445860,0.0891720,333.4394904,333.4394904
,3,0.0301249,0.7005862,4.3343949,4.3343949,1.0,0.7051702,1.0,0.7113260,0.0414013,0.1305732,333.4394904,333.4394904
,4,0.0404115,0.6883880,4.3343949,4.3343949,1.0,0.6944100,1.0,0.7070201,0.0445860,0.1751592,333.4394904,333.4394904
,5,0.0506980,0.6777218,4.3343949,4.3343949,1.0,0.6833340,1.0,0.7022142,0.0445860,0.2197452,333.4394904,333.4394904
,6,0.1006613,0.5921431,4.3343949,4.3343949,1.0,0.6307416,1.0,0.6667388,0.2165605,0.4363057,333.4394904,333.4394904
,7,0.1506245,0.4743554,3.4420195,4.0383874,0.7941176,0.5288941,0.9317073,0.6210147,0.1719745,0.6082803,244.2019483,303.8387448
,8,0.2005878,0.3635956,3.5057606,3.9057185,0.8088235,0.4134612,0.9010989,0.5693164,0.1751592,0.7834395,250.5760584,290.5718485
,9,0.3005143,0.2494908,1.7528803,3.1898603,0.4044118,0.2988201,0.7359413,0.4793714,0.1751592,0.9585987,75.2880292,218.9860309




ModelMetricsBinomial: gbm
** Reported on validation data. **

MSE: 0.10738613094612717
RMSE: 0.3276982315273111
LogLoss: 0.3526392796944966
Mean Per-Class Error: 0.15703893881044073
AUC: 0.9051775780915703
pr_auc: 0.6992576931358087
Gini: 0.8103551561831406
Confusion Matrix (Act/Pred) for max f1 @ threshold = 0.23648281929888165: 


0,1,2,3,4
,0.0,1.0,Error,Rate
0,175.0,30.0,0.1463,(30.0/205.0)
1,11.0,46.0,0.193,(11.0/57.0)
Total,186.0,76.0,0.1565,(41.0/262.0)


Maximum Metrics: Maximum metrics at their respective thresholds



0,1,2,3
metric,threshold,value,idx
max f1,0.2364828,0.6917293,75.0
max f2,0.1690449,0.8006042,102.0
max f0point5,0.4231605,0.7239819,40.0
max accuracy,0.4231605,0.8702290,40.0
max precision,0.7202686,1.0,0.0
max recall,0.1178760,1.0,159.0
max specificity,0.7202686,1.0,0.0
max absolute_mcc,0.2364828,0.6006752,75.0
max min_per_class_accuracy,0.2043152,0.8146341,84.0


Gains/Lift Table: Avg response rate: 21.76 %, avg score: 22.47 %



0,1,2,3,4,5,6,7,8,9,10,11,12,13
,group,cumulative_data_fraction,lower_threshold,lift,cumulative_lift,response_rate,score,cumulative_response_rate,cumulative_score,capture_rate,cumulative_capture_rate,gain,cumulative_gain
,1,0.0114504,0.7146478,4.5964912,4.5964912,1.0,0.7198602,1.0,0.7198602,0.0526316,0.0526316,359.6491228,359.6491228
,2,0.0229008,0.7063768,4.5964912,4.5964912,1.0,0.7091718,1.0,0.7145160,0.0526316,0.1052632,359.6491228,359.6491228
,3,0.0305344,0.7022501,2.2982456,4.0219298,0.5,0.7053051,0.875,0.7122133,0.0175439,0.1228070,129.8245614,302.1929825
,4,0.0419847,0.6981562,3.0643275,3.7607656,0.6666667,0.7000096,0.8181818,0.7088850,0.0350877,0.1578947,206.4327485,276.0765550
,5,0.0534351,0.6774225,4.5964912,3.9398496,1.0,0.6882409,0.8571429,0.7044613,0.0526316,0.2105263,359.6491228,293.9849624
,6,0.1030534,0.5801487,3.5357625,3.7452891,0.7692308,0.6240870,0.8148148,0.6657625,0.1754386,0.3859649,253.5762483,274.5289149
,7,0.1526718,0.4240456,3.1821862,3.5622807,0.6923077,0.5105784,0.775,0.6153277,0.1578947,0.5438596,218.2186235,256.2280702
,8,0.2022901,0.3427659,1.4143050,3.0354187,0.3076923,0.3854640,0.6603774,0.5589460,0.0701754,0.6140351,41.4304993,203.5418736
,9,0.3015267,0.2214432,1.9446694,2.6764379,0.4230769,0.2866857,0.5822785,0.4693414,0.1929825,0.8070175,94.4669366,167.6437930



Scoring History: 


0,1,2,3,4,5,6,7,8,9,10,11,12,13,14,15
,timestamp,duration,number_of_trees,training_rmse,training_logloss,training_auc,training_pr_auc,training_lift,training_classification_error,validation_rmse,validation_logloss,validation_auc,validation_pr_auc,validation_lift,validation_classification_error
,2018-12-05 11:40:31,15.661 sec,0.0,0.4212889,0.5401361,0.5,0.0,1.0,0.7692873,0.4127943,0.5242931,0.5,0.0,1.0,0.7824427
,2018-12-05 11:40:31,15.669 sec,1.0,0.4190541,0.5348765,0.9218133,0.6904241,4.0867152,0.1418075,0.4111523,0.5205021,0.8397946,0.4550935,3.3429027,0.2366412
,2018-12-05 11:40:31,15.676 sec,2.0,0.4167839,0.5296198,0.9471678,0.8476022,4.3343949,0.1058046,0.4092130,0.5160826,0.8701326,0.6322209,3.0643275,0.1564885
,2018-12-05 11:40:31,15.685 sec,3.0,0.4145087,0.5244291,0.9563068,0.8581450,4.3343949,0.1080088,0.4075015,0.5122329,0.8760804,0.6337763,3.0643275,0.1984733
,2018-12-05 11:40:31,15.695 sec,4.0,0.4120911,0.5189969,0.9604785,0.8825637,4.3343949,0.0911095,0.4056867,0.5082079,0.8853231,0.6607637,4.5964912,0.1984733
---,---,---,---,---,---,---,---,---,---,---,---,---,---,---,---
,2018-12-05 11:40:32,16.805 sec,96.0,0.2868209,0.2935556,0.9797815,0.9293814,4.3343949,0.0661278,0.3286572,0.3549953,0.9052632,0.7007329,4.5964912,0.1564885
,2018-12-05 11:40:32,16.817 sec,97.0,0.2861896,0.2925002,0.9798241,0.9293373,4.3343949,0.0705364,0.3283970,0.3543867,0.9049208,0.6977633,4.5964912,0.1564885
,2018-12-05 11:40:32,16.829 sec,98.0,0.2856431,0.2915466,0.9798013,0.9292479,4.3343949,0.0705364,0.3281990,0.3538633,0.9047497,0.7021267,4.5964912,0.1564885



See the whole table with table.as_data_frame()
Variable Importances: 


0,1,2,3
variable,relative_importance,scaled_importance,percentage
day of month,2027.5406494,1.0,0.3504845
Past 1-3 avg,1288.7708740,0.6356326,0.2227794
Change in past week,800.5285645,0.3948274,0.1383809
Past 3-6 avg,529.7349854,0.2612697,0.0915710
LocRegion,329.9961243,0.1627568,0.0570438
---,---,---,---
Contacts Created,0.7660735,0.0003778,0.0001324
Checkin_w_All,0.3824240,0.0001886,0.0000661
Checkin_w_C_N,0.0000071,0.0000000,0.0000000



See the whole table with table.as_data_frame()
<bound method ModelBase.varimp of >


## Deep Learning

In [36]:
# Import H2O DL:
from h2o.estimators.deeplearning import H2ODeepLearningEstimator

In [37]:
# Initialize and train the DL estimator:

dl_fit1 = H2ODeepLearningEstimator(model_id='dl_fit1', seed=1, balance_classes = True)
dl_fit1.train(x=x, y=y, training_frame=train)

deeplearning Model Build progress: |██████████████████████████████████████| 100%


In [38]:
dl_fit2 = H2ODeepLearningEstimator(model_id='dl_fit2', 
                                   epochs=20, 
                                   hidden=[10,10], 
                                   stopping_rounds=0,  #disable early stopping
                                   seed=1,
                                   balance_classes = True)
dl_fit2.train(x=x, y=y, training_frame=train)

deeplearning Model Build progress: |██████████████████████████████████████| 100%


In [39]:
dl_fit3 = H2ODeepLearningEstimator(model_id='dl_fit3', 
                                   epochs=20, 
                                   hidden=[10,10],
                                   score_interval=1,          #used for early stopping
                                   stopping_rounds=3,         #used for early stopping
                                   stopping_metric='AUC',     #used for early stopping
                                   stopping_tolerance=0.0005, #used for early stopping
                                   seed=1,
                                   balance_classes = True)
dl_fit3.train(x=x, y=y, training_frame=train, validation_frame=valid)

deeplearning Model Build progress: |██████████████████████████████████████| 100%


In [40]:
# Import H2O DL:
from h2o.estimators.deeplearning import H2ODeepLearningEstimator

In [41]:
# DL hyperparameters
activation_opt = ["Rectifier", "RectifierWithDropout", "Maxout", "MaxoutWithDropout"]
l1_opt = [0, 0.00001, 0.0001, 0.001, 0.01, 0.1]
l2_opt = [0, 0.00001, 0.0001, 0.001, 0.01, 0.1]
dl_params = {'activation': activation_opt, 'l1': l1_opt, 'l2': l2_opt}

# Search criteria
search_criteria = {'strategy': 'RandomDiscrete', 'max_runtime_secs': 120, 'seed':1}

In [42]:
#dl_grid = H2OGridSearch(model=H2ODeepLearningEstimator,
#                        grid_id='dl_grid1',
#                        hyper_params=dl_params,
#                        search_criteria=search_criteria)

#dl_grid.train(x=x, y=y,
#              training_frame=train, 
#              validation_frame=valid, 
#              hidden=[10,10],
#              hyper_params=dl_params,
#              search_criteria=search_criteria)

#dl_gridperf = dl_grid.get_grid(sort_by='auc', decreasing=True)

In [43]:
# Grab the model_id for the top GBM model, chosen by validation AUC
#best_dl_model = dl_gridperf.models[0]

# Now let's evaluate the model performance on a test set
# so we get an honest estimate of top model performance

#dl_perf = best_gbm_model.model_performance(test)
#print dl_perf.auc()

In [44]:
dl_perf1 = dl_fit1.model_performance(test)
dl_perf2 = dl_fit2.model_performance(test)
dl_perf3 = dl_fit3.model_performance(test)

In [45]:
# Retreive test set AUC
print(dl_perf1.auc())
print(dl_perf2.auc())
print(dl_perf3.auc())

0.7903930131004366
0.7675728975912101
0.7795464149880266


In [46]:
# Grab the model_id for the top GBM model, chosen by validation AUC
#best_dl_model = dl_gridperf.models[0]

# Now let's evaluate the model performance on a test set
# so we get an honest estimate of top model performance

#dl_perf = best_gbm_model.model_performance(test)
#print(dl_perf.auc())

## Auto ML (Picks ensembles for you!)

In [47]:
#from h2o.automl import H2OAutoML

In [57]:
#aml = H2OAutoML(5,True,max_models = 20, seed = 1,max_runtime_secs=900)
#aml.train(x = x, y = y, training_frame=train, validation_frame=valid)

AutoML progress: |████████████████████████████████████████████████████████| 100%


In [58]:
#lb = aml.leaderboard

In [59]:
#lb.head()

model_id,auc,logloss,mean_per_class_error,rmse,mse
GBM_2_AutoML_20181205_114825,0.914544,0.312158,0.181513,0.313152,0.0980645
GBM_grid_1_AutoML_20181205_114825_model_12,0.914308,0.307619,0.183103,0.312127,0.0974232
GBM_grid_1_AutoML_20181205_114825_model_7,0.913407,0.310253,0.192024,0.313129,0.0980495
GBM_3_AutoML_20181205_114112,0.911651,0.308763,0.188043,0.311434,0.0969911
GBM_3_AutoML_20181205_114825,0.911509,0.321457,0.18374,0.316994,0.100485
StackedEnsemble_BestOfFamily_AutoML_20181205_114825,0.911049,0.325435,0.171792,0.314702,0.099037
GBM_4_AutoML_20181205_114825,0.909727,0.33103,0.203648,0.320547,0.10275
GBM_grid_1_AutoML_20181205_114825_model_16,0.909479,0.420602,0.187084,0.368921,0.136103
GBM_grid_1_AutoML_20181205_114825_model_11,0.909456,0.323882,0.207315,0.317695,0.10093
GBM_4_AutoML_20181205_114112,0.909269,0.315524,0.18151,0.314723,0.0990509




In [61]:
# Get model ids for all models in the AutoML Leaderboard
#model_ids = list(aml.leaderboard['model_id'].as_data_frame().iloc[:,0])

In [62]:
# Get the "All Models" Stacked Ensemble model
#se = h2o.get_model([mid for mid in model_ids if "StackedEnsemble_AllModels" in mid][0])

In [63]:
# Get the Stacked Ensemble metalearner model
#metalearner = h2o.get_model(aml.leader.metalearner()['name'])

AttributeError: type object 'H2OGradientBoostingEstimator' has no attribute 'metalearner'

In [None]:
#metalearner.coef_norm()

In [None]:
#%matplotlib inline
#metalearner.std_coef_plot()

In [64]:
#h2o.save_model(aml.leader)

'C:\\Users\\zgeorge\\Dropbox (CMN Hospitals)\\George\\School\\Fall 2018\\Applied Machine Learning\\AML_code\\Final Project\\GBM_2_AutoML_20181205_114825'

In [65]:
#aml.leader.download_mojo()


'C:\\Users\\zgeorge\\Dropbox (CMN Hospitals)\\George\\School\\Fall 2018\\Applied Machine Learning\\AML_code\\Final Project\\GBM_2_AutoML_20181205_114825.zip'

In [66]:
#automlleader = aml.leader.model_performance(test)

In [70]:
#print(automlleader)


ModelMetricsBinomial: gbm
** Reported on test data. **

MSE: 0.11443888754233571
RMSE: 0.3382881723358588
LogLoss: 0.36242273686222015
Mean Per-Class Error: 0.20812790533878012
AUC: 0.8723059585857162
pr_auc: 0.6550562435747587
Gini: 0.7446119171714325
Confusion Matrix (Act/Pred) for max f1 @ threshold = 0.2803486167706243: 


0,1,2,3,4
,0.0,1.0,Error,Rate
0,201.0,28.0,0.1223,(28.0/229.0)
1,19.0,43.0,0.3065,(19.0/62.0)
Total,220.0,71.0,0.1615,(47.0/291.0)


Maximum Metrics: Maximum metrics at their respective thresholds



0,1,2,3
metric,threshold,value,idx
max f1,0.2803486,0.6466165,70.0
max f2,0.0431307,0.7338308,153.0
max f0point5,0.3816209,0.6578947,50.0
max accuracy,0.3816209,0.8522337,50.0
max precision,0.9010817,1.0,0.0
max recall,0.0152282,1.0,197.0
max specificity,0.9010817,1.0,0.0
max absolute_mcc,0.3248643,0.5490914,61.0
max min_per_class_accuracy,0.1545941,0.7772926,99.0


Gains/Lift Table: Avg response rate: 21.31 %, avg score: 18.36 %



0,1,2,3,4,5,6,7,8,9,10,11,12,13
,group,cumulative_data_fraction,lower_threshold,lift,cumulative_lift,response_rate,score,cumulative_response_rate,cumulative_score,capture_rate,cumulative_capture_rate,gain,cumulative_gain
,1,0.0103093,0.8899551,4.6935484,4.6935484,1.0,0.8998860,1.0,0.8998860,0.0483871,0.0483871,369.3548387,369.3548387
,2,0.0206186,0.8737693,4.6935484,4.6935484,1.0,0.8839540,1.0,0.8919200,0.0483871,0.0967742,369.3548387,369.3548387
,3,0.0309278,0.8645296,4.6935484,4.6935484,1.0,0.8723226,1.0,0.8853875,0.0483871,0.1451613,369.3548387,369.3548387
,4,0.0412371,0.8431202,3.1290323,4.3024194,0.6666667,0.8567315,0.9166667,0.8782235,0.0322581,0.1774194,212.9032258,330.2419355
,5,0.0515464,0.8136066,3.1290323,4.0677419,0.6666667,0.8289826,0.8666667,0.8683753,0.0322581,0.2096774,212.9032258,306.7741935
,6,0.1030928,0.6658755,2.5032258,3.2854839,0.5333333,0.7349357,0.7,0.8016555,0.1290323,0.3387097,150.3225806,228.5483871
,7,0.1512027,0.4365177,2.6820276,3.0934751,0.5714286,0.5444545,0.6590909,0.7198188,0.1290323,0.4677419,168.2027650,209.3475073
,8,0.2027491,0.3482195,2.8161290,3.0229634,0.6,0.3803563,0.6440678,0.6335148,0.1451613,0.6129032,181.6129032,202.2963368
,9,0.3024055,0.1985635,1.1329255,2.4001100,0.2413793,0.2648648,0.5113636,0.5120279,0.1129032,0.7258065,13.2925473,140.0109971




