# H2O

Let's use [H20 AutoML](https://docs.h2o.ai/h2o/latest-stable/h2o-docs/automl.html) and see what we can build. This seems like 'stacking, the easy way out'. 

In [1]:
import h2o
from h2o.automl import H2OAutoML

h2o.init()

Checking whether there is an H2O instance running at http://localhost:54321 ..... not found.
Attempting to start a local H2O server...
  Java Version: java version "11.0.10" 2021-01-19 LTS; Java(TM) SE Runtime Environment 18.9 (build 11.0.10+8-LTS-162); Java HotSpot(TM) 64-Bit Server VM 18.9 (build 11.0.10+8-LTS-162, mixed mode)
  Starting server from /Users/king/opt/anaconda3/envs/tabular/lib/python3.7/site-packages/h2o/backend/bin/h2o.jar
  Ice root: /var/folders/wn/c096zq791xd853brbq55tmq80000gn/T/tmp4b2hze_c
  JVM stdout: /var/folders/wn/c096zq791xd853brbq55tmq80000gn/T/tmp4b2hze_c/h2o_king_started_from_python.out
  JVM stderr: /var/folders/wn/c096zq791xd853brbq55tmq80000gn/T/tmp4b2hze_c/h2o_king_started_from_python.err
  Server is running at http://127.0.0.1:54321
Connecting to H2O server at http://127.0.0.1:54321 ... successful.


0,1
H2O_cluster_uptime:,01 secs
H2O_cluster_timezone:,Europe/Athens
H2O_data_parsing_timezone:,UTC
H2O_cluster_version:,3.32.0.4
H2O_cluster_version_age:,1 month and 11 days
H2O_cluster_name:,H2O_from_python_king_jau9zi
H2O_cluster_total_nodes:,1
H2O_cluster_free_memory:,2 Gb
H2O_cluster_total_cores:,8
H2O_cluster_allowed_cores:,8


In [2]:
import matplotlib.pyplot as plt
import seaborn as sns
import numpy as np
import pandas as pd
import pickle
from pathlib import Path
from tqdm.notebook import trange, tqdm
### USE FOR LOCAL JUPYTER NOTEBOOKS ###
DOWNLOAD_DIR = Path('../download')
DATA_DIR = Path('../data')
SUBMISSIONS_DIR = Path('../submissions')
MODEL_DIR = Path('../models')
#######################################

# Paths must be strings
X = h2o.import_file(path='../download/train_values.csv')
y = h2o.import_file(path='../download/train_labels.csv')
y['damage_grade'] = y['damage_grade'].asfactor()
data = X.merge(y)
y_str = 'damage_grade'

Parse progress: |█████████████████████████████████████████████████████████| 100%
Parse progress: |█████████████████████████████████████████████████████████| 100%


In [6]:
data.drop('building_id')

geo_level_1_id,geo_level_2_id,geo_level_3_id,count_floors_pre_eq,age,area_percentage,height_percentage,land_surface_condition,foundation_type,roof_type,ground_floor_type,other_floor_type,position,plan_configuration,has_superstructure_adobe_mud,has_superstructure_mud_mortar_stone,has_superstructure_stone_flag,has_superstructure_cement_mortar_stone,has_superstructure_mud_mortar_brick,has_superstructure_cement_mortar_brick,has_superstructure_timber,has_superstructure_bamboo,has_superstructure_rc_non_engineered,has_superstructure_rc_engineered,has_superstructure_other,legal_ownership_status,count_families,has_secondary_use,has_secondary_use_agriculture,has_secondary_use_hotel,has_secondary_use_rental,has_secondary_use_institution,has_secondary_use_school,has_secondary_use_industry,has_secondary_use_health_post,has_secondary_use_gov_office,has_secondary_use_use_police,has_secondary_use_other,damage_grade
30,266,1224,1,25,5,2,t,r,n,f,j,s,d,0,1,0,0,0,0,0,0,0,0,0,v,0,0,0,0,0,0,0,0,0,0,0,0,2
17,409,12182,2,0,13,7,t,r,n,f,q,s,d,0,1,0,0,0,0,0,0,0,0,0,v,1,0,0,0,0,0,0,0,0,0,0,0,3
17,716,7056,2,5,12,6,o,r,q,f,q,s,d,0,1,0,0,0,0,0,0,0,0,0,v,1,0,0,0,0,0,0,0,0,0,0,0,3
4,651,105,2,80,5,4,n,r,n,f,q,s,d,0,1,0,0,0,0,0,0,0,0,0,v,1,0,0,0,0,0,0,0,0,0,0,0,2
3,1387,3909,5,40,5,10,t,r,n,f,q,o,d,0,0,0,0,1,0,0,0,0,0,0,v,1,0,0,0,0,0,0,0,0,0,0,0,2
26,1132,6645,2,0,6,6,t,w,n,f,x,s,d,0,0,0,0,0,0,1,0,0,0,0,a,1,0,0,0,0,0,0,0,0,0,0,0,1
8,1297,9721,2,0,2,6,t,r,n,f,x,s,d,0,1,1,0,0,0,0,0,0,0,0,v,1,0,0,0,0,0,0,0,0,0,0,0,3
6,398,4512,2,30,10,5,t,r,n,f,q,t,d,0,1,0,0,0,0,0,0,0,0,0,v,0,1,1,0,0,0,0,0,0,0,0,0,3
7,555,2763,3,40,5,6,t,r,n,f,q,s,d,0,1,0,0,0,0,0,0,0,0,0,v,2,0,0,0,0,0,0,0,0,0,0,0,2
20,508,10459,2,5,7,6,t,w,q,f,q,s,d,0,1,0,0,0,0,0,1,0,0,0,v,1,0,0,0,0,0,0,0,0,0,0,0,1




In [7]:
aml = H2OAutoML(max_models=3, seed=1)
aml.train(y=y_str, training_frame=data.drop('building_id'))

AutoML progress: |████████████████████████████████████████████████████████| 100%


In [9]:
lb = aml.leaderboard
lb.head(rows=lb.nrows)

model_id,mean_per_class_error,logloss,rmse,mse,auc,aucpr
XGBoost_2_AutoML_20210312_134505,0.331618,0.593365,0.436941,0.190918,,
XGBoost_1_AutoML_20210312_134505,0.333698,0.578879,0.434243,0.188567,,
StackedEnsemble_AllModels_AutoML_20210312_134505,0.333935,0.584509,0.434848,0.189093,,
XGBoost_3_AutoML_20210312_134505,0.350392,0.594189,0.441993,0.195358,,




In [10]:
aml.leader

Model Details
H2OXGBoostEstimator :  XGBoost
Model Key:  XGBoost_2_AutoML_20210312_134505


Model Summary: 


Unnamed: 0,Unnamed: 1,number_of_trees
0,,86.0




ModelMetricsMultinomial: xgboost
** Reported on train data. **

MSE: 0.10001471880493466
RMSE: 0.3162510376345581
LogLoss: 0.3318647840761985
Mean Per-Class Error: 0.15686326561521877
AUC: NaN
AUCPR: NaN
Multinomial auc values: Table is not computed because it is disabled (model parameter 'auc_type' is set to AUTO or NONE) or due to domain size (maximum is 50 domains).
Multinomial auc_pr values: Table is not computed because it is disabled (model parameter 'auc_type' is set to AUTO or NONE) or due to domain size (maximum is 50 domains).

Confusion Matrix: Row labels: Actual class; Column labels: Predicted class


Unnamed: 0,1,2,3,Error,Rate
0,19751.0,5060.0,313.0,0.213859,"5,373 / 25,124"
1,1852.0,138149.0,8258.0,0.068191,"10,110 / 148,259"
2,392.0,16052.0,70774.0,0.188539,"16,444 / 87,218"
3,21995.0,159261.0,79345.0,0.122513,"31,927 / 260,601"



Top-3 Hit Ratios: 


Unnamed: 0,k,hit_ratio
0,1,0.877487
1,2,0.993243
2,3,1.0



ModelMetricsMultinomial: xgboost
** Reported on cross-validation data. **

MSE: 0.19091750254440085
RMSE: 0.4369410744532961
LogLoss: 0.5933645232749909
Mean Per-Class Error: 0.33161784939064404
AUC: NaN
AUCPR: NaN
Multinomial auc values: Table is not computed because it is disabled (model parameter 'auc_type' is set to AUTO or NONE) or due to domain size (maximum is 50 domains).
Multinomial auc_pr values: Table is not computed because it is disabled (model parameter 'auc_type' is set to AUTO or NONE) or due to domain size (maximum is 50 domains).

Confusion Matrix: Row labels: Actual class; Column labels: Predicted class


Unnamed: 0,1,2,3,Error,Rate
0,13306.0,11288.0,530.0,0.470387,"11,818 / 25,124"
1,6047.0,122858.0,19354.0,0.171329,"25,401 / 148,259"
2,640.0,30160.0,56418.0,0.353138,"30,800 / 87,218"
3,19993.0,164306.0,76302.0,0.261008,"68,019 / 260,601"



Top-3 Hit Ratios: 


Unnamed: 0,k,hit_ratio
0,1,0.738992
1,2,0.976205
2,3,1.0



Cross-Validation Metrics Summary: 


Unnamed: 0,Unnamed: 1,mean,sd,cv_1_valid,cv_2_valid,cv_3_valid,cv_4_valid,cv_5_valid
0,accuracy,0.7389918,0.0014019057,0.7395292,0.73983115,0.73986953,0.739198,0.7365311
1,auc,,0.0,,,,,
2,aucpr,,0.0,,,,,
3,err,0.2610082,0.0014019057,0.26047084,0.26016885,0.26013047,0.260802,0.26346892
4,err_count,13603.8,73.04245,13576.0,13560.0,13558.0,13593.0,13732.0
5,logloss,0.59336454,0.004686571,0.5912132,0.58741236,0.593184,0.5949129,0.6001002
6,max_per_class_error,0.47037503,0.003523275,0.47133884,0.46907422,0.46536663,0.47500506,0.4710904
7,mean_per_class_accuracy,0.6683833,0.0016719104,0.66782254,0.66941315,0.6707706,0.6670731,0.6668373
8,mean_per_class_error,0.3316167,0.0016719104,0.3321775,0.33058685,0.3292294,0.33292696,0.33316272
9,mse,0.1909175,0.0009746323,0.19076952,0.18961366,0.19067271,0.19122799,0.19230364



Scoring History: 


Unnamed: 0,Unnamed: 1,timestamp,duration,number_of_trees,training_rmse,training_logloss,training_classification_error,training_auc,training_pr_auc
0,,2021-03-12 14:52:19,31 min 56.100 sec,0.0,0.666667,1.098612,0.66532,,
1,,2021-03-12 14:52:54,32 min 30.931 sec,5.0,0.473117,0.644989,0.231515,,
2,,2021-03-12 14:53:30,33 min 7.111 sec,10.0,0.422873,0.541277,0.216822,,
3,,2021-03-12 14:54:07,33 min 43.773 sec,15.0,0.403373,0.50122,0.206427,,
4,,2021-03-12 14:54:42,34 min 19.161 sec,20.0,0.39089,0.475699,0.196488,,
5,,2021-03-12 14:55:18,34 min 54.920 sec,25.0,0.381928,0.45759,0.188303,,
6,,2021-03-12 14:55:54,35 min 31.146 sec,30.0,0.374563,0.442596,0.18089,,
7,,2021-03-12 14:56:31,36 min 7.933 sec,35.0,0.368254,0.429849,0.174596,,
8,,2021-03-12 14:57:08,36 min 44.866 sec,40.0,0.362656,0.418552,0.168806,,
9,,2021-03-12 14:57:46,37 min 22.654 sec,45.0,0.356107,0.405766,0.162371,,



Variable Importances: 


Unnamed: 0,variable,relative_importance,scaled_importance,percentage
0,geo_level_3_id,85010.039062,1.0,0.184311
1,geo_level_1_id,83860.4375,0.986477,0.181819
2,geo_level_2_id,79510.476562,0.935307,0.172388
3,age,34692.371094,0.408097,0.075217
4,area_percentage,32236.933594,0.379213,0.069893
5,foundation_type.r,23109.724609,0.271847,0.050104
6,height_percentage,19411.345703,0.228342,0.042086
7,ground_floor_type.v,8914.84668,0.104868,0.019328
8,has_superstructure_mud_mortar_stone,8517.291016,0.100192,0.018466
9,count_floors_pre_eq,6927.010742,0.081485,0.015019



See the whole table with table.as_data_frame()




In [13]:
X_test = h2o.import_file(path='../download/test_values.csv')

Parse progress: |█████████████████████████████████████████████████████████| 100%


In [21]:
preds = aml.predict(X_test)['predict']

xgboost prediction progress: |████████████████████████████████████████████| 100%


In [31]:
building_id_df = h2o.as_list(X_test['building_id'])
preds_def = h2o.as_list(preds)
my_sub = pd.concat([building_id_df, preds_def], axis=1)
my_sub = my_sub.set_index('building_id')
title = '../submissions/03-12 h2o AutoML - 3 models - seed=1 - no data preprocessing'
my_sub.to_csv(title)

In [30]:
my_sub

Unnamed: 0_level_0,predict
building_id,Unnamed: 1_level_1
300051,2
99355,2
890251,2
745817,2
421793,3
...,...
310028,2
663567,2
1049160,2
442785,2
