# AutoML with H20.ai
This is a Jupyter Notebook we will execute the autoML using H20.ai library. This is part of the **Modeling** component of the Klee project.

- H2O’s AutoML can be used for automating the machine learning workflow, which includes automatic training and tuning of many models within a user-specified time-limit.
- H2O’s AutoML can also be a helpful tool for the advanced user, by providing a simple wrapper function that performs a large number of modeling-related tasks that would typically require many lines of code, and by freeing up their time to focus on other aspects of the data science pipeline tasks such as data-preprocessing, feature engineering and model deployment.

In [112]:
# import the library
import h2o
# from h2o.automl import H2OAutoML, H2OAutoMLClassifier
from h2o.sklearn import H2OAutoMLClassifier
import os
import pandas as pd

In [113]:
h2o.init()

Checking whether there is an H2O instance running at http://localhost:54321 ..... not found.
Attempting to start a local H2O server...
  Java Version: java version "14.0.1" 2020-04-14; Java(TM) SE Runtime Environment (build 14.0.1+7); Java HotSpot(TM) 64-Bit Server VM (build 14.0.1+7, mixed mode, sharing)
  Starting server from /Users/ai/opt/anaconda3/lib/python3.8/site-packages/h2o/backend/bin/h2o.jar
  Ice root: /var/folders/9n/d4pb9v495qdct3c11c8_c1t80000gn/T/tmpukyzr18t
  JVM stdout: /var/folders/9n/d4pb9v495qdct3c11c8_c1t80000gn/T/tmpukyzr18t/h2o_ai_started_from_python.out
  JVM stderr: /var/folders/9n/d4pb9v495qdct3c11c8_c1t80000gn/T/tmpukyzr18t/h2o_ai_started_from_python.err
  Server is running at http://127.0.0.1:54321
Connecting to H2O server at http://127.0.0.1:54321 ... successful.


0,1
H2O_cluster_uptime:,03 secs
H2O_cluster_timezone:,America/New_York
H2O_data_parsing_timezone:,UTC
H2O_cluster_version:,3.34.0.3
H2O_cluster_version_age:,"21 days, 3 hours and 54 minutes"
H2O_cluster_name:,H2O_from_python_ai_76jp08
H2O_cluster_total_nodes:,1
H2O_cluster_free_memory:,2 Gb
H2O_cluster_total_cores:,4
H2O_cluster_allowed_cores:,4


# Load the dataset

In [114]:
# Use local data file or download from GitHub

data_path = "/Users/ai/Desktop/heart.csv"

pd.read_csv(data_path)

Unnamed: 0,age,sex,cp,trestbps,chol,fbs,restecg,thalach,exang,oldpeak,slope,ca,thal,target
0,63,1,3,145,233,1,0,150,0,2.3,0,0,1,1
1,37,1,2,130,250,0,1,187,0,3.5,0,0,2,1
2,41,0,1,130,204,0,0,172,0,1.4,2,0,2,1
3,56,1,1,120,236,0,1,178,0,0.8,2,0,2,1
4,57,0,0,120,354,0,1,163,1,0.6,2,0,2,1
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
298,57,0,0,140,241,0,1,123,1,0.2,1,0,3,0
299,45,1,3,110,264,0,1,132,0,1.2,1,0,3,0
300,68,1,0,144,193,1,1,141,0,3.4,1,2,3,0
301,57,1,0,130,131,0,1,115,1,1.2,1,1,3,0


In [115]:
df = h2o.import_file(data_path)

Parse progress: |████████████████████████████████████████████████████████████████| (done) 100%


In [116]:
# descriptive statistics for the dataset

df.describe()

Rows:303
Cols:14




Unnamed: 0,﻿age,sex,cp,trestbps,chol,fbs,restecg,thalach,exang,oldpeak,slope,ca,thal,target
type,int,int,int,int,int,int,int,int,int,real,int,int,int,int
mins,29.0,0.0,0.0,94.0,126.0,0.0,0.0,71.0,0.0,0.0,0.0,0.0,0.0,0.0
mean,54.36633663366333,0.6831683168316832,0.9669966996699675,131.62376237623772,246.26402640264035,0.1485148514851485,0.5280528052805283,149.6468646864687,0.32673267326732675,1.0396039603960392,1.3993399339933994,0.7293729372937293,2.313531353135314,0.5445544554455446
maxs,77.0,1.0,3.0,200.0,564.0,1.0,2.0,202.0,1.0,6.2,2.0,4.0,3.0,1.0
sigma,9.08210098983786,0.46601082333962385,1.0320524894832983,17.538142813517098,51.83075098793005,0.35619787492797644,0.525859596359298,22.905161114914087,0.4697944645223165,1.1610750220686346,0.6162261453459621,1.0226063649693276,0.6122765072781408,0.49883478416439136
zeros,0,96,143,0,0,258,147,0,204,99,21,175,2,138
missing,0,0,0,0,0,0,0,0,0,0,0,0,0,0
0,63.0,1.0,3.0,145.0,233.0,1.0,0.0,150.0,0.0,2.3,0.0,0.0,1.0,1.0
1,37.0,1.0,2.0,130.0,250.0,0.0,1.0,187.0,0.0,3.5,0.0,0.0,2.0,1.0
2,41.0,0.0,1.0,130.0,204.0,0.0,0.0,172.0,0.0,1.4,2.0,0.0,2.0,1.0


In [117]:
# define the independent and dependent variable
y = "target"
df[y] = df[y].asfactor()

In [118]:
df[y].levels()

[['0', '1']]

In [119]:
# split the data into train and test split

splits = df.split_frame(ratios = [0.8], seed = 1)
train = splits[0]
test = splits[1]

In [120]:
splits

age,sex,cp,trestbps,chol,fbs,restecg,thalach,exang,oldpeak,slope,ca,thal,target
63,1,3,145,233,1,0,150,0,2.3,0,0,1,1
37,1,2,130,250,0,1,187,0,3.5,0,0,2,1
56,1,1,120,236,0,1,178,0,0.8,2,0,2,1
57,0,0,120,354,0,1,163,1,0.6,2,0,2,1
57,1,0,140,192,0,1,148,0,0.4,1,0,1,1
56,0,1,140,294,0,0,153,0,1.3,1,0,2,1
44,1,1,120,263,0,1,173,0,0.0,2,0,3,1
52,1,2,172,199,1,1,162,0,0.5,2,0,3,1
54,1,0,140,239,0,1,160,0,1.2,2,0,2,1
48,0,2,130,275,0,1,139,0,0.2,2,0,2,1


age,sex,cp,trestbps,chol,fbs,restecg,thalach,exang,oldpeak,slope,ca,thal,target
41,0,1,130,204,0,0,172,0,1.4,2,0,2,1
57,1,2,150,168,0,1,174,0,1.6,2,0,2,1
59,1,0,135,234,0,1,161,0,0.5,1,0,3,1
54,1,2,150,232,0,0,165,0,1.6,2,0,3,1
65,0,2,160,360,0,0,151,0,0.8,2,0,2,1
66,1,0,120,302,0,0,151,0,0.4,1,0,2,1
52,1,1,134,201,0,1,158,0,0.8,2,1,2,1
41,1,1,135,203,0,1,132,0,0.0,1,0,1,1
58,1,2,140,211,1,0,165,0,0.0,2,0,2,1
29,1,1,130,204,0,0,202,0,0.0,2,0,2,1


[, ]

In [111]:
h2o.cluster().shutdown()

H2O session _sid_a731 closed.


# AutoML
- Run AutoML, stopping after 60 seconds. The **max_runtime_secs** argument provides a way to limit the AutoML run by time. When using a time-limited stopping criterion, the number of models train will vary between runs. If different hardware is used or even if the same machine is used but the available compute resources on that machine are not the same between runs, then AutoML may be able to train more models on one run vs another.

- The test frame is passed explicitly to the leaderboard_frame argument here, which means that instead of using cross-validated metrics, we use test set metrics for generating the leaderboard.

In [121]:
aml = H2OAutoML(max_runtime_secs = 60, seed = 1, project_name = "Testing_frame")
a = aml.train(y = y, training_frame = train, leaderboard_frame = test)

AutoML progress: |█████████
15:40:05.293: GBM_1_AutoML_1_20211028_153953 [GBM def_5] failed: water.exceptions.H2OModelBuilderIllegalArgumentException: Illegal argument(s) for GBM model: GBM_1_AutoML_1_20211028_153953.  Details: ERRR on field: _min_rows: The dataset size is too small to split for min_rows=100.0: must have at least 200.0 (weighted) rows, but have only 196.0.
ERRR on field: _min_rows: The dataset size is too small to split for min_rows=100.0: must have at least 200.0 (weighted) rows, but have only 197.0.
ERRR on field: _min_rows: The dataset size is too small to split for min_rows=100.0: must have at least 200.0 (weighted) rows, but have only 197.0.
ERRR on field: _min_rows: The dataset size is too small to split for min_rows=100.0: must have at least 200.0 (weighted) rows, but have only 197.0.
ERRR on field: _min_rows: The dataset size is too small to split for min_rows=100.0: must have at least 200.0 (weighted) rows, but have only 197.0.


██████████████████████████████

## Leaderboard
- View the AutoML Leaderboard. Since we specified a leaderboard_frame in the H2OAutoML.train() method for scoring and ranking the models, the AutoML leaderboard uses the performance on this data to rank the models.

- A default performance metric for each machine learning task (binary classification, multiclass classification, regression) is specified internally and the leaderboard will be sorted by that metric. In the case of regression, the default ranking metric is mean residual deviance. In the future, the user will be able to specify any of the H2O metrics so that different metrics can be used to generate rankings on the leaderboard.

In [122]:
pd.DataFrame(a.varimp())

Unnamed: 0,0,1,2,3
0,thal,58.004417,1.0,0.248755
1,cp,56.33604,0.971237,0.2416
2,ca,33.590401,0.579101,0.144054
3,oldpeak,33.516792,0.577832,0.143739
4,thalach,15.431341,0.266037,0.066178
5,﻿age,12.437601,0.214425,0.053339
6,restecg,7.164467,0.123516,0.030725
7,chol,6.665822,0.114919,0.028587
8,trestbps,5.829804,0.100506,0.025001
9,slope,4.201881,0.072441,0.01802


In [123]:
aml.leaderboard.as_data_frame()

Unnamed: 0,model_id,auc,logloss,aucpr,mean_per_class_error,rmse,mse
0,XGBoost_1_AutoML_1_20211028_153953,0.934243,0.350666,0.929068,0.148263,0.334331,0.111777
1,GBM_3_AutoML_1_20211028_153953,0.92928,0.357638,0.905499,0.128412,0.33494,0.112185
2,XGBoost_grid_1_AutoML_1_20211028_153953_model_1,0.923697,0.438773,0.924328,0.144541,0.368744,0.135972
3,GBM_2_AutoML_1_20211028_153953,0.923077,0.365645,0.890528,0.119107,0.329324,0.108454
4,XRT_1_AutoML_1_20211028_153953,0.919975,0.382773,0.908195,0.148263,0.348381,0.121369
5,DRF_1_AutoML_1_20211028_153953,0.918114,0.378989,0.890504,0.148263,0.344504,0.118683
6,GBM_4_AutoML_1_20211028_153953,0.916873,0.378097,0.884919,0.12531,0.346682,0.120188
7,XGBoost_2_AutoML_1_20211028_153953,0.914392,0.366674,0.898191,0.151365,0.344146,0.118436
8,StackedEnsemble_AllModels_1_AutoML_1_20211028_153953,0.911911,0.376777,0.887712,0.132134,0.348733,0.121615
9,StackedEnsemble_AllModels_2_AutoML_1_20211028_153953,0.908189,0.376905,0.881952,0.132134,0.349131,0.121893


Mostly on the top there will be ensemble of models, as they perform better than a single model

## Testing of model

In [124]:
perf = aml.leader.model_performance(test)
perf

H2OConnectionError: Local server has died unexpectedly. RIP.

In [28]:
# Get model ids for all models in the AutoML Leaderboard
model_ids = list(aml.leaderboard['model_id'].as_data_frame().iloc[:,0])
# Get the "All Models" Stacked Ensemble model
se = h2o.get_model([mid for mid in model_ids if "StackedEnsemble_AllModels" in mid][0])
# Get the Stacked Ensemble metalearner model
metalearner = se.metalearner()

In [29]:
metalearner.coef_norm()

{'Intercept': 0.5650406504065044,
 'GBM_2_AutoML_1_20211028_125422': 0.0,
 'GBM_3_AutoML_1_20211028_125422': 0.0,
 'GBM_4_AutoML_1_20211028_125422': 0.060827910095249106,
 'DRF_1_AutoML_1_20211028_125422': 0.06979153832718642,
 'GLM_1_AutoML_1_20211028_125422': 0.1861497773446067,
 'XGBoost_2_AutoML_1_20211028_125422': 0.0,
 'XGBoost_1_AutoML_1_20211028_125422': 0.043722748151260725}

## Save the best Model

In [27]:
h2o.save_model(aml.leader, path = "./BestModel")

'C:\\Users\\Abhishek\\Desktop\\Kinesso\\BestModel\\StackedEnsemble_BestOfFamily_2_AutoML_1_20211013_181100'