# AutoML with H20.ai
This notebook will be a part of Modeling node of the Klee project. This is a Jupyter Notebook we will execute the autoML using H20.ai library. This is part of the **Modeling** component of the Klee project.

- H2O’s AutoML can be used for automating the machine learning workflow, which includes automatic training and tuning of many models within a user-specified time-limit.
- H2O’s AutoML can also be a helpful tool for the advanced user, by providing a simple wrapper function that performs a large number of modeling-related tasks that would typically require many lines of code, and by freeing up their time to focus on other aspects of the data science pipeline tasks such as data-preprocessing, feature engineering and model deployment.

##### We can see in this notebook that the ensemble model from h20.ai works best, as it is a combination of all the models tested.

##### Limitation:
The h20.ai library tests on alot of models, but because of low computational power on my system, it is testing on mostly tree based models. 

### Libraries used 
- We use H20.AI to investigate and select best features from the dataset. 

#### Input: 
The input to this notebook is Tabular dataset. Ex. Titanic dataset used in this notebook

#### Output:
Output of this notebook is the best model based on performance accuracy for the dataset

In [1]:
# import the library

import h2o
from h2o.automl import H2OAutoML
import os
import pandas as pd

In [2]:
h2o.init()

Checking whether there is an H2O instance running at http://localhost:54321 ..... not found.
Attempting to start a local H2O server...
  Java Version: java version "1.8.0_201"; Java(TM) SE Runtime Environment (build 1.8.0_201-b09); Java HotSpot(TM) 64-Bit Server VM (build 25.201-b09, mixed mode)
  Starting server from /Users/bear/.local/lib/python3.8/site-packages/h2o/backend/bin/h2o.jar
  Ice root: /var/folders/lh/42j8mfjx069d1bkc2wlf2pw40000gn/T/tmpcvvzgi9s
  JVM stdout: /var/folders/lh/42j8mfjx069d1bkc2wlf2pw40000gn/T/tmpcvvzgi9s/h2o_bear_started_from_python.out
  JVM stderr: /var/folders/lh/42j8mfjx069d1bkc2wlf2pw40000gn/T/tmpcvvzgi9s/h2o_bear_started_from_python.err
  Server is running at http://127.0.0.1:54321
Connecting to H2O server at http://127.0.0.1:54321 ... successful.


0,1
H2O_cluster_uptime:,03 secs
H2O_cluster_timezone:,America/New_York
H2O_data_parsing_timezone:,UTC
H2O_cluster_version:,3.34.0.7
H2O_cluster_version_age:,"21 days, 10 hours and 3 minutes"
H2O_cluster_name:,H2O_from_python_bear_5kyi1o
H2O_cluster_total_nodes:,1
H2O_cluster_free_memory:,3.556 Gb
H2O_cluster_total_cores:,8
H2O_cluster_allowed_cores:,8


# Load the dataset

In [4]:
# Use local data file or download from GitHub
url = "https://github.com/nikbearbrown/Visual_Analytics/raw/main/CSV/titanic.csv"
df = h2o.import_file(url)
df.head()

Parse progress: |████████████████████████████████████████████████████████████████| (done) 100%


PassengerId,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked
892,3,"Kelly, Mr. James",male,34.5,0,0,330911.0,7.8292,,Q
893,3,"Wilkes, Mrs. James (Ellen Needs)",female,47.0,1,0,363272.0,7.0,,S
894,2,"Myles, Mr. Thomas Francis",male,62.0,0,0,240276.0,9.6875,,Q
895,3,"Wirz, Mr. Albert",male,27.0,0,0,315154.0,8.6625,,S
896,3,"Hirvonen, Mrs. Alexander (Helga E Lindqvist)",female,22.0,1,1,3101300.0,12.2875,,S
897,3,"Svensson, Mr. Johan Cervin",male,14.0,0,0,7538.0,9.225,,S
898,3,"Connolly, Miss. Kate",female,30.0,0,0,330972.0,7.6292,,Q
899,2,"Caldwell, Mr. Albert Francis",male,26.0,1,1,248738.0,29.0,,S
900,3,"Abrahim, Mrs. Joseph (Sophie Halaut Easu)",female,18.0,0,0,2657.0,7.2292,,C
901,3,"Davies, Mr. John Samuel",male,21.0,2,0,,24.15,,S




In [6]:
# descriptive statistics for the dataset

df.describe()

Rows:418
Cols:11




Unnamed: 0,PassengerId,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked
type,int,int,string,enum,real,int,int,int,real,enum,enum
mins,892.0,1.0,,,0.17,0.0,0.0,680.0,0.0,,
mean,1100.5,2.2655502392344493,,,30.272590361445786,0.44736842105263147,0.39234449760765555,223850.98986486494,35.627188489208656,,
maxs,1309.0,3.0,,,76.0,8.0,9.0,3101298.0,512.3292,,
sigma,120.81045760473994,0.8418375519640504,,,14.181209235624422,0.8967595611217135,0.9814288785371694,369523.7764694362,55.907576179973844,,
zeros,0,0,0,,0,283,324,0,2,,
missing,0,0,0,0,86,0,0,122,1,327,0
0,892.0,3.0,"Kelly, Mr. James",male,34.5,0.0,0.0,330911.0,7.8292,,Q
1,893.0,3.0,"Wilkes, Mrs. James (Ellen Needs)",female,47.0,1.0,0.0,363272.0,7.0,,S
2,894.0,2.0,"Myles, Mr. Thomas Francis",male,62.0,0.0,0.0,240276.0,9.6875,,Q


In [7]:
# define the independent and dependent variable
y = "Fare"

In [8]:
# split the data into train and test split

splits = df.split_frame(ratios = [0.8], seed = 1)
train = splits[0]
test = splits[1]

In [9]:
splits

PassengerId,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked
892,3,"Kelly, Mr. James",male,34.5,0,0,330911.0,7.8292,,Q
893,3,"Wilkes, Mrs. James (Ellen Needs)",female,47.0,1,0,363272.0,7.0,,S
895,3,"Wirz, Mr. Albert",male,27.0,0,0,315154.0,8.6625,,S
896,3,"Hirvonen, Mrs. Alexander (Helga E Lindqvist)",female,22.0,1,1,3101300.0,12.2875,,S
897,3,"Svensson, Mr. Johan Cervin",male,14.0,0,0,7538.0,9.225,,S
898,3,"Connolly, Miss. Kate",female,30.0,0,0,330972.0,7.6292,,Q
899,2,"Caldwell, Mr. Albert Francis",male,26.0,1,1,248738.0,29.0,,S
900,3,"Abrahim, Mrs. Joseph (Sophie Halaut Easu)",female,18.0,0,0,2657.0,7.2292,,C
902,3,"Ilieff, Mr. Ylio",male,,0,0,349220.0,7.8958,,S
903,1,"Jones, Mr. Charles Cresson",male,46.0,0,0,694.0,26.0,,S


PassengerId,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked
894,2,"Myles, Mr. Thomas Francis",male,62.0,0,0,240276.0,9.6875,,Q
901,3,"Davies, Mr. John Samuel",male,21.0,2,0,,24.15,,S
912,1,"Rothschild, Mr. Martin",male,55.0,1,0,,59.4,,C
929,3,"Cacic, Miss. Manda",female,21.0,0,0,315087.0,8.6625,,S
931,3,"Hee, Mr. Ling",male,,0,0,1601.0,56.4958,,S
943,2,"Pulbaum, Mr. Franz",male,27.0,0,0,,15.0333,,C
947,3,"Rice, Master. Albert",male,10.0,4,1,382652.0,29.125,,Q
955,3,"Bradley, Miss. Bridget Delia",female,22.0,0,0,334914.0,7.725,,Q
956,1,"Ryerson, Master. John Borie",male,13.0,2,2,,262.375,B57 B59 B63 B66,C
964,3,"Nieminen, Miss. Manta Josefina",female,29.0,0,0,3101300.0,7.925,,S


[, ]

# AutoML
- Run AutoML, stopping after 60 seconds. The **max_runtime_secs** argument provides a way to limit the AutoML run by time. When using a time-limited stopping criterion, the number of models train will vary between runs. If different hardware is used or even if the same machine is used but the available compute resources on that machine are not the same between runs, then AutoML may be able to train more models on one run vs another.

- The test frame is passed explicitly to the leaderboard_frame argument here, which means that instead of using cross-validated metrics, we use test set metrics for generating the leaderboard.

In [10]:
aml = H2OAutoML(max_runtime_secs = 60, seed = 1, project_name = "Testing_frame")
aml.train(y = y, training_frame = train, leaderboard_frame = test)

AutoML progress: |
01:02:02.338: _train param, Dropping bad and constant columns: [Name]


01:02:03.348: XGBoost_1_AutoML_1_20220112_10202 [XGBoost def_2] failed: water.exceptions.H2OModelBuilderIllegalArgumentException: Illegal argument(s) for XGBoost model: XGBoost_1_AutoML_1_20220112_10202_cv_1.  Details: ERRR on field: _response_column: Response contains missing values (NAs) - not supported by XGBoost.

01:02:03.353: _train param, Dropping bad and constant columns: [Name]

█
01:02:04.386: _train param, Dropping bad and constant columns: [Name]

█
01:02:05.403: _train param, Dropping unused columns: [Name]

█
01:02:06.413: _train param, Dropping bad and constant columns: [Name]

█
01:02:07.418: XGBoost_2_AutoML_1_20220112_10202 [XGBoost def_1] failed: water.exceptions.H2OModelBuilderIllegalArgumentException: Illegal argument(s) for XGBoost model: XGBoost_2_AutoML_1_20220112_10202_cv_1.  Details: ERRR on field: _response_column: Response contains missing values (NAs) - not supported 



## Leaderboard
- View the AutoML Leaderboard. Since we specified a leaderboard_frame in the H2OAutoML.train() method for scoring and ranking the models, the AutoML leaderboard uses the performance on this data to rank the models.

- A default performance metric for each machine learning task (binary classification, multiclass classification, regression) is specified internally and the leaderboard will be sorted by that metric. In the case of regression, the default ranking metric is mean residual deviance. In the future, the user will be able to specify any of the H2O metrics so that different metrics can be used to generate rankings on the leaderboard.

In [11]:
aml.leaderboard.head()

model_id,mean_residual_deviance,rmse,mse,mae,rmsle
StackedEnsemble_BestOfFamily_2_AutoML_1_20220112_10202,848.854,29.1351,848.854,15.2723,0.700064
GBM_4_AutoML_1_20220112_10202,920.86,30.3457,920.86,13.2641,0.470033
StackedEnsemble_AllModels_1_AutoML_1_20220112_10202,927.818,30.4601,927.818,15.9489,0.720741
GBM_grid_1_AutoML_1_20220112_10202_model_9,978.155,31.2755,978.155,15.1538,0.505271
StackedEnsemble_BestOfFamily_3_AutoML_1_20220112_10202,985.23,31.3884,985.23,16.0655,0.606395
StackedEnsemble_BestOfFamily_4_AutoML_1_20220112_10202,992.816,31.509,992.816,15.1833,0.517876
DRF_1_AutoML_1_20220112_10202,998.256,31.5952,998.256,13.5931,0.458207
GBM_grid_1_AutoML_1_20220112_10202_model_10,1019.92,31.9362,1019.92,14.6545,0.537303
GBM_grid_1_AutoML_1_20220112_10202_model_17,1020.91,31.9516,1020.91,14.5714,0.515761
StackedEnsemble_AllModels_2_AutoML_1_20220112_10202,1030.0,32.0936,1030.0,16.49,0.640455




Mostly on the top there will be ensemble of models, as they perform better than a single model

## Testing of model

In [12]:
perf = aml.leader.model_performance(test)
perf


ModelMetricsRegressionGLM: stackedensemble
** Reported on test data. **

MSE: 848.8542772637696
RMSE: 29.135103865676705
MAE: 15.272297474451399
RMSLE: 0.7000639680799781
R^2: 0.7062323746863823
Mean Residual Deviance: 848.8542772637696
Null degrees of freedom: 83
Residual degrees of freedom: 80
Null deviance: 242885.75655845192
Residual deviance: 71303.75929015665
AIC: 814.8682261938338




In [13]:
# Get model ids for all models in the AutoML Leaderboard
model_ids = list(aml.leaderboard['model_id'].as_data_frame().iloc[:,0])
# Get the "All Models" Stacked Ensemble model
se = h2o.get_model([mid for mid in model_ids if "StackedEnsemble_AllModels" in mid][0])
# Get the Stacked Ensemble metalearner model
metalearner = se.metalearner()

In [14]:
metalearner.coef_norm()

{'Intercept': 35.39018375680689,
 'GBM_4_AutoML_1_20220112_10202': 0.0,
 'DRF_1_AutoML_1_20220112_10202': 5.526465201060359,
 'GBM_3_AutoML_1_20220112_10202': 28.591036886229674,
 'GBM_2_AutoML_1_20220112_10202': 0.0,
 'GLM_1_AutoML_1_20220112_10202': 9.640862074758312,
 'GBM_1_AutoML_1_20220112_10202': 0.0}