# (Very) Basic H2O Demonstration
## Purpose
This notebook is my first test of the H20 Python API used to interact with an H2O server for automated machine learning (AutoML). The goal is to understand the approach and the functions used in the process. No attempt is made at this stage to check the results for the models generated: the main goal is to have everything running without errors.

A lot of information and code here is repeated from the [H2O online documentation](https://h2o-release.s3.amazonaws.com/h2o/rel-3.46.0/4/docs-website/h2o-py/docs/intro.html#installing-h2o-3). Some differences include the use of HuggingFace `datasets` used to download a dataset and prepare it using pandas before converting it into an H2O frame. Note that H2O can also work with datasets that are already in the correct format and that can be retrieved using their own methods and functions.

## Preparing the environment
The following commands are used to prepare the environment to run this notebook.
```bash
conda create --name h2o python=3.10
conda activate h2o
pip install https://h2o-release.s3.amazonaws.com/h2o/rel-3.46.0/4/Python/h2o-3.46.0.4-py2.py3-none-any.whl
pip install datasets
```

Any additional ML and data manipulation libraries can of course be used too. Please note that some compatibility issues between H2O and other librariees have been observed. Read the output from the `pip install` commands for additional information and tweaking if necessary.

H2O requires Java to run. Please install the latest recommended JRE on the [H2O website](http://docs.h2o.ai/h2o/latest-stable/h2o-docs/welcome.html#java-requirements).

In [72]:
import numpy as np
import h2o
from h2o.automl import H2OAutoML
from h2o.frame import H2OFrame
import datasets
import pandas as pd

## Starting H2O and Inspecting the Cluster
There are many tools for directly interacting with user-visible objects in the H2O cluster. Every new python session begins by initializing a connection between the python client and the H2O cluster. Note that h2o.init() accepts a number of arguments that are described in the h2o.init section.

In [73]:
h2o.init()

# After making a successful connection, you can obtain a high-level summary of the cluster status:
h2o.cluster_info()

Checking whether there is an H2O instance running at http://localhost:54321..... not found.
Attempting to start a local H2O server...
; Java HotSpot(TM) 64-Bit Server VM (build 25.371-b11, mixed mode)
  Starting server from C:\Users\User\anaconda3\envs\h2o\Lib\site-packages\h2o\backend\bin\h2o.jar
  Ice root: C:\Users\User\AppData\Local\Temp\tmpj8xlsr5c
  JVM stdout: C:\Users\User\AppData\Local\Temp\tmpj8xlsr5c\h2o_BC_started_from_python.out
  JVM stderr: C:\Users\User\AppData\Local\Temp\tmpj8xlsr5c\h2o_BC_started_from_python.err
  Server is running at http://127.0.0.1:54321
Connecting to H2O server at http://127.0.0.1:54321 ... successful.


0,1
H2O_cluster_uptime:,01 secs
H2O_cluster_timezone:,Australia/Sydney
H2O_data_parsing_timezone:,UTC
H2O_cluster_version:,3.46.0.4
H2O_cluster_version_age:,16 days
H2O_cluster_name:,H2O_from_python_BC_71x5j9
H2O_cluster_total_nodes:,1
H2O_cluster_free_memory:,3.531 Gb
H2O_cluster_total_cores:,12
H2O_cluster_allowed_cores:,12


  h2o.cluster_info()


0,1
H2O_cluster_uptime:,01 secs
H2O_cluster_timezone:,Australia/Sydney
H2O_data_parsing_timezone:,UTC
H2O_cluster_version:,3.46.0.4
H2O_cluster_version_age:,16 days
H2O_cluster_name:,H2O_from_python_BC_71x5j9
H2O_cluster_total_nodes:,1
H2O_cluster_free_memory:,3.531 Gb
H2O_cluster_total_cores:,12
H2O_cluster_allowed_cores:,12


In [74]:
# To list the current contents of the H2O cluster, you can use the h2o.ls command:
h2o.ls()




Unnamed: 0,key


## AutoML with H20
Source: https://docs.h2o.ai/h2o/latest-stable/h2o-docs/automl.html

## Load and prepare dataset

### Load Palmer penguins dataset from HuggingFace

In [75]:
# Import a sample binary outcome train/test set into H2O
dataset = datasets.load_dataset("SIH/palmer-penguins", split='train')

### Manually split the dataset
Note that a method such as `sklearn train_test_split` could of course be used here instead. It is however important to keep the label attached to the input DataFrame prior to converting it to an H20 frame.

In [64]:
dataset_df = pd.DataFrame(dataset)
dataset_df['sex'] = dataset_df['sex'].apply(lambda x: 1 if x=='male' else 0)
dataset_df = dataset_df.rename(columns={'sex': 'label'})

# Identify predictors and response
x = list(dataset_df.columns)
y = "label"
x.remove(y)

# Training dataset is 80% of full dataset
train_df = dataset_df.sample(int(.8*len(dataset_df)), random_state=42)

# Test dataset is all rows (index) not in train dataset
test_df = dataset_df.iloc[~dataset_df.index.isin(train.index)]

### Convert the pandas DataFrames into H2O frames
DataFrames can easily be converted into H2O frames using `h2o.H2OFrame`. The function also accepts dictionaries and numpy arrays but we've experienced issues with pd.Series.

In [67]:
# Convert to H2O frames
train = H2OFrame(train_df)
test = H2OFrame(test_df)

# For binary classification, response should be a factor
train[y] = train[y].asfactor()
test[y] = test[y].asfactor()

Parse progress: |████████████████████████████████████████████████████████████████| (done) 100%
Parse progress: |████████████████████████████████████████████████████████████████| (done) 100%


## Run H2O AutoML
The basic commands to run the AutoML process are shown below. Note that this is a very simple demonstration and no further work is done at this point to check the results or comment on the models performance. See this notebook as more of a "smoke test" rather than a comprehensive demonstration.

In [69]:
# Run AutoML for 20 base models
aml = H2OAutoML(max_models=5, seed=1)
aml.train(x=x, y=y, training_frame=train)

# View the AutoML Leaderboard
lb = aml.leaderboard
lb.head(rows=lb.nrows)  # Print all rows instead of default (10 rows)

AutoML progress: |


09:20:40.77: AutoML: XGBoost is not available; skipping it.

███████████████████████████████████████████████████████████████| (done) 100%


model_id,auc,logloss,aucpr,mean_per_class_error,rmse,mse
GBM_4_AutoML_2_20240726_92040,0.976924,0.192053,0.973619,0.0617656,0.236095,0.0557409
StackedEnsemble_BestOfFamily_1_AutoML_2_20240726_92040,0.9765,0.192089,0.972984,0.0584048,0.234459,0.0549712
GBM_5_AutoML_2_20240726_92040,0.976448,0.194165,0.973338,0.0692283,0.238489,0.0568772
StackedEnsemble_AllModels_1_AutoML_2_20240726_92040,0.975124,0.200834,0.971997,0.0694136,0.244187,0.0596272
GBM_grid_1_AutoML_2_20240726_92040_model_5,0.975124,0.201927,0.972457,0.0789404,0.245279,0.0601619
GBM_2_AutoML_2_20240726_92040,0.97306,0.206992,0.969616,0.0727744,0.244937,0.0599941
GBM_3_AutoML_2_20240726_92040,0.97306,0.205731,0.969072,0.0796814,0.246748,0.0608843
GBM_grid_1_AutoML_2_20240726_92040_model_2,0.972584,0.204478,0.965924,0.0649413,0.242204,0.0586626
GBM_grid_1_AutoML_2_20240726_92040_model_3,0.967079,0.234733,0.963294,0.0752091,0.260505,0.0678629
GLM_1_AutoML_2_20240726_92040,0.963745,0.24765,0.948011,0.0858474,0.263402,0.0693806


In [76]:
# List items available in H2O
h2o.ls()




Unnamed: 0,key


## Shut down server
Remmeber to shut down the server at the end of the session. Note that `h2o.init()` checks whether a server is already running. The shutdown method can also be used to kill the server in case of runtime issues or to interrupt a process.

In [71]:
# Shutdown the H2O server
h2o.shutdown(prompt=False)

H2O session _sid_b33b closed.


  h2o.shutdown(prompt=False)
