# AutoML in action

In [1]:
!pip install -f http://h2o-release.s3.amazonaws.com/h2o/latest_stable_Py.html h2o mlflow==1.30 boto3 awscli -q

## Only configurable cell in the notebook

In [2]:
!aws --endpoint-url $MINIO_ENDPOINT_URL s3 cp s3://bpk/titanic_cleaned.parquet .

download: s3://bpk/titanic_cleaned.parquet to ./titanic_cleaned.parquet


In [3]:
name = "titanic"
dataset_path = "titanic_cleaned.parquet"
use_cols = ['Pclass', 'Sex', 'Age', 'SibSp', 'Parch', 'Fare', 'Embarked']
target_col = "Survived"
categorical_cols = ["Survived", "Pclass"]

## AutoML magic starts here

We will not change anything there for different datasets

In [4]:
import h2o
from h2o.automl import H2OAutoML

h2o.init()

Checking whether there is an H2O instance running at http://localhost:54321..... not found.
Attempting to start a local H2O server...
  Java Version: openjdk version "11.0.17" 2022-10-18; OpenJDK Runtime Environment (build 11.0.17+8-post-Ubuntu-1ubuntu220.04); OpenJDK 64-Bit Server VM (build 11.0.17+8-post-Ubuntu-1ubuntu220.04, mixed mode, sharing)
  Starting server from /opt/conda/lib/python3.8/site-packages/h2o/backend/bin/h2o.jar
  Ice root: /tmp/tmppfq4dx13
  JVM stdout: /tmp/tmppfq4dx13/h2o_unknownUser_started_from_python.out
  JVM stderr: /tmp/tmppfq4dx13/h2o_unknownUser_started_from_python.err
  Server is running at http://127.0.0.1:54321
Connecting to H2O server at http://127.0.0.1:54321 ... successful.


0,1
H2O_cluster_uptime:,02 secs
H2O_cluster_timezone:,Etc/UTC
H2O_data_parsing_timezone:,UTC
H2O_cluster_version:,3.40.0.1
H2O_cluster_version_age:,24 days
H2O_cluster_name:,H2O_from_python_unknownUser_5gkcz5
H2O_cluster_total_nodes:,1
H2O_cluster_free_memory:,1.201 Gb
H2O_cluster_total_cores:,2
H2O_cluster_allowed_cores:,2


In [5]:
df = h2o.import_file(dataset_path)

Parse progress: |████████████████████████████████████████████████████████████████| (done) 100%


In [27]:
#bugfix in h2o
df = df.drop("__index_level_0__")

In [28]:
for b in categorical_cols:
    df[b] = df[b].asfactor()

In [29]:
train, test = df.split_frame(ratios=[.7])

In [30]:
import mlflow
from mlflow.models.signature import infer_signature

In [31]:
with mlflow.start_run(run_name=name):
    mlflow.set_tag("author", "bpk")
    aml = H2OAutoML(max_runtime_secs = 60)
    mlflow.log_param("max_runtime_secs", 60)
    aml.train(x = use_cols, y = target_col, training_frame = train, validation_frame = test)
    
    signature = infer_signature(test.drop(target_col).as_data_frame(), test[target_col].as_data_frame())
    
    mlflow.log_text(str(aml.leader.model_performance(test)), "model_performance.txt")
    mlflow.h2o.log_model(aml.leader, "leader", registered_model_name="titanic-automl", signature=signature)

AutoML progress: |
14:32:56.237: User specified a validation frame with cross-validation still enabled. Please note that the models will still be validated using cross-validation only, the validation frame will be used to provide purely informative validation metrics on the trained models.

███████████████████████████████████████████████████████████████| (done) 100%


Registered model 'titanic-automl' already exists. Creating a new version of this model...
2023/03/05 14:34:01 INFO mlflow.tracking._model_registry.client: Waiting up to 300 seconds for model version to finish creation.                     Model name: titanic-automl, version 3
Created version '3' of model 'titanic-automl'.
