[![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/alexwolson/postdocbootcamp2023/blob/main/lab_3_1_automl.ipynb)

# UofT DSI-CARTE Postdoc Bootcamp
#### Thursday, July 20, 2023
#### AutoML - Introduction to H2O - Lab 1, Day 3
#### Teaching team: Alex Olson, Nakul Upadhya, Shehnaz Islam
##### Lab author: Shehnaz Islam, shehnaz.islam@mail.utoronto.ca, edited by Alex Olson

### Introduction
**Automated machine learning (AutoML)** is a fairly young field with the goal to build an automated workflow that could take raw data as input, and produce a prediction automatically. This automated workflow should automatically do preprocessing, model selection, hyperparameter tuning, and all other stages of the ML process.

There are different types of AutoML frameworks, each having unique features automating a few steps of a full machine learning workflow, from pre-processing to model development. In this lab we discuss one such framework called **H2O**.

### H2O
**H2O** is a state of the art, and open-source AutoML framework. The task of H2O is to find the best ML model and its hyperparameter for a dataset among a vast search space, including plenty of classifiers and a lot of hyperparameters. Thus, H2O frees a machine learning user from algorithm selection and hyperparameter tuning by leveraging recent advantages in Bayesian optimization, meta-learning and ensemble construction.

### Why AutoML?
AutoML can be useful as it has the ability to improve the quality of work for data scientists, not remove data scientists from the cycle.
Experts could use AutoML to increase their job performance by focusing on the best-performing pipelines, and non-experts could use AutoML systems without a broad ML education.

## H2O for Classification

Now, let's see how to implement the H2O framework of AutoML for a breast cancer binary classification task with 30 features and 569 samples belonging to either 'malignant' or 'benign' class.

In [None]:
# Install standard dependencies
!pip install -q requests tabulate future matplotlib

In [None]:
# macOS >= El Capitan requires an additional flag to pip here - check if we are running on macOS
import platform
if platform.system() == 'Darwin' and platform.release() >= '15.': # MacOS version check
    !pip install -f http://h2o-release.s3.amazonaws.com/h2o/latest_stable_Py.html h2o --user --trusted-host h2o-release.s3.amazonaws.com
else:
    !pip install -f http://h2o-release.s3.amazonaws.com/h2o/latest_stable_Py.html h2o --trusted-host h2o-release.s3.amazonaws.com

In [None]:
# Make necessary imports
import h2o
from h2o.automl import H2OAutoML
import pandas as pd
from sklearn.model_selection import train_test_split

### Data Loading

In [None]:
# Load breast cancer dataset and prepare for h2o
from sklearn.datasets import load_breast_cancer
from sklearn.model_selection import train_test_split
import h2o

def load_and_prep_data(load_data_func, test_size=0.33, random_state=0):
    # Load the dataset using the provided function
    dataset = load_data_func()
    X, y = dataset.data, dataset.target

    # Split the dataset into training and testing sets
    X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=test_size, random_state=random_state)

    # Get the feature names
    feature_names = dataset.feature_names

    # Convert datasets to pandas DataFrames
    X_train = pd.DataFrame(X_train, columns=feature_names)
    X_test = pd.DataFrame(X_test, columns=feature_names)
    y_train = pd.DataFrame(y_train)
    y_test = pd.DataFrame(y_test)

    # Add the target column
    X_train['target'] = y_train
    X_test['target'] = y_test

    return X_train, X_test

# Initialize h2o and remove any previous data
h2o.init()
h2o.remove_all()

# Load and preprocess the data
X_train, X_test = load_and_prep_data(load_breast_cancer)

In [None]:
# Convert pandas dataframe to h2o frame
train = h2o.H2OFrame(X_train)
test = h2o.H2OFrame(X_test)

# Identify predictors and response
x = train.columns
y = "target"
x.remove(y)

# For binary classification, response should be a factor
train[y] = train[y].asfactor()
test[y] = test[y].asfactor()

In [None]:
X_train

In [None]:
# Run AutoML for 20 base models (limited to 1 hour max runtime by default, but we will set it to 5 min to save time)
aml = H2OAutoML(max_runtime_secs = 300, seed = 1, project_name = "breast_cancer")
aml.train(x = x, y = y, training_frame = train)

## View training results

We can use **aml.leaderboard** to view the top models built by AutoML and compare their performance. Explore the **leaderboard** to see the different models built by AutoML.

In [None]:
# View the AutoML Leaderboard
lb = aml.leaderboard
lb.head(rows=lb.nrows)  # Print all rows instead of default (10 rows)

## View the top model

We can use **aml.leader** to view the single top model built by AutoML and its performance.

In [None]:
# The leader model is stored here
aml.leader

We can also look at the best model using a specific metric, such as **AUC**., or algorithm, such as **XGBoost**.

In [None]:
# Get the best model based on AUC and using XGBoost
aml.get_best_model(criterion = "AUC", algorithm = "XGBoost")

## Model performance

Now that we have trained our model, we can estimate its performance on our validation data. We can use the `aml.leader.model_performance()` method to get the performance of the leader model on the validation data.

In [None]:
# Get the performance of the leader model
aml.leader.model_performance(test_data=test)

## Model explainability

We can also get a detailed summary of the model built by AutoML using the **explain()** method.

In [None]:
# Get a detailed summary of the best XGBoost model
best_xgb = aml.get_best_model(algorithm = "XGBoost")
best_xgb.explain(test)

## Your Turn

The task we presented for you is a binary classification task. Now, we want you to try AutoML for a regression task. We will use the **California Housing** dataset for this task. The dataset contains 8 features and 20640 samples. The task is to predict the median house value for each block in California. The dataset is available in the **sklearn.datasets** module. You can use the **fetch_california_housing()** function to load the dataset. You can also find more information about the dataset [here](https://scikit-learn.org/stable/datasets/index.html#california-housing-dataset).

In [None]:
from sklearn.datasets import fetch_california_housing
import h2o

# Initialize h2o and remove any previous data
h2o.init()
h2o.remove_all()

# Load and preprocess the data
X_train, X_test = load_and_prep_data(fetch_california_housing)

In [None]:
# Convert pandas dataframe to h2o frame
train = h2o.H2OFrame(X_train)
test = h2o.H2OFrame(X_test)

# Identify predictors and response
x = train.columns
y = "target"
x.remove(y)

# For regression, response should be a numeric
train[y] = train[y].asnumeric()
test[y] = test[y].asnumeric()

In [None]:
## Your code here - build an H2OAutoML model for the Boston Housing dataset, select a model, and report on its performance
## What's the best accuracy you can get?

