# WS 12 AutoML with AutoGluon Hands on Module

In this hands on module, we will see how to simplify the process of training high-quality, optimized machine learning models on sample datasets from UCI Machine Learning repository using the [AutoGluon](https://auto.gluon.ai/stable/index.html) package.
We start by installing the `utogluon` and `ucimlrepo` packages with `pip`

In [None]:
!pip install autogluon
!pip install ucimlrepo

Now we import pacakges and load in the following three healthcare related datasets from [UCI Machine Learning Repository](https://archive.ics.uci.edu/)


*   [Heart Disease data](https://archive.ics.uci.edu/dataset/45/heart+disease)
*   [Breast Cancer data](https://archive.ics.uci.edu/dataset/17/breast+cancer+wisconsin+diagnostic) from University of Wisconsin
*   [Diabetes data](https://archive.ics.uci.edu/dataset/296/diabetes+130-us+hospitals+for+years+1999-2008) representing ten years of clinical care at 130 US hospitals









In [None]:

import pandas as pd
import numpy as np
from ucimlrepo import fetch_ucirepo

In [None]:
# load in the heart disease dataset from UCI
heart_disease = fetch_ucirepo(id=45)

# data (as pandas dataframes)
X = heart_disease.data.features
y = heart_disease.data.targets



In [None]:

# variable information
print(heart_disease.variables)

The heart disease dataset uses a multi-class label with integer values ranging from 0 - 4 with the following semantics


*   0 (no heart disease)
*   1-4 (increasing severity of hear disease)

In order to simplify the classification problem, we convert these to binary labels, with 0/1 indicating absence/presence of heart disease



In [None]:
# finalize the heart disease dataset in a single DataFrame with predictors and labels
heart_disease_df = X.assign(
    binary_label=y.map(lambda value: value > 0).astype(int) # convert categorical labels to binary (1=heart disease, 0=no heart disease)
)

In [None]:
heart_disease_df

In [None]:
heart_disease_df['binary_label'].value_counts()

Next we load in the breast cancer dataset. This dataset contains features that describe the characteristics of cell nuclei present in a digitized image taken from the fine needle aspirate of a breast mass. The labels in the data are binary/two-class, with 'B' representing a benign mass and 'M' representing a malignant mass



In [None]:
# now we load in the breast cancer dataset from UCI
breast_cancer_wisconsin_diagnostic = fetch_ucirepo(id=17)

# data (as pandas dataframes)
X = breast_cancer_wisconsin_diagnostic.data.features
y = breast_cancer_wisconsin_diagnostic.data.targets


In [None]:
print(breast_cancer_wisconsin_diagnostic.variables)

In [None]:
breast_cancer_df = X.assign(
    Diagnosis=y
)

In [None]:
breast_cancer_df

In [None]:
breast_cancer_df['Diagnosis'].value_counts()

Next we load in the Diabetes dataset. This dataset was constructed with the goal of predicting the early readmission of diabetes patients within 30 days of discharge

In [None]:
# fetch dataset
diabetes_data = fetch_ucirepo(id=296)

# data (as pandas dataframes)
X = diabetes_data.data.features
y = diabetes_data.data.targets

In [None]:
print(diabetes_data.variables)

In [None]:
diabetes_df = X.assign(
    readmitted=y
)

In [None]:
diabetes_df['readmitted'].value_counts()

Now we split the two datasets into 80%/20% training/test set splits, so that we can evaluate our tuned models at the very end on unseen test data

In [None]:
# splitting the heart disease dataset into training and test sets using DataFrame.sample()
hd_train = heart_disease_df.sample(frac=0.8)
hd_test = heart_disease_df.drop(hd_train.index)

In [None]:
hd_train['binary_label'].value_counts(normalize=True)

In [None]:
hd_test['binary_label'].value_counts(normalize=True)

In [None]:
bc_train = breast_cancer_df.sample(frac=0.8)
bc_test = breast_cancer_df.drop(bc_train.index)

In [None]:
bc_train['Diagnosis'].value_counts(normalize=True)

In [None]:
bc_test['Diagnosis'].value_counts(normalize=True)

In [None]:
diabetes_binary = diabetes_df.assign(
    binary_label=lambda x: x['readmitted'].map(lambda label: 1 if label == '<30' else 0)
).drop(columns='readmitted')
diabetes_train = diabetes_binary.sample(frac=0.8)
diabetes_test = diabetes_binary.drop(diabetes_train.index)

In [None]:
diabetes_train['binary_label'].value_counts(normalize=True)

In [None]:
diabetes_test['binary_label'].value_counts(normalize=True)

Now, for illustrative purposes, we see how much code it would take to implement a similar *(highly simplified)* k-fold bagging + stacking model ensembling such as what AutoGluon does automatically using Scikit Learn, another popular machine learning framework for Python

In [None]:
from sklearn.model_selection import KFold
from sklearn.ensemble import RandomForestClassifier
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import accuracy_score, f1_score
from xgboost import XGBClassifier
from tqdm import tqdm

In [None]:
# Note: In newer versions of scikit learn >= 1.4, the RandomForest classifier can handle NaNs
# This saves us from having to impute them explicitly, but normally we would have to deal with this
heart_disease_df

In [None]:
# separate features X from targets y
X = hd_train.drop(columns=['binary_label'])
y = hd_train['binary_label']
# initialize the Kfold object for doing kfold cross validation
kf = KFold(n_splits=5, shuffle=True, random_state=42)

# construct arrays for storing the out of fold prediciotns for the models
oof_preds_rf = np.zeros(len(X))
oof_preds_xgb = np.zeros(len(X))

# save the bagged models in lists

# specifiy the classifiers that will be in each layer
layers = [RandomForestClassifier, XGBClassifier]
layer_preds = [oof_preds_rf, oof_preds_xgb]
layer_bags = [list(), list()]
# loop over our layers
for i, layer in enumerate(layers):
  print(f"Performing k-fold cross validation at layer {i} with {layer}")
  # do the K-fold cross validation loop
  for train_idx, val_idx in tqdm(kf.split(X), total=5):
      # split inputs and outputs into training and validation
      X_train, X_val = X.iloc[train_idx], X.iloc[val_idx]
      y_train, y_val = y.iloc[train_idx], y.iloc[val_idx]

      # if we are in a layer past the first layer, the inputs need to include
      # the predictions from the prior layer
      if i > 0:
        X_train = np.column_stack([
            X_train.to_numpy(),
            layer_preds[i-1][train_idx] # include preds from prior layer
        ])
        X_val = np.column_stack([
            X_val.to_numpy(),
            layer_preds[i-1][val_idx]
        ])

      # fit a model from the given layer on the training fold
      model = layer()
      model.fit(X_train, y_train)
      # evaluate it on the validation fold and save oof predictions
      layer_preds[i][val_idx] = model.predict_proba(X_val)[:, 1]

      # save the model in our layer bag
      layer_bags[i].append(model)

# final meta model: Weighted ensemble of the predictions from the prior layers
meta_features = np.column_stack(layer_preds)

meta_model = LogisticRegression()
meta_model.fit(meta_features, y)

Now to evaluate this custom Meta model on the unseen test data, we need to


1.   Get 5 separate sets of predictions from each Random Forest model in the first layer
2.   Get 5 separate sets of predictions from each XGBoost model in the second layer, appending the predictions from the first layer models as feature inputs to the second layer models
3. Average the predicions made at each layer into a single set of predictions per layer
4. Append these two sets of predictions together into the final features to feed to the Meta Model (Logistic Regression)
5. Get the final predictions from the Meta Model



In [None]:
X_test = hd_test.drop(columns=['binary_label'])
y_test = hd_test['binary_label']

layer_preds_test = []

# steps 1 and 2 - Getting separate predictions from models in each layer
for i in range(len(layer_bags)):
  # concatenate predictions from prior later to features if needed
  if i > 0:
    X_test_stacked = np.column_stack([X_test.to_numpy(), layer_preds_test[i-1]])
  else:
    X_test_stacked = X_test

  # step 3 - compute average predictions across all models in the layer
  current_layer_preds = sum([
      model.predict_proba(X_test_stacked)[:, 1]
      for model in layer_bags[i]
  ]) / len(layer_bags[i])

  # save the predictions for the layer
  layer_preds_test.append(current_layer_preds)

# Step 4 - concatenate layer predicitons
meta_features_test = np.column_stack(layer_preds_test)

# Step 5 - get final predictions
final_predictions = meta_model.predict(meta_features_test)

# get the accuracy and f1 score
print(f"F1: {f1_score(y_test, final_predictions)}")
print(f"Accuracy: {accuracy_score(y_test, final_predictions)}")


We see that even to try and capture a minimal representation of what AutoGluon is implementing for us under the hood at training and inference time, it took a considerable amount of code, understanding, and index manipulation, and we didn't even implement greedy weighting and model pruning. \
Now we will see how AutoGluon's `TabularPredictor` class can be used to automatically fit a weighted ensemble on the same dataset, with automatic K-fold cross validation, bagging, and stacking, with a much larger suite of models evaluated for inclusion in the final ensemble.

In [None]:
from autogluon.tabular import TabularPredictor

In [None]:
# Fitting a tabular predictor on the Heart Disease Dataset
predictor_hd = TabularPredictor( # construct the predictor
    label='binary_label', eval_metric='roc_auc'
).fit( # call the fit method
    hd_train,
    num_bag_folds=3 # perform k-fold cross validation
)

Now we evaluate the AutoGluon model on the test dataset, and also show a leaderboard with a performance breakdown across all models trained during construction of the ensemble

In [None]:
predictor_hd.evaluate(hd_test)

In [None]:
predictor_hd.leaderboard(hd_test)

In [None]:
bc_train

Here we train another TabularPredictor on the BreastCancer data, showing that AutoGluon can handle Categorical targets out of the box, without the requirement to manually encode targets as integers/one-hot

In [None]:
# here we do the same with the breast cancer dataset
predictor_bc = TabularPredictor( # construct the predictor
    label='Diagnosis', eval_metric='roc_auc'
).fit( # call the fit method
    bc_train
)

In [None]:
predictor_bc.evaluate(bc_test)

In [None]:
predictor_bc.leaderboard(bc_test)

We see that AutoGluon's Tabular Predictor achieves performance competitive with that of state of the art baseline models with minimal code and Machine Learning engineering understanding. However, let's assess how well it scales to larger datasets with heterogeneous input types, by testing it out on the Diabetes dataset. Here we will also see how one can tune the decision threshold of a binary classifier with AutoGluon in order to try and achieve better Recall and performance on the positive class

In [None]:
diabetes_train

In [None]:
predictor_diabetes = TabularPredictor( # construct the predictor
    label='binary_label', eval_metric='roc_auc'
).fit( # call the fit method
    diabetes_train.sample(frac=0.4), # training on a subsample of the data for faster training
    excluded_model_types=['NN_TORCH', 'FASTAI', 'CAT'], # excluding neural nets for faster training
)

In [None]:
predictor_diabetes.evaluate(diabetes_test)

In [None]:
# now we optimize the threshold to try and boost the f1 score
threshold = predictor_diabetes.calibrate_decision_threshold(metric='f1')

In [None]:
predictor_diabetes.set_decision_threshold(threshold)

In [None]:
predictor_diabetes.evaluate(diabetes_test)

In [None]:
diabetes_preds = predictor_diabetes.predict(diabetes_test)

In [None]:
from sklearn.metrics import ConfusionMatrixDisplay

In [None]:
ConfusionMatrixDisplay.from_predictions(diabetes_test['binary_label'], diabetes_preds)