# WS 12 AutoML with AutoGluon Hands on Module


## 1. Introduction
In this hands on module, we will see how to simplify the process of training high-quality, optimized machine learning models on sample datasets from UCI Machine Learning repository using the [AutoGluon](https://auto.gluon.ai/stable/index.html) package.

## 2. Dataset introduction and loading data from UCI ML Repository
Now we import pacakges and load in the following three healthcare related datasets from [UCI Machine Learning Repository](https://archive.ics.uci.edu/)


*   [Breast Cancer data](https://archive.ics.uci.edu/dataset/17/breast+cancer+wisconsin+diagnostic) from University of Wisconsin
*   [Diabetes data](https://archive.ics.uci.edu/dataset/296/diabetes+130-us+hospitals+for+years+1999-2008) representing ten years of clinical care at 130 US hospitals
*  [A Drug reviews](https://archive.ics.uci.edu/dataset/461/drug+review+dataset+druglib+com) dataset providing patient reviews on specific drugs

The first two datsets will be used to demonstrate AutoGluon's `TabularPredictor` class and how it enables us to train high-fideltiy ensemble models on data without needing to worry about pre-processing.

The third dataset will allow us to explore AutoGluon's `MultiModalPredictor` and how it allows us to train models on plain-text inputs









In [None]:
import pandas as pd
import numpy as np
from ucimlrepo import fetch_ucirepo

In [None]:
# here we fix a random seed for reproducibility purposes
np.random.seed(913)

Now we load in the breast cancer dataset. This dataset contains features that describe the characteristics of cell nuclei present in a digitized image taken from the fine needle aspirate of a breast mass. The labels in the data are binary/two-class, with 'B' representing a benign mass and 'M' representing a malignant mass



In [None]:
# now we load in the breast cancer dataset from UCI
breast_cancer_wisconsin_diagnostic = fetch_ucirepo(id=17)

# data (as pandas dataframes)
X = breast_cancer_wisconsin_diagnostic.data.features
y = breast_cancer_wisconsin_diagnostic.data.targets


In [None]:
print(breast_cancer_wisconsin_diagnostic.variables)

Here we combine our input features `X` and target labels `y` into a single pandas `DataFrame` to make it easier to work with

In [None]:
breast_cancer_df = X.assign(
    Diagnosis=y
)

In [None]:
breast_cancer_df

In [None]:
breast_cancer_df['Diagnosis'].value_counts()

Next we load in the Diabetes dataset. This dataset was constructed with the goal of predicting the early readmission of diabetes patients within 30 days of discharge

In [None]:
# fetch dataset
diabetes_data = fetch_ucirepo(id=296)

# data (as pandas dataframes)
X = diabetes_data.data.features
y = diabetes_data.data.targets

In [None]:
print(diabetes_data.variables)

In [None]:
y

Here we combine the input features X and targets y into a single `DataFrame`. We also convert the multi-class readmission labels as follows.
1. Original label `'NO'` for "no readmission" -> `0`
2. Original label `'>30'` for "admitted in more than 30 days" -> `1`
3. Original label `'<30'` for "admitted within 30 days" -> `1`
This is done for purely pedagogical purposes so we can demonstrate how to calibrate a model's classification threshold using AutoGluon. AutoGluon is perfectly capable of handling multi-class targets, but converting to binary makes the example simpler.

In [None]:
diabetes_df = X.assign(
    readmitted=y.map(lambda readmit: 0 if readmit == 'NO' else 1) # convert to a binary target
)

In [None]:
diabetes_df['readmitted'].value_counts()

Since this is a relatively large dataset, for performance reasons, we downsample to only include 20% of the original dataset in our subsequent examples.

In [None]:
# because this is such a large dataset, we will down-sample this to only include 20% of the dat
diabetes_df_downsamp = diabetes_df.sample(frac=0.2)

In [None]:
diabetes_df_downsamp['readmitted'].value_counts()

Now we split the two datasets into 80%/20% training/test set splits, so that we can evaluate our tuned models at the very end on unseen test data

In [None]:
bc_train = breast_cancer_df.sample(frac=0.8)
bc_test = breast_cancer_df.drop(bc_train.index)

In [None]:
bc_train['Diagnosis'].value_counts(normalize=True)

In [None]:
bc_test['Diagnosis'].value_counts(normalize=True)

In [None]:
diabetes_train = diabetes_df_downsamp.sample(frac=0.8)
diabetes_test = diabetes_df_downsamp.drop(diabetes_train.index)

In [None]:
diabetes_train['readmitted'].value_counts(normalize=True)

In [None]:
diabetes_test['readmitted'].value_counts(normalize=True)

## 3. Introduction to AutoGluon Tabular Predictor
Now we will see how AutoGluon's `TabularPredictor` class can be used to automatically fit a weighted ensemble on the breast cancer dataset, with automatic K-fold cross validation, bagging, and stacking, and with a large suite of models evaluated for inclusion in the final ensemble

In [None]:
from autogluon.tabular import TabularPredictor

Here we construct a `TabularPredictor` by calling its constructor, and specifying the name of the column containing our labels/targets, as well as the evaluation metric we want AutoGluon to use to score the models that it tests for inclusion in its final ensemble. Here we choose `'roc_auc'` for AUROC (or Area Under the ROC curve) which is a good default metric to use for binary classification problems, as it is robust to class imbalances.

We then call the `.fit()` method on the `TabularPredictor` and pass it our training data, and two more parameters
1. `num_bag_folds` - The number of data folds used in model bagging and K-fold cross validation. According to the AutoGluon documentation, this increases training time by a factor of k
2. `exluded_model_types` - Allows us to specify models to leave out of the leaderboard training and ensemble model for faster training time. Here we exluce Neural network models and CatBoost, which can take longer to train than our other models. Feel free to comment out/delete this line if you would prefer to train an ensemble that includes these models.

In [None]:
# here we do the same with the breast cancer dataset
predictor_bc = TabularPredictor( # construct the predictor
    label='Diagnosis', eval_metric='roc_auc'
).fit( # call the fit method
    bc_train,
    num_bag_folds=3,
    excluded_model_types=['NN_TORCH', 'FASTAI', 'CAT'] # exclude neural nets and CatBoost for faster training
)

After fitting our model, we can then call `.evaluate()` to see how it performs on the held out test data. We can also call the `.leaderboard()` method to see a breakdown of performance across the various models trained and evaluated in the ensemble

In [None]:
# Now we evaluate the Breast Cancer model on the test data
predictor_bc.evaluate(bc_test)

In [None]:
# Output summary of information about models produced during fit() as a pd.DataFrame. Includes information on test and validation scores for all models, model training times, inference times, and stack levels
predictor_bc.leaderboard(bc_test)

## 4. Comparison with Scikit Learn Toy Implementation of Bagging + Stacking
We see that our accuracy and precision rival that of the benchmark models listed on the UCI Machine Learning repository page for this dataset.


Now, for illustrative purposes, we will take a brief look at how much code it would take to implement a similar *(highly simplified)* k-fold bagging + stacking model ensembling such as what AutoGluon does automatically using Scikit Learn, another popular machine learning framework for Python


In [None]:
from sklearn.model_selection import KFold
from sklearn.ensemble import RandomForestClassifier
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import accuracy_score, f1_score
from xgboost import XGBClassifier
from tqdm import tqdm

In [None]:
# our target labels are text character 'M' and 'B'
# Scikit-learn binary classifiers need these to be converted to numeric 1/0
bc_train_binary = bc_train.assign(
    binary_label=lambda x: x['Diagnosis'].map(lambda diag: 1 if diag == 'M' else 0)
).drop(columns='Diagnosis')
bc_test_binary = bc_test.assign(
    binary_label=lambda x: x['Diagnosis'].map(lambda diag: 1 if diag == 'M' else 0)
).drop(columns='Diagnosis')

In [None]:
# separate features X from targets y
X = bc_train_binary.drop(columns=['binary_label'])
y = bc_train_binary['binary_label']
# initialize the Kfold object for doing kfold cross validation
kf = KFold(n_splits=5, shuffle=True, random_state=42)

# construct arrays for storing the out of fold prediciotns for the models
oof_preds_rf = np.zeros(len(X))
oof_preds_xgb = np.zeros(len(X))

# save the bagged models in lists

# specifiy the classifiers that will be in each layer
layers = [RandomForestClassifier, XGBClassifier]
layer_preds = [oof_preds_rf, oof_preds_xgb]
layer_bags = [list(), list()]
# loop over our layers
for i, layer in enumerate(layers):
  print(f"Performing k-fold cross validation at layer {i} with {layer}")
  # do the K-fold cross validation loop
  for train_idx, val_idx in tqdm(kf.split(X), total=5):
      # split inputs and outputs into training and validation
      X_train, X_val = X.iloc[train_idx], X.iloc[val_idx]
      y_train, y_val = y.iloc[train_idx], y.iloc[val_idx]

      # if we are in a layer past the first layer, the inputs need to include
      # the predictions from the prior layer
      if i > 0:
        X_train = np.column_stack([
            X_train.to_numpy(),
            layer_preds[i-1][train_idx] # include preds from prior layer
        ])
        X_val = np.column_stack([
            X_val.to_numpy(),
            layer_preds[i-1][val_idx]
        ])

      # fit a model from the given layer on the training fold
      model = layer()
      model.fit(X_train, y_train)
      # evaluate it on the validation fold and save oof predictions
      layer_preds[i][val_idx] = model.predict_proba(X_val)[:, 1]

      # save the model in our layer bag
      layer_bags[i].append(model)

# final meta model: Weighted ensemble of the predictions from the prior layers
meta_features = np.column_stack(layer_preds)

meta_model = LogisticRegression()
meta_model.fit(meta_features, y)

Now to evaluate this custom Meta model on the unseen test data, we need to


1.   Get 5 separate sets of predictions from each Random Forest model in the first layer
2.   Get 5 separate sets of predictions from each XGBoost model in the second layer, appending the predictions from the first layer models as feature inputs to the second layer models
3. Average the predicions made at each layer into a single set of predictions per layer
4. Append these two sets of predictions together into the final features to feed to the Meta Model (Logistic Regression)
5. Get the final predictions from the Meta Model



In [None]:
X_test = bc_test_binary.drop(columns=['binary_label'])
y_test = bc_test_binary['binary_label']

layer_preds_test = []

# steps 1 and 2 - Getting separate predictions from models in each layer
for i in range(len(layer_bags)):
  # concatenate predictions from prior layer to features if needed
  if i > 0:
    X_test_stacked = np.column_stack([X_test.to_numpy(), layer_preds_test[i-1]])
  else:
    X_test_stacked = X_test

  # step 3 - compute average predictions across all models in the layer
  current_layer_preds = sum([
      model.predict_proba(X_test_stacked)[:, 1]
      for model in layer_bags[i]
  ]) / len(layer_bags[i])

  # save the predictions for the layer
  layer_preds_test.append(current_layer_preds)

# Step 4 - concatenate layer predicitons
meta_features_test = np.column_stack(layer_preds_test)

# Step 5 - get final predictions
final_predictions = meta_model.predict(meta_features_test)

# get the accuracy and f1 score
print(f"F1: {f1_score(y_test, final_predictions)}")
print(f"Accuracy: {accuracy_score(y_test, final_predictions)}")


## 5. Applying AutoGluon to a more complex dataset and threshold tuning
We can see that a considerable amount of code, understanding, and index manipulation was required in order to get an ensemble implementation using Scikit-learn that approaches the functionality which AutoGluon provides in only 3-4 lines of code. The toy implementation above also does not perform any greedy model weighting or pruning of models, and only incorporated two model families in the ensemble.

Here we train another TabularPredictor on the Diabetes readmission data, to assess how well it scales to larger datasets with heterogeneous input types. Here we will also see how one can tune the decision threshold of a binary classifier with AutoGluon in order to try and achieve better Recall and performance on the positive class

In [None]:
diabetes_train

In [None]:
predictor_diabetes = TabularPredictor( # construct the predictor
    label='readmitted', eval_metric='roc_auc'
).fit( # call the fit method
    train_data=diabetes_train,
    excluded_model_types=['NN_TORCH', 'FASTAI', 'CAT'], # excluding neural nets for faster training
)

In [None]:
predictor_diabetes.evaluate(diabetes_test)

The recall of 0.54 tells us that only ~54% of patients who were readmitted to a hospital were successfully classified as such by our model. If we want to tune our classification threshold to achieve a better recall/sensitivity/TPR (true-positive rate), as we often do when working with medical data and developing tests for the presence of risks/conditions, AutoGluon makes this very easy.

In [None]:
from sklearn.metrics import ConfusionMatrixDisplay, classification_report

In [None]:
diabetes_preds = predictor_diabetes.predict(diabetes_test)

In [None]:
ConfusionMatrixDisplay.from_predictions(diabetes_test['readmitted'], diabetes_preds)

In [None]:
# now we calibrate the decision threshold of our model, using the F1 score as the calibration metric
threshold = predictor_diabetes.calibrate_decision_threshold(metric='f1')

Conceptual Question - Why did we calibrate the decision threshold to optimize the [F1 score](https://en.wikipedia.org/wiki/F-score) (the harmonic mean of precision and recall) if our goal was to achieve higher recall? Why not just use recall as the calibration metric directly?

In [None]:
predictor_diabetes.set_decision_threshold(threshold)

In [None]:
predictor_diabetes.evaluate(diabetes_test)

In [None]:
diabetes_preds = predictor_diabetes.predict(diabetes_test)

In [None]:
ConfusionMatrixDisplay.from_predictions(diabetes_test['readmitted'], diabetes_preds)


## 6. Tackling a Text Classification Problem with AutoGluon Tabular and MultiModal
Now that we will see how AutoGluon can also seamlessly tackle plain-text columns, such as written reviews, using both the `TabularPredictor` class, as well as the more sophisticated `MultiModalPredictor` which trains a full transformer neural network to make predictions on text-based data.

In [None]:
from autogluon.multimodal import MultiModalPredictor

In [None]:
# now we load in the Drug Reviews dataset from UCI Machine learning repository
drug_reviews_druglib_com = fetch_ucirepo(id=461)

# data (as pandas dataframes)
X = drug_reviews_druglib_com.data.features

In [None]:
print(drug_reviews_druglib_com.variables)

In [None]:
X

In [None]:
# construct a 3-class label based on the numeric rating from 1-10
targets = X['rating'].map(lambda rating: 'positive' if rating >= 7 else 'neutral' if rating >= 4 else 'negative')

In [None]:
drug_reviews_df = X[['benefitsReview', 'sideEffectsReview', 'commentsReview']].assign(
    target=targets
)

In [None]:
drug_reviews_df

In [None]:
drug_reviews_df['target'].value_counts()

In [None]:
# we will downsample the three classes to achieve a smaller dataset for demonstration purposes
drug_reviews_downsamp = pd.concat([
    drug_reviews_df.query('target == "positive"').sample(n=400),
    drug_reviews_df.query('target == "neutral"').sample(n=400),
    drug_reviews_df.query('target == "negative"').sample(n=400)
])

In [None]:
drug_reviews_train = drug_reviews_downsamp.sample(frac=0.8)
drug_reviews_test = drug_reviews_downsamp.drop(drug_reviews_train.index)

In [None]:
drug_reviews_train['target'].value_counts(normalize=True)

In [None]:
drug_reviews_test['target'].value_counts(normalize=True)

In [None]:
# first we fit a TabularPredictor to the dataset to see how it performs
predictor_tab = TabularPredictor(label='target', eval_metric='acc')

In [None]:
predictor_tab.fit(drug_reviews_train)

In [None]:
# now we evaluate the model's performance on the test data
predictor_tab.evaluate(drug_reviews_test)

In [None]:
ConfusionMatrixDisplay.from_predictions(drug_reviews_test['target'], predictor_tab.predict(drug_reviews_test))

In [None]:
print(classification_report(drug_reviews_test['target'], predictor_tab.predict(drug_reviews_test)))

In [None]:
# now we construct our model using the MultiModalPredictor class
predictor = MultiModalPredictor(label='target', eval_metric='acc')

In [None]:
import gc

del(predictor_tab)
del(predictor_bc)
del(predictor_diabetes)

gc.collect()

In [None]:
predictor.fit(drug_reviews_train, time_limit=180)

In [None]:
predictor.evaluate(drug_reviews_test)

In [None]:
predictions = predictor.predict(drug_reviews_test)

In [None]:
predictions.value_counts()

In [None]:
ConfusionMatrixDisplay.from_predictions(drug_reviews_test['target'], predictions)