<a href="https://colab.research.google.com/github/fengfrankgthb/BUS-41204/blob/main/SL-2-3-AutoMLExample2.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Stacking and AutoML

In this notebook, we'll illustrate stacking and automatic machine learning.

The data are from S. Moro, P. Cortez and P. Rita (2014) “A Data-Driven Approach to Predict the Success of Bank Telemarketing. Decision Support Systems,” Decision Support Systems 62, 22-31.

The outcome variable is a binary variable indicating whether a person subscribes to a bank term deposit (y = 1).

# Python setup

Note: If you want to re-run the AutoML code, you may need to downgrade NumPy and restart the session. After restarting, re-import NumPy. This step is only necessary if you intend to re-run the AutoML code.

In [1]:
# Downgrade NumPy and then restart the session by clicking the Runtime tab
!pip uninstall -y numpy
!pip install numpy==1.26.3

Found existing installation: numpy 2.0.2
Uninstalling numpy-2.0.2:
  Successfully uninstalled numpy-2.0.2
Collecting numpy==1.26.3
  Downloading numpy-1.26.3-cp311-cp311-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.metadata (61 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m61.2/61.2 kB[0m [31m1.2 MB/s[0m eta [36m0:00:00[0m
[?25hDownloading numpy-1.26.3-cp311-cp311-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (18.3 MB)
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m18.3/18.3 MB[0m [31m33.7 MB/s[0m eta [36m0:00:00[0m
[?25hInstalling collected packages: numpy
[31mERROR: pip's dependency resolver does not currently take into account all the packages that are installed. This behaviour is the source of the following dependency conflicts.
thinc 8.3.6 requires numpy<3.0.0,>=2.0.0, but you have numpy 1.26.3 which is incompatible.[0m[31m
[0mSuccessfully installed numpy-1.26.3


As usual, we'll start by importing libraries we're going to make use of.

In [None]:
!pip install flaml
!pip install xgboost==1.6.0

!pip install h2o

# Import relevant packages
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn.model_selection import train_test_split
from sklearn.model_selection import KFold, GridSearchCV
from sklearn.linear_model import LogisticRegression
from sklearn.ensemble import RandomForestClassifier
from sklearn.ensemble import GradientBoostingClassifier
from sklearn.ensemble import StackingClassifier
from sklearn.base import BaseEstimator, ClassifierMixin
from sklearn.metrics import confusion_matrix, classification_report
from sklearn.metrics import accuracy_score
from sklearn.metrics import ConfusionMatrixDisplay
from sklearn.metrics import roc_curve, roc_auc_score

# Load and examine data

We'll import the data from the course github repository. We'll do the same data cleaning and pre-processing we did in [the previous notebook for this example.](https://colab.research.google.com/github/chansen776/MBA-ML-Course-Materials/blob/main/Code/BankDepositExample2.ipynb)

In [None]:
# Import data
file = "https://raw.githubusercontent.com/chansen776/MBA-ML-Course-Materials/main/Data/bank-additional-full.csv"
rawdata = pd.read_csv(file, sep=";")
print(rawdata.shape)
print(rawdata.columns)
print(rawdata.dtypes)

In [None]:
# Recode outcome from "yes" and "no" to 1 and 0
rawdata["y"] = rawdata["y"].replace({"no": 0, "yes": 1})

In [None]:
# Create variable indicating not previously contacted and replace 999's in pdays with 0's
rawdata["never_contacted"] = np.where(rawdata["pdays"] == 999, 1, 0)
rawdata["never_contacted"] = rawdata["never_contacted"].astype("category")
rawdata["pdays"] = np.where(rawdata["pdays"] == 999, 0, rawdata["pdays"])
rawdata["pdays"].describe()

In [None]:
# Drop duration column
rawdata = rawdata.drop(columns=["duration"])

In [None]:
# Split the data into training (80%) and validation (20%) sets
train, val = train_test_split(rawdata, test_size=0.2, random_state=94)

# Stacking

It's hard to know the right model to try in any given setting.

It is also possible that different models may perform differentially well for different instances in the same data. I.e. there may not be a "best" model.

Rather than choose a single model, we might instead wish to combine the candidate models to try to obtain a better overall prediction rule.

Stacking is one approach to building such a model combination. The basic idea is to take the (out-of-sample) predictions from each candidate model and simply use them as predictors in a new (usually simple) model aimed at predicting the outcome.

For regression tasks, the stacking model is usually built by applying standard least squares linear regression to predict the outcome with the (out-of-sample) predictions from the baseline prediction rules as features.

For binary classification tasks, a simple baseline is to build the stacking model by applying logistic regression to predict the outcome with the (out-of-sample) predictions from the baseline prediction rules as features.

Here, we're going to apply stacking in the bank deposit example using the models [we previously tried.](https://colab.research.google.com/github/chansen776/MBA-ML-Course-Materials/blob/main/Code/BankDepositExample2.ipynb)


In [None]:
import warnings
from sklearn.exceptions import ConvergenceWarning

# Suppress only ConvergenceWarning
warnings.filterwarnings("ignore", category=ConvergenceWarning)

# Prepare training and validation data
X_train = train.drop(columns=["y"])
y_train = train["y"]
X_val = val.drop(columns=["y"])
y_val = val["y"]

# Get dummy variables for categorical features
X_train_full = pd.get_dummies(X_train, drop_first=False)
X_val_full = pd.get_dummies(X_val, drop_first=False)

# Align validation data to ensure same columns as training data
X_val_full = X_val_full.reindex(columns=X_train_full.columns, fill_value=0)

# Prepare data specifically for Logistic Regression (drop one dummy variable per category)
X_train_logistic = pd.get_dummies(X_train, drop_first=True)
X_val_logistic = pd.get_dummies(X_val, drop_first=True)
X_val_logistic = X_val_logistic.reindex(columns=X_train_logistic.columns, fill_value=0)


# Define a custom wrapper to handle separate design matrices for stacking
class CustomLogisticRegression:
    def __init__(self, model, X_train, X_val):
        self.model = model
        self.X_train = X_train
        self.X_val = X_val

    def fit(self, X, y):
        self.model.fit(self.X_train, y)

    def predict(self, X):
        return self.model.predict(self.X_val)

    def predict_proba(self, X):
        return self.model.predict_proba(self.X_val)


# Instantiate individual models
logistic_model = CustomLogisticRegression(
    model=LogisticRegression(max_iter=1000, random_state=94, penalty=None),
    X_train=X_train_logistic,
    X_val=X_val_logistic,
)
rf_model = RandomForestClassifier(
    n_estimators=1000, min_samples_leaf=15, random_state=94
)
gbc_model = GradientBoostingClassifier(
    random_state=94, learning_rate=0.1, max_depth=4, n_estimators=60
)

gbcES = GradientBoostingClassifier(
    learning_rate=0.1,
    max_depth=4,
    n_estimators=200,
    validation_fraction=0.2,
    n_iter_no_change=10,
    tol=1e-4,
    random_state=94,
)


# Constant model
def constant_model_predict_proba(X):
    return np.full((X.shape[0], 2), [1 - y_train.mean(), y_train.mean()])


def constant_model_predict(X):
    return np.zeros(X.shape[0])


class ConstantModel(BaseEstimator, ClassifierMixin):
    def __init__(self):
        pass  # No need for self.mean as it's calculated in fit

    def fit(self, X, y):
        self.mean = y.mean()
        self.classes_ = np.unique(y)
        return self  # Return self for compatibility with scikit-learn estimators

    def get_params(self, deep=True):
        # This method is required for cloning and compatibility with scikit-learn
        return {}  # ConstantModel has no parameters to return

    def predict_proba(self, X):
        return constant_model_predict_proba(X)

    def predict(self, X):
        return constant_model_predict(X)


constant_model = ConstantModel()
constant_model.fit(X_train_full, y_train)

# Create a stacking classifier with Logistic Regression as the meta-model
stacking_model = StackingClassifier(
    estimators=[
        (
            "logistic",
            logistic_model.model,
        ),  # Logistic regression with reduced design matrix
        ("random_forest", rf_model),
        ("gradient_boosting", gbc_model),
        ("gradient_boosting_es", gbcES),
        ("constant", constant_model),
    ],
    final_estimator=LogisticRegression(max_iter=1000, random_state=94),
)

# Fit the stacking model with full design matrix for meta-model
stacking_model.fit(X_train_full, y_train)

# Make predictions on the validation set
y_pred_stack = stacking_model.predict(X_val_full)
y_pred_prob_stack = stacking_model.predict_proba(X_val_full)[:, 1]

# Evaluate the stacking model
stacking_classification_metrics = classification_report(
    y_val, y_pred_stack, output_dict=True
)
print(pd.DataFrame(stacking_classification_metrics))

ConfusionMatrixDisplay.from_predictions(y_val, y_pred_stack)
plt.show()

# ROC Curve
fpr_stack, tpr_stack, thresholds_stack = roc_curve(y_val, y_pred_prob_stack)
roc_auc_stack = roc_auc_score(y_val, y_pred_prob_stack)

plt.plot(fpr_stack, tpr_stack, label=f"Stacking (area = {roc_auc_stack:.2f})")

# Plot individual models for comparison
# Logistic Regression
logistic_model.model.fit(X_train_logistic, y_train)
y_pred_prob_logistic = logistic_model.model.predict_proba(X_val_logistic)[:, 1]
fpr_logistic, tpr_logistic, _ = roc_curve(y_val, y_pred_prob_logistic)
roc_auc_logistic = roc_auc_score(y_val, y_pred_prob_logistic)
plt.plot(
    fpr_logistic,
    tpr_logistic,
    label=f"Logistic Regression (area = {roc_auc_logistic:.2f})",
)

# Random Forest
rf_model.fit(X_train_full, y_train)
y_pred_prob_rf = rf_model.predict_proba(X_val_full)[:, 1]
fpr_rf, tpr_rf, _ = roc_curve(y_val, y_pred_prob_rf)
roc_auc_rf = roc_auc_score(y_val, y_pred_prob_rf)
plt.plot(fpr_rf, tpr_rf, label=f"Random Forest (area = {roc_auc_rf:.2f})")

# Gradient Boosting
gbc_model.fit(X_train_full, y_train)
y_pred_prob_gbc = gbc_model.predict_proba(X_val_full)[:, 1]
fpr_gbc, tpr_gbc, _ = roc_curve(y_val, y_pred_prob_gbc)
roc_auc_gbc = roc_auc_score(y_val, y_pred_prob_gbc)
plt.plot(fpr_gbc, tpr_gbc, label=f"Gradient Boosting (area = {roc_auc_gbc:.2f})")

# Gradient Boosting with Early Stopping
gbcES.fit(X_train_full, y_train)
y_pred_prob_gbcES = gbcES.predict_proba(X_val_full)[:, 1]
fpr_gbcES, tpr_gbcES, _ = roc_curve(y_val, y_pred_prob_gbcES)
roc_auc_gbcES = roc_auc_score(y_val, y_pred_prob_gbcES)
plt.plot(
    fpr_gbcES,
    tpr_gbcES,
    label=f"Gradient Boosting Early Stopping (area = {roc_auc_gbcES:.2f})",
)

# Constant Model
y_pred_prob_constant = constant_model.predict_proba(X_val_full)[:, 1]
fpr_constant, tpr_constant, _ = roc_curve(y_val, y_pred_prob_constant)
roc_auc_constant = roc_auc_score(y_val, y_pred_prob_constant)
plt.plot(
    fpr_constant, tpr_constant, label=f"Constant Model (area = {roc_auc_constant:.2f})"
)

# Finalize the plot
plt.plot([0, 1], [0, 1], color="black", lw=2, linestyle="--")
plt.xlabel("False Positive Rate")
plt.ylabel("True Positive Rate")
plt.title("ROC Curve")
plt.legend()
plt.show()

In [None]:
# Weights ordered same way as learners - logistic, random forest,
# boosting, boosting early stopping, constant
stacking_weights = (
    stacking_model.final_estimator_.coef_
    if hasattr(stacking_model.final_estimator_, "coef_")
    else None
)
if stacking_weights is not None:
    print("Stacking Weights:")
    print(stacking_weights)
else:
    print("Final estimator does not support coefficients.")

# FLAML

In [None]:
warnings.filterwarnings("ignore", category=UserWarning)

from flaml import AutoML

X_train = train.drop(columns=["y"])
y_train = train["y"]
X_val = val.drop(columns=["y"])
y_val = val["y"]

# Get dummy variables for our categorical features.
X_train = pd.get_dummies(X_train, drop_first=False)
X_val = pd.get_dummies(X_val, drop_first=False)

model_config = {
    "task": "classification",
    "time_budget": 300,  # time budget in seconds
    "eval_method": "cv",
    "split_type": KFold(n_splits=5, shuffle=True, random_state=94),
    "ensemble": True,
    "verbose": 0,  # only display warnings
}


automl = AutoML()
automl.fit(X_train, y_train, **model_config)

In [None]:
print(automl.model)
print(
    "The ",
    automl.best_iteration,
    "-th iteration is the best, completed in ",
    round(automl.time_to_find_best_model, 1),
    " seconds.",
    sep="",
)

In [None]:
y_pred_flaml = automl.predict(X_val)
y_pred_prob_flaml = automl.predict_proba(X_val)[:, 1]

# Evaluate the model
aml_classification_metrics = classification_report(
    y_val, y_pred_flaml, output_dict=True
)
print(pd.DataFrame(aml_classification_metrics))

ConfusionMatrixDisplay.from_predictions(y_val, y_pred_flaml)
plt.show()

# ROC Curve
fpr_flaml, tpr_flaml, thresholds_flaml = roc_curve(y_val, y_pred_prob_flaml)
roc_auc_flaml = roc_auc_score(y_val, y_pred_prob_flaml)

plt.plot(
    fpr_logistic,
    tpr_logistic,
    label=f"Logistic Regression (area = {roc_auc_logistic:.2f})",
)
plt.plot(fpr_rf, tpr_rf, label=f"Random Forest (area = {roc_auc_rf:.2f})")
plt.plot(fpr_gbc, tpr_gbc, label=f"Gradient Boosting (area = {roc_auc_gbc:.2f})")
plt.plot(
    fpr_gbcES, tpr_gbcES, label=f"Gradient Boosting Early (area = {roc_auc_gbcES:.2f})"
)
plt.plot(fpr_flaml, tpr_flaml, label=f"FLAML (area = {roc_auc_flaml:.2f})")
plt.plot([0, 1], [0, 1], color="black", lw=2, linestyle="--")
plt.xlabel("False Positive Rate")
plt.ylabel("True Positive Rate")
plt.title("ROC Curve")
plt.legend()
plt.show()

#H2o

In [None]:
import h2o
from h2o.automl import H2OAutoML

h2o.init()

In [None]:
# Convert pandas DataFrame to H2OFrame
train_h2o = h2o.H2OFrame(train)
val_h2o = h2o.H2OFrame(val)

# Need to convert y variable to factor if going to use h2o
train_h2o["y"] = train_h2o["y"].asfactor()
val_h2o["y"] = val_h2o["y"].asfactor()

In [None]:
# Takes ~ 30 minutes to run

X_train = train.drop(columns=["y"])
y_train = train["y"]
X_val = val.drop(columns=["y"])
y_val = val["y"]

# Run AutoML for 20 base models
aml = H2OAutoML(max_models=20, seed=42, nfolds=5)
aml.train(x=list(X_train.columns), y=y_train.name, training_frame=train_h2o)

# View the AutoML Leaderboard
lb = aml.leaderboard
lb.head(rows=lb.nrows)  # Print all rows instead of default (10 rows)

In [None]:
# Evaluate the best model on the test data
y_pred_aml_all = aml.leader.predict(val_h2o)  # Returns 3 columns.
# First column = binary prediction
# Second column = probability y = 0
# Third column = probability y = 1

y_pred_aml = y_pred_aml_all[:, 0].as_data_frame(use_pandas=True).to_numpy()
y_pred_prob_aml = y_pred_aml_all[:, 2].as_data_frame(use_pandas=True).to_numpy()

# Evaluate the model
aml_classification_metrics = classification_report(y_val, y_pred_aml, output_dict=True)
print(pd.DataFrame(aml_classification_metrics))

ConfusionMatrixDisplay.from_predictions(y_val, y_pred_aml)
plt.show()

# ROC Curve
fpr_aml, tpr_aml, thresholds_aml = roc_curve(y_val, y_pred_prob_aml)
roc_auc_aml = roc_auc_score(y_val, y_pred_prob_aml)

plt.plot(
    fpr_logistic,
    tpr_logistic,
    label=f"Logistic Regression (area = {roc_auc_logistic:.2f})",
)
plt.plot(fpr_rf, tpr_rf, label=f"Random Forest (area = {roc_auc_rf:.2f})")
plt.plot(fpr_gbc, tpr_gbc, label=f"Gradient Boosting (area = {roc_auc_gbc:.2f})")
plt.plot(
    fpr_gbcES, tpr_gbcES, label=f"Gradient Boosting Early (area = {roc_auc_gbcES:.2f})"
)
plt.plot(fpr_flaml, tpr_flaml, label=f"FLAML (area = {roc_auc_flaml:.2f})")
plt.plot(fpr_aml, tpr_aml, label=f"AML H2o (area = {roc_auc_aml:.2f})")
plt.plot([0, 1], [0, 1], color="black", lw=2, linestyle="--")
plt.xlabel("False Positive Rate")
plt.ylabel("True Positive Rate")
plt.title("ROC Curve")
plt.legend()
plt.show()