### Direct Marketing in Banking - Propensity Modelling with Tabular Data

# Part 2: XGBoost and Hyperparameter Optimization

> *This notebook works well with the `Python 3 (Data Science 3.0)` kernel on SageMaker Studio*

This workshop explores a tabular, [binary classification](https://en.wikipedia.org/wiki/Binary_classification) use-case with significant **class imbalance**: predicting which of a bank's customers are likely to respond to a targeted marketing campaign.

In this second notebook, you'll tackle the challenge with the [SageMaker built-in XGBoost algorithm](https://docs.aws.amazon.com/sagemaker/latest/dg/xgboost.html) and [automatic hyperparameter tuning](https://docs.aws.amazon.com/sagemaker/latest/dg/automatic-model-tuning.html).

> ⚠️ **You must** have run [Notebook 1 Autopilot and AutoGluon.ipynb](1%20Autopilot%20and%20AutoGluon.ipynb) before this notebook (at least to the point of having queried a data snapshot from SageMaker Feature Store)


## Contents

> ℹ️ **Tip:** You can use the Table of Contents panel in the left sidebar on JupyterLab / SageMaker Studio, to view and navigate sections

1. **[Prepare our environment](#Prepare-our-environment)**
1. **[Understand the algorithm requirements](#Understand-the-algorithm-requirements)**
1. **[Prepare training and test data](#Prepare-training-and-test-data)**
1. **[Train a model](#Train-a-model)**
1. **[Batch inference](#Batch-inference)**
1. **[Hyperparameter Optimization (HPO)](#Hyperparameter-Optimization-(HPO))**
1. **[Deploy and test the optimized model](#Deploy-and-test-the-optimized-model)**
1. **[Conclusions](#Conclusions)**

## Prepare our environment

As in the previous notebook, we'll start by importing libraries and configuring AWS/Sagemaker service connections:

In [None]:
%load_ext autoreload
%autoreload 2

# Python Built-Ins:
import json
import time

# External Dependencies:
import boto3  # General-purpose AWS SDK for Python
import numpy as np  # For matrix operations and numerical processing
import pandas as pd  # Tabular data utilities
import sagemaker  # High-level SDK specifically for Amazon SageMaker

# Local Helper Functions:
import util

# Setting up SageMaker parameters
sgmk_session = sagemaker.Session()  # Connect to SageMaker APIs
region = sgmk_session.boto_session.region_name  # The AWS Region we're using (e.g. 'ap-southeast-1')
bucket_name = sgmk_session.default_bucket()  # Select an Amazon S3 bucket
bucket = boto3.resource("s3").Bucket(bucket_name)
bucket_prefix = "sm101/direct-marketing"  # Location in the bucket to store our files
sgmk_role = sagemaker.get_execution_role()  # IAM Execution Role to use for permissions

print(f"s3://{bucket_name}/{bucket_prefix}")
print(sgmk_role)

## Understand the algorithm requirements

As discussed on the [XGBoost algorithm doc page](https://docs.aws.amazon.com/sagemaker/latest/dg/xgboost.html), there are 2 ways to use XGBoost in SageMaker: As a pre-built algorithm (no script required), or as a framework (with your own custom training script).

In this example, we'll use pre-built algorithm mode so only need to fetch the container image URI:

In [None]:
train_image_uri = sagemaker.image_uris.retrieve("xgboost", region=region, version="1.5-1")
print(train_image_uri)

## Prepare training and test data

We'll **re-use the snapshot** queried from SageMaker Feature Store in the previous notebook, reading all CSVs under the S3 prefix into a combined dataframe.

▶️ **Check** the `data_extract_s3uri` here matches your `data_extract_s3uri` from notebook 1

In [None]:
data_extract_s3uri = f"s3://{bucket_name}/{bucket_prefix}/data-extract"
data_extract_prefix = data_extract_s3uri[len("s3://"):].partition("/")[2]

full_df = pd.concat(
    [
        pd.read_csv(f"s3://{s3obj.bucket_name}/{s3obj.key}")
        for s3obj in bucket.objects.filter(Prefix=data_extract_prefix)
        if s3obj.key.lower().endswith(".csv")
    ],
    axis=0,
)
full_df

However, some extra data preparation is required because (at the time of writing), this XGBoost algorithm version doesn't fully support string categorical features.

Below we **one-hot encode the categorical fields** before splitting the data into train and test sets as done previously:

In [None]:
df_model_data = full_df.drop(
    columns=[
        "customer_id", "event_time", "write_time", "api_invocation_time", "is_deleted", "row_number"
    ],
    errors="ignore",  # Your DF may not have 'row_number' if you didn't do a time travel query
)
df_model_data

# Need to one-hot encode?
df_model_data = pd.get_dummies(df_model_data)  # Convert categorical variables to sets of indicators

# Shuffle and splitting dataset
train_data, validation_data, test_data = np.split(
    df_model_data.sample(frac=1, random_state=1729),
    [int(0.7 * len(df_model_data)), int(0.9 * len(df_model_data))],
)

# Create CSV files for Train / Validation / Test
train_data.to_csv("data/train.csv", index=False, header=False)
validation_data.to_csv("data/validation.csv", index=False, header=False)
test_data.to_csv("data/test.csv", index=False, header=False)

df_model_data

The datasets specific for this algorithm can then be uploaded to Amazon S3, similar to with AutoGluon-Tabular:

In [None]:
model_data_s3uri = f"s3://{bucket_name}/{bucket_prefix}/model-data-xgb"

train_data_s3uri = model_data_s3uri + "/train/data.csv"
train_data.to_csv(train_data_s3uri, index=False, header=False)
validation_data_s3uri = model_data_s3uri + "/validation/data.csv"
validation_data.to_csv(validation_data_s3uri, index=False, header=False)
test_data_s3uri = model_data_s3uri + "/test/data.csv"
test_data.to_csv(test_data_s3uri, index=False, header=False)

## Train a model

With the parameters collected and data prepared in a compatible format, we're ready to train an initial model.

Like in the previous AutoGluon example, this process uses the [Estimator](https://sagemaker.readthedocs.io/en/stable/api/training/estimators.html#sagemaker.estimator.Estimator) SDK class to define and run the training job.

Unlike the AutoGluon example:

- The XGBoost algorithm expects separate `train` and `validation` channels instead of folders under a single `training` prefix
- We'll demonstrate using [SageMaker managed spot](https://docs.aws.amazon.com/sagemaker/latest/dg/model-managed-spot-training.html) to save cost, since this algorithm can train on a more basic CPU-only instance type

In [None]:
%%time

xgb_estimator = sagemaker.estimator.Estimator(
    base_job_name="xgboost",
    role=sgmk_role,  # IAM role for job permissions
    image_uri=train_image_uri,  # AutoGluon-Tabular algorithm container
    instance_count=1,
    instance_type="ml.m5.xlarge",  # Type of compute instance
    max_run=25 * 60,  # Limit job to 25 minutes
    
    use_spot_instances=True,  # Use spot instances to reduce cost
    max_wait=30 * 60,  # Maximum clock time (including spot delays)

    output_path=f"s3://{bucket_name}/{bucket_prefix}/train-output",
)

xgb_estimator.set_hyperparameters(
    num_round=50,  # int: [1,300]
    max_depth=5,  # int: [1,10]
    alpha=2.5,  # float: [0,5]
    eta=0.5,  # float: [0,1]
    objective="binary:logistic",
    eval_metric="auc",
)

# Launch a SageMaker Training job by passing the S3 path of the datasets:
xgb_estimator.fit({
    "train": sagemaker.inputs.TrainingInput(train_data_s3uri, content_type="csv"),
    "validation": sagemaker.inputs.TrainingInput(validation_data_s3uri, content_type="csv"),
})

## Batch inference

In the previous notebook we deployed our model to a real-time endpoint and made inference requests to test its accuracy - before shutting the endpoint down to release the infrastructure.

But SageMaker can instead orchestrate batch inference for us with [SageMaker Batch Transform](https://docs.aws.amazon.com/sagemaker/latest/dg/batch-transform.html): spinning up a temporary cluster and shutting it down as soon as all the input data is processed.

To get started, you can create a [Transformer object](https://sagemaker.readthedocs.io/en/stable/api/inference/transformer.html) directly from `estimator.transformer(...)`. However, in this example we'll go via `create_model()` first so we can easily register the model later:

In [None]:
xgb_model = xgb_estimator.create_model()

Because SageMaker Batch Transform orchestrates the process of sending the data through the endpoint and consolidating the outputs, there are a range of extra parameters beyond the basic output location and instance size/type.

By default, SageMaker Batch Transform treats each file in the input S3 prefix as one request payload and generates an output file of the same name, appending `.out`. Below we configure more specific handling for tabular data though:

- Interpret each line of input files as a separate record with `split_type`, and interpret each line of output data as separate record with `assemble_with`.
- Make `MultiRecord` batch requests up to `max_payload` Megabytes each - allowing up to `max_concurrent_transforms` concurrent requests per instance.
- Exclude the `y` target label column (which is present in the test data) from model requests with `input_filter`.
- Include the input data as well as the predictions in the result with `join_source`.

The result will still be a single `.csv.out` file for each `.csv` input, but SageMaker has control of individual request batch sizes to optimize resource use.

In [None]:
eval_s3uri = f"s3://{bucket_name}/{bucket_prefix}/xgb-evaluation"

xgb_transformer = xgb_model.transformer(
    output_path=eval_s3uri,  # S3 output location
    instance_count=1,  # Number of instances to spin up for the job
    instance_type="ml.m5.large",  # Instance type to use for inference
    strategy="MultiRecord",  # Request inference in batches, for efficiency
    accept="text/csv",  # Request CSV response format
    assemble_with="Line",  # Consolidate response records with newlines between
    max_concurrent_transforms=2,  # Instances sent up to N requests concurrently
    max_payload=1,  # Max size per request (in Megabytes)
)

xgb_transformer.base_transform_job_name="sm101-dm-xgboost"
xgb_transformer.transform(
    test_data_s3uri,
    content_type="text/csv",  # Test data is in CSV format
    split_type="Line",  # Each line of test data is a separate record
    join_source="Input",  # Output joined data including the input features as well as prediction
    input_filter="$[1:]",  # Exclude the leading (actual target value) field
    # wait=True,  # (Default True) Block the notebook kernel until the job completes
    # logs=True,  # (Default True) Stream job logs to the notebook
)

Once the job completes, we can read the dataframe direct from Amazon S3:

In [None]:
df_eval = pd.read_csv(
    eval_s3uri + "/data.csv.out",
    header=None,
    names=test_data.columns.tolist() + ["y_prob"],
)
df_eval

This algorithm only outputs positive-class probability scores for binary classification - not including assigned class labels like AutoGluon-Tabular.

For assessing performance we can either assume a particular `decision_threshold` (for example, scores over than 0.5 are assigned to class 1) - or take whatever threshold maximises the F1 score of the model.

Run the cell below to produce a model quality report similar to our AutoGluon-Tabular example earlier:

In [None]:
report = util.reporting.generate_binary_classification_report(
    y_real=df_eval["y"].values,
    y_predict_proba=df_eval["y_prob"].values,
    # y_predict_label not available for XGBoost output format
    # Optionally set decision_threshold=0.5 to apply a specific threshold, instead of maximizing F1:
    # decision_threshold=0.5,
    class_names_list=["Did not enroll", "Enrolled"],
    title="Initial XGBoost model",
)

# Store the model quality report locally and on Amazon S3:
with open("data/report-xgboost.json", "w") as f:
    json.dump(report, f, indent=2)
model_quality_s3uri = f"s3://{bucket_name}/{bucket_prefix}/{xgb_model.name}/model-quality.json"
!aws s3 cp data/report-xgboost.json {model_quality_s3uri}

We can register this alternative model as a new **version** in the same **model package group** from earlier, as shown below:

In [None]:
xgb_model.register(
    content_types=["text/csv"],
    response_types=["text/csv"],
    model_package_group_name="sm101-dm",
    description="Initial XGBoost model",
    model_metrics=sagemaker.model_metrics.ModelMetrics(
        model_statistics=sagemaker.model_metrics.MetricsSource(
            content_type="application/json",
            s3_uri=model_quality_s3uri,
        ),
    ),
    domain="MACHINE_LEARNING",
    task="CLASSIFICATION",
    sample_payload_url=test_data_s3uri,
)

▶️ **Open** your model group in SageMaker Model Registry

- You can `Shift+Click` or `Control+Click` to **select multiple versions** in the model group
- With multiple versions selected, you can `Right Click` to `Compare model versions` for a side-by-side comparison of different models' charts and statistics.

## Hyperparameter Optimization (HPO)

> ⏰ *Note, with the default settings below, the hyperparameter tuning job can take up to ~20 minutes to complete.*

While AutoML frameworks like AutoGluon try to encapsulate model ensembling, single-algorithm approaches like XGBoost can often benefit from **hyperparameter tuning** to find the best values for settings like `alpha`, `eta` and `max_depth` on a particular problem.

Exploring these parameter combinations by hand can be time-consuming - especially if considering more than a couple of parameters.

[SageMaker automatic model tuning](https://docs.aws.amazon.com/sagemaker/latest/dg/automatic-model-tuning.html) can run intelligent exploration and optimization jobs for you automatically, so you can focus on building and applying insights - rather than managing these experiments.

As shown below, you can set up a [HyperparameterTuner](https://sagemaker.readthedocs.io/en/stable/api/training/tuner.html) wrapper around your standard `Estimator`. The key requirements are:

1. Your job outputs at least one **metric** which the tuner can maximize or minimize (this is handled automatically for most built-in algorithms)
1. Specify **ranges for the hyperparameters** you'd like to explore
1. Specify the **strategy and resource limits** for the job

SageMaker HPO supports a range of [strategies](https://docs.aws.amazon.com/sagemaker/latest/dg/automatic-model-tuning-how-it-works.html) including exploratory tools like Grid and Random search, and efficient HPO-oriented optimization tools like Bayesian Optimization and Hyperband.

In this example, we'll use Bayesian search to optimize Area Under the ROC Curve (AUC) of our XGBosot model [See Machine Learning Key Concepts](https://docs.aws.amazon.com/machine-learning/latest/dg/amazon-machine-learning-key-concepts.html) for more info if you're unfamiliar with these metrics.

In [None]:
# import required HPO objects
from sagemaker.tuner import (
    CategoricalParameter,
    ContinuousParameter,
    HyperparameterTuner,
    IntegerParameter,
)

# Target metric is already built in to the algorithm, so we just specify the name:
objective = "validation:auc"

# Configure hyperparameter ranges to explore:
ranges = {
    "num_round": IntegerParameter(1, 300),
    "max_depth": IntegerParameter(1, 10),
    "alpha": ContinuousParameter(0, 5),
    "eta": ContinuousParameter(0, 1),
}

# Configure the tuner:
xgb_tuner = HyperparameterTuner(
    estimator=xgb_estimator,  # The SageMaker estimator object
    hyperparameter_ranges=ranges,
    max_jobs=15,  # Max total number of training jobs
    max_parallel_jobs=3,  # How many training jobs can run in parallel
    strategy="Bayesian",  # the internal optimization strategy of HPO
    objective_metric_name=objective,
    objective_type="Maximize",  # For AUC, higher = better
)

# Start the job:
xgb_tuner.fit(
    {
        "train": sagemaker.inputs.TrainingInput(train_data_s3uri, content_type="csv"),
        "validation": sagemaker.inputs.TrainingInput(validation_data_s3uri, content_type="csv"),
    },
    wait=True,  # Optionally block the notebook until the job is complete.
)

Note that `max_parallel_jobs` creates a trade-off between job run time and result quality: The more jobs are run in parallel, the faster the `max_jobs` will be completed, but the less information the strategy has about completed jobs when selecting parameter combinations to try next.

As with training and transform jobs, hyperparameter tuning runs separately from the notebook so won't be interrupted if you lose connection or shut down. You can track job progress in the [Training > Hyperparameter tuning jobs page](https://console.aws.amazon.com/sagemaker/home?#/hyper-tuning-jobs) of the SageMaker console and the [DescribeHyperParameterTuningJob API](https://docs.aws.amazon.com/sagemaker/latest/APIReference/API_DescribeHyperParameterTuningJob.html).

You can always set `wait=False` or interrupt the cell (Square stop ⏹ button in the toolbar) to continue working in the notebook while HPO runs in the background. You can later resume waiting for the active job by calling `tuner.wait()` as shown below. Just like the Estimator, you won't be able to `deploy()` the tuner's model until the tuning job is complete.

In [None]:
xgb_tuner.wait()

The individual training jobs created by the model tuning are listed in SageMaker just like manually-created ones, and the HPO job builds up a leaderboard of models based on the objective metric.

In this example we'll simply deploy the "best" model, but you can also explore the jobs for deeper insights: See [this sample notebook](https://github.com/aws/amazon-sagemaker-examples/blob/main/hyperparameter_tuning/analyze_results/HPO_Analyze_TuningJob_Results.ipynb) for examples.


## Deploy and test the optimized model

You can directly call the [HyperparameterTuner.deploy(...)](https://sagemaker.readthedocs.io/en/stable/api/training/tuner.html#sagemaker.tuner.HyperparameterTuner.deploy) method to deploy the winning model to an endpoint - but as before, we'll create a `Model` object first to link back to SageMaker Model Registry.

In [None]:
best_job_name = xgb_tuner.best_training_job()
print("Best training job from HPO run:", best_job_name)

hpo_model = sagemaker.estimator.Estimator.attach(best_job_name).create_model()

In [None]:
hpo_predictor = hpo_model.deploy(
    initial_instance_count=1,
    instance_type="ml.m5.large",
    serializer=sagemaker.serializers.CSVSerializer(),
    deserializer=sagemaker.deserializers.CSVDeserializer(),
)

Once deployed, we can evaluate the performance by sending real-time inference requests as before with AutoGluon.

Remember that the output of this algorithm is different: in CSV format with only the positive class probability.

In [None]:
# getting the predicted probabilities of the best model
hpo_probabilities = np.array(
    hpo_predictor.predict(test_data.drop(["y"], axis=1).values),
    dtype=float,
).squeeze()

hpo_report = util.reporting.generate_binary_classification_report(
    y_real=test_data["y"].values,
    y_predict_proba=hpo_probabilities,
    # y_predict_label not available for XGBoost output format
    # Optionally set decision_threshold=0.5 to apply a specific threshold, instead of maximizing F1:
    # decision_threshold=0.5,
    class_names_list=["Did not enroll", "Enrolled"],
    title="HP-tuned XGBoost model",
)

# Store the model quality report locally and on Amazon S3:
with open("data/report-xgbhpo.json", "w") as f:
    json.dump(hpo_report, f, indent=2)
hpo_quality_s3uri = f"s3://{bucket_name}/{bucket_prefix}/{hpo_model.name}/model-quality.json"
!aws s3 cp data/report-xgbhpo.json {hpo_quality_s3uri}

...And finally, we can register this tuned model as a third candidate version in our SageMaker Model Registry group:

In [None]:
hpo_model.register(
    content_types=["text/csv"],
    response_types=["text/csv"],
    model_package_group_name="sm101-dm",
    description="HP-tuned XGBoost model",
    model_metrics=sagemaker.model_metrics.ModelMetrics(
        model_statistics=sagemaker.model_metrics.MetricsSource(
            content_type="application/json",
            s3_uri=hpo_quality_s3uri,
        ),
    ),
    domain="MACHINE_LEARNING",
    task="CLASSIFICATION",
    sample_payload_url=test_data_s3uri,
)

You can compare the model charts and statistics side-by-side in SageMaker Studio's Model Registry UI to assess their performance - as shown in the screenshot below:

![](img/model-registry-compare.png "Screenshot of side-by-side comparison in SageMaker Studio Model Registry UI")

Note that:

- F1-related comparisons may not be entirely fair: Our XGBoost models' metrics automatically inferred the F1-maximising threshold and used it to drive decisions, whereas AutoGluon-Tabular used its own threshold selection algorithm to assign labels.
- This model package group can contain versions with different I/O contracts (Our XGBoost models expect one-hot encoded inputs, and our AutoGluon model produces JSON output instead of CSV). You could consider also attaching [data quality reports](https://docs.aws.amazon.com/sagemaker/latest/dg/model-monitor-data-quality.html) to fully specify the expected distribution of model inputs and outputs from training, and attaching additional lineage metadata.

## Conclusions

In this example we used an alternative built-in algorithm for tabular data on SageMaker, and showed how its performance can be improved by efficient, [automatic hyperparameter tuning](https://docs.aws.amazon.com/sagemaker/latest/dg/automatic-model-tuning.html) without manually exploring combinations.

We used a relatively small number of trials in this HPO run to keep the run-time fast, so you might not have seen much improvement: But HPO is particularly useful when the space of parameters becomes large and you can allocate sufficient compute resources for the algorithm to explore best combinations for you.

Although [SageMaker Autopilot](https://docs.aws.amazon.com/sagemaker/latest/dg/autopilot-automate-model-development.html) is perhaps the quickest way to deliver strong initial results on a new tabular data project, SageMaker built-in algorithms support a wide range of use-cases from text and vision to more niche tabular problem types. Combining built-in algorithms with SageMaker HPO can really boost their accuracy.

In fact, you'll see that Autopilot uses many of these same tools under the hood: Creating HPO jobs when running in HPO mode, using SageMaker Processing for data pre-processing experiments, and making use of the XGBoost and AutoGluon-Tabular algorithms.

Check out the other workshops in this repository to dive deeper on custom ML with bring-your-own-script training jobs.

## Releasing cloud resources

As mentioned in the previous notebook, you should shut down any created inference endpoints when finished experimenting. You may also choose to clear out your Amazon S3 storage, in which case do remember to delete your SageMaker Feature Store Feature Group and Model Registry Model Group first.

You can un-comment the below code to delete the inference endpoint created by this notebook:

In [None]:
# hpo_predictor.delete_endpoint(delete_endpoint_config=True)