# Amazon SageMaker with XGBoost and Hyperparameter Tuning
_**Supervised Learning with Gradient Boosted Trees: A Binary Prediction Problem With Unbalanced Classes**_

In [None]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt

pd.set_option('display.max_columns', 50)

import time

---

## Prepare our Environment

We'll need to:

- **import** some useful libraries (as in any Python notebook)
- **configure** the S3 bucket and folder where data should be stored (to keep our environment tidy)
- **connect** to AWS in general (with [boto3](https://boto3.amazonaws.com/v1/documentation/api/latest/index.html)) and SageMaker in particular (with the [sagemaker SDK](https://sagemaker.readthedocs.io/en/stable/)), to use the cloud services

While `boto3` is the general AWS SDK for Python, `sagemaker` provides some powerful, higher-level interfaces designed specifically for ML workflows.

In [None]:
# setting up SageMaker parameters
import sagemaker
import boto3

boto_session = boto3.Session()
region = boto_session.region_name
bucket_name = sagemaker.Session().default_bucket()
bucket_prefix = "xgboost-example"  # Location in the bucket to store our files
sm_session = sagemaker.Session()
sm_client = boto_session.client("sagemaker")
sm_role = sagemaker.get_execution_role()

In [None]:
dataset_path=#"<dataset path local or from s3>"

In [None]:
df = pd.read_csv(dataset_path) 

In [None]:
df.head()

In [None]:
target_label=#'<write your target label name here>'

In [None]:
# Upload CSV files to S3 for SageMaker processing and training
rawdata_uri = sm_session.upload_data(
    path=dataset_path,
    bucket=bucket_name,
    key_prefix=bucket_prefix,
)

In [None]:
import boto3
import sagemaker
from sagemaker import get_execution_role
from sagemaker.sklearn.processing import SKLearnProcessor

region = boto3.session.Session().region_name

role = get_execution_role()
sklearn_processor = SKLearnProcessor(framework_version='0.20.0',
                                     role=role,
                                     instance_type='ml.m5.xlarge',
                                     instance_count=1)


In [None]:
from sagemaker.processing import ProcessingInput, ProcessingOutput
from datetime import datetime

job_name='xgboost-processing-'+datetime.now().strftime('%Y-%m-%d-%H-%M-%S')

output_path_train='s3://'+bucket_name+'/'+job_name+'/processing/output/train/'
output_path_val='s3://'+bucket_name+'/'+job_name+'/processing/output/validation/'
output_path_test='s3://'+bucket_name+'/'+job_name+'/processing/output/test/'

sklearn_processor.run(
    code='scripts/preprocess.py',
    job_name=job_name, 
    #arguments = ['arg1', 'arg2'],
    inputs=[ProcessingInput(
        source=dataset_path,
        #source = 's3_path_to_data'
        destination='/opt/ml/processing/input')],
    outputs=[ProcessingOutput(source='/opt/ml/processing/output/train', destination = output_path_train),
        ProcessingOutput(source='/opt/ml/processing/output/validation', destination = output_path_val),
        ProcessingOutput(source='/opt/ml/processing/output/test', destination = output_path_test)]
)

## Using Sagemaker Built-in XGBoost

We'll be using SageMaker's built-in XGBoost Algorithm: Benefiting from performance-optimized, pre-implemented functionality like multi-instance parallelization, and support for multiple input formats.

In general to use the pre-built algorithms, we'll need to:

    Refer to the Common Parameters docs to see the high-level configuration and what features each algorithm has
    Refer to the algorithm docs to understand the detail of the data formats and (hyper)-parameters it supports

From these docs, we'll understand what data format we need to upload to S3 (next), and how to get the container image URI of the algorithm... which is listed on the Common Parameters page but can also be extracted through the SDK:

In [None]:
# specify container
training_image = sagemaker.image_uris.retrieve("xgboost", region=region, version="1.0-1")

print(training_image)

Training a model on SageMaker follows the usual steps with other ML libraries (e.g. SciKit-Learn):

    Initiate a session (we did this up top).
    Instantiate an estimator object for our algorithm (XGBoost).
    Define its hyperparameters.
    Start the training job.

A small competition!

SageMaker's XGBoost includes 38 parameters. You can find more information about them here. For simplicity, we choose to experiment only with a few of them.

...and finally, actually create the training job using the high-level Estimator API.

The Estimator class provides a familiar, scikit-learn-like API for fit()ting models to data, deploy()ing models to real-time endpoints, or running batch inference jobs.

In [None]:
# Define the data input channels for the training job:
s3_input_train = sagemaker.inputs.TrainingInput(output_path_train, content_type="csv")
s3_input_validation = sagemaker.inputs.TrainingInput(output_path_val, content_type="csv")

In [None]:
# Instantiate an XGBoost estimator object
estimator = sagemaker.estimator.Estimator(
    image_uri=training_image,      # XGBoost algorithm container
    instance_type="ml.m5.xlarge",  # type of training instance
    instance_count=1,              # number of instances to be used
    role=sm_role,                # IAM role to be used
    max_run=20*60,                 # Maximum allowed active runtime
    use_spot_instances=True,       # Use spot instances to reduce cost
    max_wait=30*60,                # Maximum clock time (including spot delays)
)

# scale_pos_weight is a paramater that controls the relative weights of the classes.
# Because the data set is so highly skewed, we set this parameter according to the ratio (n/y)
scale_pos_weight = np.count_nonzero(df[target_label].values==0) / np.count_nonzero(df[target_label].values)

# define its hyperparameters
estimator.set_hyperparameters(
    num_round=150,     # int: [1,300]
    max_depth=5,     # int: [1,10]
    alpha=2,         # float: [0,5]
    eta=0.5,           # float: [0,1]
    objective="binary:logistic",
    eval_metric= "auc,accuracy",
    scale_pos_weight=scale_pos_weight,  # set the balance between the 2 classes
)

# start a training (fitting) job
estimator.fit({ "train": s3_input_train, "validation": s3_input_validation })



## Deploy the model


Now that we've trained the xgboost algorithm on our data, deploying the model (hosting it behind a real-time endpoint) is just one function call!

This deployment might take **up to 10 minutes**, and by default the code will wait for the deployment to complete.

If you like, you can instead:

- Un-comment the `wait=False` parameter
- Use the [Endpoints page of the SageMaker Console](https://console.aws.amazon.com/sagemaker/home?#/endpoints) to check the status of the deployment
- Skip over the *Evaluation* section below (which won't run until the deployment is complete), and start the Hyperparameter Optimization job - which will take a while to run too, so can be started in parallel


In [None]:
# Real-time endpoint:
model_name='xgboost-model-'+datetime.now().strftime('%Y-%m-%d-%H-%M-%S')
predictor = estimator.deploy(
    model_name=model_name,
    initial_instance_count=1,
    instance_type="ml.m5.large",
    # wait=False,  # Remember, predictor.predict() won't work until deployment finishes!
)

### Evaluation

Since SageMaker is a general purpose ML platform and our endpoint is a web service, we'll need to be explicit that we're sending in tabular data (_serialized_ in CSV string format for the HTTPS request) and expect a tabular response (to be _deserialized_ from CSV to numpy).

In the SageMaker SDK (from v2), this packing and unpacking of the payload for the web endpoint is handled by [serializer classes](https://sagemaker.readthedocs.io/en/stable/api/inference/serializers.html) and [deserializer classes](https://sagemaker.readthedocs.io/en/stable/api/inference/deserializers.html).

Unfortunately the pre-built `CSVDeserializer` produces nested Python lists of strings, rather than a numpy array of numbers - so rather than bothering to implement a custom class (like the examples [here](https://github.com/aws/sagemaker-python-sdk/blob/master/src/sagemaker/deserializers.py)) we'll be lazy and take this as a post-processing step.

With this setup ready, requesting inferences is as easy as calling `predictor.predict()`:

In [None]:
test_df=pd.read_csv(output_path_test+'test.csv')

In [None]:
predictor.serializer = sagemaker.serializers.CSVSerializer()
predictor.deserializer = sagemaker.deserializers.CSVDeserializer()

In [None]:
X_test_numpy = test_df.drop([target_label], axis=1).values

predictions = np.array(predictor.predict(X_test_numpy), dtype=float).squeeze()
predictions

In [None]:
test_results = pd.concat(
    [
        pd.Series(predictions, name="y_pred", index=test_df.index),
        test_df,
    ],
    axis=1
)
test_results.head()

In [None]:
import util

In [None]:
util.plotting.generate_classification_report(
    y_real=test_results[target_label],
    y_predict_proba=test_results["y_pred"],
    decision_threshold=0.5,
    class_names_list=["good", "default"],
    title="Initial risk model",
)

---

## Hyperparameter Optimization (HPO)
*Note, with the default settings below, the hyperparameter tuning job can take up to ~20 minutes to complete.*

We will use SageMaker HyperParameter Optimization (HPO) to automate the searching process effectively. Specifically, we **specify a range**, or a list of possible values in the case of categorical hyperparameters, for each of the hyperparameter that we plan to tune.

SageMaker hyperparameter tuning will automatically launch **multiple training jobs** with different hyperparameter settings, evaluate results of those training jobs based on a predefined "objective metric", and select the hyperparameter settings for future attempts based on previous results. For each hyperparameter tuning job, we will specify the maximum number of HPO tries (`max_jobs`) and how many of these can happen in parallel (`max_parallel_jobs`).

Tip: `max_parallel_jobs` creates a **trade-off between performance and speed** (better hyperparameter values vs how long it takes to find these values). If `max_parallel_jobs` is large, then HPO is faster, but the discovered values may not be optimal. Smaller `max_parallel_jobs` will increase the chance of finding optimal values, but HPO will take more time to finish.

Next we'll specify the objective metric that we'd like to tune and its definition, which includes the regular expression (Regex) needed to extract that metric from the CloudWatch logs of the training job. Since we are using built-in XGBoost algorithm here, it emits two predefined metrics: **validation:auc** and **train:auc**, and we elected to monitor *validation:auc* as you can see below. In this case (because it's pre-built for us), we only need to specify the metric name.

For more information on the documentation of the Sagemaker HPO please refer [here](https://sagemaker.readthedocs.io/en/stable/tuner.html).

In [None]:
# import required HPO objects
from sagemaker.tuner import IntegerParameter, CategoricalParameter, ContinuousParameter, HyperparameterTuner

# set up hyperparameter ranges
ranges = {
    "num_round": IntegerParameter(1, 300),
    "max_depth": IntegerParameter(1, 10),
    "alpha": ContinuousParameter(0, 5),
    "eta": ContinuousParameter(0, 1),
}

# set up the objective metric
objective = 'validation:auc'

# instantiate a HPO object
tuner = HyperparameterTuner(
    estimator=estimator,              # the SageMaker estimator object
    hyperparameter_ranges=ranges,     # the range of hyperparameters
    max_jobs=20,                      # total number of HPO jobs
    max_parallel_jobs=4,              # how many HPO jobs can run in parallel
    strategy="Bayesian",              # the internal optimization strategy of HPO
    objective_metric_name=objective,  # the objective metric to be used for HPO
    objective_type="Maximize",        # maximize or minimize the objective metric
)

### Launch HPO
Now we can launch a hyperparameter tuning job by calling *fit()* function. After the hyperparameter tuning job is created, we can go to SageMaker console to track the progress of the hyperparameter tuning job until it is completed.

In [None]:
# start HPO
tuner.fit({ "train": s3_input_train, "validation": s3_input_validation })

HPO jobs often take quite a long time to finish and as such, sometimes you may want to free up the notebook and then resume the wait later.

Just like the Estimator, we won't be able to `deploy()` the model until the HPO tuning job is complete; and the status is visible through both the [AWS Console](https://console.aws.amazon.com/sagemaker/home?#/hyper-tuning-jobs) and the [SageMaker API](https://docs.aws.amazon.com/sagemaker/latest/APIReference/API_DescribeHyperParameterTuningJob.html). We could for example write a polling script like the below:

### Deploy and test optimized model
Deploying the best model is another simple `.deploy()` call:

In [None]:
# deploy the best model from HPO
model_name='xgboost-hpo-model-'+datetime.now().strftime('%Y-%m-%d-%H-%M-%S')

hpo_predictor = tuner.deploy(
    model_name=model_name
    initial_instance_count=1,
    instance_type="ml.m5.large",
    serializer=sagemaker.serializers.CSVSerializer(),
    deserializer=sagemaker.deserializers.CSVDeserializer(),
)

Once deployed, we can now evaluate the performance of the best model.

In [None]:
# getting the predicted probabilities of the best model
hpo_predictions = np.array(hpo_predictor.predict(X_test_numpy), dtype=float).squeeze()
print(hpo_predictions)

util.plotting.generate_classification_report(
    y_real=test_results[target_label],
    y_predict_proba=hpo_predictions,
    decision_threshold=0.5,
    class_names_list=["good", "default"],
    title="HPO risk model",
)

## Feature Importance

In [None]:

from sagemaker import clarify
clarify_processor = clarify.SageMakerClarifyProcessor(role=sm_role,
                                                      instance_count=1,
                                                      instance_type='ml.m5.xlarge',
                                                      sagemaker_session=sm_session)

In [None]:
shap_config = clarify.SHAPConfig(baseline=[test_df.iloc[0,1:].values.tolist()],
                                 num_samples=15,
                                 agg_method='mean_abs',
                                 save_local_shap_values=False)

explainability_output_path = 's3://{}/{}/clarify-explainability'.format(bucket_name, bucket_prefix)
explainability_data_config = clarify.DataConfig(s3_data_input_path=output_path_train,
                                s3_output_path=explainability_output_path,
                                label=target_label,
                                headers=df.columns.to_list(),
                                dataset_type='text/csv')



In [None]:
model_config = clarify.ModelConfig(model_name=model_name,
                                   instance_type='ml.m5.xlarge',
                                   instance_count=1,
                                   accept_type='text/csv',
                                   content_type='text/csv')


In [None]:
clarify_processor.run_explainability(data_config=explainability_data_config,
                                     model_config=model_config,
                                     explainability_config=shap_config)

