# Customizing open-source AWS XGBoost algorithm container

This notebooks demonstrates an example of customizing AWS XGBoost algorithm container. We are basing our modifications on AWS builtin XGBoost algorithm version 1.2-2. 

A docker script gets a copy of this open-source container, swaps some of the python scripts, and creates an image hosted in a private AWS ECR registry. The training and inference APIs using this customized container remain the same as the original builtin container.


 
# Case study: Producing SHAP values in batch inference

We use the example of producing [SHAP values](https://arxiv.org/abs/1705.07874) together predications at inference time. 
SHAP values provide information on which features are contributing the most to the predicted value. SHAP values are often 
used in explaining a model's behavior and explainability.
The [shap package](https://github.com/slundberg/shap) can be used to compute the SHAP values and comes pre-installed in the SageMaker XGBoost container. In 
training mode, SageMaker Debugger can be configured to collect SHAP values as post-training debugging logs. AWS SageMaker
Clarify can also similarly be used to collect SHAP values in post-processing.
Here, we are interested in computing SHAP values at inference time (online). Getting SHAP values together with prediction
is useful in many practical situations. For example, in user churn prediction, real-time SHAP values can be used to 
identify the key drivers behind a user's likelihood to depart and personalize website's content to improve his or her 
experience.

This workflow can be the baseline to tweak the image as needed. See the accompanying code for an example where two scripts
in the official container are modified so the algorithm returns SHAP feature-importance values together with the inference prediction.

## Prerequisites

In [None]:
import os
import boto3
import re
import sagemaker
from sagemaker.session import s3_input, Session
from sagemaker.inputs import TrainingInput
import pandas as pd
import numpy as np

In [None]:
role = sagemaker.get_execution_role()
region = boto3.Session().region_name
s3_client = boto3.client("s3")
account=boto3.client('sts').get_caller_identity().get('Account')

# S3 bucket where the training data is located.
data_bucket = f"sagemaker-sample-files"
data_prefix = "datasets/tabular/uci_abalone"
data_bucket_path = f"s3://{data_bucket}"

# S3 bucket for saving code and model artifacts.
# Can specify a different bucket and prefix
output_bucket = sagemaker.Session().default_bucket()
output_prefix = "sagemaker/DEMO-xgboost-abalone-default"
output_bucket_path = f"s3://{output_bucket}"

## Preparing sample data
We use the [Abalone](https://archive.ics.uci.edu/ml/datasets/abalonehttps://archive.ics.uci.edu/ml/datasets/abalone) dataset. The sample data is already prepared in libsvm format in a public s3 bucket. We move the data to our own location. 

In [None]:
%%time
for data_category in ["train", "test", "validation"]:
    
    data_key = "{0}/{1}/abalone.{1}".format(data_prefix, data_category)
    output_key = "{0}/{1}/abalone.{1}".format(output_prefix, data_category)
    data_filename = "abalone.{}".format(data_category)
    s3_client.download_file(data_bucket, data_key, data_filename)   
    s3_client.upload_file(data_filename, output_bucket, output_key)


## Pulling and modifying the AWS XGBoost algorithm container 

The bash script `docker_build.sh` pulls the XGBoost image with the URI determined by the `image_uris.retrieve` method, modifies it, and pushes it to a private AWS ECR repository. (May need to run the script twice if authentication fails at first.)

In [None]:
# initialize parameters
version = '1.2-2'

# this line automatically looks for the XGBoost image URI and builds an XGBoost container.
# specify the repo_version depending on your preference.
xgboost_container = sagemaker.image_uris.retrieve(region=region, framework='xgboost', version=version)
# saving the image uri to extend later
with open('src/base_image_uri.txt','w+') as f:
    f.write(xgboost_container)
custom_container=f"{account}.dkr.ecr.{region}.amazonaws.com/custom_images/sagemaker-xgboost-{version}"
print(custom_container)
!cd src && ./docker_build.sh

## Train a model to use in inference

Next we train a model to use for inference. We have not modified the algorithm on the training side and could have used the original image for training as well. In any case, the training API is the same and we are using the builtin training script as is.

In [None]:
%%time
hyperparameters = {
        "max_depth":"5",
        "eta":"0.2",
        "gamma":"4",
        "min_child_weight":"6",
        "subsample":"0.7",
        "objective":"reg:squarederror",
        "num_round":"50"}
# construct a SageMaker estimator that calls the xgboost-container
estimator = sagemaker.estimator.Estimator(image_uri=custom_container, 
                                          base_job_name='xgboost-custom',
                                          hyperparameters=hyperparameters,
                                          role=sagemaker.get_execution_role(),
                                          instance_count=1, 
                                          instance_type='ml.m5.2xlarge', 
                                          volume_size=5, # 5 GB 
                                          output_path=f"s3://{output_bucket}/{output_prefix}")

# define the data type and paths to the training and validation datasets
# define the data type and paths to the training and validation datasets
content_type = "libsvm"
train_input = TrainingInput("s3://{}/{}/{}/".format(output_bucket, output_prefix, 'train'), content_type=content_type)
validation_input = TrainingInput("s3://{}/{}/{}/".format(output_bucket, output_prefix, 'validation'), content_type=content_type)

# execute the XGBoost training job
estimator.fit({'train': train_input, 'validation': validation_input})

## Inference
We use the batch transform mode to make predictions and compute SHAP values on the test data as sample. Inside the container, we have modified the data encoding so we can stack the vectorized inference outputs as a CSV-style output.

In [None]:
transformer=estimator.transformer(instance_count=1, instance_type= 'ml.m4.xlarge', 
                                  model_name='custom-xgboost', 
                                  output_path=f"s3://{output_bucket}/{output_prefix}/results")
transformer.transform(f"s3://{output_bucket}/{output_prefix}/test/abalone.test")

### Inspecting the inference output
We see the inference output now has the prediction, the expected apriori estimate (the expected shap value), and posteriori shap values that indicate feature contributions towards the predicted values. 

In [None]:
# Looking at output. Note that the third column value in original file output is always zero and corresponds to the label data in training phase 
df = pd.read_csv(f"s3://{output_bucket}/{output_prefix}/results/abalone.test.out", 
                 index_col=None, header=None, 
                 names=['prediction', 'base_values', 'shap 1', 'shap 2', 'shap 3', 'shap 4', 'shap 5', 'shap 6', 'shap 7', 'shap 8'], 
                 usecols=[0,1,3,4,5,6,7,8,9,10])
df

In [None]:
import shap
shap.initjs()
shap.plots.force(df['base_values'][0], np.array(df[3:]))

## Conclusion

This notebook provides a guide on how to customize the production AWS XGBoost algorithm images to create a custom image that adheres to the base image as much as possible. Besides package consistency, a side benefit of this approach is that the training and inference APIs between the customized image and the builtin base image remain the same -- only the image URI needs to be swapped, which can be helpful to maintain consistency.  Plus, we can resuse many of the builtin functionalities that are already avaiable in the base image without code duplication.

## Clean up
The following cell cleans up the deployed resources.

In [None]:
"# Delete the CloudFormation stack.\n",
"# WARNING: THIS WILL DELETE THIS NOTEBOOK AND ANY CODE CHANGES.\n",
"# cft_client = boto3.client('cloudformation')\n",
"# cft_client.delete_stack(StackName='xgboost-extension-sample')"