# Classification with Amazon SageMaker XGBoost algorithm
_**Managed training for building a classification model with Amazon SageMaker XGBoost*_

## Introduction

This notebook demonstrates the use of Amazon SageMaker XGBoost to train and host a classification model. [XGBoost (eXtreme Gradient Boosting)](https://xgboost.readthedocs.io) is a popular and efficient machine learning algorithm used for regression and classification tasks on tabular datasets. It implements a technique know as gradient boosting on trees, and performs remarkably well in machine learning competitions, and gets a lot of attention from customers. 

We use the [MNIST data](https://www.csie.ntu.edu.tw/~cjlin/libsvmtools/datasets/multiclass.html) stored in [LIBSVM](https://www.csie.ntu.edu.tw/~cjlin/libsvm/) format.

---
## Setup

This notebook was tested in Amazon SageMaker Studio on a ml.t3.medium instance with Python 3 (Data Science) kernel.

Let's start by specifying:
1. The S3 bucket and prefix that you want to use for training and model data. This should be within the same region as the Notebook Instance, training, and hosting.
1. The IAM role arn used to give training and hosting access to your data. See the documentation for how to create these. Note, if more than one role is required for notebook instances, training, and/or hosting, please replace the boto regexp with a the appropriate full IAM role arn string(s).

In [27]:
%%time

import os
import boto3
import re
import sagemaker

# Get a SageMaker-compatible role used by this Notebook Instance.
role = sagemaker.get_execution_role()
region = boto3.Session().region_name

### update below values appropriately ###
bucket = sagemaker.Session().default_bucket()
####

print(region)

us-east-1
CPU times: user 84.1 ms, sys: 2.15 ms, total: 86.3 ms
Wall time: 185 ms


### Fetching the dataset

Following code downloads the data and splits the data into train/validation datasets and upload files to S3.

In [28]:
from data_utils import load_mnist, upload_to_s3

bucket = sagemaker.Session().default_bucket()
prefix = "DEMO-smdebug-xgboost-mnist"

In [30]:
%%time

train_file, validation_file = load_mnist()
upload_to_s3(train_file, bucket, f"{prefix}/train/mnist.train.libsvm")
upload_to_s3(validation_file, bucket, f"{prefix}/validation/mnist.validation.libsvm")

Writing to s3://sagemaker-us-east-1-365792799466/DEMO-smdebug-xgboost-mnist/train/mnist.train.libsvm
Writing to s3://sagemaker-us-east-1-365792799466/DEMO-smdebug-xgboost-mnist/validation/mnist.validation.libsvm
CPU times: user 3.74 s, sys: 420 ms, total: 4.16 s
Wall time: 1min 9s


## Training the XGBoost model

Now that we have the data uploaded to s3 we will use the XGBoost container to run our training.

To run our training script on SageMaker, we construct a sagemaker.estimator.Estimator class, which accepts several constructor arguments:

* __image_uri__: The path to the XG Boost Container that SageMaker runs for training and prediction.
* __role__: Role ARN
* __train_instance_type__ *(optional)*: The type of SageMaker instances for training. __Note__: Because Scikit-learn does not natively support GPU training, Sagemaker Scikit-learn does not currently support training on GPU instance types.
* __sagemaker_session__ *(optional)*: The session used to train on Sagemaker.
* __hyperparameters__ *(optional)*: A dictionary passed to the train function as hyperparameters.

In [53]:
from sagemaker import image_uris

# Below changes the region to be one where this notebook is running
region = boto3.Session().region_name
container = sagemaker.image_uris.retrieve("xgboost", region, "latest")

In [54]:
from sagemaker import get_execution_role

role = get_execution_role()
base_job_name = "mnist-xgboost-classification"
bucket_path = "s3://{}".format(bucket)

num_round = 50
save_interval = 3
hyperparameters = {
    "max_depth": "5",
    "eta": "0.1",
    "gamma": "4",
    "min_child_weight": "6",
    "silent": "0",
    "objective": "multi:softmax",
    "num_class": "10",  # num_class is required for 'multi:*' objectives
    "num_round": num_round,
}

In [55]:
from sagemaker.estimator import Estimator

xgboost_algorithm_mode_estimator = Estimator(
    role=role,
    base_job_name=base_job_name,
    instance_count=1,
    instance_type="ml.m5.xlarge",
    image_uri=container,
    hyperparameters=hyperparameters
    
)

In [39]:
from sagemaker.session import TrainingInput

train_s3_input = TrainingInput(
    "s3://{}/{}/{}".format(bucket, prefix, "train"), content_type="libsvm"
)
validation_s3_input = TrainingInput(
    "s3://{}/{}/{}".format(bucket, prefix, "validation"), content_type="libsvm"
)


xgboost_algorithm_mode_estimator.fit(
    {"train": train_s3_input, "validation": validation_s3_input}, wait=True
)

2021-08-11 09:45:05 Starting - Starting the training job...
2021-08-11 09:45:30 Starting - Launching requested ML instancesProfilerReport-1628675105: InProgress
...
2021-08-11 09:46:04 Starting - Preparing the instances for training.........
2021-08-11 09:47:36 Downloading - Downloading input data
2021-08-11 09:47:36 Training - Downloading the training image....[34mINFO:sagemaker-containers:Imported framework sagemaker_xgboost_container.training[0m
[34mINFO:sagemaker-containers:Failed to parse hyperparameter objective value multi:softmax to Json.[0m
[34mReturning the value itself[0m
[34mINFO:sagemaker-containers:No GPUs detected (normal if no gpus installed)[0m
[34mINFO:sagemaker_xgboost_container.training:Running XGBoost Sagemaker in algorithm mode[0m
[34m[09:48:05] 48000x781 matrix with 7194988 entries loaded from /opt/ml/input/data/train[0m
[34m[09:48:05] 12000x781 matrix with 1799168 entries loaded from /opt/ml/input/data/validation[0m
[34mINFO:root:Distributed node 

## Train XGBoost Estimator on MNIST data using Spot Instance

Training using spot instance is just adding a simple configuration attribute 'use_spot_instances' to the estimator class.

In [47]:
# Open Source distributed script mode
from sagemaker.session import Session
from sagemaker.inputs import TrainingInput
from sagemaker.xgboost.estimator import XGBoost

boto_session = boto3.Session(region_name=region)
session = Session(boto_session=boto_session)

xgboost_algorithm_mode_estimator = Estimator(
    role=role,
    base_job_name=base_job_name,
    instance_count=1,
    instance_type="ml.m5.xlarge",
    image_uri=container,
    hyperparameters=hyperparameters,
    use_spot_instances=True,
    max_run=3600,
    max_wait=3600
    
)

train_s3_input = TrainingInput(
    "s3://{}/{}/{}".format(bucket, prefix, "train"), content_type="libsvm"
)
validation_s3_input = TrainingInput(
    "s3://{}/{}/{}".format(bucket, prefix, "validation"), content_type="libsvm"
)

In [48]:
xgboost_algorithm_mode_estimator.fit(
    {"train": train_s3_input, "validation": validation_s3_input}, wait=True
)

2021-08-11 10:32:52 Starting - Starting the training job...
2021-08-11 10:33:15 Starting - Launching requested ML instancesProfilerReport-1628677972: InProgress
...
2021-08-11 10:33:47 Starting - Preparing the instances for training............
2021-08-11 10:35:45 Downloading - Downloading input data...
2021-08-11 10:36:16 Training - Training image download completed. Training in progress.....[34mINFO:sagemaker-containers:Imported framework sagemaker_xgboost_container.training[0m
[34mINFO:sagemaker-containers:Failed to parse hyperparameter objective value multi:softmax to Json.[0m
[34mReturning the value itself[0m
[34mINFO:sagemaker-containers:No GPUs detected (normal if no gpus installed)[0m
[34mINFO:sagemaker_xgboost_container.training:Running XGBoost Sagemaker in algorithm mode[0m
[34m[10:36:14] 48000x781 matrix with 7194988 entries loaded from /opt/ml/input/data/train[0m
[34m[10:36:14] 12000x781 matrix with 1799168 entries loaded from /opt/ml/input/data/validation[0m


## Hyperparameter tuning with XGBoost Estimator on MNIST data using Spot Instance


Now that we have prepared the dataset and trained our model, one thing to note is there are algorithm settings which are called "hyperparameters" that can dramtically affect the performance of the trained models. For example, XGBoost algorithm has dozens of hyperparameters and we need to pick the right values for those hyperparameters in order to achieve the desired model training results. Since which hyperparameter setting can lead to the best result depends on the dataset as well, it is almost impossible to pick the best hyperparameter setting without searching for it, and a good search algorithm can search for the best hyperparameter setting in an automated and effective way.

We will use SageMaker hyperparameter tuning to automate the searching process effectively. Specifically, we specify a range, or a list of possible values in the case of categorical hyperparameters, for each of the hyperparameter that we plan to tune. SageMaker hyperparameter tuning will automatically launch multiple training jobs with different hyperparameter settings, evaluate results of those training jobs based on a predefined "objective metric", and select the hyperparameter settings for future attempts based on previous results. For each hyperparameter tuning job, we will give it a budget (max number of training jobs) and it will complete once that many training jobs have been executed.


Now we configure the hyperparameter tuning job by using the SDK that specifies following information:
* The Estimator to use for HPO. This we created in the earlier step for training.
* The ranges of hyperparameters we want to tune
* Number of training jobs to run in total and how many training jobs should be run simultaneously. More parallel jobs will finish tuning sooner, but may sacrifice accuracy. We recommend you set the parallel jobs value to less than 10% of the total number of training jobs (we'll set it higher just for this example to keep it short).
* The objective metric that will be used to evaluate training results, in this example, we select *validation:auc* to be the objective metric and the goal is to maximize the value throughout the hyperparameter tuning process. One thing to note is the objective metric has to be among the metrics that are emitted by the algorithm during training. In this example, the built-in XGBoost algorithm emits a bunch of metrics and *validation:auc* is one of them. If you bring your own algorithm to SageMaker, then you need to make sure whatever objective metric you select, your algorithm actually emits it.



We will tune three hyperparameters in this examples:
* *eta*: Step size shrinkage used in updates to prevent overfitting. After each boosting step, you can directly get the weights of new features. The eta parameter actually shrinks the feature weights to make the boosting process more conservative.  
* *min_child_weight*: Minimum sum of instance weight (hessian) needed in a child. If the tree partition step results in a leaf node with the sum of instance weight less than min_child_weight, the building process gives up further partitioning. In linear regression models, this simply corresponds to a minimum number of instances needed in each node. The larger the algorithm, the more conservative it is. 
* *max_depth*: Maximum depth of a tree. Increasing this value makes the model more complex and likely to be overfitted. 

In [56]:
from sagemaker.tuner import (
    IntegerParameter,
    CategoricalParameter,
    ContinuousParameter,
    HyperparameterTuner,
)

# Define hyperparameter ranges.
hyperparameter_ranges = {
    "eta": ContinuousParameter(0, 1),
    "min_child_weight": ContinuousParameter(1, 10),
    "max_depth": IntegerParameter(1, 10),
}

Next we'll specify the objective metric that we'd like to tune and its definition, which includes the regular expression (Regex) needed to extract that metric from the CloudWatch logs of the training job. Since we are using built-in XGBoost algorithm here, it emits two predefined metrics: *validation:auc* and *train:auc*, and we elected to monitor *validation:auc* as you can see below. In this case, we only need to specify the metric name and do not need to provide regex. If you bring your own algorithm, your algorithm emits metrics by itself. In that case, you'll need to add a MetricDefinition object here to define the format of those metrics through regex, so that SageMaker knows how to extract those metrics from your CloudWatch logs.

In [64]:
objective_metric_name = "validation:merror"
objective_type = 'Minimize'

Now, we'll create a `HyperparameterTuner` object, to which we pass:
- The XGBoost estimator we created above
- Our hyperparameter ranges
- Objective metric name and definition
- Tuning resource configurations such as Number of training jobs to run in total and how many training jobs can be run in parallel.

In [65]:
tuner = HyperparameterTuner(
    xgboost_algorithm_mode_estimator, objective_metric_name, hyperparameter_ranges, max_jobs=2, max_parallel_jobs=2,objective_type=objective_type
)

## Launch_Hyperparameter_Tuning
Now we can launch a hyperparameter tuning job by calling *fit()* function. After the hyperparameter tuning job is created, we can go to SageMaker console to track the progress of the hyperparameter tuning job until it is completed.

In [None]:
train_s3_input = TrainingInput(
    "s3://{}/{}/{}".format(bucket, prefix, "train"), content_type="libsvm"
)
validation_s3_input = TrainingInput(
    "s3://{}/{}/{}".format(bucket, prefix, "validation"), content_type="libsvm"
)

tuner.fit(
    {"train": train_s3_input, "validation": validation_s3_input}, wait=True
)

....................................

### Note on Distributed training

SageMaker's XGBoost Algorithm supports distributed training by default. We just need to increase the instance_count(number of instances) while creating the estimator.
 
```python

xgboost_algorithm_mode_estimator = Estimator(
    role=role,
    base_job_name=base_job_name,
    instance_count=2,
    instance_type="ml.m5.xlarge",
    image_uri=container,
    hyperparameters=hyperparameters
    
)

```
