# Classification with Amazon SageMaker XGBoost algorithm
_**Managed training for building a classification model with Amazon SageMaker XGBoost*_

## Introduction

This notebook demonstrates the use of Amazon SageMaker XGBoost to train and host a classification model. [XGBoost (eXtreme Gradient Boosting)](https://xgboost.readthedocs.io) is a popular and efficient machine learning algorithm used for regression and classification tasks on tabular datasets. It implements a technique know as gradient boosting on trees, and performs remarkably well in machine learning competitions, and gets a lot of attention from customers. 

We use the [MNIST data](https://www.csie.ntu.edu.tw/~cjlin/libsvmtools/datasets/multiclass.html) stored in [LIBSVM](https://www.csie.ntu.edu.tw/~cjlin/libsvm/) format.

---
## Setup

This notebook was tested in Amazon SageMaker Studio on a ml.t3.medium instance with Python 3 (Data Science) kernel.

Let's start by specifying:
1. The S3 bucket and prefix that you want to use for training and model data. This should be within the same region as the Notebook Instance, training, and hosting.
1. The IAM role arn used to give training and hosting access to your data. See the documentation for how to create these. Note, if more than one role is required for notebook instances, training, and/or hosting, please replace the boto regexp with a the appropriate full IAM role arn string(s).

In [27]:
%%time

import os
import boto3
import re
import sagemaker

# Get a SageMaker-compatible role used by this Notebook Instance.
role = sagemaker.get_execution_role()
region = boto3.Session().region_name

### update below values appropriately ###
bucket = sagemaker.Session().default_bucket()
prefix = "sagemaker/DEMO-xgboost-dist-script"
####

print(region)

us-east-1
CPU times: user 84.1 ms, sys: 2.15 ms, total: 86.3 ms
Wall time: 185 ms


### Fetching the dataset

Following code downloads the data and splits the data into train/validation datasets and upload files to S3.

In [28]:
from data_utils import load_mnist, upload_to_s3

bucket = sagemaker.Session().default_bucket()
prefix = "DEMO-smdebug-xgboost-mnist"

In [30]:
%%time

train_file, validation_file = load_mnist()
upload_to_s3(train_file, bucket, f"{prefix}/train/mnist.train.libsvm")
upload_to_s3(validation_file, bucket, f"{prefix}/validation/mnist.validation.libsvm")

Writing to s3://sagemaker-us-east-1-365792799466/DEMO-smdebug-xgboost-mnist/train/mnist.train.libsvm
Writing to s3://sagemaker-us-east-1-365792799466/DEMO-smdebug-xgboost-mnist/validation/mnist.validation.libsvm
CPU times: user 3.74 s, sys: 420 ms, total: 4.16 s
Wall time: 1min 9s


## Training the XGBoost model

After setting training parameters, we kick off training, and poll for status until training is completed, which in this example, takes between few minutes.

To run our training script on SageMaker, we construct a sagemaker.xgboost.estimator.XGBoost estimator, which accepts several constructor arguments:

* __entry_point__: The path to the Python script SageMaker runs for training and prediction.
* __role__: Role ARN
* __train_instance_type__ *(optional)*: The type of SageMaker instances for training. __Note__: Because Scikit-learn does not natively support GPU training, Sagemaker Scikit-learn does not currently support training on GPU instance types.
* __sagemaker_session__ *(optional)*: The session used to train on Sagemaker.
* __hyperparameters__ *(optional)*: A dictionary passed to the train function as hyperparameters.

In [32]:
from sagemaker import image_uris

# Below changes the region to be one where this notebook is running
region = boto3.Session().region_name
container = sagemaker.image_uris.retrieve("xgboost", region, "0.90-2")

In [33]:
from sagemaker import get_execution_role

role = get_execution_role()
base_job_name = "mnist-xgboost-classification"
bucket_path = "s3://{}".format(bucket)

num_round = 250
save_interval = 30
hyperparameters = {
    "max_depth": "5",
    "eta": "0.1",
    "gamma": "4",
    "min_child_weight": "6",
    "silent": "0",
    "objective": "multi:softmax",
    "num_class": "10",  # num_class is required for 'multi:*' objectives
    "num_round": num_round,
}

In [34]:
from sagemaker.estimator import Estimator

xgboost_algorithm_mode_estimator = Estimator(
    role=role,
    base_job_name=base_job_name,
    instance_count=2,
    instance_type="ml.m5.xlarge",
    image_uri=container,
    hyperparameters=hyperparameters
    
)

In [23]:
from sagemaker.session import TrainingInput

train_s3_input = TrainingInput(
    "s3://{}/{}/{}".format(bucket, prefix, "train"), content_type="libsvm"
)
validation_s3_input = TrainingInput(
    "s3://{}/{}/{}".format(bucket, prefix, "validation"), content_type="libsvm"
)

# This is a fire and forget event. By setting wait=False, you just submit the job to run in the background.
# Amazon SageMaker will start one training job and release control to next cells in the notebook.
# Follow this notebook to see status of the training job.
xgboost_algorithm_mode_estimator.fit(
    {"train": train_s3_input, "validation": validation_s3_input}, wait=True
)

## Train XGBoost Estimator on MNIST data using Spot Instance

Training using spot instance is just adding a simple configuration attribute 'use_spot_instances' to the estimator class.

In [25]:
# Open Source distributed script mode
from sagemaker.session import Session
from sagemaker.inputs import TrainingInput
from sagemaker.xgboost.estimator import XGBoost

boto_session = boto3.Session(region_name=region)
session = Session(boto_session=boto_session)
script_path = "abalone.py"

xgboost_algorithm_mode_estimator = Estimator(
    role=role,
    base_job_name=base_job_name,
    instance_count=2,
    instance_type="ml.m5.xlarge",
    image_uri=container,
    hyperparameters=hyperparameters,
    use_spot_instances=True,
    max_run=3600,
    max_wait=3600
    
)

train_s3_input = TrainingInput(
    "s3://{}/{}/{}".format(bucket, prefix, "train"), content_type="libsvm"
)
validation_s3_input = TrainingInput(
    "s3://{}/{}/{}".format(bucket, prefix, "validation"), content_type="libsvm"
)

# This is a fire and forget event. By setting wait=False, you just submit the job to run in the background.
# Amazon SageMaker will start one training job and release control to next cells in the notebook.
# Follow this notebook to see status of the training job.


In [None]:
xgb_script_mode_estimator.fit({"train": train_input, "validation": validation_input})

2021-08-10 17:37:34 Starting - Starting the training job...ProfilerReport-1628617054: InProgress
...
2021-08-10 17:38:31 Starting - Launching requested ML instances......
2021-08-10 17:39:32 Starting - Preparing the instances for training.........
2021-08-10 17:40:53 Downloading - Downloading input data...
2021-08-10 17:41:32 Training - Training image download completed. Training in progress..[34m[2021-08-10 17:41:30.020 ip-10-0-141-79.ec2.internal:1 INFO utils.py:27] RULE_JOB_STOP_SIGNAL_FILENAME: None[0m
[34m[2021-08-10:17:41:30:INFO] Imported framework sagemaker_xgboost_container.training[0m
[34m[2021-08-10:17:41:30:INFO] No GPUs detected (normal if no gpus installed)[0m
[34m[2021-08-10:17:41:30:INFO] Invoking user training script.[0m
[34m[2021-08-10:17:41:45:INFO] Module abalone does not provide a setup.py. [0m
[34mGenerating setup.py[0m
[34m[2021-08-10:17:41:45:INFO] Generating setup.cfg[0m
[34m[2021-08-10:17:41:45:INFO] Generating MANIFEST.in[0m
[34m[2021-08-10:1


2021-08-10 17:41:54 Uploading - Uploading generated training model
2021-08-10 17:42:53 Completed - Training job completed
ProfilerReport-1628617054: NoIssuesFound
Training seconds: 104
Billable seconds: 47
Managed Spot Training savings: 54.8%
