# Training SageMaker Models using XGBoost on SageMaker Managed Spot Training


The XGBoost algorithm can be used as a built-in algorithm or as a framework such as TensorFlow. Using XGBoost as a framework provides more flexible than using it as a built-in algorithm as it enables more advanced scenarios that allow pre-processing and post-processing scripts to be incorporated into your training script. Using XGBoost as a built-in Amazon SageMaker algorithm is how you had to use the original XGBoost Release 0.72 version and nothing changes here except the version of XGBoost that you use.

## Use XGBoost as a built-in algorithm

The example here is almost the same as [Regression with Amazon SageMaker XGBoost algorithm](https://github.com/awslabs/amazon-sagemaker-examples/blob/master/introduction_to_amazon_algorithms/xgboost_abalone/xgboost_abalone.ipynb).

This notebook tackles the exact same problem with the same solution, but it has been modified to be able to run using SageMaker Managed Spot infrastructure. SageMaker Managed Spot uses [EC2 Spot Instances](https://aws.amazon.com/ec2/spot/) to run Training at a lower cost.

Please read the original notebook and try it out to gain an understanding of the ML use-case and how it is being solved. We will not delve into that here in this notebook.

In [1]:
!pip install -qU awscli boto3 sagemaker

[33mYou are using pip version 10.0.1, however version 19.2.3 is available.
You should consider upgrading via the 'pip install --upgrade pip' command.[0m


## First setup variables and define functions

Again, we won't go into detail explaining the code below, it has been lifted almost verbatim from [Regression with Amazon SageMaker XGBoost algorithm](https://github.com/awslabs/amazon-sagemaker-examples/blob/master/introduction_to_amazon_algorithms/xgboost_abalone/xgboost_abalone.ipynb).

In [2]:
%%time

import os
import boto3
import re
import sagemaker

role = sagemaker.get_execution_role()
region = boto3.Session().region_name
# S3 bucket for saving code and model artifacts.
# Feel free to specify a different bucket here if you wish.
bucket = sagemaker.Session().default_bucket()

prefix = 'sagemaker/DEMO-xgboost-regression'
# customize to your bucket where you have stored the data
bucket_path = 's3://{}'.format(bucket)

CPU times: user 1.05 s, sys: 123 ms, total: 1.18 s
Wall time: 1.94 s


### Fetching the dataset

Following methods split the data into train/test/validation datasets and upload files to S3.

In [3]:
%%time

import io
import boto3
import random

def data_split(FILE_DATA, FILE_TRAIN, FILE_VALIDATION, FILE_TEST, PERCENT_TRAIN, PERCENT_VALIDATION, PERCENT_TEST):
    data = [l for l in open(FILE_DATA, 'r')]
    train_file = open(FILE_TRAIN, 'w')
    valid_file = open(FILE_VALIDATION, 'w')
    tests_file = open(FILE_TEST, 'w')

    num_of_data = len(data)
    num_train = int((PERCENT_TRAIN/100.0)*num_of_data)
    num_valid = int((PERCENT_VALIDATION/100.0)*num_of_data)
    num_tests = int((PERCENT_TEST/100.0)*num_of_data)

    data_fractions = [num_train, num_valid, num_tests]
    split_data = [[],[],[]]

    rand_data_ind = 0

    for split_ind, fraction in enumerate(data_fractions):
        for i in range(fraction):
            rand_data_ind = random.randint(0, len(data)-1)
            split_data[split_ind].append(data[rand_data_ind])
            data.pop(rand_data_ind)

    for l in split_data[0]:
        train_file.write(l)

    for l in split_data[1]:
        valid_file.write(l)

    for l in split_data[2]:
        tests_file.write(l)

    train_file.close()
    valid_file.close()
    tests_file.close()

def write_to_s3(fobj, bucket, key):
    return boto3.Session(region_name=region).resource('s3').Bucket(bucket).Object(key).upload_fileobj(fobj)

def upload_to_s3(bucket, channel, filename):
    fobj=open(filename, 'rb')
    key = prefix+'/'+channel
    url = 's3://{}/{}/{}'.format(bucket, key, filename)
    print('Writing to {}'.format(url))
    write_to_s3(fobj, bucket, key)

CPU times: user 8 µs, sys: 1 µs, total: 9 µs
Wall time: 12.6 µs


### Data ingestion

In [4]:
%%time
import urllib.request

# Load the dataset
FILE_DATA = 'abalone'
urllib.request.urlretrieve("https://www.csie.ntu.edu.tw/~cjlin/libsvmtools/datasets/regression/abalone", FILE_DATA)

#split the downloaded data into train/test/validation files
FILE_TRAIN = 'abalone.train'
FILE_VALIDATION = 'abalone.validation'
FILE_TEST = 'abalone.test'
PERCENT_TRAIN = 70
PERCENT_VALIDATION = 15
PERCENT_TEST = 15
data_split(FILE_DATA, FILE_TRAIN, FILE_VALIDATION, FILE_TEST, PERCENT_TRAIN, PERCENT_VALIDATION, PERCENT_TEST)

#upload the files to the S3 bucket
upload_to_s3(bucket, 'train', FILE_TRAIN)
upload_to_s3(bucket, 'validation', FILE_VALIDATION)
upload_to_s3(bucket, 'test', FILE_TEST)

Writing to s3://sagemaker-us-west-2-959484541615/sagemaker/DEMO-xgboost-regression/train/abalone.train
Writing to s3://sagemaker-us-west-2-959484541615/sagemaker/DEMO-xgboost-regression/validation/abalone.validation
Writing to s3://sagemaker-us-west-2-959484541615/sagemaker/DEMO-xgboost-regression/test/abalone.test
CPU times: user 180 ms, sys: 44.5 ms, total: 225 ms
Wall time: 2.1 s


### Training the XGBoost model

After setting training parameters, we kick off training, and poll for status until training is completed, which in this example, takes between 5 and 6 minutes.

In [5]:
from sagemaker.amazon.amazon_estimator import get_image_uri
container = get_image_uri(region, 'xgboost', '0.90-1')

In [6]:
%%time

from time import gmtime, strftime

job_name = 'DEMO-xgboost-regression-' + strftime("%Y-%m-%d-%H-%M-%S", gmtime())
print("Training job", job_name)

train_use_spot_instances = True
train_max_run = 3600
train_max_wait = 7200 if train_use_spot_instances else None
checkpoint_s3_uri = bucket_path + '/' + prefix + '/checkpoints/' + job_name if train_use_spot_instances else None
print("Checkpoint path:", checkpoint_s3_uri)

create_training_params = \
{
    "AlgorithmSpecification": {
        "TrainingImage": container,
        "TrainingInputMode": "File"
    },
    "RoleArn": role,
    "OutputDataConfig": {
        "S3OutputPath": bucket_path + "/" + prefix + "/single-xgboost"
    },
    "ResourceConfig": {
        "InstanceCount": 1,
        "InstanceType": "ml.m4.4xlarge",
        "VolumeSizeInGB": 5
    },
    "TrainingJobName": job_name,
    "HyperParameters": {
        "max_depth":"5",
        "eta":"0.2",
        "gamma":"4",
        "min_child_weight":"6",
        "subsample":"0.7",
        "silent":"0",
        "objective":"reg:linear",
        "num_round":"50"
    },
    "StoppingCondition": {
        "MaxRuntimeInSeconds": train_max_run,
        "MaxWaitTimeInSeconds": train_max_wait
    },
    "InputDataConfig": [
        {
            "ChannelName": "train",
            "DataSource": {
                "S3DataSource": {
                    "S3DataType": "S3Prefix",
                    "S3Uri": bucket_path + "/" + prefix + '/train',
                    "S3DataDistributionType": "FullyReplicated"
                }
            },
            "ContentType": "libsvm",
            "CompressionType": "None"
        },
        {
            "ChannelName": "validation",
            "DataSource": {
                "S3DataSource": {
                    "S3DataType": "S3Prefix",
                    "S3Uri": bucket_path + "/" + prefix + '/validation',
                    "S3DataDistributionType": "FullyReplicated"
                }
            },
            "ContentType": "libsvm",
            "CompressionType": "None"
        }
    ],
    "EnableManagedSpotTraining": train_use_spot_instances,
    "CheckpointConfig": { 
        "S3Uri": checkpoint_s3_uri
    }
}


client = boto3.client('sagemaker', region_name=region)
client.create_training_job(**create_training_params)

import time

status = client.describe_training_job(TrainingJobName=job_name)['TrainingJobStatus']
print(status)
while status !='Completed' and status!='Failed':
    time.sleep(60)
    status = client.describe_training_job(TrainingJobName=job_name)['TrainingJobStatus']
    print(status)

Training job DEMO-xgboost-regression-2019-09-27-05-09-28
Checkpoint path: s3://sagemaker-us-west-2-959484541615/sagemaker/DEMO-xgboost-regression/checkpoints/DEMO-xgboost-regression-2019-09-27-05-09-28
InProgress
InProgress
InProgress
Completed
CPU times: user 97.5 ms, sys: 10.3 ms, total: 108 ms
Wall time: 3min


In [7]:
training_job_description = client.describe_training_job(TrainingJobName=job_name)
print("TrainingTimeInSeconds", training_job_description["TrainingTimeInSeconds"])
print("BillableTimeInSeconds", training_job_description["BillableTimeInSeconds"])

TrainingTimeInSeconds 40
BillableTimeInSeconds 20


# Use XGBoost as a framework

For Managed Spot Training using XGBoost we need to configure three things:
1. Enable the `train_use_spot_instances` constructor arg - a simple self-explanatory boolean.
2. Set the `train_max_wait` constructor arg - this is an int arg representing the amount of time you are willing to wait for Spot infrastructure to become available. Some instance types are harder to get at Spot prices and you may have to wait longer. You are not charged for time spent waiting for Spot infrastructure to become available, you're only charged for actual compute time spent once Spot instances have been successfully procured.
3. Setup a `checkpoint_s3_uri` constructor arg. This arg will tell SageMaker an S3 location where to save checkpoints (assuming your algorithm has been modified to save checkpoints periodically). While not strictly necessary checkpointing is highly recommended for Manage Spot Training jobs due to the fact that Spot instances can be interrupted with short notice and using checkpoints to resume from the last interruption ensures you don't lose any progress made before the interruption.

Feel free to toggle the `train_use_spot_instances` variable to see the effect of running the same job using regular (a.k.a. "On Demand") infrastructure.

Note that `train_max_wait` can be set if and only if `train_use_spot_instances` is enabled and **must** be greater than or equal to `train_max_run`.

In [8]:
import uuid

train_use_spot_instances = True
train_max_run = 3600
train_max_wait = 7200 if train_use_spot_instances else None

checkpoint_suffix = str(uuid.uuid4())

if train_use_spot_instances:
    checkpoint_s3_uri = bucket_path + '/' + prefix + '/checkpoints/' + checkpoint_suffix
    print("Checkpoint path:", checkpoint_s3_uri)
else:
    checkpoint_s3_uri = None

Checkpoint path: s3://sagemaker-us-west-2-959484541615/sagemaker/DEMO-xgboost-regression/checkpoints/482ac248-2a4e-404c-9c45-0c6bc09b73ae


In [9]:
from sagemaker.session import s3_input
from sagemaker.xgboost.estimator import XGBoost

hyperparameters = {
    "max_depth":"5",
    "eta":"0.2",
    "gamma":"4",
    "min_child_weight":"6",
    "subsample":"0.7",
    "silent":"0",
    "objective":"reg:linear",
    "num_round":"50"
}
instance_type = "ml.m4.4xlarge"
output_path = "s3://{}/{}/{}/output".format(bucket, prefix, "single-xgboost")
content_type = "libsvm"

xgb_script_mode_estimator = XGBoost(
    entry_point="abalone.py",
    hyperparameters=hyperparameters,
    image_name=container,
    role=role, 
    train_instance_count=1,
    train_instance_type=instance_type,
    framework_version="0.90-1",
    output_path=output_path,
    train_use_spot_instances=train_use_spot_instances,
    train_max_run=train_max_run,
    train_max_wait=train_max_wait,
    checkpoint_s3_uri=checkpoint_s3_uri)

xgb_script_mode_estimator.fit(
    {
        "train": s3_input(
            "s3://{}/{}/{}".format(bucket, prefix, "train"),
            content_type=content_type
        ),
        "validation": s3_input(
            "s3://{}/{}/{}".format(bucket, prefix, "validation"),
            content_type=content_type
        )
    }
)

2019-09-27 05:12:29 Starting - Starting the training job...
2019-09-27 05:12:30 Starting - Launching requested ML instances......
2019-09-27 05:13:56 Starting - Preparing the instances for training......
2019-09-27 05:14:47 Downloading - Downloading input data...
2019-09-27 05:15:25 Training - Training image download completed. Training in progress.
2019-09-27 05:15:25 Uploading - Uploading generated training model[31mINFO:sagemaker-containers:Imported framework sagemaker_xgboost_container.training[0m
[31mINFO:sagemaker-containers:No GPUs detected (normal if no gpus installed)[0m
[31mINFO:sagemaker_xgboost_container.training:Invoking user training script.[0m
[31mINFO:sagemaker-containers:Module abalone does not provide a setup.py. [0m
[31mGenerating setup.py[0m
[31mINFO:sagemaker-containers:Generating setup.cfg[0m
[31mINFO:sagemaker-containers:Generating MANIFEST.in[0m
[31mINFO:sagemaker-containers:Installing module with the following command:[0m
[31m/usr/bin/python3 -


2019-09-27 05:15:31 Completed - Training job completed
Training seconds: 44
Billable seconds: 18
Managed Spot Training savings: 59.1%


# Savings
Towards the end of the job you should see two lines of output printed:

- `Training seconds: X` : This is the actual compute-time your training job spent
- `Billable seconds: Y` : This is the time you will be billed for after Spot discounting is applied.

If you enabled the `train_use_spot_instances` var then you should see a notable difference between `X` and `Y` signifying the cost savings you will get for having chosen Managed Spot Training. This should be reflected in an additional line:
- `Managed Spot Training savings: (1-Y/X)*100 %`