## Explore, Develop, Train, Optimize and Deploy Built-in algorithm XGBoost


In this notebook, we show how to use Amazon SageMaker to develop, train, tune and deploy a built-in XGBoost model. We continue to use the Boston Housing dataset, present in Scikit-Learn: https://scikit-learn.org/stable/datasets/index.html#boston-dataset

**This sample is provided for demonstration purposes, make sure to conduct appropriate testing if derivating this code for your own use-cases!**

In [None]:
import datetime
import tarfile

import boto3
from sagemaker import get_execution_role
import sagemaker



sm_boto3 = boto3.client('sagemaker')

sess = sagemaker.Session()

region = sess.boto_session.region_name

bucket = sess.default_bucket()  # this could also be a hard-coded bucket name

print('Using bucket ' + bucket)

### Prerequisites: prepare the raw dataset
#### We load a dataset from sklearn library, split it and send it to S3. 

In [None]:
from sklearn.model_selection import train_test_split
from sklearn.datasets import load_boston
import pandas as pd
import numpy as np

data = load_boston()

X_train, X_test, y_train, y_test = train_test_split(
    data.data, data.target, test_size=0.25, random_state=42)

trainX = pd.DataFrame(X_train, columns=data.feature_names)
trainX['target'] = y_train

testX = pd.DataFrame(X_test, columns=data.feature_names)
testX['target'] = y_test

trainX.to_csv('boston_train.csv')
testX.to_csv('boston_test.csv')

In [None]:
trainX.head()

In [None]:
trainX.describe()

#### Upload the dataset to S3 as input data for this demo

In [None]:
# send data to S3. SageMaker will take training data from s3
trainpath = sess.upload_data(
    path='boston_train.csv', bucket=bucket,
    key_prefix='sagemaker/xgboostcontainer/raw-data')

testpath = sess.upload_data(
    path='boston_test.csv', bucket=bucket,
    key_prefix='sagemaker/xgboostcontainer/raw-data')

print('Raw dataset will be stored S3 at:', trainpath)
print('Raw dataset will be stored S3 at:', testpath)

### Amazon SageMaker Experiments – Organize, Track And Compare Your Machine Learning Trainings

* Doc https://docs.aws.amazon.com/sagemaker/latest/dg/experiments.html
* SDK https://sagemaker-experiments.readthedocs.io/en/latest/index.html

In [None]:
import sys
!{sys.executable} -m pip install sagemaker-experiments

SageMaker Experiments automatically tracks the inputs, parameters, configurations, and results of your iterations as trials. You can assign, group, and organize these trials into experiments. SageMaker Experiments is integrated with Amazon SageMaker Studio providing a visual interface to browse your active and past experiments, compare trials on key performance metrics, and identify the best performing models.

In [None]:
import time
from time import strftime

from smexperiments.experiment import Experiment
from smexperiments.trial import Trial
from smexperiments.trial_component import TrialComponent
from smexperiments.tracker import Tracker

create_date = strftime("%Y-%m-%d-%H-%M-%S")
experiment_name = "Boston-Housing-XGBoost-Trial-{}".format(create_date)
demo_experiment = Experiment.create(experiment_name = experiment_name,
                                    description = "Demo experiment using SageMaker for organize, track and compare"
                                   )

A trial is a set of steps called trial components that produce a machine learning model. A trial is part of a single Amazon SageMaker experiment.

In [None]:
create_date = strftime("%Y-%m-%d-%H-%M-%S")

demo_trial = Trial.create(trial_name = "Boston-Housing-XGBoost-Trial-{}".format(create_date),
                          experiment_name = experiment_name
                         )

Creates a trial component, which is a stage of a machine learning trial. A trial is composed of one or more trial components. Trial components include pre-processing jobs, training jobs, and batch transform jobs.

In [None]:
with Tracker.create(display_name="Dataset", sagemaker_boto_client=sm_boto3) as tracker:
    tracker.log_parameters({
        "train-test-splite": 70
    })
    # we can log the s3 uri to the dataset we just uploaded
    tracker.log_input(name="boston-housing-training-dataset", media_type="s3/uri", value=trainpath)
    tracker.log_input(name="boston-housing-test-dataset", media_type="s3/uri", value=testpath)

In [None]:
dataset_trial_component = tracker.trial_component
demo_trial.add_trial_component(dataset_trial_component)

### Data preprocessing with Amazon SageMaker Processing
Amazon SageMaker Processing allows you to run steps for data pre- or post-processing, feature engineering, data validation, or model evaluation workloads on Amazon SageMaker.

* Doc https://docs.aws.amazon.com/sagemaker/latest/dg/processing-job.html
* SDK https://sagemaker.readthedocs.io/en/stable/amazon_sagemaker_processing.html


#### Write preprocessing script with scikit-learn

This simple script preprocesses data into SageMaker Built-in XGBoost compatible format, by changing the colume order of training and test dataset, and by dropping the header of dataset and some columns. In real world cases, you can image a more complete pre-processing setup with Amazon SageMaker Processing. 

In [None]:
%%writefile preprocessing.py

import argparse
import os

import pandas as pd
import numpy as np

columns = ['CRIM', 'ZN', 'RM', 'AGE', 'DIS', 'RAD', 'TAX', 'PTRATIO', 'B', 'LSTAT', 'target']

if __name__=='__main__':
    
    sagemaker_processing_input_path = '/opt/ml/processing/input'
    sagemaker_processing_output_path = '/opt/ml/processing/output'

    parser = argparse.ArgumentParser()
    parser.add_argument('--train-file', type=str, default='boston_train.csv')
    parser.add_argument('--test-file', type=str, default='boston_test.csv')
    parser.add_argument('--input-dir', type=str, default=sagemaker_processing_input_path)
    parser.add_argument('--output-dir', type=str, default=sagemaker_processing_output_path)

    args, _ = parser.parse_known_args()
    print('Received arguments {}'.format(args))

    print('reading data')
    train_df = pd.read_csv(os.path.join(args.input_dir, args.train_file))
    test_df = pd.read_csv(os.path.join(args.input_dir, args.test_file))
        
    cols_xgboost = columns[-1:] + columns[:-1]
    
    train_df = train_df[cols_xgboost]
    test_df = test_df[cols_xgboost]
    
    # Create local output directories
    if not os.path.exists(os.path.join(args.output_dir,'train')):
        os.makedirs(os.path.join(args.output_dir,'train'))
        print('creating the processed train directory')

    if not os.path.exists(os.path.join(args.output_dir,'test')):
        os.makedirs(os.path.join(args.output_dir,'test'))
        print('creating the processed test directory')
    
    output_train_data_path = os.path.join(args.output_dir,'train',args.train_file)
    train_df.to_csv(output_train_data_path,header=False,index=False)
    print('Saved the processed training dataset')

    
    output_test_data_path = os.path.join(args.output_dir,'test',args.test_file)
    test_df.to_csv(output_test_data_path,header=False,index=False)
    print('Saved the processed test dataset')

#### Test the code locally on this local notebook environment

This code runs on the Docker image associated with this notebook, you can run with command line outside the notebook

In [None]:
! python preprocessing.py  --input-dir './' \
                           --output-dir './processed' 

#### Process data with Amazon SageMaker Processing Job

You can run a scikit-learn script to do data processing on SageMaker.
The code runs a processing job using SKLearnProcessor class from the the Amazon SageMaker Python SDK to execute a scikit-learn script that you provide. 

In [None]:
from sagemaker.sklearn.processing import SKLearnProcessor

region = boto3.session.Session().region_name
role = get_execution_role()
sklearn_processor = SKLearnProcessor(framework_version='0.20.0',
                                     role=role,
                                     instance_type='ml.m5.xlarge',
                                     instance_count=1)

Define the input and output S3 location for SageMaker Processing Job with SKLearnProcessor

In [None]:
input_data_s3 = 's3://{}/sagemaker/xgboostcontainer/raw-data'.format(bucket)
print('Raw dataset at S3 location:',input_data_s3)
output_data_s3_prefix = 's3://{}/sagemaker/xgboostcontainer/processed'.format(bucket)
print('Processed dataset at S3 location:',output_data_s3_prefix)
output_data_s3_train = output_data_s3_prefix + '/train'
output_data_s3_test = output_data_s3_prefix + '/test'

#### Run SageMaker Processing Job with SageMaker SDK

See the SDK reference
https://sagemaker.readthedocs.io/en/stable/amazon_sagemaker_processing.html

In [None]:
from sagemaker.processing import ProcessingInput, ProcessingOutput
current_date = strftime("%Y-%m-%d-%H-%M-%S")

sklearn_processor.run(code='preprocessing.py',
                      inputs=[ProcessingInput(
                        source=input_data_s3,
                        destination='/opt/ml/processing/input')],
                      outputs=[ProcessingOutput(output_name='xgboost_train_data',
                                                source='/opt/ml/processing/output/train',
                                               destination = output_data_s3_train),
                               ProcessingOutput(output_name='xgboost_test_data',
                                                source='/opt/ml/processing/output/test',
                                               destination = output_data_s3_test)],
                      experiment_config={ "TrialName": demo_trial.trial_name, "TrialComponentDisplayName": "Preprocessing-{}".format(current_date)}
                     )

### SageMaker Training with built-in XGBoost

Amazon SageMaker provides several built-in machine learning algorithms that you can use for a variety of problem types.
<br>
Using the built-in algorithm version of XGBoost is simpler than using the open source version, because you don’t have to write a training script. 

* Doc https://docs.aws.amazon.com/sagemaker/latest/dg/xgboost.html
* SDK https://sagemaker.readthedocs.io/en/stable/frameworks/xgboost/using_xgboost.html


In [None]:
from sagemaker.image_uris import retrieve 
from sagemaker.session import Session

# this line automatically looks for the XGBoost image URI and builds an XGBoost container.
# specify the repo_version depending on your preference.
container = retrieve(region=boto3.Session().region_name,
                          framework='xgboost', 
                          version='1.0-1')
print(container)

Set the hyperparameters for SageMaker Built-in XGBoost.
<br>
In terms of objective metric, we fix here reg:squarederror, which indicates regression task with squared loss. 

List of available hyperparameters can be found here 
<br>
https://docs.aws.amazon.com/sagemaker/latest/dg/xgboost_hyperparameters.html

In [None]:
# initialize hyperparameters
hyperparameters = {
        "max_depth":"10",
        "eta":"0.2",
        "gamma":"4",
        "min_child_weight":"6",
        "subsample":"0.7",
        "objective":"reg:squarederror",
        "num_round":"200"}

#### Launching a training job with the Python SDK

In [None]:
# construct a SageMaker estimator that calls the xgboost-container
estimator = sagemaker.estimator.Estimator(image_uri=container, 
                                          hyperparameters=hyperparameters,
                                          role=role,
                                          instance_count=1, 
                                          instance_type='ml.m5.2xlarge')



Define the data type and paths to the training and validation datasets

In [None]:
from sagemaker.inputs import TrainingInput
content_type = "csv"
train_input = TrainingInput("s3://{}/sagemaker/xgboostcontainer/processed/{}/".format(bucket, 'train'), content_type=content_type)
validation_input = TrainingInput("s3://{}/sagemaker/xgboostcontainer/processed/{}/".format(bucket, 'test'), content_type=content_type)

Execute the XGBoost training job

In [None]:
current_date = strftime("%Y-%m-%d-%H-%M-%S")

estimator.fit({'train': train_input, 'validation': validation_input},       
              experiment_config={
                "TrialName": demo_trial.trial_name,
                "TrialComponentDisplayName": "Training-{}".format(current_date)},
              wait=False
             )

### Deploy an endpoint for real-time inference with SageMaker SDK 

Here we deploy the best trained job to an Amazon SageMaker endpoint with SageMaker SDK 
<br>
Note that one could also use the more extensive process of creating a model from s3 artifacts, and deploy a model that was trained in a different session or even out of SageMaker.

In [None]:
predictor = estimator.deploy(initial_instance_count=1, instance_type='ml.m5.xlarge', endpoint_name ='xgboost-endpoint',
                             tags=None, wait=False)

Invoke with boto3 python SDK

In [None]:
import pandas as pd 
import numpy as np 

runtime = boto3.client('sagemaker-runtime')

prediction_data = np.array([0.09178,0.0,6.416,84.1,2.6463,5.0,296.0,16.6,395.5,9.04]).reshape((1,10))
serialized_data = pd.DataFrame(prediction_data).to_csv(header=False, index=False).encode('utf-8')
print(serialized_data)

In [None]:
# csv serialization
response = runtime.invoke_endpoint(
    EndpointName=predictor.endpoint_name,
    Body=serialized_data,
    ContentType='text/csv')

print(response['Body'].read())

### Batch prediction with batch transform

Run inference when you don't need a persistent endpoint.

In [None]:
ingestedpath = sess.upload_data(
    path='./processed/test/boston_test.csv', bucket=bucket,
    key_prefix='sagemaker/xgboostcontainer/ingested-data')

print('Ingested data will be stored S3 at:', ingestedpath)

In [None]:
# The location of the test dataset
batch_input = 's3://{}/sagemaker/xgboostcontainer/ingested-data/'.format(bucket)

# The location to store the results of the batch transform job
batch_output = 's3://{}/sagemaker/xgboostcontainer/batch-predicted-data/'.format(bucket)


# Define a SKLearn Transformer from the trained SKLearn Estimator
transformer = estimator.transformer(instance_count=1, instance_type='ml.m5.xlarge',
                                            output_path=batch_output,accept='text/csv',assemble_with='Line')

In [None]:
#transformer.transform(data=batch_input, data_type='S3Prefix', content_type='text/csv', split_type='Line', input_filter='$[:29]')
transformer.transform(data=batch_input, data_type='S3Prefix', content_type='text/csv', split_type='Line', 
                      input_filter="$[1:]", join_source= "Input", output_filter="$")

print('Waiting for transform job: ' + transformer.latest_transform_job.job_name)

### SageMaker Hyperparameters Tuning with Built-in XGBoost

Check out the SageMaker documentation for How Hyperparameter Tuning Works
<br>
https://docs.aws.amazon.com/sagemaker/latest/dg/automatic-model-tuning-how-it-works.html

Similar as the SageMaker training job SDK, we configure here the SageMaker estimator, and pre-set the hyperparameters when we consider fixed (no need to tune).  

In [None]:
from sagemaker.tuner import IntegerParameter, CategoricalParameter, ContinuousParameter, HyperparameterTuner

xgb = sagemaker.estimator.Estimator(container,
                                    role, 
                                    instance_count=1, 
                                    instance_type='ml.m5.xlarge',
                                    sagemaker_session=sess)

xgb.set_hyperparameters(objective='reg:squarederror',
                        num_round=50,
                        rate_drop=0.3)

Given an objective metric and a set of the hyperparameters to be tuned, the tuning job optimizes a model for the metric that you choose.

<br>
For regression problem, we fix here Root Mean Square Error (RMSE) as objective metric for tuning job, and the best job would be the one minimises such error.  

In [None]:
objective_metric_name = 'validation:rmse'
objective_type = 'Minimize'

We perform automatic model tuning with following hyperparameters

- eta: Step size shrinkage used in updates to prevent overfitting. After each boosting step, you can directly get the weights of new features. The eta parameter actually shrinks the feature weights to make the boosting process more conservative.
- alpha: L1 regularization term on weights. Increasing this value makes models more conservative.
- min_child_weight: Minimum sum of instance weight (hessian) needed in a child. If the tree partition step results in a leaf node with the sum of instance weight less than min_child_weight, the building process gives up further partitioning. In linear regression models, this simply corresponds to a minimum number of instances needed in each node. The larger the algorithm, the more conservative it is.
- max_depth: Maximum depth of a tree. Increasing this value makes the model more complex and likely to be overfitted.

In [None]:
hyperparameter_ranges = {'eta': ContinuousParameter(0, 1),
                        'min_child_weight': ContinuousParameter(1, 10),
                        'alpha': ContinuousParameter(0, 2),
                        'max_depth': IntegerParameter(1, 10)}

#### Launch the SageMaker hyperparameter tuning job

In [None]:
tuner = HyperparameterTuner(xgb,
                            objective_metric_name=objective_metric_name,
                            objective_type=objective_type,
                            hyperparameter_ranges=hyperparameter_ranges,
                            max_jobs=4,
                            max_parallel_jobs=2)

In [None]:
from sagemaker.inputs import TrainingInput
content_type = "csv"
train_input = TrainingInput("s3://{}/sagemaker/xgboostcontainer/processed/{}/".format(bucket, 'train'), content_type=content_type)
validation_input = TrainingInput("s3://{}/sagemaker/xgboostcontainer/processed/{}/".format(bucket, 'test'), content_type=content_type)

In [None]:
tuner.fit({'train': train_input, 'validation': validation_input},
          include_cls_metadata=False,wait=False)

#### Fetch results about a hyperparameter tuning job and make them accessible for analytics

In [None]:
# get tuner results in a df
results = tuner.analytics().dataframe()
results.head(16)

### The end, please don't forget to delete the endpoint !

In [None]:
sm_boto3.delete_endpoint(EndpointName=tuning_predictor.endpoint)