## Explore, Develop, Train, Optimize and Deploy Built-in algorithm XGBoost


* Doc https://docs.aws.amazon.com/sagemaker/latest/dg/xgboost.html
* SDK https://sagemaker.readthedocs.io/en/stable/frameworks/xgboost/using_xgboost.html

In this notebook, we show how to use Amazon SageMaker to develop, train, tune and deploy a built-in XGBoost model. We continue to use the Boston Housing dataset, present in Scikit-Learn: https://scikit-learn.org/stable/datasets/index.html#boston-dataset

**This sample is provided for demonstration purposes, make sure to conduct appropriate testing if derivating this code for your own use-cases!**

In [14]:
import datetime
import tarfile

import boto3
from sagemaker import get_execution_role
import sagemaker



sm_boto3 = boto3.client('sagemaker')

sess = sagemaker.Session()

region = sess.boto_session.region_name

bucket = sess.default_bucket()  # this could also be a hard-coded bucket name

print('Using bucket ' + bucket)

Using bucket sagemaker-eu-west-1-707684582322


### Prerequisites: prepare the dataset
#### We load a dataset from sklearn library, split it and send it to S3. 

If you have executed the previous demo, the boston_train.csv and boston_test.csv should already in the current path. 
<br>
Otherwise, you can uncomment the following cell, the code loads boston dataset with sklearn and saves to current path. 

In [15]:
from sklearn.model_selection import train_test_split
from sklearn.datasets import load_boston
import pandas as pd
import numpy as np

data = load_boston()

X_train, X_test, y_train, y_test = train_test_split(
    data.data, data.target, test_size=0.25, random_state=42)

trainX = pd.DataFrame(X_train, columns=data.feature_names)
trainX['target'] = y_train

testX = pd.DataFrame(X_test, columns=data.feature_names)
testX['target'] = y_test

trainX.to_csv('boston_train.csv')
testX.to_csv('boston_test.csv')

In [16]:
trainX.head()

Unnamed: 0,CRIM,ZN,INDUS,CHAS,NOX,RM,AGE,DIS,RAD,TAX,PTRATIO,B,LSTAT,target
0,0.09103,0.0,2.46,0.0,0.488,7.155,92.2,2.7006,3.0,193.0,17.8,394.12,4.82,37.9
1,3.53501,0.0,19.58,1.0,0.871,6.152,82.6,1.7455,5.0,403.0,14.7,88.01,15.02,15.6
2,0.03578,20.0,3.33,0.0,0.4429,7.82,64.5,4.6947,5.0,216.0,14.9,387.31,3.76,45.4
3,0.38735,0.0,25.65,0.0,0.581,5.613,95.6,1.7572,2.0,188.0,19.1,359.29,27.26,15.7
4,0.06724,0.0,3.24,0.0,0.46,6.333,17.2,5.2146,4.0,430.0,16.9,375.21,7.34,22.6


In [17]:
trainX.describe()

Unnamed: 0,CRIM,ZN,INDUS,CHAS,NOX,RM,AGE,DIS,RAD,TAX,PTRATIO,B,LSTAT,target
count,379.0,379.0,379.0,379.0,379.0,379.0,379.0,379.0,379.0,379.0,379.0,379.0,379.0,379.0
mean,3.468655,11.596306,11.119974,0.076517,0.558326,6.323496,69.14723,3.76344,9.361478,405.311346,18.263852,358.767124,12.513298,22.907916
std,8.313983,23.093394,6.953401,0.266175,0.119118,0.720086,27.703149,2.112633,8.601322,166.060463,2.263954,87.511867,7.14769,9.429546
min,0.00906,0.0,1.21,0.0,0.385,3.863,2.9,1.1296,1.0,187.0,12.6,0.32,1.73,5.0
25%,0.08193,0.0,5.13,0.0,0.453,5.89,46.25,2.0754,4.0,279.0,16.6,376.14,6.865,16.9
50%,0.26938,0.0,9.69,0.0,0.538,6.226,78.1,3.1121,5.0,330.0,18.6,391.34,11.22,21.7
75%,3.242325,20.0,18.1,0.0,0.639,6.6645,93.85,5.25095,16.0,666.0,20.2,395.76,16.395,26.6
max,88.9762,100.0,27.74,1.0,0.871,8.78,100.0,12.1265,24.0,711.0,22.0,396.9,37.97,50.0


### Data Wrangler for data exploration

Upload the dataset to S3 as input data for this demo

In [18]:
# send data to S3. SageMaker will take training data from s3
trainpath = sess.upload_data(
    path='boston_train.csv', bucket=bucket,
    key_prefix='sagemaker/xgboostcontainer/raw-data')

testpath = sess.upload_data(
    path='boston_test.csv', bucket=bucket,
    key_prefix='sagemaker/xgboostcontainer/raw-data')

print('Raw dataset will be stored S3 at:', trainpath)
print('Raw dataset will be stored S3 at:', testpath)

Raw dataset will be stored S3 at: s3://sagemaker-eu-west-1-707684582322/sagemaker/xgboostcontainer/raw-data/boston_train.csv
Raw dataset will be stored S3 at: s3://sagemaker-eu-west-1-707684582322/sagemaker/xgboostcontainer/raw-data/boston_test.csv


#### Amazon SageMaker Experiments – Organize, Track And Compare Your Machine Learning Trainings

In [19]:
import sys
!{sys.executable} -m pip install sagemaker-experiments



In [20]:
import time
from time import strftime

from smexperiments.experiment import Experiment
from smexperiments.trial import Trial
from smexperiments.trial_component import TrialComponent
from smexperiments.tracker import Tracker

experiment_name = "Boston-Housing-Prediction"
demo_experiment = Experiment.create(experiment_name = experiment_name,
                                    description = "Demo experiment using SageMaker for organize, track and compare"
                                   )

ClientError: An error occurred (ValidationException) when calling the CreateExperiment operation: Experiment names must be unique within an AWS account and region. Experiment with name (Boston-Housing-Prediction) already exists.

In [21]:
create_date = strftime("%Y-%m-%d-%H-%M-%S")

demo_trial = Trial.create(trial_name = "Boston-Housing-XGBoost-Trial-{}".format(create_date),
                          experiment_name = experiment_name
                         )

In [22]:
with Tracker.create(display_name="Dataset", sagemaker_boto_client=sm_boto3) as tracker:
    tracker.log_parameters({
        "train-test-splite": 70
    })
    # we can log the s3 uri to the dataset we just uploaded
    tracker.log_input(name="boston-housing-training-dataset", media_type="s3/uri", value=trainpath)
    tracker.log_input(name="boston-housing-test-dataset", media_type="s3/uri", value=testpath)

In [23]:
dataset_trial_component = tracker.trial_component
demo_trial.add_trial_component(dataset_trial_component)

## Data preprocessing with Amazon SageMaker Processing
Amazon SageMaker Processing allows you to run steps for data pre- or post-processing, feature engineering, data validation, or model evaluation workloads on Amazon SageMaker.

* Doc https://docs.aws.amazon.com/sagemaker/latest/dg/processing-job.html
* SDK https://sagemaker.readthedocs.io/en/stable/amazon_sagemaker_processing.html


#### Write preprocessing script with scikit-learn

This simple script preprocesses data into SageMaker Built-in XGBoost compatible format, by changing the colume order of training and test dataset, and by dropping the header of dataset and some colume. In real world cases, you can image a more complete pre-processing setup with Amazon SageMaker Processing. 

In [12]:
%%writefile preprocessing.py

import argparse
import os

import pandas as pd
import numpy as np

columns = ['CRIM', 'ZN', 'RM', 'AGE', 'DIS', 'RAD', 'TAX', 'PTRATIO', 'B', 'LSTAT', 'target']

if __name__=='__main__':
    
    sagemaker_processing_input_path = '/opt/ml/processing/input'
    sagemaker_processing_output_path = '/opt/ml/processing/output'

    parser = argparse.ArgumentParser()
    parser.add_argument('--train-file', type=str, default='boston_train.csv')
    parser.add_argument('--test-file', type=str, default='boston_test.csv')
    parser.add_argument('--input-dir', type=str, default=sagemaker_processing_input_path)
    parser.add_argument('--output-dir', type=str, default=sagemaker_processing_output_path)

    args, _ = parser.parse_known_args()
    print('Received arguments {}'.format(args))

    print('reading data')
    train_df = pd.read_csv(os.path.join(args.input_dir, args.train_file))
    test_df = pd.read_csv(os.path.join(args.input_dir, args.test_file))
        
    cols_xgboost = columns[-1:] + columns[:-1]
    
    train_df = train_df[cols_xgboost]
    test_df = test_df[cols_xgboost]
    
    # Create local output directories
    if not os.path.exists(os.path.join(args.output_dir,'train')):
        os.makedirs(os.path.join(args.output_dir,'train'))
        print('creating the processed train directory')

    if not os.path.exists(os.path.join(args.output_dir,'test')):
        os.makedirs(os.path.join(args.output_dir,'test'))
        print('creating the processed test directory')
    
    output_train_data_path = os.path.join(args.output_dir,'train',args.train_file)
    train_df.to_csv(output_train_data_path,header=False,index=False)
    print('Saved the processed training dataset')

    
    output_test_data_path = os.path.join(args.output_dir,'test',args.test_file)
    test_df.to_csv(output_test_data_path,header=False,index=False)
    print('Saved the processed test dataset')

Writing preprocessing.py


#### Test the code locally on this local notebook environment

In [13]:
! python preprocessing.py  --input-dir './' \
                           --output-dir './processed' 

Received arguments Namespace(input_dir='./', output_dir='./processed', test_file='boston_test.csv', train_file='boston_train.csv')
reading data
creating the processed train directory
creating the processed test directory
Saved the processed training dataset
Saved the processed test dataset


#### Process data with Amazon SageMaker Processing Job

You can run a scikit-learn script to do data processing on SageMaker.
The code runs a processing job using SKLearnProcessor class from the the Amazon SageMaker Python SDK to execute a scikit-learn script that you provide. 

In [24]:
from sagemaker.sklearn.processing import SKLearnProcessor

region = boto3.session.Session().region_name
role = get_execution_role()
sklearn_processor = SKLearnProcessor(framework_version='0.20.0',
                                     role=role,
                                     instance_type='ml.m5.xlarge',
                                     instance_count=1)

INFO:sagemaker.image_uris:Same images used for training and inference. Defaulting to image scope: inference.
INFO:sagemaker.image_uris:Defaulting to only available Python version: py3


Define the input and output S3 location for SageMaker Processing Job with SKLearnProcessor

In [25]:
input_data_s3 = 's3://{}/sagemaker/xgboostcontainer/raw-data'.format(bucket)
print('Raw dataset at S3 location:',input_data_s3)
output_data_s3_prefix = 's3://{}/sagemaker/xgboostcontainer/processed'.format(bucket)
print('Processed dataset at S3 location:',output_data_s3_prefix)
output_data_s3_train = output_data_s3_prefix + '/train'
output_data_s3_test = output_data_s3_prefix + '/test'

Raw dataset at S3 location: s3://sagemaker-eu-west-1-707684582322/sagemaker/xgboostcontainer/raw-data
Processed dataset at S3 location: s3://sagemaker-eu-west-1-707684582322/sagemaker/xgboostcontainer/processed


#### Run SageMaker Processing Job with SageMaker SDK

See the SDK reference
https://sagemaker.readthedocs.io/en/stable/amazon_sagemaker_processing.html

In [26]:
from sagemaker.processing import ProcessingInput, ProcessingOutput
current_date = strftime("%Y-%m-%d-%H-%M-%S")

sklearn_processor.run(code='preprocessing.py',
                      inputs=[ProcessingInput(
                        source=input_data_s3,
                        destination='/opt/ml/processing/input')],
                      outputs=[ProcessingOutput(output_name='xgboost_train_data',
                                                source='/opt/ml/processing/output/train',
                                               destination = output_data_s3_train),
                               ProcessingOutput(output_name='xgboost_test_data',
                                                source='/opt/ml/processing/output/test',
                                               destination = output_data_s3_test)],
                      experiment_config={ "TrialName": demo_trial.trial_name, "TrialComponentDisplayName": "Preprocessing-{}".format(current_date)}
                     )

INFO:sagemaker:Creating processing-job with name sagemaker-scikit-learn-2021-02-03-16-09-25-286



Job Name:  sagemaker-scikit-learn-2021-02-03-16-09-25-286
Inputs:  [{'InputName': 'input-1', 'S3Input': {'S3Uri': 's3://sagemaker-eu-west-1-707684582322/sagemaker/xgboostcontainer/raw-data', 'LocalPath': '/opt/ml/processing/input', 'S3DataType': 'S3Prefix', 'S3InputMode': 'File', 'S3DataDistributionType': 'FullyReplicated', 'S3CompressionType': 'None'}}, {'InputName': 'code', 'S3Input': {'S3Uri': 's3://sagemaker-eu-west-1-707684582322/sagemaker-scikit-learn-2021-02-03-16-09-25-286/input/code/preprocessing.py', 'LocalPath': '/opt/ml/processing/input/code', 'S3DataType': 'S3Prefix', 'S3InputMode': 'File', 'S3DataDistributionType': 'FullyReplicated', 'S3CompressionType': 'None'}}]
Outputs:  [{'OutputName': 'xgboost_train_data', 'S3Output': {'S3Uri': 's3://sagemaker-eu-west-1-707684582322/sagemaker/xgboostcontainer/processed/train', 'LocalPath': '/opt/ml/processing/output/train', 'S3UploadMode': 'EndOfJob'}}, {'OutputName': 'xgboost_test_data', 'S3Output': {'S3Uri': 's3://sagemaker-eu-wes

## SageMaker Training with built-in XGBoost

Amazon SageMaker provides several built-in machine learning algorithms that you can use for a variety of problem types.
<br>
Using the built-in algorithm version of XGBoost is simpler than using the open source version, because you don’t have to write a training script. 


In [27]:
from sagemaker.image_uris import retrieve 
from sagemaker.session import Session

# this line automatically looks for the XGBoost image URI and builds an XGBoost container.
# specify the repo_version depending on your preference.
container = retrieve(region=boto3.Session().region_name,
                          framework='xgboost', 
                          version='1.0-1')
print(container)

INFO:sagemaker.image_uris:Same images used for training and inference. Defaulting to image scope: inference.
INFO:sagemaker.image_uris:Defaulting to only available Python version: py3
INFO:sagemaker.image_uris:Defaulting to only supported image scope: cpu.


141502667606.dkr.ecr.eu-west-1.amazonaws.com/sagemaker-xgboost:1.0-1-cpu-py3


Set the hyperparameters for SageMaker Built-in XGBoost.
<br>
In terms of objective metric, we fix here reg:squarederror, which indicates regression task with squared loss. 

List of available hyperparameters can be found here 
<br>
https://docs.aws.amazon.com/sagemaker/latest/dg/xgboost_hyperparameters.html

In [28]:
# initialize hyperparameters
hyperparameters = {
        "max_depth":"10",
        "eta":"0.2",
        "gamma":"4",
        "min_child_weight":"6",
        "subsample":"0.7",
        "objective":"reg:squarederror",
        "num_round":"200"}

#### Launching a training job with the Python SDK

In [29]:
# construct a SageMaker estimator that calls the xgboost-container
estimator = sagemaker.estimator.Estimator(image_uri=container, 
                                          hyperparameters=hyperparameters,
                                          role=role,
                                          instance_count=1, 
                                          instance_type='ml.m5.2xlarge')



Define the data type and paths to the training and validation datasets

In [30]:
from sagemaker.inputs import TrainingInput
content_type = "csv"
train_input = TrainingInput("s3://{}/sagemaker/xgboostcontainer/processed/{}/".format(bucket, 'train'), content_type=content_type)
validation_input = TrainingInput("s3://{}/sagemaker/xgboostcontainer/processed/{}/".format(bucket, 'test'), content_type=content_type)

Execute the XGBoost training job

In [31]:
current_date = strftime("%Y-%m-%d-%H-%M-%S")

estimator.fit({'train': train_input, 'validation': validation_input},       
              experiment_config={
                "TrialName": demo_trial.trial_name,
                "TrialComponentDisplayName": "Training-{}".format(current_date)},
              wait=False
             )

INFO:sagemaker.image_uris:Defaulting to the only supported framework/algorithm version: latest.
INFO:sagemaker.image_uris:Ignoring unnecessary instance type: None.
INFO:sagemaker:Creating training-job with name: sagemaker-xgboost-2021-02-03-16-14-05-120


#### Deploy with SageMaker SDK

Here we deploy the best trained job to an Amazon SageMaker endpoint with SageMaker SDK 
<br>
Note that one could also use the more extensive process of creating a model from s3 artifacts, and deploy a model that was trained in a different session or even out of SageMaker.

In [66]:
predictor = estimator.deploy(initial_instance_count=1, instance_type='ml.m5.xlarge', endpoint_name ='xgboost-endpoint',
                             tags=None, wait=False)

INFO:sagemaker:Creating model with name: sagemaker-xgboost-2021-02-03-17-35-06-534
INFO:sagemaker:Creating endpoint with name xgboost-endpoint


ClientError: An error occurred (ValidationException) when calling the CreateEndpoint operation: The provided tags "Tag(tagname, tagvalue),Tag(sagemaker:project-id, p-1llgzzxekxpq),Tag(sagemaker:project-name, mlops-cicd-demo),Tag(sagemaker:project-id, p-1llgzzxekxpq),Tag(sagemaker:project-name, mlops-cicd-demo)" must not have duplicate keys.

#### Invoke with boto3 python SDK

In [31]:
import pandas as pd 
import numpy as np 

runtime = boto3.client('sagemaker-runtime')

prediction_data = np.array([0.09178,0.0,4.05,0.0,0.51,6.416,84.1,2.6463,5.0,296.0,16.6,395.5,9.04]).reshape((1,13))
serialized_data = pd.DataFrame(prediction_data).to_csv(header=False, index=False).encode('utf-8')
print(serialized_data)

b'0.09178,0.0,4.05,0.0,0.51,6.416,84.1,2.6463,5.0,296.0,16.6,395.5,9.04\n'


In [35]:
# csv serialization
response = runtime.invoke_endpoint(
    EndpointName=predictor.endpoint_name,
    Body=serialized_data,
    ContentType='text/csv')

print(response['Body'].read())

b'26.668699264526367'


## Don't forget to delete the endpoint !

In [None]:
sm_boto3.delete_endpoint(EndpointName=tuning_predictor.endpoint)

## Batch prediction with batch transform

In [37]:
ingestedpath = sess.upload_data(
    path='./processed/test/boston_test.csv', bucket=bucket,
    key_prefix='sagemaker/xgboostcontainer/ingested-data')

print('Ingested data will be stored S3 at:', ingestedpath)

Ingested data will be stored S3 at: s3://sagemaker-eu-west-1-707684582322/sagemaker/xgboostcontainer/ingested-data/boston_test.csv


In [38]:
# The location of the test dataset
batch_input = 's3://{}/sagemaker/xgboostcontainer/ingested-data/'.format(bucket)

# The location to store the results of the batch transform job
batch_output = 's3://{}/sagemaker/xgboostcontainer/batch-predicted-data/'.format(bucket)


# Define a SKLearn Transformer from the trained SKLearn Estimator
transformer = estimator.transformer(instance_count=1, instance_type='ml.m5.xlarge',
                                            output_path=batch_output,accept='text/csv',assemble_with='Line')

INFO:sagemaker:Creating model with name: sagemaker-xgboost-2021-02-03-16-25-37-387


In [43]:
#transformer.transform(data=batch_input, data_type='S3Prefix', content_type='text/csv', split_type='Line', input_filter='$[:29]')
transformer.transform(data=batch_input, data_type='S3Prefix', content_type='text/csv', split_type='Line', 
                      input_filter="$[1:]", join_source= "Input", output_filter="$")

print('Waiting for transform job: ' + transformer.latest_transform_job.job_name)

INFO:sagemaker:Creating transform job with name: sagemaker-xgboost-2021-02-03-16-40-38-028


...........................
[34m[2021-02-03:16:45:00:INFO] No GPUs detected (normal if no gpus installed)[0m
[34m[2021-02-03:16:45:00:INFO] No GPUs detected (normal if no gpus installed)[0m
[34m[2021-02-03:16:45:00:INFO] nginx config: [0m
[34mworker_processes auto;[0m
[34mdaemon off;[0m
[34mpid /tmp/nginx.pid;[0m
[34merror_log  /dev/stderr;
[0m
[34mworker_rlimit_nofile 4096;
[0m
[34mevents {
  worker_connections 2048;[0m
[34m}
[0m
[34mhttp {
  include /etc/nginx/mime.types;
  default_type application/octet-stream;
  access_log /dev/stdout combined;

  upstream gunicorn {
    server unix:/tmp/gunicorn.sock;
  }

  server {
    listen 8080 deferred;
    client_max_body_size 0;

    keepalive_timeout 3;

    location ~ ^/(ping|invocations|execution-parameters) {
      proxy_set_header X-Forwarded-For $proxy_add_x_forwarded_for;
      proxy_set_header Host $http_host;
      proxy_redirect off;
      proxy_read_timeout 60s;
      proxy_pass http://gunicorn;
    }

    l

## SageMaker Hyperparameters Tuning with Built-in XGBoost

Check out the SageMaker documentation for How Hyperparameter Tuning Works
<br>
https://docs.aws.amazon.com/sagemaker/latest/dg/automatic-model-tuning-how-it-works.html

Similar as the SageMaker training job SDK, we configure here the SageMaker estimator, and pre-set the hyperparameters when we consider fixed (no need to tune).  

In [47]:
from sagemaker.tuner import IntegerParameter, CategoricalParameter, ContinuousParameter, HyperparameterTuner

xgb = sagemaker.estimator.Estimator(container,
                                    role, 
                                    instance_count=1, 
                                    instance_type='ml.m5.xlarge',
                                    sagemaker_session=sess)

xgb.set_hyperparameters(objective='reg:squarederror',
                        num_round=50,
                        rate_drop=0.3)

Given an objective metric and a set of the hyperparameters to be tuned, the tuning job optimizes a model for the metric that you choose.
<br>
For regression problem, we fix here Root Mean Square Error (RMSE) as objective metric for tuning job, and the best job would be the one minimises such error.  

In [48]:
objective_metric_name = 'validation:rmse'
objective_type = 'Minimize'

We perform automatic model tuning with following hyperparameters

- eta: Step size shrinkage used in updates to prevent overfitting. After each boosting step, you can directly get the weights of new features. The eta parameter actually shrinks the feature weights to make the boosting process more conservative.
- alpha: L1 regularization term on weights. Increasing this value makes models more conservative.
- min_child_weight: Minimum sum of instance weight (hessian) needed in a child. If the tree partition step results in a leaf node with the sum of instance weight less than min_child_weight, the building process gives up further partitioning. In linear regression models, this simply corresponds to a minimum number of instances needed in each node. The larger the algorithm, the more conservative it is.
- max_depth: Maximum depth of a tree. Increasing this value makes the model more complex and likely to be overfitted.

In [56]:
hyperparameter_ranges = {'eta': ContinuousParameter(0, 1),
                        'min_child_weight': ContinuousParameter(1, 10),
                        'alpha': ContinuousParameter(0, 2),
                        'max_depth': IntegerParameter(1, 10)}

#### Launch the SageMaker hyperparameter tuning job

In [57]:
tuner = HyperparameterTuner(xgb,
                            objective_metric_name=objective_metric_name,
                            objective_type=objective_type,
                            hyperparameter_ranges=hyperparameter_ranges,
                            max_jobs=4,
                            max_parallel_jobs=2)

In [58]:
from sagemaker.inputs import TrainingInput
content_type = "csv"
train_input = TrainingInput("s3://{}/sagemaker/xgboostcontainer/processed/{}/".format(bucket, 'train'), content_type=content_type)
validation_input = TrainingInput("s3://{}/sagemaker/xgboostcontainer/processed/{}/".format(bucket, 'test'), content_type=content_type)

In [60]:
tuner.fit({'train': train_input, 'validation': validation_input},
          include_cls_metadata=False,wait=False)

INFO:sagemaker.image_uris:Defaulting to the only supported framework/algorithm version: latest.
INFO:sagemaker.image_uris:Ignoring unnecessary instance type: None.
INFO:sagemaker:Creating hyperparameter tuning job with name: sagemaker-xgboost-210203-1723


#### Fetch results about a hyperparameter tuning job and make them accessible for analytics

In [None]:
# get tuner results in a df
results = tuner.analytics().dataframe()
results.head(16)