# Regression with Amazon SageMaker XGBoost algorithm (SDK V3)

---

This notebook's CI test result for us-west-2 is as follows. CI test results in other regions can be found at the end of the notebook. 

![This us-west-2 badge failed to load. Check your device's internet connectivity, otherwise the service is currently unavailable](https://prod.us-west-2.tcx-beacon.docs.aws.dev/sagemaker-nb/us-west-2/build_and_train_models|sm-regression_xgboost|sm-regression_xgboost.ipynb)

---

_**Single machine training for regression with Amazon SageMaker XGBoost algorithm - Migrated to SDK V3**_

---

---
## Contents
1. [Introduction](#Introduction)
2. [Setup](#Setup)
  1. [Fetching the dataset](#Fetching-the-dataset)
  2. [Data Ingestion](#Data-ingestion)
3. [Training the XGBoost model](#Training-the-XGBoost-model)
  1. [Plotting evaluation metrics](#Plotting-evaluation-metrics)
4. [Set up hosting for the model](#Set-up-hosting-for-the-model)
  1. [Deploy with ModelBuilder](#Deploy-with-ModelBuilder)
5. [Validate the model for use](#Validate-the-model-for-use)

---
## Introduction

This notebook demonstrates the use of Amazon SageMaker's implementation of the XGBoost algorithm to train and host a regression model using **SageMaker SDK V3**. We use the [Abalone data](https://www.csie.ntu.edu.tw/~cjlin/libsvmtools/datasets/regression.html) originally from the UCI data repository [1]. More details about the original dataset can be found [here](https://archive.ics.uci.edu/ml/machine-learning-databases/abalone/abalone.names).  In the libsvm converted [version](https://www.csie.ntu.edu.tw/~cjlin/libsvmtools/datasets/regression.html), the nominal feature (Male/Female/Infant) has been converted into a real valued feature. Age of abalone is to be predicted from eight physical measurements. Dataset is already processed and stored on S3. Scripts used for processing the data can be found in the [Appendix](#Appendix). These include downloading the data, splitting into train, validation and test, and uploading to S3 bucket. 

>[1] Dua, D. and Graff, C. (2019). UCI Machine Learning Repository [http://archive.ics.uci.edu/ml]. Irvine, CA: University of California, School of Information and Computer Science.

## Setup


This notebook was tested in Amazon SageMaker Studio on a ml.t3.medium instance with Python 3 (Data Science) kernel.

Let's start by specifying:
1. The S3 buckets and prefixes that you want to use for saving the model and where training data is located. This should be within the same region as the Notebook Instance, training, and hosting. 
1. The IAM role arn used to give training and hosting access to your data. See the documentation for how to create these. Note, if more than one role is required for notebook instances, training, and/or hosting, please replace the boto regexp with a the appropriate full IAM role arn string(s).

In [None]:
!pip3 install -U sagemaker

In [None]:
%%time

import os
import boto3
import re

# V3 imports
from sagemaker.core.helper.session_helper import Session, get_execution_role

role = get_execution_role()

#TODO replace with your SageMaker Domain or user execution role
role = 'arn:aws:iam::716664005094:role/llm-fine-tuning-product-role'

region = boto3.Session().region_name
s3_client = boto3.client("s3")
sagemaker_session = Session()

# S3 bucket where the training data is located.
data_bucket = f"sagemaker-sample-files"
data_prefix = "datasets/tabular/uci_abalone"
data_bucket_path = f"s3://{data_bucket}"

# S3 bucket for saving code and model artifacts.
# Handle default_bucket carefully - it may be a property or method
output_bucket = sagemaker_session.default_bucket
if callable(output_bucket):
    output_bucket = output_bucket()

# Ensure bucket exists
if not isinstance(output_bucket, str) or not output_bucket:
    account_id = boto3.client('sts').get_caller_identity()['Account']
    output_bucket = f"sagemaker-{region}-{account_id}"
    try:
        s3_client.head_bucket(Bucket=output_bucket)
    except s3_client.exceptions.NoSuchBucket:
        if region == 'us-east-1':
            s3_client.create_bucket(Bucket=output_bucket)
        else:
            s3_client.create_bucket(
                Bucket=output_bucket,
                CreateBucketConfiguration={'LocationConstraint': region}
            )

output_prefix = "sagemaker/DEMO-xgboost-abalone-v3"
output_bucket_path = f"s3://{output_bucket}"

for data_category in ["train", "test", "validation"]:
    data_key = "{0}/{1}/abalone.{1}".format(data_prefix, data_category)
    output_key = "{0}/{1}/abalone.{1}".format(output_prefix, data_category)
    data_filename = "abalone.{}".format(data_category)
    s3_client.download_file(data_bucket, data_key, data_filename)
    s3_client.upload_file(data_filename, output_bucket, output_key)

## Training the XGBoost model

After setting training parameters, we kick off training, and poll for status until training is completed.

Training can be done by either calling SageMaker Training with a set of hyperparameters values to train with, or by leveraging SageMaker Automatic Model Tuning ([AMT](https://docs.aws.amazon.com/sagemaker/latest/dg/automatic-model-tuning.html)). AMT, also known as hyperparameter tuning (HPO), finds the best version of a model by running many training jobs on your dataset using the algorithm and ranges of hyperparameters that you specify. It then chooses the hyperparameter values that result in a model that performs the best, as measured by a metric that you choose.

In this notebook, both methods are used for demonstration purposes, but the model that the HPO job creates is the one that is eventually hosted. You can instead choose to deploy the model created by the standalone training job by changing the below variable `deploy_amt_model` to False.

### Initializing common variables 

In [None]:
# V3 imports
from sagemaker.core import image_uris

container = image_uris.retrieve("xgboost", region, "1.7-1")
deploy_amt_model = True

### Training with ModelTrainer (V3)

In [None]:
%%time
import boto3
from time import gmtime, strftime
import time

# V3 imports
from sagemaker.train import ModelTrainer
from sagemaker.train.configs import InputData, Compute, StoppingCondition, OutputDataConfig

# Configure input data
train_input = InputData(
    channel_name="train",
    data_source=f"{output_bucket_path}/{output_prefix}/train",
    content_type="libsvm"
)

validation_input = InputData(
    channel_name="validation",
    data_source=f"{output_bucket_path}/{output_prefix}/validation",
    content_type="libsvm"
)

# Configure compute resources
compute_config = Compute(
    instance_type="ml.m5.2xlarge",
    instance_count=1,
    volume_size_in_gb=5
)

# Configure stopping condition
stopping_config = StoppingCondition(max_runtime_in_seconds=3600)

# Configure output
output_config = OutputDataConfig(
    s3_output_path=f"{output_bucket_path}/{output_prefix}/single-xgboost"
)

# Create ModelTrainer
trainer = ModelTrainer(
    training_image=container,
    role=role,
    compute=compute_config,
    stopping_condition=stopping_config,
    output_data_config=output_config,
    hyperparameters={
        "max_depth": "5",
        "eta": "0.2",
        "gamma": "4",
        "min_child_weight": "6",
        "subsample": "0.7",
        "objective": "reg:linear",
        "num_round": "50",
        "verbosity": "2"
    },
    sagemaker_session=sagemaker_session
)

# Start training
print("Starting training job. It will take between 5 and 6 minutes to complete.")
trainer.train(
    input_data_config=[train_input, validation_input],
    wait=True
)

# Get the training job name and model artifacts
training_job_name = trainer._latest_training_job.training_job_name
model_artifacts = trainer._latest_training_job.model_artifacts.s3_model_artifacts
print(f"Training job name: {training_job_name}")
print(f"Model artifacts: {model_artifacts}")

Note that the "validation" channel has been initialized too. The SageMaker XGBoost algorithm actually calculates RMSE and writes it to the CloudWatch logs on the data passed to the "validation" channel.

### Tuning with HyperparameterTuner (V3)

To create a tuning job using the V3 HyperparameterTuner, you need to:

1. Create a base ModelTrainer with static hyperparameters
2. Define hyperparameter ranges using ContinuousParameter and IntegerParameter
3. Create HyperparameterTuner with the base trainer and ranges
4. Call tuner.tune() to start the tuning job

In [None]:
%%time
from time import gmtime, strftime, sleep

# V3 imports
from sagemaker.train.tuner import HyperparameterTuner
from sagemaker.core.parameter import ContinuousParameter, IntegerParameter

# Define hyperparameter ranges
hyperparameter_ranges = {
    "eta": ContinuousParameter(min_value=0.1, max_value=0.5),
    "gamma": ContinuousParameter(min_value=0, max_value=5),
    "min_child_weight": ContinuousParameter(min_value=0, max_value=120),
    "subsample": ContinuousParameter(min_value=0.5, max_value=1),
    "alpha": ContinuousParameter(min_value=0, max_value=2),
    "max_depth": IntegerParameter(min_value=0, max_value=10),
    "num_round": IntegerParameter(min_value=1, max_value=4000)
}

# Create base trainer with static hyperparameters only
base_trainer = ModelTrainer(
    training_image=container,
    role=role,
    compute=compute_config,
    stopping_condition=StoppingCondition(max_runtime_in_seconds=43200),
    output_data_config=output_config,
    hyperparameters={
        "objective": "reg:linear",
        "verbosity": "2"
    },
    sagemaker_session=sagemaker_session
)

# Create HyperparameterTuner
tuner = HyperparameterTuner(
    model_trainer=base_trainer,
    objective_metric_name="validation:rmse",
    objective_type="Minimize",
    hyperparameter_ranges=hyperparameter_ranges,
    max_jobs=6,
    max_parallel_jobs=2,
    strategy="Bayesian"
)

# Start tuning
print("Starting tuning job. It will take between 12 and 17 minutes to complete.")
tuner.tune(
    inputs=[train_input, validation_input],
    wait=True
)

# Get best training job name
best_job_name = tuner.best_training_job()
print(f"Best training job: {best_job_name}")

# Get model artifacts from best training job
from sagemaker.core.resources import TrainingJob
best_training_job = TrainingJob.get(training_job_name=best_job_name)
tuned_model_artifacts = best_training_job.model_artifacts.s3_model_artifacts
print(f"Model artifacts from best job: {tuned_model_artifacts}")

## Set up hosting for the model

### Deploy with ModelBuilder (V3)

In V3, we use ModelBuilder to deploy models to endpoints. This simplifies the process compared to the 3-step deployment in V2.

In [None]:
%%time
import boto3
from time import gmtime, strftime

# V3 imports
from sagemaker.serve.model_builder import ModelBuilder
from sagemaker.serve.mode.function_pointers import Mode

# Determine which model to deploy
if deploy_amt_model:
    model_data = tuned_model_artifacts
    model_source = "tuning job"
else:
    model_data = model_artifacts
    model_source = "training job"

print(f"Deploying model from {model_source}")
print(f"Model data: {model_data}")

# Create ModelBuilder with correct parameter names
model_builder = ModelBuilder(
    s3_model_data_url=model_data,  # NOT model_data
    role_arn=role,                  # NOT role
    image_uri=container,
    mode=Mode.SAGEMAKER_ENDPOINT,
    sagemaker_session=sagemaker_session
)

# Build the model
model = model_builder.build()

# Deploy to endpoint
endpoint_name = f'DEMO-XGBoostEndpoint-{strftime("%Y-%m-%d-%H-%M-%S", gmtime())}'
print(f"Creating endpoint with name: {endpoint_name}. This will take between 9 and 11 minutes to complete.")

endpoint = model_builder.deploy(
    instance_type="ml.m5.xlarge",
    initial_instance_count=1,
    endpoint_name=endpoint_name,
    wait=True
)

print(f"Endpoint deployed: {endpoint.endpoint_name}")
print(f"Endpoint ARN: {endpoint.endpoint_arn}")

## Validate the model for use

Finally, we can validate the model for use. In V3, we use `endpoint.invoke()` instead of boto3 runtime client.

Download test data

In [None]:
FILE_TEST = "abalone.test"
s3 = boto3.client("s3")
s3.download_file(data_bucket, f"{data_prefix}/test/{FILE_TEST}", FILE_TEST)

Start with a single prediction.

In [None]:
!head -1 abalone.test > abalone.single.test

In [None]:
%%time
import json
from itertools import islice
import math
import struct

file_name = "abalone.single.test"
with open(file_name, "r") as f:
    payload = f.read().strip()

# V3: Use endpoint.invoke() instead of runtime_client.invoke_endpoint()
result = endpoint.invoke(
    body=payload,
    content_type="text/x-libsvm"
)

# V3: Parse response manually
response_body = result.body.read().decode('utf-8')

# Handle different response formats
if ',' in response_body.strip():
    predictions = [math.ceil(float(num.strip())) for num in response_body.strip().split(',') if num.strip()]
elif '\n' in response_body.strip():
    predictions = [math.ceil(float(num.strip())) for num in response_body.strip().split('\n') if num.strip()]
else:
    predictions = [math.ceil(float(response_body.strip()))]

label = payload.strip(" ").split()[0]
print(f"Label: {label}\nPrediction: {predictions[0]}")

OK, a single prediction works. Let's do a whole batch to see how good is the predictions accuracy.

In [None]:
import sys
import math

# V3: Updated to use endpoint.invoke()
def do_predict(data, endpoint_obj, content_type):
    payload = "\n".join(data)
    result = endpoint_obj.invoke(
        body=payload,
        content_type=content_type
    )
    
    # Parse response manually
    response_body = result.body.read().decode('utf-8')
    response_body = response_body.strip("\n")
    
    # Handle different response formats
    if '\n' in response_body:
        preds = [float(num.strip()) for num in response_body.split('\n') if num.strip()]
    elif ',' in response_body:
        preds = [float(num.strip()) for num in response_body.split(',') if num.strip()]
    else:
        preds = [float(response_body)]
    
    preds = [math.ceil(num) for num in preds]
    return preds


def batch_predict(data, batch_size, endpoint_obj, content_type):
    items = len(data)
    arrs = []

    for offset in range(0, items, batch_size):
        if offset + batch_size < items:
            results = do_predict(data[offset : (offset + batch_size)], endpoint_obj, content_type)
            arrs.extend(results)
        else:
            arrs.extend(do_predict(data[offset:items], endpoint_obj, content_type))
        sys.stdout.write(".")
    return arrs

The following helps us calculate the Median Absolute Percent Error (MdAPE) on the batch dataset. 

In [None]:
%%time
import json
import numpy as np

with open(FILE_TEST, "r") as f:
    payload = f.read().strip()

labels = [int(line.split(" ")[0]) for line in payload.split("\n")]
test_data = [line for line in payload.split("\n")]
preds = batch_predict(test_data, 100, endpoint, "text/x-libsvm")

print(
    "\n Median Absolute Percent Error (MdAPE) = ",
    np.median(np.abs(np.array(labels) - np.array(preds)) / np.array(labels)),
)

### Delete Endpoint

Once you are done using the endpoint, you can delete it. In V3, you need to delete the endpoint and endpoint config separately.

In [None]:
# V3: Delete endpoint and endpoint config separately
from sagemaker.core.resources import EndpointConfig

# Get endpoint config name before deleting
endpoint_config_name = endpoint.endpoint_config_name

# Delete the endpoint
print(f"Deleting endpoint: {endpoint.endpoint_name}")
endpoint.delete()

# Delete the endpoint config
print(f"Deleting endpoint config: {endpoint_config_name}")
endpoint_config = EndpointConfig.get(endpoint_config_name=endpoint_config_name)
endpoint_config.delete()

print("Cleanup complete!")

## Appendix

### Data split and upload

Following methods split the data into train/test/validation datasets and upload files to S3.

In [None]:
import io
import boto3
import random


def data_split(
    FILE_DATA,
    FILE_TRAIN,
    FILE_VALIDATION,
    FILE_TEST,
    PERCENT_TRAIN,
    PERCENT_VALIDATION,
    PERCENT_TEST,
):
    data = [l for l in open(FILE_DATA, "r")]
    train_file = open(FILE_TRAIN, "w")
    valid_file = open(FILE_VALIDATION, "w")
    tests_file = open(FILE_TEST, "w")

    num_of_data = len(data)
    num_train = int((PERCENT_TRAIN / 100.0) * num_of_data)
    num_valid = int((PERCENT_VALIDATION / 100.0) * num_of_data)
    num_tests = int((PERCENT_TEST / 100.0) * num_of_data)

    data_fractions = [num_train, num_valid, num_tests]
    split_data = [[], [], []]

    rand_data_ind = 0

    for split_ind, fraction in enumerate(data_fractions):
        for i in range(fraction):
            rand_data_ind = random.randint(0, len(data) - 1)
            split_data[split_ind].append(data[rand_data_ind])
            data.pop(rand_data_ind)

    for l in split_data[0]:
        train_file.write(l)

    for l in split_data[1]:
        valid_file.write(l)

    for l in split_data[2]:
        tests_file.write(l)

    train_file.close()
    valid_file.close()
    tests_file.close()


def write_to_s3(fobj, bucket, key):
    return (
        boto3.Session(region_name=region)
        .resource("s3")
        .Bucket(bucket)
        .Object(key)
        .upload_fileobj(fobj)
    )


def upload_to_s3(bucket, channel, filename):
    fobj = open(filename, "rb")
    key = f"{prefix}/{channel}"
    url = f"s3://{bucket}/{key}/{filename}"
    print(f"Writing to {url}")
    write_to_s3(fobj, bucket, key)

### Data ingestion

Next, we read the dataset from the existing repository into memory, for preprocessing prior to training. This processing could be done *in situ* by Amazon Athena, Apache Spark in Amazon EMR, Amazon Redshift, etc., assuming the dataset is present in the appropriate location. Then, the next step would be to transfer the data to S3 for use in training. For small datasets, such as this one, reading into memory isn't onerous, though it would be for larger datasets.

In [None]:
%%time
s3 = boto3.client("s3")

bucket = output_bucket
prefix = "sagemaker/DEMO-xgboost-abalone-v3"

# Load the dataset
FILE_DATA = "abalone"
s3.download_file(
    f"sagemaker-example-files-prod-{region}",
    f"datasets/tabular/uci_abalone/abalone.libsvm",
    FILE_DATA,
)

# split the downloaded data into train/test/validation files
FILE_TRAIN = "abalone.train"
FILE_VALIDATION = "abalone.validation"
FILE_TEST = "abalone.test"
PERCENT_TRAIN = 70
PERCENT_VALIDATION = 15
PERCENT_TEST = 15
data_split(
    FILE_DATA,
    FILE_TRAIN,
    FILE_VALIDATION,
    FILE_TEST,
    PERCENT_TRAIN,
    PERCENT_VALIDATION,
    PERCENT_TEST,
)

# upload the files to the S3 bucket
upload_to_s3(bucket, "train", FILE_TRAIN)
upload_to_s3(bucket, "validation", FILE_VALIDATION)
upload_to_s3(bucket, "test", FILE_TEST)