## CatBoost XGBoost Script Mode Training and Serving 

This is a sample Python program that trains a simple CatBoost model and a XGBoost model using SageMaker XGBoost Docker image, and then performs inference. This implementation will work on your *local computer* or in the *AWS Cloud*.

#### Prerequisites:
1. Install required Python packages:
   `pip install -r requirements.txt`
2. Docker Desktop installed and running on your computer:
   `docker ps`
3. You should have AWS credentials configured on your local machine in order to be able to pull the docker image from ECR.

In [3]:
import os
import sagemaker
import pandas as pd
from sagemaker.predictor import csv_serializer
from sagemaker.xgboost import XGBoost
from sklearn.datasets import load_diabetes
from sklearn.model_selection import train_test_split

In [4]:
sess = sagemaker.Session()
bucket = sess.default_bucket()
role = sagemaker.get_execution_role()

    
prefix = "xgboost_catboost"

## Downloading Data
Download training and eval data

In [5]:
local_train = './data/train/diabetes_train.csv'
local_validation = './data/validation/diabetes_validation.csv'
local_test = './data/test/diabetes_test.csv'

In [6]:
if os.path.isfile('./data/train/diabetes_train.csv') and \
        os.path.isfile('./data/validation/diabetes_validation.csv') and \
        os.path.isfile('./data/test/diabetes_test.csv'):
    print('Training dataset exist. Skipping Download')
else:
    print('Downloading training dataset')

    os.makedirs("./data", exist_ok=True)
    os.makedirs("./data/train", exist_ok=True)
    os.makedirs("./data/validation", exist_ok=True)
    os.makedirs("./data/test", exist_ok=True)

    data = load_diabetes()

    X_train, X_test, y_train, y_test = train_test_split(data.data, data.target, test_size=0.25, random_state=45)
    X_val, X_test, y_val, y_test = train_test_split(X_test, y_test, test_size=0.5, random_state=45)

    trainX = pd.DataFrame(X_train, columns=data.feature_names)
    trainX['target'] = y_train

    valX = pd.DataFrame(X_test, columns=data.feature_names)
    valX['target'] = y_test

    testX = pd.DataFrame(X_test, columns=data.feature_names)

    trainX.to_csv(local_train, header=None, index=False)
    valX.to_csv(local_validation, header=None, index=False)
    testX.to_csv(local_test, header=None, index=False)

    print('Downloading completed')

Training dataset exist. Skipping Download


## Model Training
Starting model training using **local mode**. Note: if launching for the first time in local mode, container image download might take a few minutes to complete.

In [7]:
training_instance_type = "ml.m5.xlarge"
train_location = sess.upload_data(
    local_train, key_prefix="{}/data/{}".format(prefix, "train")
)
validation_location = sess.upload_data(
    local_validation, key_prefix="{}/data/{}".format(prefix, "validation")
)
        

In [None]:
train_location, validation_location

In [9]:
hyperparameters = {"num_round": 6, "max_depth": 5}

estimator_parameters = {
    "entry_point": "multi_model_hpo.py",
    "source_dir": "code",
    "dependencies": ["my_custom_library"],
    "instance_type": training_instance_type,
    "instance_count": 1,
    "hyperparameters": hyperparameters,
    "role": role,
    "base_job_name": "xgboost-model",
    "framework_version": "1.0-1",
    "py_version": "py3",
}    
    

estimator = XGBoost(**estimator_parameters)

If you only want to train the model, un-comment the next cell

In [10]:
# estimator.fit({'train': train_location, 'validation': validation_location})
# print('Completed model training')

In the following cells, we will define a Hyperparameter Optimization job.

In [11]:
from sagemaker.tuner import (
    IntegerParameter,
#     CategoricalParameter,
    ContinuousParameter,
    HyperparameterTuner,
)

hyperparameter_ranges = {
    "eta": ContinuousParameter(0.2, 0.3),
    "max_depth": IntegerParameter(3, 4)
}

objective_metric_name = "validation:rmse"

In [12]:
tuner = HyperparameterTuner(
    estimator, 
    objective_metric_name,
    hyperparameter_ranges, 
    max_jobs=4, 
    max_parallel_jobs=2, 
    objective_type='Minimize'
)

In [14]:
tuner.fit({"train": train_location, "validation": validation_location}, include_cls_metadata=False)


................................................................................................!


Get the best training job's values

In [15]:
job_name=tuner.latest_tuning_job.name
attached_tuner = HyperparameterTuner.attach(job_name)

In [16]:
attached_tuner.describe()["BestTrainingJob"]

{'TrainingJobName': 'sagemaker-xgboost-220802-0855-003-b8a5e508',
 'TrainingJobArn': 'arn:aws:sagemaker:us-east-1:327838496401:training-job/sagemaker-xgboost-220802-0855-003-b8a5e508',
 'CreationTime': datetime.datetime(2022, 8, 2, 8, 58, 45, tzinfo=tzlocal()),
 'TrainingStartTime': datetime.datetime(2022, 8, 2, 9, 0, 51, tzinfo=tzlocal()),
 'TrainingEndTime': datetime.datetime(2022, 8, 2, 9, 2, 44, tzinfo=tzlocal()),
 'TrainingJobStatus': 'Completed',
 'TunedHyperParameters': {'eta': '0.27402597074722407', 'max_depth': '3'},
 'FinalHyperParameterTuningJobObjectiveMetric': {'MetricName': 'validation:rmse',
  'Value': 0.04377000033855438},
 'ObjectiveStatus': 'Succeeded'}

# Deployment options

## Deploy best model to real time endpoint
Parameters for launching an m5.xlarge instance and deploy best model from HPO job

In [17]:
# predictor_params = {
#     "endpoint_name": "xgboost-catboost-ensemble",
#     "entry_point": "multi_model_deploy.py",
#     "dependencies": ["my_custom_library"],
#     "source_dir": "code",
#     "initial_instance_count": 1,
#     "instance_type": "ml.m5.xlarge"
# }

In [18]:
# predictor = attached_tuner.deploy(**predictor_params)

## Deploy best model to a serverless endpoint
Parameters for deploying best model from HPO job as a serverless endpoint

In [19]:
from sagemaker.serverless.serverless_inference_config import ServerlessInferenceConfig

serverless_config = ServerlessInferenceConfig(
    memory_size_in_mb=6144,
    max_concurrency=1,
)

In [None]:
estimator=attached_tuner.best_estimator()

In [21]:
predictor = estimator.deploy(serverless_inference_config=serverless_config)

-------!

In [None]:
predictor.endpoint_context()

## Deploying trained model 
We can also deploy the trained model and perform invocation 

uncomment the below cell if you would like to deploy directly from the estimator object.

In [None]:
# endpoint_name = "xgboost-catboost-endpoint"
# predictor = estimator.deploy(
#         initial_instance_count=1, instance_type="ml.m5.xlarge", endpoint_name=endpoint_name
#     )


If you already have a model trained previously, you can use the model s3 uri in the model_data field and create a model object for deployment. No need to retrain the model using the estimator.

In [None]:
# from sagemaker.xgboost.model import XGBoostModel

# inference_model = XGBoostModel(
#     model_data=model_data,
#     role=role,
#     entry_point="multi_model_deploy.py",
#     framework_version="1.0-1",
#     dependencies=["my_custom_library"],
#     source_dir="code",
# )

The entry point script "multi_model_deploy.py" will handle the multiple models in the model artifacts and perform inference against each model. The results will be the mean of each inference output. This is a simple demonstration of how to work with multiple models, but you can design the model ensemble as you need.

In [None]:
# predictor = inference_model.deploy(
#     initial_instance_count=1,
#     instance_type="ml.m5.xlarge",
# )

# Invoke the model

In [22]:

from sagemaker.serializers import NumpySerializer, JSONSerializer, CSVSerializer
from sagemaker.deserializers import NumpyDeserializer, JSONDeserializer
predictor.serializer = CSVSerializer()
predictor.deserializer = JSONDeserializer()


In [None]:
with open(local_test, 'r') as f:
    payload = f.read().strip()

predictions = predictor.predict(payload)
print('predictions: {}'.format(predictions))

## Clear up resources
Delete the endpoint deployed in local

In [None]:
# predictor.delete_endpoint(predictor.endpoint)