## Introduction

This is our second notebook which will explore the model training stage of the ML workflow.

Here, we will put on the hat of the `Data Scientist` and will perform the task of modeling which includes training a model, performing hyperparameter tuning, evaluating the model and registering high performing candidate models in a model registry. This task is highly iterative in nature and hence we also need to track our experimentation until we reach desired results.

We will learn how to bring scale to model development tasks using managed SageMaker training and experiment tracking capabilities combined with curated feature data pulled from SageMaker Feature Store.  You'll also perform tuning at scale using SageMaker's automatic hyperparameter tuning capabilities. Then, finally register the best performing model in SageMaker Model Registry. 

![Notebook2](images/Notebook2.png)



Let's get started!

**Important:** for this example, we will use XGBoost-Ray. XGBoost-Ray integrates well with the Ray Tune hyperparameter optimization library and implements advanced fault tolerance handling mechanisms. We will use ray.data to load training, validation and testind data  (in parquet format) from the offline data store of the Feature Store. Then we will run a hyperparamter optimization job to find the best HPs. Finally we will register the best performing model to the Model registry. 

In [2]:
%store -r

In [3]:
train_feature_group_name

'fs-train--2023-07-04-13-37-02'

In [4]:
!pip install -U sagemaker ray==2.5.0 modin[ray]==0.22.1 pydantic==1.10.10 xgboost_ray tensorboardx

Collecting tensorboardx
  Using cached tensorboardX-2.6.1-py2.py3-none-any.whl (101 kB)
Collecting ray[default]>=1.13.0 (from modin[ray]==0.22.1)
  Using cached ray-2.5.1-cp310-cp310-manylinux2014_x86_64.whl (56.2 MB)
INFO: pip is looking at multiple versions of tensorboardx to determine which version is compatible with other requirements. This could take a while.
INFO: pip is looking at multiple versions of ray[default] to determine which version is compatible with other requirements. This could take a while.
[0m

In [5]:
from sagemaker.feature_store.feature_group import FeatureGroup
from sagemaker.sklearn.estimator import SKLearn
from sagemaker.sklearn.model import SKLearnModel
from time import gmtime, strftime
import boto3
import sys
import sagemaker
import json
import os

from sagemaker.model_metrics import ModelMetrics, MetricsSource
from sagemaker.analytics import ExperimentAnalytics
from sagemaker.tuner import IntegerParameter, ContinuousParameter, HyperparameterTuner
# SageMaker Experiments
from sagemaker.experiments.run import Run
from sagemaker.utils import unique_name_from_base

from sagemaker.feature_store.feature_group import FeatureGroup
from sagemaker import image_uris
from sagemaker.inputs import TrainingInput

In [6]:
# Useful SageMaker variables
sess = sagemaker.Session()
bucket = sess.default_bucket()
role_arn= sagemaker.get_execution_role()
region = sess.boto_region_name
s3_client = boto3.client('s3', region_name=region)
sagemaker_client = boto3.client('sagemaker')

enable_local_mode_training = False
model_name = 'xgboost-model'

experiment_name = unique_name_from_base('synthetic-housing-XGB-regression')

model_path = f's3://{bucket}/{s3_prefix}/output/model/xgb'

**Get the `ResolvedOutputS3Uri` of the Feature Group**

We can obtain the location where each Feature Group is storing data in parquet format.

In [7]:
fs_train_group = FeatureGroup(
        name=train_feature_group_name, 
        sagemaker_session=sess
    )

fs_train_data_loc = fs_train_group.describe().get("OfflineStoreConfig").get("S3StorageConfig").get("ResolvedOutputS3Uri")
fs_train_data_loc

's3://sagemaker-us-east-1-523914011708/aws-sm-ray-workshop/data/feature-store/train/523914011708/sagemaker/us-east-1/offline-store/fs-train--2023-07-04-13-37-02-1688478127/data'

In [8]:
fs_val_group = FeatureGroup(
        name=validation_feature_group_name, 
        sagemaker_session=sess
    )

fs_val_data_loc = fs_val_group.describe().get("OfflineStoreConfig").get("S3StorageConfig").get("ResolvedOutputS3Uri")
fs_val_data_loc

's3://sagemaker-us-east-1-523914011708/aws-sm-ray-workshop/data/feature-store/validation/523914011708/sagemaker/us-east-1/offline-store/fs-validation--2023-07-04-13-37-02-1688478127/data'

In [9]:
fs_test_group = FeatureGroup(
        name=test_feature_group_name, 
        sagemaker_session=sess
    )

fs_test_data_loc = fs_test_group.describe().get("OfflineStoreConfig").get("S3StorageConfig").get("ResolvedOutputS3Uri")
fs_test_data_loc

's3://sagemaker-us-east-1-523914011708/aws-sm-ray-workshop/data/feature-store/test/523914011708/sagemaker/us-east-1/offline-store/fs-test--2023-07-04-13-37-02-1688478127/data'

## SageMaker Training

Now that we've prepared our training and test data, we can move on to use SageMaker's hosted training functionality - [SageMaker Training](https://docs.aws.amazon.com/sagemaker/latest/dg/train-model.html). Hosted training is preferred for doing actual training, especially large-scale, distributed training. Unlike training a model on a local computer or server, SageMaker hosted training will spin up a separate cluster of machines managed by SageMaker to train your model. Before starting hosted training, the data must be in S3, or an EFS or FSx for Lustre file system. We uploaded to S3 in the previous notebook, so we're good to go here.

In [10]:
%%writefile ./pipeline_scripts/train/script.py
import subprocess
import sys
subprocess.check_call([sys.executable, '-m', 'pip', 'install', 'sagemaker','ray', 'xgboost_ray', 'pyarrow >= 6.0.1'])
import os
import time

import argparse
import json
import logging
import boto3
import sagemaker
# Experiments
from sagemaker.session import Session
from sagemaker.experiments.run import load_run

import ray
from ray.train.xgboost import XGBoostTrainer
from ray.air.config import ScalingConfig
from ray.data import Dataset
from ray.air.result import Result
from ray.air.checkpoint import Checkpoint
from sagemaker_ray_helper import RayHelper 

logger = logging.getLogger(__name__)
logger.setLevel(logging.DEBUG)
logger.addHandler(logging.StreamHandler(sys.stdout))

def read_parameters():
    parser = argparse.ArgumentParser()

    # Hyperparameters are described here.
    parser.add_argument('--max_depth', type=int)
    parser.add_argument('--eta', type=float)
    parser.add_argument('--min_child_weight', type=int)
    parser.add_argument('--subsample', type=float)
    parser.add_argument('--verbosity', type=int)
    parser.add_argument('--num_round', type=int)
    parser.add_argument('--tree_method', type=str, default="auto")
    parser.add_argument('--predictor', type=str, default="auto")

    # Sagemaker specific arguments. Defaults are set in the environment variables.
    parser.add_argument('--output_data_dir', type=str, default=os.environ.get('SM_OUTPUT_DATA_DIR'))
    parser.add_argument('--model_dir', type=str, default=os.environ.get('SM_MODEL_DIR'))
    parser.add_argument('--train', type=str, default=os.environ.get('SM_CHANNEL_TRAIN'))
    parser.add_argument('--validation', type=str, default=os.environ.get('SM_CHANNEL_VALIDATION'))
    parser.add_argument('--sm_hosts', type=str, default=os.environ.get('SM_HOSTS'))
    parser.add_argument('--sm_current_host', type=str, default=os.environ.get('SM_CURRENT_HOST'))
    
    parser.add_argument('--num_ray_workers', type=int,default=3)
    parser.add_argument('--use_gpu', type=bool, default=False)
    # parse region
    parser.add_argument('--region', type=str, default='us-east-1')
    
    parser.add_argument('--target_col', type=str, default='price')
    
    try:
        from sagemaker_training import environment
        env = environment.Environment()
        parser.add_argument('--n_jobs', type=int, default=env.num_cpus)
    except:
        parser.add_argument('--n_jobs', type=int, default=4)

    args, _ = parser.parse_known_args()
    return args

def load_dataset(fs_data_loc, target_col="price"):
    """
    Loads the data as a ray dataset from the offline featurestore S3 location
    Args:
        feature_group_name (str): name of the feature group
        target_col (str): the target columns (will be used only for the test set).
    Returns:
        ds (ray.data.dataset): Ray dataset the contains the requested dat from the feature store
    """
    # Drop columns added by the feature store
    cols_to_drop = ["record_id", "event_time","write_time", 
                    "api_invocation_time", "is_deleted", 
                    'year', "month", "day", "hour"]
                    
    
    # A simple check is this is test data
    # If True add the target column to the columns list to be dropped
    if '/test/' in fs_data_loc:
        cols_to_drop.append(target_col)

    ds = ray.data.read_parquet(fs_data_loc)
    ds = ds.drop_columns(cols_to_drop)
    print(f"{fs_data_loc} count is {ds.count()}")

    return ds

def train_xgboost(ds_train, ds_val, params, num_workers, use_gpu = False, target_col = "price") -> Result:
    """
    Creates a XGBoost trainer, train it, and return the result.        
    Args:
        ds_train (ray.data.dataset): Training dataset
        ds_val (ray.data.dataset): Validation dataset
        params (dict): Hyperparameters
        num_workers (int): number of workers to distribute the training across
        use_gpu (bool): Should the taining job use GPUs
        target_col (str): target column
    Returns:
        result (ray.air.result.Result): Result of the training job
    """
    """
    params = {
        "tree_method": "approx",
        "objective": "reg:squarederror",
        "eval_metric": ["mae", "rmse"],
    }
    """
    trainer = XGBoostTrainer(
        scaling_config=ScalingConfig(num_workers=num_workers, use_gpu=use_gpu),
        label_column="PRICE",
        params=params,
        datasets={"train": ds_train, "valid": ds_val},
        num_boost_round=100,
    )
    result = trainer.fit()
    print("<==== Start Training Metrics ====>")
    print(result.metrics)
    print("<==== END Training Metrics ====>")

    return result

def main():
    # Get SageMaker host information from runtime environment variables
    sm_hosts = json.loads(args.sm_hosts)
    sm_current_host = args.sm_current_host
    
    hyperparams = {
        'max_depth': args.max_depth,
        'min_child_weight': args.min_child_weight,
        'eta': args.eta,
        'subsample': args.subsample,
        "tree_method": "approx",
        "objective": "reg:squarederror",
        "eval_metric": ["mae", "rmse"],
        "num_round": 100
    }

    ds_train = load_dataset(args.train, args.target_col)
    ds_validation = load_dataset(args.validation, args.target_col)
    
    result = train_xgboost(ds_train, ds_validation, hyperparams, args.num_ray_workers, args.use_gpu, args.target_col)
    metrics = result.metrics
    checkpoint = result.checkpoint.to_directory(path=os.path.join(args.model_dir, f'model-{metrics["trial_id"]}.xgb'))
    trainMAE = metrics['train-mae']
    trainRMSE = metrics['train-rmse']
    valMAE = metrics['valid-mae']
    valRMSE = metrics['valid-rmse']
    print('[1] #011train-mae:{}'.format(trainMAE))
    print('[2] #011train-rmse:{}'.format(trainRMSE))
    print('[3] #011validation-mae:{}'.format(valMAE))
    print('[4] #011validation-rmse:{}'.format(valRMSE))
    
    local_testing = False
    try:
        load_run(sagemaker_session=sess)
    except:
        local_testing = True
    if not local_testing: # Track experiment if using SageMaker Training
        with load_run(sagemaker_session=sess) as run:
            run.log_metric('train-mae', trainMAE)
            run.log_metric('train-rmse', trainRMSE)
            run.log_metric('validation-mae', valMAE)
            run.log_metric('validation-rmse', valRMSE)
    
if __name__ == '__main__':
    ray_helper = RayHelper()
    
    ray_helper.start_ray()
    args = read_parameters()
    sess = sagemaker.Session(boto3.Session(region_name=args.region))

    start = time.time()
    main()
    taken = time.time() - start
    print(f"TOTAL TIME TAKEN: {taken:.2f} seconds")
    
    
    

Overwriting ./pipeline_scripts/train/script.py


In [11]:
!cp ./common/sagemaker_ray_helper.py ./pipeline_scripts/train/

In [12]:
hyperparams = {
    "max_depth": "5",
    "eta": "0.2",
    "min_child_weight": "6",
    "subsample": "0.7",
    "objective": "reg:linear",
}

train_instance_type = 'ml.c5.2xlarge'

estimator_parameters = {
    'source_dir': './pipeline_scripts/train/',
    'entry_point': 'script.py',
    'framework_version': '1.7-1',
    'instance_type': train_instance_type,
    'instance_count': 2,
    'hyperparameters': hyperparams,
    'role': role_arn,
    'base_job_name': 'XGBoost-model',
    'output_path': model_path,
    'image_scope': 'training'
}

inputs = {'train': TrainingInput(fs_train_data_loc), 'validation': TrainingInput(fs_val_data_loc)}


In [13]:
from IPython.core.display import display, HTML
from sagemaker.xgboost.estimator import XGBoost
from sagemaker.sklearn.estimator import SKLearn

display(
    HTML(
        '<b>Review the <a target="blank" href="https://console.aws.amazon.com/sagemaker/home?region={}#/jobs/{}">Training Job</a> After About 5 Minutes</b>'.format(
            region, experiment_name
        )
    )
)

with Run(experiment_name=experiment_name, run_name='XGBoost-run') as run:
    estimator = XGBoost(**estimator_parameters)
    estimator.fit(inputs)

  from IPython.core.display import display, HTML


INFO:sagemaker.image_uris:Ignoring unnecessary Python version: py3.
INFO:sagemaker.image_uris:Ignoring unnecessary instance type: ml.c5.2xlarge.
INFO:sagemaker:Creating training-job with name: XGBoost-model-2023-07-04-20-37-38-389


Using provided s3_resource
2023-07-04 20:37:38 Starting - Starting the training job...
2023-07-04 20:37:53 Starting - Preparing the instances for training......
2023-07-04 20:38:52 Downloading - Downloading input data...
2023-07-04 20:39:37 Training - Training image download completed. Training in progress...[34m[2023-07-04 20:39:48.416 ip-10-0-81-246.ec2.internal:7 INFO utils.py:28] RULE_JOB_STOP_SIGNAL_FILENAME: None[0m
[34m[2023-07-04 20:39:48.435 ip-10-0-81-246.ec2.internal:7 INFO profiler_config_parser.py:111] User has disabled profiler.[0m
[34m[2023-07-04:20:39:48:INFO] Imported framework sagemaker_xgboost_container.training[0m
[34m[2023-07-04:20:39:48:INFO] No GPUs detected (normal if no gpus installed)[0m
[34m[2023-07-04:20:39:48:INFO] Invoking user training script.[0m
[34m[2023-07-04:20:39:48:INFO] Module script does not provide a setup.py. [0m
[34mGenerating setup.py[0m
[34m[2023-07-04:20:39:48:INFO] Generating setup.cfg[0m
[34m[2023-07-04:20:39:48:INFO] Gene

In [14]:
hyperparameter_ranges = {
    "max_depth": IntegerParameter(1, 8),
    "eta": ContinuousParameter(0.2, 1),
    "min_child_weight": IntegerParameter(0, 120),
    "subsample": ContinuousParameter(0.2, 1),
}

objective_metric_name = 'validation:rmse'
objective_type = 'Minimize'

In [15]:
tuner_parameters = {
                    'estimator': estimator,
                    'objective_metric_name': objective_metric_name,
                    'hyperparameter_ranges': hyperparameter_ranges,
                    # 'metric_definitions': metric_definitions,
                    'max_jobs': 4,
                    'max_parallel_jobs': 2,
                    'objective_type': objective_type
                    }
    
tuner = HyperparameterTuner(**tuner_parameters)

tuning_job_name = f'xgb-model-tuning-{strftime("%d-%H-%M-%S", gmtime())}'
display(
    HTML(
        '<b>Review the <a target="blank" href="https://console.aws.amazon.com/sagemaker/home?region={}#/hyper-tuning-jobs/{}">Tuning Job</a> After About 5 Minutes</b>'.format(
            region, tuning_job_name
        )
    )
)
tuner.fit(inputs, job_name=tuning_job_name)
tuner.wait()

Using provided s3_resource


INFO:sagemaker:Creating hyperparameter tuning job with name: xgb-model-tuning-04-20-42-28


..........................................................................................!
!


In [16]:
tuner_metrics = sagemaker.HyperparameterTuningJobAnalytics(tuning_job_name)
tuner_metrics.dataframe().sort_values(['FinalObjectiveValue'], ascending=True).head(5)

Unnamed: 0,eta,max_depth,min_child_weight,subsample,TrainingJobName,TrainingJobStatus,FinalObjectiveValue,TrainingStartTime,TrainingEndTime,TrainingElapsedTimeSeconds
2,0.389778,7.0,65.0,0.732408,xgb-model-tuning-04-20-42-28-002-7dbed3fb,Completed,10061.348633,2023-07-04 20:43:42+00:00,2023-07-04 20:46:35+00:00,173.0
3,0.294678,4.0,40.0,0.302141,xgb-model-tuning-04-20-42-28-001-08d509b3,Completed,10950.479492,2023-07-04 20:43:39+00:00,2023-07-04 20:46:37+00:00,178.0
0,0.876835,7.0,118.0,0.961943,xgb-model-tuning-04-20-42-28-004-c028b040,Completed,13059.387695,2023-07-04 20:47:54+00:00,2023-07-04 20:50:02+00:00,128.0
1,0.589236,1.0,89.0,0.732445,xgb-model-tuning-04-20-42-28-003-51e7aefc,Completed,15361.978516,2023-07-04 20:47:02+00:00,2023-07-04 20:49:15+00:00,133.0


In [18]:
model_package_group_name = 'synthetic-housing-models-ray'
#model_package_group_name = unique_name_from_base('synthetic-housing-models-ray')

In [20]:
sagemaker_client.create_model_package_group(ModelPackageGroupName=model_package_group_name,
                                            ModelPackageGroupDescription='Models predicting synthetic housing prices')                                            

{'ModelPackageGroupArn': 'arn:aws:sagemaker:us-east-1:523914011708:model-package-group/synthetic-housing-models-ray-1688505916-5329',
 'ResponseMetadata': {'RequestId': 'acf58269-1e8d-4d23-b42d-c29fe9f2fcef',
  'HTTPStatusCode': 200,
  'HTTPHeaders': {'x-amzn-requestid': 'acf58269-1e8d-4d23-b42d-c29fe9f2fcef',
   'content-type': 'application/x-amz-json-1.1',
   'content-length': '132',
   'date': 'Tue, 04 Jul 2023 21:26:13 GMT'},
  'RetryAttempts': 0}}

In [21]:
from helper_library import *
# Register model
best_estimator = tuner.best_estimator()
model_metrics = create_training_job_metrics(best_estimator, s3_prefix, region, bucket)

INFO:sagemaker.image_uris:Ignoring unnecessary Python version: py3.
INFO:sagemaker.image_uris:Ignoring unnecessary instance type: ml.c5.2xlarge.


2023-07-04 20:47:01 Starting - Preparing the instances for training
2023-07-04 20:47:01 Downloading - Downloading input data
2023-07-04 20:47:01 Training - Training image download completed. Training in progress.
2023-07-04 20:47:01 Uploading - Uploading generated training model
2023-07-04 20:47:01 Completed - Resource reused by training job: xgb-model-tuning-04-20-42-28-003-51e7aefc[34m[2023-07-04 20:44:34.527 ip-10-0-76-161.ec2.internal:7 INFO utils.py:28] RULE_JOB_STOP_SIGNAL_FILENAME: None[0m
[34m[2023-07-04 20:44:34.549 ip-10-0-76-161.ec2.internal:7 INFO profiler_config_parser.py:111] Unable to find config at /opt/ml/input/config/profilerconfig.json. Profiler is disabled.[0m
[34m[2023-07-04:20:44:34:INFO] Imported framework sagemaker_xgboost_container.training[0m
[34m[2023-07-04:20:44:34:INFO] Failed to parse hyperparameter _tuning_objective_metric value validation:rmse to Json.[0m
[34mReturning the value itself[0m
[34m[2023-07-04:20:44:34:INFO] No GPUs detected (normal

In [22]:
model_package = best_estimator.register(content_types=['text/csv'],
                                        response_types=['application/json'],
                                        inference_instances=['ml.t2.medium', 'ml.m5.xlarge'],
                                        transform_instances=['ml.m5.xlarge'],
                                        image_uri=best_estimator.image_uri,
                                        model_package_group_name=model_package_group_name,
                                        model_metrics=model_metrics,
                                        approval_status='PendingManualApproval',
                                        description='XGBoost model to predict synthetic housing prices',
                                        model_name=model_name,
                                        name=model_name)
model_package_arn = model_package.model_package_arn

In [23]:
%store model_package_arn
%store model_name
%store model_package_group_name
%store model_metrics

Stored 'model_package_arn' (str)
Stored 'model_name' (str)
Stored 'model_package_group_name' (str)
Stored 'model_metrics' (ModelMetrics)
