## Introduction

This is our second notebook which will explore the model training stage of the ML workflow.

Here, we will put on the hat of the `Data Scientist` and will perform the task of modeling which includes training a model, performing hyperparameter tuning, evaluating the model and registering high performing candidate models in a model registry. This task is highly iterative in nature and hence we also need to track our experimentation until we reach desired results.

We will learn how to bring scale to model development tasks using managed SageMaker training and experiment tracking capabilities combined with curated feature data pulled from SageMaker Feature Store.  You'll also perform tuning at scale using SageMaker's automatic hyperparameter tuning capabilities. Then, finally register the best performing model in SageMaker Model Registry. 

![Notebook2](images/Notebook2.png)



Let's get started!

**Important:** for this example, we will use XGBoost-Ray. XGBoost-Ray integrates well with the Ray Tune hyperparameter optimization library and implements advanced fault tolerance handling mechanisms. We will use ray.data to load training, validation and testind data  (in parquet format) from the offline data store of the Feature Store. Then we will run a hyperparamter optimization job to find the best HPs. Finally we will register the best performing model to the Model registry. 

In [3]:
%store -r

## Introduction

This is our second notebook which will explore the model training stage of the ML workflow.

Here, we will put on the hat of the `Data Scientist` and will perform the task of modeling which includes training a model, performing hyperparameter tuning, evaluating the model and registering high performing candidate models in a model registry. This task is highly iterative in nature and hence we also need to track our experimentation until we reach desired results.

We will learn how to bring scale to model development tasks using managed SageMaker training and experiment tracking capabilities combined with curated feature data pulled from SageMaker Feature Store.  You'll also perform tuning at scale using SageMaker's automatic hyperparameter tuning capabilities. Then, finally register the best performing model in SageMaker Model Registry. 

![Notebook2](images/Notebook2.png)



Let's get started!

**Important:** for this example, we will use XGBoost-Ray. XGBoost-Ray integrates well with the Ray Tune hyperparameter optimization library and implements advanced fault tolerance handling mechanisms. We will use ray.data to load training, validation and testind data  (in parquet format) from the offline data store of the Feature Store. Then we will run a hyperparamter optimization job to find the best HPs. Finally we will register the best performing model to the Model registry. 

In [4]:
feature_group_name

'fs-ray-synthetic_home-price-2023-07-16-10-43-16'

In [5]:
!pip install -U sagemaker ray==2.5.0 modin[ray]==0.22.1 pydantic==1.10.10 xgboost_ray tensorboardx

Collecting ray[default]>=1.13.0 (from modin[ray]==0.22.1)
  Using cached ray-2.5.1-cp310-cp310-manylinux2014_x86_64.whl (56.2 MB)
INFO: pip is looking at multiple versions of ray[default] to determine which version is compatible with other requirements. This could take a while.
[0m
[1m[[0m[34;49mnotice[0m[1;39;49m][0m[39;49m A new release of pip is available: [0m[31;49m23.1.2[0m[39;49m -> [0m[32;49m23.2[0m
[1m[[0m[34;49mnotice[0m[1;39;49m][0m[39;49m To update, run: [0m[32;49mpip install --upgrade pip[0m


In [6]:
from sagemaker.feature_store.feature_group import FeatureGroup
from sagemaker.sklearn.estimator import SKLearn
from sagemaker.sklearn.model import SKLearnModel
from time import gmtime, strftime
import boto3
import sys
import sagemaker
import json
import os

from sagemaker.model_metrics import ModelMetrics, MetricsSource
from sagemaker.analytics import ExperimentAnalytics
from sagemaker.tuner import IntegerParameter, ContinuousParameter, HyperparameterTuner
# SageMaker Experiments
from sagemaker.experiments.run import Run
from sagemaker.utils import unique_name_from_base

from sagemaker.feature_store.feature_group import FeatureGroup
from sagemaker import image_uris
from sagemaker.inputs import TrainingInput

In [134]:
# Useful SageMaker variables
sess = sagemaker.Session()
bucket = sess.default_bucket()
role_arn= sagemaker.get_execution_role()
region = sess.boto_region_name
s3_client = boto3.client('s3', region_name=region)
sagemaker_client = boto3.client('sagemaker')

enable_local_mode_training = False
model_name = 'xgboost-model-synth-house-price'

experiment_name = unique_name_from_base('synthetic-housing-XGB-regression')

run_name = unique_name_from_base('XGBoost-run')

model_path = f's3://{bucket}/{s3_prefix}/output/model/xgb'

**Get the `ResolvedOutputS3Uri` of the Feature Group**

We can obtain the location where each Feature Group is storing data in parquet format.

## SageMaker Training

Now that we've prepared our training and test data, we can move on to use SageMaker's hosted training functionality - [SageMaker Training](https://docs.aws.amazon.com/sagemaker/latest/dg/train-model.html). Hosted training is preferred for doing actual training, especially large-scale, distributed training. Unlike training a model on a local computer or server, SageMaker hosted training will spin up a separate cluster of machines managed by SageMaker to train your model. Before starting hosted training, the data must be in S3, or an EFS or FSx for Lustre file system. We uploaded to S3 in the previous notebook, so we're good to go here.

In [8]:
%%writefile ./pipeline_scripts/train/script.py
import subprocess
import sys
# subprocess.check_call([sys.executable, '-m', 'pip', 'install', 'pandas==1.5.2', 'sagemaker','ray[all]==2.4.0', 'modin[ray]==0.18.0', 'xgboost_ray', 'pyarrow >= 6.0.1','pydantic==1.10.10', 'gpustat==1.0.0'])

import os
import time
from glob import glob
import argparse
import json
import logging
import boto3
import sagemaker
import numpy as np
import modin.pandas as pd

# Experiments
from sagemaker.session import Session
from sagemaker.experiments.run import load_run

import ray
from xgboost_ray import RayDMatrix, RayParams, train

from ray.air.config import ScalingConfig
from ray.data import Dataset
from ray.air.result import Result
from ray.air.checkpoint import Checkpoint
from sagemaker_ray_helper import RayHelper 

logger = logging.getLogger(__name__)
logger.setLevel(logging.DEBUG)
logger.addHandler(logging.StreamHandler(sys.stdout))

def read_parameters():
    parser = argparse.ArgumentParser()

    # Hyperparameters are described here.
    parser.add_argument('--max_depth', type=int)
    parser.add_argument('--eta', type=float)
    parser.add_argument('--min_child_weight', type=int)
    parser.add_argument('--subsample', type=float)
    parser.add_argument('--verbosity', type=int)
    parser.add_argument('--num_round', type=int)
    parser.add_argument('--tree_method', type=str, default="auto")
    parser.add_argument('--predictor', type=str, default="auto")

    # Sagemaker specific arguments. Defaults are set in the environment variables.
    parser.add_argument('--output_data_dir', type=str, default=os.environ.get('SM_OUTPUT_DATA_DIR'))
    parser.add_argument('--model_dir', type=str, default=os.environ.get('SM_MODEL_DIR'))
    parser.add_argument('--train', type=str, default=os.environ.get('SM_CHANNEL_TRAIN'))
    parser.add_argument('--validation', type=str, default=os.environ.get('SM_CHANNEL_VALIDATION'))
    parser.add_argument('--sm_hosts', type=str, default=os.environ.get('SM_HOSTS'))
    parser.add_argument('--sm_current_host', type=str, default=os.environ.get('SM_CURRENT_HOST'))
    
    parser.add_argument('--num_ray_workers', type=int,default=6)
    parser.add_argument('--use_gpu', type=bool, default=False)
    # parse region
    parser.add_argument('--region', type=str, default='us-east-1')
    
    parser.add_argument('--target_col', type=str, default='price')
    
    try:
        from sagemaker_training import environment
        env = environment.Environment()
        parser.add_argument('--n_jobs', type=int, default=env.num_cpus)
    except:
        parser.add_argument('--n_jobs', type=int, default=4)

    args, _ = parser.parse_known_args()
    return args

def load_dataset(path, num_workers, target_col="price"):
    """
    Loads the data as a ray dataset from the offline featurestore S3 location
    Args:
        feature_group_name (str): name of the feature group
        target_col (str): the target columns (will be used only for the test set).
    Returns:
        ds (ray.data.dataset): Ray dataset the contains the requested dat from the feature store
    """
    cols_to_drop=[]
    # A simple check is this is test data
    # If True add the target column to the columns list to be dropped
    if '/test/' in path:
        cols_to_drop.append(target_col)

    csv_files = glob(os.path.join(path, "*.csv"))
    print(f"found {len(csv_files)} files")
    ds = ray.data.read_csv(path)
    ds = ds.drop_columns(cols_to_drop)
    print(f"{path} count is {ds.count()}")

    return ds.repartition(num_workers)

def train_xgboost(ds_train, ds_val, params, num_workers, target_col = "price") -> Result:
    """
    Creates a XGBoost trainer, train it, and return the result.        
    Args:
        ds_train (ray.data.dataset): Training dataset
        ds_val (ray.data.dataset): Validation dataset
        params (dict): Hyperparameters
        num_workers (int): number of workers to distribute the training across
        target_col (str): target column
    Returns:
        result (ray.air.result.Result): Result of the training job
    """
    
    train_set = RayDMatrix(ds_train, 'PRICE')
    val_set = RayDMatrix(ds_val, 'PRICE')
    
    evals_result = {}
    
    trainer = train(
        params=params,
        dtrain=train_set,
        evals_result=evals_result,
        evals=[(val_set, "validation")],
        verbose_eval=False,
        num_boost_round=100,
        ray_params=RayParams(num_actors=num_workers, cpus_per_actor=1),
    )
    
    output_path=os.path.join(args.model_dir, 'model.xgb')
    
    trainer.save_model(output_path)
    
    valMAE = evals_result["validation"]["mae"][-1]
    valRMSE = evals_result["validation"]["rmse"][-1]
 
    print('[3] #011validation-mae:{}'.format(valMAE))
    print('[4] #011validation-rmse:{}'.format(valRMSE))
    
    local_testing = False
    try:
        load_run(sagemaker_session=sess)
    except:
        local_testing = True
    if not local_testing: # Track experiment if using SageMaker Training
        with load_run(sagemaker_session=sess) as run:
            run.log_metric('validation-mae', valMAE)
            run.log_metric('validation-rmse', valRMSE)

def main():
    # Get SageMaker host information from runtime environment variables
    sm_hosts = json.loads(args.sm_hosts)
    sm_current_host = args.sm_current_host
    
    hyperparams = {
        'max_depth': args.max_depth,
        'min_child_weight': args.min_child_weight,
        'eta': args.eta,
        'subsample': args.subsample,
        "tree_method": "approx",
        "objective": "reg:squarederror",
        "eval_metric": ["mae", "rmse"],
        "num_round": 100,
        "seed": 47
    }

    ds_train = load_dataset(args.train, args.num_ray_workers, args.target_col)
    ds_validation = load_dataset(args.validation, args.num_ray_workers, args.target_col)
    
    trainer = train_xgboost(ds_train, ds_validation, hyperparams, args.num_ray_workers, args.target_col)

    
if __name__ == '__main__':
    ray_helper = RayHelper()
    
    ray_helper.start_ray()
    args = read_parameters()
    sess = sagemaker.Session(boto3.Session(region_name=args.region))

    start = time.time()
    main()
    taken = time.time() - start
    print(f"TOTAL TIME TAKEN: {taken:.2f} seconds")
    
    
    

Overwriting ./pipeline_scripts/train/script.py


In [9]:
!cp -r ./common/* ./pipeline_scripts/train/

In [10]:
hyperparams = {
    "max_depth": "5",
    "eta": "0.2",
    "min_child_weight": "6",
    "subsample": "0.7",
    # "objective": "reg:squarederror",
}

train_instance_type = 'ml.c5.xlarge'

estimator_parameters = {
    'source_dir': './pipeline_scripts/train/',
    'entry_point': 'script.py',
    'framework_version': '1.7-1',
    'instance_type': train_instance_type,
    'instance_count': 2,
    'hyperparameters': hyperparams,
    'role': role_arn,
    'base_job_name': 'XGBoost-model',
    'output_path': model_path,
    'image_scope': 'training',
    'env': {
        'MODIN_AUTOIMPORT_PANDAS': '1', 
        'SAGEMAKER_REQUIREMENTS': 'requirements.txt', # path relative to `source_dir` below.
    }
}

inputs = {'train': TrainingInput(train_s3_destination), 'validation': TrainingInput(val_s3_destination)}


In [11]:
from IPython.core.display import display, HTML
from sagemaker.xgboost.estimator import XGBoost
# from sagemaker.sklearn.estimator import SKLearn

display(
    HTML(
        '<b>Review the <a target="blank" href="https://console.aws.amazon.com/sagemaker/home?region={}#/jobs/{}">Training Job</a> After About 5 Minutes</b>'.format(
            region, experiment_name
        )
    )
)

with Run(experiment_name=experiment_name, run_name=run_name) as run:
    estimator = XGBoost(**estimator_parameters)
    estimator.fit(inputs)

  from IPython.core.display import display, HTML


INFO:sagemaker.image_uris:Ignoring unnecessary Python version: py3.
INFO:sagemaker.image_uris:Ignoring unnecessary instance type: ml.c5.xlarge.
INFO:sagemaker:Creating training-job with name: XGBoost-model-2023-07-19-15-25-16-308


Using provided s3_resource
2023-07-19 15:25:16 Starting - Starting the training job...
2023-07-19 15:25:31 Starting - Preparing the instances for training......
2023-07-19 15:26:32 Downloading - Downloading input data...
2023-07-19 15:27:02 Training - Downloading the training image..
2023-07-19 15:27:23 Training - Training image download completed. Training in progress.[34m[2023-07-19 15:27:27.768 ip-10-0-210-229.ec2.internal:7 INFO utils.py:28] RULE_JOB_STOP_SIGNAL_FILENAME: None[0m
[34m[2023-07-19 15:27:27.790 ip-10-0-210-229.ec2.internal:7 INFO profiler_config_parser.py:111] User has disabled profiler.[0m
[34m[2023-07-19:15:27:28:INFO] Imported framework sagemaker_xgboost_container.training[0m
[34m[2023-07-19:15:27:28:INFO] No GPUs detected (normal if no gpus installed)[0m
[34m[2023-07-19:15:27:28:INFO] Invoking user training script.[0m
[34m[2023-07-19:15:27:28:INFO] Module script does not provide a setup.py. [0m
[34mGenerating setup.py[0m
[34m[2023-07-19:15:27:28:INF

### Verify Ray Cluster
In the output from the previous step, right after the ray head is initialized you should see the `ray.cluster_resources()` output. This will look like

<span style="color:#208ffb">All workers present and accounted for <br/>
{'CPU': 8.0, 'memory': xxxx, 'object_store_memory': xxxx, 'node:10.2.xxx.xxx': 1.0, 'node:10.2.xxx.xxx': 1.0}</span>
<br></br>
This confirms the there were 2 instance of `ml.c5.xlarge` with a total of 8 CPUs in the Ray cluster that processed this training job

## Hyper Parameter Tuning

Instead of maunally configuring your hyper parameter values and training with SageMaker Training, you could also train with Amazon SageMaker Automatic Model Tuning. AMT, also known as hyperparameter tuning, finds the best version of a model by running many training jobs on your dataset using the algorithm and ranges of hyperparameters that you specify. It then chooses the hyperparameter values that result in a model that performs the best, as measured by a metric that you choose.

In [12]:
hyperparameter_ranges = {
    "max_depth": IntegerParameter(1, 8),
    "eta": ContinuousParameter(0.1, 0.5),
    "min_child_weight": IntegerParameter(0, 120),
    "subsample": ContinuousParameter(0.2, 1),
}

objective_metric_name = 'validation:rmse'
objective_type = 'Minimize'

In [13]:
tuner_parameters = {
                    'estimator': estimator,
                    'objective_metric_name': objective_metric_name,
                    'hyperparameter_ranges': hyperparameter_ranges,
                    # 'metric_definitions': metric_definitions,
                    'max_jobs': 10,
                    'max_parallel_jobs': 5,
                    'objective_type': objective_type
                    }
    
tuner = HyperparameterTuner(**tuner_parameters)

tuning_job_name = f'xgb-model-tuning-{strftime("%d-%H-%M-%S", gmtime())}'
display(
    HTML(
        '<b>Review the <a target="blank" href="https://console.aws.amazon.com/sagemaker/home?region={}#/hyper-tuning-jobs/{}">Tuning Job</a> After About 5 Minutes</b>'.format(
            region, tuning_job_name
        )
    )
)
tuner.fit(inputs, job_name=tuning_job_name)
tuner.wait()

Using provided s3_resource


INFO:sagemaker:Creating hyperparameter tuning job with name: xgb-model-tuning-19-15-30-39


................................................................................................................!
!


In [14]:
tuner_metrics = sagemaker.HyperparameterTuningJobAnalytics(tuning_job_name)
tuner_metrics.dataframe().sort_values(['FinalObjectiveValue'], ascending=True).head(5)

Unnamed: 0,eta,max_depth,min_child_weight,subsample,TrainingJobName,TrainingJobStatus,FinalObjectiveValue,TrainingStartTime,TrainingEndTime,TrainingElapsedTimeSeconds
9,0.137447,6.0,78.0,0.815319,xgb-model-tuning-19-15-30-39-001-b4f93606,Completed,8302.693359,2023-07-19 15:32:11+00:00,2023-07-19 15:35:55+00:00,224.0
3,0.1,8.0,94.0,0.772631,xgb-model-tuning-19-15-30-39-007-1f22db72,Completed,8877.764648,2023-07-19 15:36:27+00:00,2023-07-19 15:39:05+00:00,158.0
7,0.313556,2.0,53.0,0.975413,xgb-model-tuning-19-15-30-39-003-21c773b2,Completed,9517.50293,2023-07-19 15:32:02+00:00,2023-07-19 15:35:31+00:00,209.0
8,0.240968,4.0,118.0,0.909318,xgb-model-tuning-19-15-30-39-002-5a6bd673,Completed,9861.994141,2023-07-19 15:32:08+00:00,2023-07-19 15:35:26+00:00,198.0
0,0.5,8.0,57.0,0.833239,xgb-model-tuning-19-15-30-39-010-1856d481,Completed,9987.393555,2023-07-19 15:36:32+00:00,2023-07-19 15:39:15+00:00,163.0


In [118]:
%%writefile ./pipeline_scripts/inference/script.py

import json
import os
import pickle as pkl

import numpy as np
import tarfile
import xgboost as xgb
import sagemaker_xgboost_container.encoder as xgb_encoder


def model_fn(model_dir):
    """
    Deserialize and return fitted model.
    """
    booster = xgb.Booster()
    booster.load_model(os.path.join(model_dir, 'model.xgb'))
    return booster


def input_fn(request_body, request_content_type):
    """
    The SageMaker XGBoost model server receives the request data body and the content type,
    and invokes the `input_fn`.

    Return a DMatrix (an object that can be passed to predict_fn).
    """
    print(f'Incoming format type is {request_content_type}')
    if request_content_type == "text/csv":
        decoded_payload = request_body.strip()
        return xgb_encoder.csv_to_dmatrix(decoded_payload, dtype=np.float)
    if request_content_type == "text/libsvm":
        return xgb_encoder.libsvm_to_dmatrix(request_body)
    else:
        raise ValueError(
            "Content type {} is not supported.".format(request_content_type)
        )


def predict_fn(input_data, model):
    """
    SageMaker XGBoost model server invokes `predict_fn` on the return value of `input_fn`.

    Return a two-dimensional NumPy array where the first columns are predictions
    and the remaining columns are the feature contributions (SHAP values) for that prediction.
    """
    prediction = model.predict(input_data)
    feature_contribs = model.predict(input_data, pred_contribs=True, validate_features=False)
    output = np.hstack((prediction[:, np.newaxis], feature_contribs))
    return output


def output_fn(predictions, content_type):
    """
    After invoking predict_fn, the model server invokes `output_fn`.
    """
    print(f'outgoing format type is {content_type}')
    print (predictions)
    if content_type == "text/csv":
        return ','.join(str(x[0]) for x in predictions)
    else:
        raise ValueError("Content type {} is not supported.".format(content_type))

Overwriting ./pipeline_scripts/inference/script.py


In [135]:
from helper_library import *
# Register model
best_estimator = tuner.best_estimator()
#best_estimator = estimator
model_metrics = create_training_job_metrics(best_estimator, s3_prefix, region, bucket)

INFO:sagemaker.image_uris:Ignoring unnecessary Python version: py3.
INFO:sagemaker.image_uris:Ignoring unnecessary instance type: ml.c5.xlarge.


2023-07-19 15:36:30 Starting - Preparing the instances for training
2023-07-19 15:36:30 Downloading - Downloading input data
2023-07-19 15:36:30 Training - Training image download completed. Training in progress.
2023-07-19 15:36:30 Uploading - Uploading generated training model
2023-07-19 15:36:30 Completed - Resource reused by training job: xgb-model-tuning-19-15-30-39-010-1856d481[35m[2023-07-19 15:33:04.718 ip-10-0-180-84.ec2.internal:7 INFO utils.py:28] RULE_JOB_STOP_SIGNAL_FILENAME: None[0m
[35m[2023-07-19 15:33:04.740 ip-10-0-180-84.ec2.internal:7 INFO profiler_config_parser.py:111] Unable to find config at /opt/ml/input/config/profilerconfig.json. Profiler is disabled.[0m
[35m[2023-07-19:15:33:05:INFO] Imported framework sagemaker_xgboost_container.training[0m
[35m[2023-07-19:15:33:05:INFO] Failed to parse hyperparameter _tuning_objective_metric value validation:rmse to Json.[0m
[35mReturning the value itself[0m
[35m[2023-07-19:15:33:05:INFO] No GPUs detected (normal

In [None]:
"""

model = XGBoostModel(model_data=xgb_model.model_data,
                     role = role,
                     name = model_name,
                     model_package_group_name=model_package_group,
                     model_metrics=model_metrics)
"""

In [136]:
#model_package_group_name = 'synthetic-housing-models-ray'
model_package_group_name = unique_name_from_base('synthetic-housing-models-ray-')

In [137]:
sagemaker_client.create_model_package_group(ModelPackageGroupName=model_package_group_name,
                                            ModelPackageGroupDescription='Models predicting synthetic housing prices')                                            

{'ModelPackageGroupArn': 'arn:aws:sagemaker:us-east-1:523914011708:model-package-group/synthetic-housing-models-ray--1689808577-5420',
 'ResponseMetadata': {'RequestId': '0fdbd45e-e158-44d3-ab1a-9835cf5d1e25',
  'HTTPStatusCode': 200,
  'HTTPHeaders': {'x-amzn-requestid': '0fdbd45e-e158-44d3-ab1a-9835cf5d1e25',
   'content-type': 'application/x-amz-json-1.1',
   'content-length': '133',
   'date': 'Wed, 19 Jul 2023 23:16:18 GMT'},
  'RetryAttempts': 0}}

In [138]:
from sagemaker.xgboost.model import XGBoostModel
print(model_data_path)
xgb_inference_model = XGBoostModel(
    model_data=best_estimator.model_data,
    role=role_arn,
    name = model_name,
    entry_point="./pipeline_scripts/inference/script.py",
    framework_version="1.7-1"
)

s3://sagemaker-us-east-1-523914011708/aws-sm-ray-workshop/output/model/xgb/XGBoost-model-2023-07-19-15-25-16-308/output/model.tar.gz


In [142]:
xgb_model_package = xgb_inference_model.register(content_types=['text/csv'],
                                        response_types=['application/json'],
                                        inference_instances=['ml.t2.medium', 'ml.m5.xlarge'],
                                        transform_instances=['ml.m5.xlarge'],
                                        image_uri=best_estimator.image_uri,
                                        model_package_group_name=model_package_group_name,
                                        model_metrics=model_metrics,
                                        approval_status='PendingManualApproval',
                                        description='XGBoost model to predict synthetic housing prices',
                                        # model_package_name=model_name,
                                    )



In [143]:
model_package_arn = xgb_model_package.model_package_arn

In [124]:
model_package = best_estimator.register(content_types=['text/csv'],
                                        response_types=['application/json'],
                                        inference_instances=['ml.t2.medium', 'ml.m5.xlarge'],
                                        transform_instances=['ml.m5.xlarge'],
                                        image_uri=best_estimator.image_uri,
                                        model_package_group_name=model_package_group_name,
                                        model_metrics=model_metrics,
                                        approval_status='PendingManualApproval',
                                        description='XGBoost model to predict synthetic housing prices',
                                        model_name=model_name,
                                        name=model_name,
                                        # model_data=model_data_path,
                                        entry_point="./pipeline_scripts/inference/script.py",)
model_package_arn = model_package.model_package_arn

In [125]:
model_package

<sagemaker.model.ModelPackage at 0x7f60bd403f10>

In [144]:
%store model_package_arn
%store model_name
%store model_package_group_name
%store model_metrics
%store model_data_path

Stored 'model_package_arn' (str)
Stored 'model_name' (str)
Stored 'model_package_group_name' (str)
Stored 'model_metrics' (ModelMetrics)
Stored 'model_data_path' (str)


In [127]:
"""
fs_train_data_loc = 's3://sagemaker-us-east-1-523914011708/aws-sm-ray-workshop/data/feature-store/train/523914011708/sagemaker/us-east-1/offline-store/fs-train--2023-07-04-13-37-02-1688478127/data'
fs_val_data_loc = 's3://sagemaker-us-east-1-523914011708/aws-sm-ray-workshop/data/feature-store/validation/523914011708/sagemaker/us-east-1/offline-store/fs-train--2023-07-04-13-37-02-1688478127/data'
fs_test_data_loc = 's3://sagemaker-us-east-1-523914011708/aws-sm-ray-workshop/data/feature-store/test/523914011708/sagemaker/us-east-1/offline-store/fs-test--2023-07-04-13-37-02-1688478127/data'
"""

"\nfs_train_data_loc = 's3://sagemaker-us-east-1-523914011708/aws-sm-ray-workshop/data/feature-store/train/523914011708/sagemaker/us-east-1/offline-store/fs-train--2023-07-04-13-37-02-1688478127/data'\nfs_val_data_loc = 's3://sagemaker-us-east-1-523914011708/aws-sm-ray-workshop/data/feature-store/validation/523914011708/sagemaker/us-east-1/offline-store/fs-train--2023-07-04-13-37-02-1688478127/data'\nfs_test_data_loc = 's3://sagemaker-us-east-1-523914011708/aws-sm-ray-workshop/data/feature-store/test/523914011708/sagemaker/us-east-1/offline-store/fs-test--2023-07-04-13-37-02-1688478127/data'\n"

In [128]:
%store fs_train_data_loc
%store fs_val_data_loc
%store fs_test_data_loc

UsageError: Unknown variable 'fs_train_data_loc'


In [31]:
model_data_path

's3://sagemaker-us-east-1-523914011708/aws-sm-ray-workshop/output/model/xgb/XGBoost-model-2023-07-19-15-25-16-308/output/model.tar.gz'

In [114]:
from sagemaker.serializers import CSVSerializer
from sagemaker.deserializers import JSONDeserializer, CSVDeserializer
predictor = xgb_inference_model.deploy(
    initial_instance_count=1,
    instance_type="ml.c5.xlarge",
    serializer=CSVSerializer(),
    deserializer=CSVDeserializer()
)

INFO:sagemaker.image_uris:Ignoring unnecessary instance type: ml.c5.xlarge.
INFO:sagemaker:Creating model with name: sagemaker-xgboost-2023-07-19-19-41-32-453
INFO:sagemaker:Creating endpoint-config with name sagemaker-xgboost-2023-07-19-19-41-33-274
INFO:sagemaker:Creating endpoint with name sagemaker-xgboost-2023-07-19-19-41-33-274


----!

In [115]:
import pandas as pd 

df = pd.read_csv("./data/processed/test/4e6a5839a2c240ae9ae86ad10e699a7c_000000.csv")

dropped_df = df.drop(columns=["PRICE"])
df.head(5)
# Get a real-time prediction (only predicting the 1st 5 row to reduce output size)
#

Unnamed: 0,NUM_BATHROOMS,NUM_BEDROOMS,FRONT_PORCH,LOT_ACRES,DECK,SQUARE_FEET,YEAR_BUILT,GARAGE_SPACES,PRICE
0,0.724528,-1.45194,1.006354,-0.309725,0.991371,0.404904,-1.161628,-1.327666,387076
1,1.4387,0.693784,1.006354,0.573522,-1.008705,0.514101,-0.35465,-1.327666,480152
2,-1.417989,-0.021457,1.006354,-0.831644,-1.008705,-0.154838,-1.363372,1.320898,354698
3,0.010355,1.409025,-0.993687,-0.751349,0.991371,1.325636,1.057561,-0.444812,646437
4,0.010355,-0.021457,-0.993687,0.453079,-1.008705,-2.337322,-1.363372,1.320898,129137


In [116]:
preds = predictor.predict(dropped_df[:5].to_csv(index=False, header=False))

In [117]:
preds

[['397074.56', '483460.0', '360082.28', '652707.94', '117097.086']]