## Introduction

This is our fourth notebook (Lab 2) which will dive deep into automating machine learning workflows to create a more repeatable path to production.  

Here, we will put on the hat of a `ML Engineer` and perform the tasks required to automate the tasks within our machine learning workflows as well as orchestrate the steps.  For this, we'll build pipeline steps that include all the previous notebooks components into one singular entity. This pipeline entity accomplishes a repeatable ML workflow with some reliability built in through quality minimal quality gates. 

For this task we will be using Amazon SageMaker Pipelines capabilities to build out an end-to-end machine learning pipeline.   

![Notebook4](images/Notebook-4.png)

Keep in mind, CI/CD practicies are typically more aligned with the *Reliable* stage so you'll notice we have not yet considered a more robust set of pipelines that considers the lifecycle of each stage (build vs deploy), source/version control, automated triggers, or additional quality gates. 

Let's get started!

In [2]:
!pip install -U sagemaker

[0m

In [3]:
!pip show sagemaker

Name: sagemaker
Version: 2.171.0
Summary: Open source library for training and deploying models on Amazon SageMaker.
Home-page: https://github.com/aws/sagemaker-python-sdk/
Author: Amazon Web Services
Author-email: 
License: Apache License 2.0
Location: /opt/conda/lib/python3.10/site-packages
Requires: attrs, boto3, cloudpickle, google-pasta, importlib-metadata, jsonschema, numpy, packaging, pandas, pathos, platformdirs, protobuf, PyYAML, schema, smdebug-rulesconfig, tblib
Required-by: 


In [4]:
%store -r

In [5]:
# Processing imports
from sagemaker.sklearn.processing import SKLearnProcessor
from sagemaker.processing import ProcessingInput, ProcessingOutput, ScriptProcessor

# SageMaker Pipeline imports
from sagemaker.workflow.properties import PropertyFile
from sagemaker.workflow.conditions import ConditionLessThanOrEqualTo
from sagemaker.workflow.condition_step import ConditionStep
from sagemaker.workflow.functions import JsonGet

from sagemaker.workflow.pipeline import Pipeline
from sagemaker.workflow.steps import ProcessingStep, TrainingStep, CreateModelStep, TransformStep, TuningStep
from sagemaker.workflow.model_step import ModelStep

from sagemaker.workflow.parameters import (
    ParameterInteger,
    ParameterString,
    ParameterFloat,
)

# Other imports
import json
import time
from time import gmtime, strftime
from sagemaker.sklearn.estimator import SKLearn
from sagemaker.model import Model
from sagemaker.tuner import IntegerParameter, HyperparameterTuner
from sagemaker.inputs import TrainingInput
from sagemaker.lambda_helper import Lambda
from sagemaker.workflow.lambda_step import (
    LambdaStep,
    LambdaOutput,
    LambdaOutputTypeEnum,
)

# To test the endpoint once it's deployed
from sagemaker.predictor import Predictor
from sagemaker.serializers import CSVSerializer
from sagemaker.deserializers import JSONDeserializer, CSVDeserializer
from sagemaker.workflow.pipeline_context import PipelineSession
from sagemaker.tuner import IntegerParameter, ContinuousParameter, HyperparameterTuner

import sagemaker
import json
import boto3
from sagemaker.model_metrics import ModelMetrics, MetricsSource
import pandas as pd
from sagemaker.feature_store.feature_group import FeatureGroup
from helper_library import *

from sagemaker.workflow.steps import CacheConfig

**Session variables**

In [6]:
# Useful SageMaker variables
session = PipelineSession()
bucket = session.default_bucket()
role_arn= sagemaker.get_execution_role()
region = session.boto_region_name
sagemaker_client = boto3.client('sagemaker')
aws_account_id = boto3.client('sts').get_caller_identity().get('Account')
lambda_role = create_lambda_iam_role('LambdaSageMakerExecutionRole')
# Data paths in S3
s3_prefix = 'aws-sm-ray-workshop'
bucket_prefix = f'{s3_prefix}/data/feature-store'
model_prefix = f'{s3_prefix}/models'
output_path = f's3://{bucket}/{s3_prefix}/data/sm_processed'
fs_s3_path = f's3://{bucket}/{s3_prefix}/data/feature-store'

Using ARN from existing role: LambdaSageMakerExecutionRole
Done


## Model Build pipeline with SageMaker Pipelines

[Amazon SageMaker Pipelines](https://docs.aws.amazon.com/sagemaker/latest/dg/pipelines-sdk.html) provides the ability to create a directed acryclic graph (DAG) containing the pipeline steps need to build and/or deploy machine learning models.  Each pipeline, created through the provided Python SDK, is a series of interconnected steps.  This same pipeline can also be exported as a JSON pipeline definition. 

The structure of a pipeline's DAG is determined by the data dependencies between steps. These data dependencies are created when the properties of a step's output are passed as the input to another step. The following image is a pipeline DAG that we'll be creating for our training pipeline:

![](images/sagemaker-pipelines-dag.png)

#### Pipeline Parameters

SageMaker Pipelines supports [pipeline parameters](https://docs.aws.amazon.com/sagemaker/latest/dg/build-and-manage-parameters.html) allowing you to provide runtime parameters for each run of your pipeline.   This allows you to change key inputs for each run of your pipeline without changing your pipeline code (ex. raw data on input)

Here, we'll identify the parameters and set the parameter default.  You can also use this feature to make it reusable (you'll be able to override these inputs upon executing the pipeline later in the notebook).

In [7]:
# Upload raw data to S3
local_data_path_ray = "data/raw/ray/house_pricing.csv"
raw_data_s3_prefix = '{}/data/raw'.format(s3_prefix)
raw_s3 = session.upload_data(path=local_data_path_ray, key_prefix=raw_data_s3_prefix)

In [8]:
# Optional step
# Delete all file in the S3 prefix before begining preprocessing. 
# This is to prevent duplication of data when running this workshop multiple time.

s3 = boto3.resource('s3')
print(bucket)
bucket_obj = s3.Bucket(bucket)
print(f"{s3_prefix}/data/sm_processed/")
files = bucket_obj.objects.filter(Prefix=f"{s3_prefix}/data/sm_processed/")
files.delete()

sagemaker-us-east-1-523914011708
aws-sm-ray-workshop/data/sm_processed/


[{'ResponseMetadata': {'RequestId': 'NEN60782JHAXRY5B',
   'HostId': 'Y0blAkRqXU4TyYkJUbgbHp5qu4g6cDKsC2cyjPws/R01hEQ85byZLPuo/zfeNkD7Gq1WGq7IKxc=',
   'HTTPStatusCode': 200,
   'HTTPHeaders': {'x-amz-id-2': 'Y0blAkRqXU4TyYkJUbgbHp5qu4g6cDKsC2cyjPws/R01hEQ85byZLPuo/zfeNkD7Gq1WGq7IKxc=',
    'x-amz-request-id': 'NEN60782JHAXRY5B',
    'date': 'Tue, 11 Jul 2023 00:35:10 GMT',
    'content-type': 'application/xml',
    'transfer-encoding': 'chunked',
    'server': 'AmazonS3',
    'connection': 'close'},
   'RetryAttempts': 0},
  'Deleted': [{'Key': 'aws-sm-ray-workshop/data/sm_processed/test/651e40caff3b45c1973bf015e9137453_000000.csv'},
   {'Key': 'aws-sm-ray-workshop/data/sm_processed/validation/bdb8f4f66c1b4d41b99ef65fd492bbc9_000000.csv'},
   {'Key': 'aws-sm-ray-workshop/data/sm_processed/train/4a5d3d63121946fca22d79e859b7c1ad_000000.csv'}]}]

In [None]:
# Optional step
# Delete all Feature Groups that start with the prefix fs-. 
# This is to prevent duplication of feature stores when running this workshop multiple time.

sm_client = boto3.client('sagemaker', region_name='us-east-1')
sagemaker_session = sagemaker.Session(boto3.Session(region_name='us-east-1'))
response = sm_client.list_feature_groups(
    NameContains='fs-'
)

for feature in response["FeatureGroupSummaries"]:
    print(f'deleting {feature["FeatureGroupName"]}')
    resp = sm_client.delete_feature_group(
        FeatureGroupName=feature["FeatureGroupName"]
    )

In [9]:
processing_instance_count = ParameterInteger(
    name='ProcessingInstanceCount',
    default_value=1
)
"""
train_feature_group_name = ParameterString(
    name='train_feature_group_name',
    default_value='fs-train-synthetic-housing-data'
)

validation_feature_group_name = ParameterString(
    name='validation_feature_group_name',
    default_value='fs-val-synthetic-housing-data'
)

test_feature_group_name = ParameterString(
    name='test_feature_group_name',
    default_value='fs-test-synthetic-housing-data'
)
"""
bucket_prefix = ParameterString(
    name='bucket_prefix',
    default_value='aws-ray-mlops-workshop/feature-store'
)

rmse_threshold = ParameterFloat(name="RMSEThreshold", default_value=15000.0)

train_feature_group_name = 'fs-train-synthetic-housing-data'
validation_feature_group_name = 'fs-val-synthetic-housing-data'
test_feature_group_name = 'fs-test-synthetic-housing-data'

#### Setup Step Caching Configuration

This configuration can be enabled on pipeline steps to allow SageMaker Pipelines to automatically check if a previous (successful) run of a pipeline step with the same values for specific parameters is found. If it is found, Pipelines propogates the results of that step to the next step without re-running the step saving both time and compute costs.

In [10]:
cache_config = CacheConfig(enable_caching=True, expire_after="PT12H")

#### SageMaker Processing step

This should look very similar to the SageMaker Processing Job you configured in notebook 2. The only new line of code is the `ProcessingStep` line at the bottom of the cell below which allows us to take the Processing Job configuration and include it as a pipeline step. 

In [11]:
preprocess_data_processor = SKLearnProcessor(
    framework_version='1.0-1',
    role=role_arn,
    instance_type='ml.m5.xlarge',
    instance_count=processing_instance_count,
    base_job_name='preprocess-data',
    sagemaker_session=session,
    
)

preprocess_dataset_step = ProcessingStep(
    name='PreprocessData',
    code='./pipeline_scripts/preprocessing/script.py',
    processor=preprocess_data_processor,
    inputs=[
        ProcessingInput(
            source=raw_s3,
            destination='/opt/ml/processing/input',
            s3_data_distribution_type='ShardedByS3Key'
        )
    ],
    outputs=[
        ProcessingOutput(
            output_name='train',
            destination=f'{output_path}/train',
            source='/opt/ml/processing/output/train'
        ),
        ProcessingOutput(
            output_name='validation',
            destination=f'{output_path}/validation',
            source='/opt/ml/processing/output/validation'
        ),
        ProcessingOutput(
            output_name='test',
            destination=f'{output_path}/test',
            source='/opt/ml/processing/output/test'
        )
    ],
    cache_config=cache_config
)

In [12]:
from sagemaker.workflow.functions import Join

feature_store_ingestion = SKLearnProcessor(
    framework_version='1.0-1',
    role=role_arn,
    instance_type='ml.m5.2xlarge',
    instance_count=processing_instance_count,
    base_job_name='feature-store-ingestion',
    sagemaker_session=session
)

processor_args = feature_store_ingestion.run(
    code="./pipeline_scripts/feature-store/script.py",
    inputs=[
        ProcessingInput(
            source=preprocess_dataset_step.properties.ProcessingOutputConfig.Outputs[
                'train'
            ].S3Output.S3Uri,
            destination='/opt/ml/processing/input/train'
        ),
        ProcessingInput(
            source=preprocess_dataset_step.properties.ProcessingOutputConfig.Outputs[
                'validation'
            ].S3Output.S3Uri,
            destination='/opt/ml/processing/input/validation'
        ),
        ProcessingInput(
            source=preprocess_dataset_step.properties.ProcessingOutputConfig.Outputs[
                'test'
            ].S3Output.S3Uri,
            destination='/opt/ml/processing/input/test'
        )
    ],
    outputs=[
        ProcessingOutput(output_name="train", s3_upload_mode='Continuous', app_managed=True, feature_store_output = sagemaker.processing.FeatureStoreOutput(feature_group_name = train_feature_group_name)),
        ProcessingOutput(output_name="validation", s3_upload_mode='Continuous', app_managed=True, feature_store_output = sagemaker.processing.FeatureStoreOutput(feature_group_name = validation_feature_group_name)),
    ],  
    arguments=['--train_feature_group_name', train_feature_group_name,
                '--validation_feature_group_name', validation_feature_group_name,
                '--test_feature_group_name', test_feature_group_name,
                '--bucket_prefix', bucket_prefix,
                '--role_arn', role_arn,
                '--region', region,
    ]
)
 
feature_store_ingestion_step = ProcessingStep(
    name='FeatureStoreIngestion',
    step_args=processor_args,
    cache_config=cache_config
)
    
    
            



## Hyperparameter Tuning Step
Amazon SageMaker automatic model tuning, also known as hyperparameter tuning, finds the best version of a model by running many training jobs on your dataset using the algorithm and ranges of hyperparameters that you specify. It then chooses the hyperparameter values that result in a model that performs the best, as measured by a metric that you choose.

This configuration should also look very similar to the SageMaker Training job you did in notebook 2. The only new line of code is the `TuningStep` line at the bottom of the cell below to allow us to run the training job as a step in our pipeline.

You can learn more about [Hyperparameter Tuning](https://docs.aws.amazon.com/sagemaker/latest/dg/automatic-model-tuning-how-it-works.html) in the SageMaker docs.

In [13]:
from sagemaker.xgboost.estimator import XGBoost

hyperparams = {
    # Tuned hyperparameters
    "max_depth": "4",
    "eta": "0.810249",
    "min_child_weight": "79",
    "subsample": "0.984023",
    "objective": "reg:squarederror",
    # Training job params
    "train_feature_group_name": train_feature_group_name,
    "validation_feature_group_name": validation_feature_group_name,
    "role_arn": role_arn,
    "region": region,
}

train_instance_type = 'ml.c5.2xlarge'

estimator_parameters = {
    'source_dir': './pipeline_scripts/train/',
    'entry_point': 'script-pipeline.py',
    'framework_version': '1.7-1',
    'instance_type': train_instance_type,
    'instance_count': 1,
    'hyperparameters': hyperparams,
    'role': role_arn,
    'base_job_name': 'XGBoost-model',
    'output_path': f's3://{bucket}/{s3_prefix}/',
    'image_scope': 'training'
}

estimator = XGBoost(**estimator_parameters)

"""
#training_step = TrainingStep(
    name='TrainXGBModel',
    estimator=estimator,
    cache_config=cache_config
)
training_step.add_depends_on([feature_store_ingestion_step])
"""

"\n#training_step = TrainingStep(\n    name='TrainXGBModel',\n    estimator=estimator,\n    cache_config=cache_config\n)\ntraining_step.add_depends_on([feature_store_ingestion_step])\n"

In [14]:
hyperparameter_ranges = {
    "max_depth": IntegerParameter(1, 8),
    "eta": ContinuousParameter(0.2, 1),
    "min_child_weight": IntegerParameter(0, 120),
    "subsample": ContinuousParameter(0.2, 1),
}

objective_metric_name = 'validation:rmse'
objective_type = 'Minimize'
tuner_parameters = {
                    'estimator': estimator,
                    'objective_metric_name': objective_metric_name,
                    'hyperparameter_ranges': hyperparameter_ranges,
                    # 'metric_definitions': metric_definitions,
                    'max_jobs': 6,
                    'max_parallel_jobs': 4,
                    'objective_type': objective_type
                }
    
tuner = HyperparameterTuner(**tuner_parameters)

tuning_step = TuningStep(
    name="HPTuning",
    tuner=tuner,
    cache_config=cache_config,
)
tuning_step.add_depends_on([feature_store_ingestion_step])

#### Model evaluation step

After the training step in our pipeline, we'll want to then evaluate our model's performance. To do that, we can create a SageMaker Processing Step that will utilize evaluation code (evaluation.py) that we specify to perform evaluation of the model using the test hold-out dataset that is output of the preprocess data step configured above. 

In [135]:
%%writefile ./pipeline_scripts/evaluate/script.py
import subprocess
import sys
subprocess.check_call([sys.executable, '-m', 'pip', 'install', 'sagemaker', 'ray', 'modin[ray]', 'pydantic==1.10.10', 'xgboost_ray'])
import os
import time
import tarfile
import argparse
import json
import logging
import boto3
import sagemaker
import glob

import pathlib
import numpy as np
from math import sqrt
from sklearn.metrics import mean_absolute_error, mean_squared_error, r2_score

# Sagemaker specific imports
from sagemaker.session import Session
from sagemaker.experiments.run import load_run
import modin.pandas as pd
# Ray specific imports
import ray
from ray.air.checkpoint import Checkpoint
from ray.train.xgboost import XGBoostCheckpoint, XGBoostPredictor
import ray.cloudpickle as cloudpickle

logger = logging.getLogger(__name__)
logger.setLevel(logging.DEBUG)
logger.addHandler(logging.StreamHandler(sys.stdout))

if __name__ == "__main__":
    logger.debug('Starting evaluation.')
    
    model_dir = '/opt/ml/processing/model/'
    for file in os.listdir(model_dir):
        logger.info(file)
    
    logger.debug('Loading model.')
    model_path = os.path.join(model_dir, 'model.tar.gz')
    # Open the .tar.gz file
    with tarfile.open(model_path, 'r:gz') as tar:
        # Extract all files to the model directory
        tar.extractall(path=model_dir)

    for file in os.listdir(model_dir):
        logger.debug(file)
                     
    # Load the serialized model data from a file
    with open(f'{model_dir}model.pkl', "rb") as f:
        serialized_model = f.read()

    # Deserialize the model using cloudpickle
    result = cloudpickle.loads(serialized_model)
    metrics = result.metrics

    # See the regression metrics
    # see: https://docs.aws.amazon.com/sagemaker/latest/dg/model-monitor-model-quality-metrics.html
    logger.debug('extracting metrics.')
    mae = metrics['valid-mae']
    rmse = metrics['valid-rmse']
    report_dict = {
        'regression_metrics': {
            'mae': {
                'value': mae,
            },
            'rmse': {
                'value': rmse,
            },
        },
    }

    output_dir = '/opt/ml/processing/evaluation'
    pathlib.Path(output_dir).mkdir(parents=True, exist_ok=True)

    logger.info('Writing out evaluation report with rmse: %f', rmse)
    evaluation_path = f'{output_dir}/evaluation.json'
    with open(evaluation_path, 'w') as f:
        f.write(json.dumps(report_dict))

Overwriting ./pipeline_scripts/evaluate/script.py


In [136]:
evaluation_processor = SKLearnProcessor(
    framework_version='0.23-1',
    role=role_arn,
    instance_type='ml.m5.xlarge',
    instance_count=processing_instance_count,
    base_job_name='evaluation',
    sagemaker_session=session,
)

INFO:sagemaker.image_uris:Defaulting to only available Python version: py3


In [137]:
# Specify where we'll store the model evaluation results so
# that other steps can access those results
evaluation_report = PropertyFile(
    name='EvaluationReport',
    output_name='evaluation',
    path='evaluation.json',
)

evaluation_step = ProcessingStep(
    name='EvaluateModel',
    processor=evaluation_processor,
    inputs=[
        ProcessingInput(
            source=tuning_step.get_top_model_s3_uri(
                top_k=0, s3_bucket=bucket, prefix=s3_prefix
            ),
            destination='/opt/ml/processing/model',
        ),
    ],
    outputs=[
        ProcessingOutput(
            output_name='evaluation', source='/opt/ml/processing/evaluation'
        ),
    ],
    code='./pipeline_scripts/evaluate/script.py',
    property_files=[evaluation_report],
)

In [138]:
model_metrics = ModelMetrics(
    model_statistics=MetricsSource(
        s3_uri='{}/evaluation.json'.format(
            evaluation_step.arguments['ProcessingOutputConfig']['Outputs'][0]['S3Output'][
                'S3Uri'
            ]
        ),
        content_type='application/json',
    )
)

best_model = Model(
    image_uri=estimator.training_image_uri(),
    model_data=tuning_step.get_top_model_s3_uri(top_k=0, s3_bucket=bucket, prefix=s3_prefix),
    source_dir=estimator.source_dir,
    entry_point=estimator.entry_point,
    role=role_arn,
    sagemaker_session=session
)

model_registry_args = best_model.register(
    content_types=['text/csv'],
    response_types=['application/json'],
    inference_instances=['ml.t2.medium', 'ml.m5.xlarge'],
    transform_instances=['ml.m5.xlarge'],
    model_package_group_name=model_package_group_name,
    approval_status='PendingManualApproval',
    model_metrics=model_metrics
)

register_step = ModelStep(
    name='RegisterTrainedModel',
    step_args=model_registry_args
)



In [139]:
from sagemaker.workflow.fail_step import FailStep
from sagemaker.workflow.functions import Join

metrics_fail_step = FailStep(
    name="RMSEFail",
    error_message=Join(on=" ", values=["Execution failed due to RMSE >", rmse_threshold]),
)

# Condition step for evaluating model quality and branching execution
cond_lte = ConditionLessThanOrEqualTo(
    left=JsonGet(
        step_name=evaluation_step.name,
        property_file=evaluation_report,
        json_path='regression_metrics.rmse.value',
    ),
    right=rmse_threshold,
)
condition_step = ConditionStep(
    name='CheckEvaluation',
    conditions=[cond_lte],
    if_steps=[register_step],
    else_steps=[metrics_fail_step],
)

In [140]:
# pipeline_name = 'synthetic-housing-training-pipeline-{}'.format(strftime('%d-%H-%M-%S', gmtime()))
pipeline_name = 'synthetic-housing-training-pipeline-ray'
step_list = [
             preprocess_dataset_step,
             feature_store_ingestion_step,
             tuning_step,
             #training_step,
             evaluation_step,
             condition_step
             #register_step
            ]

training_pipeline = Pipeline(
    name=pipeline_name,
    parameters=[
        processing_instance_count,
        #train_feature_group_name,
        #validation_feature_group_name,
        #test_feature_group_name,
        bucket_prefix,
        rmse_threshold
    ],
    steps=step_list
)

# Note: If an existing pipeline has the same name it will be overwritten.
training_pipeline.upsert(role_arn=role_arn)

# Viewing the pipeline definition will all the string variables interpolated may help debug pipeline bugs. It is commented out here due to length.
#json.loads(training_pipeline.definition())



Using provided s3_resource




Using provided s3_resource




Using provided s3_resource
Using provided s3_resource




{'PipelineArn': 'arn:aws:sagemaker:us-east-1:523914011708:pipeline/synthetic-housing-training-pipeline-ray',
 'ResponseMetadata': {'RequestId': 'e6bc05be-1350-4604-8b18-abff1421372b',
  'HTTPStatusCode': 200,
  'HTTPHeaders': {'x-amzn-requestid': 'e6bc05be-1350-4604-8b18-abff1421372b',
   'content-type': 'application/x-amz-json-1.1',
   'content-length': '107',
   'date': 'Tue, 11 Jul 2023 19:22:21 GMT'},
  'RetryAttempts': 0}}

In [141]:
json.loads(training_pipeline.definition())



Using provided s3_resource




Using provided s3_resource


{'Version': '2020-12-01',
 'Metadata': {},
 'Parameters': [{'Name': 'ProcessingInstanceCount',
   'Type': 'Integer',
   'DefaultValue': 1},
  {'Name': 'bucket_prefix',
   'Type': 'String',
   'DefaultValue': 'aws-ray-mlops-workshop/feature-store'},
  {'Name': 'RMSEThreshold', 'Type': 'Float', 'DefaultValue': 15000.0}],
 'PipelineExperimentConfig': {'ExperimentName': {'Get': 'Execution.PipelineName'},
  'TrialName': {'Get': 'Execution.PipelineExecutionId'}},
 'Steps': [{'Name': 'PreprocessData',
   'Type': 'Processing',
   'Arguments': {'ProcessingResources': {'ClusterConfig': {'InstanceType': 'ml.m5.xlarge',
      'InstanceCount': {'Get': 'Parameters.ProcessingInstanceCount'},
      'VolumeSizeInGB': 30}},
    'AppSpecification': {'ImageUri': '683313688378.dkr.ecr.us-east-1.amazonaws.com/sagemaker-scikit-learn:1.0-1-cpu-py3',
     'ContainerEntrypoint': ['python3',
      '/opt/ml/processing/input/code/script.py']},
    'RoleArn': 'arn:aws:iam::523914011708:role/service-role/AmazonSageM

In [142]:
# This is where you could optionally override parameter defaults 
execution = training_pipeline.start()

In [143]:
execution.describe()

{'PipelineArn': 'arn:aws:sagemaker:us-east-1:523914011708:pipeline/synthetic-housing-training-pipeline-ray',
 'PipelineExecutionArn': 'arn:aws:sagemaker:us-east-1:523914011708:pipeline/synthetic-housing-training-pipeline-ray/execution/sf5ubdjeq3m1',
 'PipelineExecutionDisplayName': 'execution-1689103343183',
 'PipelineExecutionStatus': 'Executing',
 'CreationTime': datetime.datetime(2023, 7, 11, 19, 22, 23, 9000, tzinfo=tzlocal()),
 'LastModifiedTime': datetime.datetime(2023, 7, 11, 19, 22, 23, 9000, tzinfo=tzlocal()),
 'CreatedBy': {'UserProfileArn': 'arn:aws:sagemaker:us-east-1:523914011708:user-profile/d-zbtbfrmc31iz/user-1',
  'UserProfileName': 'user-1',
  'DomainId': 'd-zbtbfrmc31iz'},
 'LastModifiedBy': {'UserProfileArn': 'arn:aws:sagemaker:us-east-1:523914011708:user-profile/d-zbtbfrmc31iz/user-1',
  'UserProfileName': 'user-1',
  'DomainId': 'd-zbtbfrmc31iz'},
 'ResponseMetadata': {'RequestId': '769bb605-dfcb-42b6-8f63-a3def41328f1',
  'HTTPStatusCode': 200,
  'HTTPHeaders': {

In [40]:
execution.wait()

WaiterError: Waiter PipelineExecutionComplete failed: Waiter encountered a terminal failure state: For expression "PipelineExecutionStatus" we matched expected path: "Failed"

List the steps in the execution. These are the steps in the pipeline that have been resolved by the step executor service.

In [None]:
execution.list_steps()

### Examining the Evaluation

Examine the resulting model evaluation after the pipeline completes. Download the resulting `evaluation.json` file from S3 and print the report.

In [None]:
from pprint import pprint


evaluation_json = sagemaker.s3.S3Downloader.read_file(
    "{}/evaluation.json".format(
        evaluation_step.arguments["ProcessingOutputConfig"]["Outputs"][0]["S3Output"]["S3Uri"]
    )
)
pprint(json.loads(evaluation_json))

### Lineage

Review the lineage of the artifacts generated by the pipeline.

In [None]:
import time
from sagemaker.lineage.visualizer import LineageTableVisualizer


viz = LineageTableVisualizer(sagemaker.session.Session())
for execution_step in reversed(execution.list_steps()):
    print(execution_step)
    display(viz.show(pipeline_execution_step=execution_step))
    time.sleep(5)

In [None]:
go = 1
pipelines = [
 "synthetic-housing-training-pipeline-ray","synthetic-housing-training-pipeline-09-12-47-53","synthetic-housing-training-pipeline-09-13-45-45"   
]

for pipeline_name in pipelines:
    if go == 1:
        try:
            print("Delete the pipeline")
            sagemaker_client.delete_pipeline(PipelineName=pipeline_name)
        except:
            pass


In [None]:
script = "s3://sagemaker-us-east-1-523914011708/EvaluateModel-19ecf21a25720cfb701ca57f9e01cfa4/input/code/script.py"
test = "s3://sagemaker-us-east-1-523914011708/aws-sm-ray-workshop/data/sm_processed/test"
model = "s3://sagemaker-us-east-1-523914011708/aws-sm-ray-workshop/pipelines-catx4geg05zv-TrainXGBModel-wnOd8iMBW9/output/model.tar.gz"
break;


In [100]:
import subprocess
import sys
subprocess.check_call([sys.executable, '-m', 'pip', 'install', 'sagemaker', 'ray', 'modin[ray]', 'pydantic==1.10.10'])
import os
import time
import tarfile
import argparse
import json
import logging
import boto3
import sagemaker
import glob

import pickle
import pathlib
import numpy as np
from math import sqrt
from sklearn.metrics import mean_absolute_error, mean_squared_error, r2_score

# Sagemaker specific imports
from sagemaker.session import Session
from sagemaker.experiments.run import load_run
import modin.pandas as pd
# Ray specific imports
import ray
from ray.train.xgboost import XGBoostTrainer
from ray.air.config import ScalingConfig
from ray.data import Dataset
from ray.air.result import Result
from ray.air.checkpoint import Checkpoint





In [101]:
train_feature_group_name = 'fs-train-synthetic-housing-data'
validation_feature_group_name = 'fs-val-synthetic-housing-data'
test_feature_group_name = 'fs-test-synthetic-housing-data'

In [102]:
def load_data():
    fs_group = FeatureGroup(name=test_feature_group_name, sagemaker_session=session)  
    query = fs_group.athena_query()
    table = query.table_name
    query_string = f'SELECT {features_to_select} FROM "sagemaker_featurestore"."{table}" ORDER BY record_id'
    query_results = 'sagemaker-featurestore'
    output_location = f's3://{bucket}/{query_results}/query_results/'
    query.run(
        query_string=query_string, 
        output_location=output_location
    )
    query.wait()
    df = query.as_dataframe()
    return df

In [108]:
from xgboost import DMatrix
import ray.cloudpickle as cloudpickle

print('Starting evaluation.')
#ray.init()  
#model_dir = '/opt/ml/processing/model'
model_dir = './common/'
model_path = os.path.join(model_dir, 'model.tar.gz')
print(model_path)

# Open the .tar.gz file
with tarfile.open(model_path, 'r:gz') as tar:
    # Extract all files
    tar.extractall(path=model_dir)

# Optional: Print the list of extracted files
with tarfile.open(model_path, 'r:gz') as tar:
    file_names = tar.getnames()
    print("Extracted files:")
    for name in file_names:
        print(name)

df = load_data()    
y_test = df.iloc[:, 0].to_numpy()
df.drop(df.columns[0], axis=1, inplace=True)
X_test = df.to_numpy()

# Load the serialized model data from a file
with open(f'{model_dir}model.pkl', "rb") as f:
    serialized_model = f.read()

# Deserialize the model using cloudpickle
result = cloudpickle.loads(serialized_model)

#checkpoint = ray.train.xgboost.XGBoostCheckpoint.from_directory(f'{model_dir}model.pkl/')
predictor = ray.train.xgboost.XGBoostPredictor.from_checkpoint(result.checkpoint)
predictions = predictor.predict(X_test)
result.metrics
# model = checkpoint.get_model()
# Convert DataFrame to DMatrix
#dtest = DMatrix(df, label=y_test)
#model.eval(dtest)

Starting evaluation.
./common/model.tar.gz
Extracted files:
model.pkl


INFO:sagemaker:Query 7e37b0f9-08d8-40c3-a571-0d05a9c7ffb6 is being executed.
INFO:sagemaker:Query 7e37b0f9-08d8-40c3-a571-0d05a9c7ffb6 successfully executed.


{'train-mae': 3891.672102539063,
 'train-rmse': 5032.198575273299,
 'valid-mae': 5261.345394222348,
 'valid-rmse': 7024.516821349464,
 'time_this_iter_s': 0.02758312225341797,
 'should_checkpoint': True,
 'done': True,
 'training_iteration': 101,
 'trial_id': '97470_00000',
 'date': '2023-07-11_16-33-02',
 'timestamp': 1689093182,
 'time_total_s': 3.961212158203125,
 'pid': 945,
 'hostname': 'ip-10-0-237-52.ec2.internal',
 'node_ip': '10.0.237.52',
 'config': {},
 'time_since_restore': 3.961212158203125,
 'iterations_since_restore': 101,
 'experiment_tag': '0'}

In [109]:
predictions

array([421312.72, 421312.72, 397031.66, ..., 210423.03, 279956.5 ,
       486753.7 ], dtype=float32)

In [None]:
predictions

In [110]:
y_test

array([271801, 271801, 391223, ..., 395752, 194462, 428049])

In [None]:
from sklearn.metrics import mean_absolute_error, mean_squared_error, r2_score
mse = mean_squared_error(y_test, predictions)
rmse = sqrt(mse)

In [None]:
rmse

In [None]:
#function to download the model from Sagemaker Model registry
def download_model(model_name, download_path):
    # Create a Boto3 SageMaker client
    sagemaker_client = boto3.client('sagemaker')


    # Get the details of the model package
    response = sagemaker_client.describe_model_package(
        ModelPackageName=model_name
    )

    # Retrieve the S3 location of the model package
    model_package_location = response['InferenceSpecification']['Containers'][0]['ModelDataUrl']
    print(model_package_location)
    # Download the model package
    s3_client = boto3.client('s3')
    #bucket_name = 'your-bucket-name'  # Replace with your S3 bucket name
    download_path = './common/model.tar.gz'  # Specify the local download path

    s3_client.download_file(bucket, model_package_location, download_path)

    print("Model downloaded successfully.")
    return