<h1>Model Build Workflow</h1>

In this notebook we will show how to define a model build workflow that orchestrates the previous steps (processing, training) and registers models in the SageMaker Model Registry. We will use Amazon SageMaker Pipelines for the workflow orchestration and lineage.

Orchestrating and automating the model build workflow is preliminary to any ML CI/CD, since CI/CD automations must be capable of executing the steps that lead to the generation of a model, which can vary based on the use case. The idea is that a typical "build" stage of CI/CD will execute a workflow that has been previously defined by a Data Scientist.

Amazon SageMaker Pipelines  supports a pipeline Domain Specific Language (DSL), which is a declarative Json specification. This DSL defines a Directed Acyclic Graph (DAG) of pipeline parameters and SageMaker job steps. The SageMaker Python SDK streamlines the generation of the pipeline DSL using constructs that are already familiar to engineers and scientists alike.

SageMaker Model Registry is where trained models are stored, versioned, and managed. Data Scientists and Machine Learning Engineers can compare model versions, approve models for deployment, and deploy models from different AWS accounts, all from a single Model Registry.

Let's define the variables first.

In [None]:
# Check SageMaker Python SDK version
import sagemaker
print(sagemaker.__version__)

def versiontuple(v):
    return tuple(map(int, (v.split("."))))

if versiontuple(sagemaker.__version__) < versiontuple('2.22.0'):
    raise Exception("This notebook requires at least SageMaker Python SDK version 2.22.0. Please install it via pip.")

In [None]:
import boto3
import time

role = sagemaker.get_execution_role()
region = boto3.Session().region_name
sagemaker_session = sagemaker.Session()
bucket_name = sagemaker_session.default_bucket()
prefix = 'endtoendmlsm'

print(region)
print(role)
print(bucket_name)

<h2>Define Pipeline</h2>

In this section, we will define a model build workflow for the pre-processing and training operations that we have executed manually in the previous notebooks. The workflow definition will also include steps to register models in the SageMaker model registry.

Our objective is defining a pipeline as graphically shown below: 

<img src="./workflow.png" />

The pipeline will execute the following steps:
<ul>
    <li>Run a SM Processing job to execute data preparation and generate a featurizer model</li>
    <ul>
        <li>Register the featurizer model in the SM Model Registry</li>
        <li>Run a SM Training job to train the XGBoost model</li>
        <ul><li>Register the XGBoost model in the SM Model Registry</li></ul>
    </ul>
</ul>

Note: the repack model steps will be automatically added by SM to convert the models in a suitable format for the SM Model Registry, when custom inference logic is required.

<h3>Pipeline parameters</h3>

We define workflow parameters by which we can parametrize our pipeline and vary the values injected and used in pipeline executions and schedules without having to modify the definition.

The supported parameter types include:

* `ParameterString` - representing a `str` Python type
* `ParameterInteger` - representing an `int` Python type
* `ParameterFloat` - representing a `float` Python type

In [None]:
from sagemaker.workflow.parameters import (
    ParameterInteger,
    ParameterString,
    ParameterFloat,
)

# ---------------------
# Processing parameters
# ---------------------

# The path to the raw data.
raw_data_path = 's3://{0}/{1}/data/raw/'.format(bucket_name, prefix)
raw_data_path_param = ParameterString(name="raw_data_path", default_value=raw_data_path)

# The output path to the training data.
train_data_path = 's3://{0}/{1}/data/preprocessed/train/'.format(bucket_name, prefix)
train_data_path_param = ParameterString(name="train_data_path", default_value=train_data_path)

# The output path to the validation data.
val_data_path = 's3://{0}/{1}/data/preprocessed/val/'.format(bucket_name, prefix)
val_data_path_param = ParameterString(name="val_data_path", default_value=val_data_path)

# The output path to the featurizer model.
model_path = 's3://{0}/{1}/output/sklearn/'.format(bucket_name, prefix)
model_path_param = ParameterString(name="model_path", default_value=model_path)

# The instance type for the processing job.
processing_instance_type_param = ParameterString(name="processing_instance_type", default_value='ml.m5.large')

# The instance count for the processing job.
processing_instance_count_param = ParameterInteger(name="processing_instance_count", default_value=1)

# The train/test split ration parameter.
train_test_split_ratio_param = ParameterString(name="train_test_split_ratio", default_value='0.2')

# -------------------
# Training parameters
# -------------------
        
# XGB hyperparameters.
max_depth_param = ParameterInteger(name="max_depth", default_value=3)
eta_param = ParameterFloat(name="eta", default_value=0.1)
gamma_param = ParameterInteger(name="gamma", default_value=6)
min_child_weight_param = ParameterInteger(name="min_child_weight", default_value=6)
objective_param = ParameterString(name="objective", default_value='reg:logistic')
num_round_param = ParameterInteger(name="num_round", default_value=20)

# The instance type for the training job.
training_instance_type_param = ParameterString(name="training_instance_type", default_value='ml.m5.xlarge')

# The instance count for the training job.
training_instance_count_param = ParameterInteger(name="training_instance_count", default_value=1)

# The training output path for the model.
output_path = 's3://{0}/{1}/output/'.format(bucket_name, prefix)
output_path_param = ParameterString(name="output_path", default_value=output_path)

# --------------------------
# Register models parameters
# --------------------------

# The default intance type for deployment.
deploy_instance_type_param = ParameterString(name="deploy_instance_type", default_value='ml.m5.2xlarge')

# The approval status for models added to the registry.
model_approval_status_param = ParameterString(name="model_approval_status", default_value='PendingManualApproval')


<h3>Processing Step</h3>

Now, we can start by defining the processing step that will prepare our dataset, as seen in module 02_data_exploration_and_feature_eng.

In [None]:
!pygmentize ../02_data_exploration_and_feature_eng/source_dir/preprocessor.py

In [None]:
from sagemaker.processing import ProcessingInput, ProcessingOutput
from sagemaker.sklearn.processing import SKLearnProcessor

sklearn_processor = SKLearnProcessor(role=role,
                                     instance_type=processing_instance_type_param,
                                     instance_count=processing_instance_count_param,
                                     framework_version='0.20.0')

inputs = [ProcessingInput(input_name='raw_data', 
                          source=raw_data_path_param, destination='/opt/ml/processing/input')]

outputs = [ProcessingOutput(output_name='train_data', 
                            source='/opt/ml/processing/train', destination=train_data_path_param),
           ProcessingOutput(output_name='val_data', 
                            source='/opt/ml/processing/val', destination=val_data_path_param),
           ProcessingOutput(output_name='model', 
                            source='/opt/ml/processing/model', destination=model_path_param)]

code_path = '../02_data_exploration_and_feature_eng/source_dir/preprocessor.py'

In [None]:
from sagemaker.workflow.steps import ProcessingStep

processing_step = ProcessingStep(
    name='Processing', 
    code=code_path,
    processor=sklearn_processor,
    inputs=inputs,
    outputs=outputs,
    job_arguments=['--train-test-split-ratio', train_test_split_ratio_param]
)

print(processing_step)

<h3>Training Step</h3>

In [None]:
!pygmentize ../03_train_model/source_dir/training.py

In [None]:
from sagemaker.xgboost import XGBoost

hyperparameters = {
    "max_depth": max_depth_param,
    "eta": eta_param,
    "gamma": gamma_param,
    "min_child_weight": min_child_weight_param,
    "silent": 0,
    "objective": objective_param,
    "num_round": num_round_param
}

entry_point='training.py'
source_dir='../03_train_model/source_dir/'
code_location = 's3://{0}/{1}/code'.format(bucket_name, prefix)

estimator = XGBoost(
    entry_point=entry_point,
    source_dir=source_dir,
    output_path=output_path_param,
    code_location=code_location,
    hyperparameters=hyperparameters,
    instance_type=training_instance_type_param,
    instance_count=training_instance_count_param,
    framework_version="0.90-2",
    py_version="py3",
    role=role
)

In [None]:
from sagemaker.inputs import TrainingInput
from sagemaker.workflow.steps import TrainingStep

training_step = TrainingStep(
    name='Training',
    estimator=estimator,
    inputs={
        'train': TrainingInput(
            s3_data=processing_step.properties.ProcessingOutputConfig.Outputs[
                'train_data'
            ].S3Output.S3Uri,
            content_type='text/csv'
        ),
        'validation': TrainingInput(
            s3_data=processing_step.properties.ProcessingOutputConfig.Outputs[
                'val_data'
            ].S3Output.S3Uri,
            content_type='text/csv'
        )      
    },
)

print(training_step)

<h3>Register Model Steps</h3>

<h4>Featurizer Model</h4>

In [None]:
model_package_group_name_featurizer = 'end-to-end-ml-sagemaker-sklearn-featurizer'

In [None]:
inference_image_uri = sagemaker.image_uris.retrieve(
    framework='sklearn',
    region=region,
    version='0.20.0',
    py_version='py3',
    instance_type=deploy_instance_type_param,
    image_scope='inference'
)
print(inference_image_uri)

In [None]:
import os
from sagemaker.workflow.step_collections import RegisterModel
from sagemaker.sklearn.estimator import SKLearn

dummy_estimator = SKLearn(sagemaker_session=sagemaker_session,
                          entry_point='inference.py',
                          source_dir='../04_deploy_model/sklearn_source_dir',
                          image_uri=inference_image_uri,
                          role=role,
                          instance_type=deploy_instance_type_param,
                          instance_count=1)

register_featurizer_step = RegisterModel(
    name='RegisterFeaturizerModel',
    estimator=dummy_estimator,
    entry_point='inference.py',
    source_dir='../04_deploy_model/sklearn_source_dir',
    image_uri=inference_image_uri,
    model_data=processing_step.properties.ProcessingOutputConfig.Outputs['model'].S3Output.S3Uri,
    content_types=['text/csv'],
    response_types=['application/json', 'text/csv'],
    inference_instances=[deploy_instance_type_param],
    transform_instances=['ml.c5.4xlarge'],
    model_package_group_name=model_package_group_name_featurizer,
    approval_status=model_approval_status_param
)

<h4>XGBoost Model</h4>

In [None]:
model_package_group_name_xgboost = 'end-to-end-ml-sagemaker-xgboost'

In [None]:
inference_image_uri=sagemaker.image_uris.retrieve(
    framework='xgboost',
    region=region,
    version='0.90-2',
    py_version='py3',
    instance_type=deploy_instance_type_param,
    image_scope='inference'
)
print(inference_image_uri)

In [None]:
register_xgboost_step=RegisterModel(
    name='RegisterXGBoostModel',
    estimator=estimator,
    entry_point='inference.py',
    source_dir='../04_deploy_model/xgboost_source_dir',
    image_uri=inference_image_uri,
    model_data=training_step.properties.ModelArtifacts.S3ModelArtifacts,
    content_types=['text/csv', 'application/json'],
    response_types=['text/csv', 'application/json'],
    inference_instances=[deploy_instance_type_param],
    transform_instances=['ml.c5.4xlarge'],
    model_package_group_name=model_package_group_name_xgboost,
    approval_status=model_approval_status_param
)

<h3>Pipeline</h3>

In [None]:
from sagemaker.workflow.pipeline import Pipeline

pipeline_name = 'end-to-end-ml-sagemaker-pipeline'

pipeline = Pipeline(
    name=pipeline_name,
    parameters=[
        raw_data_path_param,
        train_data_path_param,
        val_data_path_param,
        model_path_param,
        processing_instance_type_param,
        processing_instance_count_param,
        train_test_split_ratio_param,
        max_depth_param,
        eta_param,
        gamma_param,
        min_child_weight_param,
        objective_param,
        num_round_param,
        training_instance_type_param,
        training_instance_count_param,
        output_path_param,
        deploy_instance_type_param,
        model_approval_status_param
    ],
    steps=[processing_step, training_step, register_featurizer_step, register_xgboost_step],
    sagemaker_session=sagemaker_session,
)

In [None]:
import json
definition = json.loads(pipeline.definition())
definition

<h2>Insert and Execute the pipeline</h2>

In [None]:
response = pipeline.upsert(role_arn=role)

pipeline_arn = response["PipelineArn"]
print(pipeline_arn)

In [None]:
execution = pipeline.start()
print(execution.arn)

<h3>Wait for pipeline execution</h3>

In [None]:
%%time
execution.wait()

While waiting for pipeline execution to complete (it will take ~10mins), feel free to use the left side panel in SageMaker Studio to review the pipeline definition and execution status.

<h2>Approve models in registry</h2>

In [None]:
steps = execution.list_steps()
register_sklearn_step = next(s for s in steps if s['StepName'] == 'RegisterFeaturizerModel' )
register_xgboost_step = next(s for s in steps if s['StepName'] == 'RegisterXGBoostModel' )

sklearn_model_package_arn = register_sklearn_step['Metadata']['RegisterModel']['Arn']
xgboost_model_package_arn = register_xgboost_step['Metadata']['RegisterModel']['Arn']

print(sklearn_model_package_arn)
print(xgboost_model_package_arn)

In [None]:
sm_client = boto3.client('sagemaker')

sm_client.update_model_package(
    ModelPackageArn=sklearn_model_package_arn,
    ModelApprovalStatus="Approved",
)

In [None]:
sm_client.update_model_package(
    ModelPackageArn=xgboost_model_package_arn,
    ModelApprovalStatus="Approved",
)

<h2>Deploy real-time endpoint from models in the registry</h2>

In [None]:
sklearn_mp_response = sm_client.describe_model_package(ModelPackageName = sklearn_model_package_arn)
xgboost_mp_response = sm_client.describe_model_package(ModelPackageName = xgboost_model_package_arn)

sklearn_container = sklearn_mp_response['InferenceSpecification']['Containers'][0]['Image']
sklearn_model_data = sklearn_mp_response['InferenceSpecification']['Containers'][0]['ModelDataUrl']
print(sklearn_container)
print(sklearn_model_data)
print()

xgboost_container = xgboost_mp_response['InferenceSpecification']['Containers'][0]['Image']
xgboost_model_data = xgboost_mp_response['InferenceSpecification']['Containers'][0]['ModelDataUrl']
print(xgboost_container)
print(xgboost_model_data)

In [None]:
sklearn_model_path = sklearn_model_data[0:sklearn_model_data.rfind('/')] + '/'
xgboost_model_path = xgboost_model_data[0:sklearn_model_data.rfind('/')] + '/'

In [None]:
!tar -cvzf sklearn_sourcedir.tar.gz -C ../04_deploy_model/sklearn_source_dir/ .
!aws s3 cp sklearn_sourcedir.tar.gz {sklearn_model_path}
!tar -cvzf xgboost_sourcedir.tar.gz -C ../04_deploy_model/xgboost_source_dir/ .
!aws s3 cp xgboost_sourcedir.tar.gz {xgboost_model_path}

In [None]:
from sagemaker.model import Model
from sagemaker.pipeline import PipelineModel

sklearn_model = Model(image_uri = sklearn_container,
                      model_data = sklearn_model_data,
                      env = {
                          'SAGEMAKER_PROGRAM' : 'inference.py',
                          'SAGEMAKER_SUBMIT_DIRECTORY' : sklearn_model_path + 'sklearn_sourcedir.tar.gz',
                      },
                      role = role,
                      sagemaker_session = sagemaker_session) 

xgboost_model = Model(image_uri = xgboost_container,
                      model_data = xgboost_model_data,
                      env = {
                          'SAGEMAKER_PROGRAM' : 'inference.py',
                          'SAGEMAKER_SUBMIT_DIRECTORY' : xgboost_model_path + 'xgboost_sourcedir.tar.gz',
                      },
                      role = role,
                      sagemaker_session = sagemaker_session)

pipeline_model_name = 'end-to-end-ml-sm-xgb-skl-pipeline-{0}'.format(str(int(time.time())))

pipeline_model = PipelineModel(
    name=pipeline_model_name, 
    role=role,
    models=[
        sklearn_model, 
        xgboost_model],
    sagemaker_session=sagemaker_session)

endpoint_name = 'end-to-end-ml-sm-pipeline-endpoint-{0}'.format(str(int(time.time())))
print(endpoint_name)

pipeline_model.deploy(initial_instance_count=1, 
                      instance_type='ml.m5.2xlarge', 
                      endpoint_name=endpoint_name)

<h3>Execute inference</h3>

In [None]:
from sagemaker.serializers import CSVSerializer
from sagemaker.deserializers import JSONDeserializer
from sagemaker.predictor import Predictor

predictor = Predictor(
    endpoint_name=endpoint_name,
    sagemaker_session=sagemaker_session,
    serializer=CSVSerializer(),
    deserializer=JSONDeserializer())

payload = "TID008,HAWT,64,80,46,21,55,55,7,34,SE"
print(predictor.predict(payload))

Finally, we can cleanup resources.

In [None]:
predictor.delete_endpoint()