# Amazon SageMaker model training & tuning pipeline

***This notebook works best with the `Data Science 3.0` kernel on an `ml.t3.medium` instance type***.

Customers can use SageMaker Pipelines to build scalable machine learning pipelines that preprocess data and train machine learning models. With SageMaker Pipelines, customers have a toolkit for every part of the machine learning lifecycle that provides deep customizations and tuning options to fit every organization. Customers have the freedom to customize SageMaker Pipelines to specific use cases, but also to create generic machine learning pipelines that can be reused across different use cases.

From a birds-eye view a machine learning pipeline usually consists of 3 general steps: a preprocess step where the data is transformed, a training step where a machine learning model is trained, and an evaluation step which tests the performance of the trained model. If the model is performing according to the objective metric you’re optimizing for, then that becomes a candidate model for deployment to one or more environments. These candidate models should be registered into SageMaker Model Registry to catalog and store key metadata for that model version.

--- 

These steps have a lot of commonalities, even across different machine learning use cases. Customers that want to create training pipelines that can be re-used in an organization can use SageMaker Pipelines to create parameterized, generic training pipelines. Parameters allow customers to identify specific parameters that can be passed into the pipeline during pipeline execution without having to directly change the pipeline code itself. 

**This notebook** demonstrates how SageMaker Pipelines can be used to string together a sequence of data processing, model training, tuning and evaluation step to train a binary classification machine learning model using [`scikit-learn`](https://pypi.org/project/scikit-learn/). The trained model can then be used for batch inference, see [`1_batch_transform_pipeline`](./1_batch_transform_pipeline.ipynb) or hosted on a SageMaker endpoint for realtime inference, see [`2_realtime_inference`](./2_realtime_inference.ipynb).


### SageMaker Pipelines
Amazon SageMaker Pipelines is a purpose-built, easy-to-use CI/CD service for machine learning. With SageMaker Pipelines, customers can create machine learning workflows with an easy-to-use Python SDK, and then visualize and manage workflows using Amazon SageMaker Studio.


#### SageMaker Pipeline steps and parameters
SageMaker pipelines works on the concept of steps. The order steps are executed in is inferred from the dependencies each step has. If a step has a dependency on the output from a previous step, it's not executed until after that step has completed successfully.

SageMaker Pipeline Parameters are input parameters specified when triggering a pipeline execution. They need to be explicitly defined when creating the pipeline and contain default values.
(https://docs.aws.amazon.com/sagemaker/latest/dg/pipelines-sdk.html).

#### SageMaker Pipeline DAG

When creating a SageMaker Pipeline, SageMaker creates a Direct Acyclic Graph, DAG, that customers can visualize in Amazon SageMaker Studio. The DAG can be used to track pipeline executions, outputs and metrics. In this notebook, a SageMaker Pipeline with the following DAG is created:

## Predict customer orders with Random Forest Classifier

### Data

This notebook uses PrestoDB to extract `tpc-h` data from the `tpc-h connector`, and includes the data extraction, preprocesing, as well as the splitting of data into train, test, and validation datases as a part of the preprocessing step of this sagemaker pipeline. 

***To configure PrestoDB within your EC2 instance view***: see instructions in the [README](./README.md) file.


### Overview 

This model is a binary classification model creating using the scikit-learn `RandomForestClassifier`. It categorizes input data into high value/low value order classes. 

Training data: the training data for this model is available via PrestoDB tables and is read into Pandas through the PrestoDB Python client. This data is then read into an Apache Spark dataframe (although the model training happens only using the data in the Pandas dataframe).

* Data is read using queries from PrestoDB and any feature engineering required is done as part of the query itself.

* Note that ingestion of raw data into PrestoDB tables is outside the scope of this project and it is assumed that for the purpose of model training the data can simply be queried from PrestoDB tables.


In [2]:
# import sys
# !{sys.executable} -m pip install -r requirements.txt

In [3]:
## Install the necessary boto3 and sagemaker libraries to initialize session
import json
import boto3
import time
import logging
import sagemaker
import sagemaker.session
from typing import Dict, List
from datetime import datetime
from utils import (load_config,
                    print_pipeline_execution_summary)
from sagemaker.workflow.pipeline_context import PipelineSession
from sagemaker.workflow.parameters import (
    ParameterInteger,
    ParameterString,
    ParameterFloat,
)

from sagemaker.estimator import Estimator
from sklearn.metrics import roc_auc_score
from sagemaker.inputs import TrainingInput
from sagemaker.workflow.functions import Join
from sagemaker.workflow.steps import TuningStep
from sagemaker.sklearn.estimator import SKLearn
from sagemaker.workflow.steps import ProcessingStep
from sagemaker.sklearn.processing import SKLearnProcessor
from sagemaker.processing import ProcessingInput, ProcessingOutput
from sagemaker.workflow.execution_variables import ExecutionVariables
from sagemaker.tuner import HyperparameterTuner, IntegerParameter, CategoricalParameter

from sagemaker.workflow.functions import Join
from sagemaker.workflow.functions import JsonGet
from sagemaker.workflow.pipeline import Pipeline
from sagemaker.processing import ScriptProcessor
from sagemaker.workflow.fail_step import FailStep
from sagemaker.workflow.properties import PropertyFile
from sagemaker.workflow.condition_step import ConditionStep
from sagemaker.workflow.step_collections import RegisterModel
from sagemaker.model_metrics import MetricsSource, ModelMetrics
from sagemaker.workflow.conditions import ConditionGreaterThanOrEqualTo



sagemaker.config INFO - Not applying SDK defaults from location: /etc/xdg/sagemaker/config.yaml
sagemaker.config INFO - Not applying SDK defaults from location: /root/.config/sagemaker/config.yaml


In [4]:
## set the logger to track all of the logs as this pipeline runs
logging.basicConfig(format='[%(asctime)s] p%(process)s {%(filename)s:%(lineno)d} %(levelname)s - %(message)s', level=logging.INFO)
logger = logging.getLogger(__name__)

### Load the Config.yml file that contains information that is used across this pipeline

In [5]:
config = load_config('config.yml')
logger.info(json.dumps(config, indent=2))

[2024-02-23 23:42:22,880] p4797 {2294058105.py:2} INFO - {
  "aws": {
    "region": "us-east-1",
    "sagemaker_execution_role": "arn:aws:iam::218208277580:role/service-role/AmazonSageMaker-ExecutionRole-20230911T184036",
    "s3_bucket": "sagemaker-{region}-{account_id}",
    "s3_prefix": "mlops-pipeline-model"
  },
  "presto": {
    "host": "3.93.186.209",
    "parameter": "8080",
    "presto_credentials": "presto-credentials"
  },
  "pipeline": {
    "training_pipeline_name": "mlops-pipeline-presto",
    "transform_pipeline_name": "mlops-batch-inference",
    "execution_display_name": "mlops-prestodb-pipeline",
    "tags": [
      {
        "Key": "team",
        "Value": "my-team"
      }
    ]
  },
  "training_step": {
    "training_target": "high_value_order",
    "training_features": [
      "total_extended_price",
      "avg_discount",
      "total_quantity"
    ],
    "sklearn_framework_version": "0.23-1",
    "n_estimators": 75,
    "max_depth": 10,
    "min_samples_split": 2

In [6]:
## initialize the sagemaker session, region, role bucket and pipeline session
session = sagemaker.session.Session()
region = session.boto_region_name
pipeline_session = PipelineSession()

role = config['aws']['sagemaker_execution_role']
ci = boto3.client('sts').get_caller_identity()
bucket = config['aws']['s3_bucket'].format(account_id=ci['Account'], region=region)
prefix = config['aws']['s3_prefix']  # Prefix to S3 artifacts

logger.info(f"bucket={bucket}, prefix={prefix}, role={role}")

[2024-02-23 23:42:22,956] p4797 {credentials.py:1278} INFO - Found credentials in shared credentials file: ~/.aws/credentials
[2024-02-23 23:42:23,197] p4797 {credentials.py:1278} INFO - Found credentials in shared credentials file: ~/.aws/credentials
[2024-02-23 23:42:23,567] p4797 {credentials.py:1278} INFO - Found credentials in shared credentials file: ~/.aws/credentials
[2024-02-23 23:42:23,678] p4797 {2480719557.py:11} INFO - bucket=sagemaker-us-east-1-218208277580, prefix=mlops-pipeline-model, role=arn:aws:iam::218208277580:role/service-role/AmazonSageMaker-ExecutionRole-20230911T184036


### Set parameters that are used throughout the training pipeline

In [7]:
# Convert your list to a JSON string
training_features_str = json.dumps(config['training_step']['training_features'])
logger.info(f"the training features being used for this pipeline --> {training_features_str}")

# Define new pipeline parameters
host_parameter = ParameterString(name="HostParameter", default_value=config['presto']['host'])
port_parameter = ParameterString(name="PortParameter", default_value=config['presto']['parameter'])
target_parameter = ParameterString(name="Target", default_value=config['training_step']['training_target'])
feature_parameter = ParameterString(name="Feature", default_value=training_features_str)

## presto credential key and region pipeline parameters
presto_parameter = ParameterString(name="PrestoParameter", default_value=config['presto']['presto_credentials'])
region_parameter = ParameterString(name="Region", default_value=config['aws']['region'])

# training hyperparameters to use hyperparameter parameters
n_estimators_parameter = ParameterInteger(name="NEstimators", default_value=config['training_step']['n_estimators'])
max_depth_parameter = ParameterInteger(name="MaxDepth", default_value=config['training_step']['max_depth'])
min_samples_split_parameter = ParameterInteger(name="MinSamplesSplit", default_value=config['training_step']['min_samples_split'])
max_features_parameter = ParameterString(name="MaxFeatures", default_value=config['training_step']['max_features'])
model_approval_status = ParameterString(
    name="ModelApprovalStatus", default_value=config['register_model_step']['approval_status']
)

[2024-02-23 23:42:23,693] p4797 {1164995085.py:3} INFO - the training features being used for this pipeline --> ["total_extended_price", "avg_discount", "total_quantity"]


<a id='parameters'></a>

### Pipeline input parameters

Pipeline Parameters are input parameter when triggering a pipeline execution. They need to be explicitly defined when creating the pipeline and contain default values.

Create parameters for the inputs to the pipeline. In this case, parameters will be used for:
- `ModelGroup` - Which registry to register the trained model with.
- `InputData` - S3 URI to pipeline input data.
- `PreprocessScript` - S3 URI to python script to preprocess the data.
- `EvaluateScript` - S3 URI to python script to evaluate the trained model.
- `MaxiumTrainingJobs` - How many training jobs to allow when hyperparameter tuning the model
- `MaxiumParallelTrainingJobs` - How many training jobs to allow in parallel when hyperparameter tuning the model.
- `AccuracyConditionThreshold` - Only register models with the model registry if the have at least this classification accuracy.
- `ProcessingInstanceType` - What EC2 instance type to use for processing.
- `TrainingInstanceType` - What EC2 instance type to use for training.

In [8]:
# To what Registry to register the model and its versions.
model_group = ParameterString(name="ModelGroup", default_value=config['register_model_step']['model_group'])

# Maximum amount of training jobs to allow in the HP tuning
max_training_jobs = ParameterInteger(name="MaximumTrainingJobs", default_value=config['tuning_step']['maximum_training_jobs'])

# Maximum amount of trainingjobs to allow in the HP tuning
max_parallel_training_jobs = ParameterInteger(name="MaximumParallelTrainingJobs", default_value=config['tuning_step']['maximum_parallel_training_jobs'])

# Accuracy threshold to decide whether or not to register the model with Model Registry
accuracy_condition_threshold = ParameterFloat(name="AccuracyConditionThreshold", default_value=config['evaluation_step']['accuracy_condition_threshold'])

# What instance type to use for processing.
processing_instance_type = ParameterString(
    name="ProcessingInstanceType", default_value=config['data_processing_step']['processing_instance_type']
)

# What instance type to use for training.
training_instance_type = ParameterString(name="TrainingInstanceType", default_value=config['training_step']['instance_type'])

<a id='preprocess'></a>

## Preprocess data step
--- 
In the first step an sklearn processor is created, used in the ProcessingStep. In this step, the preprocess script is read to connect to presto and query data, that is then sent to an Amazon S3 bucket split into train, test and validation datasets. Using these files, this step can then use the data for training the model.

In [9]:
# Create SKlearn processor object,
# The object contains information about what instance type to use, the IAM role to use etc.
# A managed processor comes with a preconfigured container, so only specifying version is required.
sklearn_processor = SKLearnProcessor(framework_version=config['training_step']['sklearn_framework_version'],
                                     role=role,
                                     instance_type=processing_instance_type,
                                     instance_count=config['data_processing_step']['instance_count'],
                                     tags=config['data_processing_step']['tags'])

# Use the sklearn_processor in a SageMaker Pipelines ProcessingStep
# Configure the ProcessingStep
step_preprocess_data = ProcessingStep(
    name=config['data_processing_step']['step_name'],
    processor=sklearn_processor,
    inputs=[],  # No static inputs required as data fetching is part of the script
    outputs=[
        ProcessingOutput(
            output_name="train",
            source="/opt/ml/processing/train",
            destination=Join(
                on="/",
                values=[
                    "s3://{}".format(bucket),
                    prefix,
                    ExecutionVariables.PIPELINE_EXECUTION_ID,
                    "train",
                ],
            ),
        ),
        ProcessingOutput(
            output_name="validation",
            source="/opt/ml/processing/validation",
            destination=Join(
                on="/",
                values=[
                    "s3://{}".format(bucket),
                    prefix,
                    ExecutionVariables.PIPELINE_EXECUTION_ID,
                    "validation",
                ],
            ),
        ),
        ProcessingOutput(
            output_name="test",
            source="/opt/ml/processing/test",
            destination=Join(
                on="/",
                values=[
                    "s3://{}".format(bucket),
                    prefix,
                    ExecutionVariables.PIPELINE_EXECUTION_ID,
                    "test",
                ],
            ),
        ),
    ],
    code = config['scripts']['preprocess_data'],
    job_arguments=[
        "--host", host_parameter,
        "--port", port_parameter,
        "--presto_credentials_key", presto_parameter,
        "--region", region_parameter,
    ]
)

[2024-02-23 23:42:24,033] p4797 {image_uris.py:581} INFO - Defaulting to only available Python version: py3


<a id='train'></a>

## Train model step
In the second step, the train and validation output from the previous processing step are used to train a model. 

---

We use the SKLearn estimator from SageMaker SDK and the RandomForestClassifier from scikit-learn to train the ML model. The HyperparameterTunerclass is used for running automatic model tuning to determine the set of hyperparameters that provide the best performance (maximize the AUC metric).

In [10]:
# Fetch container to use for training
image_uri = sagemaker.image_uris.retrieve(
    framework="sklearn",
    region=region,
    version=config['training_step']['sklearn_framework_version'],
    py_version="py3",
    instance_type=config['training_step']['instance_type'],
)
logger.info(f"training step image_uri={image_uri}")

sklearn_estimator = SKLearn(
    entry_point=config['scripts']['training_script'],
    role=role,
    instance_count=config['training_step']['instance_count'],
    instance_type=config['training_step']['instance_type'],
    framework_version=config['training_step']['sklearn_framework_version'],
    base_job_name=config['training_step']['base_job_name'],
    hyperparameters={
        "n_estimators": config['training_step']['n_estimators'],
        "max_depth": config['training_step']['max_depth'],  
        "features": config['training_step']['training_features'],
        "target": config['training_step']['training_target'],
    },
    tags=config['training_step']['tags']
)

# Create Hyperparameter tuner object. Ranges from https://docs.aws.amazon.com/sagemaker/latest/dg/xgboost-tuning.html
rf_tuner = HyperparameterTuner(
                estimator=sklearn_estimator,
                objective_metric_name=config['tuning_step']['objective_metric_name'],
                hyperparameter_ranges={
                    "n_estimators": IntegerParameter(config['tuning_step']['hyperparam_ranges']['n_estimators'][0], config['tuning_step']['hyperparam_ranges']['n_estimators'][1]),
                    "max_depth": IntegerParameter(config['tuning_step']['hyperparam_ranges']['max_depth'][0], config['tuning_step']['hyperparam_ranges']['max_depth'][1]),
                    "min_samples_split": IntegerParameter(config['tuning_step']['hyperparam_ranges']['min_samples_split'][0], config['tuning_step']['hyperparam_ranges']['min_samples_split'][1]),
                    "max_features": CategoricalParameter(config['tuning_step']['hyperparam_ranges']['max_features'])
                },
                max_jobs=config['tuning_step']['maximum_training_jobs'], ## reducing this for testing purposes
                metric_definitions=config['tuning_step']['metric_definitions'],
                max_parallel_jobs=config['tuning_step']['maximum_parallel_training_jobs'], ## reducing this for testing purposes
)


step_tuning = TuningStep(
    name=config['tuning_step']['step_name'],
    tuner=rf_tuner,
    inputs={
        "train": TrainingInput(
            s3_data=step_preprocess_data.properties.ProcessingOutputConfig.Outputs[
                "train" ## refer to this
            ].S3Output.S3Uri,
            content_type="text/csv",
        ),
        "test": TrainingInput(
        s3_data=step_preprocess_data.properties.ProcessingOutputConfig.Outputs["test"].S3Output.S3Uri,
        content_type="text/csv",
        ),
    },
)

[2024-02-23 23:42:24,118] p4797 {4266246662.py:9} INFO - training step image_uri=683313688378.dkr.ecr.us-east-1.amazonaws.com/sagemaker-scikit-learn:0.23-1-cpu-py3


## Evaluate model step
---

When a model is trained, it's common to evaluate the model on unseen data before registering it with the model registry. This ensures the model registry isn't cluttered with poorly performing model versions. The purpose of the model evaluation step is to check that the trained and tuned model has an accuracy level above a configurable threshold and only then register the model with the model registry (from where it can be subsequently approved and deployed). If the model accuracy does not meet a configured threshold then the pipeline fails and the model is not registered with the model registry.

In [11]:
# Create ScriptProcessor object.
# The object contains information about what container to use, what instance type etc.
evaluate_model_processor = ScriptProcessor(
    image_uri=image_uri,
    command=["python3"],
    instance_type=config['evaluation_step']['instance_type'],
    instance_count=config['evaluation_step']['instance_count'],
    role=role,
)

# Create a PropertyFile
# A PropertyFile is used to be able to reference outputs from a processing step, for instance to use in a condition step.
# For more information, visit https://docs.aws.amazon.com/sagemaker/latest/dg/build-and-manage-propertyfile.html
evaluation_report = PropertyFile(
    name="EvaluationReport", output_name="evaluation", path=config['evaluation_step']['evaluation_filename']
)


step_evaluate_model = ProcessingStep(
    name=config['evaluation_step']['step_name'],
    processor=evaluate_model_processor,
    inputs=[
        ProcessingInput(
            source=step_tuning.get_top_model_s3_uri(top_k=0, s3_bucket=bucket),
            destination="/opt/ml/processing/model",
            input_name="model.tar.gz" 
        ),
        ProcessingInput(
            source=step_preprocess_data.properties.ProcessingOutputConfig.Outputs["test"].S3Output.S3Uri,
            destination="/opt/ml/processing/test",
            input_name="test.csv" 
        ),
    ],
    outputs=[
        ProcessingOutput(
            output_name="evaluation",
            source="/opt/ml/processing/evaluation",
            destination=Join(
                on="/",
                values=[
                    "s3://{}".format(bucket),
                    prefix,
                    ExecutionVariables.PIPELINE_EXECUTION_ID,
                    "evaluation",
                ]
            )
        )
    ],
    code = config['scripts']['evaluation'],
    property_files=[evaluation_report],
    job_arguments=[
        "--target", target_parameter,
        "--features", feature_parameter,
    ]
)

<a id='register'></a>

## Register model step
If the trained model meets the model performance requirements, a new model version is registered with the model registry for further analysis. To attach model metrics to the model version, create a [ModelMetrics](https://docs.aws.amazon.com/sagemaker/latest/dg/model-monitor-model-quality-metrics.html) object using the evaluation report created in the evaluation step. Then, create the RegisterModel step.

The model is registered with the model Registry with approval status set to PendingManualApproval, this means the model cannot be deployed on a SageMaker Endpoint unless its status in the registry is changed to Approved manually via the SageMaker console, programmatically or through a Lambda function.

In [12]:
# Create ModelMetrics object using the evaluation report from the evaluation step
# A ModelMetrics object contains metrics captured from a model.
model_metrics = ModelMetrics(
    model_statistics=MetricsSource(
        s3_uri=Join(
            on="/",
            values=[
                step_evaluate_model.arguments["ProcessingOutputConfig"]["Outputs"][0]["S3Output"][
                    "S3Uri"
                ],
                config['evaluation_step']['evaluation_filename'],
            ],
        ),
        content_type="application/json",
    )
)

# Crete a RegisterModel step, which registers the model with SageMaker Model Registry.
step_register_model = RegisterModel(
    name=config['register_model_step']['step_name'],
    estimator=sklearn_estimator,
    model_data=step_tuning.get_top_model_s3_uri(top_k=0, s3_bucket=bucket),
    content_types=["text/csv"],
    response_types=["text/csv"],
    inference_instances=config['register_model_step']['inference_instance_types'],
    transform_instances=config['register_model_step']['transform_instance_types'],
    model_package_group_name=model_group,
    approval_status=model_approval_status,
    model_metrics=model_metrics,
    tags=config['register_model_step']['tags']
    
)



<a id='condition'></a>

## Accuracy condition step
Adding conditions to the pipeline is done with a ConditionStep.
In this case, we only want to register the new model version with the model registry if the new model meets an accuracy condition.

In [13]:
step_fail = FailStep(
    name=config['fail_step']['step_name'],
    error_message=Join(on=" ", values=["Execution failed due to Accuracy <", accuracy_condition_threshold]),
)

In [14]:
# Create accuracy condition to ensure the model meets performance requirements.
# Models with a test accuracy lower than the condition will not be registered with the model registry.
cond_gte = ConditionGreaterThanOrEqualTo(
    left=JsonGet(
        step_name=step_evaluate_model.name,
        property_file=evaluation_report,
        json_path="binary_classification_metrics.accuracy.value",
    ),
    right=accuracy_condition_threshold,
)

# Create a SageMaker Pipelines ConditionStep, using the condition above.
# Enter the steps to perform if the condition returns True / False.
step_cond = ConditionStep(
    name=config['condition_step']['step_name'],
    conditions=[cond_gte],
    if_steps=[step_register_model],
    else_steps=[step_fail], ## if this fails - add a step here (from the quip)
)

<a id='orchestrate'></a>

## Pipeline Creation: Orchestrate all steps

Now that all pipeline steps are created, a pipeline is created.

In [15]:
# Create a SageMaker Pipeline.
# Each parameter for the pipeline must be set as a parameter explicitly when the pipeline is created.
# Also pass in each of the steps created above.
# Note that the order of execution is determined from each step's dependencies on other steps,
# not on the order they are passed in below.
pipeline = Pipeline(
    name=config['pipeline']['training_pipeline_name'],
    parameters=[
        processing_instance_type,
        training_instance_type,
        accuracy_condition_threshold,
        model_group,
        max_parallel_training_jobs,
        max_training_jobs,
        host_parameter,
        region_parameter,
        presto_parameter,
        port_parameter,
        target_parameter, 
        feature_parameter,
        model_approval_status,
    ],
    steps=[
            step_preprocess_data, 
            step_tuning, 
            step_evaluate_model, 
            step_cond],
)

In [16]:
# Submit pipeline
pipeline_upsert_tags = config['pipeline']['tags']
pipeline.upsert(role_arn=role, tags=pipeline_upsert_tags)



{'PipelineArn': 'arn:aws:sagemaker:us-east-1:218208277580:pipeline/mlops-pipeline-presto',
 'ResponseMetadata': {'RequestId': '692aa45e-cc63-48c7-b1d8-11ef9cbb5f2a',
  'HTTPStatusCode': 200,
  'HTTPHeaders': {'x-amzn-requestid': '692aa45e-cc63-48c7-b1d8-11ef9cbb5f2a',
   'content-type': 'application/x-amz-json-1.1',
   'content-length': '89',
   'date': 'Fri, 23 Feb 2024 23:42:26 GMT'},
  'RetryAttempts': 0}}

## Start pipeline with different parameters.
Now that the pipeline is created, it can be started with custom parameters making the pipeline agnostic to who is triggering it, but also to the scripts and data used. The pipeline can be started using the CLI, the SageMaker Studio UI or the SDK and below there is a screenshot of what it looks like in the SageMaker Studio UI.

#### Starting the pipeline with the SDK
In the examples below, the pipeline is triggered for two machine learning problems, each with different preprocessing scripts and model registry. Each machine learning problem is run with two different sets of parameters.

In [17]:
# Start pipeline with credit data and preprocessing script
execution = pipeline.start(
                execution_display_name=config['pipeline']['execution_display_name'],
                parameters=dict(
                AccuracyConditionThreshold=config['evaluation_step']['accuracy_condition_threshold'],
                MaximumParallelTrainingJobs=config['tuning_step']['maximum_parallel_training_jobs'],
                MaximumTrainingJobs=config['tuning_step']['maximum_training_jobs'],
                ModelGroup=config['register_model_step']['model_group'],
            ),
        )

In [18]:
execution.describe()

{'PipelineArn': 'arn:aws:sagemaker:us-east-1:218208277580:pipeline/mlops-pipeline-presto',
 'PipelineExecutionArn': 'arn:aws:sagemaker:us-east-1:218208277580:pipeline/mlops-pipeline-presto/execution/978rkuzjppke',
 'PipelineExecutionDisplayName': 'mlops-prestodb-pipeline',
 'PipelineExecutionStatus': 'Executing',
 'CreationTime': datetime.datetime(2024, 2, 23, 23, 42, 27, 347000, tzinfo=tzlocal()),
 'LastModifiedTime': datetime.datetime(2024, 2, 23, 23, 42, 27, 347000, tzinfo=tzlocal()),
 'CreatedBy': {},
 'LastModifiedBy': {},
 'ResponseMetadata': {'RequestId': '33eddd53-2de9-4133-b8d6-665686b57695',
  'HTTPStatusCode': 200,
  'HTTPHeaders': {'x-amzn-requestid': '33eddd53-2de9-4133-b8d6-665686b57695',
   'content-type': 'application/x-amz-json-1.1',
   'content-length': '407',
   'date': 'Fri, 23 Feb 2024 23:42:27 GMT'},
  'RetryAttempts': 0}}

In [None]:
st = time.perf_counter()
logger.info(f"starting pipeline={pipeline.name}")
execution.wait()
elapsed_time = time.perf_counter() - st
logger.info(f"pipeline={pipeline.name} took {elapsed_time:.2f} seconds to run")

[2024-02-23 23:42:27,568] p4797 {661491485.py:2} INFO - starting pipeline=mlops-pipeline-presto


In [None]:
print_pipeline_execution_summary(execution.list_steps(), pipeline.name)

#### Now that the model is registered, get access to the registered model manually on the sagemaker studio model registry console, or programmatically in the next notebook, approve it and run the second portion of this solution: Batch Transform Step