# Batch Transform on Amazon SageMaker Pipelines Integrated with Presto


---

This notebook's CI test result for us-west-2 is as follows. CI test results in other regions can be found at the end of the notebook. 

![This us-west-2 badge failed to load. Check your device's internet connectivity, otherwise the service is currently unavailable](https://prod.us-west-2.tcx-beacon.docs.aws.dev/sagemaker-nb/us-west-2/sagemaker-pipeline-parameterization|parameterized-pipeline.ipynb)

---

## Predict customer orders with Random Forest Classifier

### Data

This notebook uses PrestoDB to extract `tpc-h` data from the `tpc-h connector`, and includes the data extraction, preprocesing, as well as the splitting of data into train, test, and validation datases as a part of the preprocessing step of this sagemaker pipeline. 

***To configure PrestoDB within your EC2 instance view***: [PrestoDB EC2 Connection](https://normanlimxk.com/2020/09/15/creating-a-presto-cluster-on-ec2/)


### Overview 
**Disclaimer** This notebook was created using [Amazon SageMaker Studio](https://aws.amazon.com/sagemaker/studio/) and the `Python3(DataScience) kernel`. SageMaker Studio is required for the visualizations of the DAG and model metrics to work.

The purpose of this notebook is to demonstrate how SageMaker Pipelines can be used to create a generic Scikit-Learn training pipeline that preprocesses, trains, tunes, evaluates and registers new machine learning models with the SageMaker model registry, that is reusable across teams, customers and use cases. All scripts to preprocess the data and evaluate the trained model have been prepared in advance and are available here: 

---

This model is a binary classification model creating using the scikit-learn `RandomForestClassifier`. It categorizes input data into high value/low value order classes. 

Training data: the training data for this model is available via PrestoDB tables and is read into Pandas through the PrestoDB Python client. This data is then read into an Apache Spark dataframe (although the model training happens only using the data in the Pandas dataframe).

* Data is read using queries from PrestoDB and any feature engineering required is done as part of the query itself.

* Note that ingestion of raw data into PrestoDB tables is outside the scope of this project and it is assumed that for the purpose of model training the data can simply be queried from PrestoDB tables.
---

### This notebook specifically focuses on the second portion of this solution, i.e.: 

1. Extract the latest approved model from the model registry

2. Process and send batch data from PrestoDB to Amazon S3

3. Utilize the batch data for the batch transform and inference step, record the start and end times, and send the output to an S3 path. 


In [None]:
!pip install -U sagemaker --quiet # Ensure correct version of SageMaker is installed

In [None]:
## Install the necessary boto3 and sagemaker libraries to initialize session
import json
import boto3
import sagemaker
from utils import *
import sagemaker.session
from typing import Dict, List, Optional, Tuple, Union
from sagemaker.workflow.pipeline_context import PipelineSession

In [None]:
## set the logger to track all of the logs as this pipeline runs
logging.basicConfig(format='[%(asctime)s] p%(process)s {%(filename)s:%(lineno)d} %(levelname)s - %(message)s', level=logging.INFO)
logger = logging.getLogger(__name__)

### Load the Config.yml file that contains information that is used across this pipeline

In [None]:
config = load_config('config.yml')
logger.info(json.dumps(config, indent=2))

In [None]:
## initialize the sagemaker session, region, role bucket and pipeline session
session = sagemaker.session.Session()
pipeline_session = PipelineSession()
role = config['aws']['sagemaker_execution_role']
session_bucket = session.default_bucket()

logger.info(f"the pipeline bucket being used for this pipeline execution -> {session_bucket}")
logger.info(f"the sagemaker execution role being used across this pipeline -> {role}")

In [None]:
prefix = config['general']['prefix']  # Prefix to S3 artifacts
pipeline_name = config['general']['pipeline_name']  # SageMaker Pipeline name
order_model_group = config['general']['model_group']

logger.info(f"the prefix for the pipeline name -> {prefix}, pipeline name of this execution -> {pipeline_name}, model group name -> {order_model_group}")

In [None]:
logger.info(f"{config['training_params']['training_features']}")

In [None]:
import json
from sagemaker.workflow.parameters import ParameterString, ParameterInteger, ParameterFloat

# Convert your list to a JSON string
training_features_str = json.dumps(config['training_params']['training_features'])
logger.info(f"the training features being used for this pipeline --> {training_features_str}")

# Define new pipeline parameters
host_parameter = ParameterString(name="HostParameter", default_value=config['pipeline_parameters']['presto_host'])
port_parameter = ParameterString(name="PortParameter", default_value=config['pipeline_parameters']['port_parameter'])
user_parameter = ParameterString(name="UserParameter", default_value=config['pipeline_parameters']['user_parameter'])
target_parameter = ParameterString(name="Target", default_value=config['training_params']['training_target'])
feature_parameter = ParameterString(name="Feature", default_value=training_features_str)

# Log the feature parameter as an array
logger.info(f"the feature parameter being used for training -> {feature_parameter.expr}")
logger.info(f"the host parameter being used from the presto config -> {host_parameter.expr}")
logger.info(f"the port parameter being used from the presto config -> {port_parameter.expr}")
logger.info(f"the user parameter being used from the presto config -> {user_parameter.expr}")

<a id='parameters'></a>

### Pipeline input parameters

Pipeline Parameters are input parameter when triggering a pipeline execution. They need to be explicitly defined when creating the pipeline and contain default values.

Create parameters for the inputs to the pipeline. In this case, parameters will be used for:

- `ProcessingInstanceType` - What EC2 instance type to use for processing.
- `TrainingInstanceType` - What EC2 instance type to use for training.

In [None]:
from sagemaker.sklearn.processing import SKLearnProcessor
from sagemaker.workflow.parameters import (
    ParameterInteger,
    ParameterString,
    ParameterFloat,
)

# What instance type to use for processing.
processing_instance_type = ParameterString(
    name="ProcessingInstanceType", default_value=config['input_params']['processing_instance_type']
)

# What instance type to use for training.
training_instance_type = ParameterString(name="TrainingInstanceType", default_value=config['input_params']['training_instance_type'])

## initializing the sklearn processor
sklearn_processor = SKLearnProcessor(
    framework_version="0.23-1", role=role, instance_type=processing_instance_type, instance_count=1
)

#### Create an Image URI object to use while creating the model from the approved model in the registry

In [None]:
## represents the framework version
FRAMEWORK_VERSION = "0.23-1"

# Fetch container to use for training
image_uri = sagemaker.image_uris.retrieve(
    framework="sklearn",
    region=config['aws']['region'],
    version=FRAMEWORK_VERSION,
    py_version="py3",
    instance_type=config['input_params']['processing_instance_type'],
)

### Now, step is to approve the model
---
Finally, approve the model to launch the model deployment process

In [None]:
sm = boto3.client("sagemaker")

# list all model packages and select the latest one
model_packages = []

for p in sm.get_paginator('list_model_packages').paginate(
        ModelPackageGroupName=config['general']['model_group'],
        SortBy="CreationTime",
        SortOrder="Descending",
    ):
    model_packages.extend(p["ModelPackageSummaryList"])

if len(model_packages) == 0:
    raise Exception(f"No model package is found for {model_package_group_name} model package group")

## print the latest model, approve it
latest_model_package_arn = model_packages[0]["ModelPackageArn"]
print(latest_model_package_arn)

The following statement sets the ModelApprovalStatus for the model package to Approved. The model package state change will launch the EventBridge rule and the rule will launch the CodePipeline CI/CD pipeline with model deployment.

In [None]:
## updating the latest model package to approved status to use it for batch inference
model_package_update_response = sm.update_model_package(
    ModelPackageArn=latest_model_package_arn,
    ModelApprovalStatus="Approved",
)

## PART 2: Batch Transform Pipeline: Prepare Batch Data & Perform Batch Inference

### first step is to get the latest batch data from presto and use that for batch transform step

In [None]:
from sagemaker.sklearn.processing import SKLearnProcessor
from sagemaker.processing import ProcessingInput, ProcessingOutput
from sagemaker.workflow.steps import ProcessingStep
from sagemaker.workflow.functions import Join
from sagemaker.workflow.execution_variables import ExecutionVariables

# Use the sklearn_processor in a SageMaker Pipelines ProcessingStep
# Configure the ProcessingStep
batch_data_prep = ProcessingStep(
    name="Preprocess-Batch-Data",
    processor=sklearn_processor,
    outputs=[
        ProcessingOutput(
            output_name="batch",
            source="/opt/ml/processing/batch",
            destination=Join(
                on="/",
                values=[
                    "s3://{}".format(session_bucket),
                    prefix,
                    ExecutionVariables.PIPELINE_EXECUTION_ID,
                    "batch",
                ], 
            ),
        ),
    ],
    code = config['dir_scripts']['batch_transform_script'],
    job_arguments=[
        ## these job parameters are required in the process of getting batch data from presto
        ## and then send it to s3 for the process of batch inference
        "--host", host_parameter, ## represents the host parameter for the batch data
        "--port", port_parameter, ## represents the port for the EC2
        "--user", user_parameter, ## represents the username for the presto
    ],
)

### Batch Transform Configuration begins below:
---

1. Create the model with the model image uri, refer to the 'inference.py' script that grabs information on features to use while making predictions.

2. Create the model which automatically triggers the training and the preprocess data step

3. Run the transformer step on the created model and 

In [None]:
client = boto3.client("sagemaker")
list_model_packages_response = client.list_model_packages(ModelPackageGroupName=config['general']['model_group'])
list_model_packages_response

latest_model_version_arn = list_model_packages_response["ModelPackageSummaryList"][0][
    "ModelPackageArn"
]
print(latest_model_version_arn)

In [None]:
try:
    latest_approved_model_package = client.describe_model_package(ModelPackageName=latest_model_version_arn)

    if latest_approved_model_package['ModelApprovalStatus'] == "Approved":
        logger.info(f"The latest approved model package is --> {latest_approved_model_package}")
        model_data_url = latest_approved_model_package['InferenceSpecification']['Containers'][0]['ModelDataUrl']
        logger.info(f"The model data for the latest approved model arn {latest_model_version_arn} is stored in {model_data_url}")
    else:
        # If the model approval status is not PendingApproval, throw an error exception
        error_message = f"ModelApprovalStatus is not PendingApproval. Current status: {latest_approved_model_package['ModelApprovalStatus']}"
        logger.error(error_message)
        raise ValueError(error_message)

except Exception as e:
    logger.error(f"An error occurred while tracking the approved model: {str(e)}")
    raise e



In [None]:
from sagemaker.model import Model

## create the model image based on the model data and refer to the inference script as an entry point for 
## batch inference
model = Model(
    image_uri=image_uri,
    entry_point=config['dir_scripts']['batch_inference'],
    model_data=model_data_url,
    sagemaker_session=pipeline_session,
    role=role,
)

#### Create the model image from the approved model for batch inference in the next step

In [None]:
from sagemaker.inputs import CreateModelInput
from sagemaker.workflow.model_step import ModelStep

step_create_model = ModelStep(
    name=config['general']['model_name'],
    step_args=model.create(instance_type=config['input_params']['inference_instance_type']),
)

### Define a Transform Step to Perform Batch Transformation

Now that a model instance is defined, create a Transformer instance with the appropriate model type, compute instance type, and desired output S3 URI.

Specifically, pass in the ModelName from the CreateModelStep, step_create_model properties. The CreateModelStep properties attribute matches the object model of the DescribeModel response object.

In [None]:
import datetime
from sagemaker.transformer import Transformer

# Capture the current time for recording the start and end time for the batch transform step
current_time = datetime.datetime.now().strftime('%Y-%m-%d %H:%M:%S')


transformer = Transformer(
    model_name=step_create_model.properties.ModelName,
    instance_type=config['input_params']['inference_instance_type'],
    instance_count=config['input_params']['instance_count'],
    strategy="MultiRecord",
    accept="text/csv",
    assemble_with="Line",
    output_path=f"s3://{session_bucket}/batch_transform_output",
    tags = config['pipeline_tags']['transform_model_tags'], 
    env={
        'START_TIME': current_time, 
        'END_TIME': '',
    }
    
)

### Pass in the transformer instance and the TransformInput with the batch_data pipeline parameter defined earlier.

In [None]:
from sagemaker.inputs import TransformInput
from sagemaker.workflow.steps import TransformStep

# Assuming batch_prediction_data is the S3 path where your input data is stored
transform_input = TransformInput(
    data=batch_data_prep.properties.ProcessingOutputConfig.Outputs[
                "batch" ## this refers to the batch data that is configured within s3 after the batch preprocessing step
            ].S3Output.S3Uri,
    
    content_type="text/csv", 
    split_type="Line")

step_transform = TransformStep(
    name=config['general']['transform_step_name'], transformer=transformer, inputs=transform_input, 
)

In [None]:
from sagemaker.workflow.pipeline import Pipeline

pipeline_name = config['general']['transform_pipeline_name']

burner_monitor_pipeline = Pipeline(
    name=pipeline_name,
    parameters=
    [processing_instance_type,
    host_parameter,
    port_parameter,
    user_parameter,
    target_parameter, 
    feature_parameter,],
    
    steps=[
        batch_data_prep,
        step_create_model, 
        step_transform,
    ],
)

In [None]:
burner_monitor_pipeline.upsert(role_arn=role, tags = config['pipeline_tags']['transformer_pipeline_upsert_tags'])

In [None]:
execution_burner = burner_monitor_pipeline.start()

In [None]:
execution_burner.describe()

In [None]:
execution_burner.wait()

In [None]:
execution_burner.list_steps()

## Notebook CI Test Results

This notebook was tested in multiple regions. The test results are as follows, except for us-west-2 which is shown at the top of the notebook.

![This us-east-1 badge failed to load. Check your device's internet connectivity, otherwise the service is currently unavailable](https://prod.us-west-2.tcx-beacon.docs.aws.dev/sagemaker-nb/us-east-1/sagemaker-pipeline-parameterization|parameterized-pipeline.ipynb)

![This us-east-2 badge failed to load. Check your device's internet connectivity, otherwise the service is currently unavailable](https://prod.us-west-2.tcx-beacon.docs.aws.dev/sagemaker-nb/us-east-2/sagemaker-pipeline-parameterization|parameterized-pipeline.ipynb)

![This us-west-1 badge failed to load. Check your device's internet connectivity, otherwise the service is currently unavailable](https://prod.us-west-2.tcx-beacon.docs.aws.dev/sagemaker-nb/us-west-1/sagemaker-pipeline-parameterization|parameterized-pipeline.ipynb)

![This ca-central-1 badge failed to load. Check your device's internet connectivity, otherwise the service is currently unavailable](https://prod.us-west-2.tcx-beacon.docs.aws.dev/sagemaker-nb/ca-central-1/sagemaker-pipeline-parameterization|parameterized-pipeline.ipynb)

![This sa-east-1 badge failed to load. Check your device's internet connectivity, otherwise the service is currently unavailable](https://prod.us-west-2.tcx-beacon.docs.aws.dev/sagemaker-nb/sa-east-1/sagemaker-pipeline-parameterization|parameterized-pipeline.ipynb)

![This eu-west-1 badge failed to load. Check your device's internet connectivity, otherwise the service is currently unavailable](https://prod.us-west-2.tcx-beacon.docs.aws.dev/sagemaker-nb/eu-west-1/sagemaker-pipeline-parameterization|parameterized-pipeline.ipynb)

![This eu-west-2 badge failed to load. Check your device's internet connectivity, otherwise the service is currently unavailable](https://prod.us-west-2.tcx-beacon.docs.aws.dev/sagemaker-nb/eu-west-2/sagemaker-pipeline-parameterization|parameterized-pipeline.ipynb)

![This eu-west-3 badge failed to load. Check your device's internet connectivity, otherwise the service is currently unavailable](https://prod.us-west-2.tcx-beacon.docs.aws.dev/sagemaker-nb/eu-west-3/sagemaker-pipeline-parameterization|parameterized-pipeline.ipynb)

![This eu-central-1 badge failed to load. Check your device's internet connectivity, otherwise the service is currently unavailable](https://prod.us-west-2.tcx-beacon.docs.aws.dev/sagemaker-nb/eu-central-1/sagemaker-pipeline-parameterization|parameterized-pipeline.ipynb)

![This eu-north-1 badge failed to load. Check your device's internet connectivity, otherwise the service is currently unavailable](https://prod.us-west-2.tcx-beacon.docs.aws.dev/sagemaker-nb/eu-north-1/sagemaker-pipeline-parameterization|parameterized-pipeline.ipynb)

![This ap-southeast-1 badge failed to load. Check your device's internet connectivity, otherwise the service is currently unavailable](https://prod.us-west-2.tcx-beacon.docs.aws.dev/sagemaker-nb/ap-southeast-1/sagemaker-pipeline-parameterization|parameterized-pipeline.ipynb)

![This ap-southeast-2 badge failed to load. Check your device's internet connectivity, otherwise the service is currently unavailable](https://prod.us-west-2.tcx-beacon.docs.aws.dev/sagemaker-nb/ap-southeast-2/sagemaker-pipeline-parameterization|parameterized-pipeline.ipynb)

![This ap-northeast-1 badge failed to load. Check your device's internet connectivity, otherwise the service is currently unavailable](https://prod.us-west-2.tcx-beacon.docs.aws.dev/sagemaker-nb/ap-northeast-1/sagemaker-pipeline-parameterization|parameterized-pipeline.ipynb)

![This ap-northeast-2 badge failed to load. Check your device's internet connectivity, otherwise the service is currently unavailable](https://prod.us-west-2.tcx-beacon.docs.aws.dev/sagemaker-nb/ap-northeast-2/sagemaker-pipeline-parameterization|parameterized-pipeline.ipynb)

![This ap-south-1 badge failed to load. Check your device's internet connectivity, otherwise the service is currently unavailable](https://prod.us-west-2.tcx-beacon.docs.aws.dev/sagemaker-nb/ap-south-1/sagemaker-pipeline-parameterization|parameterized-pipeline.ipynb)
