# Part 1: Creating SageMaker Processing Job from AWS Data Wrangler Flow Template

<div class="alert alert-warning"> 
	⚠️ <strong> PRE-REQUISITE: </strong> Before proceeding with this notebook, please ensure that you have executed the <code>00_setup_data_wrangler.ipynb</code> Notebook</li>
</div>

We will demonstrate how to define a SageMaker Processing Job based on an existing SageMaker Data Wrangler Flow definition.

## 1. Initialization

In [None]:
%store -r ins_claim_flow_uri

In [None]:
import sagemaker
import json
import string
import boto3

sm_client = boto3.client("sagemaker")
sess = sagemaker.Session()

bucket = sess.default_bucket()
prefix = 'aws-data-wrangler-workflows'

FLOW_TEMPLATE_URI = ins_claim_flow_uri

flow_file_name = FLOW_TEMPLATE_URI.split("/")[-1]
flow_export_name = flow_file_name.replace(".flow", "")
flow_export_id = flow_export_name.replace("flow-", "")

## 2. Download Flow template from Amazon S3

We download the flow template from S3, in order to parse its content and retrieve the following information:
* Source datasets, including dataset names and S3 URI
* Output node, including Node ID and output path
This information is then used as part of the parameters of the SageMaker Processing Job

In [None]:
sagemaker.s3.S3Downloader.download(FLOW_TEMPLATE_URI, ".")

#### Parsing input and output parameters from flow template

In [None]:
with open(flow_file_name, 'r') as f:
    data = json.load(f)
    output_node = data['nodes'][-1]['node_id']
    output_path = data['nodes'][-1]['outputs'][0]['name']
    input_source_names = [node['parameters']['dataset_definition']['name'] for node in data['nodes'] if node['type']=="SOURCE"]
    input_source_uris = [node['parameters']['dataset_definition']['s3ExecutionContext']['s3Uri'] for node in data['nodes'] if node['type']=="SOURCE"]
    

output_name = f"{output_node}.{output_path}"

## 3. Create SageMaker Processing Job from Data Wrangler Flow template

### 3.1 SageMaker Processing Inputs

Below are the inputs required by the SageMaker Python SDK to launch a processing job.

#### Source datasets

In [None]:
from sagemaker.processing import ProcessingInput, ProcessingOutput

data_sources = []

for i in range(0,len(input_source_uris)):
    data_sources.append(ProcessingInput(
        source=input_source_uris[i],
        destination=f"/opt/ml/processing/{input_source_names[i]}",
        input_name=input_source_names[i],
        s3_data_type="S3Prefix",
        s3_input_mode="File",
        s3_data_distribution_type="FullyReplicated"
    ))

#### Flow Input

In [None]:
## Input - Flow
flow_input = ProcessingInput(
    source=FLOW_TEMPLATE_URI,
    destination="/opt/ml/processing/flow",
    input_name="flow",
    s3_data_type="S3Prefix",
    s3_input_mode="File",
    s3_data_distribution_type="FullyReplicated"
)

processing_job_inputs=[flow_input] + data_sources

### 3.2. SageMaker Processing Output

In [None]:
s3_output_prefix = f"export-{flow_export_name}/output"
s3_output_path = f"s3://{bucket}/{s3_output_prefix}"
print(f"Flow S3 export result path: {s3_output_path}")

processing_job_output = ProcessingOutput(
    output_name=output_name,
    source="/opt/ml/processing/output",
    destination=s3_output_path,
    s3_upload_mode="EndOfJob"
)

processing_job_outputs=[processing_job_output]

### 3.3. Create Processor Object

In [None]:
# IAM role for executing the processing job.
iam_role = sagemaker.get_execution_role()
aws_region = sess.boto_region_name

# Unique processing job name. Give a unique name every time you re-execute processing jobs
processing_job_name = f"data-wrangler-flow-processing-{flow_export_id}"

# Data Wrangler Container URI.
container_uri = sagemaker.image_uris.retrieve(
    framework='data-wrangler',
    region=aws_region
)

# Processing Job Instance count and instance type.
instance_count = 2
instance_type = "ml.m5.4xlarge"

# Size in GB of the EBS volume to use for storing data during processing
volume_size_in_gb = 30

# Network Isolation mode; default is off
enable_network_isolation = False

# KMS key for per object encryption; default is None
kms_key = None


# Content type for each output. Data Wrangler supports CSV as default and Parquet.
processing_job_output_content_type = "CSV"

# Output configuration used as processing job container arguments 
processing_job_output_config = {
    output_name: {
        "content_type": processing_job_output_content_type
    }
}

To launch the Processing Job in a workflow compatible with SageMaker SDK, you will create a Processor function. The processor can then be integrated in the following workflows:

* Amazon SageMaker Pipelines
* AWS Step Functions (through AWS Step Functions Data Science SDK)
* Apache Airflow (through a Python Operator, using Amazon SageMaker SDK)

In [None]:
from sagemaker.processing import Processor
from sagemaker.network import NetworkConfig

processor = Processor(
    role=iam_role,
    image_uri=container_uri,
    instance_count=instance_count,
    instance_type=instance_type,
    volume_size_in_gb=volume_size_in_gb,
    network_config=NetworkConfig(enable_network_isolation=enable_network_isolation),
    sagemaker_session=sess,
    output_kms_key=kms_key
)

## 4. Create SageMaker Estimator

Another building block for orchestrating ML workflows is an Estimator, which is the base for a Training Job. The estimator can be integrated in the following workflows:

* Amazon SageMaker Pipelines
* AWS Step Functions (through AWS Step Functions Data Science SDK)
* Apache Airflow (through Amazon SageMaker Operator for Apache Airflow)

In [None]:
import boto3
from sagemaker.estimator import Estimator

region = boto3.Session().region_name

# Estimator Instance count and instance type.
instance_count = 1
instance_type = "ml.m5.4xlarge"

image_uri = sagemaker.image_uris.retrieve(
    framework="xgboost",
    region=region,
    version="1.2-1",
    py_version="py3",
    instance_type=instance_type,
)
xgb_train = Estimator(
    image_uri=image_uri,
    instance_type=instance_type,
    instance_count=instance_count,
    role=iam_role,
)
xgb_train.set_hyperparameters(
    objective="reg:squarederror",
    num_round=3,
)

# Part 2: Creating a workflow with SageMaker Pipelines

## 1. Define Pipeline Steps and Parameters
To create a SageMaker pipeline, you will first create a `ProcessingStep` using the Data Wrangler processor defined above.

In [None]:
from sagemaker.workflow.steps import ProcessingStep

data_wrangler_step = ProcessingStep(
    name="DataWranglerProcessingStep",
    processor=processor,
    inputs=processing_job_inputs, 
    outputs=processing_job_outputs,
    job_arguments=[f"--output-config '{json.dumps(processing_job_output_config)}'"],
)

You now add a `TrainingStep` to the pipeline that trains a model on the preprocessed train data set. 

You can also add more steps. To learn more about adding steps to a pipeline, see [Define a Pipeline](http://docs.aws.amazon.com/sagemaker/latest/dg/define-pipeline.html) in the SageMaker documentation.

In [None]:
from sagemaker.inputs import TrainingInput
from sagemaker.workflow.steps import TrainingStep
from sagemaker.workflow.step_collections import RegisterModel

xgb_input_content_type = None

if processing_job_output_content_type == "CSV":
    xgb_input_content_type = 'text/csv'
elif processing_job_output_content_type == "Parquet":
    xgb_input_content_type = 'application/x-parquet'

training_step = TrainingStep(
    name="DataWranglerTrain",
    estimator=xgb_train,
    inputs={
        "train": TrainingInput(
            s3_data=data_wrangler_step.properties.ProcessingOutputConfig.Outputs[
                output_name
            ].S3Output.S3Uri,
            content_type=xgb_input_content_type
        )
    }
)

register_step = RegisterModel(
        name=f"DataWranglerRegisterModel",
        estimator=xgb_train,
        model_data=training_step.properties.ModelArtifacts.S3ModelArtifacts,
        content_types=["text/csv"],
        response_types=["text/csv"],
        inference_instances=["ml.t2.medium", "ml.m5.large"],
        transform_instances=["ml.m5.large"],
        model_package_group_name="DataWrangler-PackageGroup"
    )

### Define Pipeline Parameters
Now you will create the SageMaker pipeline that combines the steps created above so it can be executed. 

Define Pipeline parameters that you can use to parametrize the pipeline. Parameters enable custom pipeline executions and schedules without having to modify the Pipeline definition.

The parameters supported in this notebook includes:

- `instance_type` - The ml.* instance type of the processing job.
- `instance_count` - The instance count of the processing job.

In [None]:
from sagemaker.workflow.parameters import (
    ParameterInteger,
    ParameterString,
)
# Define Pipeline Parameters
instance_type = ParameterString(name="InstanceType", default_value="ml.m5.4xlarge")
instance_count = ParameterInteger(name="InstanceCount", default_value=1)

You will create a pipeline with the steps and parameters defined above

In [None]:
import time
import uuid

from sagemaker.workflow.pipeline import Pipeline

# Create a unique pipeline name with flow export name
pipeline_name = f"pipeline-{flow_export_name}"

# Combine pipeline steps
pipeline_steps = [data_wrangler_step, training_step, register_step]

pipeline = Pipeline(
    name=pipeline_name,
    parameters=[instance_type, instance_count],
    steps=pipeline_steps,
    sagemaker_session=sess
)

### (Optional) Examining the pipeline definition

The JSON of the pipeline definition can be examined to confirm the pipeline is well-defined and 
the parameters and step properties resolve correctly.

In [None]:
import json

definition = json.loads(pipeline.definition())
definition

## 2. Submit the pipeline to SageMaker and start execution

Submit the pipeline definition to the SageMaker Pipeline service and start an execution. The role passed in 
will be used by the Pipeline service to create all the jobs defined in the steps.

In [None]:
iam_role = sagemaker.get_execution_role()
pipeline.upsert(role_arn=iam_role)
execution = pipeline.start()

### Pipeline Operations: Examine and Wait for Pipeline Execution

Describe the pipeline execution and wait for its completion.

In [None]:
execution.wait()

List the steps in the execution. These are the steps in the pipeline that have been resolved by the step 
executor service.

In [None]:
execution.list_steps()

You can visualize the pipeline execution status and details in Studio. For details please refer to 
[View, Track, and Execute SageMaker Pipelines in SageMaker Studio](https://docs.aws.amazon.com/sagemaker/latest/dg/pipelines-studio.html)

# Part 3: Cleanup

## Pipeline cleanup
Set `pipeline_deletion` flag below to `True` to delete the SageMaker Pipelines created in this notebook.

In [None]:
pipeline_deletion = True

In [None]:
if pipeline_deletion:
    pipeline.delete()

## Model cleanup
Set `model_deletion` flag below to `True` to delete the SageMaker Model created in this notebook.

In [None]:
model_deletion = True

In [None]:
if model_deletion:
    model_package_group_name = register_step.steps[0].model_package_group_name

    model_package_list = sm_client.list_model_packages(
        ModelPackageGroupName = model_package_group_name
    )

    for version in range(0,len(model_package_list["ModelPackageSummaryList"])):
        sm_client.delete_model_package(
            ModelPackageName = model_package_list["ModelPackageSummaryList"][version]["ModelPackageArn"]
        )

    sm_client.delete_model_package_group(
        ModelPackageGroupName = model_package_group_name
    )

## Experiment cleanup
Set `experiment_deletion` flag below to `True` to delete the SageMaker Experiment and Trials created by the Pipeline execution in this notebook.

In [None]:
experiment_deletion = True

In [None]:
if experiment_deletion:
    experiment_name = pipeline_name
    trial_name = execution.arn.split("/")[-1]
    
    components_in_trial = sm_client.list_trial_components(TrialName=trial_name)
    print('TrialComponentNames:')
    for component in components_in_trial['TrialComponentSummaries']:
        component_name = component['TrialComponentName']
        print(f"\t{component_name}")
        sm_client.disassociate_trial_component(TrialComponentName=component_name, TrialName=trial_name)
        try:
            # comment out to keep trial components
            sm_client.delete_trial_component(TrialComponentName=component_name)
        except:
            # component is associated with another trial
            continue
        # to prevent throttling
        time.sleep(.5)
    sm_client.delete_trial(TrialName=trial_name)
    try:
        sm_client.delete_experiment(ExperimentName=experiment_name)
        print(f"\nExperiment {experiment_name} deleted")
    except:
        # experiment already existed and had other trials
        print(f"\nExperiment {experiment_name} in use by other trials. Will not delete")