## Introduction

This notebook describes using the AWS Step Functions Data Science SDK to create and manage workflows. The Step Functions SDK is an open source library that allows data scientists to easily create and execute machine learning workflows using AWS Step Functions and Amazon SageMaker. For more information, see the following.
* [AWS Step Functions](https://aws.amazon.com/step-functions/)
* [AWS Step Functions Developer Guide](https://docs.aws.amazon.com/step-functions/latest/dg/welcome.html)
* [AWS Step Functions Data Science SDK](https://aws-step-functions-data-science-sdk.readthedocs.io)

In this notebook we will use the SDK to create steps, link them together to create a workflow, and execute the workflow in AWS Step Functions. 

In [594]:
# import sys
# !{sys.executable} -m pip install --upgrade pip
# !{sys.executable} -m pip install -qU awscli boto3 "sagemaker>=2.0.0"
# !{sys.executable} -m pip install -qU "stepfunctions>=2.0.0"
# !{sys.executable} -m pip show sagemaker stepfunctions

## Prequisite 

It is assumed that lambda functions for checking if model already exist or not and required IAM roles for Sagemaker, Step function is already created. <br/>
In this notebook we are going to use Step Functions SDK build-up for Sagemaker


## 1. Preprocessing logic script

Below is the preprocessing logic script which we will upload on S3 it will be used in preprocessing job. These scripts are the logic script which we have generated for preprocessing activities. Upload it on S3 and then we can use it as the parameter.

In [1]:
v_workflow_execution_role = "arn:aws:iam::014257795134:role/ds-mlops-stepfunction-role" # Step function IAM role ARN
v_preprocessing_iam_role = "arn:aws:iam::014257795134:role/ds-mlops-sagemaker-role" # IAM role for preprocessing container
v_preprocessing_instance_type = "ml.m5.xlarge" # Instance type for preprocessing container it changes as per workload
v_s3_input_bucket = "ds-mlops-s3" # S3 bucket for input and output data
v_prefix_for_input_data = "transformed/monitoring/inbound/ll/scoreinput"  # Prefix where data is stored
v_prefix_for_score_output = "transformed/monitoring/outbound/ll/scoreoutput/ll"  # Prefix where data is stored
v_prefix_for_code_location = "code/score_ll_processing_script.py" # prefix where code is stored
v_prefix_for_post_code_loc = "code/score_ll_post_processing_script.py" # prefix where code is stored
v_score_instance_type = "ml.m5.xlarge" # Instance type for training
v_validation_scoring_instance_type = "ml.m5.large" # Instance type for batch scoring
v_model_name = "ds-mlops-linear-learner-02" # Name of DS_MLOPS model to be kept
#in above give model name to run it for XGBosst or Linear learner"
v_region = 'us-east-1' # AWS region

# VV added after design review
sec_groups = ["sg-01d629a900f9b4d92"]
subnets = ["subnet-07bd1dfe6aee76227",
           "subnet-076950ecc89d4340b",
           "subnet-0c5a462cb45a14bab"]

# Files should not be over-written

In [2]:
%%writefile score_ll_processing_script.py
import pandas as pd
import json
import glob
#path = r'/opt/ml/processing/input' # Input path
path='/opt/ml/processing/input'
all_files = glob.glob(path + "/*.csv.out")
# all_files=['s3://ds-mlops-s3/data/scoreoutput/lr/2022/02/17/18/batchscoring.csv.out']
counter = 0
print(all_files)
for filename in all_files:
    print("hi")
    df = pd.read_csv(filename,header=None)
    df = df.sample(frac=.25)
    df=df.iloc[:,1:]
    # Create a multiline json
    json_list = json.loads(df.to_json(orient = "records"))
    output_path = "/opt/ml/processing/outputfile"
    print(output_path)
    counter = counter + 1
    data = {}
    data["captureData"]={
            "endpointInput": {
                "observedContentType": "text/csv",
                "mode": "INPUT",
                "data": "132,25,113.2,96,269.9,107,229.1,87,7.1,7,2,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,1,0,0,1",
                "encoding": "CSV"
            },
            "endpointOutput": {
                "observedContentType": "text/csv; charset=utf-8",
                "mode": "OUTPUT",
                "data": "6295.23876953125",
                "encoding": "CSV"
            }
        }
    data["eventMetadata"] = {
            "eventId": "",
            "inferenceTime": "2"
        }
    data["eventVersion"] = "0"
    with open(output_path, 'w') as f:
        print(output_path)
        for item in json_list:
            item = list(item.values())
            inpitem = ','.join([str(elem) for elem in item[:-1]])
            #outitem = ','.join([str(elem) for elem in item[-1]])
            outitem=str(item[-1])
            data["captureData"]["endpointInput"]["data"] = inpitem
            data["captureData"]["endpointOutput"]["data"] =outitem
            f.write("%s\n" % data)
# Data push

Overwriting score_ll_processing_script.py


In [3]:
!aws s3 cp score_ll_processing_script.py s3://$v_s3_input_bucket/$v_prefix_for_code_location # Uploading preprocessing code on s3


Completed 1.8 KiB/1.8 KiB (8.5 KiB/s) with 1 file(s) remainingupload: ./score_ll_processing_script.py to s3://ds-mlops-s3/code/score_ll_processing_script.py


In [4]:
%%writefile score_ll_post_processing_script.py
import pandas as pd
import json
import glob
#path = r'/opt/ml/processing/input' # Input path
path='/opt/ml/processing/input'
all_files = glob.glob(path + "/*.csv.out")
# all_files=['s3://ds-mlops-s3/data/scoreoutput/lr/2022/02/17/18/batchscoring.csv.out']
counter = 0
print(all_files)
for filename in all_files:
    print("hi")
    df = pd.read_csv(filename,header=None)
    df = df.sample(frac=.25)
    df=df.iloc[:,1:]
    # Create a multiline json
    json_list = json.loads(df.to_json(orient = "records"))
    output_path = "/opt/ml/processing/ll/outputfile.jsonl" #path to the linear learner
    print(output_path)
    counter = counter + 1
    data = {}
    data["captureData"]={
            "endpointInput": {
                "observedContentType": "text/csv",
                "mode": "INPUT",
                "data": "132,25,113.2,96,269.9,107,229.1,87,7.1,7,2,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,1,0,0,1",
                "encoding": "CSV"
            },
            "endpointOutput": {
                "observedContentType": "text/csv; charset=utf-8",
                "mode": "OUTPUT",
                "data": "6295.23876953125",
                "encoding": "CSV"
            }
        }
    data["eventMetadata"] = {
            "eventId": "",
            "inferenceTime": "2"
        }
    data["eventVersion"] = "0"
    with open(output_path, 'w') as f:
        print(output_path)
        for item in json_list:
            item = list(item.values())
            inpitem = ','.join([str(elem) for elem in item[:-1]])
            #outitem = ','.join([str(elem) for elem in item[-1]])
            outitem=str(item[-1])
            data["captureData"]["endpointInput"]["data"] = inpitem
            data["captureData"]["endpointOutput"]["data"] =outitem
            f.write("%s\n" % data)
# Data push



Overwriting score_ll_post_processing_script.py


In [5]:
!aws s3 cp score_ll_post_processing_script.py s3://$v_s3_input_bucket/$v_prefix_for_post_code_loc # Uploading preprocessing code on s3


Completed 1.8 KiB/1.8 KiB (1.9 KiB/s) with 1 file(s) remainingupload: ./score_ll_post_processing_script.py to s3://ds-mlops-s3/code/score_ll_post_processing_script.py


## 2. Parameter

Below are the list of paramters which we have to change inorder to run below sdk


In [6]:
# pwd

In [7]:
# !aws s3 cp /home/ec2-user/SageMaker/Updated_Notebooks_Aug5/scoringdata.csv s3://ds-mlops-s3/data/scoreinput/

## 3 Import the required modules from the SDK and uploading code to s3

In [8]:
import sagemaker
import stepfunctions
import logging

from stepfunctions.steps import *
from stepfunctions.workflow import Workflow
from stepfunctions import steps
from stepfunctions.inputs import ExecutionInput
from sagemaker.processing import Processor,ProcessingInput, ProcessingOutput
import uuid
import sagemaker
from sagemaker.inputs import TrainingInput
import boto3
from sagemaker.network import NetworkConfig

stepfunctions.set_stream_logger(level=logging.INFO)

## 4. Create workflow

In the following cell, you will define the step that you will use in our first workflow.  Then you will create, visualize and execute the workflow. 

Steps relate to states in AWS Step Functions. For more information, see [States](https://docs.aws.amazon.com/step-functions/latest/dg/concepts-states.html) in the *AWS Step Functions Developer Guide*. For more information on the AWS Step Functions Data Science SDK APIs, see: https://aws-step-functions-data-science-sdk.readthedocs.io. 

## 4.1 Creating Pre-Processing step

In [9]:
processor = Processor(image_uri='683313688378.dkr.ecr.us-east-1.amazonaws.com/sagemaker-scikit-learn:0.20.0-cpu-py3',
                     role=v_preprocessing_iam_role,
                     instance_count=1,
                     instance_type=v_preprocessing_instance_type,
                      network_config = NetworkConfig(security_group_ids = sec_groups, subnets = subnets))

In [10]:
import datetime
year=datetime.datetime.now().strftime("%Y")
month=datetime.datetime.now().strftime("%m")
day=datetime.datetime.now().strftime("%d")
hour=datetime.datetime.now().strftime("%H")
print(datetime.datetime.now().strftime("%Y"))
print(datetime.datetime.now().strftime("%m"))
print(datetime.datetime.now().strftime("%d"))
print(datetime.datetime.now().strftime("%H"))
# batchscoringprefix=year,"/",month,"/",day/hour
# print(year/month/day/hour)
outputloc=("s3://{}/{}/{}/{}/{}/{}".format(v_s3_input_bucket,v_prefix_for_score_output,year,month,day,hour)) # Output location for the batch scoring
print(outputloc)

2022
03
28
02
s3://ds-mlops-s3/transformed/monitoring/outbound/ll/scoreoutput/ll/2022/03/28/02


In [11]:
input_data = "s3://{}/{}".format("ds-mlops-s3","data/scoreinput/scoringdata.csv")
input_code = "s3://{}/{}".format(v_s3_input_bucket,v_prefix_for_code_location)
output_data = "s3://{}/{}".format(v_s3_input_bucket,"preprocess-data/score")

inputs = [
    ProcessingInput(
        source=input_data, destination="/opt/ml/processing/input", input_name="input"
    ),
    ProcessingInput(
        source=input_code,
        destination="/opt/ml/processing/input/code",
        input_name="code",
    ),
]

outputs = [
    ProcessingOutput(
        source="/opt/ml/processing/ll",
        destination="{}/{}".format(output_data, "ll"),
        output_name="ll_data",
    )
    
]

In [12]:
# SageMaker expects unique names for each job, model and endpoint.
# If these names are not unique the execution will fail. Pass these dynamically for each execution using placeholders.

##VV updated after review

execution_input = ExecutionInput(
    schema={
        "PreprocessingJobName": str,
        "scoringstep":str,
        "PostprocessingJobName": str
     
           }
)

In [13]:
preprocessing_step = ProcessingStep(
    state_id='Pre-processing', 
    processor=processor,
    job_name=execution_input["PreprocessingJobName"], 
    inputs=inputs, 
    outputs=outputs, 
    experiment_config=None, 
    container_entrypoint=["python3", "/opt/ml/processing/input/code/score_ll_processing_script.py"], # DS needs to change this directory /path
    wait_for_completion=True
)

## 4.5 Create a batch transform step

Now once all the above steps are done we will perform scoring on a small data set to see all the components are working fine

In [14]:
sagemaker_execution_role =  sagemaker.get_execution_role()
lr = sagemaker.transformer.Transformer(model_name=v_model_name,
                                       instance_count=1,
                                       instance_type=v_score_instance_type,
                                    assemble_with='Line',
                                    output_path=outputloc,#"s3://{}/{}".format(v_s3_input_bucket,v_prefix_for_score_output),
                                        accept='text/csv',
                                    base_transform_job_name='scorelinearlearner'
                                       )


In [15]:
from sagemaker.inputs import TransformInput

batch_scoring = TransformStep(
    state_id="batchscoring-step",
    job_name=execution_input["scoringstep"],
    transformer=lr,
    model_name=v_model_name,
    data="{}/{}".format(output_data, "test"), # location for test data
    data_type='S3Prefix',
    content_type="text/csv",
    split_type='Line',
    wait_for_completion=True,
    input_filter="$[1:]",
    join_source='Input'
       
)

In [16]:
ses="{}/{}/".format(output_data, "ll")
print(ses)

s3://ds-mlops-s3/preprocess-data/score/ll/


## Post Processing

In [17]:
input_data = outputloc
input_code = "s3://{}/{}".format(v_s3_input_bucket,v_prefix_for_post_code_loc)
output_data = outputloc

inputs = [
    ProcessingInput(
        source=input_data, destination="/opt/ml/processing/input", input_name="input"
    ),
    ProcessingInput(
        source=input_code,
        destination="/opt/ml/processing/input/code",
        input_name="code",
    ),
]

outputs = [
    ProcessingOutput(
        source="/opt/ml/processing/ll",
        destination="{}/{}".format(output_data, "ll"),
        output_name="test_data",
    )
    
]

In [18]:
postprocessing_step = ProcessingStep(
    state_id='Post-processing', 
    processor=processor,
    job_name=execution_input["PostprocessingJobName"], 
    inputs=inputs, 
    outputs=outputs, 
    experiment_config=None, 
    container_entrypoint=["python3", "/opt/ml/processing/input/code/score_ll_post_processing_script.py"], # DS needs to change this directory /path
    wait_for_completion=True
)

## 4.6 Chain together steps for the basic path

The following cell links together the steps you've created into a sequential group called `basic_path`. We will chain a single step to create our basic path. See [Chain](https://aws-step-functions-data-science-sdk.readthedocs.io/en/latest/states.html#stepfunctions.steps.states.Chain) in the AWS Step Functions Data Science SDK documentation.

After chaining together the steps for the basic path, in this case only one step, we will visualize the basic path.

In [19]:
# First we chain the start pass state,preprocessing_step,
basic_path=Chain([preprocessing_step,batch_scoring,postprocessing_step])
# basic_path=Chain([preprocessing_step])


## 4.7 Define the workflow instance

The following cell defines the [workflow](https://aws-step-functions-data-science-sdk.readthedocs.io/en/latest/workflow.html#stepfunctions.workflow.Workflow) with the path we just defined.

After defining the workflow, we will render the graph to see what our workflow looks like.

In [20]:
# Next, we define the workflow
basic_workflow = Workflow(
    name="ds-mlops-ll-score-step-function",
    definition=basic_path,
    role=v_workflow_execution_role
)

#Render the workflow
basic_workflow.render_graph()

## 4.8 Review the Amazon States Language code for your workflow

The following renders the JSON of the [Amazon States Language](https://docs.aws.amazon.com/step-functions/latest/dg/concepts-amazon-states-language.html) definition of the workflow you created. 

In [21]:
print(basic_workflow.definition.to_json(pretty=True)) # From this json we would be leveraging the codes to create the Cloud Formation parameterized template...

{
    "StartAt": "Pre-processing",
    "States": {
        "Pre-processing": {
            "Resource": "arn:aws:states:::sagemaker:createProcessingJob.sync",
            "Parameters": {
                "ProcessingJobName.$": "$$.Execution.Input['PreprocessingJobName']",
                "ProcessingInputs": [
                    {
                        "InputName": "input",
                        "AppManaged": false,
                        "S3Input": {
                            "S3Uri": "s3://ds-mlops-s3/data/scoreinput/scoringdata.csv",
                            "LocalPath": "/opt/ml/processing/input",
                            "S3DataType": "S3Prefix",
                            "S3InputMode": "File",
                            "S3DataDistributionType": "FullyReplicated",
                            "S3CompressionType": "None"
                        }
                    },
                    {
                        "InputName": "code",
                        "AppManag

## 4.9 Create the workflow on AWS Step Functions

Create the workflow in AWS Step Functions with [create](https://aws-step-functions-data-science-sdk.readthedocs.io/en/latest/workflow.html#stepfunctions.workflow.Workflow.create).

In [22]:
basic_workflow.create()

[31m[ERROR] A workflow with the same name already exists on AWS Step Functions. To update a workflow, use Workflow.update().[0m


'arn:aws:states:us-east-1:014257795134:stateMachine:ds-mlops-ll-score-step-function'

In [23]:
basic_workflow.update(definition=basic_workflow.definition,role=basic_workflow.role)

[32m[INFO] Workflow updated successfully on AWS Step Functions. All execute() calls will use the updated definition and role within a few seconds. [0m


'arn:aws:states:us-east-1:014257795134:stateMachine:ds-mlops-ll-score-step-function'

## 5 Execute the workflow

Run the workflow with [execute](https://aws-step-functions-data-science-sdk.readthedocs.io/en/latest/workflow.html#stepfunctions.workflow.Workflow.execute). Since the workflow only has a pass state, it will succeed immediately.

In [24]:
# Generate unique names for Pre-Processing Job, Training Job, and Model Evaluation Job for the Step Functions Workflow
 # Each Training Job requires a unique name
preprocessing_job_name = "ll-score-preprocessing-{}".format(
    uuid.uuid1().hex
)  # Each Preprocessing job requires a unique name,
scoring_job_name = "ll-score-{}".format(
    uuid.uuid1().hex
)  # Each Evaluation Job requires a unique name
batch_scoring_job_name = "ll-post-preprocessing-{}".format(
    uuid.uuid1().hex
)

In [25]:
basic_workflow_execution = basic_workflow.execute(
    inputs={
       "PreprocessingJobName": preprocessing_job_name,
        "scoringstep":scoring_job_name,  # Each pre processing job (SageMaker processing job) requires a unique name,
        "PostprocessingJobName":batch_scoring_job_name
            }
)

[32m[INFO] Workflow execution started successfully on AWS Step Functions.[0m


## 5.1 Review the execution progress

Render workflow progress with the [render_progress](https://aws-step-functions-data-science-sdk.readthedocs.io/en/latest/workflow.html#stepfunctions.workflow.Execution.render_progress).

This generates a snapshot of the current state of your workflow as it executes. This is a static image. Run the cell again to check progress. 

In [26]:
basic_workflow_execution.render_progress()

## Clean-up steps

https://docs.aws.amazon.com/sagemaker/latest/dg/ex1-cleanup.html

In [548]:
# # Clean up end point
# client = boto3.client("sagemaker", region_name=region)
# response=client.delete_endpoint(EndpointName=endpoint_name)