## Introduction

This notebook describes using the AWS Step Functions Data Science SDK to create and manage workflows. The Step Functions SDK is an open source library that allows data scientists to easily create and execute machine learning workflows using AWS Step Functions and Amazon SageMaker. For more information, see the following.
* [AWS Step Functions](https://aws.amazon.com/step-functions/)
* [AWS Step Functions Developer Guide](https://docs.aws.amazon.com/step-functions/latest/dg/welcome.html)
* [AWS Step Functions Data Science SDK](https://aws-step-functions-data-science-sdk.readthedocs.io)

In this notebook we will use the SDK to create steps, link them together to create a workflow, and execute the workflow in AWS Step Functions. 

In [7]:
# import sys
# !{sys.executable} -m pip install --upgrade pip
# !{sys.executable} -m pip install -qU awscli boto3 "sagemaker>=2.0.0"
# !{sys.executable} -m pip install -qU "stepfunctions>=2.0.0"
# !{sys.executable} -m pip show sagemaker stepfunctions

In [1]:
from sagemaker.processing import Processor

In [2]:
print(Processor.JOB_CLASS_NAME)

processing-job


## Prequisite 

It is assumed that lambda functions for checking if model already exist or not and required IAM roles for Sagemaker, Step function is already created. <br/>
In this notebook we are going to use Step Functions SDK build-up for Sagemaker

<span style="color:red">**AP: Do these items have to be configured again for new projects? I suppose at least the model existence check has to be configured once every time a new model is created/added, right?**</span>

<span style="color:green">** Charles:No, they don't need to be. The same logic applies. For the individual cells below, I have made comments as to what portions of the code needs changes if the project changes. Same applies for the XGBOOST model, for example. 

## 1. Preprocessing logic script

Below is the preprocessing logic script which we will upload on S3 it will be used in preprocessing job. These scripts are the logic script which we have generated for preprocessing activities. Upload it on S3 and then we can use it as the parameter.

In [8]:
%%writefile ds-mlops-preprocessing-linear-learner-script.py
# Importing required library
import pandas as pd
import numpy as np
from sklearn.impute import SimpleImputer
# you can put any value here according to your situation
chunksize = 10000
from sklearn import preprocessing

data = pd.read_csv('/opt/ml/processing/input/inputdata.csv') # Note this is the input location where we will get our data set downloaded
df = data.copy()
df.drop(columns=['Unnamed: 0', 'id', 'url', 'region', 'region_url', 'VIN', 'size', 'image_url', 'description', 'state', 'lat', 'long'], inplace=True)
imr = SimpleImputer(strategy='mean')
imr = imr.fit(df[['odometer']])
imputed_data = imr.transform(df[['odometer']])
df['odometer'] = pd.DataFrame(imputed_data)
def encode_features(dataframe):
    result = dataframe.copy()
    encoders = {}
    for column in result.columns:
        if result.dtypes[column] == np.object:
            encoders[column] = preprocessing.LabelEncoder()
            result[column] = encoders[column].fit_transform(result[column])
    return result, encoders

# 'description'
encoded_df, encoders = encode_features(df.astype(str))
encoded_df.fillna(method ='pad',inplace=True)
train_data, validation_data, test_data = np.split(encoded_df.sample(frac=1, random_state=1729), [int(0.7 * len(encoded_df)), int(0.9*len(encoded_df))]) # Splitting dataset 
train_data.to_csv('/opt/ml/processing/train/train.csv', index=False, header=False) # train data
validation_data.to_csv('/opt/ml/processing/validation/validation_data.csv', index=False, header=False) # validation data
test_data = test_data.iloc[:,1:] # removing column where we have to do predictions
test_data.to_csv('/opt/ml/processing/test/test.csv', index=False, header=False) # test data 

Overwriting ds-mlops-preprocessing-linear-learner-script.py


## 2. Parameter

Below are the list of parameters which we have to change in order to run below sdk


In [3]:
import sagemaker

In [4]:
v_workflow_execution_role = "arn:aws:iam::014257795134:role/ds-mlops-stepfunction-role" # Step function IAM role ARN
v_preprocessing_iam_role = "arn:aws:iam::014257795134:role/ds-mlops-sagemaker-role" # IAM role for preprocessing container
v_preprocessing_instance_type = "ml.m5.xlarge" # Instance type for preprocessing container it changes as per workload
v_s3_input_bucket = "ds-mlops-s3" # S3 bucket for input and output data
v_prefix_for_input_data = "data/input/inputdata.csv"  # Prefix where data is stored
v_prefix_for_code_location = "code/ds-mlops-preprocessing_linear-learner_script.py" # prefix where code is stored
v_lambda_function_name = "ds-mlops-linear-learner-lambda-test" # Name of lambda function for triggering training pipeline.
v_region = 'us-east-1' # AWS region
v_model_container = sagemaker.image_uris.retrieve('linear-learner', v_region) # Linear conatiner
v_train_instance_type = "ml.m5.xlarge" # Instance type for training
v_validation_scoring_instance_type = "ml.m5.xlarge" # Instance type for batch scoring
v_model_name = "ds-mlops-linear-learner-02" # Name of DS_MLOPS model to be kept

<span style="color:red">**AP: Would it make sense to specify a project string or sub-bucket in addition to this?**</span>
    
<span style="color:red">v_prefix_for_code_location = "code/ds-mlops-preprocessing_linear-learner_script.py" # prefix where code is stored</span>

<span style="color:green">**Charles:You could use all the aforementioned components and give the string that you've mentioned. Everything remaining the same, you could give a project string and change the file names so that updates follow accordingly. You can use the same bucket, and the rest of the paramenters. 



## 3 Import the required modules from the SDK and uploading code to s3

In [5]:
import stepfunctions
import logging

from stepfunctions.steps import *
from stepfunctions.workflow import Workflow
from stepfunctions.inputs import ExecutionInput
from sagemaker.processing import Processor,ProcessingInput, ProcessingOutput
import uuid
import sagemaker
from sagemaker.inputs import TrainingInput
import boto3
from sagemaker.network import NetworkConfig

sec_groups = ["sg-01d629a900f9b4d92"]
subnets = ["subnet-07bd1dfe6aee76227",
           "subnet-076950ecc89d4340b",
           "subnet-0c5a462cb45a14bab"]
stepfunctions.set_stream_logger(level=logging.INFO)

In [6]:
!aws s3 cp ds-mlops-preprocessing-linear-learner-script.py s3://$v_s3_input_bucket/$v_prefix_for_code_location # Uploading preprocessing code on s3

Completed 1.7 KiB/1.7 KiB (15.9 KiB/s) with 1 file(s) remainingupload: ./ds-mlops-preprocessing-linear-learner-script.py to s3://ds-mlops-s3/code/ds-mlops-preprocessing_linear-learner_script.py


## 4. Create workflow

In the following cell, you will define the step that you will use in our first workflow.  Then you will create, visualize and execute the workflow. 

Steps relate to states in AWS Step Functions. For more information, see [States](https://docs.aws.amazon.com/step-functions/latest/dg/concepts-states.html) in the *AWS Step Functions Developer Guide*. For more information on the AWS Step Functions Data Science SDK APIs, see: https://aws-step-functions-data-science-sdk.readthedocs.io. 

## 4.1 Creating Pre-Processing step

In [8]:
processor = Processor(image_uri='683313688378.dkr.ecr.us-east-1.amazonaws.com/sagemaker-scikit-learn:0.20.0-cpu-py3',
                      network_config = NetworkConfig(security_group_ids = sec_groups, subnets = subnets),
                     role=v_preprocessing_iam_role,
                     instance_count=1,
                     instance_type=v_preprocessing_instance_type)

In [9]:
input_data = "s3://{}/{}".format(v_s3_input_bucket,v_prefix_for_input_data)
input_code = "s3://{}/{}".format(v_s3_input_bucket,v_prefix_for_code_location)
output_data = "s3://{}/{}".format(v_s3_input_bucket,"preprocess-data")

inputs = [
    ProcessingInput(
        source=input_data, destination="/opt/ml/processing/input", input_name="input"
    ),
    ProcessingInput(
        source=input_code,
        destination="/opt/ml/processing/input/code",
        input_name="code",
    ),
]

outputs = [
    ProcessingOutput(
        source="/opt/ml/processing/train",
        destination="{}/{}".format(output_data,"train"),
        output_name="train_data",
    ),
    ProcessingOutput(
        source="/opt/ml/processing/test",
        destination="{}/{}".format(output_data, "test"),
        output_name="test_data",
    ),
    ProcessingOutput(
        source="/opt/ml/processing/validation",
        destination="{}/{}".format(output_data, "validation"),
        output_name="validation_data",
    )
]

In [10]:
# Generate unique names for Pre-Processing Job, Training Job, and Model Evaluation Job for the Step Functions Workflow
training_job_name = "linear-learner-training-{}".format(
    uuid.uuid1().hex
)  # Each Training Job requires a unique name
preprocessing_job_name = "linear-learner-preprocessing-{}".format(
    uuid.uuid1().hex
)  # Each Preprocessing job requires a unique name,
evaluation_job_name = "linear-learner-evaluation-{}".format(
    uuid.uuid1().hex
)  # Each Evaluation Job requires a unique name
scoring_job_name = "linear-learner-score-{}".format(
    uuid.uuid1().hex
)  # Each Evaluation Job requires a unique name

In [11]:
# SageMaker expects unique names for each job, model and endpoint.
# If these names are not unique the execution will fail. Pass these dynamically for each execution using placeholders.
execution_input = ExecutionInput(
    schema={
        "PreprocessingJobName": str,
        "TrainingJobName": str,
        "EvaluationProcessingJobName": str,
        "ModelName": str,
        "ScoreJobName":str
    }
)

In [12]:
preprocessing_step = ProcessingStep(
    state_id='Pre-processing', 
    processor=processor,
    job_name=execution_input["PreprocessingJobName"], 
    inputs=inputs, 
    outputs=outputs, 
    experiment_config=None, 
    container_entrypoint=["python3", "/opt/ml/processing/input/code/ds-mlops-preprocessing_linear-learner_script.py"], # DS needs to change this directory /path
    wait_for_completion=True
)

## 4.2 Train trigger lambda function (Check if model with same name exists)

In the following cell, we define a lambda step that will invoke the previously created lambda function as part of our Step Function workflow. See [LambdaStep](https://aws-step-functions-data-science-sdk.readthedocs.io/en/latest/compute.html#stepfunctions.steps.compute.LambdaStep) in the AWS Step Functions Data Science SDK documentation to learn more.

In [13]:
client = boto3.client('sagemaker') # getting sagemaker client 
try:
    client.delete_model(
        ModelName=v_model_name # delete if some model exist with this name
    )
except:
    pass

In [14]:
lambda_step = compute.LambdaStep(
    'Start Training',
    parameters={  
        "FunctionName": v_lambda_function_name
    }
)

## 4.3 Train model

### Create a SageMaker Training Step 

In the following cell, we create the training step and pass the estimator we defined above. See  [TrainingStep](https://aws-step-functions-data-science-sdk.readthedocs.io/en/latest/sagemaker.html#stepfunctions.steps.sagemaker.TrainingStep) in the AWS Step Functions Data Science SDK documentation to learn more.

In [4]:
%store -r ll_best_hyper_parameter # generate the hyperparameter names from the experiment notebook.

In [6]:
ll_best_hyper_parameter # get you all those stroed param here in the SDK notebook. You just need to export all from experiment and refer those param name here in the SDK notebook. No need to do hardcoding, pass those imported param name here.

In [15]:
sess = sagemaker.Session()
training_output = 's3://{}/models'.format(v_s3_input_bucket) # model output locations
linear = sagemaker.estimator.Estimator(v_model_container,
                                       v_preprocessing_iam_role, 
                                       instance_count = 1, 
                                       instance_type = 'ml.m5.xlarge',
                                       output_path=training_output,
                                       sagemaker_session = sess,
                                       security_group_ids=sec_groups,
                                       subnets=subnets)

linear.set_hyperparameters(epochs = 50,
                           l1 = 0.00035080090763972647,
                           learning_rate = 0.000199542496841376,
                           mini_batch_size = 512,
                           predictor_type = "regressor")

In [16]:
training_step = TrainingStep(
    'Model Training(linear)', 
    estimator=linear,
    data={
        'train': TrainingInput("{}/{}".format(output_data,"train"), content_type='text/csv'),
        'validation': TrainingInput("{}/{}".format(output_data, "validation"), content_type='text/csv')
    },
    job_name=execution_input['TrainingJobName'],
    wait_for_completion=True
)

## 4.4 Create a Model

In the following cell, we define a model step that will create a model in Amazon SageMaker using the artifacts created during the TrainingStep. See  [ModelStep](https://aws-step-functions-data-science-sdk.readthedocs.io/en/latest/sagemaker.html#stepfunctions.steps.sagemaker.ModelStep) in the AWS Step Functions Data Science SDK documentation to learn more.

The model creation step typically follows the training step. The Step Functions SDK provides the [get_expected_model](https://aws-step-functions-data-science-sdk.readthedocs.io/en/latest/sagemaker.html#stepfunctions.steps.sagemaker.TrainingStep.get_expected_model) method in the TrainingStep class to provide a reference for the trained model artifacts. Please note that this method is only useful when the ModelStep directly follows the TrainingStep.

In [17]:
model_step = ModelStep(
    'Save Model',
    model=training_step.get_expected_model(),
    model_name=execution_input['ModelName'],
    result_path='$.ModelStepResults'
)

## 4.5 Create a batch transform step

Now once all the above steps are done we will perform scoring on a small data set to see all the components are working fine

In [18]:
from sagemaker.inputs import TransformInput

batch_scoring = TransformStep(
    state_id="validation-step",
    job_name=execution_input['ScoreJobName'],
    transformer=linear.transformer(instance_count=1,
                                instance_type=v_validation_scoring_instance_type),
    data="{}/{}".format(output_data, "test"), # location for test data
    model_name=execution_input['ModelName'],
    content_type="text/csv"
)

## 4.6 Chain together steps for the basic path

The following cell links together the steps you've created into a sequential group called `basic_path`. We will chain a single step to create our basic path. See [Chain](https://aws-step-functions-data-science-sdk.readthedocs.io/en/latest/states.html#stepfunctions.steps.states.Chain) in the AWS Step Functions Data Science SDK documentation.

After chaining together the steps for the basic path, in this case only one step, we will visualize the basic path.

In [19]:
# First we chain the start pass state
basic_path=Chain([preprocessing_step, 
                  lambda_step,
                  training_step,
                  model_step,
                  batch_scoring])

# preprocessing_step, lambda_step,

## 4.7 Define the workflow instance

The following cell defines the [workflow](https://aws-step-functions-data-science-sdk.readthedocs.io/en/latest/workflow.html#stepfunctions.workflow.Workflow) with the path we just defined.

After defining the workflow, we will render the graph to see what our workflow looks like.

In [20]:
# Next, we define the workflow
basic_workflow = Workflow(
    name="ds-mlops-dev-linear-learner-step-function",
    definition=basic_path,
    role=v_workflow_execution_role
)

#Render the workflow
basic_workflow.render_graph()

## 4.8 Review the Amazon States Language code for your workflow

The following renders the JSON of the [Amazon States Language](https://docs.aws.amazon.com/step-functions/latest/dg/concepts-amazon-states-language.html) definition of the workflow you created. 

In [21]:
print(basic_workflow.definition.to_json(pretty=True)) # From this json we would be leveraging the codes to create the Cloud Formation parameterized template...

{
    "StartAt": "Pre-processing",
    "States": {
        "Pre-processing": {
            "Resource": "arn:aws:states:::sagemaker:createProcessingJob.sync",
            "Parameters": {
                "ProcessingJobName.$": "$$.Execution.Input['PreprocessingJobName']",
                "ProcessingInputs": [
                    {
                        "InputName": "input",
                        "AppManaged": false,
                        "S3Input": {
                            "S3Uri": "s3://ds-mlops-s3/data/input/inputdata.csv",
                            "LocalPath": "/opt/ml/processing/input",
                            "S3DataType": "S3Prefix",
                            "S3InputMode": "File",
                            "S3DataDistributionType": "FullyReplicated",
                            "S3CompressionType": "None"
                        }
                    },
                    {
                        "InputName": "code",
                        "AppManaged": fa

## 4.9 Create the workflow on AWS Step Functions

Create the workflow in AWS Step Functions with [create](https://aws-step-functions-data-science-sdk.readthedocs.io/en/latest/workflow.html#stepfunctions.workflow.Workflow.create).

In [22]:
basic_workflow.create()

[31m[ERROR] A workflow with the same name already exists on AWS Step Functions. To update a workflow, use Workflow.update().[0m


'arn:aws:states:us-east-1:014257795134:stateMachine:ds-mlops-dev-linear-learner-step-function'

<span style="color:red">**AP: Receiving an error message at this step:**</span>

<span style="color:green">**Charles: The workflow exists and that is why you are receiving an error warning. But, the next step is updating the workflow. 

[ERROR] A workflow with the same name already exists on AWS Step Functions. To update a workflow, use Workflow.update().

In [23]:
basic_workflow.update(definition=basic_workflow.definition,
                      role=basic_workflow.role)

[32m[INFO] Workflow updated successfully on AWS Step Functions. All execute() calls will use the updated definition and role within a few seconds. [0m


'arn:aws:states:us-east-1:014257795134:stateMachine:ds-mlops-dev-linear-learner-step-function'

In [24]:
import time
time.sleep(30)

<span style="color:red">**AP: What's the reason for waiting 30 seconds here? Is this to give the system time to update/refresh/provision something?**</span>

<span style="color:green">**Charles: Yes, for a refresh and update.
    

## 5 Execute the workflow

Run the workflow with [execute](https://aws-step-functions-data-science-sdk.readthedocs.io/en/latest/workflow.html#stepfunctions.workflow.Workflow.execute). Since the workflow only has a pass state, it will succeed immediately.

In [25]:
# Generate unique names for Pre-Processing Job, Training Job, and Model Evaluation Job for the Step Functions Workflow
training_job_name = "ll-boost-training-{}".format(
    uuid.uuid1().hex
)  # Each Training Job requires a unique name
preprocessing_job_name = "ll-boost-preprocessing-{}".format(
    uuid.uuid1().hex
)  # Each Preprocessing job requires a unique name,
evaluation_job_name = "ll-boost-evaluation-{}".format(
    uuid.uuid1().hex
)  # Each Evaluation Job requires a unique name
scoring_job_name = "ll-boost-score-{}".format(
    uuid.uuid1().hex
)  # Each Evaluation Job requires a unique name

In [26]:
basic_workflow_execution = basic_workflow.execute(
    inputs={
        "PreprocessingJobName": preprocessing_job_name,  # Each pre processing job (SageMaker processing job) requires a unique name,
        "TrainingJobName": training_job_name,  # Each Sagemaker Training job requires a unique name,
        "EvaluationProcessingJobName": evaluation_job_name,  # Each SageMaker processing job requires a unique name,
        "ModelName" : v_model_name, # Name of model ,
        "ScoreJobName" : scoring_job_name
    }
)

[32m[INFO] Workflow execution started successfully on AWS Step Functions.[0m


## 5.1 Review the execution progress

Render workflow progress with the [render_progress](https://aws-step-functions-data-science-sdk.readthedocs.io/en/latest/workflow.html#stepfunctions.workflow.Execution.render_progress).

This generates a snapshot of the current state of your workflow as it executes. This is a static image. Run the cell again to check progress. 

In [37]:
basic_workflow_execution.render_progress()

<span style="color:red">**AP: This seems to take a long time. Is that due to the machine image (instance size) selected or to be expected given the process overhead?**</span>

<span style="color:green"> Charles: Whenever we invoke the SageMaker model training, there is a booting process that takes some amount of time. By default, it is approximately 4 minutes per instance. For the training and subsequent steps, it does not consume any time greater than 2 or 3 minutes. For the workflow to be initiated/excecuted, however, it takes up time. 

In [36]:
%store v_model_name

Stored 'v_model_name' (str)
