# Push Data to respective S3 for Train Pipeline Trigger
#what happens when we push this data to s3 ?
#how it triggers the pipeline ?
#Where this event is mapped and when it is defined?
#what if my data source need to change?
#will i get notification of the pipeline execution?
#what if some steps fails in between- how to monitor and fix the issue?
#what if we need to change pre processing logic?
#what if model hyperparameter needs to be changed?


In [39]:
! aws s3 cp /home/ec2-user/SageMaker/WipCoe/Notebooks_LatestCode/WiproFormat/train.csv s3://wi-cred-datalake-dev-raw/titanic/input/data/train.csv

Completed 59.8 KiB/59.8 KiB (517.9 KiB/s) with 1 file(s) remainingupload: ../Notebooks_LatestCode/WiproFormat/train.csv to s3://wi-cred-datalake-dev-raw/titanic/input/data/train.csv


# Push data to trigger scoring
#how the event is triggered ?
#which version of model scoring file would refer ?
#is there way to add additional attribute for better insight ?
#are we linking the scoring file with the outcome?

In [3]:
!aws s3 cp scorewolabel.csv s3://wi-cred-datalake-dev-raw/titanic/data/scoreinput/

Completed 76.2 KiB/76.2 KiB (786.2 KiB/s) with 1 file(s) remainingupload: ./scorewolabel.csv to s3://wi-cred-datalake-dev-raw/titanic/data/scoreinput/scorewolabel.csv


## Introduction

This notebook describes using the AWS Step Functions Data Science SDK to create and manage workflows. The Step Functions SDK is an open source library that allows data scientists to easily create and execute machine learning workflows using AWS Step Functions and Amazon SageMaker. For more information, see the following.
* [AWS Step Functions](https://aws.amazon.com/step-functions/)
* [AWS Step Functions Developer Guide](https://docs.aws.amazon.com/step-functions/latest/dg/welcome.html)
* [AWS Step Functions Data Science SDK](https://aws-step-functions-data-science-sdk.readthedocs.io)

In this notebook we will use the SDK to create steps, link them together to create a workflow, and execute the workflow in AWS Step Functions. 

In [46]:
# import sys
# !{sys.executable} -m pip install --upgrade pip
# !{sys.executable} -m pip install -qU awscli boto3 "sagemaker>=2.0.0"
# !{sys.executable} -m pip install -qU "stepfunctions>=2.0.0"
# !{sys.executable} -m pip show sagemaker stepfunctions

## Prequisite 

It is assumed that lambda functions for checking if model already exist or not and required IAM roles for Sagemaker, Step function is already created. <br/>
In this notebook we are going to use Step Functions SDK build-up for Sagemaker


## 1. Preprocessing logic script

Below is the preprocessing logic script which we will upload on S3 it will be used in preprocessing job. These scripts are the logic script which we have generated for preprocessing activities. Upload it on S3 and then we can use it as the parameter.

In [47]:
%%writefile titanic-preprocessing-linear-learner-script.py
# Importing required library
import pandas as pd
import numpy as np
from sklearn.impute import SimpleImputer
from sklearn.preprocessing import StandardScaler
scale = StandardScaler()
import glob
# you can put any value here according to your situation
chunksize = 10000
from sklearn import preprocessing
path = r'/opt/ml/processing/input' # Input path
all_files = glob.glob(path + "/*.csv")
#all_files=["/home/ec2-user/SageMaker/WipCoe/Notebooks_LatestCode/WiproFormat/train.csv"]
#read them into pandas
df_list = [pd.read_csv(filename) for filename in all_files]
data = pd.concat(df_list)
df = data.copy()
#change this as per the data set
#create all categorical variables that we did above for both training and test sets 
df['cabin_multiple'] = df.Cabin.apply(lambda x: 0 if pd.isna(x) else len(x.split(' ')))
df['cabin_adv'] = df.Cabin.apply(lambda x: str(x)[0])
df['numeric_ticket'] = df.Ticket.apply(lambda x: 1 if x.isnumeric() else 0)
df['ticket_letters'] = df.Ticket.apply(lambda x: ''.join(x.split(' ')[:-1]).replace('.','').replace('/','').lower() if len(x.split(' ')[:-1]) >0 else 0)
df['name_title'] = df.Name.apply(lambda x: x.split(',')[1].split('.')[0].strip())
#impute nulls for continuous data 
#df.Age = df.Age.fillna(training.Age.mean())
df.Age = df.Age.fillna(df.Age.median())
#df.Fare = df.Fare.fillna(training.Fare.mean())
df.Fare = df.Fare.fillna(df.Fare.median())
#drop null 'embarked' rows. Only 2 instances of this in training and 0 in test 
df.dropna(subset=['Embarked'],inplace = True)
#tried log norm of sibsp (not used)
df['norm_sibsp'] = np.log(df.SibSp+1)
# log norm of fare (used)
df['norm_fare'] = np.log(df.Fare+1)
# converted fare to category for pd.get_dummies()
df.Pclass = df.Pclass.astype(str)
#created dummy variables from categories (also can use OneHotEncoder)
all_dummies = pd.get_dummies(df[['PassengerId','Survived','Pclass','Sex','Age','SibSp','Parch','norm_fare','Embarked','cabin_adv','cabin_multiple','numeric_ticket','name_title']])
#scaling
encoded_df = all_dummies.copy()
encoded_df[['Age','SibSp','Parch','norm_fare']]= scale.fit_transform(encoded_df[['Age','SibSp','Parch','norm_fare']])
encoded_df
#Split to train test again
#split train test , change if diff split is requiredn
train_data, validation_data, test_data = np.split(encoded_df.sample(frac=1, random_state=1729), [int(0.7 * len(encoded_df)), int(0.9*len(encoded_df))]) # Splitting dataset 
train_data=train_data.drop(columns=['PassengerId'])#id is used for ground truth, if there is no id column in data create custom and use that
#workflow path
train_data.to_csv('/opt/ml/processing/train/train.csv', index=False, header=False) # train data
train_data.to_csv('/opt/ml/processing/trainbase/train_baseline.csv', index=False, header=True) # baseline data
#local path
# train_data.to_csv('titanic/train.csv', index=False, header=False) # xtrain data
# train_data.to_csv('titanic/train_baseline.csv', index=False, header=True) # baseline data
validbsline_data=validation_data.copy()
validation_data=validation_data.drop(columns=['PassengerId'])#id is used for ground truth, if there is no id column in data create custom and use that
validation_data.to_csv('/opt/ml/processing/validation/validation_data.csv', index=False, header=False) # validation data
#validation_data.to_csv('titanic/validation_data.csv', index=False, header=False) # validation data
validbsline= validbsline_data.drop(columns=['Survived']) # removing cloumn where we have to do predictions
validbsline.to_csv('/opt/ml/processing/baselinemodeldrift/baselinemodeldrift.csv', index=False, header=False)
#validbsline.to_csv('titanic/baselinemodeldrift.csv', index=False, header=False)
#validation data without label---one set
groundtrth=validbsline_data[['PassengerId','Survived']]#ground truth data should/only have Id,TargetVal colmun to corelate ground truth with predicted values 
groundtrth.to_csv('/opt/ml/processing/groundtruth/groundtruth.csv', index=False, header=True)
#groundtrth.to_csv('titanic/groundtruth.csv', index=False, header=True)
#ground truth (only label and ID)--2nd set
test_data = test_data.iloc[:,1:] # removing cloumn where we have to do predictions
test_data.to_csv('/opt/ml/processing/test/test.csv', index=False, header=False) # test data 
#test_data.to_csv('titanic/test.csv', index=False, header=False) # test data 

Overwriting titanic-preprocessing-linear-learner-script.py


## 2. Parameter

Below are the list of paramters which we have to change inorder to run below sdk


In [48]:
import sagemaker
v_workflow_execution_role = "arn:aws:iam::525102048888:role/poc-sagemaker-step-functi-MachineLearningWorkflowE-1XFI2UPRXFTXE" # Step function IAM role ARN
v_preprocessing_iam_role = "arn:aws:iam::525102048888:role/service-role/AmazonSageMaker-ExecutionRole-20191105T125227" # IAM role for preprocessing container
v_lambda_execution_role = "arn:aws:iam::525102048888:role/poc-sagemaker-step-functi-LambaForDataGenerationEx-PKONGQTFWLRF"
v_preprocessing_instance_type = "ml.m5.xlarge" # Instance type for preprocessing container it changes as per workload
v_s3_input_bucket = "ds-mlops-dev" # S3 bucket for input and output data
v_prefix_for_input_data = "titanic/data/input/inputdata.csv"  # Prefix where data is stored
v_prefix_for_code_location = "titanic/code/titanic-preprocessing-linear-learner-script.py" # prefix where code is stored
v_lambda_function_name = "ds-mlops-linear-learner-lambda-test" # Name of lambda function for triggering training pipeline.
v_region = 'us-east-1' # AWS region
v_model_container = sagemaker.image_uris.retrieve('linear-learner', v_region) # Linear conatiner
v_train_instance_type = "ml.m5.xlarge" # Instance type for training
v_validation_scoring_instance_type = "ml.m5.large" # Instance type for batch scoring
v_model_name = "Titanic-linear-learner-02" # Name of DS_MLOPS model to be kept
# VV added after design review
v_threshold=3000
v_prefix_for_train_lambda="code/query_training_status.zip"
##
# sec_groups = ["sg-044e0e7ce4f5721c0"]
# subnets = ["subnet-0cf0e3f46326aa259",
#            "subnet-0156b7f5500cf0b78",
#            "subnet-032420199163cff9b"]

## 3 Import the required modules from the SDK and uploading code to s3

In [49]:
import stepfunctions
import logging

from stepfunctions.steps import *
from stepfunctions.workflow import Workflow
from stepfunctions import steps
from stepfunctions.inputs import ExecutionInput
from sagemaker.processing import Processor,ProcessingInput, ProcessingOutput
import uuid
import sagemaker
from sagemaker.inputs import TrainingInput
import boto3
from sagemaker.network import NetworkConfig
from sagemaker.sklearn.processing import SKLearnProcessor

stepfunctions.set_stream_logger(level=logging.INFO)

In [50]:
!aws s3 cp titanic-preprocessing-linear-learner-script.py s3://$v_s3_input_bucket/$v_prefix_for_code_location # Uploading preprocessing code on s3

Completed 4.2 KiB/4.2 KiB (56.4 KiB/s) with 1 file(s) remainingupload: ./titanic-preprocessing-linear-learner-script.py to s3://ds-mlops-dev/titanic/code/titanic-preprocessing-linear-learner-script.py


In [51]:
!cp titanic-preprocessing-linear-learner-script.py /home/ec2-user/SageMaker/WipCoe/VW-TitanicDeployment/train_preprocessing/

In [52]:
!cp titanic-preprocessing-linear-learner-script.py /home/ec2-user/SageMaker/WipCoe/VW-TitanicDeployment/train_preprocessing/titanic-preprocessing-xgb-script.py

#added below lambda for query training status to read metric values

In [53]:
%%writefile query_training_status.py
import boto3
import logging
import json

logger = logging.getLogger()
logger.setLevel(logging.INFO)
sm_client = boto3.client('sagemaker')

#Retrieve transform job name from event and return transform job status.
def lambda_handler(event, context):

    if ('TrainingJobName' in event):
        job_name = event['TrainingJobName']

    else:
        raise KeyError('TrainingJobName key not found in function input!'+
                      ' The input received was: {}.'.format(json.dumps(event)))

    #Query boto3 API to check training status.
    try:
        response = sm_client.describe_training_job(TrainingJobName=job_name)
        logger.info("Training job:{} has status:{}.".format(job_name,
            response['TrainingJobStatus']))

    except Exception as e:
        response = ('Failed to read training status!'+ 
                    ' The training job may not exist or the job name may be incorrect.'+ 
                    ' Check SageMaker to confirm the job name.')
        print(e)
        print('{} Attempted to read job name: {}.'.format(response, job_name))

    #We can't marshall datetime objects in JSON response. So convert
    #all datetime objects returned to unix time.
    for index, metric in enumerate(response['FinalMetricDataList']):
        metric['Timestamp'] = metric['Timestamp'].timestamp()

    return {
        'statusCode': 200,
        'trainingMetrics': response['FinalMetricDataList']
    }


Overwriting query_training_status.py


In [54]:
## VV added after review
! zip query_training_status.zip query_training_status.py

updating: query_training_status.py (deflated 55%)


In [55]:
## VV added after review
! aws s3 cp query_training_status.zip s3://$v_s3_input_bucket/$v_prefix_for_train_lambda

Completed 846 Bytes/846 Bytes (9.0 KiB/s) with 1 file(s) remainingupload: ./query_training_status.zip to s3://ds-mlops-dev/code/query_training_status.zip


In [56]:
## VV added after review
function_name = 'LinearLearnerQuery-training-status'
try:
    lambda_client = boto3.client('lambda')
    response = lambda_client.create_function(
        FunctionName=function_name,
        Runtime='python3.7',
        Role=v_lambda_execution_role,
        Handler='query_training_status.lambda_handler',
        Code={
            'S3Bucket':v_s3_input_bucket,
            'S3Key': '{}'.format(v_s3_input_bucket)
        },
        Description='Queries a SageMaker training job and return the results.',
        Timeout=15,
        MemorySize=128
    )
except:
    print("exception")

exception


## 4. Create workflow

In the following cell, you will define the step that you will use in our first workflow.  Then you will create, visualize and execute the workflow. 

Steps relate to states in AWS Step Functions. For more information, see [States](https://docs.aws.amazon.com/step-functions/latest/dg/concepts-states.html) in the *AWS Step Functions Developer Guide*. For more information on the AWS Step Functions Data Science SDK APIs, see: https://aws-step-functions-data-science-sdk.readthedocs.io. 

## 4.1 Creating Pre-Processing step

In [57]:
input_data = "s3://{}/{}".format(v_s3_input_bucket,v_prefix_for_input_data)
input_code = "s3://{}/{}".format(v_s3_input_bucket,v_prefix_for_code_location)
output_data = "s3://{}/{}".format(v_s3_input_bucket,"titanic/preprocess-data")

inputs = [
    ProcessingInput(
        source=input_data, destination="/opt/ml/processing/input", input_name="input"
    ),
    ProcessingInput(
        source=input_code,
        destination="/opt/ml/processing/input/code",
        input_name="code",
    ),
]

outputs = [
    ProcessingOutput(
        source="/opt/ml/processing/train",
        destination="{}/{}".format(output_data,"train"),
        output_name="train_data",
    ),
    ProcessingOutput(
        source="/opt/ml/processing/test",
        destination="{}/{}".format(output_data, "test"),
        output_name="test_data",
    ),
    ProcessingOutput(
        source="/opt/ml/processing/validation",
        destination="{}/{}".format(output_data, "validation"),
        output_name="validation_data",
    ),
    ProcessingOutput(
        source="/opt/ml/processing/baselinemodeldrift",
        destination="{}/{}".format(output_data,"baselinemodeldrift"),
        output_name="baselinemodeldrift",
    ),
    ProcessingOutput(
        source="/opt/ml/processing/groundtruth",
        destination="{}/{}".format(output_data,"groundtruth"),
        output_name="groundtruth",
    ),
    
    ProcessingOutput(
        source="/opt/ml/processing/trainbase",
        destination="{}/{}".format(output_data,"trainbase"),
        output_name="trainbase",
    )   
]

In [58]:
print(input_data)

s3://ds-mlops-dev/titanic/data/input/inputdata.csv


In [59]:
# Generate unique names for Pre-Processing Job, Training Job, and Model Evaluation Job for the Step Functions Workflow
training_job_name = "Titanic-ll-training-{}".format(
    uuid.uuid1().hex
)  # Each Training Job requires a unique name
preprocessing_job_name = "Titanic-ll-preprocessing-{}".format(
    uuid.uuid1().hex
)  # Each Preprocessing job requires a unique name,
evaluation_job_name = "Titanic-ll-evaluation-{}".format(
    uuid.uuid1().hex
)  # Each Evaluation Job requires a unique name
scoring_job_name = "Titanic-ll-score-{}".format(
    uuid.uuid1().hex
)  # Each Evaluation Job requires a unique name

In [60]:
# SageMaker expects unique names for each job, model and endpoint.
# If these names are not unique the execution will fail. Pass these dynamically for each execution using placeholders.

##VV updated after review

execution_input = ExecutionInput(
    schema={
        "PreprocessingJobName": str,
        "TrainingJobName": str,
        "EvaluationProcessingJobName": str,
        "ModelName": str,
        "EndPointConfig":str,
        "EndpointName":str,
        "ScoreJobName":str
    }
)

In [61]:
processor =Processor(image_uri='683313688378.dkr.ecr.us-east-1.amazonaws.com/sagemaker-scikit-learn:0.20.0-cpu-py3',
                     role=v_preprocessing_iam_role,
                     instance_count=1,
                     instance_type=v_preprocessing_instance_type
                    #network_config = NetworkConfig(security_group_ids = sec_groups, subnets = subnets)
                     )
#processor.JOB_CLASS_NAME= 'processing-job'

In [62]:
# Generate unique names for Pre-Processing Job, Training Job, and Model Evaluation Job for the Step Functions Workflow
training_job_name = "Titanic-ll-training-{}".format(
    uuid.uuid1().hex
)  # Each Training Job requires a unique name
preprocessing_job_name = "Titanic-ll--preprocessing-{}".format(
    uuid.uuid1().hex
)  # Each Preprocessing job requires a unique name,
evaluation_job_name = "Titanic-ll--evaluation-{}".format(
    uuid.uuid1().hex
)  # Each Evaluation Job requires a unique name
scoring_job_name = "Titanic-ll-score-{}".format(
    uuid.uuid1().hex
)  # Each Evaluation Job requires a unique name
endpoint_name = "Titanic-endpoint-{}".format(
    uuid.uuid1().hex
)  # Each Evaluation Job requires a unique name
endpoint_config_name = "Titanic-endpoint-config-{}".format(
    uuid.uuid1().hex
)  # Each Evaluation Job requires a unique name

In [63]:
#processor.JOB_CLASS_NAME= 'processing-job'
preprocessing_step = ProcessingStep(
    state_id='Pre-processing', 
    processor=processor,
    #JOB_CLASS_NAME='processing_job',
    job_name=preprocessing_job_name, 
    inputs=inputs, 
    outputs=outputs, 
    experiment_config=None, 
    container_entrypoint=["python3", "/opt/ml/processing/input/code/titanic-preprocessing-linear-learner-script.py"], # DS needs to change this directory /path
    wait_for_completion=True
)

## 4.2 Train trigger lambda function (Check if model with same name exists)

In the following cell, we define a lambda step that will invoke the previously created lambda function as part of our Step Function workflow. See [LambdaStep](https://aws-step-functions-data-science-sdk.readthedocs.io/en/latest/compute.html#stepfunctions.steps.compute.LambdaStep) in the AWS Step Functions Data Science SDK documentation to learn more.

In [64]:
client = boto3.client('sagemaker') # getting sagemaker client 
try:
    client.delete_model(
        ModelName=v_model_name # delete if some model exist with this name
    )
except:
    pass

In [65]:
lambda_step = compute.LambdaStep(
    'Start Training',
    parameters={  
        "FunctionName": v_lambda_function_name
    }
)

In [66]:
##VV added after review
lambda_step_acc = compute.LambdaStep(
    'Query Training Results',
    parameters={  
        "FunctionName": function_name,
        'Payload':{
            "TrainingJobName.$": '$.TrainingJobName'
        }
    }
)

In [67]:
##VV added after review
check_accuracy_step = steps.states.Choice(
    'RMSE < Threshold'
)

In [68]:
##VV added after review
fail_step = steps.states.Fail(
    'Model Accuracy Too Low',
    comment='Validation accuracy lower than threshold'
)

## 4.3 Train model

### Create a SageMaker Training Step 

In the following cell, we create the training step and pass the estimator we defined above. See  [TrainingStep](https://aws-step-functions-data-science-sdk.readthedocs.io/en/latest/sagemaker.html#stepfunctions.steps.sagemaker.TrainingStep) in the AWS Step Functions Data Science SDK documentation to learn more.

In [69]:
sess = sagemaker.Session()
training_output = 's3://{}/models'.format(v_s3_input_bucket) # model output locations
linear = sagemaker.estimator.Estimator(v_model_container,
                                       v_preprocessing_iam_role, 
                                       instance_count = 1, 
                                       instance_type = 'ml.m5.xlarge',
                                       output_path=training_output,
                                       sagemaker_session = sess,
                                      #security_group_ids=sec_groups,
                                       #subnets=subnets
                                      )

linear.set_hyperparameters(epochs = 50,
                           l1 = 0.0014121036264995773,
                           learning_rate = 0.012104062065442999,
                           mini_batch_size = 256,
                           predictor_type = "regressor")

In [70]:
training_step = TrainingStep(
    'Model Training(linear)', 
    estimator=linear,
    data={
        'train': TrainingInput("{}/{}/".format(output_data,"train"), content_type='text/csv'),
        'validation': TrainingInput("{}/{}/".format(output_data, "validation"), content_type='text/csv')
    },
    job_name=execution_input['TrainingJobName'],
    wait_for_completion=True
)

In [71]:
print(output_data)

s3://ds-mlops-dev/titanic/preprocess-data


## 4.4 Create a Model

In the following cell, we define a model step that will create a model in Amazon SageMaker using the artifacts created during the TrainingStep. See  [ModelStep](https://aws-step-functions-data-science-sdk.readthedocs.io/en/latest/sagemaker.html#stepfunctions.steps.sagemaker.ModelStep) in the AWS Step Functions Data Science SDK documentation to learn more.

The model creation step typically follows the training step. The Step Functions SDK provides the [get_expected_model](https://aws-step-functions-data-science-sdk.readthedocs.io/en/latest/sagemaker.html#stepfunctions.steps.sagemaker.TrainingStep.get_expected_model) method in the TrainingStep class to provide a reference for the trained model artifacts. Please note that this method is only useful when the ModelStep directly follows the TrainingStep.

In [72]:
model_step = ModelStep(
    'Save Model',
    model=training_step.get_expected_model(),
    model_name=execution_input['ModelName'],
    result_path='$.ModelStepResults'
)

## 4.5 Create a batch transform step

Now once all the above steps are done we will perform scoring on a small data set to see all the components are working fine

In [73]:
from sagemaker.inputs import TransformInput

batch_scoring = TransformStep(
    state_id="validation-step",
    job_name=execution_input['ScoreJobName'],
    transformer=linear.transformer(instance_count=1,
                                instance_type=v_validation_scoring_instance_type),
    data="{}/{}".format(output_data, "test"), # location for test data
    model_name=execution_input['ModelName'],
    content_type="text/csv"
)

# Create End point config and end point step

In [74]:
##VV added after review
endpoint_config_step = steps.EndpointConfigStep(
    "Create Model Endpoint Config",
    endpoint_config_name=execution_input['EndPointConfig'],
    model_name=execution_input['ModelName'],
    initial_instance_count=1,
    instance_type='ml.m4.xlarge'
)

In [75]:
##VV added after review
endpoint_step = steps.EndpointStep(
    'Update Model Endpoint',
    endpoint_name=execution_input['EndpointName'],
    endpoint_config_name=execution_input['EndPointConfig'],
    update=False
)

In [76]:
##VV added after review
threshold_rule = steps.choice_rule.ChoiceRule.NumericLessThan(variable=lambda_step.output()['Payload']['trainingMetrics'][0]['Value'], value=v_threshold)
check_accuracy_step.add_choice(rule=threshold_rule, next_step=endpoint_config_step)
check_accuracy_step.default_choice(next_step=fail_step)

In [77]:
##VV added after review
endpoint_config_step.next(endpoint_step)

Update Model Endpoint EndpointStep(resource='arn:aws:states:::sagemaker:createEndpoint', parameters={'EndpointConfigName': <stepfunctions.inputs.placeholders.ExecutionInput object at 0x7f81b3e63160>, 'EndpointName': <stepfunctions.inputs.placeholders.ExecutionInput object at 0x7f81b3e63198>}, type='Task')

## 4.6 Chain together steps for the basic path

The following cell links together the steps you've created into a sequential group called `basic_path`. We will chain a single step to create our basic path. See [Chain](https://aws-step-functions-data-science-sdk.readthedocs.io/en/latest/states.html#stepfunctions.steps.states.Chain) in the AWS Step Functions Data Science SDK documentation.

After chaining together the steps for the basic path, in this case only one step, we will visualize the basic path.

In [78]:
# # First we chain the start pass state,preprocessing_step,
basic_path=Chain([preprocessing_step,
                  training_step,
                  model_step,
                  lambda_step_acc,
                  check_accuracy_step])

In [79]:
# # First we chain the start pass state,preprocessing_step,
# basic_path=Chain([training_step])

## 4.7 Define the workflow instance

The following cell defines the [workflow](https://aws-step-functions-data-science-sdk.readthedocs.io/en/latest/workflow.html#stepfunctions.workflow.Workflow) with the path we just defined.

After defining the workflow, we will render the graph to see what our workflow looks like.

In [80]:
# Next, we define the workflow
basic_workflow = Workflow(
    name="Titanic-linear-learner-step-function",
    definition=basic_path,
    role=v_workflow_execution_role
)

#Render the workflow
basic_workflow.render_graph()

## 4.8 Review the Amazon States Language code for your workflow

The following renders the JSON of the [Amazon States Language](https://docs.aws.amazon.com/step-functions/latest/dg/concepts-amazon-states-language.html) definition of the workflow you created. 

In [81]:
print(basic_workflow.definition.to_json(pretty=True)) # From this json we would be leveraging the codes to create the Cloud Formation parameterized template...

{
    "StartAt": "Pre-processing",
    "States": {
        "Pre-processing": {
            "Resource": "arn:aws:states:::sagemaker:createProcessingJob.sync",
            "Parameters": {
                "ProcessingJobName": "Titanic-ll--preprocessing-73cec0161a0611ed8cef065a32dff597",
                "ProcessingInputs": [
                    {
                        "InputName": "input",
                        "AppManaged": false,
                        "S3Input": {
                            "S3Uri": "s3://ds-mlops-dev/titanic/data/input/inputdata.csv",
                            "LocalPath": "/opt/ml/processing/input",
                            "S3DataType": "S3Prefix",
                            "S3InputMode": "File",
                            "S3DataDistributionType": "FullyReplicated",
                            "S3CompressionType": "None"
                        }
                    },
                    {
                        "InputName": "code",
                 

## 4.9 Create the workflow on AWS Step Functions

Create the workflow in AWS Step Functions with [create](https://aws-step-functions-data-science-sdk.readthedocs.io/en/latest/workflow.html#stepfunctions.workflow.Workflow.create).

In [82]:
basic_workflow.create()

[31m[ERROR] A workflow with the same name already exists on AWS Step Functions. To update a workflow, use Workflow.update().[0m


'arn:aws:states:us-east-1:525102048888:stateMachine:Titanic-linear-learner-step-function'

In [83]:
basic_workflow.update(definition=basic_workflow.definition,role=basic_workflow.role)

[32m[INFO] Workflow updated successfully on AWS Step Functions. All execute() calls will use the updated definition and role within a few seconds. [0m


'arn:aws:states:us-east-1:525102048888:stateMachine:Titanic-linear-learner-step-function'

## 5 Execute the workflow

Run the workflow with [execute](https://aws-step-functions-data-science-sdk.readthedocs.io/en/latest/workflow.html#stepfunctions.workflow.Workflow.execute). Since the workflow only has a pass state, it will succeed immediately.

In [84]:
basic_workflow_execution = basic_workflow.execute(
    inputs={
        "PreprocessingJobName": preprocessing_job_name,  # Each pre processing job (SageMaker processing job) requires a unique name,
        "TrainingJobName": training_job_name,  # Each Sagemaker Training job requires a unique name,
        "EvaluationProcessingJobName": evaluation_job_name,  # Each SageMaker processing job requires a unique name,
        "ModelName" : v_model_name, # Name of model ,
        "ScoreJobName" : scoring_job_name,
        "EndpointName" : endpoint_name,
        "EndPointConfig" : endpoint_config_name
    }
)

[32m[INFO] Workflow execution started successfully on AWS Step Functions.[0m


## 5.1 Review the execution progress

Render workflow progress with the [render_progress](https://aws-step-functions-data-science-sdk.readthedocs.io/en/latest/workflow.html#stepfunctions.workflow.Execution.render_progress).

This generates a snapshot of the current state of your workflow as it executes. This is a static image. Run the cell again to check progress. 

In [85]:
basic_workflow_execution.render_progress()

In [18]:
import pandas as pd
filename="scorewolabel20220817034014_130312380.csv.out"
#filename="badodometersd20220623120734_1889418018.csv.out"
df = pd.read_csv(filename,header=None)
df = df.sample(frac=.25)
print(df.head(5))
df=df.iloc[: ,6:]#.drop([4],axis = 1)
print(df.head(5))

                      0        1      2                  3   \
530  2022-08-17 03:45:26  XGboost  Batch  53220220817034526   
22   2022-08-17 03:45:26  XGboost  Batch   2320220817034526   
623  2022-08-17 03:45:26  XGboost  Batch  62520220817034526   
535  2022-08-17 03:45:26  XGboost  Batch  53720220817034526   
285  2022-08-17 03:45:26  XGboost  Batch  28720220817034526   

                                                    4        5         6   \
530  s3://wi-cred-datalake-dev-raw/data/repscoreinp...  Titanic -0.101340   
22   s3://wi-cred-datalake-dev-raw/data/repscoreinp...  Titanic -1.103064   
623  s3://wi-cred-datalake-dev-raw/data/repscoreinp...  Titanic -0.640730   
535  s3://wi-cred-datalake-dev-raw/data/repscoreinp...  Titanic  1.208607   
285  s3://wi-cred-datalake-dev-raw/data/repscoreinp...  Titanic  0.052771   

           7         8         9   ...  37  38  39  40  41  42  43  44  45  \
530 -0.475199 -0.474326 -0.880201  ...   0   0   0   1   0   0   0   0   0   
22

In [12]:
!aws s3 cp s3://wi-cred-datalake-dev-raw/transformed/scoring/outbound/batch/ll/2022/06/23/12/badodometersd20220623120734_1889418018.csv.out .

Completed 120.1 KiB/120.1 KiB (1.0 MiB/s) with 1 file(s) remainingdownload: s3://wi-cred-datalake-dev-raw/transformed/scoring/outbound/batch/ll/2022/06/23/12/badodometersd20220623120734_1889418018.csv.out to ./badodometersd20220623120734_1889418018.csv.out


In [36]:
def jsonl_to_csv(input_path, bkt,output_path,filename,modelname,monitorjobname,mtrrefpath,mdmonitorjobname):
    col_name=list()
    for i in range(47):#set it to the lenght of scoring table means # of desiered columns same as batch scoring data file
        column = 'col_' + str(i)
        col_name.append(column)
    df = pd.DataFrame(columns=col_name)
    i = 0
    for line in smart_open(input_path, 'rb'):
        input_row = json.loads(line.decode('utf8'))
        input_data = input_row['captureData']['endpointInput']['data']
        input_data = input_data.split(',')
        inference_id = input_row['eventMetadata']['inferenceId']
        runtime=datetime.datetime.now().strftime("%Y-%m-%d %H:%M:%S")
        output = input_row['captureData']['endpointOutput']['data'].replace('\n', '')
        input_data.insert(0, 'Titanic')#added use case name
        input_data.insert(0, 'NA for realtime data')
        input_data.insert(0, inference_id)
        input_data.insert(0, 'realtime')
        input_data.insert(0, modelname)
        input_data.insert(0, runtime)
        
        input_data.append(output)
        df.loc[i] = input_data
        i = i+1
    csv_buffer = StringIO()
    df.to_csv(csv_buffer,index=False)
    s=writecsv(csv_buffer,bkt,output_path+filename+'.csv')
    df1=df.iloc[:,3:4] # read the inference id from position
    df2=df1.copy(deep=True)
    df2.columns=['Inferenceid']
    df2['monitorjobname']=monitorjobname
    df2['modelname']=modelname
    df2['mdmonitorjobname']=mdmonitorjobname
    csv_buffer = StringIO()
    df2.to_csv(csv_buffer,index=False)
    s=writecsv(csv_buffer,bkt,mtrrefpath+filename+'.csv')
    return 'Transfer is done'    
def writecsv(content,buckname,buckobj):
    clnt=boto3.resource('s3')
    clnt.Object(buckname, buckobj).put(Body=content.getvalue())

In [21]:
event={"RTReportPath": "transformed/titanic/scoring/outbound/realtime/xg/2022/08/18/04/",
      "notif_sub": "Data Drift Monitor -XGboost - Realtime",
      "prep_jsonlpath": "transformed/titanic/monitoring/inbound/realtime/xg/2022/08/18/04/",
      "monitoropkey": "transformed/titanic/monitoring/outbound/datadrift/realtime/xg/2022/08/18/04/",
      "modelname": "XGboost",
      "reportopkey": "transformed/titanic/monitoring/reporting/drift/realtime/xg/2022/08/18/04/",
      "inpjsonline": "s3://wi-cred-datalake-dev-raw/transformed/titanic/monitoring/inbound/realtime/xg/2022/08/18/04/",
      "endtime": "2022-08-18T05:00:00Z",
      "outjsonpath": "s3://wi-cred-datalake-dev-raw/transformed/titanic/monitoring/outbound/datadrift/realtime/xg/2022/08/18/04/",
      "starttime": "2022-08-18T04:00:00Z",
      "MonitorJobName": "Wi-MLOPS-ModelMonitor-RT-xg-datadrift-3680709594",
      "infertype": "Realtime",
      "baselinecons": "s3://wi-cred-datalake-dev-s3-mlops-config/titanic/training/inbound/baseline/datadrift/constraints.json",
      "payload_src": "transformed/titanic/monitoring/inbound/currentrun/realtime/xg/",
      "mtrrefpath": "transformed/titanic/monitoring/reporting/scoremonitorbridge/realtime/xg/2022/08/18/04/",
      "MD_MonitorJobName": "Wi-MLOPS-ModelMonitor-RT-xg-modeldrift-3680709844",
      "baselinestat": "s3://wi-cred-datalake-dev-s3-mlops-config/titanic/training/inbound/baseline/datadrift/statistics.json",
      "md_monitoroppath": "s3://wi-cred-datalake-dev-raw/transformed/titanic/monitoring/outbound/modeldrift/realtime/xg/2022/08/18/04/"
    }

In [37]:
import boto3
import json
import datetime
from io import StringIO
from smart_open import smart_open
s3_client=boto3.client('s3')
output_path=event['RTReportPath']
inputJsonpath=event['prep_jsonlpath']
monitorjobname=event['MonitorJobName']
mdmonitorjobname=event['MD_MonitorJobName']
modelname=event['modelname']
mtrrefpath=event['mtrrefpath']
v_s3_input_bucket="wi-cred-datalake-dev-raw"
print("hirealtimereport")
response = s3_client.list_objects(Bucket=v_s3_input_bucket,Prefix=inputJsonpath)
print("hirealtimereport_1")
file_list=list()
for i in response['Contents']:
    filepath= 's3://'+ v_s3_input_bucket+ '/'+i['Key']
    file_list.append(filepath)
for i in file_list:
    filename=i.split('/')[-1][0:-6]
    jsonl_to_csv(i,v_s3_input_bucket,output_path,filename,modelname,monitorjobname,mtrrefpath,mdmonitorjobname)

hirealtimereport
hirealtimereport_1


In [27]:
!pip install smart_open

Collecting smart_open
  Downloading smart_open-6.0.0-py3-none-any.whl (58 kB)
     |████████████████████████████████| 58 kB 6.8 MB/s             
[?25hInstalling collected packages: smart-open
Successfully installed smart-open-6.0.0
