## PyTorch 2 Complete Project Workflow in Amazon SageMaker

## Workflow Automation with the AWS Step Functions Data Science SDK <a class="anchor" id="WorkflowAutomation">

In the previous notesbooks, we prototyped various steps of a PyTorch project within the notebook itself.  Notebooks are great for prototyping, but generally are  not used in production-ready machine learning pipelines.  For example, a simple pipeline in SageMaker includes the following steps:  

1. Training the model.
2. Creating a SageMaker Model object that wraps the model artifact for serving.
3. Creating a SageMaker Endpoint Configuration specifying how the model should be served (e.g. hardware type and amount).
4. Deploying the trained model to the configured SageMaker Endpoint.  

The AWS Step Functions Data Science SDK automates the process of creating and running these kinds of workflows using AWS Step Functions and SageMaker.  It does this by allowing you to create workflows using short, simple Python scripts that define workflow steps and chain them together.  Under the hood, all the workflow steps are coordinated by AWS Step Functions without any need for you to manage the underlying infrastructure.  

To begin, install the Step Functions Data Science SDK:  

In [None]:
import sys

!{sys.executable} -m pip install --quiet --upgrade stepfunctions

First, we'll import the variables stored from previous notebooks.

In [None]:
%store -r

### Add an IAM policy to your SageMaker role <a class="anchor" id="IAMPolicy">

**If you are running this notebook on an Amazon SageMaker notebook instance**, the IAM role assumed by your notebook instance needs permission to create and run workflows in AWS Step Functions. To provide this permission to the role, do the following.

1. Open the Amazon [SageMaker console](https://console.aws.amazon.com/sagemaker/). 
2. Select **Notebook instances** and choose the name of your notebook instance
3. Under **Permissions and encryption** select the role ARN to view the role on the IAM console
4. Choose **Attach policies** and search for `AWSStepFunctionsFullAccess`.
5. Select the check box next to `AWSStepFunctionsFullAccess` and choose **Attach policy**

If you are running this notebook in a local environment, the SDK will use your configured AWS CLI configuration. For more information, see [Configuring the AWS CLI](https://docs.aws.amazon.com/cli/latest/userguide/cli-chap-configure.html).


### Create an execution role for Step Functions <a class="anchor" id="CreateExecutionRole">

You also need to create an execution role for Step Functions to enable that service to access SageMaker and other service functionality.

1. Go to the [IAM console](https://console.aws.amazon.com/iam/)
2. Select **Roles** and then **Create role**.
3. Under **Choose the service that will use this role** select **Step Functions**
4. Choose **Next** until you can enter a **Role name**
5. Enter a name such as `StepFunctionsWorkflowExecutionRole` and then select **Create role**


Select your newly create role and attach a policy to it. The following steps attach a policy that provides full access to Step Functions, however as a good practice you should only provide access to the resources you need.  

1. Under the **Permissions** tab, click **Add inline policy**
2. Enter the following in the **JSON** tab

```json
{
    "Version": "2012-10-17",
    "Statement": [
        {
            "Sid": "VisualEditor0",
            "Effect": "Allow",
            "Action": [
                "events:PutTargets",
                "events:DescribeRule",
                "events:PutRule"
            ],
            "Resource": "arn:aws:events:*:*:rule/*"
        },
        {
            "Sid": "VisualEditor1",
            "Effect": "Allow",
            "Action": "iam:PassRole",
            "Resource": "*",
            "Condition": {
                "StringEquals": {
                    "iam:PassedToService": "sagemaker.amazonaws.com"
                }
            }
        },
        {
            "Sid": "VisualEditor2",
            "Effect": "Allow",
            "Action": [
                "sagemaker:DescribeTrainingJob",
                "sagemaker:CreateModel",
                "sagemaker:DeleteEndpointConfig",
                "batch:SubmitJob",
                "dynamodb:DeleteItem",
                "sagemaker:StopProcessingJob",
                "sagemaker:CreateProcessingJob",
                "sagemaker:CreateTrainingJob",
                "batch:TerminateJob",
                "batch:DescribeJobs",
                "sagemaker:DescribeTransformJob",
                "sns:Publish",
                "ecs:RunTask",
                "sagemaker:StopHyperParameterTuningJob",
                "sagemaker:DeleteEndpoint",
                "dynamodb:GetItem",
                "glue:GetJobRun",
                "ecs:StopTask",
                "ecs:DescribeTasks",
                "sagemaker:CreateTransformJob",
                "sagemaker:ListTags",
                "sagemaker:CreateEndpoint",
                "dynamodb:PutItem",
                "lambda:InvokeFunction",
                "sqs:SendMessage",
                "dynamodb:UpdateItem",
                "glue:BatchStopJobRun",
                "sagemaker:StopTrainingJob",
                "sagemaker:DescribeHyperParameterTuningJob",
                "sagemaker:UpdateEndpoint",
                "sagemaker:CreateEndpointConfig",
                "glue:StartJobRun",
                "sagemaker:StopTransformJob",
                "sagemaker:CreateHyperParameterTuningJob",
                "glue:GetJobRuns"
            ],
            "Resource": "*"
        }
    ]
}
```

3. Choose **Review policy** and give the policy a name such as `StepFunctionsWorkflowExecutionPolicy`
4. Choose **Create policy**. You will be redirected to the details page for the role.
5. Copy the **Role ARN** at the top of the **Summary**

In [None]:
# Paste the StepFunctionsWorkflowExecutionRole ARN from above
workflow_execution_role = 'arn:aws:iam::001772452635:role/StepFunctionsWorkflowExecutionRole'

### Set up a TrainingPipeline <a class="anchor" id="TrainingPipeline">

You can use a state machine workflow to create a model retraining pipeline. The AWS Data Science Workflows SDK provides several AWS SageMaker workflow steps that you can use to construct an ML pipeline. In this tutorial you will create the following steps:
    
[ProcessingStep](https://aws-step-functions-data-science-sdk.readthedocs.io/en/stable/sagemaker.html#stepfunctions.steps.sagemaker.ProcessingStep) - Preprocesses data set by executing a SageMaker Processing Job.

[TrainingStep](https://aws-step-functions-data-science-sdk.readthedocs.io/en/stable/sagemaker.html#stepfunctions.steps.sagemaker.TrainingStep) - Creates the training step and passes the defined estimator.

[ModelStep](https://aws-step-functions-data-science-sdk.readthedocs.io/en/stable/sagemaker.html#stepfunctions.steps.sagemaker.ModelStep) - Creates a model in SageMaker using the artifacts created during the TrainingStep.

[EndpointConfigStep](https://aws-step-functions-data-science-sdk.readthedocs.io/en/stable/sagemaker.html#stepfunctions.steps.sagemaker.EndpointConfigStep) - Creates the endpoint config step to define the new configuration for our endpoint.

[EndpointStep](https://aws-step-functions-data-science-sdk.readthedocs.io/en/stable/sagemaker.html#stepfunctions.steps.sagemaker.EndpointStep) - Creates the endpoint step to update our model endpoint.

The following code cell configures a  `pipeline` object with the necessary parameters to define such a simple pipeline:

In [None]:
from stepfunctions import steps
from sagemaker.sklearn.processing import SKLearnProcessor
from stepfunctions.steps.sagemaker import ProcessingStep
from sagemaker.processing import ProcessingInput, ProcessingOutput
from stepfunctions.steps import TrainingStep, ModelStep
from stepfunctions.inputs import ExecutionInput
from stepfunctions.workflow import Workflow

We will use the uuid library to create a unique ID to track the different resources for each workflow run.

In [None]:
from uuid import uuid4
id = uuid4().hex

In [None]:
# SageMaker expects unique names for each job, model and endpoint. 
# If these names are not unique the execution will fail.
execution_input = ExecutionInput(schema={
    'TrainingJobName': str,
    'ModelName': str,
    'EndpointName': str
})

In [None]:
import sagemaker
from sagemaker import get_execution_role
from time import gmtime, strftime

sess = sagemaker.Session()

# Create the SKLearn Processor
sklearn_processor = SKLearnProcessor(framework_version='0.20.0',
                                     role=get_execution_role(),
                                     instance_type='ml.m5.xlarge',
                                     instance_count=1)

In [None]:
# Create the ProcessingStep
sf_processing_job_name = "pytorch-2-stepfunctions-processing-{}".format(strftime("%d-%H-%M-%S", gmtime()))
workflow_name = "preprocessing-step-workflow-{}".format(strftime("%d-%H-%M-%S", gmtime()))

code_uri = sess.upload_data("preprocessing.py", bucket=bucket, key_prefix="data/sklearn_processing/code")

output_destination = 's3://{}/{}/data'.format(bucket, s3_prefix)
train_destination = '{}/train'.format(output_destination)
test_destination = '{}/test'.format(output_destination)

preprocessing_step = ProcessingStep(
                    'Preprocessing',
                    job_name=sf_processing_job_name,
                    processor=sklearn_processor,
                    wait_for_completion=True,
                    inputs=[ProcessingInput(
                                source=raw_s3,
                                destination='/opt/ml/processing/input',
                                s3_data_distribution_type='ShardedByS3Key',
                                input_name='input-1'),
                            ProcessingInput(
                                source=code_uri,
                                destination='/opt/ml/processing/input/code',
                                input_name='code')],
                    outputs=[ProcessingOutput(output_name='test',
                                destination=test_destination,
                                source='/opt/ml/processing/test'),
                            ProcessingOutput(output_name='train',
                                destination=train_destination,
                                source='/opt/ml/processing/train')],
                    container_entrypoint=['python3', '/opt/ml/processing/input/code/preprocessing.py'])

In [None]:
inputs = {'train': train_destination,
         'test': test_destination}

In [None]:
from sagemaker.pytorch import PyTorch

training_step = steps.TrainingStep(
    'Model Training', 
    estimator=PyTorch(**estimator_parameters),
    data=inputs,
    job_name=execution_input['TrainingJobName'],
    wait_for_completion=True
)

In [None]:
model_step = steps.ModelStep(
    'Save Model',
    model=training_step.get_expected_model(),
    model_name=execution_input['ModelName'],
    instance_type='ml.c5.xlarge'
)

In [None]:
training_step.get_expected_model()

In [None]:
endpoint_config_step = steps.EndpointConfigStep(
    "Create Model Endpoint Config",
    endpoint_config_name=execution_input['ModelName'],
    model_name=execution_input['ModelName'],
    initial_instance_count=1,
    instance_type='ml.c5.xlarge'
)

In [None]:
endpoint_step = steps.EndpointStep(
    'Create Inference Endpoint',
    endpoint_name=execution_input['EndpointName'],
    endpoint_config_name=execution_input['ModelName'],
    update=False
)

In [None]:
workflow_definition = steps.Chain([
    preprocessing_step,
    training_step,
    model_step,
    endpoint_config_step,
    endpoint_step
])

In [None]:
workflow = Workflow(
    name='MyTrainingRoutine{}'.format(id),
    definition=workflow_definition,
    role=workflow_execution_role,
    execution_input=execution_input
)

### Visualizing the workflow <a class="anchor" id="VisualizingWorkflow">

You can now view the workflow definition, and visualize it as a graph. This workflow and graph represent your training pipeline from starting a training job to deploying the model.

In [None]:
print(workflow.definition.to_json(pretty=True))

In [None]:
workflow.render_graph()

### Creating and executing the pipeline <a class="anchor" id="CreatingExecutingPipeline">

Before the workflow can be run for the first time, the pipeline must be created using the `create` method:

In [None]:
workflow.create()

Now the workflow can be started by invoking the pipeline's `execute` method:

In [None]:
training_job_name = 'BostonHousing-{}-Train'.format(id)
model_name ='BostonHousting-{}-Model'.format(id)
endpoint_name = 'BostonHousing-{}-Endpoint'.format(id)

execution = workflow.execute(
    inputs={
        'TrainingJobName': training_job_name, # Each Sagemaker Job requires a unique name
        'ModelName': model_name, # Each Model requires a unique name,
        'EndpointName': endpoint_name # Each Endpoint requires a unique name
    }
)

Use the `list_executions` method to list all executions for the workflow you created, including the one we just started.  After a pipeline is created, it can be executed as many times as needed, for example on a schedule for retraining on new data.  (For purposes of this notebook just execute the workflow one time to save resources.)  The output will include a list you can click through to access a view of the execution in the AWS Step Functions console.

In [None]:
workflow.list_executions(html=True)

While the workflow is running, you can check workflow progress inside this notebook with the `render_progress` method.  This generates a snapshot of the current state of your workflow as it executes. This is a static image. Run the cell again to check progress while the workflow is running.

In [None]:
execution.render_progress()

#### BEFORE proceeding with the rest of the notebook:

Wait until the workflow completes with status **Succeeded**, which will take a few minutes.  You can check status with `render_progress` above, or open in a new browser tab the **Inspect in AWS Step Functions** link in the cell output.  

To view the details of the completed workflow execution, from model training through deployment, use the `list_events` method, which lists all events in the workflow execution.

In [None]:
execution.list_events(reverse_order=True, html=False)

Once we have the endpoint name, we can use it to instantiate a PyTorchPredictor object that wraps the endpoint.  This PyTorchPredictor can be used to make predictions, as shown in the following code cell.  

#### BEFORE running the following code cell:

Go to the [SageMaker console](https://console.aws.amazon.com/sagemaker/), click **Endpoints** in the left panel, and make sure that the endpoint status is **InService**.  If the status is **Creating**, wait until it changes, which may take several minutes.

In [None]:
import numpy as np
from sagemaker.predictor import json_deserializer, json_serializer
from sagemaker.pytorch import PyTorchPredictor

workflow_predictor = PyTorchPredictor(endpoint_name)

# Define the serializers
workflow_predictor.content_type = "application/json"
workflow_predictor.accept = "application/json"
workflow_predictor.serializer = json_serializer
workflow_predictor.deserializer = json_deserializer

results = [workflow_predictor.predict(x_test[i]) for i in range(0, 10)]
print('predictions: \t{}'.format(np.array(results).round(decimals=1)))
print('target values: \t{}'.format(y_test[:10].round(decimals=1)))

Using the AWS Step Functions Data Science SDK, there are many other workflows you can create to automate your machine learning tasks.  For example, you could create a workflow to automate model retraining on a periodic basis.  Such a workflow could include a test of model quality after training, with subsequent branches for failing (no model deployment) and passing the quality test (model is deployed).  Other possible workflow steps include Automatic Model Tuning, data preprocessing with AWS Glue, and more.  

For a detailed example of a retraining workflow, see the AWS ML Blog post [Automating model retraining and deployment using the AWS Step Functions Data Science SDK for Amazon SageMaker](https://aws.amazon.com/blogs/machine-learning/automating-model-retraining-and-deployment-using-the-aws-step-functions-data-science-sdk-for-amazon-sagemaker/).

### Cleanup <a class="anchor" id="Cleanup">

The workflow we created above deployed a model to an endpoint.  To avoid billing charges for an unused endpoint, you can delete it using the following code:

In [None]:
workflow_predictor.delete_endpoint(delete_endpoint_config=True)

## Extensions <a class="anchor" id="Extensions">

We've covered a lot of content in these notebooks:  SageMaker Processing for data transformation, Local Mode for prototyping training and inference code, Automatic Model Tuning, and SageMaker hosted training and inference.  These are central elements for most deep learning workflows in SageMaker.  Additionally, we examined how the AWS Step Functions Data Science SDK helps automate deep learning workflows after completion of the prototyping phase of a project.

Besides all of the SageMaker features explored above, there are many other features that may be applicable to your project.  For example, to handle common problems during deep learning model training such as vanishing or exploding gradients, **SageMaker Debugger** is useful.  To manage common problems such as data drift after a model is in production, **SageMaker Model Monitor** can be applied.