#########################
# Make sure you are using the `conda_python3` Jupyter Kernel for SageMaker!
## We will install the necessary libraries.

#########################

# TensorFlow Training and Serving in SageMaker "Script Mode"

Script mode is a training script format for TensorFlow that lets you execute any TensorFlow training script in SageMaker with minimal modification. The [SageMaker Python SDK](https://github.com/aws/sagemaker-python-sdk) handles transferring your script to a SageMaker training instance. On the training instance, SageMaker's native TensorFlow support sets up training-related environment variables and executes your training script. In this tutorial, we use the SageMaker Python SDK to launch a training job and deploy the trained model.

Script mode supports training with a Python script, a Python module, or a shell script. In this example, we use a Python script to train a classification model on the [MNIST dataset](http://yann.lecun.com/exdb/mnist/). In this example, we will show how easily you can train a SageMaker using TensorFlow 1.x and TensorFlow 2.x scripts with SageMaker Python SDK. In addition, this notebook demonstrates how to perform real time inference with the [SageMaker TensorFlow Serving container](https://github.com/aws/sagemaker-tensorflow-serving-container). The TensorFlow Serving container is the default inference method for script mode. For full documentation on the TensorFlow Serving container, please visit [here](https://github.com/aws/sagemaker-python-sdk/blob/master/src/sagemaker/tensorflow/deploying_tensorflow_serving.rst).


# Set up the environment

Let's start by setting up the environment:

In [None]:
!pip uninstall -y tensorflow tensorflow-estimator tb-nightly tf-estimator-nightly tensorboard tensorboardX tensorflow-hub tensorflow-metadata tfds-nightly

In [None]:
!pip3 uninstall -y tensorflow tensorflow-estimator tb-nightly tf-estimator-nightly tensorboard tensorboardX tensorflow-hub tensorflow-metadata tfds-nightly

In [None]:
!pip list

In [None]:
!pip3 install tensorflow==2.0.0b1 --upgrade --ignore-installed --no-cache --user # tensorboard==1.14.0

In [None]:
!pip3 install sagemaker --upgrade --ignore-installed --no-cache --user

In [None]:
!pip install --upgrade --ignore-installed --no-cache stepfunctions

In [None]:
!pip3 install requests==2.20.1 --user

### Restart the Kernel to Recognize New Dependencies Above

In [None]:
from IPython.display import display_html
display_html("<script>Jupyter.notebook.kernel.restart()</script>", raw=True)

In [None]:
!pip3 list

## Add a policy to your SageMaker role in IAM

**If you are running this notebook on an Amazon SageMaker notebook instance**, the IAM role assumed by your notebook instance needs permission to create and run workflows in AWS Step Functions. To provide this permission to the role, do the following.

1. Open the Amazon [SageMaker console](https://console.aws.amazon.com/sagemaker/). 
2. Select **Notebook instances** and choose the name of your notebook instance
3. Under **Permissions and encryption** select the role ARN to view the role on the IAM console
4. Choose **Attach policies** and search for `AWSStepFunctionsFullAccess`.
5. Select the check box next to `AWSStepFunctionsFullAccess` and choose **Attach policy**

If you are running this notebook in a local environment, the SDK will use your configured AWS CLI configuration. For more information, see [Configuring the AWS CLI](https://docs.aws.amazon.com/cli/latest/userguide/cli-chap-configure.html).

Next, create an execution role in IAM for Step Functions. 

### Create an execution role for Step Functions

You need an execution role so that you can create and execute workflows in Step Functions.

1. Go to the [IAM console](https://console.aws.amazon.com/iam/)
2. Select **Roles** and then **Create role**.
3. Under **Choose the service that will use this role** select **Step Functions**
4. Choose **Next** until you can enter a **Role name**
5. Enter a name such as `StepFunctionsWorkflowExecutionRole` and then select **Create role**


Attach a policy to the role you created. The following steps attach a policy that provides full access to Step Functions, however as a good practice you should only provide access to the resources you need.  

1. Under the **Permissions** tab, click **Add inline policy**
2. Enter the following in the **JSON** tab

```json
{
    "Version": "2012-10-17",
    "Statement": [
        {
            "Effect": "Allow",
            "Action": [
                "sagemaker:CreateTransformJob",
                "sagemaker:DescribeTransformJob",
                "sagemaker:StopTransformJob",
                "sagemaker:CreateTrainingJob",
                "sagemaker:DescribeTrainingJob",
                "sagemaker:StopTrainingJob",
                "sagemaker:CreateHyperParameterTuningJob",
                "sagemaker:DescribeHyperParameterTuningJob",
                "sagemaker:StopHyperParameterTuningJob",
                "sagemaker:CreateModel",
                "sagemaker:CreateEndpointConfig",
                "sagemaker:CreateEndpoint",
                "sagemaker:DeleteEndpointConfig",
                "sagemaker:DeleteEndpoint",
                "sagemaker:UpdateEndpoint",
                "sagemaker:ListTags",
                "lambda:InvokeFunction",
                "sqs:SendMessage",
                "sns:Publish",
                "ecs:RunTask",
                "ecs:StopTask",
                "ecs:DescribeTasks",
                "dynamodb:GetItem",
                "dynamodb:PutItem",
                "dynamodb:UpdateItem",
                "dynamodb:DeleteItem",
                "batch:SubmitJob",
                "batch:DescribeJobs",
                "batch:TerminateJob",
                "glue:StartJobRun",
                "glue:GetJobRun",
                "glue:GetJobRuns",
                "glue:BatchStopJobRun"
            ],
            "Resource": "*"
        },
        {
            "Effect": "Allow",
            "Action": [
                "iam:PassRole"
            ],
            "Resource": "*",
            "Condition": {
                "StringEquals": {
                    "iam:PassedToService": "sagemaker.amazonaws.com"
                }
            }
        },
        {
            "Effect": "Allow",
            "Action": [
                "events:PutTargets",
                "events:PutRule",
                "events:DescribeRule"
            ],
            "Resource": [
                "arn:aws:events:*:*:rule/StepFunctionsGetEventsForSageMakerTrainingJobsRule",
                "arn:aws:events:*:*:rule/StepFunctionsGetEventsForSageMakerTransformJobsRule",
                "arn:aws:events:*:*:rule/StepFunctionsGetEventsForSageMakerTuningJobsRule",
                "arn:aws:events:*:*:rule/StepFunctionsGetEventsForECSTaskRule",
                "arn:aws:events:*:*:rule/StepFunctionsGetEventsForBatchJobsRule"
            ]
        }
    ]
}
```

3. Choose **Review policy** and give the policy a name such as `StepFunctionsWorkflowExecutionPolicy`
4. Choose **Create policy**. You will be redirected to the details page for the role.
5. Copy the **Role ARN** at the top of the **Summary**

### Import the required modules & create the AWS SageMaker execution role

Now import the required modules from the Step Functions SDK and AWS SageMaker, configure an S3 bucket, and get the AWS SageMaker execution role.

In [2]:
import os
import sagemaker
import stepfunctions
import logging

from stepfunctions.template.pipeline import TrainingPipeline

sagemaker_session = sagemaker.Session()
stepfunctions.set_stream_logger(level=logging.INFO)

#bucket = sagemaker_session.default_bucket()
#prefix = 'sagemaker/DEMO-tensorflow-mnist'

# SageMaker Execution Role
# You can use sagemaker.get_execution_role() if running inside sagemaker's notebook instance
sagemaker_execution_role = sagemaker.get_execution_role() #Replace with ARN if not in an AWS SageMaker notebook

## Setup the Service Execution Role and Region
Get IAM role arn used to give training and hosting access to your data.  See the documentation for how to create these.  Note, if more than one role is required for notebook instances, training, and/or hosting, please replace the `sagemaker.get_execution_role()` with a the appropriate full IAM role arn string(s).

In [3]:
#role = get_execution_role()
print('RoleARN:  {}\n'.format(sagemaker_execution_role))

region = sagemaker_session.boto_session.region_name
print('Region:  {}'.format(region))

RoleARN:  arn:aws:iam::835319576252:role/service-role/AmazonSageMaker-ExecutionRole-20191006T135881

Region:  us-east-1


## Training Data

The MNIST dataset has been loaded to the public S3 buckets ``sagemaker-sample-data-<REGION>`` under the prefix ``tensorflow/mnist``.

In [4]:
original_training_data_uri = 's3://sagemaker-sample-data-{}/tensorflow/mnist'.format(region)
print(original_training_data_uri)

s3://sagemaker-sample-data-us-east-1/tensorflow/mnist


### Copy the Training Data to Your Notebook Disk

In [5]:
local_data_path = './data'

In [6]:
!aws --region {region} s3 cp --recursive {original_training_data_uri} {local_data_path}

download: s3://sagemaker-sample-data-us-east-1/tensorflow/mnist/eval_labels.npy to data/eval_labels.npy
download: s3://sagemaker-sample-data-us-east-1/tensorflow/mnist/train_labels.npy to data/train_labels.npy
download: s3://sagemaker-sample-data-us-east-1/tensorflow/mnist/eval_data.npy to data/eval_data.npy
download: s3://sagemaker-sample-data-us-east-1/tensorflow/mnist/train_data.npy to data/train_data.npy


There are four ``.npy`` file under this prefix:
* ``train_data.npy``
* ``eval_data.npy``
* ``train_labels.npy``
* ``eval_labels.npy``

In [7]:
!ls {local_data_path}

eval_data.npy  eval_labels.npy	train_data.npy	train_labels.npy


### Upload the Data to S3 for Distributed Training Across Many Workers
We are going to use the `sagemaker.Session.upload_data` function to upload our datasets to an S3 location. The return value inputs identifies the location -- we will use later when we start the training job.

This is S3 bucket and prefix that you want to use for training and model data.  This should be within the same region as the Notebook Instance, training, and hosting.

In [8]:
bucket = sagemaker_session.default_bucket()
data_prefix = 'sagemaker/tensorflow-mnist/data'

In [9]:
training_data_uri = sagemaker_session.upload_data(path=local_data_path, bucket=bucket, key_prefix=data_prefix)
print(training_data_uri)

s3://sagemaker-us-east-1-835319576252/sagemaker/tensorflow-mnist/data


In [10]:
!aws s3 ls --recursive {training_data_uri}

2020-02-23 19:14:35   31360128 sagemaker/tensorflow-mnist/data/eval_data.npy
2020-02-23 19:14:35      40128 sagemaker/tensorflow-mnist/data/eval_labels.npy
2020-02-23 19:14:35  172480128 sagemaker/tensorflow-mnist/data/train_data.npy
2020-02-23 19:14:35     220128 sagemaker/tensorflow-mnist/data/train_labels.npy


## Train
https://sagemaker.readthedocs.io/en/stable/using_tf.html#distributed-training

This tutorial's training script was adapted from TensorFlow's official [CNN MNIST example](https://github.com/tensorflow/tensorflow/blob/master/tensorflow/examples/tutorials/layers/cnn_mnist.py). We have modified it to handle the ``model_dir`` parameter passed in by SageMaker. This is an S3 path which can be used for data sharing during distributed training and checkpointing and/or model persistence. We have also added an argument-parsing function to handle processing training-related variables.

At the end of the training job we have added a step to export the trained model to the path stored in the environment variable ``SM_MODEL_DIR``, which always points to ``/opt/ml/model``. This is critical because SageMaker uploads all the model artifacts in this folder to S3 at end of training.

### Training Script

In [11]:
!ls ./src/mnist_keras_tf2.py

./src/mnist_keras_tf2.py


You can add custom Python modules to the `src/requirements.txt` file.  They will automatically be installed - and made available to your training script.

In [12]:
!cat ./src/requirements.txt

# Python dependencies go here...

## Use Step Functions to run training in SageMaker


### Train with SageMaker `TensorFlow` Estimator

https://sagemaker.readthedocs.io/en/stable/using_tf.html#distributed-training

The `sagemaker.tensorflow.TensorFlow` estimator handles locating the script mode container, uploading your script to a S3 location and creating a SageMaker training job. Let's call out a couple important parameters here:

* `py_version` is set to `'py3'` to indicate that we are using script mode since legacy mode supports only Python 2. Though Python 2 will be deprecated soon, you can use script mode with Python 2 by setting `py_version` to `'py2'` and `script_mode` to `True`.

* `distributions` is used to configure the distributed training setup. It's required only if you are doing distributed training either across a cluster of instances or across multiple GPUs. Here we are using parameter servers as the distributed training schema. SageMaker training jobs run on homogeneous clusters. To make parameter server more performant in the SageMaker setup, we run a parameter server on every instance in the cluster, so there is no need to specify the number of parameter servers to launch. Script mode also supports distributed training with [Horovod](https://github.com/horovod/horovod). You can find the full documentation on how to configure `distributions` [here](https://github.com/aws/sagemaker-python-sdk/tree/master/src/sagemaker/tensorflow#distributed-training). 

Notes:  
* This example uses two(2) `ml.p3.2xlarge` instances.  You will likely need to request a SageMaker instance limit increase from Support before continuing.

* Alternatively, you can specify `ml.c5.2xlarge` instead.

* To recognize the `requirements.txt`, we must include `src/setup.py` per [this](https://github.com/aws/sagemaker-python-sdk/issues/911) GitHub issue.

In [13]:
from sagemaker.tensorflow import TensorFlow

# Output Bucket is defined below
#model_output_path = 's3://{}/sagemaker/tensorflow-mnist/training-runs'.format(bucket)

mnist_estimator = TensorFlow(entry_point='mnist_keras_tf2.py',
                             source_dir='./src',
#                             output_path=model_output_path,
                             role=sagemaker_execution_role,
                             train_instance_count=1,
                             train_instance_type='ml.c5.2xlarge',
                             framework_version='2.0.0',
                             py_version='py3',
                             enable_sagemaker_metrics=True,
                             script_mode=True,
                             distributions={'parameter_server': {'enabled': True}},
                             # Assuming the loglines from the TensorFlow training job are as follows:
                             #    Test loss    : 0.0635053280624561
                             #    Test accuracy: 0.9821
                             metric_definitions=[
                                 {'Name': 'test:loss', 'Regex': 'Test loss    : ([0-9\\.]+)'},
                                 {'Name': 'test:accuracy', 'Regex': 'Test accuracy: ([0-9\\.]+)'},
                             ])

### Build a training pipeline with the Step Functions SDK

A typical task for a data scientist is to train a model and deploy that model to an endpoint. Without the Step Functions SDK, this is a four step process on SageMaker that includes the following.

1. Training the model
2. Creating the model on SageMaker
3. Creating an endpoint configuration
4. Deploying the trained model to the configured endpoint

The Step Functions SDK provides the [TrainingPipeline](https://aws-step-functions-data-science-sdk.readthedocs.io/en/latest/pipelines.html#stepfunctions.template.pipeline.train.TrainingPipeline) API to simplify this procedure. The following configures `pipeline` with the necessary parameters to define a training pipeline.

In [14]:
print(training_data_uri)

s3://sagemaker-us-east-1-835319576252/sagemaker/tensorflow-mnist/data


In [15]:
# paste the StepFunctionsWorkflowExecutionRole ARN from above
workflow_execution_role = "arn:aws:iam::835319576252:role/StepFunctionsWorkflowExecutionRole" 

In [16]:
model_output_s3_bucket = '{}/sagemaker/tensorflow-mnist/training-runs'.format(bucket)
print(model_output_s3_bucket)

sagemaker-us-east-1-835319576252/sagemaker/tensorflow-mnist/training-runs


In [17]:
pipeline = TrainingPipeline(
    estimator=mnist_estimator,
    role=workflow_execution_role,
    inputs=training_data_uri,
    s3_bucket=model_output_s3_bucket
)

### Visualize the pipeline

You can now view the workflow definition, and also visualize it as a graph. This workflow and graph represent your training pipeline.

#### View the workflow definition

In [18]:
print(pipeline.workflow.definition.to_json(pretty=True))

{
    "StartAt": "Training",
    "States": {
        "Training": {
            "Resource": "arn:aws:states:::sagemaker:createTrainingJob.sync",
            "Parameters": {
                "AlgorithmSpecification.$": "$$.Execution.Input['Training'].AlgorithmSpecification",
                "OutputDataConfig.$": "$$.Execution.Input['Training'].OutputDataConfig",
                "StoppingCondition.$": "$$.Execution.Input['Training'].StoppingCondition",
                "ResourceConfig.$": "$$.Execution.Input['Training'].ResourceConfig",
                "RoleArn.$": "$$.Execution.Input['Training'].RoleArn",
                "InputDataConfig.$": "$$.Execution.Input['Training'].InputDataConfig",
                "HyperParameters.$": "$$.Execution.Input['Training'].HyperParameters",
                "TrainingJobName.$": "$$.Execution.Input['Training'].TrainingJobName"
            },
            "Type": "Task",
            "Next": "Create Model"
        },
        "Create Model": {
            "Par

#### Visualize the workflow graph

In [19]:
pipeline.render_graph()

### Create and execute the pipeline on AWS Step Functions

Create the pipeline in AWS Step Functions with [create](https://aws-step-functions-data-science-sdk.readthedocs.io/en/latest/workflow.html#stepfunctions.workflow.Workflow.create).

In [20]:
pipeline.create()

[32m[INFO] Workflow created successfully on AWS Step Functions.[0m


'arn:aws:states:us-east-1:835319576252:stateMachine:training-pipeline-2020-02-23-19-14-47'

Run the workflow with [execute](https://aws-step-functions-data-science-sdk.readthedocs.io/en/latest/workflow.html#stepfunctions.workflow.Workflow.execute). A link will be provided after the following cell is executed. Following this link, you can monitor your pipeline execution on Step Functions' console.

In [21]:
execution = pipeline.execute()

[32m[INFO] Workflow execution started successfully on AWS Step Functions.[0m


In [28]:
execution.render_progress()

In [94]:
import json
events = execution.list_events()

endpoint_name = json.loads(events[18]['taskScheduledEventDetails']['parameters'])['EndpointName']

In [95]:
event_output = json.loads(events[21]['stateExitedEventDetails']['output'])
event_output

{'EndpointArn': 'arn:aws:sagemaker:us-east-1:835319576252:endpoint/training-pipeline-2020-02-23-19-15-03',
 'SdkHttpMetadata': {'HttpHeaders': {'Content-Length': '105',
   'Content-Type': 'application/x-amz-json-1.1',
   'Date': 'Sun, 23 Feb 2020 19:17:54 GMT',
   'x-amzn-RequestId': '8ed6f43f-5389-47a7-b5aa-1efe325f1b45'},
  'HttpStatusCode': 200},
 'SdkResponseMetadata': {'RequestId': '8ed6f43f-5389-47a7-b5aa-1efe325f1b45'}}

In [96]:
endpoint_arn = event_output['EndpointArn']

In [98]:
# TODO:  Retieve the predictor from the pipeline/workflow above
# predictor = mnist_estimator.deploy(initial_instance_count=1, instance_type='ml.c5.2xlarge')

predictor = sagemaker.predictor.RealTimePredictor(endpoint=endpoint_name)

### Invoke the Endpoint

Let's download the training data and use that as input for inference.

In [99]:
import numpy as np

train_data = np.load('{}/train_data.npy'.format(local_data_path))
train_labels = np.load('{}/train_labels.npy'.format(local_data_path))

The formats of the input and the output data correspond directly to the request and response formats of the `Predict` method in the [TensorFlow Serving REST API](https://www.tensorflow.org/serving/api_rest). SageMaker's TensforFlow Serving endpoints can also accept additional input formats that are not part of the TensorFlow REST API, including the simplified JSON format, line-delimited JSON objects ("jsons" or "jsonlines"), and CSV data.

In this example we are using a `numpy` array as input, which will be serialized into the simplified JSON format. In addition, TensorFlow Serving can also process multiple items at once as you can see in the following code. You can find the complete documentation on how to make predictions against a TensorFlow Serving SageMaker endpoint [here](https://github.com/aws/sagemaker-python-sdk/blob/master/src/sagemaker/tensorflow/deploying_tensorflow_serving.rst#making-predictions-against-a-sagemaker-endpoint).

Examine the prediction result from the TensorFlow 2.0 model.



In [100]:
predictions = predictor.predict(train_data[:50])
for i in range(0, 50):
    prediction = predictions['predictions'][i]
    label = train_labels[i]
    print('Prediction is {}, Label is {}, Matched: {}'.format(prediction, label, prediction == label))

ParamValidationError: Parameter validation failed:
Invalid type for parameter Body, value: [[0. 0. 0. ... 0. 0. 0.]
 [0. 0. 0. ... 0. 0. 0.]
 [0. 0. 0. ... 0. 0. 0.]
 ...
 [0. 0. 0. ... 0. 0. 0.]
 [0. 0. 0. ... 0. 0. 0.]
 [0. 0. 0. ... 0. 0. 0.]], type: <class 'numpy.ndarray'>, valid types: <class 'bytes'>, <class 'bytearray'>, file-like object