In [72]:
!pip install -q xgboost==0.90
!pip install -q stepfunctions

## Setup

### Add a policy to your SageMaker role in IAM

**If you are running this notebook on an Amazon SageMaker notebook instance**, the IAM role assumed by your notebook instance needs permission to create and run workflows in AWS Step Functions. To provide this permission to the role, do the following.

1. Open the Amazon [SageMaker console](https://console.aws.amazon.com/sagemaker/). 
2. Select **Notebook instances** and choose the name of your notebook instance
3. Under **Permissions and encryption** select the role ARN to view the role on the IAM console
4. Choose **Attach policies** and search for `AWSStepFunctionsFullAccess`.
5. Select the check box next to `AWSStepFunctionsFullAccess` and choose **Attach policy**

If you are running this notebook in a local environment, the SDK will use your configured AWS CLI configuration. For more information, see [Configuring the AWS CLI](https://docs.aws.amazon.com/cli/latest/userguide/cli-chap-configure.html).

Next, create an execution role in IAM for Step Functions. 

### Create an execution role for Step Functions

You need an execution role so that you can create and execute workflows in Step Functions.

1. Go to the [IAM console](https://console.aws.amazon.com/iam/)
2. Select **Roles** and then **Create role**.
3. Under **Choose the service that will use this role** select **Step Functions**
4. Choose **Next** until you can enter a **Role name**
5. Enter a name such as `StepFunctionsWorkflowExecutionRole` and then select **Create role**


Attach a policy to the role you created. The following steps attach a policy that provides full access to Step Functions, however as a good practice you should only provide access to the resources you need.  

1. Under the **Permissions** tab, click **Add inline policy**
2. Enter the following in the **JSON** tab

```json
{
    "Version": "2012-10-17",
    "Statement": [
        {
            "Effect": "Allow",
            "Action": [
                "sagemaker:CreateTransformJob",
                "sagemaker:DescribeTransformJob",
                "sagemaker:StopTransformJob",
                "sagemaker:CreateTrainingJob",
                "sagemaker:DescribeTrainingJob",
                "sagemaker:StopTrainingJob",
                "sagemaker:CreateHyperParameterTuningJob",
                "sagemaker:DescribeHyperParameterTuningJob",
                "sagemaker:StopHyperParameterTuningJob",
                "sagemaker:CreateModel",
                "sagemaker:CreateEndpointConfig",
                "sagemaker:CreateEndpoint",
                "sagemaker:DeleteEndpointConfig",
                "sagemaker:DeleteEndpoint",
                "sagemaker:UpdateEndpoint",
                "sagemaker:ListTags",
                "lambda:InvokeFunction",
                "sqs:SendMessage",
                "sns:Publish",
                "ecs:RunTask",
                "ecs:StopTask",
                "ecs:DescribeTasks",
                "dynamodb:GetItem",
                "dynamodb:PutItem",
                "dynamodb:UpdateItem",
                "dynamodb:DeleteItem",
                "batch:SubmitJob",
                "batch:DescribeJobs",
                "batch:TerminateJob",
                "glue:StartJobRun",
                "glue:GetJobRun",
                "glue:GetJobRuns",
                "glue:BatchStopJobRun"
            ],
            "Resource": "*"
        },
        {
            "Effect": "Allow",
            "Action": [
                "iam:PassRole"
            ],
            "Resource": "*",
            "Condition": {
                "StringEquals": {
                    "iam:PassedToService": "sagemaker.amazonaws.com"
                }
            }
        },
        {
            "Effect": "Allow",
            "Action": [
                "events:PutTargets",
                "events:PutRule",
                "events:DescribeRule"
            ],
            "Resource": [
                "arn:aws:events:*:*:rule/StepFunctionsGetEventsForSageMakerTrainingJobsRule",
                "arn:aws:events:*:*:rule/StepFunctionsGetEventsForSageMakerTransformJobsRule",
                "arn:aws:events:*:*:rule/StepFunctionsGetEventsForSageMakerTuningJobsRule",
                "arn:aws:events:*:*:rule/StepFunctionsGetEventsForECSTaskRule",
                "arn:aws:events:*:*:rule/StepFunctionsGetEventsForBatchJobsRule"
            ]
        }
    ]
}
```

3. Choose **Review policy** and give the policy a name such as `StepFunctionsWorkflowExecutionPolicy`
4. Choose **Create policy**. You will be redirected to the details page for the role.
5. Copy the **Role ARN** at the top of the **Summary**

### Import the required modules from the SDK

In [73]:
import boto3
import sagemaker
import pandas as pd

sess   = sagemaker.Session()
bucket = sess.default_bucket()
role = sagemaker.get_execution_role()
region = boto3.Session().region_name

sm = boto3.Session().client(service_name='sagemaker', region_name=region)

# Dataset

In [74]:
%store -r spark_processing_job_s3_output_prefix

In [75]:
print('Previous Spark Processing Job Name: {}'.format(spark_processing_job_s3_output_prefix))

Previous Spark Processing Job Name: amazon-reviews-spark-processor-2020-03-28-04-41-56


In [76]:
prefix_train = '{}/output/tfidf-train'.format(spark_processing_job_s3_output_prefix)
prefix_validation = '{}/output/tfidf-validation'.format(spark_processing_job_s3_output_prefix)
prefix_test = '{}/output/tfidf-test'.format(spark_processing_job_s3_output_prefix)

tfidf_train_path = './{}'.format(prefix_train)
tfidf_validation_path = './{}'.format(prefix_validation)
tfidf_test_path = './{}'.format(prefix_test)

tfidf_train_s3_uri = 's3://{}/{}'.format(bucket, prefix_train)
tfidf_validation_s3_uri = 's3://{}/{}'.format(bucket, prefix_validation)
tfidf_test_s3_uri = 's3://{}/{}'.format(bucket, prefix_test)

print(tfidf_train_s3_uri)
print(tfidf_validation_s3_uri)
print(tfidf_test_s3_uri)

s3://sagemaker-us-east-1-835319576252/amazon-reviews-spark-processor-2020-03-28-04-41-56/output/tfidf-train
s3://sagemaker-us-east-1-835319576252/amazon-reviews-spark-processor-2020-03-28-04-41-56/output/tfidf-validation
s3://sagemaker-us-east-1-835319576252/amazon-reviews-spark-processor-2020-03-28-04-41-56/output/tfidf-test


In [77]:
s3_input_train_data = sagemaker.s3_input(s3_data=tfidf_train_s3_uri, content_type='text/csv')
s3_input_validation_data = sagemaker.s3_input(s3_data=tfidf_validation_s3_uri, content_type='text/csv')
s3_input_test_data = sagemaker.s3_input(s3_data=tfidf_test_s3_uri, content_type='text/csv')

print(s3_input_train_data.config)
print(s3_input_validation_data.config)
print(s3_input_test_data.config)

{'DataSource': {'S3DataSource': {'S3DataType': 'S3Prefix', 'S3Uri': 's3://sagemaker-us-east-1-835319576252/amazon-reviews-spark-processor-2020-03-28-04-41-56/output/tfidf-train', 'S3DataDistributionType': 'FullyReplicated'}}, 'ContentType': 'text/csv'}
{'DataSource': {'S3DataSource': {'S3DataType': 'S3Prefix', 'S3Uri': 's3://sagemaker-us-east-1-835319576252/amazon-reviews-spark-processor-2020-03-28-04-41-56/output/tfidf-validation', 'S3DataDistributionType': 'FullyReplicated'}}, 'ContentType': 'text/csv'}
{'DataSource': {'S3DataSource': {'S3DataType': 'S3Prefix', 'S3Uri': 's3://sagemaker-us-east-1-835319576252/amazon-reviews-spark-processor-2020-03-28-04-41-56/output/tfidf-test', 'S3DataDistributionType': 'FullyReplicated'}}, 'ContentType': 'text/csv'}


In [78]:
!aws s3 ls $tfidf_train_s3_uri/ 

2020-03-28 04:49:32          0 _SUCCESS
2020-03-28 04:48:52 1839696670 part-00000-75daeb1d-1477-4fd5-8436-f52122d97da1-c000.csv


In [79]:
!aws s3 ls $tfidf_validation_s3_uri/

2020-03-28 04:50:27          0 _SUCCESS
2020-03-28 04:50:25  102405623 part-00000-870d4bbb-d00e-4572-ac1e-9fa130c60189-c000.csv


In [80]:
!aws s3 ls $tfidf_test_s3_uri/

2020-03-28 04:51:04          0 _SUCCESS
2020-03-28 04:51:01  102471944 part-00000-5ca41884-1bc3-41dd-a1d1-869b714e36f6-c000.csv


In [81]:
!aws s3 cp --recursive $tfidf_train_s3_uri $tfidf_train_path
!aws s3 cp --recursive $tfidf_validation_s3_uri $tfidf_validation_path
!aws s3 cp --recursive $tfidf_test_s3_uri $tfidf_test_path

download: s3://sagemaker-us-east-1-835319576252/amazon-reviews-spark-processor-2020-03-28-04-41-56/output/tfidf-train/_SUCCESS to amazon-reviews-spark-processor-2020-03-28-04-41-56/output/tfidf-train/_SUCCESS
download: s3://sagemaker-us-east-1-835319576252/amazon-reviews-spark-processor-2020-03-28-04-41-56/output/tfidf-train/part-00000-75daeb1d-1477-4fd5-8436-f52122d97da1-c000.csv to amazon-reviews-spark-processor-2020-03-28-04-41-56/output/tfidf-train/part-00000-75daeb1d-1477-4fd5-8436-f52122d97da1-c000.csv
download: s3://sagemaker-us-east-1-835319576252/amazon-reviews-spark-processor-2020-03-28-04-41-56/output/tfidf-validation/_SUCCESS to amazon-reviews-spark-processor-2020-03-28-04-41-56/output/tfidf-validation/_SUCCESS
download: s3://sagemaker-us-east-1-835319576252/amazon-reviews-spark-processor-2020-03-28-04-41-56/output/tfidf-validation/part-00000-870d4bbb-d00e-4572-ac1e-9fa130c60189-c000.csv to amazon-reviews-spark-processor-2020-03-28-04-41-56/output/tfidf-validation/part-0000

In [82]:
!cat src/xgboost_reviews.py

import os
import argparse
import pickle as pkl
import pandas as pd
from sklearn.metrics import accuracy_score, precision_score, classification_report, confusion_matrix
from sklearn import metrics
from sklearn.base import BaseEstimator, TransformerMixin
import nltk
import re
import xgboost as xgb
from xgboost import XGBClassifier
import glob


# Note:  header=None
def load_dataset(path, sep, header):
    data = pd.concat([pd.read_csv(f, sep=sep, header=header) for f in glob.glob('{}/*.csv'.format(path))], ignore_index = True)

    labels = data.iloc[:,0]
    features = data.drop(data.columns[0], axis=1)
    
    if header==None:
        # Adjust the column names after dropped the 0th column above
        # New column names are 0 (inclusive) to len(features.columns) (exclusive)
        new_column_names = list(range(0, len(features.columns)))
        features.columns = new_column_names

    return features, labels


def model_fn(model_dir):
    """
    :par

In [83]:
from sagemaker.xgboost import XGBoost

# TODO:  Bug re: s3://s3://?  in just pipelines?  doesn't seem to be in ScriptMode
#        See here for more info:  https://github.com/aws/aws-step-functions-data-science-sdk-python/issues/32
#model_output_path = 's3://{}/models/amazon-reviews/script-mode/training-runs'.format(bucket)
model_output_path = 's3://{}/models/amazon-reviews/script-mode/training-runs'.format(bucket)

xgb_estimator = XGBoost(entry_point='xgboost_reviews.py', 
                        source_dir='src/',
                        role=role,
                        train_instance_count=1, 
#                        train_instance_type='local',
                        train_instance_type='ml.c5.4xlarge',
                        framework_version='0.90-2',
                        py_version='py3',
                        output_path=model_output_path,
                        hyperparameters={'objective':'binary:logistic',
                                         'num_round': 1,
                                         'max_depth': 5},
                        enable_cloudwatch_metrics=True,
                       )

### Build a training pipeline with the Step Functions SDK

A typical task for a data scientist is to train a model and deploy that model to an endpoint. Without the Step Functions SDK, this is a four step process on SageMaker that includes the following.

1. Training the model
2. Creating the model on SageMaker
3. Creating an endpoint configuration
4. Deploying the trained model to the configured endpoint

The Step Functions SDK provides the [TrainingPipeline](https://aws-step-functions-data-science-sdk.readthedocs.io/en/latest/pipelines.html#stepfunctions.template.pipeline.train.TrainingPipeline) API to simplify this procedure. The following configures `pipeline` with the necessary parameters to define a training pipeline.


In [84]:
# paste the StepFunctionsWorkflowExecutionRole ARN from above
workflow_execution_role = "arn:aws:iam::835319576252:role/StepFunctionsWorkflowExecutionRole"
#workflow_execution_role = "XXXX"

In [85]:
from stepfunctions.template.pipeline import TrainingPipeline

pipeline = TrainingPipeline(
    estimator=xgb_estimator,
    role=workflow_execution_role,
    inputs={'train': s3_input_train_data, 
            'validation': s3_input_validation_data},
    s3_bucket=bucket)
#    s3_bucket=model_output_path)

## Visualize the pipeline
You can now view the workflow definition, and also visualize it as a graph. This workflow and graph represent your training pipeline.


### View the pipeline definition

In [86]:
print(pipeline.workflow.definition.to_json(pretty=True))

{
    "StartAt": "Training",
    "States": {
        "Training": {
            "Resource": "arn:aws:states:::sagemaker:createTrainingJob.sync",
            "Parameters": {
                "AlgorithmSpecification.$": "$$.Execution.Input['Training'].AlgorithmSpecification",
                "OutputDataConfig.$": "$$.Execution.Input['Training'].OutputDataConfig",
                "StoppingCondition.$": "$$.Execution.Input['Training'].StoppingCondition",
                "ResourceConfig.$": "$$.Execution.Input['Training'].ResourceConfig",
                "RoleArn.$": "$$.Execution.Input['Training'].RoleArn",
                "InputDataConfig.$": "$$.Execution.Input['Training'].InputDataConfig",
                "HyperParameters.$": "$$.Execution.Input['Training'].HyperParameters",
                "TrainingJobName.$": "$$.Execution.Input['Training'].TrainingJobName",
                "DebugHookConfig.$": "$$.Execution.Input['Training'].DebugHookConfig"
            },
            "Type": "Task",
 

### Visualize the pipeline graph

In [87]:
pipeline.render_graph()

### Create and execute the pipeline on AWS Step Functions

Create the pipeline in AWS Step Functions with [create](https://aws-step-functions-data-science-sdk.readthedocs.io/en/latest/workflow.html#stepfunctions.workflow.Workflow.create).

In [88]:
pipeline.create()

'arn:aws:states:us-east-1:835319576252:stateMachine:training-pipeline-2020-03-28-21-21-26'

### Run the pipeline 

A link will be provided after the following cell is executed. Following this link, you can monitor your pipeline execution on Step Functions' console.

In [89]:
execution = pipeline.execute()

In [102]:
# Waiting for this:  https://github.com/aws/aws-step-functions-data-science-sdk-python/issues/32

execution.render_progress()

## *** YOU MUST WAIT FOR THE ABOVE PIPELINE TO COMPLETE BEFORE CONTINUING! ***

### Review the execution events

In [91]:
import json
events = execution.list_events()

event_output = json.loads(events[21]['stateExitedEventDetails']['output'])
endpoint_arn = event_output['EndpointArn']

endpoint_name = json.loads(events[18]['taskScheduledEventDetails']['parameters'])['EndpointName']
endpoint_name

IndexError: list index out of range

In [None]:
# TODO:  Retieve the predictor from the pipeline/workflow above
# predictor = mnist_estimator.deploy(initial_instance_count=1, instance_type='ml.c5.2xlarge')

#predictor = sagemaker.predictor.RealTimePredictor(endpoint=endpoint_name)
predictor = sagemaker.xgboost.model.XGBoostPredictor(endpoint_name=endpoint_name)
predictor

# Deploy Endpoint

From an external application, you can use the following code to make a prediction

# TODO:  This is erroring out with `Please provide a model_fn implementation.`

In [None]:
import boto3
import sagemaker
import pandas as pd

sess   = sagemaker.Session()
bucket = sess.default_bucket()
role = sagemaker.get_execution_role()
region = boto3.Session().region_name

In [None]:
# import time

# # https://towardsdatascience.com/xgboost-in-amazon-sagemaker-28e5e354dbcd
# from sagemaker.predictor import csv_serializer

# xgb_endpoint_name = 'xgboost-script-pipeline-{}'.format(time.strftime("%Y-%m-%d-%H-%M-%S", time.gmtime()))
# xgb_endpoint_name

In [92]:
# ## Deploy trained XGBoost model endpoint to perform predictions
# xgb_predictor = xgb_estimator.deploy(initial_instance_count = 1, 
#                                      instance_type = 'ml.m4.xlarge',
#                                      endpoint_name=xgb_endpoint_name)

# xgb_predictor.content_type = 'text/csv'
# xgb_predictor.serializer = csv_serializer
# xgb_predictor.deserializer = None

In [93]:
import numpy as np
from sklearn.metrics import accuracy_score, precision_score, classification_report, confusion_matrix

sm_runtime = boto3.client('sagemaker-runtime')

#payload_500_samples = X_test[:500].to_csv(index=False, header=False).rstrip()

# response_500_samples = sm_runtime.invoke_endpoint(
#     EndpointName=endpoint_name,
#     Body=payload.encode('utf-8'),
#     ContentType='text/csv')['Body'].read()

In [94]:
# predictions_500_samples = np.fromstring(response_500_samples, sep=',')
# predictions_500_samples_0_or_1 = np.where(predictions_500_samples > 0.5, 1, 0)

In [95]:
# print('Test Accuracy: ', accuracy_score(y_test[:500], predictions_500_samples_0_or_1))
# print('Test Precision: ', precision_score(y_test[:500], predictions_500_samples_0_or_1, average=None))

In [96]:
# import seaborn as sn
# import pandas as pd
# import matplotlib.pyplot as plt

# df_cm_test = confusion_matrix(y_test[:500], predictions_500_samples_0_or_1)
# df_cm_test

In [97]:
# import itertools

# import matplotlib.pyplot as plt
# %matplotlib inline
# %config InlineBackend.figure_format='retina'

# def plot_conf_mat(cm, classes, title, cmap = plt.cm.Greens):
#     print(cm)
#     plt.imshow(cm, interpolation='nearest', cmap=cmap)
#     plt.title(title)
#     plt.colorbar()
#     tick_marks = np.arange(len(classes))
#     plt.xticks(tick_marks, classes, rotation=45)
#     plt.yticks(tick_marks, classes)

#     fmt = 'd'
#     thresh = cm.max() / 2.
#     for i, j in itertools.product(range(cm.shape[0]), range(cm.shape[1])):
#         plt.text(j, i, format(cm[i, j], fmt),
#         horizontalalignment="center",
#         color="black" if cm[i, j] > thresh else "black")

#         plt.tight_layout()
#         plt.ylabel('True label')
#         plt.xlabel('Predicted label')

# # Plot non-normalized confusion matrix
# plt.figure()
# fig, ax = plt.subplots(figsize=(6,4))
# plot_conf_mat(df_cm_test, classes=['Not Positive Sentiment', 'Positive Sentiment'], 
#                           title='Confusion matrix')
# plt.show()

In [98]:
# from sklearn import metrics

# import matplotlib.pyplot as plt
# %matplotlib inline
# %config InlineBackend.figure_format='retina'

# auc = round(metrics.roc_auc_score(y_test, preds_test), 4)
# print('AUC is ' + repr(auc))

# fpr, tpr, _ = metrics.roc_curve(y_test, preds_test)

# plt.title('ROC Curve')
# plt.plot(fpr, tpr, 'b',
# label='AUC = %0.2f'% auc)
# plt.legend(loc='lower right')
# plt.plot([0,1],[0,1],'r--')
# plt.xlim([-0.1,1.1])
# plt.ylim([-0.1,1.1])
# plt.ylabel('True Positive Rate')
# plt.xlabel('False Positive Rate')
# plt.show()

In [99]:
payload = """Very funny. A typical mid 50's comedy."""
sm_runtime.invoke_endpoint(
     EndpointName=endpoint_name,
     Body=payload.encode('utf-8'),
     ContentType='text/csv')['Body'].read()
    

#predictions, raw_outputs = xgb_predictor.predict(["""Very funny. A typical mid 50's comedy."""])
print('Predictions: {}'.format(predictions))
print('Raw outputs: {}'.format(raw_outputs))

ValidationError: An error occurred (ValidationError) when calling the InvokeEndpoint operation: Endpoint training-pipeline-2020-03-28-15-52-56 of account 835319576252 not found.

In [100]:
predictions, raw_outputs = xgb_predictor.predict(["""That movie was absolutely awful."""])
print('Predictions: {}'.format(predictions))
print('Raw outputs: {}'.format(raw_outputs))

NameError: name 'xgb_predictor' is not defined