# Deploying an Amazon Comprehend Model with SageMaker Pipelines

This example notebook showcases how you can deploy a custom text classification using Amazon Comprehend and SageMaker Pipelines.

Before you start make sure that your SageMaker Execution Role has the following policies:

- `ComprehendFullAccess`
- `AmazonSageMakerFullAccess`
- `AWSLambda_FullAccess`

Your SageMaker Execution Role should have access to S3 already. If not you can add the S3 full access policy.
You will also need to add `iam:passRole` as an inline policy.

Finally, you will need the following trust policies.

## Prerequisites

First, we are going to import the SageMaker SDK and set some default variables such as the `role` for permissioned execution and the `default_bucket` to store model artifacts.

In [1]:
%pip install s3fs

Looking in indexes: https://pypi.org/simple, https://pip.repos.neuron.amazonaws.com
Collecting s3fs
  Downloading s3fs-2021.11.0-py3-none-any.whl (25 kB)
Collecting fsspec==2021.11.0
  Downloading fsspec-2021.11.0-py3-none-any.whl (132 kB)
     |████████████████████████████████| 132 kB 18.5 MB/s            
[?25hCollecting aiohttp>=3.7.1
  Downloading aiohttp-3.8.1-cp37-cp37m-manylinux_2_5_x86_64.manylinux1_x86_64.manylinux_2_12_x86_64.manylinux2010_x86_64.whl (1.1 MB)
     |████████████████████████████████| 1.1 MB 34.5 MB/s            
[?25hCollecting aiobotocore~=1.4.1
  Downloading aiobotocore-1.4.2.tar.gz (52 kB)
     |████████████████████████████████| 52 kB 2.5 MB/s             
[?25h  Preparing metadata (setup.py) ... [?25ldone
[?25hCollecting botocore<1.20.107,>=1.20.106
  Downloading botocore-1.20.106-py2.py3-none-any.whl (7.7 MB)
     |████████████████████████████████| 7.7 MB 70.4 MB/s            
Collecting aioitertools>=0.5.1
  Downloading aioitertools-0.8.0-py3-none-an

In [42]:
%load_ext autoreload
%autoreload 2

The autoreload extension is already loaded. To reload it, use:
  %reload_ext autoreload


In [86]:
import boto3
import sagemaker
from sagemaker.sklearn.processing import SKLearnProcessor
from sagemaker.workflow.lambda_step import (
    LambdaStep,
    LambdaOutput,
    LambdaOutputTypeEnum,
)
from sagemaker.lambda_helper import Lambda
from sagemaker.processing import ProcessingInput, ProcessingOutput
from sagemaker.workflow.steps import ProcessingStep
from sagemaker.workflow.properties import PropertyFile
from sagemaker.workflow.parameters import ParameterInteger, ParameterString

region = boto3.Session().region_name
sagemaker_session = sagemaker.session.Session()
role_arn = sagemaker.get_execution_role()
default_bucket = sagemaker_session.default_bucket()

In [87]:
!aws s3 cp s3://aws-ml-blog/artifacts/comprehend-custom-classification/comprehend-train.csv s3://$default_bucket/
!aws s3 cp s3://aws-ml-blog/artifacts/comprehend-custom-classification/comprehend-test.csv s3://$default_bucket/

download: s3://aws-ml-blog/artifacts/comprehend-custom-classification/comprehend-train.csv to ./comprehend-train.csv
download: s3://aws-ml-blog/artifacts/comprehend-custom-classification/comprehend-test.csv to ./comprehend-test.csv


Next, we define parameters that can be set for the execution of the pipeline. They serve as variables. We define the following:

- `ProcessingInstanceType`: The number of processing instances to use for the execution of the pipeline
- `TrainData`: Location of the training data in S3
- `TestData`: Location of the test data in S3
- `RoleArn`: ARN (Amazon Resource Name) of the role used for pipeline execution
- `ModelOutput`: Location of the target S3 path for the Amazon Comprehend model artifact

Amazon Comprehend creates its own validation set when training, so there is no need to provide one.

In [None]:
# Inspecting train and test files
import pandas as pd

trainFrame = pd.read_csv(f"s3://aws-ml-blog/artifacts/comprehend-custom-classification/comprehend-train.csv", header=None)
trainFrame

In [None]:
testFrame = pd.read_csv(f"s3://aws-ml-blog/artifacts/comprehend-custom-classification/comprehend-test.csv", header=None)
testFrame

In [90]:
processing_instance_count = ParameterInteger(name="ProcessingInstanceCount", default_value=1)
processing_instance_type = ParameterString(
    name="ProcessingInstanceType", default_value="ml.m5.xlarge"
)

input_train = ParameterString(
    name="TrainData",
    default_value=f"s3://{default_bucket}/comprehend-train.csv",
)

input_test = ParameterString(
    name="TestData",
    default_value=f"s3://{default_bucket}/comprehend-test.csv",
)

iam_role_arn = ParameterString(
    name="RoleArn",
    default_value=role_arn,
)

model_output = ParameterString(name="ModelOutput", default_value=f"s3://{default_bucket}/model")

We use [SKLearnProcessor](https://sagemaker.readthedocs.io/en/stable/frameworks/sklearn/sagemaker.sklearn.html#sagemaker.sklearn.processing.SKLearnProcessor) to run Python scripts to train, and deploy Amazon Comprehend models using `boto3`. In the next chunk, we instantiate an instance of `SKLearnProcessor` that we use in the next steps.

In [91]:
sklearn_processor = SKLearnProcessor(
    framework_version="0.23-1",
    instance_type=processing_instance_type,
    instance_count=processing_instance_count,
    base_job_name="comprehend-process",
    sagemaker_session=sagemaker_session,
    role=role_arn,
)

The first Amazon SageMaker [ProcessingStep](https://sagemaker.readthedocs.io/en/stable/workflows/pipelines/sagemaker.workflow.pipelines.html?highlight=ProcessingStep#sagemaker.workflow.steps.ProcessingStep) provides a containerized execution environment to run the `prepare_data.py` script.

In [92]:
preprocess = ProcessingStep(
    name="ComprehendProcess",
    processor=sklearn_processor,
    inputs=[
        ProcessingInput(source=input_train, destination="/opt/ml/processing/input_train"),
        ProcessingInput(source=input_test, destination="/opt/ml/processing/input_test"),
    ],
    outputs=[
        ProcessingOutput(output_name="train", source="/opt/ml/processing/train"),
        ProcessingOutput(output_name="test", source="/opt/ml/processing/test"),
    ],
    code="prepare_data.py",
)

The second Amazon SageMaker processing step trains the Amazon Comprehend model by running `train_eval_comprehend.py`. Amazon Comprehend automatically evaluates the performance on an evaluation set. We will use that score as a condition for deploying the model.

In [93]:
evaluation_report = PropertyFile(
    name="ComprehendEvaluationReport",
    output_name="evaluation",
    path="evaluation.json",
)

In [94]:
comprehend_train_and_eval = ProcessingStep(
    name="ComprehendTrainAndEval",
    processor=sklearn_processor,
    job_arguments=[
        "--train-input-file",
        preprocess.properties.ProcessingOutputConfig.Outputs["train"].S3Output.S3Uri,
        "--train-output-path",
        model_output,
        "--iam-role-arn",
        role_arn,
    ],
    code="train_eval_comprehend.py",
    outputs=[
        ProcessingOutput(output_name="evaluation", source="/opt/ml/processing/evaluation"),
        ProcessingOutput(output_name="arn", source="/opt/ml/processing/arn"),
    ],
    property_files=[evaluation_report],
)

The third Amazon SageMaker processing step deploys the Amazon Comprehend model running `deploy_comprehend.py`. If the Accuracy reported after training is lower than a certain threshold, this step does not run and the pipeline stops here.

In [95]:
step_deploy_model = ProcessingStep(
    name="ComprehendDeploy",
    processor=sklearn_processor,
    job_arguments=[
        "--arn-path",
        comprehend_train_and_eval.properties.ProcessingOutputConfig.Outputs["arn"].S3Output.S3Uri,
    ],
    code="deploy_comprehend.py",
    outputs=[
        ProcessingOutput(output_name="endpoint_arn", source="/opt/ml/processing/endpoint_arn")
    ],
)

In [96]:
comprehend_train_and_eval.properties.ProcessingOutputConfig.Outputs["arn"].S3Output.S3Uri

<sagemaker.workflow.properties.Properties at 0x7efecddf1590>

In [97]:
from sagemaker.workflow.conditions import ConditionGreaterThanOrEqualTo
from sagemaker.workflow.condition_step import ConditionStep
from sagemaker.workflow.functions import JsonGet

cond_lte = ConditionGreaterThanOrEqualTo(
    left=JsonGet(
        step_name="ComprehendTrainAndEval",
        property_file=evaluation_report,
        json_path="Accuracy",
    ),
    right=0.65,
)

step_cond = ConditionStep(
    name="ComprehendAccuracyCondition",
    conditions=[cond_lte],
    if_steps=[step_deploy_model],
    else_steps=[],
)

Finally, the deployed model can be used for inference. At this stage we use [AWS Lambda](https://sagemaker.readthedocs.io/en/stable/workflows/pipelines/sagemaker.workflow.pipelines.html#sagemaker.workflow.lambda_step.LambdaStep) to call the Amazon Comprehend endpoint with the text of our choice. 

In [98]:
example_text = (
    "Italian EBU Member RAI has won the 65th Eurovision Song Contest with the song "
    + "Zitti e buoni performed by Måneskin. It's the 3rd win for Italy who last triumphed in 1990. "
    + "26 countries took part in the Grand Final of the world’s largest live music event, "
    + "hosted by Dutch EBU Members NPO, NOS and AVROTROS on Saturday 22 May in Rotterdam. "
    + "Måneskin wrote the winning song which finished the night with 524 points, 25 points "
    + "ahead of 2nd placed France represented by Barbara Pravi singing Voila. Switzerland’s Gjon’s Tears with Tout l’Univers finished in third place."
)

In [99]:
# Custom Lambda Step
function_name = "sagemaker-lambda-step-endpoint-test"

# Lambda helper class can be used to create the Lambda function
func = Lambda(
    function_name=function_name,
    session=sagemaker_session,
    execution_role_arn=role_arn,
    script="test_comprehend_lambda.py",
    handler="test_comprehend_lambda.lambda_handler",
)

test_endpoint = LambdaStep(
    name="LambdaStep",
    lambda_func=func,
    inputs={
        "endpoint_arn_path": step_deploy_model.properties.ProcessingOutputConfig.Outputs[
            "endpoint_arn"
        ].S3Output.S3Uri,
        "text": example_text,
    },
)

In [100]:
step_deploy_model.properties.ProcessingOutputConfig.Outputs["endpoint_arn"].S3Output.S3Uri

<sagemaker.workflow.properties.Properties at 0x7efecdde50d0>

In [101]:
from sagemaker.workflow.pipeline import Pipeline

pipeline_name = "ComprehendPipeline"
pipeline = Pipeline(
    name=pipeline_name,
    parameters=[
        processing_instance_type,
        processing_instance_count,
        input_train,
        input_test,
        iam_role_arn,
        model_output,
    ],
    steps=[preprocess, comprehend_train_and_eval, step_cond, test_endpoint],
)

Once the pipeline is successfully defined, we can start the execution.

In [102]:
pipeline.upsert(role_arn=role_arn)

{'PipelineArn': 'arn:aws:sagemaker:eu-west-1:727118255515:pipeline/comprehendpipeline',
 'ResponseMetadata': {'RequestId': '779a1921-3ced-4f47-98fe-561f5d599f43',
  'HTTPStatusCode': 200,
  'HTTPHeaders': {'x-amzn-requestid': '779a1921-3ced-4f47-98fe-561f5d599f43',
   'content-type': 'application/x-amz-json-1.1',
   'content-length': '86',
   'date': 'Thu, 25 Nov 2021 15:14:01 GMT'},
  'RetryAttempts': 0}}

In [103]:
execution = pipeline.start()

In [104]:
execution.wait()

WaiterError: Waiter PipelineExecutionComplete failed: Waiter encountered a terminal failure state: For expression "PipelineExecutionStatus" we matched expected path: "Failed"

In [None]:
execution.list_steps()

# Conclusion

In this notebook we have seen how to create a SageMaker Pipeline to train an Amazon Comprehend Custom Classifier on your own dataset.

# Clean up

In [24]:
pipeline.delete()

AttributeError: module 'boto3' has no attribute 'delete_endpoint'