## Amazon Comprehend Custom

[Amazon Comprehend](https://aws.amazon.com/comprehend/) provides pre-trained and user-trainable tools for discovering insights and relationships from text.

<!--Pre-trained [features](https://aws.amazon.com/comprehend/features/) include:

- Sentiment analysis: Classifying the overall sentiment of a text (as Positive, Negative, Neutral, or Mixed)
- -->

This post uses the Yahoo answers corpus cited in the paper [Text Understanding from Scratch](https://arxiv.org/abs/1502.01710) by Xiang Zhang and Yann LeCun. This dataset is available on the [AWS Open Data Registry](https://registry.opendata.aws/fast-ai-nlp/)


In [None]:
!aws s3 cp s3://aws-ml-blog/artifacts/comprehend-custom-classification/comprehend-test.csv ./data/comprehend/
!aws s3 cp s3://aws-ml-blog/artifacts/comprehend-custom-classification/comprehend-train.csv ./data/comprehend/

In [None]:
# Python Built-Ins:
import os

# External Dependencies:
import boto3

# For our example, the S3 bucket has already been created & named for us:
account_id = os.environ["AWS_ACCOUNT_ID"]
region = os.environ["AWS_DEFAULT_REGION"]
bucket_name = f"comprehend-custom-{account_id}-{region}"

comprehend = boto3.client("comprehend")
s3 = boto3.resource("s3")
bucket = s3.Bucket(bucket_name)

print(f"Using S3 bucket: {bucket_name}")

In [None]:
bucket.Object(f"data/comprehend-train.csv").upload_file(f"data/comprehend/comprehend-train.csv")
bucket.Object(f"data/comprehend-test.csv").upload_file(f"data/comprehend/comprehend-test.csv")

In [None]:
# The Comprehend Execution Role has also been created already, but we need to look it up:
ssm = boto3.client("ssm")
ssm_response = ssm.get_parameters(Names=["/workshop/ComprehendRoleArn"])

try:
    comprehend_param = next(
        param for param in ssm_response["Parameters"]
        if param["Name"] == "/workshop/ComprehendRoleArn"
    )
    comprehend_role_arn = comprehend_param["Value"]
    print(comprehend_role_arn)
except StopIteration as e:
    print(ssm_response)
    raise ValueError("Couldn't retrieve Comprehend Execution Role from SSM") from e

In [None]:
create_cls_resp = comprehend.create_document_classifier(
    DocumentClassifierName="yahoo-answers",
    DataAccessRoleArn=comprehend_role_arn,
    InputDataConfig={
        "DataFormat": "COMPREHEND_CSV", # (Or "AUGMENTED_MANIFEST" for SM Ground Truth)
        "S3Uri": f"s3://{bucket_name}/data/comprehend-train.csv",
    },
    OutputDataConfig={
        'S3Uri': f"s3://{bucket_name}/training-output",
    },
    LanguageCode="en", # See also 'es'|'fr'|'de'|'it'|'pt'|'ar'|'hi'|'ja'|'ko'|'zh'|'zh-TW'
)

print(create_cls_resp)
cls_arn = create_cls_resp["DocumentClassifierArn"]

In [None]:
create_job_resp = comprehend.start_document_classification_job(
    #JobName='string',
    DocumentClassifierArn=cls_arn,
    InputDataConfig={
        "S3Uri": f"s3://{bucket_name}/data/comprehend-test.csv",
        "InputFormat": "ONE_DOC_PER_LINE",
    },
    OutputDataConfig={
        "S3Uri": f"s3://{bucket_name}/results",
        'KmsKeyId': 'string'
    },
    DataAccessRoleArn=comprehend_role_arn,
)

print(create_job_resp)
job_id = create_job_resp["JobId"]
# Also JobStatus: 'SUBMITTED'|'IN_PROGRESS'|'COMPLETED'|'FAILED'|'STOP_REQUESTED'|'STOPPED'

In [None]:
!aws s3 cp s3://$bucket_name/results/$job_id/output/output.tar.gz ./data/comprehend/results/output.tar.gz

In [None]:
!cd data/comprehend/results && tar xvzf output.tar.gz

In [None]:
{"File": "comprehend-test.csv", "Line": "9", "Classes": [{"Name": "ENTERTAINMENT_AND_MUSIC", "Score": 0.9685}, {"Name": "EDUCATION_AND_REFERENCE", "Score": 0.0159}, {"Name": "BUSINESS_AND_FINANCE", "Score": 0.0102}]}