# Amazon Textract and Amazon Comprehend AI Services
### Example on extracting insights from a PDF Document


## Contents 
1. [Background](#Background)
1. [Notes and Configuration](#Notes-and-Configuration)
1. [Functions](#Functions)
1. [Amazon Textract](#Amazon-Textract)
1. [Amazon Comprehend](#Amazon-Comprehend)
1. [Key Phrase Extraction](#Key-Phrase-Extraction)
1. [Sentiment Analysis](#Sentiment-Analysis)
1. [Entity Recognition](#Entity-Recognition)
1. [PII Entity Recognition](#PII-Entity-Recognition)
1. [Topic Modeling](#Topic-Modeling)


  
## Background
The goal of this exercise is to learn some insights from an existing PDF document. This is done by using Amazon Textract to extract the text from the document. This text is then analyzed by several Amazon Comprehend services to produce some insights about the document.  

The PDF document used in this example is a compiled list of tweets or other social media posts. Each post is separated by a URL that points to that posting. When the text is extracted from the PDF document, the text is re-assembled into a single line of text which is the full text of the tweet or post. The resulting text file contains one tweet/post per line.

## Notes and Configuration
* Kernel `Python 3 (Data Science)` works well with this notebook
* The CSV results files produced by this script use the pipe '|' symbol as a delimiter. When viewing these files in SageMaker Studio, be sure and change the Delimiter to 'pipe'.


#### Regarding IAM Roles and Permissions:

Within SageMaker Studio, each SageMaker User has an IAM Role known as the `SageMaker Execution Role`. Each Notebook for this user will run with this Role and the Permissions specified by this Role. The name of this Role can be found in the Details section of each SageMaker User in the AWS Console.

For the code which runs in this notebook, the `SageMaker Execution Role` needs additional permissions to allow it to use Amazon Textract and Amazon Comprehend. In the AWS Console, navigate to the IAM service and add these two services to your SageMaker Execution Role:
- AmazonTextractFullAccess
- AmazonComprehendFullAccess

Also, an Amazon Comprehend service Role needs to be created to grant Amazon Comprehend read access to your input data.  
Click the AttachPolicies button and add AmazonS3FullAccess. Complete this step and name your role as follows

`asyncS3ComprehendServiceRole`

Once you have created the role, copy the ARN (it will be in the format `arn:aws:iam::<AccountID>:role/asyncS3ComprehendServiceRole`)

###When creating this new Role, the default Policies are sufficient (i.e., no other Policies need to be added/modified).

Lastly, the `SageMaker Execution Role` must be allowed to Pass the Comprehend Service Role. To allow this, you must attach a Policy to the `SageMaker Execution Role`. Below, the Resource entry is the ARN of the Comprehend service Role which you created. You can either create this as a new Policy and attach it or add it as an in-line Policy.

    {
        "Version": "2012-10-17",
        "Statement": [
            {
                "Effect": "Allow",
                "Action": [
                    "iam:GetRole",
                    "iam:PassRole"
                ],
                "Resource": "<ENTER YOUR ARN HERE>"
            }
        ]
    }



In [None]:
import os
import json
import sys
import time
import boto3
from sagemaker import get_execution_role

## Setup
Set some variables that will be used throughout this example

NOTE: Update the comprehend_role to be the ARN you copied in the step above

In [None]:
region = 'us-east-1'

# change this to an existing S3 bucket in your AWS account
bucket = '<ENTER YOUR BUCKET NAME HERE>'

# this is the role that will be used by the async call to Comprehend TopicModeling at the end of this lab 
comprehend_role = '<ENTER YOUR ARN HERE>'

# this is where the various analysis results files will be stored on the local file system of this SageMaker instance
results_dir = './results'
!mkdir -p $results_dir

# the pdf file to be analyzed by Textract
textract_src_filename = 'Alabama2.pdf'

# the name of the file where the JSON results from Textract are saved
json_textract_results_filename = f'{results_dir}/textract-results.json'

# the post-processed results of the JSON results
textract_results_filename = f'{results_dir}/textract-results.txt'

# the post-processed results of the JSON results where each line is less than 5000 chars
trimmed_textract_results_filename = f'{results_dir}/trimmed_textract-results.txt'

# the results of Amazon Comprehend - Key Phrases detection
comprehend_keyphrases_results_filename = f'{results_dir}/comp-keyphrases.csv'

# the results of Amazon Comprehend - Sentiment Analysis
comprehend_sentiments_results_filename = f'{results_dir}/comp-sentiment.csv'

# the results of Amazon Comprehend - Entities Detection
comprehend_entities_results_filename = f'{results_dir}/comp-entities.csv'

# the results of Amazon Comprehend - Entities Detection
comprehend_pii_entities_results_filename = f'{results_dir}/comp-pii_entities.csv'

# the results of Amazon Comprehend - Topics Detection
comprehend_topics_results_filename = f'{results_dir}/comp-topics.csv'

# this is the IAM Role that defines which permissions this SageMaker instance has
sm_execution_role = get_execution_role()


Lets download the dataset and copy it to our bucket

In [None]:
!curl -O https://lofgren.house.gov/sites/lofgren.house.gov/files/Alabama2.pdf
!aws s3 cp Alabama2.pdf s3://{bucket}

---
## Functions
The following is a convenience function to calculate frequencies


In [None]:

'''
CalcFrequencies()
Input: dict with keys and numeric values
Returns: dict with the same keys and numeric frequency
'''
def CalcFrequencies(di):
    
    freq = {}
    
    sum = 0
    for d in di:
        sum += di[d]
    
    for d in di:
        freq[d] = di[d]/sum

    return freq


---
## Amazon Textract
Amazon Textract is a machine learning service that automatically extracts text, handwriting and data from scanned documents that goes beyond simple optical character recognition (OCR) to identify, understand, and extract data from forms and tables.  
  
In the next few cells the following steps will be performed:
1. A specified PDF document will be uploaded to Amazon S3 to be analyzed by Amazon Textract.  
1. The result of this analysis is a JSON file with each element containing details about a specific instance of text in the PDF.  
1. This JSON file is copied from S3 to this local SageMaker instance.  
1. The JSON file is then read and post-processed to produce a text file with one tweet (or other social media post) per line.  


In [None]:
# create a boto3 session
# this session will be used for the remainder of this notebook
session = boto3.Session(region_name=region)


In [None]:
# create the Textract Job
textract_client = session.client('textract')

response = textract_client.start_document_text_detection(
    DocumentLocation={
        'S3Object': {
            'Bucket': bucket,
            'Name': textract_src_filename
        }
    })

jobId = response['JobId']

print('Started Textract job at %s' % (time.ctime()))
print('JobId: %s' % (jobId))


In [None]:
# Get the current job status
response = textract_client.get_document_text_detection(JobId=jobId)
response['JobStatus']

In [None]:
# We now extract the results as a JSON List

pages = []
if response['JobStatus'] == 'SUCCEEDED':
    while('NextToken' in response):
        pages.append(response)
        response = textract_client.get_document_text_detection(JobId=jobId, NextToken=response['NextToken'])
    pages.append(response)

In [None]:
# iterate through the Textract responses, looking for the LINE and WORD entries and write out to a file

with open(textract_results_filename, 'w') as fd:
    # iterate through the Textract responses, looking for the LINE and WORD entries
    for resp in pages:
        for blk in resp['Blocks']:
            if blk['BlockType'] in ['LINE', 'WORD']:
                # if 'http' is found at the beginning of the line, we assume a new paragraph of text will be started
                loc = blk['Text'].find('https')
                if loc >= 0 and loc <= 2:
                    fd.write('\n')
                else:
                    fd.write('%s ' % blk['Text'])
                    
print('See results file: %s\n' % textract_results_filename)

In [None]:
# save the entire results set to a local file
# this file isn't used in the remaining example, but you can open this JSON file in your Jupyter Notebook and view the elements returned by Textract
with open(json_textract_results_filename, 'w') as fd:
    json.dump(pages, fd)
print(json_textract_results_filename)

---
## Amazon Comprehend
Amazon Comprehend is a natural language processing (NLP) service that uses machine learning to discover insights from text. The service provides APIs for Keyphrase Extraction, Sentiment Analysis, Entity Recognition, Topic Modeling, and Language Detection so you can easily integrate natural language processing into your applications. The following cells will walk through several examples of how to use the API.  


In [None]:
# create the comprehend boto3 client (from the existing boto3 session)
comp_client = session.client('comprehend')

In [None]:
# Lets start by loading the textract file that we will use for the next few examples
lines = []
with open(textract_results_filename) as fd:
    lines = fd.readlines()

---
## Sentiment Analysis
Use Amazon Comprehend to determine the Sentiment of each line of text from the Textract analysis.

### Sync vs Async
if you have text like may be small online review or one or two sentence text that doesn't exceed 5000bytes. you can run a lot of comprehend calls synchronously and by directly providing text string. However if your text exceeds that size you will have to make an async job, get its status and act on the output as you like. Below I show you both examples. Note the brevity of the sync calls compared to Async

### Sync Example

Here is a simple example to demonstrate a synchoronous call to Comprehend to get the sentiment from a piece of text

In [None]:
comp_client.detect_sentiment(Text="I do not like grren eggs and ham",
                    LanguageCode='en')

We can take each line of text that we extracted from Textract and send it synchonously to Comprehend to get the sentiment  

In [None]:
sentiments = {}
batch_size = 25
with open(comprehend_sentiments_results_filename, 'w') as fd:
    for i in range(0, len(lines), batch_size):
        batch = [op[:4998] for op in lines[i:i+batch_size]]        
        response = comp_client.batch_detect_sentiment(TextList=batch, LanguageCode='en')
        for idx, line_result in enumerate(response['ResultList'], start=0):
            sentiment = line_result['Sentiment']
            if sentiment in sentiments:
                sentiments[sentiment] += 1
            else:
                sentiments[sentiment] = 1
            fd.write('%s|%s' % (sentiment, batch[idx]))

In [None]:
# Now lets calculate the sentiment frequencies 

freq = CalcFrequencies(sentiments)
print('Frequencies:')
for d in sentiments:
    print('%s: %.2f' % (d, freq[d]))        
    


### Async Example

In [None]:
with open(trimmed_textract_results_filename, 'w') as fd:
    for line in lines:
        fd.write(line[:4998])

In [None]:
s3_sentiment_input = "s3://%s/%s" % (bucket, os.path.basename(trimmed_textract_results_filename))
s3_sentiment_output = "s3://%s/%s" % (bucket, "sentiment_output")
! aws s3 cp {trimmed_textract_results_filename} {s3_sentiment_input}

In [None]:
response = comp_client.start_sentiment_detection_job(
    InputDataConfig={
        'S3Uri': s3_sentiment_input,
        'InputFormat': 'ONE_DOC_PER_LINE'
    },
    OutputDataConfig={
        'S3Uri': s3_sentiment_output
    },
    DataAccessRoleArn=comprehend_role,
    LanguageCode='en'
)

jobId = response['JobId']

This next cell takes 7 mins to run

In [None]:
%%time
response = comp_client.describe_sentiment_detection_job(JobId=jobId)
while response['SentimentDetectionJobProperties']['JobStatus'] == 'IN_PROGRESS':
    time.sleep(10)
    response = comp_client.describe_sentiment_detection_job(JobId=jobId)
response['SentimentDetectionJobProperties']['JobStatus']

In [None]:
s3uri = response['SentimentDetectionJobProperties']['OutputDataConfig']['S3Uri']

In [None]:
! aws s3 cp {s3uri} results/

In [None]:
! tar zxvf results/output.tar.gz

In [None]:
! head -10 output

## Key Phrase Extraction
Use Amazon Comprehend to extract Key Phrases in the text from the Textract analysis.


In [None]:
#keep a running total of the various key phrases
keyphrase_counts = {}

for i in range(0, len(lines), batch_size):
    batch = [op[:4998] for op in lines[i:i+batch_size]]        
    response = comp_client.batch_detect_key_phrases(TextList=batch, LanguageCode='en')
    for idx, line_result in enumerate(response['ResultList'], start=0):
        for keyphrase in line_result['KeyPhrases']:
            kp = keyphrase['Text']
            if kp in keyphrase_counts:
                keyphrase_counts[kp] += 1
            else:
                keyphrase_counts[kp] = 1

sorted_keyphrase_counts = dict(sorted(keyphrase_counts.items(), key=lambda x: x[1], reverse=True))


In [None]:
# calculate the frequency of each key phrase
freq = CalcFrequencies(sorted_keyphrase_counts)

# the results file is in csv format and includes the raw counts and the frequency
with open(comprehend_keyphrases_results_filename, 'w') as fd:
    fd.write('key_phrase|count|frequency\n')
    for kp in sorted_keyphrase_counts:  
        fd.write('%s|%d|%.4f\n' % (kp, sorted_keyphrase_counts[kp], freq[kp]))

print('See results file: %s' % (comprehend_keyphrases_results_filename))

In [None]:
i = 0
for kp in sorted_keyphrase_counts:  
    i += 1
    print('%s|%d|%.4f' % (kp, sorted_keyphrase_counts[kp], freq[kp]))
    if i>10:
        break

---
## Entity Recognition
Use Amazon Comprehend to detect Entities in the text from the Textract analysis.  
What are the type of Entities?
* PERSON, ORGANIZATION, DATE, QUANTITY, LOCATION, TITLE, COMMERCIAL_ITEM, EVENT, OTHER

In [None]:
%%time
entities = {}

with open(comprehend_entities_results_filename, 'w') as fd:
    for i in range(0, len(lines), batch_size):
        batch = [op[:4998] for op in lines[i:i+batch_size]]        
        response = comp_client.batch_detect_entities(TextList=batch, LanguageCode='en')
        for idx, line_result in enumerate(response['ResultList'], start=0):
            for entity in line_result['Entities']:
                etype = entity['Type']
                if etype in entities:
                    entities[etype] += 1
                else:
                    entities[etype] = 1
                fd.write('%s|%s\n' % (etype, entity['Text']))
                
sorted_entities = dict(sorted(entities.items(), key=lambda x: x[1], reverse=True))

In [None]:
freq = CalcFrequencies(sorted_entities)
print('Frequencies:')
for d in sorted_entities:
    print('%s: %.2f' % (d, freq[d]))        
                    

---
## PII Entity Recognition
Use Amazon Comprehend to detect PII Entities in the text from the Textract analysis.  
What are the types of PII Entities?  
* NAME, DATE-TIME, ADDRESS, USERNAME, URL, EMAIL, PHONE, CREDIT-DEBIT-EXPIRY, PASSWORD, AGE


In [None]:
pii_entities = {}

with open(comprehend_pii_entities_results_filename, 'w') as fd:
    for line in lines:
        # maximum text length for Comprehend Entities is 5,000 characters
        line = line[:4998]       
        if len(line) > 1:
            response = comp_client.detect_pii_entities(Text=line, LanguageCode='en')
            for entity in response['Entities']:
                etype = entity['Type']
                if etype in pii_entities:
                    pii_entities[etype] += 1
                else:
                    pii_entities[etype] = 1
                fd.write('%s|%s\n' % (etype, line[entity['BeginOffset']:entity['EndOffset']]))

print('\n')

# sort the dictionary by values
sorted_pii_entities = dict(sorted(pii_entities.items(), key=lambda x: x[1], reverse=True))

In [None]:
freq = CalcFrequencies(sorted_pii_entities)
print('Frequencies:')
for d in sorted_pii_entities:
    print('%s: %.2f' % (d, freq[d]))        
                    