This cross-service notebook will walk you through the process of using Textract's DetectDocumenText API to extract text from a PNG file containing text, and then using Comprehend's DetectEntities API to find entities in the text extracted from a JPG/JPEG/PNG image file.

In order to make use of the Boto3 Python SDK, you will need to configure your AWS credentials. In the code cell below, replace "KeyID" with the value of your AWS Key ID and replace "AccessKey" with the value of your AWS Secret Access Key.

In [None]:
!aws configure set aws_access_key_id "KeyID"
!aws configure set aws_secret_access_key "AccessKey"

After setting your security credentials, you will need to import any libraries you need. You will also need to set the name of both the S3 bucket you have your image in and the name of the image itself. In the code below, replace the value of "bucket-name" with the name of your bucket, replace the value of "document-name" with the name of the image file you want to analyze, and replace the value of "region" with the name of the region you are operating in.

In [None]:
import boto3
import io
from PIL import Image               
from IPython.display import display 
import json
import pandas as pd

bucket = 'bucket-name'
document = 'document-name'
region = 'region'

You'll need to create a function that connects to both S3 and Textract via the Boto3 SDK. The function presented in the following code starts by connecting to the S3 resource and retrieving the image you specified from the bucket you specified. The function then connects to Textract and calls the DetectDocumentText API to extract the text in the image. The lines of text found in the document are stored in a list and returned.

In [None]:
def process_text_detection(bucket, document):
    # Get the document from S3
    s3_connection = boto3.resource('s3')

    s3_object = s3_connection.Object(bucket, document)
    s3_response = s3_object.get()

    # opening binary stream using an in-memory bytes buffer
    stream = io.BytesIO(s3_response['Body'].read())
    # loading stream into image
    image = Image.open(stream)

    # Detect text in the document
    client = boto3.client('textract', region_name=region)

    # process using S3 object
    response = client.detect_document_text(
        Document={'S3Object': {'Bucket': bucket, 'Name': document}})

    # Get the text blocks
    blocks = response['Blocks']

    # List to store image lines in document
    line_list = []

    # Create image showing bounding box/polygon the detected lines/text
    for block in blocks:
        if block["BlockType"] == "LINE":
            line_list.append(block["Text"])

    # Display the image
    display(image)
    return line_list

lines = process_text_detection(bucket, document)
print("Text found: " + str(lines))

You can now send the lines you extracted from the image to Comprehend and use the service's DetectEntities API to find all entities within those lines. You'll need a function that iteratres through the list of lines returned by the "process_text_detection" function you wrote earlier and calls the DetectEntities operation on every line.

In [None]:
# Create a list to hold the entities found for every line
response_entities = []

# Connect to comprehend
comprehend = boto3.client(service_name='comprehend', region_name=region)

print('Calling DetectEntities:')
print("------")
# Iterate through the lines in the list of lines
for line in lines:

    # construct a list to hold all found entities for a single line
    entities_list = []

    # Call the DetectEntities operation and pass it a line from lines
    found_entities = comprehend.detect_entities(Text=line, LanguageCode='en')
    for response_data, values in found_entities.items():
        for item in values:
            if "Text" in item:
                print("Entities found:")
                for text, val in item.items():
                    if text == "Text":
                        # Append the found entities to the list of entities
                        entities_list.append(val)
                        print(val)
    # Add all found entities for this line to the list of all entities found
    response_entities.append(entities_list)

Now that you have a list of the lines extracted by Textract and the entities found in those lines, you can create a dataframe that lets you see both. In the code below, a Pandas dataframe is constructed, displaying the lines found in the input image and their associated entities.

In [None]:
entities_dict = {"Lines":lines, "Entities":response_entities}
df = pd.DataFrame(entities_dict, columns=["Lines","Entities"])
print(df)