# Textract Textractor 

Textractor helps accelerate your PoCs by allowing you to quickly extract text, forms and tables from documents using Amazon Textract. It can generate output in different formats including raw JSON, JSON for each page in a document, text, text in reading order, key/values exported as CSV, and tables exported as CSV.

In this notebook, you'll be using various packages for different features. The repository can be found here https://github.com/aws-samples/amazon-textract-textractor. All packages are also available on PyPI.

Ensure you're using the **conda_mxnet_latest_p37** kernel. **Python 3.7** is required to run the code. 

First, you'll upgrade to the latest version of textract-helper https://github.com/aws-samples/amazon-textract-textractor/tree/master/helper. Helper also use the caller, overlayer and pretty-printer methods.

In [None]:
!python -m pip install -q amazon-textract-caller --upgrade
!python -m pip install -q amazon-textract-response-parser --upgrade

In [None]:
import boto3
import trp
import trp.trp2 as t2
# Textract Caller
from textractcaller.t_call import call_textract, Textract_Features
# Textract Response Parser
from trp import Document

In [None]:
# Amazon Textract client
textract = boto3.client('textract')

#Document
documentName = "employmentapp.png"

In [None]:
#display the document
from IPython.display import Image
Image(documentName)

# Textract Overlayer 

Textract overlayer generates bounding boxes to make it easier to draw for visualizations.

In [None]:
!python -m pip install -q amazon-textract-overlayer 

In [None]:
from PIL import Image as PImage, ImageDraw
image = PImage.open(documentName)

#use textract caller and overlayer to get bounding boxes
from textractoverlayer.t_overlay import DocumentDimensions, get_bounding_boxes
from textractcaller.t_call import Textract_Features, Textract_Types, call_textract


doc = call_textract(input_document = documentName, features = [Textract_Features.FORMS, Textract_Features.TABLES])

# image is a PIL.Image.Image in this case
document_dimension:DocumentDimensions = DocumentDimensions(doc_width=image.size[0], doc_height=image.size[1])
    
#return the bounding boxes for word, form, and cell types
overlay=[Textract_Types.WORD, Textract_Types.FORM, Textract_Types.CELL]

bounding_box_list = get_bounding_boxes(textract_json=doc, document_dimensions=document_dimension, overlay_features=overlay)

In [None]:
#Show the overlay drawing of the bounding boxes on the document
rgb_im = image.convert('RGB')
draw = ImageDraw.Draw(rgb_im)

# check the implementation in amazon-textract-helper for ways to associate different colors to types
for bbox in bounding_box_list:
    draw.rectangle(xy=[bbox.xmin, bbox.ymin, bbox.xmax, bbox.ymax], outline=(128, 128, 0), width=2)

from IPython.display import Image
display(rgb_im)

# Textract Response Parser

Use the Textract response parser library to parse the JSON returned by Textract. By default Textract does not put the elements identified in an order in the JSON response. This sample prints the elements in order and the confidence score.

In [None]:
# Call Amazon Textract
response = call_textract(input_document = documentName, 
                         features = [Textract_Features.FORMS, Textract_Features.TABLES])

from trp import Document
doc = Document(response)

# Iterate over elements in the document
for page in doc.pages:
    # Print lines and words
    for line in page.lines:
        print("Line: {}--{}".format(line.text, line.confidence))
        for word in line.words:
            print("Word: {}--{}".format(word.text, word.confidence))

    # Print tables
    for table in page.tables:
        for r, row in enumerate(table.rows):
            for c, cell in enumerate(row.cells):
                print("Table[{}][{}] = {}-{}".format(r, c, cell.text, cell.confidence))

    # Print fields
    for field in page.form.fields:
        key = field.key.text if field.key else ""
        value = field.value.text if field.value else ""
        print("Field: Key: {}, Value: {}".format(key, value))

    # Get field by key
    key = "Phone Number:"
    field = page.form.getFieldByKey(key)
    if(field):
        print("Field: Key: {}, Value: {}".format(field.key, field.value))

    # Search fields by key
    key = "address"
    fields = page.form.searchFieldsByKey(key)
    for field in fields:
        print("Field: Key: {}, Value: {}".format(field.key, field.value))

## Order blocks (WORDS, LINES, TABLE, KEY_VALUE_SET) by geometry y-axis

The sample is similar to the above cell, but uses the order_blocks_by_geo function using the Serializer/Deserializer shows how to change the structure and order the elements while maintaining the schema. This way no change is necessary to integrate with existing processing.

In [None]:
#from textractcaller.t_call import call_textract, Textract_Features
from trp.trp2 import TDocument, TDocumentSchema
from trp.t_pipeline import order_blocks_by_geo
import trp


j = call_textract(input_document = documentName, features=[Textract_Features.FORMS, Textract_Features.TABLES])
# the t_doc will be not ordered
t_doc = TDocumentSchema().load(j)
# the ordered_doc has elements ordered by y-coordinate (top to bottom of page)
ordered_doc = order_blocks_by_geo(t_doc)
# send to trp for further processing logic
trp_doc = trp.Document(TDocumentSchema().dump(ordered_doc))
print(trp_doc)

# Textract Prettyprinter

Textract prettyprinter formats the Textract JSON output in an easier to read format that is more consumable to use in other systems e.g. CSV, LaTeX, Markdown. 

In [None]:
#new image
imageName="patient_intake_form_sample.jpg"

#display the image
from IPython.display import Image
Image(imageName)

In [None]:
#format Textract output and print in CSV format 
from textractprettyprinter.t_pretty_print import Pretty_Print_Table_Format, Textract_Pretty_Print, get_string, get_tables_string
from textractcaller.t_call import Textract_Features, Textract_Types, call_textract

textract_json = call_textract(input_document= imageName, features=[Textract_Features.FORMS, Textract_Features.TABLES])
print(get_string(textract_json=textract_json,
               table_format=Pretty_Print_Table_Format.csv,
               output_type=[Textract_Pretty_Print.FORMS, Textract_Pretty_Print.TABLES]))

In [None]:
#call Textract
j = call_textract(input_document=imageName, features=[Textract_Features.FORMS])

#Print the key/value pairs to identify the ones that have the same name. 
from textractprettyprinter.t_pretty_print import get_forms_string
print(get_forms_string(j))