<h1> <b> Amazon Textract Primitivies and API's</b> </h1>

Amazon Textract is a document analysis service that detects and extracts printed text, and handwriting, structured data, such as fields of interest and their values, and tables from images and scans of documents. Amazon Textract's machine learning models have been trained on millions of documents so that virtually any document type you upload is automatically recognized and processed for text extraction. When information is extracted from documents, the service returns a confidence score for each element it identifies so that you can make informed decisions about how you want to use the results. Textract has four API's that we will be focusing on for todays workshop each of them play different role when it comes to processing documents.

<h3> <strong>For more details about each of these API's along with when to use them refer to the Workshop Guide under Module 1 </strong> </h3>

In [None]:
import boto3
!python -m pip install amazon-textract-prettyprinter
!python -m pip install amazon-textract-response-parser
import pprint
import os
import textractprettyprinter
from trp import Document
from textractprettyprinter.t_pretty_print_expense import get_string
from textractprettyprinter.t_pretty_print import Textract_Pretty_Print, get_lines_string
from textractprettyprinter.t_pretty_print_expense import Textract_Expense_Pretty_Print, Pretty_Print_Table_Format
from textractprettyprinter.t_pretty_print import Pretty_Print_Table_Format, Textract_Pretty_Print, get_string

<h1> Detect Text with Amazon Textract - Local Document </h1>
Amazon Textract performs OCR using the Detect Document Text API. This API will provide the user with an extraction of all the raw text on the input document locally

In [None]:
#intialize the connection to Amazon Textract
textract = boto3.client('textract')

#select the document 
document = 'w2example.jpg'


In [None]:
#Send the Document to the Detect Text API 
with open(document, 'rb') as document:
    imageBytes = bytearray(document.read())

textract_response = textract.detect_document_text(Document={'Bytes': imageBytes})

In [None]:
#Print the parsed results
doc = Document(textract_response)
for page in doc.pages:
    # Print lines and words
    for line in page.lines:
        print("Line: {}--{}".format(line.text, line.confidence))
        for word in line.words:
            print("Word: {}--{}".format(word.text, word.confidence))

In [None]:
#Optional Step to see fully raw Textract Output. Please refer to Workshop guide for more details! 
pprint.pprint(textract_response)

In [None]:
#Using a post-processing library to clean up output
pretty_printed = get_lines_string(textract_json=textract_response)
print(pretty_printed)

<h1> Detect Text with Amazon Textract - S3 Document </h1>
Amazon Textract performs OCR using the Detect Document Text API. This API will provide the user with an extraction of all the raw text on the input document in S3

In [None]:
s3BucketName = 'reinvent316-84342323'
documentName = 'w2example.jpg'

textracts3_response = textract.detect_document_text(
    Document={
        'S3Object': {
            'Bucket': s3BucketName,
            'Name': documentName
        }
    })

pretty_printeds3 = get_lines_string(textract_json=textracts3_response)
print(pretty_printeds3)

<h1> Analyze Document with Amazon Textract - Local Document (Forms) </h1>
The Analyze Document API builds ontop of the detect_text api, but now detecting structure within a document, by finding Tables or Form (Key:Value) values within the document.

In [None]:

document = 'invoice.jpg'

#Call the Analyze Doc API
with open(document, "rb") as document:
    response = textract.analyze_document(
        Document={
            'Bytes': document.read(),
        },
        FeatureTypes=["FORMS"])
    

In [None]:
#post-process the results
print(get_string(textract_json=response,
               output_type=[Textract_Pretty_Print.FORMS]))

<h1> Analyze Document with Amazon Textract - Local Document (Tables) </h1>
The Analyze Document API builds ontop of the detect_text api, but now detecting structure within a document, by finding Tables or Form (Key:Value) values within the document.

In [None]:
document = 'invoice.jpg'

#Call the Analyze Doc API
with open(document, "rb") as document:
    response = textract.analyze_document(
        Document={
            'Bytes': document.read(),
        },
        FeatureTypes=["TABLES"])


In [None]:
#post-process the results
print(get_string(textract_json=response,
               output_type=[Textract_Pretty_Print.TABLES]))

<h1> Analyze Expense with Amazon Textract</h1>
The Analyze Expense API is a purpose build API designed to extract line item details in addition to key-value pairs from invoices and receipts. 

In [19]:
document = "invoice.jpg"
    
with open(document, 'rb') as document:
    imageBytes = bytearray(document.read())

response = textract.analyze_expense(Document={'Bytes': imageBytes})
pprint.pprint(response)

{'DocumentMetadata': {'Pages': 1},
 'ExpenseDocuments': [{'ExpenseIndex': 1,
                       'LineItemGroups': [{'LineItemGroupIndex': 1,
                                           'LineItems': [{'LineItemExpenseFields': [{'LabelDetection': {'Confidence': 99.30142211914062,
                                                                                                        'Geometry': {'BoundingBox': {'Height': 0.03678973391652107,
                                                                                                                                     'Left': 0.0591685026884079,
                                                                                                                                     'Top': 0.42734193801879883,
                                                                                                                                     'Width': 0.4202221632003784},
                                                                     