# Module 1 - Processing Documents with Amazon Textract

Amazon Textract provides synchronous and asynchronous operations that return only the [text detected in a document](https://docs.aws.amazon.com/textract/latest/dg/how-it-works-detecting.html). For both sets of operations, the following information is returned in multiple Block objects:

- The lines and words of detected text
- The relationships between the lines and words of detected text
- The page that the detected text appears on
- The location of the lines and words of text on the document page

The synchronous API is [`DetectDocumentText`](https://docs.aws.amazon.com/textract/latest/dg/API_DetectDocumentText.html), whereas the asynchronous API is [`StartDocumentTextDetection`](https://docs.aws.amazon.com/textract/latest/dg/API_StartDocumentTextDetection.html). 


# Step 1: Setup notebook <a id="step1"></a>

In this step, we will import some necessary libraries that will be used throughout this notebook. 

In [None]:
!python -m pip install -q amazon-textract-response-parser --upgrade
!python -m pip install -q amazon-textract-caller --upgrade
!python -m pip install -q amazon-textract-prettyprinter==0.0.16
!python -m pip install -q amazon-textract-textractor --upgrade

In [None]:
#Restart the kernel
# import IPython
# IPython.Application.instance().kernel.do_shutdown(True)

In [None]:
# Code Cell: 1
import boto3
import botocore
import sagemaker
import pandas as pd
from IPython.display import Image, display, JSON
from textractcaller.t_call import call_textract, Textract_Features, call_textract_expense
from textractprettyprinter.t_pretty_print import convert_table_to_list
from trp import Document
import os

# variables
data_bucket = sagemaker.Session().default_bucket()
region = boto3.session.Session().region_name
account_id = boto3.client('sts').get_caller_identity().get('Account')

os.environ["BUCKET"] = data_bucket
os.environ["REGION"] = region
role = sagemaker.get_execution_role()

print(f"SageMaker role is: {role}\nDefault SageMaker Bucket: s3://{data_bucket}")

s3=boto3.client('s3')
textract = boto3.client('textract', region_name=region)
comprehend=boto3.client('comprehend', region_name=region)

---
# Step 2: Extract Text with Amazon Textract <a id="step2"></a>

Amazon Textract is an ML powered OCR service that is capable of detecting and extracting text from documents. Text data in the form of WORDS and LINES can be extracted from documents using Amazon Textract `DetectDocumentText` API. Let's extract the words and lines from the bank statement.

In [None]:
# Code Cell: 2
# Call Amazon Textract
response = textract.detect_document_text(
    Document={
        'S3Object': {
            'Bucket': data_bucket,
            'Name': file_key
        }
    })


# Print detected text
for item in response["Blocks"]:
    if item["BlockType"] == "LINE":
        print (item["Text"])

As you can notice, we were able to extract the LINES and WORDS from the document, but we also lost some of the structural formatting within the document. For example the document contains a few tables and we would like to extract the table information in a tabular structure. So let's do that next.

---
# Step 3: Analyzing complex documents using Amazon Textract <a id="step3"></a>

Amazon Textract [analyzes documents and forms](https://docs.aws.amazon.com/textract/latest/dg/how-it-works-analyzing.html) for relationships among detected text. Amazon Textract analysis operations return 6 categories of document extraction as listed below, each of these categories are known as _features_ of Amazon Textract document analysis. Document analysis with Amazon Textract is available via the [`AnalyzeDocument`](https://docs.aws.amazon.com/textract/latest/dg/API_AnalyzeDocument.html) synchronous API, and the [`StartDocumentAnalysis`](https://docs.aws.amazon.com/textract/latest/dg/API_StartDocumentAnalysis.html) asynchronous API. 

- **`Text`**: Textract returns the raw text i.e. LINEs and WORDs, extracted from a document.
  
- **`FORMS`**: Form data returned with KEY / VALUE pairs for forms documents.
  
- **`TABLES`**: Amazon Textract can extract tables, table cells, the items within table cells, table titles and footers, and the type of table. Amazon Textract can also be programmed to return the results in a JSON, CSV, or TXT file. 
  
- **`QUERIES`**: When processing a document with Amazon Textract, you may add queries to your analysis to specify what information you need. This involves passing a question, such as "What is the customer's social security number?" to Amazon Textract. 
  
- **`SIGNATURES`**: Amazon Textract can detect the locations of signatures in text documents. These are returned as geometry objects with bounding boxes that provide the location of a signature on the page, alongside the confidence that a signature is in that location.
  
- **`LAYOUT`**: Layout is a new feature that enables you to automatically extract layout elements such as paragraphs, titles, and more from documents. Layout builds on Textract’s word and line detection by automatically grouping the text into these layout elements and ordering the text and elements as a human would read (i.e., left to right, top to bottom). Layout is helpful when it comes to processing the extracted text with Large Language Models (LLM), since it is capable of allowing you to preserve the layout and reading order of the text in the document. The following image shows an example of how document Layout detection and extraction works.

We will look at how to extract a few features like `TABLES` and `LAYOUT`. Run the code cell below to extract `TABLES` data from our sample document.

In [None]:
# Code Cell: 3
response = textract.analyze_document(
    Document={
        'S3Object': {
            'Bucket': data_bucket,
            'Name': file_key
        }
    },
    FeatureTypes=["TABLES"])

response

As you can see, the response from Amazon Textract is a large JSON object that contains a lot of information. Let's parse out the table data from this reponse. To do this, we will see how to extract the tables using the textract response parser tool that we installed earlier. To learn about how Textract Table response works, refer to the [documentation](https://docs.aws.amazon.com/textract/latest/dg/how-it-works-tables.html).

In [None]:
# Code Cell: 4
tdoc = Document(response)
dfs = list()

for page in tdoc.pages:
    for table in page.tables:
        tab_list = convert_table_to_list(trp_table=table)
        print(tab_list)
        dfs.append(pd.DataFrame(tab_list))

df1 = dfs[0]

In the code cell above, we extracted the tables as a Python List and then converted them to Pandas DataFrame. You can also extract tables in other formats such as CSV, TSV etc. Refer to the [PrettyPrinter](https://github.com/aws-samples/amazon-textract-textractor/tree/master/prettyprinter) documentation for more. Now let's look at the DataFrames.

In [None]:
# Code Cell: 5
df1

In the next cell we will , add the `LAYOUT` feature to the `FeatureTypes` list e.g. `FeatureTypes=["TABLES", "LAYOUT"]`. Once done, execute the code cell.

In [None]:
# Code Cell: 4
response = textract.analyze_document(
    Document={
        'S3Object': {
            'Bucket': data_bucket,
            'Name': file_key
        }
    },
    FeatureTypes=["TABLES"]) # Add "LAYOUT" here

response

As before, we get a large JSON response, so let's run a small script to extract out the Layout text from the document. Note, the JSON response will also contain the TABLE output in it so our previous table parsing code will also work. The script `textract_linearize_layout.py` gathers all the LAYOUT entities Textract has extracted from the document and prints out the text.

In [None]:
from textract_linearize_layout import LinearizeLayout
layout = LinearizeLayout(response)
full_text = layout.get_text().strip()
print(full_text)

Notice that despite our document being a multi-column document Textract was able to preserve the proper reading order of the document with titles, paragraphs, lists and so on. Compare this to the text output you received in **Step 1 Extract Text with Amazon Textract** and you will see a noticable difference in how the text is extracted with `LAYOUT`. This formatting and preservation of layout information is going to be beneficial for use in almost ALL generative AI use cases going further.

---
# Step 4: Process documents at scale <a id="step4"></a>

In order to process documents in bulk you can stitch together a workflow that can extract all the desired features from your documents. In our case, we have pre-deployed a workflow using Amazon StepFunctions that uses AWS Lambda functions to process bulk documents and extract text with `LAYOUT` feature from them. 

Let's download some sample documents that we will use to process.

!curl <url> --output sample-docs.zip

---
# Conclusion <a id="conclusion"></a>

In this notebook we extracted plain text and lines from a document using Amazon Textract. We also did a table extraction from the and further looked on a few additional ways Amazon Textract can help extract specific layout related components from the document. In the next notebook we will start using generative AI models using Amazon Bedrock on some of the text extracted by our document processing workflow.