# Module 1 - Processing Documents with Amazon Textract

Amazon Textract provides synchronous and asynchronous operations that return only the [text detected in a document](https://docs.aws.amazon.com/textract/latest/dg/how-it-works-detecting.html). For both sets of operations, the following information is returned in multiple Block objects:

- The lines and words of detected text
- The relationships between the lines and words of detected text
- The page that the detected text appears on
- The location of the lines and words of text on the document page

The synchronous API is [`DetectDocumentText`](https://docs.aws.amazon.com/textract/latest/dg/API_DetectDocumentText.html), whereas the asynchronous API is [`StartDocumentTextDetection`](https://docs.aws.amazon.com/textract/latest/dg/API_StartDocumentTextDetection.html). 


# Step 1: Setup notebook <a id="step1"></a>

In this step, we will import some necessary libraries that will be used throughout this notebook. 

<div class="alert alert-block alert-info"> 
    <b>NOTE:</b> You can ignore any ERROR or WARNINGS during the `pip installs`.
</div>

In [None]:
!pip install -q --disable-pip-version-check --root-user-action=ignore pip --upgrade 

In [None]:
!python -m pip install -q --disable-pip-version-check --root-user-action=ignore amazon-textract-response-parser tabulate --upgrade
!python -m pip install -q --disable-pip-version-check --root-user-action=ignore amazon-textract-prettyprinter --upgrade
!python -m pip install -q --disable-pip-version-check --root-user-action=ignore amazon-textract-caller --upgrade
!python -m pip install -q --disable-pip-version-check --root-user-action=ignore boto3 langchain anthropic chromadb lark transformers sentence-transformers --upgrade

In [None]:
# Code Cell: 1
import boto3
import botocore
import sagemaker
import pandas as pd
from IPython.display import Image, display, JSON
import os

# variables
data_bucket = sagemaker.Session().default_bucket()
region = boto3.session.Session().region_name
account_id = boto3.client('sts').get_caller_identity().get('Account')

os.environ["BUCKET"] = data_bucket
os.environ["REGION"] = region
role = sagemaker.get_execution_role()

print(f"SageMaker role is: {role}\nDefault SageMaker Bucket: s3://{data_bucket}")

s3=boto3.client('s3')
textract = boto3.client('textract', region_name=region)

---
# Step 2: Extract Text with Amazon Textract <a id="step2"></a>

Amazon Textract is an ML powered OCR service that is capable of detecting and extracting text from documents. Text data in the form of WORDS and LINES can be extracted from documents using Amazon Textract `DetectDocumentText` API. Let's extract the words and lines from the sample document.

In [None]:
from IPython.display import IFrame
IFrame("./scientific_paper.pdf", width=600, height=800)

In [None]:
!aws s3 cp ./scientific_paper.pdf s3://{data_bucket}/idp/textract/ --only-show-errors

In [None]:
input_document = f"s3://{data_bucket}/idp/textract/scientific_paper.pdf"

In [None]:
# Code Cell: 2
import textractcaller as tc
from textractprettyprinter.t_pretty_print import get_lines_string


# Call Amazon Textract
textract_response_json = tc.call_textract(
                            input_document=input_document,
                            features=[],
                            call_mode=tc.Textract_Call_Mode.FORCE_SYNC,
                            boto3_textract_client=textract,
                        )

# Print detected text
plain_full_text = get_lines_string(textract_json=textract_response_json)
print(plain_full_text)

As you can notice, we were able to extract the LINES and WORDS from the document, but we also lost some of the structural formatting within the document. For example the document contains a few tables and we would like to extract the table information in a tabular structure. So let's do that next.

---
# Step 3: Analyzing complex documents using Amazon Textract <a id="step3"></a>

Amazon Textract [analyzes documents and forms](https://docs.aws.amazon.com/textract/latest/dg/how-it-works-analyzing.html) for relationships among detected text. Amazon Textract analysis operations return 6 categories of document extraction as listed below, each of these categories are known as _features_ of Amazon Textract document analysis. Document analysis with Amazon Textract is available via the [`AnalyzeDocument`](https://docs.aws.amazon.com/textract/latest/dg/API_AnalyzeDocument.html) synchronous API, and the [`StartDocumentAnalysis`](https://docs.aws.amazon.com/textract/latest/dg/API_StartDocumentAnalysis.html) asynchronous API. 

- **`Text`**: Textract returns the raw text i.e. LINEs and WORDs, extracted from a document.
  
- **`FORMS`**: Form data returned with KEY / VALUE pairs for forms documents.
  
- **`TABLES`**: Amazon Textract can extract tables, table cells, the items within table cells, table titles and footers, and the type of table. Amazon Textract can also be programmed to return the results in a JSON, CSV, or TXT file. 
  
- **`QUERIES`**: When processing a document with Amazon Textract, you may add queries to your analysis to specify what information you need. This involves passing a question, such as "What is the customer's social security number?" to Amazon Textract. 
  
- **`SIGNATURES`**: Amazon Textract can detect the locations of signatures in text documents. These are returned as geometry objects with bounding boxes that provide the location of a signature on the page, alongside the confidence that a signature is in that location.
  
- **`LAYOUT`**: Layout is a new feature that enables you to automatically extract layout elements such as paragraphs, titles, and more from documents. Layout builds on Textractâ€™s word and line detection by automatically grouping the text into these layout elements and ordering the text and elements as a human would read (i.e., left to right, top to bottom). Layout is helpful when it comes to processing the extracted text with Large Language Models (LLM), since it is capable of allowing you to preserve the layout and reading order of the text in the document. The following image shows an example of how document Layout detection and extraction works.

We will look at how to extract a few features like `TABLES` and `LAYOUT`. Run the code cell below to extract `TABLES` data from our sample document.

In [None]:
# Code Cell: 3
from textractcaller.t_call import call_textract, Textract_Features

# Call Amazon Textract
textract_response_json = call_textract(
                            input_document=input_document,
                            features=[Textract_Features.TABLES],
                            call_mode=tc.Textract_Call_Mode.FORCE_SYNC,
                            boto3_textract_client=textract,
                        )

textract_response_json

As you can see, the response from Amazon Textract is a large JSON object that contains a lot of information. Let's parse out the table data from this reponse. To do this, we will see how to extract the tables using the textract response parser tool that we installed earlier. To learn about how Textract Table response works, refer to the [documentation](https://docs.aws.amazon.com/textract/latest/dg/how-it-works-tables.html).

In [None]:
# Code Cell: 4

from textractprettyprinter.t_pretty_print import convert_table_to_list
from trp import Document

tdoc = Document(textract_response_json)
dfs = list()

for page in tdoc.pages:
    for table in page.tables:
        tab_list = convert_table_to_list(trp_table=table)
        dfs.append(pd.DataFrame(tab_list))

df1 = dfs[0]

In the code cell above, we extracted the tables as a Python List and then converted them to Pandas DataFrame. You can also extract tables in other formats such as CSV, TSV etc. Refer to the [PrettyPrinter](https://github.com/aws-samples/amazon-textract-textractor/tree/master/prettyprinter) documentation for more. Now let's look at the DataFrames.

In [None]:
# Code Cell: 5
df1

In the next cell we will , add the `LAYOUT` feature to the `FeatureTypes` list. Once done, execute the code cell.

<div class="alert alert-block alert-info"> 
    <b>INSTRUCTION:</b> Add the <b>LAYOUT</b> feature to the FeatureTypes list e.g. <b>features=[Textract_Features.TABLES, Textract_Features.LAYOUT]</b> in the code cell below.
</div>

In [None]:
# Code Cell: 4
from textractcaller.t_call import call_textract, Textract_Features

# Call Amazon Textract
textract_response_json = call_textract(
                            input_document=input_document,
                            features=[Textract_Features.TABLES, Textract_Features.LAYOUT], # Add Textract_Features.LAYOUT here
                            call_mode=tc.Textract_Call_Mode.FORCE_SYNC,
                            boto3_textract_client=textract,
                        )

textract_response_json

As before, we get a large JSON response, so let's run a small script to extract out the Layout text from the document. Note, the JSON response will also contain the TABLE output in it so our previous table parsing code will also work. The function `get_text_from_layout_json` included in `textractprettyprinter` gathers all the LAYOUT entities Textract has extracted from the document and prints out the text in a linearized format.

In [None]:
# Code Cell: 5
from textractprettyprinter.t_pretty_print import get_text_from_layout_json
from IPython.display import display_markdown


layout = get_text_from_layout_json(textract_json=textract_response_json, generate_markdown=True)
linearized_full_text = layout[1].strip()

colored_text = f"<div style='color: yellow;'>{linearized_full_text}</div>"
display_markdown(colored_text, raw=True)

Notice that despite our document being a multi-column document Textract was able to preserve the proper reading order of the document with titles, paragraphs, lists and so on. Compare this to the text output you received in **Step 1 Extract Text with Amazon Textract** and you will see a noticable difference in how the text is extracted with `LAYOUT`. This formatting and preservation of layout information is going to be beneficial for use in almost ALL generative AI use cases going further.

# Step 4: Compare plain text to linearized text
---

Now lets compared the plain text that we extracted in Step 2 vs the linearized text that we extracted using the `LAYOUT` feature in Step 3 side-by-side.

In [None]:
from IPython.display import display_markdown, HTML, display

plain_full_text_html = plain_full_text.replace('\n', '<br>')
linearized_full_text_html = linearized_full_text.replace('\n', '<br>')

html= f'''<table border="1" style="background-color: white; color: black; width: 100%;">
  <thead>
    <tr>
      <th style="text-align: left;font-size: 20px;">Plain Text</th>
      <th style="text-align: left;;font-size: 20px;">Linearized Plain Text</th>
    </tr>
  </thead>
  <tbody>
    <tr>
      <td style="text-align: left; color: red">{plain_full_text_html}</td>
      <td style="text-align: left; color: green">{linearized_full_text_html}</td>
    </tr>
  </tbody>
</table>
'''

display(HTML(html))

You can clearly see the difference in both formatting, reading order, and details such as tables extracted in a much more meaningful way on the right vs. the left side.

---
# Step 5: Process documents at scale <a id="step4"></a>

In order to process documents in bulk you can stitch together a workflow that can extract all the desired features from your documents. In our case, we have pre-deployed a workflow using Amazon StepFunctions that uses AWS Lambda functions to process bulk documents and extract text with `LAYOUT` feature from them. 

We have prepared a number of sample documents that we will first try to perform extraction in bulk using an AWS Step Functions workflow that we have pre-deployed for you in your workshop accounts.

Let's download some sample documents that we will use.

In [None]:
!curl https://ws-assets-prod-iad-r-iad-ed304a55c2ca1aee.s3.us-east-1.amazonaws.com/f1fecd1d-3ce1-41b9-9180-1c673aad9105/document_samples.zip --output sample-docs.zip

Once the zip file is downloaded, we will unzip the file and upload all the documents into an Amazon S3 bucket location.

In [None]:
!unzip sample-docs.zip -d sample-docs

Feel free to open the directory named `sample-docs` from the directory browser on the left panel and have a look at some of the sample documents. We will be looking at a few of these documents as we progress through the workshop as well. 

We will use AWS CLI (Command Line Interface) to upload the files to the S3 location which is the default `data_bucket` for SageMaker Studio. Note: this can technically be any S3 location, however for the purposes of the workshop we have created the workflow to read the files from the `data_bucket` location.

As soon as the files are copied to S3 this will invoke an AWS Lambda Function which in-turn invokes our document processing wrkflow created with AWS Step Functions.

In [None]:
!aws s3 cp sample-docs s3://{data_bucket}/uploads --recursive

Let's take a brief look at the workflow. Follow <a href="https://console.aws.amazon.com/states" target="_blank">this link</a> to access the AWS Step Function console where we will be able to see the workflow in action. The workflow should have been kicked-off by now. Once in the AWS Step Functions console, in the "Execution" tab, click on the first execution to see the details. 

Run the following code cell to see all the files created with the extracted data from the documents you uploaded in the previous step.

In [None]:
prefix = 'textract-linearized-output/'

# List objects within the specified prefix
paginator = s3.get_paginator('list_objects_v2')
page_iterator = paginator.paginate(Bucket=data_bucket, Prefix=prefix)

# Iterate through the pages and print out the keys
for page in page_iterator:
    if 'Contents' in page:  # Check if the page has contents
        for obj in page['Contents']:
            print(obj['Key'])

Let's display the content of one of the files. We've written a small utility script that can take the S3 path of the `.txt` file and get the contents from it in a list (array). Go ahead and copy an S3 path from any of the documents above. Note: the path must not contain the `.txt` file name, 

for example `path = textract-linearized-output/uploads/health_plan`

In [None]:
doc_path = "" #paste a path here within quotes, must not contain the .txt file name

In [None]:
from read_doc_from_s3 import read_document

if doc_path:
    print(f"Reading files in {doc_path}")
    full_path = f"s3://{data_bucket}/{doc_path}" 
    document = read_document(doc_path=full_path)
    # print(document)
    for i, doc in enumerate(document):
        print(f"==========Page {i+1}==========")
        print(doc)
        print("\n")
else:
    print("Please enter a value for doc_path")

<div class="alert alert-block alert-warning"> 
    <b>NOTE:</b> Proceed to the next module only after the workflow execution is complete.
</div>

---
# Conclusion <a id="conclusion"></a>

In this notebook we extracted plain text and lines from a document using Amazon Textract. We also did a table extraction from the and further looked on a few additional ways Amazon Textract can help extract specific layout related components from the document. In the next notebook we will start using generative AI models using Amazon Bedrock on some of the text extracted by our document processing workflow.