# Unlocking the Power of Intelligent Document Processing with Amazon Textract and Amazon Bedrock
---

<div class="alert alert-block alert-info"> 
    <b>NOTE:</b> If you are using SageMaker studio <b>classic</b> will need to use a Jupyter Kernel with Python 3.9 or above to use this notebook. If you are in Amazon SageMaker Studio, you can use the `Data Science 3.0` image.
</div>

<div class="alert alert-block alert-warning"> 
    <b>NOTE:</b> You will need 3rd party model access to Anthropic Claude 3 Sonnet and Haiku models to be able to run this notebook. Verify if you have access to the model by going to <a href="https://console.aws.amazon.com/bedrock" target="_blank">Amazon Bedrock console</a> > left menu "Model access". The "Access status" for Anthropic Claude must be in "Access granted" status in green. If you do not have access, then click "Edit" button on the top right > select the model checkbox > click "Save changes" button at the bottom. You should have access to the model within a few moments.
</div>



---------------
In this notebook, we will explore the powerful capabilities of Amazon Textract and Amazon Bedrock for intelligent document processing (IDP). IDP involves automatically extracting valuable information from documents, enabling organizations to streamline document-centric workflows, reduce operational costs, and gain insights from their data.

Amazon Textract uses advanced machine learning and computer vision technologies to accurately extract text, data, and metadata from various document formats, including PDFs, images, and scanned documents. By automating this process, Textract eliminates the need for manual data entry, increasing efficiency and reducing the risk of errors.

Amazon Bedrock, on the other hand, provides access to state-of-the-art foundation models (FMs) that can understand and process both text and visual data. These multi-modal models can accurately identify and extract relevant information from structured, semi-structured, and unstructured documents, enabling tasks such as form extraction, table extraction, and intelligent question answering.

Together, Textract and Bedrock form a powerful combination for building intelligent document processing pipelines. Textract handles the initial document ingestion and text extraction, while Bedrock's FMs provide advanced understanding and extraction capabilities.

Throughout this notebook, we will explore the key features of both services and learn how to integrate them into your workflows using Python libraries like Textractor and Rhubarb. By the end of this notebook, you will have a solid understanding of how to leverage AWS for efficient and accurate document processing, enabling your organization to unlock valuable insights from its data assets.


## Basic building blocks
---
This lab is an introduction to the libraries and interfaces used in subsequent intelligent document processing with generative AI labs. It introduces the key libraries that will be used, along with providing code samples that you can modify in your own workflows.

### Setup Prerequisites
---

First we need to 
1. Install Textractor. This open source python library makes it easy to parse and handle the JSON output from Amazon Textract
2. Install Rhubarb. This open source python ligrary makes it easy to use Amazon Bedrock multimodal capibiliteis for IDP
3. Install Sagemaker to give you access to a SageMaker session context which will in turn give you access to the default S3 storage buckets from SageMaker notebook environments


In [1]:
%pip install "amazon-textract-textractor[pdf]"
%pip install pyrhubarb
%pip install -U boto3

%pip install -q amazon-textract-response-parser --upgrade
%pip install -q amazon-textract-caller --upgrade
%pip install -q amazon-textract-prettyprinter==0.0.16
%pip install -q amazon-textract-textractor --upgrade

Collecting amazon-textract-textractor[pdf]
  Downloading amazon_textract_textractor-1.8.5-py3-none-any.whl (309 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m309.3/309.3 kB[0m [31m9.2 MB/s[0m eta [36m0:00:00[0m
[?25hCollecting XlsxWriter<4,>=3.0
  Downloading XlsxWriter-3.2.0-py3-none-any.whl (159 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m159.9/159.9 kB[0m [31m13.0 MB/s[0m eta [36m0:00:00[0m
[?25hCollecting editdistance<0.9,>=0.6.2
  Downloading editdistance-0.8.1-cp310-cp310-macosx_11_0_arm64.whl (79 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m79.1/79.1 kB[0m [31m8.2 MB/s[0m eta [36m0:00:00[0m
[?25hCollecting amazon-textract-caller<1,>=0.2.4
  Downloading amazon_textract_caller-0.2.4-py2.py3-none-any.whl (13 kB)
Collecting tabulate<0.10,>=0.9
  Downloading tabulate-0.9.0-py3-none-any.whl (35 kB)
Collecting pdf2image<1.17,>=1.16
  Downloading pdf2image-1.16.3-py3-none-any.whl (11 kB)
Collecting amazo

In [2]:
#Import script libraries and create global variables
import json
import sagemaker
import pandas as pd
from IPython.display import Image, display, JSON
from textractcaller.t_call import call_textract, Textract_Features, call_textract_expense
from textractprettyprinter.t_pretty_print import convert_table_to_list
from trp import Document

role = sagemaker.get_execution_role()
data_bucket = sagemaker.Session().default_bucket()
region = sagemaker.Session().boto_region_name
print(f"SageMaker bucket is {data_bucket}, and SageMaker Execution Role is {role}. Current region is {region}")

ModuleNotFoundError: No module named 'sagemaker'

### Use Boto3 to call the Bedrock API
---
Sending a prompt directly to Bedrock with Boto3 is easy and gives fine grain control over your prompt and parameters. 

In [None]:
import boto3
from botocore.exceptions import ClientError
bedrock = boto3.client(service_name="bedrock-runtime", region_name=region)

#Create a global function to call Bedrock. 
def get_response_from_claude(prompt, temp=1, model='sonnet'):
	"""
	Invokes Anthropic Claude 3 Haiku to run a text inference using the input
	provided in the request body.

	:param prompt:  The prompt that you want Claude 3 to use.
	:param temp:    The temperature to use when invoking Claude. Default is 1
	:param model:   The claude model to use. Currently this supports haiku and sonnet. Default is Sonnett
	:return:        Text response, input token count, output token count
	"""

	# Invoke the model with the prompt and the encoded image
	model_dict = {
        "haiku":"anthropic.claude-3-haiku-20240307-v1:0",
        "sonnet":"anthropic.claude-3-sonnet-20240229-v1:0"
    }
	model_id = model_dict[model]
	request_body = {
		"anthropic_version": "bedrock-2023-05-31",
		"max_tokens": 4096,
        "temperature":temp,
		"messages": [
			{
				"role": "user",
				"content": [
					{
						"type": "text",
						"text": prompt,
					},
				],
			}
		],
	}

	try:
		response = bedrock.invoke_model(
			modelId=model_id,
			body=json.dumps(request_body),
		)

		# Process and print the response
		result = json.loads(response.get("body").read())
		input_tokens = result["usage"]["input_tokens"]
		output_tokens = result["usage"]["output_tokens"]

		# the current Bedrock Claude Messagees API only supports text content in responses
		text_response = result["content"][0]["text"]

        # return a tuple with 3 values
		return text_response, input_tokens, output_tokens
	except ClientError as err:
		print(
			F"Couldn't invoke model Here's why: {err.response['Error']['Code']}: {err.response['Error']['Message']}"
		)
		raise



Now we'll verify that the models can be called, try changing the prompt and model below between sonnet and haiku. If you see an error code verify that you have [requested access](https://docs.aws.amazon.com/bedrock/latest/userguide/getting-started.html#getting-started-model-access) for the chosen model

In [None]:
print(get_response_from_claude("Tell me a short story", temp=1, model='sonnet'))

### 1. Basic usage with Amazon Textract and Amazon Textractor
---
This code block demonstrates how to use the Amazon Textract service directly through the AWS SDK for Python (Boto3) to extract text from a document stored in an Amazon S3 bucket. It first uploads the document to S3, then calls the `detect_document_text` method of the Textract client to perform text detection on the document. The response from Textract is then parsed using the `textractor.parsers.response_parser` module to create a more user-friendly representation of the detected text.

In [None]:
import boto3
from textractor.parsers import response_parser

textract = boto3.client('textract')
s3 = boto3.client("s3")

# first we upload the file to S3
s3.upload_file(Filename='../samples/discharge-summary.png', Bucket=data_bucket, Key='samples/discharge-summary.png')

# next We will use the Textract detect document text action to get all the text in the document.
textract_response = textract.detect_document_text(
    Document={'S3Object': 
              {'Bucket': data_bucket,'Name': 'samples/discharge-summary.png'}
             }, 
	)
	
# Textractor provides a parser to give us a summary of the contents and a string with the detected text
document = response_parser.parse(textract_response)

# the document object contains a summary of what textract returned 
print(document)

### Extracting tabular data using Amazon Textract

In this step we will take a brief look at how to extract table information from the bank statements. Our bank statement has two tables.

In [None]:
response = textract.analyze_document(
    Document={
        'S3Object': {
            'Bucket': data_bucket,
            'Name': 'samples/discharge-summary.png'
        }
    },
    FeatureTypes=["TABLES"])

response

As you can see, the response from Amazon Textract is a large JSON object that contains a lot of information. Let's parse out the table data from this response. To do this, we will see how to extract the tables using the textract response parser tool that we installed earlier. To learn about how Textract Table response works, refer to the documentation.


In [None]:
#print(response)
doc = Document(response)
for page in doc.pages:
     # Print tables
    for table in page.tables:
        for r, row in enumerate(table.rows):
            for c, cell in enumerate(row.cells):
                print("Table[{}][{}] = {}".format(r, c, cell.text))

In the code cells above, we used the Textract AnalyzeDocument API to extract info from the document and subsequently used textract response parser Document to parse out the tables from the JSON response. We can further use additional tooling to call the Textract API and use textract pretty printer tool to view the tables in a slightly more human readable way. We will see how to extract the tables using the Textract pretty printer tool. We will also use call_textract method from the Textract Caller tool that we installed earlier. These set of tools make it easy for us to make Textract API calls and parse it's JSON output. In our subsequent sections, we will make use of these tools to make API calls and subsequently to parse the JSON response.

In [None]:
file = '../samples/account_statement.png'
resp = call_textract(input_document=file, features=[Textract_Features.TABLES])
tdoc = Document(resp)
dfs = list()

for page in tdoc.pages:
    for table in page.tables:
        tab_list = convert_table_to_list(trp_table=table)
        print(tab_list)
        dfs.append(pd.DataFrame(tab_list))

df1 = dfs[0]
df2 = dfs[1]

In the code cell above, we extracted the tables as a Python List and then converted them to Pandas DataFrame. You can also extract tables in other formats such as CSV, TSV etc. Refer to the PrettyPrinter documentation for more. Now let's look at the DataFrames.

In [None]:
df1

In [None]:
df2

### Using the Textractor library
---
This library provides a higher-level interface for working with Amazon Textract. Instead of calling Textract directly, you can use Textractor's caller methods [caller methods](https://aws-samples.github.io/amazon-textract-textractor/textractor.html) method, which abstracts away some of the complexities of interacting with the Textract service. In this example, Textractor is used to extract text from a document stored in an S3 bucket.

For the first example below, we are using Amazon Textract's [synchronous API](https://docs.aws.amazon.com/textract/latest/dg/sync.html) which will allow u to quickly analyze a **single page** of a document

In [None]:
from textractor import Textractor
extractor = Textractor(region_name=region)

document = extractor.detect_document_text('s3://' + data_bucket + '/samples/discharge-summary.png')
print(document)

---
You can also use Textractor to extract text from a local file on your machine, rather than a file stored in S3. This can be more convenient for smaller projects or local testing.

In [None]:
from textractor import Textractor
extractor = Textractor(region_name=region)

document = extractor.detect_document_text("../samples/discharge-summary.png")
print(document.get_text())

---
Textractor also works with Textract's asynchronous methods, which are useful for processing multi-page documents. It first uploads a PDF file to S3, then calls Textractor's `start_document_text_detection` method to initiate an asynchronous text detection job. The results of this job are then printed out page by page.


The asynchronous API helps us process large multi page documents without the application being blocked. The document is sent off to the endpoint and it'll automatically split, and parallelize the OCR detection across all of the pages without the application having to do it. 

In [None]:
from textractor import Textractor
extractor = Textractor(region_name=region)

# first we upload the file to S3
s3.upload_file(Filename='../samples/employee_enrollment.pdf', Bucket=data_bucket, Key='samples/employee_enrollment.pdf')

document = extractor.start_document_text_detection(file_source="s3://" + data_bucket + "/samples/employee_enrollment.pdf",
    save_image=False)

for page in document.pages:
    print(page)
    print("\n----------------\n")

---
You can retrieve the full text content of the document from the Textractor `document` object, which can then be used as input for any further processing using the `get_text()` API call. For a full event driven workflow, you could also leverage Amazon Textract's [integration with SNS](https://docs.aws.amazon.com/textract/latest/dg/api-async-roles.html#api-async-roles-all-topics) to start additional processing workflows

In [None]:
print (f"\nDetected Text \n=========================\n")
doc_text = document.get_text()
print(doc_text)

### 2. Basic usage with Bedrock
---
Now we can use the Bedrock library to generate a response based on the document text extracted by Textractor. It constructs a prompt that includes the document text, and then calls a previously defined function `get_response_from_claude` to generate a response based on that prompt.

In [None]:
prompt = f"""

Given the document

<document>{doc_text}<document>

What is the employee's name?
"""

response = get_response_from_claude(prompt)

print (f"Our prompt has {response[1]} input tokens and Claude returned {response[2]} output tokens \n\n=========================\n")
print(response[0])

### 3. Basic usage with Rhubarb
---

The Rhubarb library, which provides a high-level interface for using Bedrock's multi-modal models (models that can process both text and images/documents). It demonstrates how to create a `DocAnalysis` object with a local PDF file, and then use that object to generate a response to a textual query using one of Bedrock's multi-modal models.


Rhubarb under the hood will make use of pdfplumber to first read each page of the document, convert it into images before passing it into a multi modal model that will then analyze the image based and output the specified data in the prompt provided.  The DocAnalysis object wraps all of these tasks into a single function call.

Compared to using Amazon Textract to first do the OCR followed ny the specific entity extraction, we do it in a single step with Rhubarb 

In [None]:
import boto3
session = boto3.Session()
from rhubarb import DocAnalysis, LanguageModels


da = DocAnalysis(file_path="../samples/employee_enrollment.pdf",
                  modelId=LanguageModels.CLAUDE_HAIKU_V1,
                  boto3_session=session)
resp = da.run(message="What is the employee's name?")
resp

## Cleanup
---
Let's delete the sample files we uploaded earlier.

In [None]:
s3.delete_object(Bucket=data_bucket, Key='samples/discharge-summary.png')
s3.delete_object(Bucket=data_bucket, Key='samples/employee_enrollment.pdf')