# Unlocking the Power of Intelligent Document Processing with Amazon Textract and Amazon Bedrock
---

<div class="alert alert-block alert-info"> 
    <b>NOTE:</b> You will need to use a Jupyter Kernel with Python 3.9 or above to use this notebook. If you are in Amazon SageMaker Studio, you can use the `Data Science 3.0` image.
</div>

<div class="alert alert-block alert-warning"> 
    <b>NOTE:</b> You will need 3rd party model access to Anthropic Claude 3 Sonnet and Haiku models to be able to run this notebook. Verify if you have access to the model by going to <a href="https://console.aws.amazon.com/bedrock" target="_blank">Amazon Bedrock console</a> > left menu "Model access". The "Access status" for Anthropic Claude must be in "Access granted" status in green. If you do not have access, then click "Edit" button on the top right > select the model checkbox > click "Save changes" button at the bottom. You should have access to the model within a few moments.
</div>



---------------
In this notebook, we will explore the powerful capabilities of Amazon Textract and Amazon Bedrock for intelligent document processing (IDP). IDP involves automatically extracting valuable information from documents, enabling organizations to streamline document-centric workflows, reduce operational costs, and gain insights from their data.

Amazon Textract uses advanced machine learning and computer vision technologies to accurately extract text, data, and metadata from various document formats, including PDFs, images, and scanned documents. By automating this process, Textract eliminates the need for manual data entry, increasing efficiency and reducing the risk of errors.

Amazon Bedrock, on the other hand, provides access to state-of-the-art large language models (LLMs) that can understand and process both text and visual data. These multi-modal models can accurately identify and extract relevant information from structured, semi-structured, and unstructured documents, enabling tasks such as form extraction, table extraction, and intelligent question answering.

Together, Textract and Bedrock form a powerful combination for building intelligent document processing pipelines. Textract handles the initial document ingestion and text extraction, while Bedrock's LLMs provide advanced understanding and extraction capabilities.

Throughout this notebook, we will explore the key features of both services and learn how to integrate them into your workflows using Python libraries like Textractor and Rhubarb. By the end of this notebook, you will have a solid understanding of how to leverage AWS for efficient and accurate document processing, enabling your organization to unlock valuable insights from its data assets.


## Basic building blocks
---
This lab is an introduction to the libraries and interfaces used in subsequent intelligent document processing with generative AI labs. It introduces the key libraries that will be used, along with providing code samples that you can modify in your own workflows.

### Setup Prerequisites
---

First we need to 
1. Install Textractor. This open source python library makes it easy to parse and handle the JSON output from Amazon Textract
1. Install Rhubarb. This open source python ligrary makes it easy to use Amazon Bedrock multimodal capibiliteis for IDP
1. Install Sagemaker to give you access to a Sagemaker session context. 


In [None]:
!python -m pip install "amazon-textract-textractor[pdf]"
!python -m pip install pyrhubarb
!python -m pip install sagemaker

In [None]:
#Import script libraries and create global variables
import json
import sagemaker

role = sagemaker.get_execution_role()
data_bucket = sagemaker.Session().default_bucket()
region = sagemaker.Session().boto_region_name
print(f"SageMaker bucket is {data_bucket}, and SageMaker Execution Role is {role}. Current region is {region}")

### Use Boto3 to call the Bedrock API
---
Sending a prompt directly to Bedrock with Boto3 is easy and gives fine grain control over your prompt and parameters. 

In [None]:
import boto3
from botocore.exceptions import ClientError
bedrock = boto3.client(service_name="bedrock-runtime", region_name=region)

#Create a global function to call Bedrock. 
def get_response_from_claude(prompt, temp=1, model='sonnet'):
	"""
	Invokes Anthropic Claude 3 Haiku to run a text inference using the input
	provided in the request body.

	:param prompt:  The prompt that you want Claude 3 to use.
	:param temp:    The temperature to use when invoking Claude. Default is 1
	:param model:   The claude model to use. Currently this supports haiku and sonnet. Default is Sonnett
	:return:        Text response, input token count, output token count
	"""

	# Invoke the model with the prompt and the encoded image
	model_dict = {
        "haiku":"anthropic.claude-3-haiku-20240307-v1:0",
        "sonnet":"anthropic.claude-3-sonnet-20240229-v1:0"
    }
	model_id = model_dict[model]
	request_body = {
		"anthropic_version": "bedrock-2023-05-31",
		"max_tokens": 4096,
        "temperature":temp,
		"messages": [
			{
				"role": "user",
				"content": [
					{
						"type": "text",
						"text": prompt,
					},
				],
			}
		],
	}

	try:
		response = bedrock.invoke_model(
			modelId=model_id,
			body=json.dumps(request_body),
		)

		# Process and print the response
		result = json.loads(response.get("body").read())
		input_tokens = result["usage"]["input_tokens"]
		output_tokens = result["usage"]["output_tokens"]

		# the current Bedrock Claude Messagees API only supports text content in responses
		text_response = result["content"][0]["text"]

        # return a tuple with 3 values
		return text_response, input_tokens, output_tokens
	except ClientError as err:
		print(
			F"Couldn't invoke Claude 3 Sonnet. Here's why: {err.response['Error']['Code']}: {err.response['Error']['Message']}"
		)
		raise



### 1. Basic usage with Amazon Textract and Amazon Textractor
---
This code block demonstrates how to use the Amazon Textract service directly through the AWS SDK for Python (Boto3) to extract text from a document stored in an Amazon S3 bucket. It first uploads the document to S3, then calls the `detect_document_text` method of the Textract client to perform text detection on the document. The response from Textract is then parsed using the `textractor.parsers.response_parser` module to create a more user-friendly representation of the detected text.

In [None]:
import boto3
from textractor.parsers import response_parser

textract = boto3.client('textract')
s3 = boto3.client("s3")

# first we upload the file to S3
s3.upload_file(Filename='../samples/discharge-summary.png', Bucket=data_bucket, Key='samples/discharge-summary.png')

# next We will use the Textract detect document text action to get all the text in the document.
textract_response = textract.detect_document_text(
    Document={'S3Object': 
              {'Bucket': data_bucket,'Name': 'samples/discharge-summary.png'}
             }, 
	)
	
# Textractor provides a parser to give us a summary of the contents and a string with the detected text
document = response_parser.parse(textract_response)

# the document object contains a summary of what textract returned 
print(document)

### Using the Textractor library
---
This library provides a higher-level interface for working with Amazon Textract. Instead of calling Textract directly, you can use Textractor's caller methods[caller methods](https://aws-samples.github.io/amazon-textract-textractor/textractor.html) method, which abstracts away some of the complexities of interacting with the Textract service. In this example, Textractor is used to extract text from a document stored in an S3 bucket.

In [None]:
from textractor import Textractor
extractor = Textractor(region_name=region)

document = extractor.detect_document_text('s3://' + data_bucket + '/samples/discharge-summary.png')
print(document)

---
You can also use Textractor to extract text from a local file on your machine, rather than a file stored in S3. This can be more convenient for smaller projects or local testing.

In [None]:
from textractor import Textractor
extractor = Textractor(region_name=region)

document = extractor.detect_document_text("../samples/discharge-summary.png")
print(document.get_text())

---
Textractor also works with Textract's asynchronous methods, which are useful for processing multi-page documents. It first uploads a PDF file to S3, then calls Textractor's `start_document_text_detection` method to initiate an asynchronous text detection job. The results of this job are then printed out page by page.

In [None]:
from textractor import Textractor
extractor = Textractor(region_name=region)

# first we upload the file to S3
s3.upload_file(Filename='../samples/employee_enrollment.pdf', Bucket=data_bucket, Key='samples/employee_enrollment.pdf')

document = extractor.start_document_text_detection(file_source="s3://" + data_bucket + "/samples/employee_enrollment.pdf",
    save_image=False)

for page in document.pages:
    print(page)
    print("\n----------------\n")

---
You can retrieve the full text content of the document from the Textractor `document` object, which can then be used as input for a Retrieval-Augmented Generation (RAG) model implemented using the Bedrock library.

In [None]:
print (f"\nDetected Text \n=========================\n")
doc_text = document.get_text()
print(doc_text)

### 2. Basic usage with Bedrock
---
Now we can use the Bedrock library to generate a response based on the document text extracted by Textractor. It constructs a prompt that includes the document text, and then calls a previously defined function `get_response_from_claude` to generate a response based on that prompt.

In [None]:
prompt = f"""

Given the document

<document>{doc_text}<document>

What is the employee's name?
"""

response = get_response_from_claude(prompt)

print (f"Our prompt has {response[1]} input tokens and Claude returned {response[2]} output tokens \n\n=========================\n")
print(response[0])

### 3. Basic usage with Rhubarb
---

The Rhubarb library, which provides a high-level interface for using Bedrock's multi-modal models (models that can process both text and images/documents). It demonstrates how to create a `DocAnalysis` object with a local PDF file, and then use that object to generate a response to a textual query using one of Bedrock's multi-modal models.

In [None]:
import boto3
session = boto3.Session()
from rhubarb import DocAnalysis

da = DocAnalysis(file_path="../samples/employee_enrollment.pdf", boto3_session=session)
resp = da.run(message="What is the employee's name?")
resp

## Cleanup
---
Let's delete the sample files we uploaded earlier.

In [None]:
s3.delete_object(Bucket=data_bucket, Key='samples/discharge-summary.png')
s3.delete_object(Bucket=data_bucket, Key='samples/employee_enrollment.pdf')