# Augment Intelligent Document Processing with generative AI using Amazon Bedrock
---

<div class="alert alert-block alert-info"> 
    <b>NOTE:</b> You will need to use a Jupyter Kernel with Python 3.9 or above to use this notebook. If you are in Amazon SageMaker Studio, you can use the `Data Science 3.0` image.
</div>

<div class="alert alert-block alert-warning"> 
    <b>NOTE:</b> You will need 3rd party model access to Anthropic Claude V1 model to be able to run this notebook. Verify if you have access to the model by going to <a href="https://console.aws.amazon.com/bedrock" target="_blank">Amazon Bedrock console</a> > left menu "Model access". The "Access status" for Anthropic Claude must be in "Access granted" status in green. If you do not have access, then click "Edit" button on the top right > select the model checkbox > click "Save changes" button at the bottom. You should have access to the model within a few moments.
</div>

In this notebook, we demonstrate how you can integrate Amazon Textract with Amazon Bedrock as a document loader to extract data from documents and use generative AI capabilities within the various IDP phases. We will perform the following with different LLMs.

- Classification
- Summarization
- Standardization
- Spell check corrections

## Setup Prerequisites
---

First we need to 
1. Install Amazon Textractor. This open source python library makes it easy to parse and handle the JSON output from Amazon Textract
1. Import script libraries and create global variables
1. Create a global function for calling Amazon Bedrock 
1. Upload our example files to S3

In [None]:
!python -m pip install amazon-textract-textractor[pdf]

In [None]:
# Import script libraries and create global variables
import json
import os
import sys
import sagemaker
import datetime
import boto3
from textractor import Textractor
from textractor.data.constants import TextractFeatures
from textractor.parsers import response_parser

extractor = Textractor(region_name="us-east-2")
textract = boto3.client('textract')
s3 = boto3.client("s3")
# in case of running this notebook in Workshop Studio environment, set region_name = us-west-2 for Bedrock models.
bedrock = boto3.client(service_name="bedrock-runtime", region_name="us-west-2")
role = sagemaker.get_execution_role()
data_bucket = sagemaker.Session().default_bucket()
print(f"SageMaker bucket is {data_bucket}, and SageMaker Execution Role is {role}")

In [None]:
# Create a global function to call Bedrock. 
def get_response_from_claude(prompt):
	"""
	Invokes Anthropic Claude 3 Haiku to run a text inference using the input
	provided in the request body.

	:param prompt:            The prompt that you want Claude 3 to use.
	:return: Inference response from the model.
	"""

	# Invoke the model with the prompt and the encoded image
	model_id = "anthropic.claude-3-haiku-20240307-v1:0"
	request_body = {
		"anthropic_version": "bedrock-2023-05-31",
		"max_tokens": 2048,
        "temperature":0.5,
		"messages": [
			{
				"role": "user",
				"content": [
					{
						"type": "text",
						"text": prompt,
					},
				],
			}
		],
	}

	try:
		response = bedrock.invoke_model(
			modelId=model_id,
			body=json.dumps(request_body),
		)

		# Process and print the response
		result = json.loads(response.get("body").read())
		input_tokens = result["usage"]["input_tokens"]
		output_tokens = result["usage"]["output_tokens"]
		# the current Bedrock Claude Messagees API only supports text content in responses
		text_response = result["content"][0]["text"]

        # return a tuple with 3 values
		return text_response, input_tokens, output_tokens
	except ClientError as err:
		logger.error(
			"Couldn't invoke Claude 3 Sonnet. Here's why: %s: %s",
			err.response["Error"]["Code"],
			err.response["Error"]["Message"],
		)
		raise




In [None]:
# Upload the sample files to S3
sample_files = [
    'samples/discharge-summary.png',
    'samples/hand_written_note.pdf',
    'samples/health_plan.pdf',
    ]
for file_key in sample_files:
    s3.upload_file(Filename='./'+file_key, Bucket=data_bucket, Key=file_key)


## 1. Classification
---

First we extract the text from the document into a format that is useful for RAG processing with our LLM

In [None]:
# Examples 1, 2, and 3 will use the discharge summary document. You can find the sample documents in the samples folder. Take a look at this documen before proceeding
file_key = 'samples/discharge-summary.png'

# We will use the Textract detect document text action to get all the text in the document.
textract_response = textract.detect_document_text(
    Document={'S3Object': 
              {'Bucket': data_bucket,'Name': file_key}
             }, 
	)
	
# Textractor provides a parser to give us a summary of the contents and a string with the detected text
document = response_parser.parse(textract_response)

print(document)
doc_text = document.get_text()
print (f"\nDetected Text \n=========================\n")
print(doc_text)

---

Claude can classify a document based on it's content, given a list of classes.

In [None]:
prompt = f"""

Given a list of classes and a document, classify the document into one of the given classes. 

<classes>DISCHARGE_SUMMARY, RECEIPT, PRESCRIPTION</classes>
<document>{doc_text}<document>



Skip any preamble text and just give the class name.
"""

response = get_response_from_claude(prompt)

print(f"The provided document is = {response[0]}")



## 2. Summarization
---

Summarize large pieces of text from a document into smaller, more coincise explanations. In this block we will ask Claude to summarize the discharge summary.

Try editing the prompt to provide a suggested next step. 

In [None]:
prompt = f"""

Given a full document, give me a concise summary. 

<document>{doc_text}</document>

Skip any preamble text and just give the summary.

<summary>"""

summary = get_response_from_claude(prompt)

print (f"Our prompt has {summary[1]} input tokens and Claude returned {summary[2]} output tokens \n\n=========================\n")
print(summary[0].replace("</summary>","").strip())

## 3. Standardization
---

Let's try to standardize dates from our discharge summary document. Note that the document has dates in `DD-MON-YYYY` format, and we want to convert all of those dates to `MM/DD/YYYY` format. We will use simple prompt engineering techniques to show Claude some example and have it generate the output in a JSON format (Key value pair).

In [None]:
question = "Can you give me the patient admitted and discharge dates?"

prompt = f"""

Given a full document, answer the question and format the output in the format specified. 

<format>
{{
  "key_name":"key_value"
}}
</format>
<document>{doc_text}</document>
<question>{question}</question>

<output_instructions>
Skip any preamble text and just generate the JSON.
Format the dates in the value fields precisely in this format <format>DD/MM/YYYY</format>. 
</output_instructions>


"""

summary = get_response_from_claude(prompt)

print (f"Our prompt has {summary[1]} input tokens and Claude returned {summary[2]} output tokens \n\n=========================\n")
print(summary[0].replace("</summary>","").strip())

## 4. Spell check and corrections
---

Perform grammatical and spelling corrections on text extracted from a hand written document.

In this example we provide Textract a handwritten note. We take the output of the note and send it to Claude to make corrections. 

In [None]:
file_key = 'samples/hand_written_note.pdf'

textract_response = textract.detect_document_text(
    Document={'S3Object': 
              {'Bucket': data_bucket,'Name': file_key}
             }, 
	)
	
document = response_parser.parse(textract_response)

print(document)
doc_text = document.get_text()
print (f"\nDetected Text \n=========================\n")
print(doc_text)

In [None]:
prompt = f"""

Given a detailed 'Document':

<document>{doc_text}</document>

perform spelling and grammatical corrections. Ensure the output is coherent, polished, and free from errors. 

Skip any preamble text and give the answer.
<answer>"""

answer = get_response_from_claude(prompt)
print (f"Our prompt has {answer[1]} input tokens and Claude returned {answer[2]} output tokens.")
print("\nCorrected Text")
print("==============")
print(answer[0].replace("</answer>","").strip())

## 5. Multi-page summarization

We will now attempt to summarize a multi-page document.

We will use the Textractor document loader to process a multi-page file store on S3. Textractor makes the asynchronous call to textract and handles the wait for response. 

Textractor will give us a page by page summary of the document. 

In [None]:
file_key = 'samples/health_plan.pdf'
file_s3_uri = f"s3://{data_bucket}/{file_key}"

document = extractor.start_document_text_detection(file_source=file_s3_uri,
# features=[TextractFeatures.LAYOUT],
    save_image=False
)
document.pages

In [None]:
print(f"The document has {len(document.words)} total words")

Since Claude supports a 200k token context window we don't need to further split this.

In [None]:
doc_text = document.get_text()

In [None]:
prompt = f"""

Given a full document

<document>{doc_text}</document>

Give me a concise summary. 

Skip any preamble text and just give the summary.

<summary>"""

summary = get_response_from_claude(prompt)

print (f"Our prompt has {summary[1]} input tokens and Claude returned {summary[2]} output tokens \n\n=========================\n")
print(summary[0].replace("</summary>","").strip())

## Cleanup
---
Let's delete the sample files we uploaded earlier.

In [None]:
for file_key in sample_files:
    s3.delete_object(Bucket=data_bucket, Key=file_key)