# Augment Intelligent Document Processing with generative AI using Amazon Bedrock
---

<div class="alert alert-block alert-info"> 
    <b>NOTE:</b> You will need to use a Jupyter Kernel with Python 3.9 or above to use this notebook. If you are in Amazon SageMaker Studio, you can use the `Data Science 3.0` image.
</div>

<div class="alert alert-block alert-warning"> 
    <b>NOTE:</b> You will need 3rd party model access to Anthropic Claude V1 model to be able to run this notebook. Verify if you have access to the model by going to <a href="https://console.aws.amazon.com/bedrock" target="_blank">Amazon Bedrock console</a> > left menu "Model access". The "Access status" for Anthropic Claude must be in "Access granted" status in green. If you do not have access, then click "Edit" button on the top right > select the model checkbox > click "Save changes" button at the bottom. You should have access to the model within a few moments.
</div>

In this notebook, we demonstrate how you can integrate Amazon Textract with LangChain as a document loader to extract data from documents and use generative AI capabilities within the various IDP phases. We will perform the following with different LLMs.

- Classification
- Summarization
- Standardization
- Spell check corrections

In [None]:
!pip install -U boto3 langchain faiss-cpu transformers
!pip install amazon-textract-textractor pypdf Pillow transformers

In [62]:
import json
import os
import sys
import sagemaker
import boto3

role = sagemaker.get_execution_role()
data_bucket = sagemaker.Session().default_bucket()
bedrock = boto3.client('bedrock-runtime')
s3 = boto3.client("s3")
print(f"SageMaker bucket is {data_bucket}, and SageMaker Execution Role is {role}")

sagemaker.config INFO - Not applying SDK defaults from location: /etc/xdg/sagemaker/config.yaml
sagemaker.config INFO - Not applying SDK defaults from location: /root/.config/sagemaker/config.yaml
sagemaker.config INFO - Not applying SDK defaults from location: /etc/xdg/sagemaker/config.yaml
sagemaker.config INFO - Not applying SDK defaults from location: /root/.config/sagemaker/config.yaml
SageMaker bucket is sagemaker-us-east-1-710096454740, and SageMaker Execution Role is arn:aws:iam::710096454740:role/service-role/AmazonSageMaker-ExecutionRole-20220504T135260


## 1. Classification
---

Classify a document based on it's content, given a list of classes.

In [20]:
from langchain.document_loaders import AmazonTextractPDFLoader
from langchain.llms import Bedrock
from langchain.prompts import PromptTemplate
from langchain.chains import LLMChain

loader = AmazonTextractPDFLoader("./samples/discharge-summary.png")
document = loader.load()

template = """

Given a list of classes, classify the document into one of these classes. Skip any preamble text and just give the class name.

<classes>DISCHARGE_SUMMARY, RECEIPT, PRESCRIPTION</classes>
<document>{doc_text}<document>
<classification>"""

prompt = PromptTemplate(template=template, input_variables=["doc_text"])
bedrock_llm = Bedrock(client=bedrock, model_id="anthropic.claude-v1")

llm_chain = LLMChain(prompt=prompt, llm=bedrock_llm)
class_name = llm_chain.run(document[0].page_content)

print(f"The provided document is = {class_name}")



The provided document is a =  DISCHARGE_SUMMARY


## 2. Summarization
---

Summarize large pieces of text from a document into smaller, more coincise explanations. In this block we will perform a single page summary.

In [None]:
!pip install anthropic

In [28]:
from langchain.document_loaders import AmazonTextractPDFLoader
from langchain.llms import Bedrock
from langchain.prompts import PromptTemplate
from langchain.chains import LLMChain

loader = AmazonTextractPDFLoader("./samples/discharge-summary.png")
document = loader.load()

template = """

Given a full document, give me a concise summary. Skip any preamble text and just give the summary.

<document>{doc_text}</document>
<summary>"""

prompt = PromptTemplate(template=template, input_variables=["doc_text"])
bedrock_llm = Bedrock(client=bedrock, model_id="anthropic.claude-v1")

num_tokens = bedrock_llm.get_num_tokens(document[0].page_content)
print (f"Our prompt has {num_tokens} tokens \n\n=========================\n")

llm_chain = LLMChain(prompt=prompt, llm=bedrock_llm)
summary = llm_chain.run(document[0].page_content)

print(summary.replace("</summary>","").strip())

Our prompt has 797 tokens 


35 yo M admitted for epigastric abdominal pain, nausea, fatigue. Found to likely have ulcer. Discharged with activity restrictions, antibiotics, diet changes, and follow up.


### Multi-page summarization

We will now attempt to summarize a multi-page document. In order to extract a multi-page PDF we first need to upload it to an S3 bucket.

In [33]:
!aws s3 cp ./samples/health_plan.pdf s3://{data_bucket}/bedrock-sample/health_plan.pdf

huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
	- Avoid using `tokenizers` before the fork if possible
	- Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false)
upload: samples/health_plan.pdf to s3://sagemaker-us-east-1-710096454740/bedrock-sample/health_plan.pdf


In [35]:
from langchain.document_loaders import AmazonTextractPDFLoader
from langchain.llms import Bedrock

bedrock_llm = Bedrock(client=bedrock, model_id="anthropic.claude-v1")

loader = AmazonTextractPDFLoader(f"s3://{data_bucket}/bedrock-sample/health_plan.pdf")
document = loader.load()

document

[Document(page_content='Health Benefit Summary Plan Description Revised 01-01-2022 BENEFITS Healthcare Policy Plan Vision: To provide high quality, affordable healthcare for all citizens. Mission: To implement policy reforms and programs that expand access to healthcare, reduce costs, and improve health outcomes. Goals: 1. Achieve universal healthcare coverage. Provide health insurance for all citizens regardless of income or health status. 2. Reduce healthcare costs for individuals and government. Implement policies and programs to lower premiums, out-of-pocket costs, and overall healthcare spending. 3. Improve population health. Invest in public health programs and prevention to promote healthy lifestyles, reduce health risks, and improve health outcomes. 4. Support healthcare innovation. Invest in research and new technologies to improve treatments, cures, and the healthcare system. Policy Reforms: 1. Establish a public healthcare option. Provide a government-run health plan to comp

Amazon Textract PDF Loader module has returned per page text. Since with 100k context we have a pretty healthy context window we don't need to further split this. Let's see the per page token size.

In [38]:
num_docs = len(document)
print (f"There are {num_docs} pages in the document")
for index, doc in enumerate(document):
    num_tokens_first_doc = bedrock_llm.get_num_tokens(doc.page_content)
    print (f"Page {index+1} has approx. {num_tokens_first_doc} tokens")

There are 5 pages in the document
Page 1 has approx. 533 tokens
Page 2 has approx. 1323 tokens
Page 3 has approx. 997 tokens
Page 4 has approx. 1643 tokens
Page 5 has approx. 867 tokens


We will use LangChain `load_summarize_chain` with a `map_reduce` chain type. For more information on Summarization techniques with LangChain refer to [this document](https://python.langchain.com/docs/use_cases/summarization).

In [39]:
from langchain.chains.summarize import load_summarize_chain

summary_chain = load_summarize_chain(llm=bedrock_llm, chain_type='map_reduce',
                                     verbose=True # Set verbose=True if you want to see the prompts being used
                                    )
output = summary_chain.run(document)



[1m> Entering new MapReduceDocumentsChain chain...[0m


[1m> Entering new LLMChain chain...[0m
Prompt after formatting:
[32;1m[1;3mWrite a concise summary of the following:


"Health Benefit Summary Plan Description Revised 01-01-2022 BENEFITS Healthcare Policy Plan Vision: To provide high quality, affordable healthcare for all citizens. Mission: To implement policy reforms and programs that expand access to healthcare, reduce costs, and improve health outcomes. Goals: 1. Achieve universal healthcare coverage. Provide health insurance for all citizens regardless of income or health status. 2. Reduce healthcare costs for individuals and government. Implement policies and programs to lower premiums, out-of-pocket costs, and overall healthcare spending. 3. Improve population health. Invest in public health programs and prevention to promote healthy lifestyles, reduce health risks, and improve health outcomes. 4. Support healthcare innovation. Invest in research and new technologie

In [40]:
print(output.strip())

Here is a concise summary:

The health benefit plan is a self-funded employer plan governed by ERISA. UMR and Express Scripts administer the medical and pharmacy benefits, respectively. The plan covers employees and dependents, funded by employee/employer contributions. It serves as both the summary plan description and official plan document. 

The plan has individual/family deductibles and out-of-pocket maximums. Expenses can combine to meet family limits, but individuals won't pay more than individual amounts. Pharmacy/medical costs apply to the same OOP max. Co-pays for some services don't apply to deductibles but do apply to OOP max. After deductibles, members pay coinsurance until reaching OOP max. Some costs are excluded from OOP max.

Providers can't waive required member costs; claims may be denied if they do. Waived claims may be reprocessed with proof of payment. Deductibles must be met each year before benefits pay. Deductible amounts apply to in/out-of-network limits. Some

## 3. Standardization
---

Let's try to standardize dates from our discharge summary document. Note that the document has dates in `DD-MON-YYYY` format, and we want to convert all of those dates to `MM/DD/YYYY` format. We will use simple prompt engineering techniques to show Claude some example and have it generate the output in a JSON format (Key value pair).

In [47]:
from langchain.document_loaders import AmazonTextractPDFLoader
from langchain.llms import Bedrock
from langchain.prompts import PromptTemplate
from langchain.chains import LLMChain

loader = AmazonTextractPDFLoader("./samples/discharge-summary.png")
document = loader.load()

bedrock_llm = Bedrock(client=bedrock, model_id="anthropic.claude-v1")

template1 = """

Given a full document, answer the question and format the output in the format specified. Skip any preamble text and just generate the JSON.

<format>
{{
  "key_name":"key_value"
}}
</format>
<document>{doc_text}</document>
<question>{question}</question>"""

template2 = """

Given a JSON document, format the dates in the value fields precisely in the provided format. Skip any preamble text and just generate the JSON.

<format>DD/MM/YYYY</format>
<json_document>{json_doc}</json_document>
"""


prompt1 = PromptTemplate(template=template1, input_variables=["doc_text", "question"])
llm_chain = LLMChain(prompt=prompt1, llm=bedrock_llm, verbose=True)

prompt2 = PromptTemplate(template=template2, input_variables=["json_doc"])
llm_chain2 = LLMChain(prompt=prompt2, llm=bedrock_llm, verbose=True)

chain = ( 
    llm_chain 
    | {'json_doc': lambda x: x['text'] }  
    | llm_chain2
)

std_op = chain.invoke({ "doc_text": document[0].page_content, 
                        "question": "Can you give me the patient admitted and discharge dates?"})

print(std_op['text'])



[1m> Entering new LLMChain chain...[0m
Prompt after formatting:
[32;1m[1;3m

Given a full document, answer the question and format the output in the format specified. Skip any preamble text and just generate the JSON.

<format>
{
  "key_name":"key_value"
}
</format>
<document>Not a Memorial Hospital Of Collier Reg: PN/S/11011, Non-Profit Contact: (999)-(888)-(1234) Physician Hospital Discharge Summary Provider: Mateo Jackson, Phd Patient: John Doe Provider's Pt ID: 00988277891 Patient Gender: Male Attachment Control Number: XA/7B/00338763 Visit (Encounter) Admitted: 07-Sep-2020 Discharged: 08-Sep-2020 Discharged to: Home with support services Assessment Reported Symptoms / History 35 yo M c/o stomach problems since 2 montsh ago. Patient reports epigastric abdominal pain non-radiating. Pain is of present illness: described as gnawing and burning, intermitent lasting 1-2 hours, and gotten progressively worse. Antacids used to alleviate pain but not anymore; nothing exhacerbates pai

And we get a nicely formatted JSON output

## 4. Spell check and corrections
---

Perform grammatical and spelling corrections on text extracted from a hand written document.

In [58]:
from langchain.document_loaders import AmazonTextractPDFLoader
from langchain.llms import Bedrock
from langchain.prompts import PromptTemplate
from langchain.chains import LLMChain

loader = AmazonTextractPDFLoader("./samples/hand_written_note.pdf")
document = loader.load()

template = """

Given a detailed 'Document', perform spelling and grammatical corrections. Ensure the output is coherent, polished, and free from errors. Skip any preamble text and give the answer.

<document>{doc_text}</<document>
<answer>
"""

prompt = PromptTemplate(template=template, input_variables=["doc_text"])
llm = Bedrock(client=bedrock, model_id="anthropic.claude-v1")
llm_chain = LLMChain(prompt=prompt, llm=llm)

try:
    txt = document[0].page_content
    std_op = llm_chain.run({"doc_text": txt})
    
    print("Extracted text")
    print("==============")
    print(txt)

    print("\nCorrected text")
    print("==============")
    print(std_op.strip())
    print("\n")
except Exception as e:
    print(str(e))

Extracted text
Patient John Doe, who was ad mitta with sever pnequonia, has shown Signif i art improumet & can be safely discharged. Follow w/s are scheduled for nen week. Patient John Doe, who was ad mitta with sever pnequonia, has shown Signif i art improumet & can be safely discharged. Follow w/s are scheduled for nen week. 

Corrected text
Patient John Doe, who was admitted with severe pneumonia, has shown significant improvement and can be safely discharged. Follow-up appointments are scheduled for next week.




## Cleanup
---
Let's delete the pdf file we uploaded earlier.

In [51]:
!aws s3api delete-object --bucket {data_bucket} --key bedrock-sample/health_plan.pdf

huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
	- Avoid using `tokenizers` before the fork if possible
	- Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false)
