# Module 2 - Document classification & summarization
---

<div class="alert alert-block alert-info"> 
    <b>NOTE:</b> You will need to use a Jupyter Kernel with Python 3.9 or above to use this notebook. If you are in Amazon SageMaker Studio, you can use the `Data Science 3.0` image.
</div>

In this notebook, we demonstrate how you can integrate Amazon Textract with LangChain as a document loader to extract data from documents and use generative AI capabilities within the various IDP phases. We will perform the following with different LLMs.

- Classification
- Summarization
- Spell check corrections

For the documents, we will use samples that our workflow has processed in the previous notebook. The samples are preset in the directory `sample-docs`.

In [None]:
import json
import os
import sys
import sagemaker
import boto3

role = sagemaker.get_execution_role()
data_bucket = sagemaker.Session().default_bucket()
bedrock = boto3.client('bedrock-runtime')
s3 = boto3.client("s3")
print(f"SageMaker bucket is {data_bucket}, and SageMaker Execution Role is {role}")

In [None]:
br = boto3.client('bedrock')
resp = br.list_foundation_models(byProvider='anthropic')
for model in resp['modelSummaries']:
    print(model['modelId'])

In [None]:
MODEL_ID = "anthropic.claude-instant-v1"

# 1. Document Classification
---

Classify a document based on it's content, given a list of classes. For this exercise, we will use the `sample-docs/mixed_sample_0.pdf` document. Let's take a look at the document.

In [None]:
from IPython.display import IFrame
document_path=f"s3://{data_bucket}/textract-linearized-output/uploads/mixed_sample_0"
IFrame("./sample-docs/mixed_sample_0.pdf", width=600, height=800)

As we can see, the document is a multi-page pdf and contains a variety of documents. More specifically it contains the following documents.

- A bank statement
- Patient discharge summary
- Health Plan document
- Doctor's notes
- Driver's License
- Invoice

We will first load all the text that has been extracted and stored in S3 and then try to classify each page using an LLM.

In [None]:
from read_doc_from_s3 import read_document
document = read_document(doc_path=document_path)
document

We have extracted the document text from S3, now let's define a list of classes for our LLM.

In [None]:
document_classes = ['BANK_STMT','DISCHARGE_SUMMARY', 'HEALTH_PLAN', 'DOCTORS_NOTE', 'ID_DOCUMENT', 'INVOICE']
classes = ",".join(document_classes)
classes

In [None]:
from langchain.llms import Bedrock
from langchain.prompts import PromptTemplate
from langchain.chains import LLMChain

template = """

Given a list of classes, classify the document into one of these classes. Skip any preamble text and just give the class name.

<classes>{classes}</classes>
<document>{doc_text}<document>
<classification>"""

prompt = PromptTemplate(template=template, input_variables=["classes","doc_text"])
bedrock_llm = Bedrock(client=bedrock, model_id=MODEL_ID, model_kwargs={'temperature':0})

llm_chain = LLMChain(prompt=prompt, llm=bedrock_llm)

for index, doc in enumerate(document):
    class_name = llm_chain.run({"classes": classes, "doc_text": doc})
    print(f"Page {index+1} is a document of type {class_name}")



# 2. Document Summarization
---

Summarize large pieces of text from a document into smaller, more coincise explanations. In this block we will perform a single page summary. We will select the `sample-docs/health_plan.pdf` document for summarization purpose. As before, lets look at the document and load it's extracted text from S3.

In [None]:
from IPython.display import IFrame

summary_document_path=f"s3://{data_bucket}/textract-linearized-output/uploads/health_plan"

IFrame("./sample-docs/health_plan.pdf", width=600, height=800)

Let's load this document's extracted text pages from S3. We will first select, just a single page (page-2) of this document and attempt to perform summarization. So let's do that.

In [None]:
from read_doc_from_s3 import read_document
document = read_document(doc_path=summary_document_path)
page_2 = document[1]
print(page_2)

In [None]:
from langchain.llms import Bedrock
from langchain.prompts import PromptTemplate
from langchain.chains import LLMChain

template = """

Given a full document, give me a concise summary. Skip any preamble text and just give the summary.

<document>{doc_text}</document>
<summary>"""

prompt = PromptTemplate(template=template, input_variables=["doc_text"])
bedrock_llm = Bedrock(client=bedrock, model_id=MODEL_ID, model_kwargs={'temperature':0})

num_tokens = bedrock_llm.get_num_tokens(page_2)
print (f"Our prompt has {num_tokens} tokens \n\n=========================\n")

llm_chain = LLMChain(prompt=prompt, llm=bedrock_llm)
summary = llm_chain.run(page_2)

print(summary.replace("</summary>","").strip())

# Multi-page summarization
---

We will now attempt to summarize the entire multi-page health plan document. Let's extract all pages of the document from our S3 location, but this time we will load each page in LangChain `Document` schema by passing `return_as="langchain_doc"` to our `read_document` function. Go ahead and add that parameter to the function and execute the code cell and Notice the structure of `full_document` which is a list of `Document(...)` schema objects. We need the document in this format, as opposed to just plain text for the subsequent summary generation code that uses the `load_summarize_chain` method.

<div class="alert alert-block alert-info"> 
    <b>INSTRUCTION:</b> Go ahead and add the new argument <b>return_as="langchain_doc"</b> to the <b>read_document()</b> function and execute the code cell.
</div>

In [None]:
from read_doc_from_s3 import read_document

# add the parameter to the read_document() function call below
full_document = read_document(doc_path=summary_document_path) 

full_document

Since with 100k context we have a pretty healthy context window we don't need to further split this. Let's see the per page token size.

In [None]:
from langchain.llms import Bedrock
bedrock_llm = Bedrock(client=bedrock, model_id=MODEL_ID, model_kwargs={'temperature':0})

num_docs = len(document)
print (f"There are {num_docs} pages in the document")
for index, doc in enumerate(full_document):
    num_tokens_first_doc = bedrock_llm.get_num_tokens(doc.page_content)
    print (f"Page {index+1} has approx. {num_tokens_first_doc} tokens")

We will use LangChain `load_summarize_chain` with a `map_reduce` chain type. For more information on Summarization techniques with LangChain refer to [this document](https://python.langchain.com/docs/use_cases/summarization).

In [None]:
from langchain.chains.summarize import load_summarize_chain

summary_chain = load_summarize_chain(llm=bedrock_llm, chain_type='map_reduce',
                                     verbose=True # Set verbose=True if you want to see the prompts being used
                                    )
output = summary_chain.run(full_document)

In [None]:
print(output.strip())

## 3. [BONUS] Spell check and corrections
---

This is a bonus and optional exercise. In this excercise, we perform grammatical and spelling corrections on text extracted from a hand written document. Let's load the document and it's extracted text.

In [None]:
from IPython.display import IFrame

hand_written_doc_path=f"s3://{data_bucket}/textract-linearized-output/uploads/hand_written_note"

IFrame("./sample-docs/hand_written_note.pdf", width=600, height=500)

We only need the plain text of the document in this case, so let's load that from S3

In [None]:
from read_doc_from_s3 import read_document
document = read_document(doc_path=hand_written_doc_path)
print(document[0])

As you can see the extracted text isn't very accurate and has some mistakes because of the poorly hand written document. Let's attempt to rectify this using an LLM.

In [None]:
from langchain.llms import Bedrock
from langchain.prompts import PromptTemplate
from langchain.chains import LLMChain


template = """

Given a detailed 'Document', perform spelling and grammatical corrections. Ensure the output is coherent, polished, and free from errors. Skip any preamble text and give the answer.

<document>{doc_text}</<document>
<answer>
"""

prompt = PromptTemplate(template=template, input_variables=["doc_text"])
llm = Bedrock(client=bedrock, model_id=MODEL_ID, model_kwargs={'temperature':0})
llm_chain = LLMChain(prompt=prompt, llm=llm)

try:
    txt = document[0]
    std_op = llm_chain.run({"doc_text": txt})
    
    print("Extracted text")
    print("==============")
    print(txt)

    print("\nCorrected text")
    print("==============")
    print(std_op.strip())
    print("\n")
except Exception as e:
    print(str(e))

## Cleanup
---

We will perform cleanup at the end of the workshop

## Conclusion
---

In this module, we performed document classification of a multi-page PDF document using the extracted text from the document. We also performed, document summarization of a single page document and a multi-page document to generate precise summaries. As a bonus we looked at a sample document that had poorly hand written text, and we extracted the text and got help from an LLM to rectify the grammatical and spelling mistakes. In the next module, we will explore on how to perform structured entity extractions from our documents.