# Module 3 - Structured data extraction
---

<div class="alert alert-block alert-info"> 
    <b>NOTE:</b> You will need to use a Jupyter Kernel with Python 3.9 or above to use this notebook. If you are in Amazon SageMaker Studio, you can use the "Data Science 3.0" image.
</div>

In this notebook we will walk through how to perform _"templating, normalizations, and entity extractions"_ from text in documents. We will be using a document and it's extracted text from our workflow in Module 1 where the text was extracted using Amazon Textract `AnalyzeDocument` API with `LAYOUT` feature, subsequently we will use and LLM and prompt engineering techniques to get the data in desired format.

<div class="alert alert-block alert-info"> 
    <b>NOTE:</b> You can ignore any WARNINGS during the `pip installs`.
</div>

In [None]:
!pip install -U faiss-cpu

In [None]:
import json
import os
import sys
import sagemaker
import boto3

role = sagemaker.get_execution_role()
data_bucket = sagemaker.Session().default_bucket()
bedrock = boto3.client('bedrock-runtime')
br = boto3.client('bedrock')
s3 = boto3.client('s3')
textract = boto3.client('textract')
print(f"SageMaker bucket is {data_bucket}, and SageMaker Execution Role is {role}")

In [None]:
MODEL_ID = "anthropic.claude-instant-v1"

# Templating & Normalizations
---

The most common way to extract information out of documents is via key-value pairs. At times you may want the output from your document to be in a specific format so that it's much easier to consume in your downstream system. One way is to specify a template of the the output structure.

In this notebook we will use a document that has Form components in it as well as some dense text that is in columnar section. We will use Amazon Textract's layout feature to read the document in the correct reading order. However our final goal is to get a specific set of information (entities) in a specific format so that we can easily consume the output later downstream.

Let's take a look at the document.

In [None]:
from IPython.display import Image
input_document_path=f"s3://{data_bucket}/output/discharge-summary/text_pages"
Image(filename='./sample-docs/discharge-summary.png',width=500)

We will try to extract the following information from the document in key-value pair format.

- Doctor's name
- Provider ID
- Patient's name
- Patient ID
- Patient gender
- Patient age
- Admitted date
- Discharge date
- Discharged to
- Drug allergies

Since we already have extracted linearized text from the document in S3, we will read the document text from that location.

In [None]:
from read_doc_from_s3 import read_document
document = read_document(doc_path=input_document_path)
full_text = document[0].strip()

print(full_text)

Let's get the text extracted by the LAYOUT feature. We have written a small linearizer function that generates the text in the proper reading order.

## Define the extraction template
---
Based on the fields we need to extract, we will define a template that will be used by the LLM to extract the entities. Let's start by creating a template with the first six values we want to extract from the document.

In [None]:
# import json
output_template= {
    "doctor_name":{ "type": "string", "description": "The doctor or provider's full name" },
    "provider_id":{ "type": "string", "description": "The doctor or provider's ID" },
    "patient_name":{ "type": "string", "description": "The patient's full name" },
    "patient_id":{ "type": "string", "description": "The patient's ID" },
    "patient_gender":{ "type": "string", "description": "The patient's gender" },
    "patient_age":{ "type": "number",  "description": "The patient's age" }
}

In [None]:
from langchain.llms import Bedrock
from langchain.prompts import PromptTemplate
from langchain.chains import LLMChain

template = """

You are a helpful assistant. Please extract the following details from the document and format the output as JSON using the keys. Skip any preamble text and generate the final answer.

<details>
{details}
</details>

<keys>
{keys}
</keys>

<document>
{doc_text}
<document>

<final_answer>"""


details = "\n".join([f"{key}: {value['description']}" for key, value in output_template.items()])
keys = "\n".join([f"{key}" for key, value in output_template.items()])

prompt = PromptTemplate(template=template, input_variables=["details", "keys", "doc_text"])
bedrock_llm = Bedrock(client=bedrock, model_id=MODEL_ID, model_kwargs={'temperature':0})

llm_chain = LLMChain(prompt=prompt, llm=bedrock_llm)
output = llm_chain.run({"doc_text": full_text, "details": details, "keys": keys})

print(output)

## More Structure with LangChain Response Schemas
---

In the above example we were providing `keys` and `details` to the prompt template by creating strings from the extraction template. A better way is to use LangChain `ResponseSchema` to define the schema, and then use a `PydanticOutputParser` which will generate the format instruction for the LLM. To use `PydanticOutputParser` **we define the template using a small Python class instead of free-form JSON**. We can then use that format instruction with our prompt template and subsequently even use the output parser to get a dictionary output that can be later consumed very easily.

First let's add the rest of entities that we want to extract from the document. Go ahead and un-comment (delete the `#` sign from the lines) the comented lines in the `output_template` below and execute the code cell. We will also try to split the paitent's first and last names into separate fields.

In the following code cell, we use our `output_template` to create the format instruction text, and also initialize an `output_parser` that will be later used to parse the output.

<div class="alert alert-block alert-info"> 
    <b>INSTRUCTION:</b> delete the "#" sign from the lines to un-comment, in the code block below.
</div>

In [None]:
from langchain.output_parsers import ResponseSchema
from langchain.output_parsers import PydanticOutputParser
from langchain.pydantic_v1 import BaseModel, Field, validator
    
class ExtractionEntities(BaseModel):
    doctor_name: str = Field(description="The doctor or provider's full name")
    provider_id: str = Field(description="The doctor or provider's ID")
    # NOTICE: Here we have split the patient's first and last name
    patient_first_name: str = Field(description="The patient's first and middle name") 
    patient_last_name: str = Field(description="The patient's last name")
    patient_id: str = Field(description="The patient's ID")
    patient_gender: str = Field(description="The patient's gender")
    patient_age: str = Field(description="The patient's age")
    #  Un-comment the lines below
    # admitted_date: str = Field(description="Date the patient was admitted to the hospital")
    # discharge_date: str = Field(description="Date the patient was discharged from the hospital")
    # discharged_to: str = Field(description="The disposition of where the patient was released or discharged to")
    # drug_allergies: str = Field(description="The patient's known drug allergies")
    
    
# output_parser = StructuredOutputParser.from_response_schemas(response_schems)
output_parser = PydanticOutputParser(pydantic_object=ExtractionEntities)
format_instructions= output_parser.get_format_instructions()
print(format_instructions)

What we did above is basically used LangChain's `format instruction` method `get_format_instructions()` to build an instruction prompt from the desired template. This alleviates the need for us to build our own prompt using the template JSON every time a new entity is added. You can potentially store the template in a file and maintain it separately and the code would simply use the file to generate the prompt instruction using the template. 

Our code here is pretty similar to before with the exception of the format instructions in the prompt template. We also instruct the model to strictly adhere to the format instructions when generating the output, so that our `output_parser` can parse it.

In [None]:
from langchain.llms import Bedrock
from langchain.prompts import PromptTemplate
from langchain.chains import LLMChain

template = """

You are a helpful assistant. Please extract the following details from the document and strictly follow the instructions described in the format instructions to format the output. Skip any preamble text and generate the final answer. Do not generate incomplete answer.

<format_instructions>
{format_instructions}
</format_instructions>

<document>
{doc_text}
<document>

<final_answer>"""


prompt = PromptTemplate(template=template, 
                        input_variables=["doc_text"],
                        partial_variables={"format_instructions": format_instructions})

bedrock_llm = Bedrock(client=bedrock, model_id=MODEL_ID, model_kwargs={'temperature':0})

llm_chain = LLMChain(prompt=prompt, llm=bedrock_llm)
output = llm_chain.run({"doc_text": full_text, "details": details, "format_instructions": format_instructions})

parsed_output= output_parser.parse(output)
parsed_output

In [None]:
final_output = dict(parsed_output)
final_output

In [None]:
final_output['doctor_name']

## Correction of incomplete generation by the LLM
---

In some cases the model may not generate the entire structured output as prompted resulting in incomplete output, resulting in an incomplete JSON or missing values. To make this more robust we can handle this using exception handling and re-prompting the model to rectify the output again. This makes for a more robust implementation for a production deployment. In the code cell below, if the `output_parser.parse()` method fails to parse the LLMs output, that would mean that the LLM generated an incomplete output. Subsequently, the `except` block of the code will catch the error and re-prompt the model to generate the full response.

In [None]:
from langchain.output_parsers import OutputFixingParser
from langchain.schema import OutputParserException

try:
    parsed_output= output_parser.parse(output)
except OutputParserException as e:
    new_parser = OutputFixingParser.from_llm(
        parser=output_parser,
        llm=bedrock_llm
    )
    parsed_output = new_parser.parse(output)
    
final_output = dict(parsed_output)
final_output

# Value standardization
---

We were able to get structured key-values out of the document using the LLM so far. We would also like to standardize some of the outputs. For example we would like the dates in the output to be of DD/MM/YYYY format instead of DD-Mon-YYYY format. Let's see if we can quickly update the format instructions to achieve this.

For the two date key's we have will add some additional instruction ` This should be formatted in DD/MM/YYYY format.`. Go ahead and modify the Field description such that it looks like below for both the date fields -

<div class="alert alert-block alert-info"> 
    <b>INSTRUCTION:</b> modify the Field description.Do this step for both the <b>addmitted_date</b> and <b>discharge_date</b> fields. For example - <br><br>
    <b>Field(description="The patient's date of birth, this should be formatted in DD/MM/YYYY format")</b>
</div>

In [None]:
from langchain.output_parsers import ResponseSchema
from langchain.output_parsers import PydanticOutputParser
from langchain.pydantic_v1 import BaseModel, Field, validator
from datetime import date

class DateEntities(BaseModel):
    # Modify the field description for the next two fields
    admitted_date: str = Field(description="Date the patient was admitted to the hospital")
    discharge_date: str = Field(description="Date the patient was discharged from the hospital")
    ######
    
output_parser = PydanticOutputParser(pydantic_object=DateEntities)
format_instructions= output_parser.get_format_instructions()
print(format_instructions)

In [None]:
from langchain.llms import Bedrock
from langchain.prompts import PromptTemplate
from langchain.chains import LLMChain
from langchain.output_parsers import OutputFixingParser
from langchain.schema import OutputParserException

template = """

You are a helpful assistant. Please extract the following details from the document and strictly follow the instructions described in the format instructions and additional instructions to format the output. Skip any preamble text and generate the final answer. Do not generate incomplete answer.

<format_instructions>
{format_instructions}
</format_instructions>

<document>
{doc_text}
<document>

<final_answer>"""


prompt = PromptTemplate(template=template, 
                        input_variables=["doc_text"],
                        partial_variables={"format_instructions": format_instructions})

bedrock_llm = Bedrock(client=bedrock, model_id=MODEL_ID, model_kwargs={'temperature':0})

llm_chain = LLMChain(prompt=prompt, llm=bedrock_llm)
output = llm_chain.run({"doc_text": full_text, "details": details, "format_instructions": format_instructions})

try:
    parsed_output= output_parser.parse(output)
except OutputParserException as e:
    new_parser = OutputFixingParser.from_llm(
        parser=output_parser,
        llm=bedrock_llm
    )
    parsed_output = new_parser.parse(output)
    
final_output = dict(parsed_output)
final_output

In [None]:
print(final_output['admitted_date'])
print(final_output['discharge_date'])

## Cleanup
---

We will perform cleanup at the end of the workshop

## Conclusion
---

In this module, we performed structured data extraction from our document using an LLM with Amazon Bedrock and the text generated by Amazon Textract's LAYOUT feature. We first used a JSON to define a template, and then used prompt engineering techniques to get the desired output from the LLM. Then, we used a more cleaner and programmatic way to define the data template that we desire and used that to extract the data. We also looked at what to do in case the LLM generates incomplete results and re-prompting the model to complete the output using exception handling. Finally, we performed standardization of date formats by simply adding a formatting instruction to the description of the date fields. In the next module, we will perform Table self-querying and look at some of the basics of loading a document text into a vector database.