# Extract Information from Amazon Bedrock

> This notebook should work well with the **`conda_python3`** kernel in SageMaker Notebook instances.

---

In this notebook, we demonstrate how to extract custom information from processed documents.

We will use [Amazon Bedrock](https://aws.amazon.com/bedrock/) by doing a `boto3` call to invoke an LLM, and provide the attribute schema and the processed document as inputs.

Note: all processed texts are stored in [processed-files](processed-files/) after running the previous notebook. If you have not run it yet, go back to `01_process_pdf.ipynb` first.

---

# 1. PREPARATIONS

In [39]:
import json
import os

import boto3
import botocore

As before, let's set up a Bedrock client:

In [28]:
bedrock_runtime = boto3.client("bedrock-runtime")

---

# 2. PROCESSING

### Input document

First, let's load one of the processed documents. Select which version of the processed document you like based on your visual inspection of the alternatives from Textract and Bedrock:

In [29]:
input_path = "demo-files"
output_path = "processed-files"
s3_bucket = "information-extraction-workshop"

In [None]:
processed_file = "2310.06825v1_bedrock.txt"  # enter the document name here

with open(f"{output_path}/{processed_file}", "r") as file:
    document = file.read()

print(document[:1000] + "...")  # preview the document

### Prompt template

We will put together a prompt that instructs the model to extract attributes. You can find example system and user prompts below. Feel free to iterate on the prompt!

In [31]:
system_prompt = """You are an AI assistant expert in extracting information from documents.
Carefully read the document given below in <document></document> tags.
Extract attributes listed below in <attributes></attributes> tags from the document.
The answer must contain the extracted attributes in JSON format. Do NOT include any other information in the answer.
If the attribute has multiple values, provide them as a list in this format: ["value1", "value2", "value3"].
If the attribute requires providing a description or free-form text, the value of the attribute must contain this text.
Note that some attributes are not directly stated in the document, but their values are implicitly defined in the text.
Do your best to extract a full value for each requested attribute from the document.
Output the JSON in <json></json> tags.
"""

prompt = """Here is the document:
<document>
{document}
</document>

Here are the attributes to extract:
<attributes>
{attributes}
</attributes>"""

Notice that we keep `{document}` and `{attributes}` placeholders. We will populate them with the actual data when running the prompt.

### Attributes schema

Let's define a list of attributes we want to extract. Think about what information you would like to get out of the document. Don't be afraid to get creative!

Remember to provide information about the attributes beyond their name. This would help the LLM to better identify the attributes in the document. We recommend the following approach:

```
1. Attribute 1. Description of the attribute 1.
2. Attribute 2. Description of the attribute 2.
...
```

In [33]:
#########################################
#    TASK 1: Define desired attributes
#########################################

attributes = """
1. Title. The title of the document.
...
"""

#########################################
#    TASK 1: Define desired attributes
#########################################

* If you have trouble after a few minutes, check out `answers/answers.ipynb`

### Insert variables

Now we have both the document and the variables. Let's insert them into the prompt template before we send it to Bedrock!

In [None]:
#########################################
#    TASK 2: Insert documents and attributes into the prompt
#########################################

prompt = prompt

print(prompt)

#########################################
#    TASK 2: Insert documents and attributes into the prompt
#########################################

* If you have trouble after a few minutes, check out `answers/answers.ipynb`

### Invoke Bedrock

Now we can finally send our document to Amazon Bedrock!

In [41]:
messages = [
    {
        "role": "user",
        "content": [{"type": "text", "text": prompt}],
    }
]

In [None]:
%%time

modelId = "anthropic.claude-3-haiku-20240307-v1:0"  # (change this to try different model versions)

body = json.dumps(
    {
        "max_tokens": 4_096,
        "system": system_prompt,
        "temperature": 0.0,
        "messages": messages,
        "anthropic_version": "bedrock-2023-05-31",
    }
)


try:
    response = bedrock_runtime.invoke_model_with_response_stream(
        body=body,
        modelId=modelId,
        accept="application/json",
        contentType="application/json",
    )

    output = ""

    stream = response.get("body")

    if stream:
        for event in stream:
            chunk = event.get("chunk")
            if chunk:
                chunk_obj = json.loads(chunk.get("bytes").decode())
                if "delta" in chunk_obj:
                    delta_obj = chunk_obj.get("delta", None)
                    if delta_obj:
                        text = delta_obj.get("text", None)
                        if text is not None:
                            output += text
                            print(text, end="")
                        if not text:
                            break
        
        print("")

except botocore.exceptions.ClientError as error:
    if error.response["Error"]["Code"] == "AccessDeniedException":
        print(
            f"\x1b[41m{error.response['Error']['Message']}\
                \nTo troubeshoot this issue please refer to the following resources.\
                \nhttps://docs.aws.amazon.com/IAM/latest/UserGuide/troubleshoot_access-denied.html\
                \nhttps://docs.aws.amazon.com/bedrock/latest/userguide/security-iam.html\x1b[0m\n"
        )

    else:
        raise error

We got the LLM output! Is it accurate? If not, try going back and tweaking the prompt. 

Notice that the output is still a text. We can't easily put this text into a table or a database. How can we go from the text output to the actual JSON object? This will be your final task in this document!

In [47]:
#########################################
#    TASK 3: Parse the output to get JSON
#########################################

json_file = {}

#########################################
#    TASK 3: Parse the output to get JSON
#########################################

* If you have trouble after a few minutes, check out `answers/answers.ipynb`

Congrats! You were able to convert an unstructured document into a JSON file!

## Optional tasks

Once you are done, consider experimenting with one of the tasks below:

Prompt engineering:
- Add chain-of-thought reasoning instructions to improve the extraction accuracy. This would let the LLM "think" by responding with a few sentences before outputting the JSON. Do the explanations make sense?
- Recent studies show that the best Q&A performance is achieved when the prompt combines both the text and the visual version of the doc. Can you implement this?
- Add few-shot example(s). If you are failing to come up with one - generate them! How does that affect model performance?
- Switch to a different LLM and compare the answer quality. Which model is doing best?

Database:
- Try uploading the resulting JSON object to DynamoDB. Start by searching for `boto3` documentation for a DynamoDB client and implementing a put request. Next, head out to DynamoDB in AWS console and check if it has been populated.

Orchestration:
- Currently, document processing and information extraction are split into two notebooks. Can you merge both steps into a single helper function? 

## Next steps

Scaling this pipeline to work with hundreds and thousands of documents would require setting up cloud infrastructure to orchestrate information extraction steps. See this diagram below for one of the recommended options:

![screenshots/workshop_arch.png](screenshots/workshop_arch.png)