# 03 - Layout aware text extraction with Amazon Textract

In [2]:
%pip install -q amazon-textract-textractor[pdf]

[0mNote: you may need to restart the kernel to use updated packages.


In [3]:
!uname -a

Linux sagemaker-ds-3-instance 4.14.336-257.568.amzn2.x86_64 #1 SMP Sat Mar 23 09:49:55 UTC 2024 x86_64 x86_64 x86_64 GNU/Linux


In [4]:
!sudo apt-get update -y 2> /dev/null

In [5]:
!sudo apt install poppler-utils -y 2> /dev/null

In [6]:
# used by mazon-textract-textractor to visualize images with extraction results
%pip install -q pdf2image

[0mNote: you may need to restart the kernel to use updated packages.


In [7]:
ls raw_documents/

[0m[01;34mAmazon[0m/  [01;34mprepared[0m/


In [8]:
ls raw_documents/prepared/

[0m[01;34mAmazon[0m/  metadata.json


In [9]:
ls raw_documents/prepared/Amazon/

annual_report_2021.pdf  annual_report_2022.pdf


In [10]:
!python -m json.tool raw_documents/prepared/metadata.json

[
    {
        "company": "Amazon",
        "year": "2022",
        "doc_url": "https://s2.q4cdn.com/299287126/files/doc_financials/2023/ar/Amazon-2022-Annual-Report.pdf",
        "local_pdf_path": "raw_documents/prepared/Amazon/annual_report_2022.pdf",
        "pages_kept": [
            15,
            17,
            18,
            47,
            48
        ]
    },
    {
        "company": "Amazon",
        "year": "2021",
        "doc_url": "https://s2.q4cdn.com/299287126/files/doc_financials/2022/ar/Amazon-2021-Annual-Report.pdf",
        "local_pdf_path": "raw_documents/prepared/Amazon/annual_report_2021.pdf",
        "pages_kept": [
            14,
            16,
            17,
            18,
            46,
            47
        ]
    }
]


## Extraction with textractor

In [11]:
import sagemaker

default_sagemaker_bucket = sagemaker.Session().default_bucket()
sagemaker_execution_role = sagemaker.get_execution_role()

sagemaker.config INFO - Not applying SDK defaults from location: /etc/xdg/sagemaker/config.yaml
sagemaker.config INFO - Not applying SDK defaults from location: /root/.config/sagemaker/config.yaml


In [14]:
import boto3
from textractor import Textractor
from textractor.data.constants import TextractFeatures

region = boto3.session.Session().region_name
# extractor = Textractor(profile_name="default")
extractor = Textractor(region_name=region)

input_document = "raw_documents/prepared/Amazon/annual_report_2022.pdf"

document = extractor.start_document_analysis(
    file_source=input_document,
    s3_upload_path=f"s3://{default_sagemaker_bucket}/input_documents/",
    s3_output_path=f"s3://{default_sagemaker_bucket}/output_documents/",
    features=[TextractFeatures.LAYOUT],
    save_image=False
)

In [15]:
document.document

In [16]:
document.pages[0]

This Page (1) holds the following data:
Words - 788
Lines - 54
Key-values - 0
Checkboxes - 0
Tables - 0
Queries - 0
Signatures - 0
Expense documents - 0
Layout elements - 10

In [17]:
print(document.pages[4].to_markdown())

AMAZON.COM, INC. 

# CONSOLIDATED STATEMENTS OF OPERATIONS 

(in millions, except per share data) 



Year Ended December 31,
2020	2021	2022
Net product sales	$	215,915	$	241,787	$	242,901
Net service sales	170,149	228,035	271,082
Total net sales	386,064	469,822	513,983
Operating expenses:
Cost of sales	233,307	272,344	288,831
Fulfillment	58,517	75,111	84,299
Technology and content	42,740	56,052	73,213
Sales and marketing	22,008	32,551	42,238
General and administrative	6,668	8,823	11,891
Other operating expense (income), net	(75)	62	1,263
Total operating expenses	363,165	444,943	501,735
Operating income	22,899	24,879	12,248
Interest income	555	448	989
Interest expense	(1,647)	(1,809)	(2,367)
Other income (expense), net	2,371	14,633	(16,806)
Total non-operating income (expense)	1,279	13,272	(18,184)
Income (loss) before income taxes	24,178	38,151	(5,936)
Benefit (provision) for income taxes	(2,863)	(4,791)	3,217
Equity-method investment activity, net of tax	16	4	(3)
Net income (loss)	$	

In [18]:
%pip install -U -q pydantic 2> /dev/null

Note: you may need to restart the kernel to use updated packages.


In [19]:
%pip install -U -q "anthropic[bedrock]"

[0mNote: you may need to restart the kernel to use updated packages.


In [20]:
from anthropic import Anthropic

anthropic_client = Anthropic()

In [21]:
anthropic_client.count_tokens(document.pages[0].to_markdown())

1038

## Use LLM to review and improve the extracted document

Here we use Anthropic Claude 3 models through Amazon Bedrock to improve the markdown file extracted by Amazon Textract further, so it is ready for the LLM to answer question properly later on.

In [22]:
import boto3
import json
import logging
from botocore.exceptions import ClientError

bedrock = boto3.client("bedrock", region_name="us-west-2")
bedrock_runtime = boto3.client("bedrock-runtime", region_name="us-west-2")

In [23]:
# bedrock.list_foundation_models()

In [24]:
# llm_model_id = "anthropic.claude-3-haiku-20240307-v1:0"
llm_model_id = "anthropic.claude-3-sonnet-20240229-v1:0"

In [25]:
logger = logging.getLogger(__name__)
logging.basicConfig(level=logging.INFO)


def generate_message(bedrock_runtime, model_id, system_prompt, messages, max_tokens):

    body=json.dumps(
        {
            "anthropic_version": "bedrock-2023-05-31",
            "max_tokens": max_tokens,
            "system": system_prompt,
            "messages": messages
        }
    )
    response = bedrock_runtime.invoke_model(body=body, modelId=model_id)
    response_body = json.loads(response.get('body').read())

    return response_body


def call_llm(user_input, model_id, system_prompt, bedrock_runtime, max_tokens=1000):
    """Handle calls to Anthropic Claude message api."""
    try:
        # Prompt with user turn only.
        user_message =  {"role": "user", "content": user_input}
        messages = [user_message]
        return generate_message(bedrock_runtime, model_id, system_prompt, messages, max_tokens)
    except ClientError as err:
        message=err.response["Error"]["Message"]
        logger.error("A client error occurred: %s", message)
        print("A client error occured: " +
            format(message))



Below we test the help functions by calling the LLM

In [26]:
%%time
user_input = "hello"
system_prompt = "reply in a friendly manner"

call_llm(user_input, llm_model_id, system_prompt, bedrock_runtime, max_tokens=1000)

CPU times: user 16 ms, sys: 4.49 ms, total: 20.5 ms
Wall time: 603 ms


{'id': 'msg_011efXny4sw5LZG8wEaapfuQ',
 'type': 'message',
 'role': 'assistant',
 'content': [{'type': 'text', 'text': 'Hello! How can I assist you today?'}],
 'model': 'claude-3-sonnet-28k-20240229',
 'stop_reason': 'end_turn',
 'stop_sequence': None,
 'usage': {'input_tokens': 13, 'output_tokens': 12}}

In [27]:
user_prompt = """
Improve the markdown while keeping all original information. Put the improved markdown inside a <results> xml tags with no explanation:
\n{markdown_doc}
""".strip()

system_prompt = "Your task is to review and improve the results of Amazon textract in markdown."


def improve_textract_markdown_output(document, llm_model_id):
    improved_markdown = []
    for i in range(len(document.pages)):
        user_input = user_prompt.format(markdown_doc=document.pages[i].to_markdown())
        result = call_llm(user_input, llm_model_id, system_prompt, bedrock_runtime, max_tokens=3000)
        # Extract the text between the <results> XML tags only.
        improved_markdown.append(result["content"][0]["text"].split("<results>")[-1].split("</results>")[0].strip())
    return improved_markdown

In [28]:
# res = improve_textract_markdown_output(document, llm_model_id)

In [29]:
import os
raw_base_directory = "raw_documents"
prepared_base_directory = os.path.join(raw_base_directory, "prepared/")
prepared_base_directory

'raw_documents/prepared/'

In [30]:
import json

with open(
    os.path.join(prepared_base_directory, "metadata.json"), "r"
) as prepared_pdfs_metadata_obj:
    prepared_pdfs_metadata = json.load(prepared_pdfs_metadata_obj)


In [31]:
prepared_pdfs_metadata

[{'company': 'Amazon',
  'year': '2022',
  'doc_url': 'https://s2.q4cdn.com/299287126/files/doc_financials/2023/ar/Amazon-2022-Annual-Report.pdf',
  'local_pdf_path': 'raw_documents/prepared/Amazon/annual_report_2022.pdf',
  'pages_kept': [15, 17, 18, 47, 48]},
 {'company': 'Amazon',
  'year': '2021',
  'doc_url': 'https://s2.q4cdn.com/299287126/files/doc_financials/2022/ar/Amazon-2021-Annual-Report.pdf',
  'local_pdf_path': 'raw_documents/prepared/Amazon/annual_report_2021.pdf',
  'pages_kept': [14, 16, 17, 18, 46, 47]}]

In [32]:
def extract_pages_as_markdown(input_document):

    document = extractor.start_document_analysis(
        file_source=input_document,
        s3_upload_path=f"s3://{default_sagemaker_bucket}/input_documents/",
        s3_output_path=f"s3://{default_sagemaker_bucket}/output_documents/",
        features=[TextractFeatures.LAYOUT],
        save_image=False
    )

    res = improve_textract_markdown_output(document, llm_model_id)
    pages = [{"page": indx, "page_text": text} for indx, text in enumerate(res)]
    return pages


def extract_docs_into_markdown(docs_metadata):
    results = []
    for doc_meta in docs_metadata:
        doc_result_with_metadata = {}
        doc_result_with_metadata["metadata"] = doc_meta
        doc_result_with_metadata["name"] = doc_meta["doc_url"].split("/")[-1]
        doc_result_with_metadata["source_location"] = doc_meta["doc_url"]
        doc_result_with_metadata["pages"] = extract_pages_as_markdown(doc_meta["local_pdf_path"])
        results.append(doc_result_with_metadata)
    return results

In [33]:
%%time

results = extract_docs_into_markdown(prepared_pdfs_metadata)

CPU times: user 1.99 s, sys: 63.9 ms, total: 2.05 s
Wall time: 4min 9s


In [34]:
results[0]

{'metadata': {'company': 'Amazon',
  'year': '2022',
  'doc_url': 'https://s2.q4cdn.com/299287126/files/doc_financials/2023/ar/Amazon-2022-Annual-Report.pdf',
  'local_pdf_path': 'raw_documents/prepared/Amazon/annual_report_2022.pdf',
  'pages_kept': [15, 17, 18, 47, 48]},
 'name': 'Amazon-2022-Annual-Report.pdf',
 'source_location': 'https://s2.q4cdn.com/299287126/files/doc_financials/2023/ar/Amazon-2022-Annual-Report.pdf',
 'pages': [{'page': 0,
   'page_text': "## Competition\n\nOur businesses encompass a large variety of product types, service offerings, and delivery channels. The worldwide marketplace in which we compete is evolving rapidly and intensely competitive, and we face a broad array of competitors from many different industry sectors around the world. Our current and potential competitors include:\n\n1. Physical, e-commerce, and omnichannel retailers, publishers, vendors, distributors, manufacturers, and producers of the products we offer and sell to consumers and busine

In [35]:
from utils.helpers import store_list_to_s3

In [36]:
ssm = boto3.client("ssm")

In [37]:
s3_bucket_name_parameter = "/AgenticLLMAssistantWorkshop/AgentDataBucketParameter"

In [38]:
s3_bucket_name = ssm.get_parameter(Name=s3_bucket_name_parameter)
s3_bucket_name = s3_bucket_name["Parameter"]["Value"]

In [39]:
processed_documents_s3_key = "documents_processed.json"

In [40]:
store_list_to_s3(s3_bucket_name, processed_documents_s3_key, results)

In [41]:
results[0]

{'metadata': {'company': 'Amazon',
  'year': '2022',
  'doc_url': 'https://s2.q4cdn.com/299287126/files/doc_financials/2023/ar/Amazon-2022-Annual-Report.pdf',
  'local_pdf_path': 'raw_documents/prepared/Amazon/annual_report_2022.pdf',
  'pages_kept': [15, 17, 18, 47, 48]},
 'name': 'Amazon-2022-Annual-Report.pdf',
 'source_location': 'https://s2.q4cdn.com/299287126/files/doc_financials/2023/ar/Amazon-2022-Annual-Report.pdf',
 'pages': [{'page': 0,
   'page_text': "## Competition\n\nOur businesses encompass a large variety of product types, service offerings, and delivery channels. The worldwide marketplace in which we compete is evolving rapidly and intensely competitive, and we face a broad array of competitors from many different industry sectors around the world. Our current and potential competitors include:\n\n1. Physical, e-commerce, and omnichannel retailers, publishers, vendors, distributors, manufacturers, and producers of the products we offer and sell to consumers and busine

In [42]:
!aws s3 ls s3://{s3_bucket_name}

2024-04-18 14:35:34      42445 documents_processed.json


In [43]:
print(s3_bucket_name)

serverlessllmassistantstac-agentdatabucket67afdfb9-2puzdhyuanzd
