# 03 - Layout aware text extraction with Amazon Textract

In [1]:
%pip install -q amazon-textract-textractor[pdf]

Note: you may need to restart the kernel to use updated packages.


In [2]:
!uname -a

Linux default 4.14.336-257.562.amzn2.x86_64 #1 SMP Sat Feb 24 09:50:35 UTC 2024 x86_64 x86_64 x86_64 GNU/Linux


In [3]:
!sudo apt-get update -y 2> /dev/null

Get:1 http://security.ubuntu.com/ubuntu jammy-security InRelease [110 kB]
Hit:2 http://archive.ubuntu.com/ubuntu jammy InRelease   
Get:3 http://archive.ubuntu.com/ubuntu jammy-updates InRelease [119 kB]
Hit:4 http://archive.ubuntu.com/ubuntu jammy-backports InRelease
Get:5 http://archive.ubuntu.com/ubuntu jammy-updates/universe amd64 Packages [1356 kB]
Get:6 http://archive.ubuntu.com/ubuntu jammy-updates/main amd64 Packages [1898 kB]
Fetched 3483 kB in 2s (1637 kB/s)                         
Reading package lists... Done


In [4]:
!sudo apt install poppler-utils -y 2> /dev/null

Reading package lists... Done
Building dependency tree... Done
Reading state information... Done
poppler-utils is already the newest version (22.02.0-2ubuntu0.3).
0 upgraded, 0 newly installed, 0 to remove and 32 not upgraded.


In [5]:
# used by mazon-textract-textractor to visualize images with extraction results
%pip install -q pdf2image

Note: you may need to restart the kernel to use updated packages.


In [6]:
ls raw_documents/

[0m[01;34mAmazon[0m/  [01;34mprepared[0m/


In [7]:
ls raw_documents/prepared/

[0m[01;34mAmazon[0m/  metadata.json


In [8]:
ls raw_documents/prepared/Amazon/

annual_report_2021.pdf  annual_report_2022.pdf


In [9]:
!python -m json.tool raw_documents/prepared/metadata.json

[
    {
        "company": "Amazon",
        "year": "2022",
        "doc_url": "https://s2.q4cdn.com/299287126/files/doc_financials/2023/ar/Amazon-2022-Annual-Report.pdf",
        "local_pdf_path": "raw_documents/prepared/Amazon/annual_report_2022.pdf",
        "pages_kept": [
            15,
            17,
            18,
            47,
            48
        ]
    },
    {
        "company": "Amazon",
        "year": "2021",
        "doc_url": "https://s2.q4cdn.com/299287126/files/doc_financials/2022/ar/Amazon-2021-Annual-Report.pdf",
        "local_pdf_path": "raw_documents/prepared/Amazon/annual_report_2021.pdf",
        "pages_kept": [
            14,
            16,
            17,
            18,
            46,
            47
        ]
    }
]


## Extraction with textractor

In [10]:
import sagemaker

default_sagemaker_bucket = sagemaker.Session().default_bucket()
sagemaker_execution_role = sagemaker.get_execution_role()

sagemaker.config INFO - Not applying SDK defaults from location: /etc/xdg/sagemaker/config.yaml
sagemaker.config INFO - Not applying SDK defaults from location: /home/sagemaker-user/.config/sagemaker/config.yaml
sagemaker.config INFO - Not applying SDK defaults from location: /etc/xdg/sagemaker/config.yaml
sagemaker.config INFO - Not applying SDK defaults from location: /home/sagemaker-user/.config/sagemaker/config.yaml
sagemaker.config INFO - Not applying SDK defaults from location: /etc/xdg/sagemaker/config.yaml
sagemaker.config INFO - Not applying SDK defaults from location: /home/sagemaker-user/.config/sagemaker/config.yaml


In [11]:
import boto3
from textractor import Textractor
from textractor.data.constants import TextractFeatures

region = boto3.session.Session().region_name
# extractor = Textractor(profile_name="default")
extractor = Textractor(region_name=region)

input_document = "raw_documents/prepared/Amazon/annual_report_2022.pdf"

document = extractor.start_document_analysis(
    file_source=input_document,
    s3_upload_path=f"s3://{default_sagemaker_bucket}/input_documents/",
    s3_output_path=f"s3://{default_sagemaker_bucket}/output_documents/",
    features=[TextractFeatures.LAYOUT],
    save_image=False
)

In [12]:
document.document

In [13]:
document.pages[0]

This Page (1) holds the following data:
Words - 788
Lines - 54
Key-values - 0
Checkboxes - 0
Tables - 0
Queries - 0
Signatures - 0
Expense documents - 0
Layout elements - 10

In [14]:
print(document.pages[4].to_markdown())

AMAZON.COM, INC. 

# CONSOLIDATED STATEMENTS OF OPERATIONS 

(in millions, except per share data) 



Year Ended December 31,
2020	2021	2022
Net product sales	$	215,915	$	241,787	$	242,901
Net service sales	170,149	228,035	271,082
Total net sales	386,064	469,822	513,983
Operating expenses:
Cost of sales	233,307	272,344	288,831
Fulfillment	58,517	75,111	84,299
Technology and content	42,740	56,052	73,213
Sales and marketing	22,008	32,551	42,238
General and administrative	6,668	8,823	11,891
Other operating expense (income), net	(75)	62	1,263
Total operating expenses	363,165	444,943	501,735
Operating income	22,899	24,879	12,248
Interest income	555	448	989
Interest expense	(1,647)	(1,809)	(2,367)
Other income (expense), net	2,371	14,633	(16,806)
Total non-operating income (expense)	1,279	13,272	(18,184)
Income (loss) before income taxes	24,178	38,151	(5,936)
Benefit (provision) for income taxes	(2,863)	(4,791)	3,217
Equity-method investment activity, net of tax	16	4	(3)
Net income (loss)	$	

In [15]:
%pip install -U -q pydantic 2> /dev/null

Note: you may need to restart the kernel to use updated packages.


In [16]:
%pip install -U -q "anthropic[bedrock]"

Note: you may need to restart the kernel to use updated packages.


In [17]:
from anthropic import Anthropic

anthropic_client = Anthropic()

In [18]:
anthropic_client.count_tokens(document.pages[0].to_markdown())

1038

## Use LLM to review and improve the extracted document

Here we use Anthropic Claude 3 models through Amazon Bedrock to improve the markdown file extracted by Amazon Textract further, so it is ready for the LLM to answer question properly later on.

In [19]:
import boto3
import json
import logging
from botocore.exceptions import ClientError

bedrock = boto3.client("bedrock", region_name="us-west-2")
bedrock_runtime = boto3.client("bedrock-runtime", region_name="us-west-2")

In [20]:
# bedrock.list_foundation_models()

In [21]:
# llm_model_id = "anthropic.claude-3-haiku-20240307-v1:0"
llm_model_id = "anthropic.claude-3-sonnet-20240229-v1:0"

In [22]:
logger = logging.getLogger(__name__)
logging.basicConfig(level=logging.INFO)


def generate_message(bedrock_runtime, model_id, system_prompt, messages, max_tokens):

    body=json.dumps(
        {
            "anthropic_version": "bedrock-2023-05-31",
            "max_tokens": max_tokens,
            "system": system_prompt,
            "messages": messages
        }
    )
    response = bedrock_runtime.invoke_model(body=body, modelId=model_id)
    response_body = json.loads(response.get('body').read())

    return response_body


def call_llm(user_input, model_id, system_prompt, bedrock_runtime, max_tokens=1000):
    """Handle calls to Anthropic Claude message api."""
    try:
        # Prompt with user turn only.
        user_message =  {"role": "user", "content": user_input}
        messages = [user_message]
        return generate_message(bedrock_runtime, model_id, system_prompt, messages, max_tokens)
    except ClientError as err:
        message=err.response["Error"]["Message"]
        logger.error("A client error occurred: %s", message)
        print("A client error occured: " +
            format(message))



Below we test the help functions by calling the LLM

In [23]:
%%time
user_input = "hello"
system_prompt = "reply in a friendly manner"

call_llm(user_input, llm_model_id, system_prompt, bedrock_runtime, max_tokens=1000)

CPU times: user 15.4 ms, sys: 13 µs, total: 15.5 ms
Wall time: 5.01 s


{'id': 'msg_01EgPgjYdJC787Qetuh5ug55',
 'type': 'message',
 'role': 'assistant',
 'content': [{'type': 'text', 'text': 'Hello! How can I assist you today?'}],
 'model': 'claude-3-sonnet-28k-20240229',
 'stop_reason': 'end_turn',
 'stop_sequence': None,
 'usage': {'input_tokens': 13, 'output_tokens': 12}}

In [24]:
user_prompt = """
Improve the markdown while keeping all original information. Put the improved markdown inside a <results> xml tags with no explanation:
\n{markdown_doc}
""".strip()

system_prompt = "Your task is to review and improve the results of Amazon textract in markdown."


def improve_textract_markdown_output(document, llm_model_id):
    improved_markdown = []
    for i in range(len(document.pages)):
        user_input = user_prompt.format(markdown_doc=document.pages[i].to_markdown())
        result = call_llm(user_input, llm_model_id, system_prompt, bedrock_runtime, max_tokens=3000)
        # Extract the text between the <results> XML tags only.
        improved_markdown.append(result["content"][0]["text"].split("<results>")[-1].split("</results>")[0].strip())
    return improved_markdown

In [25]:
# res = improve_textract_markdown_output(document, llm_model_id)

In [26]:
import os
raw_base_directory = "raw_documents"
prepared_base_directory = os.path.join(raw_base_directory, "prepared/")
prepared_base_directory

'raw_documents/prepared/'

In [27]:
import json

with open(
    os.path.join(prepared_base_directory, "metadata.json"), "r"
) as prepared_pdfs_metadata_obj:
    prepared_pdfs_metadata = json.load(prepared_pdfs_metadata_obj)


In [28]:
prepared_pdfs_metadata

[{'company': 'Amazon',
  'year': '2022',
  'doc_url': 'https://s2.q4cdn.com/299287126/files/doc_financials/2023/ar/Amazon-2022-Annual-Report.pdf',
  'local_pdf_path': 'raw_documents/prepared/Amazon/annual_report_2022.pdf',
  'pages_kept': [15, 17, 18, 47, 48]},
 {'company': 'Amazon',
  'year': '2021',
  'doc_url': 'https://s2.q4cdn.com/299287126/files/doc_financials/2022/ar/Amazon-2021-Annual-Report.pdf',
  'local_pdf_path': 'raw_documents/prepared/Amazon/annual_report_2021.pdf',
  'pages_kept': [14, 16, 17, 18, 46, 47]}]

In [29]:
def extract_pages_as_markdown(input_document):

    document = extractor.start_document_analysis(
        file_source=input_document,
        s3_upload_path=f"s3://{default_sagemaker_bucket}/input_documents/",
        s3_output_path=f"s3://{default_sagemaker_bucket}/output_documents/",
        features=[TextractFeatures.LAYOUT],
        save_image=False
    )

    res = improve_textract_markdown_output(document, llm_model_id)
    pages = [{"page": indx, "page_text": text} for indx, text in enumerate(res)]
    return pages


def extract_docs_into_markdown(docs_metadata):
    results = []
    for doc_meta in docs_metadata:
        doc_result_with_metadata = {}
        doc_result_with_metadata["metadata"] = doc_meta
        doc_result_with_metadata["name"] = doc_meta["doc_url"].split("/")[-1]
        doc_result_with_metadata["source_location"] = doc_meta["doc_url"]
        doc_result_with_metadata["pages"] = extract_pages_as_markdown(doc_meta["local_pdf_path"])
        results.append(doc_result_with_metadata)
    return results

In [30]:
%%time

results = extract_docs_into_markdown(prepared_pdfs_metadata)

CPU times: user 1.87 s, sys: 62.5 ms, total: 1.93 s
Wall time: 2min 52s


In [31]:
results[0]

{'metadata': {'company': 'Amazon',
  'year': '2022',
  'doc_url': 'https://s2.q4cdn.com/299287126/files/doc_financials/2023/ar/Amazon-2022-Annual-Report.pdf',
  'local_pdf_path': 'raw_documents/prepared/Amazon/annual_report_2022.pdf',
  'pages_kept': [15, 17, 18, 47, 48]},
 'name': 'Amazon-2022-Annual-Report.pdf',
 'source_location': 'https://s2.q4cdn.com/299287126/files/doc_financials/2023/ar/Amazon-2022-Annual-Report.pdf',
 'pages': [{'page': 0,
   'page_text': "## Competition\n\nOur businesses encompass a wide variety of product types, service offerings, and delivery channels. The worldwide marketplace is rapidly evolving and intensely competitive, and we face a diverse array of competitors from various industry sectors globally. Our current and potential competitors include:\n\n1. Physical, e-commerce, and omnichannel retailers, publishers, vendors, distributors, manufacturers, and producers of products we offer and sell to consumers and businesses.\n2. Publishers, producers, and d

In [32]:
from utils.helpers import store_list_to_s3

In [33]:
ssm = boto3.client("ssm")

In [34]:
s3_bucket_name_parameter = "/AgenticLLMAssistantWorkshop/AgentDataBucketParameter"

In [35]:
s3_bucket_name = ssm.get_parameter(Name=s3_bucket_name_parameter)
s3_bucket_name = s3_bucket_name["Parameter"]["Value"]

In [36]:
processed_documents_s3_key = "documents_processed.json"

In [37]:
store_list_to_s3(s3_bucket_name, processed_documents_s3_key, results)

In [38]:
results[0]

{'metadata': {'company': 'Amazon',
  'year': '2022',
  'doc_url': 'https://s2.q4cdn.com/299287126/files/doc_financials/2023/ar/Amazon-2022-Annual-Report.pdf',
  'local_pdf_path': 'raw_documents/prepared/Amazon/annual_report_2022.pdf',
  'pages_kept': [15, 17, 18, 47, 48]},
 'name': 'Amazon-2022-Annual-Report.pdf',
 'source_location': 'https://s2.q4cdn.com/299287126/files/doc_financials/2023/ar/Amazon-2022-Annual-Report.pdf',
 'pages': [{'page': 0,
   'page_text': "## Competition\n\nOur businesses encompass a wide variety of product types, service offerings, and delivery channels. The worldwide marketplace is rapidly evolving and intensely competitive, and we face a diverse array of competitors from various industry sectors globally. Our current and potential competitors include:\n\n1. Physical, e-commerce, and omnichannel retailers, publishers, vendors, distributors, manufacturers, and producers of products we offer and sell to consumers and businesses.\n2. Publishers, producers, and d