# 05 - Structured metadata extraction for analytical question answering

In this notebook we will be using the output of the `question-answering-using-rag.ipynb` notebook to generate structured data containing specified entities extracted from each of the documents. We will then use this structured data to answer questions.

Notebook overview:
1. Define the embedding model to initialize the vectorstore created in `04-question-answering-using-rag.ipynb` (faiss_vector_store).
2. Define the LLM model that will be used to extract the entities.
3. Define the entities to be extracted.
4. Extract the entities.

Requirements:

- To be able to extract relevant entities from the provided documents, the LLM needs to have a clear definition of what it is trying to extract. For example, if you want to extract the profit of a company from one of its financial reports, then asking the question "What is the profit?" might not be specific enough since the report can contain the profits of other companies. It may be important for your documents in S3 to contain relevant metadata that can be used in the queries to make the entities more specific.


In [1]:
import boto3
from botocore.config import Config

In [2]:
ssm_client = boto3.client("ssm")

In [3]:
bedrock_region_parameter = "/AgenticLLMAssistant/bedrock_region"
bedrock_endpoint_parameter = "/AgenticLLMAssistant/bedrock_endpoint"
s3_bucket_name_parameter = "/AgenticLLMAssistant/AgentDataBucketParameter"

BEDROCK_REGION = ssm_client.get_parameter(Name=bedrock_region_parameter)
BEDROCK_REGION = BEDROCK_REGION["Parameter"]["Value"]

S3_BUCKET_NAME = ssm_client.get_parameter(Name=s3_bucket_name_parameter)
S3_BUCKET_NAME = S3_BUCKET_NAME["Parameter"]["Value"]

BEDROCK_REGION, S3_BUCKET_NAME

('us-east-1', 'assistantbackendstack-agentdatabucket67afdfb9-pqa1faq3bool')

In [4]:
# LLM_MODEL_ID = "anthropic.claude-v1"
LLM_MODEL_ID = "anthropic.claude-v2"

In [5]:
retry_config = Config(
    region_name=BEDROCK_REGION, retries={"max_attempts": 10, "mode": "standard"}
)
bedrock_runtime = boto3.client("bedrock-runtime", config=retry_config)
bedrock = boto3.client("bedrock", config=retry_config)

First get all of the documents that were used to create the vectorstore.

In [6]:
from utils.helpers import load_list_from_s3

In [7]:
s3_key = "documents_processed.json"
documents_processed = load_list_from_s3(S3_BUCKET_NAME, s3_key)

In [8]:
len(documents_processed), len(documents_processed[0]["pages"])

(2, 5)

## Define the embedding model to embed the queries.

Make sure to use the same embedding model that was used to create the embeddings for the documents.
In this notebooks, we will continue to use the Titan embedding model from Amazon Bedrock.

In [9]:
from langchain.embeddings import BedrockEmbeddings

# Define an embedding model to generate embeddings
embedding_model_id = "amazon.titan-embed-text-v1"
embedding_model = BedrockEmbeddings(model_id=embedding_model_id, client=bedrock_runtime)

#### Initialize the vectorstore using the embedding model

In [10]:
from langchain.vectorstores import FAISS

faiss = FAISS.load_local("faiss_vector_store", embedding_model)

In [11]:
# uncomment the following to show the stored chunks in the vector store.
# faiss.docstore._dict

## Define the generation model, which does the extraction

For the generation model you are free to choose which model you would like to use, in this notebook we use Amazon Titan embedding model available through Amazon Bedrock.

Define an Amazon Bedrock generation model

- If you did not use Amazon Bedrock in the `question-answering-using-rag.ipynb` notebook, please go to that notebook first and follow the installation steps mentioned there.

In [12]:
from langchain.embeddings import BedrockEmbeddings
from langchain.llms.bedrock import Bedrock as LangchainBedrock

# Define a generation model to generate text
claude_llm = LangchainBedrock(
    model_id=LLM_MODEL_ID,
    client=bedrock_runtime,
    model_kwargs={"max_tokens_to_sample": 1024, "temperature": 0.0},
)

## Define the entities to be extracted
For the sake of simplicity we are considering entities that can be described using a `description` and a `type`.<br>
You can easily generate a description for your entities using an LLM with the prompt "Can you give a description of what '{entity}' means regarding a company, in a maximum of 2 sentences".

The `rag_query` represents the query that will be used to find relevant context to extract the entity from. This query can contain the available metadata about the document you are trying to extract the entity from. In our case, we have `company` and `year` as available metadata for each document.<br>
It is wise to experiment with the `rag_query` to improve the returned context.

In [13]:
entity_list = {
    "revenue": {
        "description": "Total income from goods sold or services provided in {company} in {year} exactly as defined in the document within <document></document> XML tags.",
        "rag_query": "What is the total revenue of {company} in {year}.",
    },
    "risks": {
        "description": "Summary of risks impacting {company} in {year} as defined in the document within <document></document> XML tags.",
        "rag_query": "What are the main risks for {company} in {year}.",
    },
    "human_capital": {
        "description": "The total number of employees in {company} in {year} exactly as defined in the document within <document></document> XML tags.",
        "rag_query": "What is the total number of employees in {company} in {year}.",
    },
}

In [14]:
# Method that returns all tables present on the page of the provided chunk.
def add_tables(chunk):
    # The company and year the chunk corresponds to
    chunk_company = chunk.metadata["company"]
    chunk_year = chunk.metadata["year"]

    # The document and page that the chunk came from
    doc = [
        document
        for document in documents_processed
        if (
            document["metadata"]["company"] == chunk_company
            and document["metadata"]["year"] == chunk_year
        )
    ][0]
    pages = doc["pages"]
    page = [
        page
        for page in pages
        if (int(page["page"]) == int(chunk.metadata["page_number"]))
    ][0]

    # The tables (in markdown) that are present on the page that the chunk came from
    tables = ""
    for table_markdown in page["page_tables"]:
        tables += "\n" + table_markdown
    return chunk.page_content + "\n" + tables.strip()

Use the following to experiment with the semantic search questions, and refine it to get the best results.

In [15]:
# Experiment with different rag queries
rag_query = "What is the total revenue of {company} in {year}."

# Referring to the Annual report of Amazon in 2021
company = "Amazon"
year = 2021

# Find context relevant to the query
for doc in faiss.similarity_search(
    query=rag_query.format(
        company=company,
        year=year,
    ),
    k=1,
    filter={
        # "document_s3_metadata": {"company": company},
        "company": company,
        # "year": year
    },
    fetch_k=1000,
):
    # print(doc.metadata)
    print("\n\n")
    print(add_tables(doc))
    print("\n\n")




Net product sales
Net service sales
Total net sales
Operating expenses:
Cost of sales
Fulfillment
Technology and content
Marketing
General and administrative
Other operating expense (income), net
Total operating expenses
Operating income
Interest income
Interest expense
Other income (expense), net
Total non-operating income (expense)
Income before income taxes
Provision for income taxes
Equity-method investment activity, net of tax
Net income
Basic earnings per share
Diluted earnings per share
Weighted-average shares used in computation of earnings per share:
Basic
Diluted
CONSOLIDATED STATEMENTS OF OPERATIONS
See accompanying notes to consolidated financial statements.
AMAZON.COM, INC.
(in millions, except per share data)
37
Year Ended December 31,
2019
2020
2021
$
$
$
160,408
215,915
241,787
120,114
170,149
228,035
386,064
280,522
469,822
165,536
233,307
272,344
40,232
58,517
75,111
35,931
42,740
56,052
18,878
22,008
32,551
8,823
6,668
5,203
62
(75)
201
444,943
363,165
265,981
14,

## Prepare the context for the LLM to extract entities
For each combination of entity and document we retrieve relevant context to present to the generation model, which will use the context during extraction.

Retrieve the chunks for each combination of document and entity and store them.

To improve the relevance of the retrieved documents you could use the document metadata added during creation of the vector store. If your documents in S3 have any S3 metadata attached to them, you should be able to reference it here using `document['metadata']`.

In [16]:
map_doc_and_entity_to_chuncks = {}

for document in documents_processed:
    print("-" * 79)
    print(document["source_location"])
    map_doc_and_entity_to_chuncks[document["source_location"]] = {}

    # Use the S3 metadata to provide the prompt with relevant information
    company = document["metadata"]["company"]
    year = document["metadata"]["year"]

    print(f"Executing entity search on doc: {document['source_location']}")
    for entity in entity_list.keys():
        query = entity_list[entity]["rag_query"].format(
            company=company,
            year=year,
        )
        print(f"Query: {query}")

        document_entity_retrieved_chunks = faiss.similarity_search(
            query=query,
            k=4,
            filter={
                "document_source_location": document["source_location"],
            },
            fetch_k=200,
        )

        map_doc_and_entity_to_chuncks[document["source_location"]][entity] = {
            "chunks": document_entity_retrieved_chunks,
            "metadata": {
                "company": company,
                "year": year,
            },
        }

-------------------------------------------------------------------------------
s3://assistantbackendstack-agentdatabucket67afdfb9-pqa1faq3bool/prepared_pdf_documents/Amazon/annual_report_2022.pdf
Executing entity search on doc: s3://assistantbackendstack-agentdatabucket67afdfb9-pqa1faq3bool/prepared_pdf_documents/Amazon/annual_report_2022.pdf
Query: What is the total revenue of Amazon in 2022.
Query: What are the main risks for Amazon in 2022.
Query: What is the total number of employees in Amazon in 2022.
-------------------------------------------------------------------------------
s3://assistantbackendstack-agentdatabucket67afdfb9-pqa1faq3bool/prepared_pdf_documents/Amazon/annual_report_2021.pdf
Executing entity search on doc: s3://assistantbackendstack-agentdatabucket67afdfb9-pqa1faq3bool/prepared_pdf_documents/Amazon/annual_report_2021.pdf
Query: What is the total revenue of Amazon in 2021.
Query: What are the main risks for Amazon in 2021.
Query: What is the total number of emp

## Define the entities to extract

In [17]:
entity_list

{'revenue': {'description': 'Total income from goods sold or services provided in {company} in {year} exactly as defined in the document within <document></document> XML tags.',
  'rag_query': 'What is the total revenue of {company} in {year}.'},
 'risks': {'description': 'Summary of risks impacting {company} in {year} as defined in the document within <document></document> XML tags.',
  'rag_query': 'What are the main risks for {company} in {year}.'},
 'human_capital': {'description': 'The total number of employees in {company} in {year} exactly as defined in the document within <document></document> XML tags.',
  'rag_query': 'What is the total number of employees in {company} in {year}.'}}

In [18]:
entity_list["revenue"]["description"]

'Total income from goods sold or services provided in {company} in {year} exactly as defined in the document within <document></document> XML tags.'

In [19]:
%pip install -U pydantic

Collecting pydantic
  Using cached pydantic-2.6.4-py3-none-any.whl.metadata (85 kB)
Using cached pydantic-2.6.4-py3-none-any.whl (394 kB)
Installing collected packages: pydantic
  Attempting uninstall: pydantic
    Found existing installation: pydantic 2.5.1
    Uninstalling pydantic-2.5.1:
      Successfully uninstalled pydantic-2.5.1
[31mERROR: pip's dependency resolver does not currently take into account all the packages that are installed. This behaviour is the source of the following dependency conflicts.
amazon-sagemaker-jupyter-scheduler 3.0.6 requires pydantic==1.*, but you have pydantic 2.6.4 which is incompatible.
gluonts 0.13.7 requires pydantic~=1.7, but you have pydantic 2.6.4 which is incompatible.[0m[31m
[0mSuccessfully installed pydantic-2.6.4
Note: you may need to restart the kernel to use updated packages.


In [30]:
import json
from datetime import date
from pydantic import BaseModel, Field
from typing import List


class RevenueEntity(BaseModel):
    revenue: float = Field(
        description=entity_list["revenue"]["description"], default=None
    )
    revenue_reasoning: str = Field(
        description="Put here the text from the document used to infer the value for the revenue field.",
        default=None,
    )
    revenue_unit: str = Field(
        description="Put here the unit of the revenue using ISO alphabetic code.",
        default=None,
    )
    revenue_unit_reasoning: str = Field(
        description="Put here the text from the document used to infer the value for the revenue_unit field.",
        default=None,
    )


class RisksEntity(BaseModel):
    risks: str = Field(description=entity_list["risks"]["description"], default=None)
    risks_reasoning: str = Field(
        description="Put here the text from the document used to infer the value for the risks field.",
        default=None,
    )


class HumanCapitalEntity(BaseModel):
    human_capital: int = Field(
        description=entity_list["human_capital"]["description"], default=None
    )
    human_capital_reasoning: str = Field(
        description="Put here the text from the document used to infer the value for the human_capital field.",
        default=None,
    )


entity_schema = {}
entity_schema["revenue"] = RevenueEntity
entity_schema["risks"] = RisksEntity
entity_schema["human_capital"] = HumanCapitalEntity

In [31]:
entity_schema["revenue"].schema()

{'properties': {'revenue': {'default': None,
   'description': 'Total income from goods sold or services provided in {company} in {year} exactly as defined in the document within <document></document> XML tags.',
   'title': 'Revenue',
   'type': 'number'},
  'revenue_reasoning': {'default': None,
   'description': 'Put here the text from the document used to infer the value for the revenue field.',
   'title': 'Revenue Reasoning',
   'type': 'string'},
  'revenue_unit': {'default': None,
   'description': 'Put here the unit of the revenue using ISO alphabetic code.',
   'title': 'Revenue Unit',
   'type': 'string'},
  'revenue_unit_reasoning': {'default': None,
   'description': 'Put here the text from the document used to infer the value for the revenue_unit field.',
   'title': 'Revenue Unit Reasoning',
   'type': 'string'}},
 'title': 'RevenueEntity',
 'type': 'object'}

In [32]:
import pydantic

In [33]:
pydantic.__version__

'2.6.4'

In [34]:
pydantic

<module 'pydantic' from '/opt/conda/lib/python3.10/site-packages/pydantic/__init__.py'>

## Defining example extractions for all entities

To boost the performance of the extraction for some entities, we add an extraction example to leverage 1-shot in-context learning.
The example is chosen based on the type of the entity that is being extracted.

In [35]:
print(entity_list.keys())

dict_keys(['revenue', 'risks', 'human_capital'])


In [36]:
example_template = """
Example {index}: Given the information inside <schema> and <documents>, the correct output is inside <json> below:

<schema>
{serialized_json_schema}
</schema>

<documents>
{document_excerpts}
</documents>

Correct output:
<json>
{json_output}
</json>
"""

example_pairs = {
    "revenue": [
        {
            "document_excerpts": "Page 20 - After the 10% increase in the number of customers, the sales for Company Inc in 2019 was $513,983 million.",
            "json_output": json.dumps(
                {
                    "revenue": 324483000000,
                    "revenue_reasoning": "The sales for Company Inc in 2019 was $324,483 million.",
                    "revenue_unit": "USD",
                    "revenue_unit_reasoning": "The financial report is in US dollars as stated on page 20.",
                },
                indent=1,
            ),
        }
    ],
    "human_capital": [
        {
            "document_excerpts": "Despite the COVID-19 pandemic, In 2019, Company Inc employed 349,329 employees worldwide.",
            "json_output": json.dumps(
                {
                    "human_capital": 349329,
                    "human_capital_reasoning": "In 2019, Company Inc employed 349,329 employees worldwide.",
                },
                indent=1,
            ),
        }
    ],
    "risks": [
        {
            "document_excerpts": """Competition continues to intensify, including with the development of new business models and the entry of new and well-funded competitors, and as our competitors enter into business combinations or alliances and established companies in other market segments expand to become competitive with our business. In addition, new and enhanced technologies, including search, web and infrastructure computing services, digital content, etc.""",
            "json_output": json.dumps(
                {
                    "risks": "The main risks are: \n* Competition from new entrants\n* Increased competition because of new technologies, ",
                    "risks_reasoning": "Competition continues to intensify, including with the development of new business models and the entry of new and well-funded competitors. In addition, new and enhanced technologies continue to increase our competition.",
                },
                indent=1,
            ),
        }
    ],
}

few_shot_examples = {}

for entity in entity_list.keys():
    combined_examples = "\n"
    for idx, current_example in enumerate(example_pairs[entity]):
        serialized_json_schema = json.dumps(
            entity_schema[entity].schema(), indent=1
        )
        combined_examples += example_template.format(
            index=idx + 1,
            serialized_json_schema=serialized_json_schema,
            document_excerpts=current_example["document_excerpts"],
            json_output=current_example["json_output"],
        )
    few_shot_examples[entity] = combined_examples


print(few_shot_examples["revenue"])



Example 1: Given the information inside <schema> and <documents>, the correct output is inside <json> below:

<schema>
{
 "properties": {
  "revenue": {
   "default": null,
   "description": "Total income from goods sold or services provided in {company} in {year} exactly as defined in the document within <document></document> XML tags.",
   "title": "Revenue",
   "type": "number"
  },
  "revenue_reasoning": {
   "default": null,
   "description": "Put here the text from the document used to infer the value for the revenue field.",
   "title": "Revenue Reasoning",
   "type": "string"
  },
  "revenue_unit": {
   "default": null,
   "description": "Put here the unit of the revenue using ISO alphabetic code.",
   "title": "Revenue Unit",
   "type": "string"
  },
  "revenue_unit_reasoning": {
   "default": null,
   "description": "Put here the text from the document used to infer the value for the revenue_unit field.",
   "title": "Revenue Unit Reasoning",
   "type": "string"
  }
 },
 "t

## Define the prompt to extract the entities

The prompt for the extracting was inspired by the paper [PromptNER : Prompting For Named Entity Recognition](https://arxiv.org/abs/2305.15444).

The main components are a clear goal, one or more examples, and chain-of-thought reasoning.

We use Claude v2 model available through Amazon Bedrock.

In [38]:
import json
from langchain.chains import LLMChain
from langchain.prompts import PromptTemplate
from langchain.schema import HumanMessage, SystemMessage
from langchain.output_parsers import PydanticOutputParser


ENTITY_EXTRACTION_PROMPT_TEMPLATE = """\n\nHuman: Extract the information described by the json schema inside the <schema></schema> XML tags from the documents inside <documents></documents> XML tags.
Follow the rules inside the <rules></rules> XML tags during extraction:
<rules>
1. You must output a valid JSON.
2. You must extract the value for each from the text inside <documents></documents>, and the value must match the description and type in JSON schema.
3. Expand numbers into full digits format: example 1: 212,765,000,000 becomes 212765000000, example 2: $469.822 million becomes 469822000, example 3: 132,452 people becomes 132452.
4. Don't use comma as thousands separator in the numbers you extract. For example 212,765 must be written as 212765.
5. Consider the context inside <context></context> XML tags.
6. If the document does not contain the value, put null.
</rules>

The JSON schema inside the <schema></schema> XML tags contains the information to extract:
<schema>
{serialized_json_schema}
</schema>

Extract information from the documents inside <documents></documents> XML tags below:
<documents>
{document_excerpts}
</documents>

Use the metadata inside the <context></context> XML tags when relevant to assist you during extraction:
<context>
The company is {company}.
The year of the financial report is {year}.
</context>

Follow the extraction examples inside the <examples></examples> XML tags below:
<examples>
{few_shot_examples}
</examples>

Only write the JSON output inside <json></json> XML tags without further explanation.

\n\nAssistant: <json>\n"""

llm_chain = LLMChain(
    llm=claude_llm,
    prompt=PromptTemplate.from_template(ENTITY_EXTRACTION_PROMPT_TEMPLATE),
)


def get_extracted_entity(
    entity, metadata, chunks, entity_schema, few_shot_examples_dict
):
    """Prepare the prompt for entity extraction."""
    document_excerpts = ""
    for chunk in chunks:
        document_excerpts += "\n".join(
            [
                f"- Below Excerpt of page {chunk.metadata['original_page_number']:}",
                "\n",
                f"{add_tables(chunk)}",
            ]
        )

    # parser = PydanticOutputParser(pydantic_object=entity_schema[entity])
    few_shot_examples = few_shot_examples_dict.get(entity, "")
    serialized_json_schema = json.dumps(
        entity_schema[entity].schema(), indent=1
    )

    prompt = ENTITY_EXTRACTION_PROMPT_TEMPLATE.format(
        entity=entity,
        metadata=metadata,
        serialized_json_schema=serialized_json_schema,
        company=metadata["company"],
        year=metadata["year"],
        document_excerpts=document_excerpts,
        few_shot_examples=few_shot_examples,
    )

    result = llm_chain.predict(
        entity=entity,
        metadata=metadata,
        serialized_json_schema=serialized_json_schema,
        company=metadata["company"],
        year=metadata["year"],
        document_excerpts=document_excerpts,
        few_shot_examples=few_shot_examples,
    )

    return result, prompt

In [39]:
# Define a dictionary to save all the extracted items
import json

extracted_entities = []
k = 2

# Create a prompt for each combination of document and entity
for document in list(map_doc_and_entity_to_chuncks.keys()):
    # You could additionally use the metadata of documents defined in S3 to customize these prompts
    # This metadata can also be found in the `documents_processed` variable under the key `metadata`.
    document_name = document.split("/")[-1]
    # extracted_entities[document] = {}

    for entity in entity_list.keys():
        current_entity_result = {}
        current_entity_result["entity_type"] = entity
        current_entity_result["source_doc"] = document

        print("-" * 79)
        print(f"Starting extraction of {entity} from document {document_name}")
        try:
            chunks = map_doc_and_entity_to_chuncks[document][entity]["chunks"]
            if not chunks:
                print(
                    f"WARNING: The input chunks are empty for {document} and {entity}. Skipping."
                )
                continue
            metadata = map_doc_and_entity_to_chuncks[document][entity]["metadata"]
            # extraction_example = entity_list[entity]['example']
            # extraction_example = extraction_example,
            current_entity_result.update(metadata)
            result, prompt = get_extracted_entity(
                entity=entity,
                metadata=metadata,
                chunks=chunks,
                entity_schema=entity_schema,
                few_shot_examples_dict=few_shot_examples,
            )
        # If the maximum context length was exceeded, try again with less context
        except Exception as e:
            print(f"Exception: {e}")

        current_entity_result["result"] = result
        current_entity_result["prompt"] = prompt
        extracted_entities.append(current_entity_result)

    print(f"Finished extracting entities from document: {document}")
    print("-" * 79)

-------------------------------------------------------------------------------
Starting extraction of revenue from document annual_report_2022.pdf
-------------------------------------------------------------------------------
Starting extraction of risks from document annual_report_2022.pdf
-------------------------------------------------------------------------------
Starting extraction of human_capital from document annual_report_2022.pdf
Finished extracting entities from document: s3://assistantbackendstack-agentdatabucket67afdfb9-pqa1faq3bool/prepared_pdf_documents/Amazon/annual_report_2022.pdf
-------------------------------------------------------------------------------
-------------------------------------------------------------------------------
Starting extraction of revenue from document annual_report_2021.pdf
-------------------------------------------------------------------------------
Starting extraction of risks from document annual_report_2021.pdf
-----------------

In [40]:
for extraction in extracted_entities:
    print("-" * 79)
    print(extraction.keys())
    print(extraction["result"])

-------------------------------------------------------------------------------
dict_keys(['entity_type', 'source_doc', 'company', 'year', 'result', 'prompt'])
{
  "revenue": 513983000000,
  "revenue_reasoning": "Total net sales for Amazon in 2022 was $513,983 million.",
  "revenue_unit": "USD",
  "revenue_unit_reasoning": "The financial report is in US dollars."
}
</json>
-------------------------------------------------------------------------------
dict_keys(['entity_type', 'source_doc', 'company', 'year', 'result', 'prompt'])
{
  "risks": "The main risks impacting Amazon in 2022 are: competition from new business models and well-funded competitors, competitors forming alliances, established companies expanding into Amazon's segments, and new technologies enabling increased competition. Other risks include limited experience and lack of first-mover advantage in new segments, failure of new products/services to be adopted, disruption risks from new products/services, lack of profitab

In [41]:
parsed_entities = []

for extraction in extracted_entities:
    current_entity = {}
    # current_entity["entity_type"] = extraction["entity_type"]
    current_entity["source_doc"] = extraction["source_doc"]
    current_entity["company"] = extraction["company"]
    current_entity["year"] = extraction["year"]

    print("-" * 79)
    result = extraction["result"]

    try:
        parser = PydanticOutputParser(
            pydantic_object=entity_schema[extraction["entity_type"]]
        )
        # We already feed the Assistant the start of the output <json>\n
        # to ensure it abides to the output format. Therefore, it will generate
        # starting from {.., so we add <json>\n here to complete the XML tags.
        result = "<json>\n" + result
        json_result = parser.parse(result).dict()
        print(json_result)
        current_entity.update(json_result)
    except Exception as e:
        print(f"Failed to parse output with error {e} for the following")
        print(result)

    parsed_entities.append(current_entity)

-------------------------------------------------------------------------------
{'revenue': 513983000000.0, 'revenue_reasoning': 'Total net sales for Amazon in 2022 was $513,983 million.', 'revenue_unit': 'USD', 'revenue_unit_reasoning': 'The financial report is in US dollars.'}
-------------------------------------------------------------------------------
{'risks': "The main risks impacting Amazon in 2022 are: competition from new business models and well-funded competitors, competitors forming alliances, established companies expanding into Amazon's segments, and new technologies enabling increased competition. Other risks include limited experience and lack of first-mover advantage in new segments, failure of new products/services to be adopted, disruption risks from new products/services, lack of profitability in new segments, and unsuccessful sustainability initiatives.", 'risks_reasoning': 'Our businesses are rapidly evolving and intensely competitive, and we have many competito

Inspect extraction result and extraction prompt to verify their structure

#### Store the extracted entities in a DataFrame

In [42]:
parsed_entities

[{'source_doc': 's3://assistantbackendstack-agentdatabucket67afdfb9-pqa1faq3bool/prepared_pdf_documents/Amazon/annual_report_2022.pdf',
  'company': 'Amazon',
  'year': '2022',
  'revenue': 513983000000.0,
  'revenue_reasoning': 'Total net sales for Amazon in 2022 was $513,983 million.',
  'revenue_unit': 'USD',
  'revenue_unit_reasoning': 'The financial report is in US dollars.'},
 {'source_doc': 's3://assistantbackendstack-agentdatabucket67afdfb9-pqa1faq3bool/prepared_pdf_documents/Amazon/annual_report_2022.pdf',
  'company': 'Amazon',
  'year': '2022',
  'risks': "The main risks impacting Amazon in 2022 are: competition from new business models and well-funded competitors, competitors forming alliances, established companies expanding into Amazon's segments, and new technologies enabling increased competition. Other risks include limited experience and lack of first-mover advantage in new segments, failure of new products/services to be adopted, disruption risks from new products/se

In [43]:
from collections import defaultdict

# Group by company and year into defaultdict
grouped = defaultdict(lambda: defaultdict(list))
for d in parsed_entities:
    # del d["entity_type"]
    grouped[d["company"]][d["year"]].append(d)

# Flatten into list of dicts for DataFrame
results = []
for comp, years in grouped.items():
    for year, items in years.items():
        result = {"company": comp, "year": year}
        for item in items:
            for k, v in item.items():
                if k not in result:
                    result[k] = v
        results.append(result)

print(results)

[{'company': 'Amazon', 'year': '2022', 'source_doc': 's3://assistantbackendstack-agentdatabucket67afdfb9-pqa1faq3bool/prepared_pdf_documents/Amazon/annual_report_2022.pdf', 'revenue': 513983000000.0, 'revenue_reasoning': 'Total net sales for Amazon in 2022 was $513,983 million.', 'revenue_unit': 'USD', 'revenue_unit_reasoning': 'The financial report is in US dollars.', 'risks': "The main risks impacting Amazon in 2022 are: competition from new business models and well-funded competitors, competitors forming alliances, established companies expanding into Amazon's segments, and new technologies enabling increased competition. Other risks include limited experience and lack of first-mover advantage in new segments, failure of new products/services to be adopted, disruption risks from new products/services, lack of profitability in new segments, and unsuccessful sustainability initiatives.", 'risks_reasoning': 'Our businesses are rapidly evolving and intensely competitive, and we have man

In [45]:
import pandas as pd

entities_df = pd.DataFrame(results)
entities_df

Unnamed: 0,company,year,source_doc,revenue,revenue_reasoning,revenue_unit,revenue_unit_reasoning,risks,risks_reasoning,human_capital,human_capital_reasoning
0,Amazon,2022,s3://assistantbackendstack-agentdatabucket67af...,513983000000.0,"Total net sales for Amazon in 2022 was $513,98...",USD,The financial report is in US dollars.,The main risks impacting Amazon in 2022 are: c...,Our businesses are rapidly evolving and intens...,1541000,"As of December 31, 2022, we employed approxima..."
1,Amazon,2021,s3://assistantbackendstack-agentdatabucket67af...,469822000000.0,"Total net sales in 2021 was $469,822 million a...",USD,The financial report is in US dollars.,The main risks impacting Amazon in 2021 are: i...,"Competition continues to intensify, including ...",1608000,"As of December 31, 2021, we employed approxima..."


In [46]:
entities_df.to_csv("extracted_entities.csv")

## Upload structured metadata csv file to S3

In [47]:
!aws s3 cp ./extracted_entities.csv s3://{S3_BUCKET_NAME}/structured_metadata/extracted_entities.csv

upload: ./extracted_entities.csv to s3://assistantbackendstack-agentdatabucket67afdfb9-pqa1faq3bool/structured_metadata/extracted_entities.csv
