<a href="https://colab.research.google.com/github/girijesh-ai/llamaIndex-projects/blob/main/Metadata_Management.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Extracting Metadata for Better Document Indexing and Understanding

In many cases, especially with long documents, a chunk of text may lack the context necessary to disambiguate the chunk from other similar chunks of text. One method of addressing this is manually labelling each chunk in our dataset or knowledge base. However, this can be labour intensive and time consuming for a large number or continually updated set of documents.

To combat this, we use LLMs to extract certain contextual information relevant to the document to better help the retrieval and language models disambiguate similar-looking passages.

We do this through our brand-new `MetadataExtractor` modules.

If you're opening this Notebook on colab, you will probably need to install LlamaIndex 🦙.

In [None]:
!pip install llama-index pypdf

Collecting llama-index
  Downloading llama_index-0.8.53.post3-py3-none-any.whl (794 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m794.6/794.6 kB[0m [31m10.1 MB/s[0m eta [36m0:00:00[0m
[?25hCollecting pypdf
  Downloading pypdf-3.16.4-py3-none-any.whl (276 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m276.6/276.6 kB[0m [31m19.3 MB/s[0m eta [36m0:00:00[0m
Collecting aiostream<0.6.0,>=0.5.2 (from llama-index)
  Downloading aiostream-0.5.2-py3-none-any.whl (39 kB)
Collecting dataclasses-json<0.6.0,>=0.5.7 (from llama-index)
  Downloading dataclasses_json-0.5.14-py3-none-any.whl (26 kB)
Collecting deprecated>=1.2.9.3 (from llama-index)
  Downloading Deprecated-1.2.14-py2.py3-none-any.whl (9.6 kB)
Collecting langchain>=0.0.303 (from llama-index)
  Downloading langchain-0.0.325-py3-none-any.whl (1.9 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m1.9/1.9 MB[0m [31m58.8 MB/s[0m eta [36m0:00:00[0m
Collecting openai>=

In [None]:
import nest_asyncio

nest_asyncio.apply()

import os
import openai

openai.api_key = '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'

In [None]:
from llama_index import ServiceContext
from llama_index.llms import OpenAI
from llama_index.schema import MetadataMode

In [None]:
llm = OpenAI(temperature=0.1, model="gpt-3.5-turbo", max_tokens=512)

We create a node parser that extracts the document title and hypothetical question embeddings relevant to the document chunk.

We also show how to instantiate the `SummaryExtractor` and `KeywordExtractor`, as well as how to create your own custom extractor
based on the `MetadataFeatureExtractor` base class

In [None]:
from llama_index.node_parser import SimpleNodeParser
from llama_index.node_parser.extractors import (
    MetadataExtractor,
    SummaryExtractor,
    QuestionsAnsweredExtractor,
    TitleExtractor,
    KeywordExtractor,
    EntityExtractor,
    MetadataFeatureExtractor,
)
from llama_index.text_splitter import TokenTextSplitter

text_splitter = TokenTextSplitter(
    separator=" ", chunk_size=512, chunk_overlap=128
)


class CustomExtractor(MetadataFeatureExtractor):
    def extract(self, nodes):
        metadata_list = [
            {
                "custom": (
                    node.metadata["document_title"]
                    + "\n"
                    + node.metadata["excerpt_keywords"]
                )
            }
            for node in nodes
        ]
        return metadata_list


metadata_extractor = MetadataExtractor(
    extractors=[
        TitleExtractor(nodes=5, llm=llm),
        QuestionsAnsweredExtractor(questions=3, llm=llm),
    ],
)

node_parser = SimpleNodeParser.from_defaults(
    text_splitter=text_splitter,
    metadata_extractor=metadata_extractor,
)

In [None]:
from llama_index import SimpleDirectoryReader

We first load the 10k annual SEC report for Uber and Lyft for the years 2019 and 2020 respectively.

In [None]:
!mkdir -p data
!wget -O "data/10k-132.pdf" "https://www.dropbox.com/scl/fi/6dlqdk6e2k1mjhi8dee5j/uber.pdf?rlkey=2jyoe49bg2vwdlz30l76czq6g&dl=1"
!wget -O "data/10k-vFinal.pdf" "https://www.dropbox.com/scl/fi/qn7g3vrk5mqb18ko4e5in/lyft.pdf?rlkey=j6jxtjwo8zbstdo4wz3ns8zoj&dl=1"

--2023-10-28 08:44:22--  https://www.dropbox.com/scl/fi/6dlqdk6e2k1mjhi8dee5j/uber.pdf?rlkey=2jyoe49bg2vwdlz30l76czq6g&dl=1
Resolving www.dropbox.com (www.dropbox.com)... 162.125.5.18, 2620:100:601d:18::a27d:512
Connecting to www.dropbox.com (www.dropbox.com)|162.125.5.18|:443... connected.
HTTP request sent, awaiting response... 302 Found
Location: https://uc32e2a0663cd9f3a20b8fa1e023.dl.dropboxusercontent.com/cd/0/inline/CGfXzjhFTtiHNbyHQYHNVTgTInmmni-hMCPmEjilNhRCKb-DO7E-TwMukruSTfnKpCpZ6dQ9oNWBhapCJP6bleQbLJrG51SVKuSygG9cNK7hK5ABjWtmil4M2YKPiTx7vZo/file?dl=1# [following]
--2023-10-28 08:44:22--  https://uc32e2a0663cd9f3a20b8fa1e023.dl.dropboxusercontent.com/cd/0/inline/CGfXzjhFTtiHNbyHQYHNVTgTInmmni-hMCPmEjilNhRCKb-DO7E-TwMukruSTfnKpCpZ6dQ9oNWBhapCJP6bleQbLJrG51SVKuSygG9cNK7hK5ABjWtmil4M2YKPiTx7vZo/file?dl=1
Resolving uc32e2a0663cd9f3a20b8fa1e023.dl.dropboxusercontent.com (uc32e2a0663cd9f3a20b8fa1e023.dl.dropboxusercontent.com)... 162.125.5.15, 2620:100:601d:15::a27d:50f
Connecting

In [None]:
# Note the uninformative document file name, which may be a common scenario in a production setting
uber_docs = SimpleDirectoryReader(input_files=["data/10k-132.pdf"]).load_data()
uber_front_pages = uber_docs[0:3]
uber_content = uber_docs[63:69]
uber_docs = uber_front_pages + uber_content

In [None]:
uber_nodes = node_parser.get_nodes_from_documents(uber_docs)

Extracting questions:   0%|          | 0/21 [00:00<?, ?it/s]

In [None]:
uber_nodes[3].metadata

{'page_label': '3',
 'file_name': '10k-132.pdf',
 'document_title': 'Exploring the Diverse Landscape of 2019: A Comprehensive Annual Report on Uber Technologies, Inc.',
 'questions_this_excerpt_can_answer': '1. Is Uber Technologies, Inc. considered a well-known seasoned issuer according to Rule 405 of the Securities Act?\n2. Has Uber Technologies, Inc. filed all the required reports under Section 13 or 15(d) of the Securities Exchange Act of 1934 in the past 12 months?\n3. Has Uber Technologies, Inc. submitted all the necessary Interactive Data Files as required by Rule 405 of Regulation S-T in the preceding 12 months?'}

In [None]:
# Note the uninformative document file name, which may be a common scenario in a production setting
lyft_docs = SimpleDirectoryReader(
    input_files=["data/10k-vFinal.pdf"]
).load_data()
lyft_front_pages = lyft_docs[0:3]
lyft_content = lyft_docs[68:73]
lyft_docs = lyft_front_pages + lyft_content

In [None]:
lyft_nodes = node_parser.get_nodes_from_documents(lyft_docs)

Extracting questions:   0%|          | 0/19 [00:00<?, ?it/s]

In [None]:
lyft_nodes[1].metadata

{'page_label': '2',
 'file_name': '10k-vFinal.pdf',
 'document_title': 'Lyft, Inc. Annual Report on Form 10-K for the fiscal year ended December 31, 2020',
 'questions_this_excerpt_can_answer': '1. Is Lyft, Inc. required to file reports pursuant to Section 13 or 15(d) of the Act?\n2. Has Lyft, Inc. filed all reports required by the Securities Exchange Act of 1934 in the past 12 months?\n3. Has Lyft, Inc. submitted every Interactive Data File required to be submitted pursuant to Rule 405 of Regulation S-T in the past 12 months?'}

Since we are asking fairly sophisticated questions, we utilize a subquestion query engine for all QnA pipelines below, and prompt it to pay more attention to the relevance of the retrieved sources.

## Querying an Index With No Extra Metadata

In [None]:
from copy import deepcopy

nodes_no_metadata = deepcopy(uber_nodes) + deepcopy(lyft_nodes)
for node in nodes_no_metadata:
    node.metadata = {
        k: node.metadata[k]
        for k in node.metadata
        if k in ["page_label", "file_name"]
    }
print(
    "LLM sees:\n",
    (nodes_no_metadata)[9].get_content(metadata_mode=MetadataMode.ALL),
)

LLM sees:
 [Excerpt from document]
page_label: 65
file_name: 10k-132.pdf
Excerpt:
-----
income (loss) attributable to  Uber Technologies, Inc. to Adjusted EBITDA. 
            
  Year Ended December 31,   2017 to 2018   2018 to 2019   
(In millions, exce pt percenta ges)  2017   2018   2019   % Chan ge  % Chan ge  
Adjusted EBITDA ................................  $ (2,642) $ (1,847) $ (2,725)  30%  (48)%
-----


In [None]:
from llama_index import VectorStoreIndex
from llama_index.query_engine import SubQuestionQueryEngine
from llama_index.tools import QueryEngineTool, ToolMetadata

In [None]:
index_no_metadata = VectorStoreIndex(
    nodes=nodes_no_metadata,
    service_context=ServiceContext.from_defaults(llm=OpenAI(model="gpt-4")),
)
engine_no_metadata = index_no_metadata.as_query_engine(
    similarity_top_k=10,
)

[nltk_data] Downloading package punkt to /tmp/llama_index...
[nltk_data]   Unzipping tokenizers/punkt.zip.


In [None]:
final_engine_no_metadata = SubQuestionQueryEngine.from_defaults(
    query_engine_tools=[
        QueryEngineTool(
            query_engine=engine_no_metadata,
            metadata=ToolMetadata(
                name="sec_filing_documents",
                description="financial information on companies",
            ),
        )
    ],
    use_async=True,
)

In [None]:
response_no_metadata = final_engine_no_metadata.query(
    """
    What was the cost due to research and development v.s. sales and marketing for uber and lyft in 2019 in millions of USD?
    Give your answer as a JSON.
    """
)
# {"Uber": {"Research and Development": 4836, "Sales and Marketing": 4626},
#  "Lyft": {"Research and Development": 1505.6, "Sales and Marketing": 814 }}

Generated 4 sub questions.
[1;3;38;2;237;90;200m[sec_filing_documents] Q: What was the cost due to research and development for Uber in 2019 in millions of USD?
[0m[1;3;38;2;90;149;237m[sec_filing_documents] Q: What was the cost due to sales and marketing for Uber in 2019 in millions of USD?
[0m[1;3;38;2;11;159;203m[sec_filing_documents] Q: What was the cost due to research and development for Lyft in 2019 in millions of USD?
[0m[1;3;38;2;155;135;227m[sec_filing_documents] Q: What was the cost due to sales and marketing for Lyft in 2019 in millions of USD?
[0m[1;3;38;2;11;159;203m[sec_filing_documents] A: The cost due to research and development for Lyft in 2019 was $1,505.640 million.
[0m[1;3;38;2;155;135;227m[sec_filing_documents] A: The cost due to sales and marketing for Lyft in 2019 was $814.122 million.
[0m[1;3;38;2;90;149;237m[sec_filing_documents] A: The cost due to sales and marketing for Uber in 2019 was $814,122 in thousands, which is equivalent to $814.122 mill

In [None]:
print(response_no_metadata.response)

{
  "Uber": {
    "Research and Development": 1505.64,
    "Sales and Marketing": 814.122
  },
  "Lyft": {
    "Research and Development": 1505.64,
    "Sales and Marketing": 814.122
  }
}


**RESULT**: As we can see, the QnA agent does not seem to know where to look for the right documents. As a result it gets the Lyft and Uber data completely mixed up.

## Querying an Index With Extracted Metadata

In [None]:
print(
    "LLM sees:\n",
    (uber_nodes + lyft_nodes)[9].get_content(metadata_mode=MetadataMode.ALL),
)

LLM sees:
 [Excerpt from document]
page_label: 65
file_name: 10k-132.pdf
document_title: Exploring the Diverse Landscape of 2019: A Comprehensive Annual Report on Uber Technologies, Inc.
questions_this_excerpt_can_answer: 1. What was the adjusted EBITDA for Uber Technologies, Inc. for the year 2019 and how does it compare to the previous two years?
2. What was the percentage change in adjusted EBITDA for Uber Technologies, Inc. from 2017 to 2018 and from 2018 to 2019?
3. How much income (loss) attributable to Uber Technologies, Inc. was reported for the year 2019 and how does it compare to the previous two years?
Excerpt:
-----
income (loss) attributable to  Uber Technologies, Inc. to Adjusted EBITDA. 
            
  Year Ended December 31,   2017 to 2018   2018 to 2019   
(In millions, exce pt percenta ges)  2017   2018   2019   % Chan ge  % Chan ge  
Adjusted EBITDA ................................  $ (2,642) $ (1,847) $ (2,725)  30%  (48)%
-----


In [None]:
index = VectorStoreIndex(
    nodes=uber_nodes + lyft_nodes,
    service_context=ServiceContext.from_defaults(llm=OpenAI(model="gpt-4")),
)
engine = index.as_query_engine(
    similarity_top_k=10,
)

In [None]:
final_engine = SubQuestionQueryEngine.from_defaults(
    query_engine_tools=[
        QueryEngineTool(
            query_engine=engine,
            metadata=ToolMetadata(
                name="sec_filing_documents",
                description="financial information on companies.",
            ),
        )
    ],
    use_async=True,
)

In [None]:
response = final_engine.query(
    """
    What was the cost due to research and development v.s. sales and marketing for uber and lyft in 2019 in millions of USD?
    Give your answer as a JSON.
    """
)
# Correct answer:
# {"Uber": {"Research and Development": 4836, "Sales and Marketing": 4626},
#  "Lyft": {"Research and Development": 1505.6, "Sales and Marketing": 814 }}

Generated 4 sub questions.
[1;3;38;2;237;90;200m[sec_filing_documents] Q: What was the cost due to research and development for Uber in 2019 in millions of USD?
[0m[1;3;38;2;90;149;237m[sec_filing_documents] Q: What was the cost due to sales and marketing for Uber in 2019 in millions of USD?
[0m[1;3;38;2;11;159;203m[sec_filing_documents] Q: What was the cost due to research and development for Lyft in 2019 in millions of USD?
[0m[1;3;38;2;155;135;227m[sec_filing_documents] Q: What was the cost due to sales and marketing for Lyft in 2019 in millions of USD?
[0m[1;3;38;2;90;149;237m[sec_filing_documents] A: The cost due to sales and marketing for Uber in 2019 was $4,626 million.
[0m[1;3;38;2;11;159;203m[sec_filing_documents] A: The cost due to research and development for Lyft in 2019 was $1,505.640 million.
[0m[1;3;38;2;237;90;200m[sec_filing_documents] A: The cost due to research and development for Uber in 2019 was $4,836 million.
[0m[1;3;38;2;155;135;227m[sec_filing_do

In [None]:
print(response.response)

{
  "Uber": {
    "Research and Development": 4836,
    "Sales and Marketing": 4626
  },
  "Lyft": {
    "Research and Development": 1505.64,
    "Sales and Marketing": 814.122
  }
}


**RESULT**: As we can see, the LLM answers the questions correctly.