# Midterm Certification Challenge: Building and Deploying a RAG Application
DUE DATE: Before 4:00 PM PT on May 2 (before next Thursday's class!)

You are to record the total time it takes you to complete

You have access to all boiler-plate code from the course, and we highly encourage you to leverage it!

**Deliverables:**

**Build 🏗️**

* Data: Meta 10-k Filings
* LLM: OpenAI GPT-3.5-turbo
* Embedding Model: text-3-embedding small
* Infrastructure: LangChain or LlamaIndex (you choose)
* Vector Store: Qdrant
* Deployment: Chainlit, Hugging Face
**Ship 🚢**

* Evaluate your answers to the following questions
"What was the total value of 'Cash and cash equivalents' as of December 31, 2023?"
"Who are Meta's 'Directors' (i.e., members of the Board of Directors)?"
* Record <10 min loom video walkthrough
$$ Extra Credit: Baseline retrieval performance w/ RAGAS, change something about your RAG system to improve it, then show the improvement quantitatively!

**Share 🚀**
* Share lessons not yet learned in #aie2-general

## Install Dependencies

In [1]:
import nest_asyncio

nest_asyncio.apply()

In [2]:
!pip install llama-parse llama_index -qU

[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m15.4/15.4 MB[0m [31m67.7 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m2.0/2.0 MB[0m [31m51.0 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m312.9/312.9 kB[0m [31m27.9 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m75.6/75.6 kB[0m [31m7.2 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m141.9/141.9 kB[0m [31m11.3 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m1.8/1.8 MB[0m [31m73.0 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m290.4/290.4 kB[0m [31m32.7 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m77.9/77.9 kB[0m [31m11.2 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━

## Set Environment Variables

In [3]:
import os
from getpass import getpass

# set openai key
os.environ["OPENAI_API_KEY"] = getpass("OpenAI API Key:")

OpenAI API Key:··········


In [4]:
# set llama cloud key
os.environ["LLAMA_CLOUD_API_KEY"] = getpass("Llama Cloud API Key:")

Llama Cloud API Key:··········


## Download the Data

In [5]:
# download the data
!mkdir 'data'
!wget 'https://d18rn0p25nwr6d.cloudfront.net/CIK-0001326801/c7318154-f6ae-4866-89fa-f0c589f2ee3d.pdf' -O 'data/Meta_10k.pdf'

--2024-05-02 08:59:39--  https://d18rn0p25nwr6d.cloudfront.net/CIK-0001326801/c7318154-f6ae-4866-89fa-f0c589f2ee3d.pdf
Resolving d18rn0p25nwr6d.cloudfront.net (d18rn0p25nwr6d.cloudfront.net)... 108.138.113.53, 108.138.113.114, 108.138.113.63, ...
Connecting to d18rn0p25nwr6d.cloudfront.net (d18rn0p25nwr6d.cloudfront.net)|108.138.113.53|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 2481466 (2.4M) [application/pdf]
Saving to: ‘data/Meta_10k.pdf’


2024-05-02 08:59:40 (46.0 MB/s) - ‘data/Meta_10k.pdf’ saved [2481466/2481466]



## Llama Index Vanilla RAG

In [None]:
# import dependencies
from llama_index.core import VectorStoreIndex, SimpleDirectoryReader

# load the document
documents = SimpleDirectoryReader('data').load_data()

# create an index from the documents
index = VectorStoreIndex.from_documents(documents)

# create a query engine for the index
query_engine = index.as_query_engine()

In [None]:
# check how many nodes we have
len(documents)

147

In [None]:
# query the engine
query1 = "What was the total value of 'Cash and cash equivalents' as of December 31, 2023?"
response = query_engine.query(query1)
print(response)

The total value of 'Cash and cash equivalents' as of December 31, 2023 was $14.681 billion.


In [None]:
# query the engine
query2 = "Who are Meta's 'Directors' (i.e., members of the Board of Directors)?"
response = query_engine.query(query2)
print(response)

Meta's 'Directors' or members of the Board of Directors include individuals who have the authority to call stockholders, such as a majority of the board of directors, the chairman of the board of directors, the chief executive officer, or the president.


## RAG with Llama Parse

In [None]:
# import dependencies
from llama_parse import LlamaParse
from llama_index.core import VectorStoreIndex, SimpleDirectoryReader

# setup parser
parser = LlamaParse(
    result_type="markdown"
)

# load and parse the documet
file_extractor = {".pdf": parser}
documents = SimpleDirectoryReader(
    input_files=['data/Meta_10k.pdf'],
    file_extractor=file_extractor
).load_data()

# create an index from the parsed markdown
index = VectorStoreIndex.from_documents(documents)

# create a new query engine for the index
new_query_engine = index.as_query_engine()

Started parsing the file under job_id 192255e6-0a6d-4ad7-9d19-b0b8d2c67cad


In [None]:
len(documents)

1

In [None]:
documents[0].text[:1000]

"## UNITED STATES SECURITIES AND EXCHANGE COMMISSION Washington, D.C. 20549 FORM 10-K\n\n(Mark One) ☒ ANNUAL REPORT PURSUANT TO SECTION 13 OR 15(d) OF THE SECURITIES EXCHANGE ACT OF 1934 For the fiscal year ended December 31, 2023\n\n☐ TRANSITION REPORT PURSUANT TO SECTION 13 OR 15(d) OF THE SECURITIES EXCHANGE ACT OF 1934 or For the transition period from to Commission File Number: 001-35551\n\nMeta Platforms, Inc. Meta (Exact name of registrant as specified in its charter) Delaware 20-1665019 (State or other jurisdiction of incorporation or organization) 1 Meta Way, Menlo Park, California 94025 (I.R.S. Employer Identification Number) (Address of principal executive offices and Zip Code) (650) 543-4800 (Registrant's telephone number, including area code) Securities registered pursuant to Section 12(b) of the Act:\n\n|Title of each class|Trading symbol(s)|Name of each exchange on which registered|\n|---|---|---|\n|Class A Common Stock, $0.000006 par value|META|The Nasdaq Stock Market L

In [None]:
# query the engine
query1 = "What was the total value of 'Cash and cash equivalents' as of December 31, 2023?"
response = new_query_engine.query(query1)
print(response)

$65.40 billion


In [None]:
# query the engine
query2 = "Who are Meta's 'Directors' (i.e., members of the Board of Directors)?"
response = new_query_engine.query(query2)
print(response)

The members of Meta's Board of Directors are referred to as the Compensation, Nominating & Governance Committee of the Board.


## RAG with Llama Parse + Recursive Chunking

In [22]:
# import dependencies
from llama_parse import LlamaParse
from llama_index.core import VectorStoreIndex, SimpleDirectoryReader
from llama_index.core.node_parser import MarkdownElementNodeParser

parsing_instruction = """The provided document is a annual report filed by Meta Platforms,
        Inc. with the Securities and Exchange Commission (SEC).
        This form provides detailed financial information about the company's performance for a specific year.
        It includes unaudited financial statements, management discussion and analysis, and other relevant disclosures required by the SEC.
        It contains many tables.
        Try to be precise while answering the questions"""

# setup parser
parser = LlamaParse(
    result_type="markdown",
    parsing_instruction=parsing_instruction

)

# load and parse the documet
file_extractor = {".pdf": parser}
documents = SimpleDirectoryReader(
    input_files=['data/Meta_10k.pdf'],
    file_extractor=file_extractor
).load_data()

# setup markdown node parser
node_parser = MarkdownElementNodeParser()

# parse the mardown document
nodes = node_parser.get_nodes_from_documents(documents)
base_nodes, objects = node_parser.get_nodes_and_objects(nodes)

# create an index from the parsed markdown
index = VectorStoreIndex(nodes=base_nodes+objects)

# create a new query engine for the index
recursive_query_engine = index.as_query_engine()

Started parsing the file under job_id e22c88c7-4e1b-46b0-b5fb-c02f8ed1fb41


70it [00:00, 16995.73it/s]
columns
  field required (type=value_error.missing)
Traceback (most recent call last):
  File "/usr/local/lib/python3.10/dist-packages/llama_index/core/response_synthesizers/refine.py", line 482, in _agive_response_single
    structured_response = await program.acall(
  File "/usr/local/lib/python3.10/dist-packages/llama_index/core/response_synthesizers/refine.py", line 92, in acall
    answer = await self._llm.astructured_predict(
  File "/usr/local/lib/python3.10/dist-packages/llama_index/core/instrumentation/dispatcher.py", line 307, in async_wrapper
    result = await func(*args, **kwargs)
  File "/usr/local/lib/python3.10/dist-packages/llama_index/core/llms/llm.py", line 391, in astructured_predict
    result = await program.acall(**prompt_args)
  File "/usr/local/lib/python3.10/dist-packages/llama_index/program/openai/base.py", line 223, in acall
    return _parse_tool_calls(
  File "/usr/local/lib/python3.10/dist-packages/llama_index/program/openai/bas

In [23]:
len(nodes)

269

In [24]:
len(base_nodes)

129

In [25]:
# query the engine
query1 = "What was the total value of 'Cash and cash equivalents' as of December 31, 2023?"
response = recursive_query_engine.query(query1)
print(response)

The total value of 'Cash and cash equivalents' as of December 31, 2023, was $41,862.


In [26]:
# query the engine
query2 = "Who are Meta's 'Directors' (i.e., members of the Board of Directors)?"
response = recursive_query_engine.query(query2)
print(response)

Mark Zuckerberg is the Board Chair and Chief Executive Officer of Meta Platforms, Inc. Susan Li is the Chief Financial Officer of Meta Platforms, Inc.


## RAG with Llama Parse + LangChain RecursiveCharacterTextSplitter

In [None]:
!pip install -qU langchain langchain-core langchain-community langchain-openai unstructured

In [None]:
!pip install -qU qdrant-client

Let's first save the llama_parse markdown document

In [None]:
# import dependencies
from llama_parse import LlamaParse
from llama_index.core import SimpleDirectoryReader

# setup parser
parser = LlamaParse(
    result_type="markdown",
    parsing_instruction=parsing_instruction
)

parsing_instruction = """The provided document is an annual report filed by Meta Platforms, Inc. with the Securities and Exchange Commission (SEC).
This form provides detailed financial information about the company's performance for a specific year.
It includes financial statements, management discussion and analysis, and other relevant disclosures required by the SEC.
It contains many tables and some signature pages.

Extract the signatures as a table.
"""

# load and parse the documet
file_extractor = {".pdf": parser}
llama_parse_documents = SimpleDirectoryReader(
    input_files=['data/Meta_10k.pdf'],
    file_extractor=file_extractor
).load_data()

# save markdown file
data_file = "./data/output.md"
with open(data_file, "a") as f:
    for doc in llama_parse_documents:
        f.write(doc.text + '\n')

Now we'll setup the langchain RAG with Qdrant

In [144]:
# import dependencies
from langchain.text_splitter import RecursiveCharacterTextSplitter
from langchain_community.vectorstores import Qdrant
from langchain_community.document_loaders import DirectoryLoader
from langchain_openai.embeddings import OpenAIEmbeddings

# load the document
loader = DirectoryLoader(path='data/', glob="**/*.md", show_progress=True)
documents = loader.load()

# split the document into chunks
text_splitter = RecursiveCharacterTextSplitter(chunk_size=2000, chunk_overlap=100)
docs = text_splitter.split_documents(documents)

# instantiate embeddings
embeddings = OpenAIEmbeddings(model="text-embedding-3-small")

# create the vectorstore
qdrant_vector_store = Qdrant.from_documents(
    documents=docs,
    embedding=embeddings,
    location=":memory:",
    collection_name="meta_10k"
)

# setup our retriever
qdrant_retriever = qdrant_vector_store.as_retriever()

100%|██████████| 1/1 [00:04<00:00,  4.01s/it]


Setup the RAG Prompt

In [188]:
from langchain_core.prompts import ChatPromptTemplate

RAG_PROMPT = """
CONTEXT:
{context}

QUERY:
{question}

The provided context is an annual report filed by Meta Platforms, Inc. with the Securities and Exchange Commission (SEC).
This form provides detailed financial information about the company's performance for a specific year.
It includes financial statements, management discussion and analysis, and other relevant disclosures required by the SEC.
It contains many tables and some signature pages. All members of the board need to sign the document.

Answer the query above only using the context provided. If you don't know the answer, simply say 'I don't know'.
"""

rag_prompt = ChatPromptTemplate.from_template(RAG_PROMPT)

Finally, we create our chain...

In [189]:
from operator import itemgetter
from langchain_core.runnables import RunnablePassthrough
from langchain_core.output_parsers import StrOutputParser
from langchain_openai import ChatOpenAI

chat_model = ChatOpenAI(model="gpt-3.5-turbo")

rag_chain = (
    {"question": itemgetter("question"), "context": itemgetter("question") | qdrant_retriever}
    | RunnablePassthrough().assign(context=itemgetter("context"))
    | {"response":rag_prompt | chat_model | StrOutputParser(), "context": itemgetter("context")}
)

In [190]:
# query the rag_chain
query1 = "What was the total value of 'Cash and cash equivalents' as of December 31, 2023?"
response = rag_chain.invoke({"question": query1})
print(response['response'])

The total value of 'Cash and cash equivalents' as of December 31, 2023, was $41,862 million.


In [191]:
for context in response['context']:
    print(context)

page_content='Due within one year $ December 31, 2023 7,120 Due after one year to five years 16,421 Total 23,541\n\nInstruments Measured at Fair Value on Non-recurring Basis\n\nOur non-marketable equity securities accounted for using the measurement alternative are measured at fair value on a non-recurring basis and are classified within Level 3 of the fair value hierarchy because we use significant unobservable inputs to estimate their fair value. Assets remeasured at fair value on a non-recurring basis within Level 3 during the years ended December 31, 2023 and 2022 were $53 million and $198 million, respectively. For additional information, see Note 6 — Non-marketable Equity Securities.\n\nNote 6. Non-marketable Equity Securities\n\nOur non-marketable equity securities are investments in privately-held companies without readily determinable fair values. The following table summarizes our non-marketable equity securities that were measured using measurement alternative and equity met

In [193]:
# query the rag_chain
query2 = "Who are Meta's 'Directors' (i.e., members of the Board of Directors)?"
response = rag_chain.invoke({"question": query2})
print(response['response'])

The members of Meta's Board of Directors are listed as follows in the context provided:
- Mark Zuckerberg
- Susan Li
- Aaron Anderson
- Peggy Alford
- Marc L. Andreessen
- Andrew W. Houston
- Nancy Killefer
- Robert M. Kimmitt
- Sheryl K. Sandberg
- Tracey T. Travis
- Tony Xu


In [194]:
for context in response['context']:
    print(context)

page_content='Sheryl Sandberg\nChief Operating Officer\nMarch 31, 2022\n\nMeta Platforms, Inc. - Annual Report\n\nTable of Contents\n\nItem 1B. Unresolved Staff Comments\n\nNone.\n\nItem 1C. Cybersecurity\n\nAt Meta, cybersecurity risk management is an important part of our overall risk management efforts. Our industry is prone to cybersecurity threats and attacks, and we regularly experience cybersecurity incidents of varying degrees. We believe we are a particularly attractive target as a result of our prominence and scale, the types and volume of personal data and content on our systems, and the evolving nature of our products and services. Our products and services reach billions of users and involve the collection, storage, processing, and transmission of a large amount of data. In addition, our business and operations span numerous geographies around the world, involve thousands of employees, contractors, vendors, developers, partners, and other third parties, and rely on softwar

## RAG with Llama Parse + Recursive Query Engine + Cohere Reranking

In [6]:
!pip install -qU llama-index-postprocessor-cohere-rerank

[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m151.2/151.2 kB[0m [31m3.9 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m3.1/3.1 MB[0m [31m59.0 MB/s[0m eta [36m0:00:00[0m
[?25h

In [7]:
import os
from getpass import getpass

# set cohere api key
os.environ['COHERE_API_KEY'] = getpass('Cohere API Key:')

Cohere API Key:··········


In [8]:
# setup default llm and embedding model for the rag pipeline
from llama_index.llms.openai import OpenAI
from llama_index.embeddings.openai import OpenAIEmbedding
from llama_index.core import Settings

Settings.llm = OpenAI(model="gpt-3.5-turbo")
Settings.embed_model = OpenAIEmbedding(model="text-embedding-3-small")

In [96]:
# import dependencies
from llama_parse import LlamaParse
from llama_index.core import VectorStoreIndex, SimpleDirectoryReader
from llama_index.core.node_parser import MarkdownElementNodeParser
from llama_index.postprocessor.cohere_rerank import CohereRerank

# parsing_instruction = """The provided document is an annual report filed by Meta Platforms, Inc. with the Securities and Exchange Commission (SEC).
# This form provides detailed financial information about the company's performance for a specific year.
# It includes financial statements, management discussion and analysis, and other relevant disclosures required by the SEC.
# It contains many tables.

# In the Power of Attorney section of this document, there is a signature table. Make sure to extract all the data from that signature table. Remove all of the '/s/' and '/s' signature prefixes and create a markdown table including the headers 'Signature', 'Title' and 'Date'.
# """

parsing_instruction = """The provided document is an annual report filed by Meta Platforms, Inc. with the Securities and Exchange Commission (SEC).
This form provides detailed financial information about the company's performance for a specific year.
It includes financial statements, management discussion and analysis, and other relevant disclosures required by the SEC.
It contains many tables and some signature pages.

Extract the signatures as a table.
"""

# parsing_instruction = """The provided document is an annual report filed by Meta Platforms, Inc. with the Securities and Exchange Commission (SEC).
# This form provides detailed financial information about the company's performance for a specific year.
# It includes financial statements, management discussion and analysis, and other relevant disclosures required by the SEC.
# It contains many tables.

# There is a signature page inside this document.
# Remove all of the '/s/' and '/s' signature prefixes and create a table from that page, with the headers 'Signature', 'Title' and 'Date'.
# """

# setup parser
parser = LlamaParse(
    result_type="markdown",
    parsing_instruction=parsing_instruction
)

# load and parse the documet
file_extractor = {".pdf": parser}
documents = SimpleDirectoryReader(
    input_files=['data/Meta_10k.pdf'],
    file_extractor=file_extractor
).load_data()

Started parsing the file under job_id 2c38731c-6e9c-41aa-8fb8-007d54d2d2d8
....

In [97]:
# save the parsed markdown document
# save markdown file
data_file = "./data/output.md"
with open(data_file, "a") as f:
    for doc in documents:
        f.write(doc.text + '\n')

In [98]:
target_page = 132
print(documents[0].text.split("\n---\n")[target_page])

|Signature|Title|Date|
|---|---|---|
|/s/ Mark Zuckerberg|Board Chair and Chief Executive Officer|February 1, 2024|
|/s/ Susan Li|Chief Financial Officer|February 1, 2024|
|/S/ Aaron Anderson|Chief Accounting Officer|February 1, 2024|
|/s/ Peggy Alford|Director|February 1, 2024|
|/s/ Marc L. Andreessen|Director|February 1, 2024|
|/s/ Andrew W. Houston|Director|February 1, 2024|
|/s/ Nancy Killefer|Director|February 1, 2024|
|/s/ Robert M. Kimmitt|Director|February 1, 2024|
|/s/ Sheryl K. Sandberg|Director|February 1, 2024|
|/s/ Tracey T. Travis|Director|February 1, 2024|
|/s/ Tony Xu|Director|February 1, 2024|


In [99]:
# setup markdown node parser
node_parser = MarkdownElementNodeParser(llm=OpenAI(model="gpt-3.5-turbo"), num_workers=8)

# parse the mardown document
nodes = node_parser.get_nodes_from_documents(documents)
base_nodes, objects = node_parser.get_nodes_and_objects(nodes)

140it [00:00, 14999.94it/s]
columns
  field required (type=value_error.missing)
Traceback (most recent call last):
  File "/usr/local/lib/python3.10/dist-packages/llama_index/core/response_synthesizers/refine.py", line 482, in _agive_response_single
    structured_response = await program.acall(
  File "/usr/local/lib/python3.10/dist-packages/llama_index/core/response_synthesizers/refine.py", line 92, in acall
    answer = await self._llm.astructured_predict(
  File "/usr/local/lib/python3.10/dist-packages/llama_index/core/instrumentation/dispatcher.py", line 307, in async_wrapper
    result = await func(*args, **kwargs)
  File "/usr/local/lib/python3.10/dist-packages/llama_index/core/llms/llm.py", line 391, in astructured_predict
    result = await program.acall(**prompt_args)
  File "/usr/local/lib/python3.10/dist-packages/llama_index/program/openai/base.py", line 223, in acall
    return _parse_tool_calls(
  File "/usr/local/lib/python3.10/dist-packages/llama_index/program/openai/ba

In [100]:
len(nodes)

424

In [101]:

len(base_nodes)

144

In [120]:
base_nodes[128]

TextNode(id_='942c6da5-6ede-4257-84aa-751bf9325af2', embedding=None, metadata={'file_path': 'data/Meta_10k.pdf', 'file_name': 'Meta_10k.pdf', 'file_type': 'application/pdf', 'file_size': 2481466, 'creation_date': '2024-05-02', 'last_modified_date': '2024-02-02'}, excluded_embed_metadata_keys=[], excluded_llm_metadata_keys=[], relationships={<NodeRelationship.SOURCE: '1'>: RelatedNodeInfo(node_id='c048b20b-fd07-44df-b07b-462f0c623390', node_type=<ObjectType.DOCUMENT: '4'>, metadata={'file_path': 'data/Meta_10k.pdf', 'file_name': 'Meta_10k.pdf', 'file_type': 'application/pdf', 'file_size': 2481466, 'creation_date': '2024-05-02', 'last_modified_date': '2024-02-02'}, hash='c3c23ef0027388046e3d5392d01653102bc476918c76dc19aa33c0b3b483a654'), <NodeRelationship.PREVIOUS: '2'>: RelatedNodeInfo(node_id='4c310d89-5adc-4067-b4db-6d1ad0b9df13', node_type=<ObjectType.TEXT: '1'>, metadata={'table_df': "{'Depreciation and amortization': {0: 'Right-of-use assets', 1: 'Total deferred tax liabilities', 2

In [121]:
base_nodes[128].text

'The valuation allowance was approximately $2.88 billion and $2.49 billion as of December 31, 2023 and 2022, respectively, primarily related to U.S. state tax credit carryforwards, U.S. foreign tax credits, unrealized losses in marketable securities, and certain foreign tax attributes for which we do not believe a tax benefit is more likely than not to be realized.\n\nAs of December 31, 2023, the U.S. federal and state net operating loss carryforwards were $200 million and $2.78 billion, which will begin to expire in 2035 and 2031, respectively, if not utilized. We have federal tax credit carryforwards of $490 million, which will begin to expire in 2029, if not utilized, and state tax credit carryforwards of $4.08 billion, most of which do not expire.\n\nUtilization of our net operating loss and tax credit carryforwards may be subject to substantial annual limitations due to the ownership change limitations provided by the Internal Revenue Code and similar state provisions. Such annual

In [122]:
len(objects)

140

In [134]:
objects[126]

IndexNode(id_='c61cab3e-87f1-44c8-a80e-2abc77f79c27', embedding=None, metadata={'col_schema': 'Column: Depreciation and amortization\nType: Numeric\nSummary: Values related to depreciation and amortization\n\nColumn: Right-of-use assets\nType: Numeric\nSummary: Values related to right-of-use assets\n\nColumn: Total deferred tax liabilities\nType: Numeric\nSummary: Values related to total deferred tax liabilities\n\nColumn: Net deferred tax assets\nType: Numeric\nSummary: Values related to net deferred tax assets', 'file_path': 'data/Meta_10k.pdf', 'file_name': 'Meta_10k.pdf', 'file_type': 'application/pdf', 'file_size': 2481466, 'creation_date': '2024-05-02', 'last_modified_date': '2024-02-02'}, excluded_embed_metadata_keys=['col_schema'], excluded_llm_metadata_keys=[], relationships={<NodeRelationship.SOURCE: '1'>: RelatedNodeInfo(node_id='c048b20b-fd07-44df-b07b-462f0c623390', node_type=<ObjectType.DOCUMENT: '4'>, metadata={'file_path': 'data/Meta_10k.pdf', 'file_name': 'Meta_10k.pdf

In [123]:
objects[120].text

"The table contains information about signatures, titles, and dates. The existing table id is 'id_1048'. The table should be kept.,\nwith the following columns:\n"

In [126]:
objects[120].obj.metadata

{'table_df': "{'Name': {0: '[Signature Name 1]', 1: '[Signature Name 2]'}, 'Title': {0: '[Title 1]', 1: '[Title 2]'}, 'Date': {0: '[Date 1]', 1: '[Date 2]'}}",
 'table_summary': "The table contains information about signatures, titles, and dates. The existing table id is 'id_1048'. The table should be kept.,\nwith the following columns:\n",
 'file_path': 'data/Meta_10k.pdf',
 'file_name': 'Meta_10k.pdf',
 'file_type': 'application/pdf',
 'file_size': 2481466,
 'creation_date': '2024-05-02',
 'last_modified_date': '2024-02-02'}

In [135]:
# setup reranker
cohere_rerank = CohereRerank(top_n=5)

# create an index from the parsed markdown
index = VectorStoreIndex(nodes=base_nodes+objects)

# create a new query engine for the index
recursive_query_engine = index.as_query_engine(
    similarity_top_k=15,
    node_postprocessors=[cohere_rerank],
    verbose=True
)

In [136]:
# query the engine
query1 = "What was the total value of 'Cash and cash equivalents' as of December 31, 2023?"
response = recursive_query_engine.query(query1)
print(response)

[1;3;38;2;11;159;203mRetrieval entering 590db1d7-47a6-444a-b315-97704bc04138: TextNode
[0m[1;3;38;2;237;90;200mRetrieving from object TextNode with query What was the total value of 'Cash and cash equivalents' as of December 31, 2023?
[0m[1;3;38;2;11;159;203mRetrieval entering 72475249-f252-4d2a-86a9-07d7ea48160b: TextNode
[0m[1;3;38;2;237;90;200mRetrieving from object TextNode with query What was the total value of 'Cash and cash equivalents' as of December 31, 2023?
[0m[1;3;38;2;11;159;203mRetrieval entering b9e72c08-6383-43e3-8bb9-0d6b3e60225a: TextNode
[0m[1;3;38;2;237;90;200mRetrieving from object TextNode with query What was the total value of 'Cash and cash equivalents' as of December 31, 2023?
[0m[1;3;38;2;11;159;203mRetrieval entering b8050b7f-1e3c-495b-b12d-5d16364edfdd: TextNode
[0m[1;3;38;2;237;90;200mRetrieving from object TextNode with query What was the total value of 'Cash and cash equivalents' as of December 31, 2023?
[0m[1;3;38;2;11;159;203mRetrieval 

In [143]:
# query the engine
query2 = "Who are Meta's 'Directors' (i.e., members of the Board of Directors)?"
response = recursive_query_engine.query(query2)
print(response)

[1;3;38;2;11;159;203mRetrieval entering ac8366a1-9e82-4b92-8e74-3d577bcd8d0c: TextNode
[0m[1;3;38;2;237;90;200mRetrieving from object TextNode with query Who are Meta's 'Directors' (i.e., members of the Board of Directors)?
[0mThe members of Meta's Board of Directors are not explicitly mentioned in the provided context information.


In [93]:
from google.colab import files
parsed_md = files.upload()

Saving output.md to output.md


## RAG with Llama Parse + Recursive Query Engine + FlagEmbedding Reranker (powered by BAAI/bge-reranker-large)

In [41]:
!pip install -qU llama-index-postprocessor-flag-embedding-reranker git+https://github.com/FlagOpen/FlagEmbedding.git

  Preparing metadata (setup.py) ... [?25l[?25hdone
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m542.0/542.0 kB[0m [31m7.0 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m297.6/297.6 kB[0m [31m26.3 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m171.5/171.5 kB[0m [31m18.2 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m116.3/116.3 kB[0m [31m14.6 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m194.1/194.1 kB[0m [31m25.4 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m134.8/134.8 kB[0m [31m18.5 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m388.9/388.9 kB[0m [31m38.5 MB/s[0m eta [36m0:00:00[0m
[?25h  Building wheel for FlagEmbedding (setup.py) ... [?25l[?25hdone


In [39]:
# setup default llm and embedding model for the rag pipeline
from llama_index.llms.openai import OpenAI
from llama_index.embeddings.openai import OpenAIEmbedding
from llama_index.core import Settings

Settings.llm = OpenAI(model="gpt-3.5-turbo")
Settings.embed_model = OpenAIEmbedding(model="text-embedding-3-small")

In [40]:
# import dependencies
from llama_parse import LlamaParse
from llama_index.core import VectorStoreIndex, SimpleDirectoryReader
from llama_index.core.node_parser import MarkdownElementNodeParser

parsing_instruction = """The provided document is an annual report filed by Meta Platforms,
        Inc. with the Securities and Exchange Commission (SEC).
        This form provides detailed financial information about the company's performance for a specific year.
        It includes unaudited financial statements, management discussion and analysis, and other relevant disclosures required by the SEC.
        It contains many tables.
        Try to be precise while answering the questions"""

# setup parser
parser = LlamaParse(
    result_type="markdown",
    parsing_instruction=parsing_instruction
)

# load and parse the documet
file_extractor = {".pdf": parser}
documents = SimpleDirectoryReader(
    input_files=['data/Meta_10k.pdf'],
    file_extractor=file_extractor
).load_data()

# setup markdown node parser
node_parser = MarkdownElementNodeParser(llm=OpenAI(model="gpt-3.5-turbo"), num_workers=8)

# parse the mardown document
nodes = node_parser.get_nodes_from_documents(documents)
base_nodes, objects = node_parser.get_nodes_and_objects(nodes)

Started parsing the file under job_id 4afdad10-d3c2-4cc9-9dd2-d5678e00a872


65it [00:00, 19248.08it/s]
columns
  field required (type=value_error.missing)
Traceback (most recent call last):
  File "/usr/local/lib/python3.10/dist-packages/llama_index/core/response_synthesizers/refine.py", line 482, in _agive_response_single
    structured_response = await program.acall(
  File "/usr/local/lib/python3.10/dist-packages/llama_index/core/response_synthesizers/refine.py", line 92, in acall
    answer = await self._llm.astructured_predict(
  File "/usr/local/lib/python3.10/dist-packages/llama_index/core/instrumentation/dispatcher.py", line 307, in async_wrapper
    result = await func(*args, **kwargs)
  File "/usr/local/lib/python3.10/dist-packages/llama_index/core/llms/llm.py", line 391, in astructured_predict
    result = await program.acall(**prompt_args)
  File "/usr/local/lib/python3.10/dist-packages/llama_index/program/openai/base.py", line 223, in acall
    return _parse_tool_calls(
  File "/usr/local/lib/python3.10/dist-packages/llama_index/program/openai/bas

In [42]:
len(nodes)

255

In [43]:
len(base_nodes)

125

In [44]:
base_nodes[0].text

'Meta Platforms, Inc. - Annual Report 2023\n\n Meta Platforms, Inc. - Annual Report 2023\n\n SECURITIES AND EXCHANGE COMMISSION\n\nWashington, D.C. 20549\n\n FORM 10-K\n\nANNUAL REPORT PURSUANT TO SECTION 13 OR 15(d) OF THE SECURITIES EXCHANGE ACT OF 1934\n\nFor the fiscal year ended December 31, 2023\n\nCommission File Number: 001-35551\n\n Meta Platforms, Inc. (Meta)\n\nDelaware | 20-1665019\n\n1 Meta Way, Menlo Park, California 94025\n\nTelephone: (650) 543-4800\n\n Securities Registered'

In [45]:
objects[25].text

'Financial performance summary including revenue, expenses, and net income for the years 2020, 2021, and 2022.,\nwith the following table title:\nFinancial Performance Summary,\nwith the following columns:\n- Year Ended December 31: None\n- 2022: None\n- 2021: None\n- 2020: None\n'

In [46]:
objects[25].obj.metadata

{'table_df': "{' ': {0: 'Revenue', 1: 'Cost of Revenue', 2: 'Research and Development', 3: 'Marketing and Sales', 4: 'General and Administrative', 5: 'Total Costs and Expenses', 6: 'Income from Operations', 7: 'Interest and Other Income (Expense), Net', 8: 'Income Before Provision for Income Taxes', 9: 'Provision for Income Taxes', 10: 'Net Income'}, 'Year Ended December 31': {0: '$2,023,134.902', 1: '$25,959', 2: '$38,483', 3: '$12,301', 4: '$11,408', 5: '$88,151', 6: '$46,751', 7: '$677', 8: '$47,428', 9: '$8,330', 10: '$39,098'}, '2022': {0: '$2,022,116.609', 1: '$25,249', 2: '$35,338', 3: '$15,262', 4: '$11,816', 5: '$87,665', 6: '$28,944', 7: '($125)', 8: '$28,819', 9: '$5,619', 10: '$23,200'}, '2021': {0: '$1,117,929', 1: '$22,649', 2: '$24,655', 3: '$14,043', 4: '$9,829', 5: '$71,176', 6: '$46,753', 7: '$531', 8: '$47,284', 9: '$7,914', 10: '$39,370'}, '2020': {0: ' ', 1: ' ', 2: ' ', 3: ' ', 4: ' ', 5: ' ', 6: ' ', 7: ' ', 8: ' ', 9: ' ', 10: ' '}}",
 'table_summary': 'Financia

In [47]:
from llama_index.postprocessor.flag_embedding_reranker import FlagEmbeddingReranker

# setup reranker
reranker = FlagEmbeddingReranker(
    top_n = 5,
    model="BAAI/bge-reranker-large"
)

# create an index from the parsed markdown
index = VectorStoreIndex(nodes=base_nodes+objects)

# create a new query engine for the index
recursive_query_engine = index.as_query_engine(
    similarity_top_k=15,
    node_postprocessors=[reranker],
    verbose=True
)

The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.


tokenizer_config.json:   0%|          | 0.00/443 [00:00<?, ?B/s]

sentencepiece.bpe.model:   0%|          | 0.00/5.07M [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/17.1M [00:00<?, ?B/s]

special_tokens_map.json:   0%|          | 0.00/279 [00:00<?, ?B/s]

config.json:   0%|          | 0.00/801 [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/2.24G [00:00<?, ?B/s]

In [48]:
# query the engine
query1 = "What was the total value of 'Cash and cash equivalents' as of December 31, 2023?"
response = recursive_query_engine.query(query1)
print(response)

[1;3;38;2;11;159;203mRetrieval entering 4115a9b4-2577-482b-8483-19c813574220: TextNode
[0m[1;3;38;2;237;90;200mRetrieving from object TextNode with query What was the total value of 'Cash and cash equivalents' as of December 31, 2023?
[0m[1;3;38;2;11;159;203mRetrieval entering 110037b5-a681-4135-a21a-da0a977e41de: TextNode
[0m[1;3;38;2;237;90;200mRetrieving from object TextNode with query What was the total value of 'Cash and cash equivalents' as of December 31, 2023?
[0m[1;3;38;2;11;159;203mRetrieval entering 6b8d35be-e7bd-40ac-b036-ea6c9c64ce64: TextNode
[0m[1;3;38;2;237;90;200mRetrieving from object TextNode with query What was the total value of 'Cash and cash equivalents' as of December 31, 2023?
[0m[1;3;38;2;11;159;203mRetrieval entering ac5228c1-3d55-406a-8711-84e37caf5170: TextNode
[0m[1;3;38;2;237;90;200mRetrieving from object TextNode with query What was the total value of 'Cash and cash equivalents' as of December 31, 2023?
[0mThe total value of 'Cash and cas

In [49]:
# query the engine
query2 = "Who are Meta's 'Directors' (i.e., members of the Board of Directors)?"
response = recursive_query_engine.query(query2)
print(response)

The members of Meta's Board of Directors are individuals who hold positions such as the chairman of the board of directors, the chief executive officer, or the president.
