# RAG Retrieval Optimization - Sentence Window Parsing technique with Amazon Bedrock and Llamaindex

# Small to big retrieval

In this tutorial, we showcase how to use a sub question query engine to tackle the problem of answering a complex query.
It first breaks down the complex query into sub questions for each relevant data source, then gather all the intermediate reponses and synthesizes a final response.

- Vector Database (Faiss / local)
- LLM (Amazon Bedrock - Claude3 Sonnet)
- Embeddings Model (Bedrock Titan Text Embeddings v2.0)
- Datasets ( Amazons 10-k sec filings from year 2022 and 2023 )
- Llamaindex SentenceWindowNodeParser (This example is built on referece llamaindex documentation available at - https://docs.llamaindex.ai/en/stable/examples/node_postprocessor/MetadataReplacementDemo/)


In [None]:
#!pip install llama-index
#%pip install llama-index-llms-bedrock
#%pip install llama-index-embeddings-bedrock
#!pip uninstall pydantic -y
#!pip install pydantic
#%pip install sqlalchemy==2.0.21 --force-reinstall --quiet
#%pip install llama-index-embeddings-instructor

In [2]:
from llama_index.embeddings.bedrock import BedrockEmbedding

In [3]:
from llama_index.core import VectorStoreIndex, SimpleDirectoryReader
from llama_index.core.tools import QueryEngineTool, ToolMetadata
from llama_index.core.query_engine import SubQuestionQueryEngine
from llama_index.core.callbacks import CallbackManager, LlamaDebugHandler
from llama_index.core import Settings

In [4]:
import json
from typing import Sequence, List
from llama_index.core.settings import Settings
from llama_index.llms.bedrock import Bedrock
from llama_index.embeddings.bedrock import BedrockEmbedding, Models
from llama_index.core.node_parser import SentenceWindowNodeParser
from llama_index.core.node_parser import SentenceSplitter

# create the sentence window node parser w/ default settings
node_parser = SentenceWindowNodeParser.from_defaults(
    window_size=3,
    window_metadata_key="window",
    original_text_metadata_key="original_text",
)

# base node parser is a sentence splitter
text_splitter = SentenceSplitter()

llm = Bedrock(model = "anthropic.claude-3-sonnet-20240229-v1:0")
embed_model = BedrockEmbedding(model = "amazon.titan-embed-text-v2:0")

Settings.llm = llm
Settings.embed_model = embed_model
Settings.chunk_size = 256
Settings.text_splitter = text_splitter

from llama_index.core.llms import ChatMessage
from llama_index.core.tools import BaseTool, FunctionTool
import nest_asyncio
nest_asyncio.apply()

In [None]:
#!mkdir -p 'data/amazon/'
#!wget 'https://s2.q4cdn.com/299287126/files/doc_financials/2023/q4/c7c14359-36fa-40c3-b3ca-5bf7f3fa0b96.pdf' -O 'data/amazon/amazon_2023.pdf'
#!wget 'https://s2.q4cdn.com/299287126/files/doc_financials/2022/q4/d2fde7ee-05f7-419d-9ce8-186de4c96e25.pdf' -O 'data/amazon/amazon_2022.pdf'

In [6]:
# load data
amazon_secfiles = SimpleDirectoryReader(input_dir="../data/lab03/amazon/").load_data()

In [7]:
nodes = node_parser.get_nodes_from_documents(amazon_secfiles)

In [8]:
base_nodes = text_splitter.get_nodes_from_documents(amazon_secfiles)

In [9]:
from llama_index.core import VectorStoreIndex

sentence_index = VectorStoreIndex(nodes)

In [10]:
base_index = VectorStoreIndex(base_nodes)

In [11]:
from llama_index.core.postprocessor import MetadataReplacementPostProcessor

query_engine = sentence_index.as_query_engine(
    similarity_top_k=5,
    # the target key defaults to `window` to match the node_parser's default
    node_postprocessors=[
        MetadataReplacementPostProcessor(target_metadata_key="window")
    ],
)
window_response = query_engine.query(
    "Whats Amazons ownership stake in Rivian??"
)
print(window_response)

According to the information provided, as of December 31, 2022, Amazon held 158 million shares of Rivian's Class A common stock, representing an approximate 17% ownership interest and an approximate 16% voting interest in Rivian. As of December 31, 2023, Amazon's ownership stake in Rivian was approximately 16% with an approximate 15% voting interest.


In [12]:
window = window_response.source_nodes[0].node.metadata["window"]
sentence = window_response.source_nodes[0].node.metadata["original_text"]

print(f"Window: {window}")
print("------------------")
print(f"Original Sentence: {sentence}")

Window: (“Rivian”).  Our investment in Rivian’s preferred stock was accounted for at cost, with adjustments for
observable changes in prices or impairments, prior to Rivian’s initial public offering in November 2021, which resulted in the conversion of our preferred stock
to Class A common stock.  As of December 31, 2023, we held 158 million shares of Rivian’s Class A common stock, representing an approximate 16%
ownership interest, and an approximate 15% voting interest.  We determined that we have the ability to exercise significant influence over Rivian through our
equity investment, our commercial arrangement for the purchase of electric vehicles and jointly-owned intellectual property, and one of our employees serving
on Rivian’s board of directors.  We elected the fair value option to account for our equity investment in Rivian, which is included in “Marketable securities” on
our consolidated balance sheets, and had a fair value of $2.9 billion and $3.7 billion as of December 31,

# Contrast with normal VectorStoreIndex

Naive RAG is not able to pinpoint necessary details. 

In [13]:
query_engine = base_index.as_query_engine(similarity_top_k=2)
vector_response = query_engine.query(
    "Whats Amazons ownership stake in Rivian?"
)
print(vector_response)

Based on the information provided, Amazon does not disclose its exact ownership stake in Rivian. However, it is mentioned that Amazon's equity investment in Rivian had a fair value of $15.6 billion as of December 31, 2021 and $2.9 billion as of December 31, 2022. The investment was subject to regulatory sales restrictions resulting in a discount for lack of marketability of approximately $800 million as of December 31, 2021, which expired in Q1 2022.


In [14]:
for source_node in window_response.source_nodes:
    print(source_node.node.metadata["original_text"])
    print("--------")

We determined that we have the ability to exercise significant influence over Rivian through our
equity investment, our commercial arrangement for the purchase of electric vehicles and jointly-owned intellectual property, and one of our employees serving
on Rivian’s board of directors. 
--------
We determined that we have the ability to exercise significant influence over Rivian through our
equity investment, our commercial arrangement for the purchase of electric vehicles, and one of our employees serving on Rivian’s board of directors. 
--------
Required summarized financial information of Rivian as disclosed in its most recent SEC filings is as follows (in millions):
Year Ended
December 31, 2020Year Ended
December 31, 2021Nine Months Ended
September 30, 2022
Revenues $ — $ 55 $ 995 
Gross profit — (465) (2,123)
Loss from operations (1,021) (4,220) (5,061)
Net loss (1,018) (4,688) (5,029)
December 31, 2021 September 30, 2022
Total current assets $ 18,559 $ 14,424 
Total assets 22,294

In [15]:
for node in vector_response.source_nodes:
    print(node.node.text)
    print("--------")

Table of Contents
 December 31, 2022
  Cost or
Amortized
CostGross
Unrealized
GainsGross
Unrealized
LossesTotal
Estimated
Fair Value
Cash $ 10,666 $ — $ — $ 10,666 
Level 1 securities:
Money market funds 27,899 — — 27,899 
Equity securities (1)(3) 3,709 
Level 2 securities:
Foreign government and agency securities 537 — (2) 535 
U.S. government and agency securities 2,301 — (155) 2,146 
Corporate debt securities 23,111 — (484) 22,627 
Asset-backed securities 2,721 — (149) 2,572 
Other fixed income securities 249 — (12) 237 
$ 67,484 $ — $ (802)$ 70,391 
Less: Restricted cash, cash equivalents, and marketable securities (2) (365)
Total cash, cash equivalents, and marketable securities $ 70,026 
___________________
(1)The related unrealized gain (loss) recorded in “Other income (expense), net” was $448 million, $11.6 billion, and $(13.6) billion for the years ended
December 31, 2020, 2021, and 2022.
(2)We are required to pledge or otherwise restrict a portion of our cash, cash equivalent