# RAG Retrieval Optimization - Sentence Window Parsing technique with Amazon Bedrock and Llamaindex

Sentence window is another technique that enhances the retrieval process by focusing on individual sentences while providing surrounding context. In this approach, documents are parsed into single sentences, each with a "window" of surrounding sentences. During retrieval, the system finds the most relevant individual sentences. However, instead of using only these single sentences, it replaces them with their corresponding windows, which include a specified number of sentences before and after the retrieved sentence. This method allows for more fine-grained retrieval of specific information while still providing necessary context, potentially improving the relevance and coherence of the generated responses.

In this lab, we demonstrated how to use sentence window technique for post-retrieval with LlamaIndex. Specifically, we employed the SentenceWindowNodeParser module to splits Amazon's SEC filing documents into individual sentences, creating a node for each sentence while also including a configurable "window" of surrounding sentences in the node's metadata. We can then use the MetadataReplacementPostProcessor module to retrieve the sentence along with associated 'window' metadata to improve the context for final response generation.

Here are the components we used:Here are the components we used:

- Vector Database (Faiss / local)
- LLM (Amazon Bedrock - Amazon Nova Pro)
- Embeddings Model (Bedrock Titan Text Embeddings v2.0)
- Datasets ( Amazons 10-k sec filings from year 2022 and 2023 )
- Llamaindex SentenceWindowNodeParser (This example is built on referece llamaindex documentation available at - https://docs.llamaindex.ai/en/stable/examples/node_postprocessor/MetadataReplacementDemo/)


Small to Large Retrieval (Reference - https://docs.llamaindex.ai/en/stable/optimizing/production_rag/)

![alt text](sentence-window.png)

## Pre-req
You must run the `[workshop_setup.ipynb]`(../lab00-setup/workshop_setup.ipynb) notebook in `lab00-setup` before starting this lab.

In [1]:
import warnings
warnings.warn("Warning: if you did not run lab00-setup, please go back and run the lab00 notebook") 



### > Setup
We start by importing necessary llamaindex libraries

In [2]:
from llama_index.core import VectorStoreIndex, SimpleDirectoryReader
from llama_index.core.tools import QueryEngineTool, ToolMetadata
from llama_index.core.query_engine import SubQuestionQueryEngine
from llama_index.core.callbacks import CallbackManager, LlamaDebugHandler
from llama_index.core import Settings
from IPython.display import Markdown, display, HTML
from termcolor import colored

In [None]:
import os
from pathlib import Path

# Define the config content
config_content = """[profile default]
region = us-west-2
output = json
"""

# Get the path to the .aws directory and config file
home_dir = "/home/sagemaker-user"  # SageMaker specific path
aws_dir = os.path.join(home_dir, ".aws")
config_path = os.path.join(aws_dir, "config")

# Check if config file already exists
if os.path.exists(config_path):
    print(f"AWS config file already exists at {config_path}. No changes made.")
else:
    # Create the .aws directory if it doesn't exist
    os.makedirs(aws_dir, exist_ok=True)
    
    # Create the config file with the content
    with open(config_path, "w") as f:
        f.write(config_content + "\n")
    
    print(f"Created directory {aws_dir} and created AWS config file at {config_path}")

We select Amazon Nova Pro as our LLM. For embedding model, we are selecting Amazon Titan Text Embed v2.0. 
Note that we are using Llamaindex's SentenceWindowNodeParser.

In [3]:
import json
from typing import Sequence, List
from llama_index.core.settings import Settings
from llama_index.llms.bedrock_converse import BedrockConverse
from llama_index.embeddings.bedrock import BedrockEmbedding, Models
from llama_index.core.node_parser import SentenceWindowNodeParser
from llama_index.core.node_parser import SentenceSplitter

from llama_index.core.llms import ChatMessage
from llama_index.core.tools import BaseTool, FunctionTool
import nest_asyncio

# create the sentence window node parser w/ default settings
node_parser = SentenceWindowNodeParser.from_defaults(
    window_size=3,
    window_metadata_key="window",
    original_text_metadata_key="original_text",
)

# base node parser is a sentence splitter
text_splitter = SentenceSplitter()

profile_name = "default"

# define the LLM
llm = BedrockConverse(
    model="us.amazon.nova-pro-v1:0",
    profile_name=profile_name,
)

embed_model = BedrockEmbedding(model = "amazon.titan-embed-text-v2:0")

Settings.llm = llm
Settings.embed_model = embed_model
Settings.chunk_size = 256
Settings.text_splitter = text_splitter

nest_asyncio.apply()

### > Document Ingestion
We ingest and index the data stored in data directory. The amazon folder has SEC-10k files from 2022 and 2023.

In [4]:
# load data
amazon_secfiles = SimpleDirectoryReader(input_dir="../data/lab03/amazon/").load_data()

### > Build a vector databases

We want to demonstrate quality of the generation with and without Sentence Window.
To do that, we will first create a normal index to show Naive RAG, then use `node_parser`
to add the expand window.

**Without** sentence window

In [5]:
base_nodes = text_splitter.get_nodes_from_documents(amazon_secfiles)

In [6]:
base_index = VectorStoreIndex(base_nodes)

**With SentenceWindowNodeParse** from above

This may take up to 5 minutes to prepare the index

In [7]:
nodes = node_parser.get_nodes_from_documents(amazon_secfiles)

sentence_window_index = VectorStoreIndex(nodes)

### Test Using Naive RAG

In [8]:
query = "Whats Amazons ownership stake in Rivian?"

In [9]:
query_engine = base_index.as_query_engine(similarity_top_k=2)
naive_response = query_engine.query(query)

print(colored(naive_response, "green"))

[32mAmazon's equity investment in Rivian had a fair value of $15.6 billion as of December 31, 2021, and $2.9 billion as of December 31, 2022. However, the specific ownership percentage is not provided in the context.[0m


### > Test RAG Using Sentence Window

In [10]:
from llama_index.core.postprocessor import MetadataReplacementPostProcessor

query_engine = sentence_window_index.as_query_engine(
    similarity_top_k=2,
    # the target key defaults to `window` to match the node_parser's default
    node_postprocessors=[
        MetadataReplacementPostProcessor(target_metadata_key="window")
    ],
)
sentence_window_response = query_engine.query(query)

print(colored(sentence_window_response, "green"))

[32mAmazon holds an approximate 16% ownership interest in Rivian.[0m


### > Display the results side-by-side 

Notice the answer with sentence widow is more accurate and more relevant.

In [11]:
import pandas as pd

# Create the first table
df = pd.DataFrame({
    'Sentence Window': [query, sentence_window_response],
    'Naive RAG': [query, naive_response]
})

output=""
output += df.style.hide()._repr_html_()
# output += "&nbsp;"

display(HTML(output))

Sentence Window,Naive RAG
Whats Amazons ownership stake in Rivian?,Whats Amazons ownership stake in Rivian?
Amazon holds an approximate 16% ownership interest in Rivian.,"Amazon's equity investment in Rivian had a fair value of $15.6 billion as of December 31, 2021, and $2.9 billion as of December 31, 2022. However, the specific ownership percentage is not provided in the context."


### > Reivew the Sentence and Window

Let's take a look at the senteces and their corresponding window used as the context

In [12]:
for source_node in sentence_window_response.source_nodes:
    
    print(colored("\nSentence: \n", "green"))

    print(source_node.node.metadata["original_text"])
    
    print(colored("\nWindow: \n", "green"))
    print(source_node.node.metadata["window"])

    print(colored("\n-------------------\n", "green"))

[32m
Sentence: 
[0m
We determined that we have the ability to exercise significant influence over Rivian through our
equity investment, our commercial arrangement for the purchase of electric vehicles and jointly-owned intellectual property, and one of our employees serving
on Rivian’s board of directors. 
[32m
Window: 
[0m
(“Rivian”).  Our investment in Rivian’s preferred stock was accounted for at cost, with adjustments for
observable changes in prices or impairments, prior to Rivian’s initial public offering in November 2021, which resulted in the conversion of our preferred stock
to Class A common stock.  As of December 31, 2023, we held 158 million shares of Rivian’s Class A common stock, representing an approximate 16%
ownership interest, and an approximate 15% voting interest.  We determined that we have the ability to exercise significant influence over Rivian through our
equity investment, our commercial arrangement for the purchase of electric vehicles and jointly-owned i