# RAG Sentence Window Retrieval Optimization
* Notebook by Adam Lang
* Date: 3/23/2025

# Overview
* In this notebook we will go over sentence window retrieval for RAG.


# Architecture
* We will use these technical dependencies to build this:

1. Data
  * To get data, we will crawl a web page using the `llama-index-readers-web`.

2. Embeddings
  * We will use hugging face open source embeddings via the Llama-index library.

3. LLM
  * We will leverage the llama-index APIs and use Open AI models.

4. APIs
  * We will use Llama-index.

## Install Dependencies

In [1]:
%%capture
!pip install llama-index llama-index-readers-web

In [2]:
%%capture
!pip install llama-index-embeddings-huggingface

In [3]:
%%capture
!pip install llama-index-llms-openai

## Load Data
* We will crawl a webpage and extract the information using the llama-index `SimpleWebPageReader`
* This will allow us to load the webpage as a Document.

In [4]:
## imports to load webpage
from llama_index.readers.web import SimpleWebPageReader
from IPython.display import Markdown, display # to display in notebook
import os

In [6]:
## load webpage as Document
documents = SimpleWebPageReader(html_to_text=True).load_data(
    ["http://paulgraham.com/worked.html"]
)

In [7]:
## view docs
documents[0]

Document(id_='http://paulgraham.com/worked.html', embedding=None, metadata={}, excluded_embed_metadata_keys=[], excluded_llm_metadata_keys=[], relationships={}, metadata_template='{key}: {value}', metadata_separator='\n', text_resource=MediaResource(embeddings=None, data=None, text='![](https://s.turbifycdn.com/aah/paulgraham/essays-5.gif)|\n![](https://sep.turbifycdn.com/ca/Img/trans_1x1.gif)|\n[![](https://s.turbifycdn.com/aah/paulgraham/essays-6.gif)](index.html)  \n  \n| ![What I Worked On](https://s.turbifycdn.com/aah/paulgraham/what-i-worked-\non-4.gif)  \n  \nFebruary 2021  \n  \nBefore college the two main things I worked on, outside of school, were\nwriting and programming. I didn\'t write essays. I wrote what beginning writers\nwere supposed to write then, and probably still are: short stories. My stories\nwere awful. They had hardly any plot, just characters with strong feelings,\nwhich I imagined made them deep.  \n  \nThe first programs I tried writing were on the IBM 1401

Summary
* We can see we were able to scrape and load the webpage pretty quickly.

## Setup LLM & Embedding Models

### LLM Setup

In [8]:
## imports for llm and embeddings
from llama_index.llms.openai import OpenAI
from getpass import getpass
import os

OPENAI_API_KEY = getpass("Enter your Open AI key: ")

Enter your Open AI key: ··········


In [10]:
## set Open AI env variable
os.environ['OPENAI_API_KEY'] = OPENAI_API_KEY

Note: To get the full list of Open AI models from the API go here: https://platform.openai.com/docs/api-reference/models/list

In [11]:
## initialize the LLM model from llama-index
llm = OpenAI(model='gpt-4o-mini-2024-07-18',
             temperature=0.0)

### Embeddings setup
* This is the model I will use: `BAAI/bge-base-en-v1.5`
* Model card: https://huggingface.co/BAAI/bge-base-en-v1.5

In [12]:
from llama_index.embeddings.huggingface import HuggingFaceEmbedding

## load embedding model of choice
model_embed = HuggingFaceEmbedding(model_name="BAAI/bge-base-en-v1.5",
                                   max_length=512)

INFO:sentence_transformers.SentenceTransformer:Load pretrained SentenceTransformer: BAAI/bge-base-en-v1.5


modules.json:   0%|          | 0.00/349 [00:00<?, ?B/s]

config_sentence_transformers.json:   0%|          | 0.00/124 [00:00<?, ?B/s]

README.md:   0%|          | 0.00/94.6k [00:00<?, ?B/s]

sentence_bert_config.json:   0%|          | 0.00/52.0 [00:00<?, ?B/s]

config.json:   0%|          | 0.00/777 [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/438M [00:00<?, ?B/s]

tokenizer_config.json:   0%|          | 0.00/366 [00:00<?, ?B/s]

vocab.txt:   0%|          | 0.00/232k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/711k [00:00<?, ?B/s]

special_tokens_map.json:   0%|          | 0.00/125 [00:00<?, ?B/s]

config.json:   0%|          | 0.00/190 [00:00<?, ?B/s]

INFO:sentence_transformers.SentenceTransformer:2 prompts are loaded, with the keys: ['query', 'text']


## Sentence Window Retriever Setup
* Now we can setup the Sentence Window Retriever.
* From the llama-index docs:
  * "Splits a document into Nodes, with each node being a sentence. Each node contains a window from the surrounding sentences in the metadata."
* Documents: https://docs.llamaindex.ai/en/v0.10.17/api/llama_index.core.node_parser.SentenceWindowNodeParser.html

In [17]:
## llama index imports
from llama_index.core import Settings ## Same as `ServiceContext` in legacy llama-index
from llama_index.core import VectorStoreIndex, StorageContext
from llama_index.core.node_parser import SentenceWindowNodeParser

Note: Recall that `window_size` is the most relevant sentence to a query and the window is the number of sentence surrounding that sentence we want to retrieve as metadata.

In [14]:
## choose a window size
window_size = 4

### Step 1: Create Node Parser

In [15]:
## node parser setup
node_parser = SentenceWindowNodeParser.from_defaults(
    window_size=window_size,
    window_metadata_key="window",
    original_text_metadata_key="original_text",
)

### Step 2: Create Sentence Context (Service Context)
* This is the method to do this: https://docs.llamaindex.ai/en/stable/module_guides/supporting_modules/service_context_migration/

In [18]:
## sentence context
# sent_context = ServiceContext.from_defaults(
#     llm=llm,
#     embed_model=model_embed,
#     node_parser=node_parser,
# )
## set Service context
Settings.llm = llm
Settings.embed_model = model_embed
Settings.node_parser = node_parser


### Step 3: Vector Index

In [19]:
# a vector store index only needs an embed model
sent_index = VectorStoreIndex.from_documents(
    documents, embed_model=model_embed,
)


Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

## Create Sentence Window Query Engine

In [20]:
## imports from llama index
from llama_index.core.indices.postprocessor import MetadataReplacementPostProcessor
from llama_index.core.indices.postprocessor import SentenceTransformerRerank

In [21]:
## Setup parameters to fine tune and Rerank results
re_rank = True
similarity_top_k = 6
rerank_top_n = 4

In [22]:
## setup postprocessers
postproc = MetadataReplacementPostProcessor(target_metadata_key="window")

## using BGE reranking model to rerank results
if re_rank:
  rerank = SentenceTransformerRerank(top_n=rerank_top_n,
                                      model='BAAI/bge-reranker-base')
  post_proc = [postproc, rerank]

else:
  post_proc = [postproc]

## query engine
sent_window_engine = sent_index.as_query_engine(
    similarity_top_k=similarity_top_k,
    node_postprocessors=post_proc
)

config.json:   0%|          | 0.00/799 [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/1.11G [00:00<?, ?B/s]

tokenizer_config.json:   0%|          | 0.00/443 [00:00<?, ?B/s]

sentencepiece.bpe.model:   0%|          | 0.00/5.07M [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/17.1M [00:00<?, ?B/s]

special_tokens_map.json:   0%|          | 0.00/279 [00:00<?, ?B/s]

In [23]:
## function for sent window retreiver
def sent_window_retriever(sent_window_engine, query):

  ## get query
  window_response = sent_window_engine.query(query)

  ## store context in list
  context = []
  for node in window_response.source_nodes:
    context.append(node.dict()['node']['text'])

  response = window_response.response

  return response, context

# Test it out

In [24]:
query = "What did the author do when he was growing up?"

In [None]:
response, context = sent_window_retriever(sent_window_engine, query)

In [None]:
## print metadata
context