
<a href="https://colab.research.google.com/github/edumunozsala/llamaindex-RAG-techniques/blob/main/sentence-window-retrieval.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Sentence Window Retrieval

In this notebook, we use the `SentenceWindowNodeParser` to parse documents into single sentences per node. Each node also contains a “window” with the sentences on either side of the node sentence.

Then, during retrieval, before passing the retrieved sentences to the LLM, the single sentences are replaced with a window containing the surrounding sentences using the `MetadataReplacementNodePostProcessor`.

This is most useful for large documents/indexes, as it helps to retrieve more fine-grained details. By default, the sentence window is 5 sentences on either side of the original sentence.

In this case, chunk size settings are not used, in favor of following the window settings.

Original code from Llama-index: https://docs.llamaindex.ai/en/stable/examples/node_postprocessor/MetadataReplacementDemo.html

In [28]:
import os
import openai

from llama_index.node_parser import (
    SentenceWindowNodeParser,
)
from llama_index import ServiceContext
from llama_index.llms import OpenAI
from llama_index import VectorStoreIndex, StorageContext, load_index_from_storage
from llama_index.indices.postprocessor import MetadataReplacementPostProcessor, SentenceTransformerRerank


Load the API Keys:

In [1]:
from dotenv import load_dotenv

# Load the enviroment variables
load_dotenv()

True

## Setup


If you're opening this Notebook on colab, you will probably need to install LlamaIndex 🦙.

### Load the data

In [2]:
from pathlib import Path
from llama_index import download_loader

PDFReader = download_loader("PDFReader")

loader = PDFReader()
documents = loader.load_data(file=Path('./data/Attention is all you need.pdf'))

In [3]:
documents

[Document(id_='622bc59c-e6aa-4c09-8db2-c40869e787b7', embedding=None, metadata={'page_label': '1', 'file_name': 'Attention is all you need.pdf'}, excluded_embed_metadata_keys=[], excluded_llm_metadata_keys=[], relationships={}, hash='215a4e4cfd187b7f1951b4796bf528de8e3c7a090794c107967fc03077f7dc5c', text='Attention Is All You Need\nAshish Vaswani∗\nGoogle Brain\navaswani@google.comNoam Shazeer∗\nGoogle Brain\nnoam@google.comNiki Parmar∗\nGoogle Research\nnikip@google.comJakob Uszkoreit∗\nGoogle Research\nusz@google.com\nLlion Jones∗\nGoogle Research\nllion@google.comAidan N. Gomez∗†\nUniversity of Toronto\naidan@cs.toronto.eduŁukasz Kaiser∗\nGoogle Brain\nlukaszkaiser@google.com\nIllia Polosukhin∗‡\nillia.polosukhin@gmail.com\nAbstract\nThe dominant sequence transduction models are based on complex recurrent or\nconvolutional neural networks that include an encoder and a decoder. The best\nperforming models also connect the encoder and decoder through an attention\nmechanism. We propos

## Window Sentence Retrieval Setup


In [8]:
# create the sentence window node parser w/ default settings
node_parser = SentenceWindowNodeParser.from_defaults(
    window_size=3,
    window_metadata_key="window",
    original_text_metadata_key="original_text",
)

In [9]:
from llama_index import Document

text = "hello. foo bar. cat dog. mouse"

nodes = node_parser.get_nodes_from_documents([Document(text=text)])

In [10]:
print([x.text for x in nodes])
print(nodes[1].metadata["window"])

['hello. ', 'foo bar. ', 'cat dog. ', 'mouse']
hello.  foo bar.  cat dog.  mouse


### Building the index

In [12]:
# DEfine the LLM
llm = OpenAI(model="gpt-3.5-turbo", temperature=0.1)
# Create the context definition
sentence_context = ServiceContext.from_defaults(
    llm=llm,
    embed_model="local:BAAI/bge-small-en-v1.5",
    # embed_model="local:BAAI/bge-large-en-v1.5"
    node_parser=node_parser,
)

  from .autonotebook import tqdm as notebook_tqdm
config.json: 100%|██████████| 743/743 [00:00<?, ?B/s] 
model.safetensors: 100%|██████████| 133M/133M [00:03<00:00, 38.4MB/s] 
tokenizer_config.json: 100%|██████████| 366/366 [00:00<?, ?B/s] 
vocab.txt: 100%|██████████| 232k/232k [00:00<00:00, 1.53MB/s]
tokenizer.json: 100%|██████████| 711k/711k [00:00<00:00, 2.21MB/s]
special_tokens_map.json: 100%|██████████| 125/125 [00:00<00:00, 84.6kB/s]


In [14]:
# Build the Vector index
sentence_index = VectorStoreIndex.from_documents(
    documents, service_context=sentence_context
)

In [15]:
# Save the index
sentence_index.storage_context.persist(persist_dir="./sentence_index")

The next section creates the vector index and saves it to disk or, if it's already been created, the vector index is read from disk.

In [17]:

if not os.path.exists("./sentence_index"):
    sentence_index = VectorStoreIndex.from_documents(
        documents, service_context=sentence_context
    )

    sentence_index.storage_context.persist(persist_dir="./sentence_index")
else:
    sentence_index = load_index_from_storage(
        StorageContext.from_defaults(persist_dir="./sentence_index"),
        service_context=sentence_context
    )

## Building the postprocessor



Here, we now use the `MetadataReplacementPostProcessor` to replace the sentence in each node with it’s surrounding context.

In [19]:
# setup base query engine as tool
postproc = MetadataReplacementPostProcessor(
    target_metadata_key="window"
)

Let's check how it works

In [20]:
nodes

[TextNode(id_='aced4204-e75c-4a70-9140-1e895ba69582', embedding=None, metadata={'window': 'hello.  foo bar.  cat dog. ', 'original_text': 'hello. '}, excluded_embed_metadata_keys=['window', 'original_text'], excluded_llm_metadata_keys=['window', 'original_text'], relationships={<NodeRelationship.SOURCE: '1'>: RelatedNodeInfo(node_id='6bee2a5e-fedc-4714-b3e5-9106ff090ab2', node_type=<ObjectType.DOCUMENT: '4'>, metadata={}, hash='204c6c483f8a92b3b1132064f4ed114086b04b20b728c2b197baaae3228edd7e'), <NodeRelationship.NEXT: '3'>: RelatedNodeInfo(node_id='9f2eb94e-58d7-4e1e-aaf7-04e88cdc6c32', node_type=<ObjectType.TEXT: '1'>, metadata={'window': 'hello.  foo bar.  cat dog.  mouse', 'original_text': 'foo bar. '}, hash='e49ab919742030657182ad4734208dcbbf31753db2584a13f1abc0a01f0bc178')}, hash='7436ba15cfb360ab416be071645e988c23ea7bd0da37900c1e05d46d97816510', text='hello. ', start_char_idx=0, end_char_idx=7, text_template='{metadata_str}\n\n{content}', metadata_template='{key}: {value}', metad

In [21]:
from llama_index.schema import NodeWithScore
from copy import deepcopy

scored_nodes = [NodeWithScore(node=x, score=1.0) for x in nodes]
nodes_old = [deepcopy(n) for n in nodes]
print(nodes_old[1].text)

foo bar. 


When applying the Replacement postprocessor

In [22]:
replaced_nodes = postproc.postprocess_nodes(scored_nodes)
print(replaced_nodes[1].text)

hello.  foo bar.  cat dog.  mouse


## Create and Run the Query engine


Now, we can create the query engine that will use the Replacement post-processor

In [24]:
sentence_window_engine = sentence_index.as_query_engine(
    similarity_top_k=6, node_postprocessors=[postproc]
)

And invoke the engine

In [25]:
window_response = sentence_window_engine.query(
   "How are transformers related to convolutional neural networks?"
)

In [26]:
from llama_index.response.notebook_utils import display_response

display_response(window_response)

**`Final Response:`** Transformers are related to convolutional neural networks (CNNs) in that they both have been used as building blocks in sequence transduction models. However, transformers differ from CNNs in their architecture. While CNNs use convolutional layers to compute hidden representations in parallel for all input and output positions, transformers rely entirely on an attention mechanism to draw global dependencies between input and output. This allows transformers to be more parallelizable and can achieve state-of-the-art results in translation tasks with faster training times compared to architectures based on recurrent or convolutional layers.

We can also check the original sentence that was retrieved for each node, as well as the actual window of sentences that was sent to the LLM.

In [27]:
window = window_response.source_nodes[0].node.metadata["window"]
sentence = window_response.source_nodes[0].node.metadata["original_text"]

print(f"Window: {window}")
print("------------------")
print(f"Original Sentence: {sentence}")

Window: Attention Is All You Need
Ashish Vaswani∗
Google Brain
avaswani@google.comNoam Shazeer∗
Google Brain
noam@google.comNiki Parmar∗
Google Research
nikip@google.comJakob Uszkoreit∗
Google Research
usz@google.com
Llion Jones∗
Google Research
llion@google.comAidan N. Gomez∗†
University of Toronto
aidan@cs.toronto.eduŁukasz Kaiser∗
Google Brain
lukaszkaiser@google.com
Illia Polosukhin∗‡
illia.polosukhin@gmail.com
Abstract
The dominant sequence transduction models are based on complex recurrent or
convolutional neural networks that include an encoder and a decoder.  The best
performing models also connect the encoder and decoder through an attention
mechanism.  We propose a new simple network architecture, the Transformer,
based solely on attention mechanisms, dispensing with recurrence and convolutions
entirely.  Experiments on two machine translation tasks show these models to
be superior in quality while being more parallelizable and requiring signiﬁcantly
less time to train.  Our 

## Adding a reranker in the post processor stage

In the deeplearning.ai course "building-eval-advanced-rag-llamaindex" a reranker is added to the post processor stage to get a better retrieved context. In the next section we will refactor the code and include the reranker. 

In [29]:

def build_sentence_window_index(
    documents,
    llm,
    embed_model="local:BAAI/bge-small-en-v1.5",
    sentence_window_size=3,
    save_dir="sentence_index",
):
    # create the sentence window node parser w/ default settings
    node_parser = SentenceWindowNodeParser.from_defaults(
        window_size=sentence_window_size,
        window_metadata_key="window",
        original_text_metadata_key="original_text",
    )
    # create the context
    sentence_context = ServiceContext.from_defaults(
        llm=llm,
        embed_model=embed_model,
        node_parser=node_parser,
    )
    # Recreate the index or load it fro disk
    if not os.path.exists(save_dir):
        sentence_index = VectorStoreIndex.from_documents(
            documents, service_context=sentence_context
        )
        sentence_index.storage_context.persist(persist_dir=save_dir)
    else:
        sentence_index = load_index_from_storage(
            StorageContext.from_defaults(persist_dir=save_dir),
            service_context=sentence_context,
        )

    return sentence_index


def get_sentence_window_query_engine(
    sentence_index, similarity_top_k=6, rerank_top_n=2
):
    # define postprocessors
    postproc = MetadataReplacementPostProcessor(target_metadata_key="window")
    rerank = SentenceTransformerRerank(
        top_n=rerank_top_n, model="BAAI/bge-reranker-base"
    )

    sentence_window_engine = sentence_index.as_query_engine(
        similarity_top_k=similarity_top_k, node_postprocessors=[postproc, rerank]
    )
    return sentence_window_engine

Now, we can call the functions and create our sentence window retrieval engine and invoke it

In [31]:
# Create the index and the retriever
index = build_sentence_window_index(
    documents,
    llm=OpenAI(model="gpt-3.5-turbo", temperature=0.1),
    save_dir="./sentence_index",
)
# Get the query engine
query_engine = get_sentence_window_query_engine(index, similarity_top_k=6)

config.json: 100%|██████████| 799/799 [00:00<00:00, 800kB/s]
model.safetensors: 100%|██████████| 1.11G/1.11G [00:26<00:00, 41.6MB/s]
tokenizer_config.json: 100%|██████████| 443/443 [00:00<00:00, 221kB/s]
sentencepiece.bpe.model: 100%|██████████| 5.07M/5.07M [00:00<00:00, 43.0MB/s]
tokenizer.json: 100%|██████████| 17.1M/17.1M [00:00<00:00, 41.6MB/s]
special_tokens_map.json: 100%|██████████| 279/279 [00:00<?, ?B/s] 


In [32]:
window_response = query_engine.query(
   "How are transformers related to convolutional neural networks?"
)

In [33]:
from llama_index.response.notebook_utils import display_response

display_response(window_response)

**`Final Response:`** Transformers are related to convolutional neural networks in that they are both used for sequence transduction tasks. However, transformers differ from convolutional neural networks in their architecture. While convolutional neural networks use convolutional layers to process input data, transformers rely solely on attention mechanisms. Transformers have been shown to achieve superior results in terms of quality, parallelizability, and training time compared to convolutional neural networks.