# Sentence Window Retriever

In [19]:
import os 
from dotenv import load_dotenv, find_dotenv

In [20]:
load_dotenv('/home/santhosh/Projects/courses/Pinnacle/.env')

True

In [3]:
OPENAI_API_KEY = os.environ['OPENAI_API_KEY']

Download data

We will be using chapter 3 of the recent IPCC climate report.

https://www.ipcc.ch/report/ar6/wg2/downloads/report/IPCC_AR6_WGII_Chapter03.pdf

In [26]:
!curl https://www.ipcc.ch/report/ar6/wg2/downloads/report/IPCC_AR6_WGII_Chapter03.pdf --output IPCC_AR6_WGII_Chapter03.pdf

  % Total    % Received % Xferd  Average Speed   Time    Time     Time  Current
                                 Dload  Upload   Total   Spent    Left  Speed
100 20.7M  100 20.7M    0     0   319k      0  0:01:06  0:01:06 --:--:--  340k   272k      0  0:01:18  0:00:10  0:01:08  326k306k      0  0:01:09  0:00:24  0:00:45  326k     0   315k      0  0:01:07  0:00:46  0:00:21  319k     0  0:01:06  0:00:55  0:00:11  317k


Load Data

In [27]:
from pathlib import Path
from llama_index.readers.file import PDFReader

In [28]:
loader = PDFReader()

In [29]:
documents = loader.load_data(file=Path('./IPCC_AR6_WGII_Chapter03.pdf'))

In [7]:
len(documents)

172

In [30]:
print(documents[0].text)

SPM
379
3
Oceans and Coastal 
Ecosystems and Their Services
This chapter should be cited as:
Cooley, S., D.  Schoeman, L. Bopp, P . Boyd, S. Donner, D.Y . Ghebrehiwet, S.-I. Ito, W. Kiessling, P . Martinetto, E. Ojea, 
M.-F . Racault, B. Rost, and M. Skern-Mauritzen, 2022: Oceans and Coastal Ecosystems and Their Services. In: Climate 
Change 2022: Impacts, Adaptation and Vulnerability. Contribution of Working Group II to the Sixth Assessment Report of 
the Intergovernmental Panel on Climate Change [H.-O. Pörtner, D.C. Roberts, M. Tignor, E.S. Poloczanska, K. Mintenbeck, 
A. Alegría, M. Craig, S. Langsdorf, S. Löschke, V . Möller, A. Okem, B. Rama (eds.)]. Cambridge University Press, Cambridge, 
UK and New York, NY , USA, pp. 379–550, doi:10.1017/9781009325844.005.
Coordinating Lead Authors: Sarah R. Cooley (USA) and David S. Schoeman (Australia)
Lead Authors: Laurent Bopp (France), Philip Boyd (Australia/UK), Simon Donner (Canada), Shin-
Ichi Ito (Japan), Wolfgang Kiessling (Germany), 

Configure OpenAI LLM

In [31]:
from llama_index.llms.openai import OpenAI
llm = OpenAI(model="gpt-4o-mini", temperature=0.1)

Load Sentence level embeddings from HuggingFace

In [32]:
from llama_index.embeddings.huggingface import HuggingFaceEmbedding
embed_model = HuggingFaceEmbedding(
    model_name="BAAI/bge-small-en-v1.5")

The installed version of bitsandbytes was compiled without GPU support. 8-bit optimizers, 8-bit multiplication, and GPU quantization are unavailable.


Create Service Context by providing LLM and Embedding model

#Sentence Window Indexing

Create Sentence Window Node Parser


In [33]:
from llama_index.core.node_parser import SentenceWindowNodeParser
node_parser = SentenceWindowNodeParser.from_defaults(
    window_size=3,
    window_metadata_key="window",
    original_text_metadata_key="original_text",
)

Extract the nodes

In [34]:
nodes = node_parser.get_nodes_from_documents(documents)

Building index

In [35]:
from llama_index.core import VectorStoreIndex

In [36]:
sentence_index = VectorStoreIndex(nodes, embed_model=embed_model, show_progress=True)

Generating embeddings:   0%|          | 0/2048 [00:00<?, ?it/s]

Generating embeddings:   0%|          | 0/2048 [00:00<?, ?it/s]

Generating embeddings:   0%|          | 0/2048 [00:00<?, ?it/s]

Generating embeddings:   0%|          | 0/2048 [00:00<?, ?it/s]

Generating embeddings:   0%|          | 0/2048 [00:00<?, ?it/s]

Generating embeddings:   0%|          | 0/1036 [00:00<?, ?it/s]

# Querying

## MetadataReplacementPostProcessor

Here, we will configure the `MetadataReplacementPostProcessor` in the query engine.

It replaces the actual sentence in each node with it's surrounding context.

In [37]:
from llama_index.core.postprocessor import MetadataReplacementPostProcessor
sentence_query_engine = sentence_index.as_query_engine(
    similarity_top_k=2,
    node_postprocessors=[
        MetadataReplacementPostProcessor(target_metadata_key="window")
    ],
)

In [38]:
sentence_window_response = sentence_query_engine.query(
    "What are the concerns surrounding the AMOC?"
)

In [39]:
print(sentence_window_response)

There is low confidence in the quantification of AMOC changes in the 20th century due to low agreement in quantitative reconstructed and simulated trends. Direct observational records since the mid-2000s are considered too short to determine the relative contributions of internal variability, natural forcing, and anthropogenic forcing to AMOC change. It is very likely that the AMOC will decline over the 21st century for all scenarios, but it is not expected to involve an abrupt collapse before 2100.


We can also check the original sentence that was retrieved for each node, as well as the actual window of sentences that was sent to the LLM.

In [40]:
sentence = sentence_window_response.source_nodes[0].node.metadata["original_text"]
print(sentence)

2.3.3.4, 9.2.3 (Fox-Kemper 
et al., 2021; Gulev et al., 
2021)
The AMOC will decline over the 21st century 
(high confidence, but low confidence for 
quantitative projections).



In [41]:
window = sentence_window_response.source_nodes[0].node.metadata["window"]
print(window)

4.3.2.2, 9.6.3 (Fox-Kemper 
et al., 2021; Lee et al., 
2021)
Extreme sea levels
Relative sea level rise is driving a global increase 
in the frequency of extreme sea levels (high 
confidence).
 9.6.4 (Fox-Kemper et al., 
2021)
Rising mean relative sea level will continue to 
drive an increase in the frequency of extreme sea 
levels (high confidence).
 9.6.4 (Fox-Kemper et al., 
2021)
Ocean circulation
Ocean stratification
‘The upper ocean has become more stably 
stratified since at least 1970 […] (virtually 
certain).’
9.2.1.3 (Fox-Kemper et al., 
2021)
‘Upper-ocean stratification will continue to 
increase throughout the 21st century (virtually 
certain).’
9.2.1.3 (Fox-Kemper et al., 
2021)
Eastern boundary 
upwelling systems
‘Only the California current system 
has experienced some large-scale 
upwelling-favourable wind intensification since 
the 1980s (medium confidence).’
9.2.5 (Fox-Kemper et al., 
2021)
‘Eastern boundary upwelling systems will 
change, with a dipole spatial patter

# Base Retriever

Create node parser to extract the sentences from the document

In [42]:
# base node parser is a sentence splitter
from llama_index.core.text_splitter import SentenceSplitter
sentence_splitter = SentenceSplitter()

Extract the nodes

In [43]:
base_nodes = sentence_splitter.get_nodes_from_documents(documents)

Build index

In [44]:
base_index = VectorStoreIndex(base_nodes, embed_model=embed_model, show_progress=True)

Generating embeddings:   0%|          | 0/459 [00:00<?, ?it/s]

Configure Query Engine

In [45]:
base_query_engine = base_index.as_query_engine(similarity_top_k=2)

In [46]:
response = base_query_engine.query(
    "What are the concerns surrounding the AMOC?"
)

In [47]:
print(response)

The concerns surrounding the AMOC include potential weakening due to climate change, which could have significant impacts on regional and global climate patterns, including sea level rise, temperature changes, and extreme weather events.


Well, that didn't work. Let's bump up the top k!


In [30]:
query_engine = base_index.as_query_engine(similarity_top_k=5)

In [31]:
response = query_engine.query(
    "What are the concerns surrounding the AMOC?"
)

In [32]:
print(response)

Concerns surrounding the AMOC include low confidence in reconstructed and modelled AMOC changes for the 20th century, a projected decline over the 21st century with high confidence but low confidence for quantitative projections, and potential implications for global climate due to the uncertainties in future AMOC behavior.


# Analysis

So the `SentenceWindowNodeParser` + `MetadataReplacementNodePostProcessor` combo is the clear winner here. But why?

Embeddings at a sentence level seem to capture more fine-grained details, like the word `AMOC`.

Lets look at the retrieved sentences for the user query!

In [33]:
for source_node in sentence_window_response.source_nodes:
  print(source_node.node.metadata["original_text"])
  print("--------")

Over the 21st century, AMOC will very likely decline for all SSP 
scenarios but will not involve an abrupt collapse before 2100 (WGI 
AR6 Sections 4.3.2, 9.2.3.1; Fox-Kemper et al., 2021; Lee et al., 2021).

--------
Continuous observation of the Atlantic meridional overturning 
circulation (AMOC) has improved the understanding of its variability 
(Frajka-Williams et  al., 2019), but there is low confidence in the 
quantification of AMOC changes in the 20th century because of low 
agreement in quantitative reconstructed and simulated trends (WGI 
AR6 Sections 2.3.3, 9.2.3.1; Fox-Kemper et al., 2021; Gulev et al., 2021). 

--------


Here, we can see that the sentence window index easily retrieved two nodes that talk about AMOC.

Remember, the embeddings are based purely on the original sentence here, but the LLM actually ends up reading the surrounding context as well!

Now, let's try and disect why the naive vector index failed.


In [36]:
#extract the nodes containing the text "AMOC"
for node in response.source_nodes:
    print("Is AMOC mentioned in the node?:", "AMOC" in node.node.text)
    print("--------")

Is AMOC mentioned in the node?: True
--------
Is AMOC mentioned in the node?: False
--------
Is AMOC mentioned in the node?: False
--------
Is AMOC mentioned in the node?: False
--------
Is AMOC mentioned in the node?: False
--------


So source node at index 2 mentions AMOC, but what did this text actually look like?

In [38]:
print(response.source_nodes[0].node.text)

’9.2.5 (Fox-Kemper et al. 
2021)‘Eastern boundary upwelling systems will 
change, with a dipole spatial pattern within 
each system of reduction at low latitude and 
enhancement at high latitude (high confidence).’9.2.5 (Fox-Kemper et al. 
2021)
Atlantic overturning 
circulation (AMOC)There is low confidence in reconstructed and 
modelled AMOC changes for the 20th century.2.3.3.4, 9.2.3 (Fox-Kemper 
et al. 2021; Gulev et al. 
2021)The AMOC will decline over the 21st century 
(high confidence, but low confidence for 
quantitative projections).4.3.2.3, 9.2.3 (Fox-Kemper 
et al. 2021; Lee et al. 
2021)
Sea ice
Arctic sea ice 
changes‘Current Arctic sea ice coverage levels are the 
lowest since at least 1850 for both annual mean 
and late-summer values (high confidence).’2.3.2.1, 9.3.1 (Fox-Kemper 
et al. 2021; Gulev et al. 
2021)‘The Arctic will become practically ice-free in 
September by the end of the 21st century under 
SSP2-4.5, SSP3-7.0 and SSP5-8.5[…](high 
confidence).’4.3.2.1, 9.

So AMOC is discussed, but sadly it is in the middle chunk.

With LLMs, it is often observed that text in the middle of retrieved context is often ignored or less useful.

A recent paper [Lost in the Middle](https://arxiv.org/abs/2307.03172) discusses this.

#Compare Base Retriever and Sentence Window Retriever

Lets evaluate how well the sentence window retriever works compared to the base retriever.

We define and load an eval benchmark dataset and then run different evaluations over it.

##Create Evaluation Dataset

*Note:This can be expensive, especially with GPT-4. Use caution and tune the sample size to fit your budget.*

In [49]:
import random
import nest_asyncio
nest_asyncio.apply()

In [50]:
len(base_nodes)

459

Randomly sample few nodes for the evaluation

In [51]:
num_nodes_eval=1

In [52]:
sample_eval_nodes = random.sample(base_nodes,num_nodes_eval)

Configure Service Context for evaluation

In [53]:
gpt4=OpenAI(model="gpt-4o")

In [60]:
from llama_index.core.llama_dataset.generator import RagDatasetGenerator

In [69]:
# generate questions
dataset_generator = RagDatasetGenerator(
    sample_eval_nodes,
    llm=gpt4,
    show_progress=True,
    num_questions_per_chunk=2,
)

In [70]:
eval_dataset = await dataset_generator.agenerate_dataset_from_nodes()

100%|██████████| 1/1 [00:02<00:00,  2.99s/it]
100%|██████████| 2/2 [00:04<00:00,  2.32s/it]


In [71]:
eval_dataset.save_json("ipcc_eval_qr_dataset.json")

In [67]:
from llama_index.core.llama_dataset import LabelledRagDataset

In [72]:
eval_dataset = LabelledRagDataset.from_json("ipcc_eval_qr_dataset.json")

In [80]:
eval_questions = [example.query for example in eval_dataset.examples]
eval_responses = [example.reference_answer for example in eval_dataset.examples]

In [79]:
len(eval_questions)

2

Query the base retriever and sentence window retriever query engines for the responses.

In [81]:
max_samples = 2

In [82]:
from llama_index.core.evaluation.eval_utils import get_responses
base_responses = get_responses(
    eval_questions[:max_samples],
    base_query_engine,
    show_progress=True
)

100%|██████████| 2/2 [00:03<00:00,  1.53s/it]


In [61]:
sentence_window_responses = get_responses(
    eval_questions[:max_samples],
    sentence_query_engine,
    show_progress=True
)

100%|██████████| 2/2 [00:07<00:00,  3.56s/it]


Configure RAG Triad of Metrics

In [66]:
from llama_index.core.evaluation import CorrectnessEvaluator, RelevancyEvaluator, FaithfulnessEvaluator

evaluator_c = CorrectnessEvaluator(llm=gpt4)
evaluator_r = RelevancyEvaluator(llm=gpt4)
evaluator_f = FaithfulnessEvaluator(llm=gpt4)

Define the BatchEvalRunner for computing the metrics

In [67]:
evaluator_dict = {
    "correctness": evaluator_c,
    "faithfulness": evaluator_f,
    "relevancy": evaluator_r
}

In [69]:
from llama_index.core.evaluation import BatchEvalRunner
batch_runner = BatchEvalRunner(evaluator_dict, workers=2, show_progress=True)

Compute metrics for the base retriever

In [70]:
base_eval_results = await batch_runner.aevaluate_responses(
    queries=eval_questions[:max_samples],
    responses=base_responses[:max_samples],
    reference=eval_responses[:max_samples],
)

100%|██████████| 6/6 [00:11<00:00,  1.86s/it]


Compute the metrics for the sentence window retriever

In [71]:
sentence_window_eval_results = await batch_runner.aevaluate_responses(
    queries=eval_questions[:max_samples],
    responses=sentence_window_responses[:max_samples],
    reference=eval_responses[:max_samples],
)

100%|██████████| 6/6 [00:08<00:00,  1.48s/it]


Display the results

In [72]:
from llama_index.core.evaluation.eval_utils import get_results_df
results_df = get_results_df(
    [sentence_window_eval_results, base_eval_results],
    ["Sentence Window Retriever", "Base Retriever"],
    ["correctness", "relevancy", "faithfulness"]
)

In [73]:
display(results_df)

Unnamed: 0,names,correctness,relevancy,faithfulness
0,Sentence Window Retriever,4.5,1.0,1.0
1,Base Retriever,4.5,1.0,1.0
