#Sentence Window Retriever

In [None]:
!pip install llama-index

Collecting llama-index
  Downloading llama_index-0.9.16.post1-py3-none-any.whl (990 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m990.2/990.2 kB[0m [31m5.3 MB/s[0m eta [36m0:00:00[0m
Collecting beautifulsoup4<5.0.0,>=4.12.2 (from llama-index)
  Downloading beautifulsoup4-4.12.2-py3-none-any.whl (142 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m143.0/143.0 kB[0m [31m19.1 MB/s[0m eta [36m0:00:00[0m
[?25hCollecting dataclasses-json (from llama-index)
  Downloading dataclasses_json-0.6.3-py3-none-any.whl (28 kB)
Collecting deprecated>=1.2.9.3 (from llama-index)
  Downloading Deprecated-1.2.14-py2.py3-none-any.whl (9.6 kB)
Collecting httpx (from llama-index)
  Downloading httpx-0.25.2-py3-none-any.whl (74 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m75.0/75.0 kB[0m [31m9.4 MB/s[0m eta [36m0:00:00[0m
Collecting openai>=1.1.0 (from llama-index)
  Downloading openai-1.5.0-py3-none-any.whl (223 kB)
[2K     [90

#Configure OpenAI API Key

In [None]:
import os
os.environ["OPENAI_API_KEY"] = 'YOUR OPENAI API KEY'

#Download data

We will be using chapter 3 of the recent IPCC climate report.

In [None]:
!curl https://www.ipcc.ch/report/ar6/wg2/downloads/report/IPCC_AR6_WGII_Chapter03.pdf --output IPCC_AR6_WGII_Chapter03.pdf

  % Total    % Received % Xferd  Average Speed   Time    Time     Time  Current
                                 Dload  Upload   Total   Spent    Left  Speed
100 20.7M  100 20.7M    0     0  30.6M      0 --:--:-- --:--:-- --:--:-- 30.6M


#Load Data

In [None]:
from pathlib import Path
from llama_index import download_loader

PDFReader = download_loader("PDFReader")

loader = PDFReader()
documents = loader.load_data(file=Path('/content/IPCC_AR6_WGII_Chapter03.pdf'))

In [None]:
len(documents)

172

In [None]:
print(documents[0].text)

SPM379
3
Oceans and Coastal 
Ecosystems and Their Services
This chapter should be cited as:
Cooley, S., D.  Schoeman, L.  Bopp, P .  Boyd, S.  Donner, D.Y .  Ghebrehiwet, S.-I.  Ito, W.  Kiessling, P .  Martinetto, E.  Ojea, 
M.-F . Racault, B.  Rost, and M.  Skern-Mauritzen, 2022: Oceans and Coastal Ecosystems and Their Services. In: Climate 
Change 2022: Impacts, Adaptation and Vulnerability. Contribution of Working Group II to the Sixth Assessment Report of 
the Intergovernmental Panel on Climate Change [H.-O.  Pörtner, D.C.  Roberts, M.  Tignor, E.S.  Poloczanska, K.  Mintenbeck, 
A. Alegría, M.  Craig, S.  Langsdorf, S.  Löschke, V .  Möller, A.  Okem, B.  Rama (eds.)]. Cambridge University Press, Cambridge, 
UK and New York, NY , USA, pp.  379–550, doi:10.1017/9781009325844.005.Coordinating Lead Authors: Sarah R. Cooley (USA) and David S. Schoeman (Australia)
Lead Authors: Laurent Bopp (France), Philip Boyd (Australia/UK), Simon Donner (Canada), Shin-
Ichi Ito (Japan), Wolfgang K

Configure OpenAI LLM

In [None]:
from llama_index.llms import OpenAI
llm = OpenAI(model="gpt-3.5-turbo", temperature=0.1)

Load Sentence level embeddings from HuggingFace

In [None]:
from llama_index.embeddings import HuggingFaceEmbedding
embed_model = HuggingFaceEmbedding(
    model_name="sentence-transformers/all-mpnet-base-v2", max_length=512
)

config.json:   0%|          | 0.00/571 [00:00<?, ?B/s]

pytorch_model.bin:   0%|          | 0.00/438M [00:00<?, ?B/s]

tokenizer_config.json:   0%|          | 0.00/363 [00:00<?, ?B/s]

vocab.txt:   0%|          | 0.00/232k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/466k [00:00<?, ?B/s]

special_tokens_map.json:   0%|          | 0.00/239 [00:00<?, ?B/s]

Create Service Context by providing LLM and Embedding model

In [None]:
from llama_index import ServiceContext
ctx = ServiceContext.from_defaults(
    llm=llm,
    embed_model=embed_model
)

[nltk_data] Downloading package punkt to /tmp/llama_index...
[nltk_data]   Unzipping tokenizers/punkt.zip.


#Sentence Window Indexing

Create Sentence Window Node Parser


In [None]:
from llama_index.node_parser import SentenceWindowNodeParser
node_parser = SentenceWindowNodeParser.from_defaults(
    window_size=3,
    window_metadata_key="window",
    original_text_metadata_key="original_text",
)

Extract the nodes

In [None]:
nodes = node_parser.get_nodes_from_documents(documents)

Building index

In [None]:
from llama_index import VectorStoreIndex
sentence_index = VectorStoreIndex(nodes, service_context=ctx,show_progress=True)

Generating embeddings:   0%|          | 0/11087 [00:00<?, ?it/s]

#Querying

## MetadataReplacementPostProcessor

Here, we will configure the `MetadataReplacementPostProcessor` in the query engine.

It replaces the actual sentence in each node with it's surrounding context.

In [None]:
from llama_index.postprocessor import MetadataReplacementPostProcessor
sentence_query_engine = sentence_index.as_query_engine(
    similarity_top_k=2,
    node_postprocessors=[
        MetadataReplacementPostProcessor(target_metadata_key="window")
    ],
)

In [None]:
sentence_window_response = sentence_query_engine.query(
    "What are the concerns surrounding the AMOC?"
)

In [None]:
print(sentence_window_response)

There is low confidence in the quantification of Atlantic Meridional Overturning Circulation (AMOC) changes in the 20th century due to low agreement in quantitative reconstructed and simulated trends. Additionally, direct observational records since the mid-2000s remain too short to determine the relative contributions of internal variability, natural forcing, and anthropogenic forcing to AMOC change. However, it is very likely that AMOC will decline for all SSP scenarios over the 21st century, but there will not be an abrupt collapse before 2100.


We can also check the original sentence that was retrieved for each node, as well as the actual window of sentences that was sent to the LLM.

In [None]:
sentence = sentence_window_response.source_nodes[0].node.metadata["original_text"]
print(sentence)

Over the 21st century, AMOC will very likely decline for all SSP 
scenarios but will not involve an abrupt collapse before 2100 (WGI 
AR6 Sections 4.3.2, 9.2.3.1; Fox-Kemper et al., 2021; Lee et al., 2021).



In [None]:
window = sentence_window_response.source_nodes[0].node.metadata["window"]
print(window)

Nevertheless, projected future annual cumulative upwelling wind 
changes at most locations and seasons remain within ±10–20% of 
present-day values (medium confidence) (WGI AR6 Section  9.2.3.5; 
Fox-Kemper et al., 2021).
 Continuous observation of the Atlantic meridional overturning 
circulation (AMOC) has improved the understanding of its variability 
(Frajka-Williams et  al., 2019), but there is low confidence in the 
quantification of AMOC changes in the 20th century because of low 
agreement in quantitative reconstructed and simulated trends (WGI 
AR6 Sections 2.3.3, 9.2.3.1; Fox-Kemper et al., 2021; Gulev et al., 2021). 
 Direct observational records since the mid-2000s remain too short to 
determine the relative contributions of internal variability, natural 
forcing and anthropogenic forcing to AMOC change (high confidence) 
(WGI AR6 Sections 2.3.3, 9.2.3.1; Fox-Kemper et al., 2021; Gulev et al., 
2021).  Over the 21st century, AMOC will very likely decline for all SSP 
scenari

#Base Retriever

Create node parser to extract the sentences from the document

In [None]:
# base node parser is a sentence splitter
from llama_index.text_splitter import SentenceSplitter
sentence_splitter = SentenceSplitter()

Extract the nodes

In [None]:
base_nodes = sentence_splitter.get_nodes_from_documents(documents)

Build index

In [None]:
base_index = VectorStoreIndex(base_nodes,service_context=ctx,show_progress=True)

Configure Query Engine

In [None]:
base_query_engine = base_index.as_query_engine(similarity_top_k=2)

In [None]:
response = base_query_engine.query(
    "What are the concerns surrounding the AMOC?"
)

In [None]:
print(response)

The concerns surrounding the AMOC are related to its potential slowdown or collapse. This could have significant impacts on global climate patterns, including changes in temperature, precipitation, and sea level rise. The AMOC plays a crucial role in redistributing heat around the planet, and any disruption to its functioning could have far-reaching consequences for ecosystems, weather patterns, and human societies.


Well, that didn't work. Let's bump up the top k!


In [None]:
query_engine = base_index.as_query_engine(similarity_top_k=5)

In [None]:
response = query_engine.query(
    "What are the concerns surrounding the AMOC?"
)

In [None]:
print(response)

There is low confidence in reconstructed and modelled AMOC (Atlantic Meridional Overturning Circulation) changes for the 20th century. However, it is projected to decline over the 21st century with high confidence, although there is low confidence in quantitative projections.


# Analysis

So the `SentenceWindowNodeParser` + `MetadataReplacementNodePostProcessor` combo is the clear winner here. But why?

Embeddings at a sentence level seem to capture more fine-grained details, like the word `AMOC`.

Lets look at the retrieved sentences for the user query!

In [None]:
for source_node in sentence_window_response.source_nodes:
  print(source_node.node.metadata["original_text"])
  print("--------")

Over the 21st century, AMOC will very likely decline for all SSP 
scenarios but will not involve an abrupt collapse before 2100 (WGI 
AR6 Sections 4.3.2, 9.2.3.1; Fox-Kemper et al., 2021; Lee et al., 2021).

--------
Direct observational records since the mid-2000s remain too short to 
determine the relative contributions of internal variability, natural 
forcing and anthropogenic forcing to AMOC change (high confidence) 
(WGI AR6 Sections 2.3.3, 9.2.3.1; Fox-Kemper et al., 2021; Gulev et al., 
2021). 
--------


Here, we can see that the sentence window index easily retrieved two nodes that talk about AMOC.

Remember, the embeddings are based purely on the original sentence here, but the LLM actually ends up reading the surrounding context as well!

Now, let's try and disect why the naive vector index failed.


In [None]:
#extract the nodes containing the text "AMOC"
for node in response.source_nodes:
    print("Is AMOC mentioned in the node?:", "AMOC" in node.node.text)
    print("--------")

Is AMOC mentioned in the node?: False
--------
Is AMOC mentioned in the node?: False
--------
Is AMOC mentioned in the node?: False
--------
Is AMOC mentioned in the node?: True
--------
Is AMOC mentioned in the node?: False
--------


So source node at index 2 mentions AMOC, but what did this text actually look like?

In [None]:
print(response.source_nodes[2].node.text)

A full assessment of 
climate-change impacts on human health is found in Chapter  7 and 
Cross-Chapter Box ILLNESS in Chapter 2.
3.6.3.2 Cross-Cutting Solutions for Coastal and Ocean 
Ecosystems
SROCC concluded that protection, restoration and pollution reduction 
can support ocean and coastal ecosystems (high confidence), and that 
EbA lowers climate risks locally and provides multiple societal benefits 
(high confidence) (IPCC, 2019c). This section updates the assessment 
of the effectiveness of these strategies for addressing climate impacts.
3.6.3.2.1  Area-based protection: MPAs for adapting to climate 
change
Marine protected areas are the most widely implemented area-based 
management approach (Section  3.6.2.3.2), commonly intended 
to conserve, preserve or restore biodiversity and habitats, protect 
species or manage resources (especially fisheries) (National Research 
Council, 2001). By August 2021, 7.74% of the ocean was protected 
(in both MPAs and OECMs) (UNEP-WCMC and IUC

So AMOC is discussed, but sadly it is in the middle chunk.

With LLMs, it is often observed that text in the middle of retrieved context is often ignored or less useful.

A recent paper [Lost in the Middle](https://arxiv.org/abs/2307.03172) discusses this.

#Compare Base Retriever and Sentence Window Retriever

Lets evaluate how well the sentence window retriever works compared to the base retriever.

We define and load an eval benchmark dataset and then run different evaluations over it.

##Create Evaluation Dataset

*Note:This can be expensive, especially with GPT-4. Use caution and tune the sample size to fit your budget.*

In [None]:
import random
import nest_asyncio
nest_asyncio.apply()

In [None]:
len(base_nodes)

459

Randomly sample few nodes for the evaluation

In [None]:
num_nodes_eval=100

In [None]:
sample_eval_nodes = random.sample(base_nodes,num_nodes_eval)

Configure Service Context for evaluation

In [None]:
eval_service_context = ServiceContext.from_defaults(llm=OpenAI(model="gpt-4"))

In [None]:
# generate questions
from llama_index.evaluation import DatasetGenerator
dataset_generator = DatasetGenerator(
    sample_eval_nodes,
    service_context=eval_service_context,
    show_progress=True,
    num_questions_per_chunk=2,
)

  dataset_generator = DatasetGenerator(


In [None]:
eval_dataset = await dataset_generator.agenerate_dataset_from_nodes()




  0%|          | 0/100 [00:00<?, ?it/s][A[A[A


  1%|          | 1/100 [00:04<07:56,  4.81s/it][A[A[A


  2%|▏         | 2/100 [00:05<03:46,  2.31s/it][A[A[A


  3%|▎         | 3/100 [00:05<02:09,  1.33s/it][A[A[A


  5%|▌         | 5/100 [00:05<01:01,  1.54it/s][A[A[A


  6%|▌         | 6/100 [00:06<01:02,  1.51it/s][A[A[A


  9%|▉         | 9/100 [00:06<00:28,  3.15it/s][A[A[A


 11%|█         | 11/100 [00:06<00:22,  4.00it/s][A[A[A


 13%|█▎        | 13/100 [00:07<00:20,  4.15it/s][A[A[A


 16%|█▌        | 16/100 [00:07<00:14,  5.62it/s][A[A[A


 17%|█▋        | 17/100 [00:07<00:15,  5.32it/s][A[A[A


 18%|█▊        | 18/100 [00:08<00:15,  5.20it/s][A[A[A


 19%|█▉        | 19/100 [00:08<00:19,  4.18it/s][A[A[A


 21%|██        | 21/100 [00:08<00:13,  5.77it/s][A[A[A


 22%|██▏       | 22/100 [00:08<00:12,  6.29it/s][A[A[A


 23%|██▎       | 23/100 [00:09<00:16,  4.78it/s][A[A[A


 26%|██▌       | 26/100 [00:09<00:10,  7.32it/s][A

In [None]:
eval_dataset.save_json("ipcc_eval_qr_dataset.json")

In [None]:
from llama_index.evaluation import QueryResponseDataset
eval_dataset = QueryResponseDataset.from_json("ipcc_eval_qr_dataset.json")

  return cls(**data)


In [None]:
eval_questions   = eval_dataset.questions
eval_responses   = [r for (_, r) in eval_dataset.qr_pairs]

In [None]:
len(eval_questions)

200

Query the base retriever and sentence window retriever query engines for the responses.

In [None]:
max_samples = 100

In [None]:
from llama_index.evaluation.eval_utils import get_responses
base_responses = get_responses(
    eval_questions[:max_samples],
    base_query_engine,
    show_progress=True
)




  0%|          | 0/100 [00:00<?, ?it/s][A[A[A


  1%|          | 1/100 [00:17<29:38, 17.96s/it][A[A[A


  2%|▏         | 2/100 [00:18<12:26,  7.61s/it][A[A[A


  5%|▌         | 5/100 [00:18<03:28,  2.19s/it][A[A[A


  8%|▊         | 8/100 [00:18<01:41,  1.10s/it][A[A[A


 12%|█▏        | 12/100 [00:18<00:50,  1.74it/s][A[A[A


 17%|█▋        | 17/100 [00:18<00:26,  3.13it/s][A[A[A


 24%|██▍       | 24/100 [00:18<00:13,  5.73it/s][A[A[A


 29%|██▉       | 29/100 [00:19<00:08,  8.00it/s][A[A[A


 33%|███▎      | 33/100 [00:19<00:06,  9.92it/s][A[A[A


 37%|███▋      | 37/100 [00:19<00:05, 11.72it/s][A[A[A


 41%|████      | 41/100 [00:19<00:04, 13.66it/s][A[A[A


 44%|████▍     | 44/100 [00:19<00:04, 13.79it/s][A[A[A


 48%|████▊     | 48/100 [00:19<00:03, 15.77it/s][A[A[A


 51%|█████     | 51/100 [00:20<00:03, 14.12it/s][A[A[A


 53%|█████▎    | 53/100 [00:20<00:04, 11.51it/s][A[A[A


 55%|█████▌    | 55/100 [00:20<00:04,  9.27it/s][

In [None]:
sentence_window_responses = get_responses(
    eval_questions[:max_samples],
    sentence_query_engine,
    show_progress=True
)




  0%|          | 0/100 [00:00<?, ?it/s][A[A[A


  1%|          | 1/100 [01:07<1:51:36, 67.64s/it][A[A[A


  3%|▎         | 3/100 [01:07<28:30, 17.63s/it]  [A[A[A


  4%|▍         | 4/100 [01:07<18:30, 11.56s/it][A[A[A


  8%|▊         | 8/100 [01:08<05:59,  3.90s/it][A[A[A


 15%|█▌        | 15/100 [01:08<02:05,  1.48s/it][A[A[A


 22%|██▏       | 22/100 [01:08<01:01,  1.27it/s][A[A[A


 27%|██▋       | 27/100 [01:08<00:39,  1.85it/s][A[A[A


 32%|███▏      | 32/100 [01:08<00:25,  2.66it/s][A[A[A


 37%|███▋      | 37/100 [01:08<00:17,  3.52it/s][A[A[A


 41%|████      | 41/100 [01:09<00:13,  4.40it/s][A[A[A


 45%|████▌     | 45/100 [01:09<00:09,  5.71it/s][A[A[A


 48%|████▊     | 48/100 [01:09<00:07,  6.86it/s][A[A[A


 51%|█████     | 51/100 [01:09<00:05,  8.32it/s][A[A[A


 54%|█████▍    | 54/100 [01:10<00:06,  7.23it/s][A[A[A


 58%|█████▊    | 58/100 [01:10<00:04,  9.60it/s][A[A[A


 63%|██████▎   | 63/100 [01:10<00:02, 12.86it/

Configure RAG Triad of Metrics

In [None]:
from llama_index.evaluation import (
   CorrectnessEvaluator,
    RelevancyEvaluator,
    FaithfulnessEvaluator
)

evaluator_c = CorrectnessEvaluator(service_context=eval_service_context)
evaluator_r = RelevancyEvaluator(service_context=eval_service_context)
evaluator_f = FaithfulnessEvaluator(service_context=eval_service_context)

Define the BatchEvalRunner for computing the metrics

In [None]:
evaluator_dict = {
    "correctness": evaluator_c,
    "faithfulness": evaluator_f,
    "relevancy": evaluator_r
}

In [None]:
from llama_index.evaluation import BatchEvalRunner
batch_runner = BatchEvalRunner(evaluator_dict, workers=2, show_progress=True)

Compute metrics for the base retriever

In [None]:
base_eval_results = await batch_runner.aevaluate_responses(
    queries=eval_questions[:max_samples],
    responses=base_responses[:max_samples],
    reference=eval_responses[:max_samples],
)




  0%|          | 0/300 [00:00<?, ?it/s][A[A[A


  0%|          | 1/300 [00:01<09:11,  1.84s/it][A[A[A


  1%|          | 2/300 [00:03<07:15,  1.46s/it][A[A[A


  1%|          | 3/300 [00:03<05:31,  1.12s/it][A[A[A


  1%|▏         | 4/300 [00:04<05:18,  1.08s/it][A[A[A


  2%|▏         | 5/300 [00:09<12:34,  2.56s/it][A[A[A


  2%|▏         | 6/300 [00:10<09:04,  1.85s/it][A[A[A


  2%|▏         | 7/300 [00:11<07:11,  1.47s/it][A[A[A


  3%|▎         | 8/300 [00:11<06:00,  1.23s/it][A[A[A


  3%|▎         | 9/300 [00:12<05:46,  1.19s/it][A[A[A


  3%|▎         | 10/300 [00:16<08:50,  1.83s/it][A[A[A


  4%|▎         | 11/300 [00:23<16:12,  3.36s/it][A[A[A


  4%|▍         | 12/300 [00:23<12:20,  2.57s/it][A[A[A


  4%|▍         | 13/300 [00:24<09:47,  2.05s/it][A[A[A


  5%|▌         | 15/300 [00:25<06:23,  1.34s/it][A[A[A


  5%|▌         | 16/300 [00:33<14:28,  3.06s/it][A[A[A


  6%|▌         | 17/300 [00:34<11:50,  2.51s/it][A[A

Compute the metrics for the sentence window retriever

In [None]:
sentence_window_eval_results = await batch_runner.aevaluate_responses(
    queries=eval_questions[:max_samples],
    responses=sentence_window_responses[:max_samples],
    reference=eval_responses[:max_samples],
)




  0%|          | 0/300 [00:00<?, ?it/s][A[A[A


  0%|          | 1/300 [00:00<03:15,  1.53it/s][A[A[A


  1%|          | 2/300 [00:01<03:01,  1.64it/s][A[A[A


  1%|          | 3/300 [00:02<04:20,  1.14it/s][A[A[A


  1%|▏         | 4/300 [00:03<04:22,  1.13it/s][A[A[A


  2%|▏         | 5/300 [00:03<03:50,  1.28it/s][A[A[A


  2%|▏         | 6/300 [00:04<03:36,  1.36it/s][A[A[A


  2%|▏         | 7/300 [00:06<05:21,  1.10s/it][A[A[A


  3%|▎         | 8/300 [00:07<05:09,  1.06s/it][A[A[A


  3%|▎         | 9/300 [00:07<04:07,  1.17it/s][A[A[A


  3%|▎         | 10/300 [00:13<11:38,  2.41s/it][A[A[A


  4%|▎         | 11/300 [00:14<09:01,  1.87s/it][A[A[A


  4%|▍         | 12/300 [00:15<07:40,  1.60s/it][A[A[A


  4%|▍         | 13/300 [00:16<07:00,  1.46s/it][A[A[A


  5%|▍         | 14/300 [00:17<05:55,  1.24s/it][A[A[A


  5%|▌         | 15/300 [00:17<04:57,  1.05s/it][A[A[A


  5%|▌         | 16/300 [00:18<04:37,  1.03it/s][A[A

Display the results

In [None]:
from llama_index.evaluation.eval_utils import get_results_df
results_df = get_results_df(
    [sentence_window_eval_results, base_eval_results],
    ["Sentence Window Retriever", "Base Retriever"],
    ["correctness", "relevancy", "faithfulness"]
)

In [None]:
display(results_df)

Unnamed: 0,names,correctness,relevancy,faithfulness
0,Sentence Window Retriever,4.21,0.9,0.93
1,Base Retriever,3.89,0.74,0.84
