# Auto Merging Retriever

In this notebook, we showcase our `AutoMergingRetriever`, which looks at a set of leaf nodes and recursively "merges" subsets of leaf nodes that reference a parent node beyond a given threshold. This allows us to consolidate potentially disparate, smaller contexts into a larger context that might help synthesis.

You can define this hierarchy yourself over a set of documents, or you can make use of our brand-new text parser: a HierarchicalNodeParser that takes in a candidate set of documents and outputs an entire hierarchy of nodes, from "coarse-to-fine".

In [1]:
import os 
from dotenv import load_dotenv, find_dotenv

In [2]:
load_dotenv('/home/santhosh/Projects/courses/Pinnacle/.env')

True

In [3]:
OPENAI_API_KEY = os.environ['OPENAI_API_KEY']

If you're opening this Notebook on colab, you will probably need to install LlamaIndex 🦙.

## Load Data

Let's first load the Llama 2 paper: https://arxiv.org/pdf/2307.09288.pdf. This will be our test data.

In [4]:
!mkdir -p 'data/'
!wget --user-agent "Mozilla" "https://arxiv.org/pdf/2307.09288.pdf" -O "data/llama2.pdf"

--2024-12-06 18:48:56--  https://arxiv.org/pdf/2307.09288.pdf
Resolving arxiv.org (arxiv.org)... 151.101.195.42, 151.101.3.42, 151.101.67.42, ...
Connecting to arxiv.org (arxiv.org)|151.101.195.42|:443... connected.
HTTP request sent, awaiting response... 301 Moved Permanently
Location: http://arxiv.org/pdf/2307.09288 [following]
URL transformed to HTTPS due to an HSTS policy
--2024-12-06 18:48:57--  https://arxiv.org/pdf/2307.09288
Reusing existing connection to arxiv.org:443.
HTTP request sent, awaiting response... 200 OK
Length: 13661300 (13M) [application/pdf]
Saving to: ‘data/llama2.pdf’


2024-12-06 18:48:59 (6.03 MB/s) - ‘data/llama2.pdf’ saved [13661300/13661300]



In [5]:
from pathlib import Path
from llama_index.readers.file import PyMuPDFReader

In [6]:
from llama_index.core import SimpleDirectoryReader

In [7]:
docs0 = SimpleDirectoryReader(input_files=["./data/llama2.pdf"]).load_data()

In [8]:
loader = PyMuPDFReader()
# docs0 = loader.load_data(file=Path("./data/llama2.pdf"))
docs0 = loader.load(file_path=Path("./data/llama2.pdf"))

By default, the PDF reader creates a separate doc for each page.
For the sake of this notebook, we stitch docs together into one doc.
This will help us better highlight auto-merging capabilities that "stitch" chunks together later on.

In [9]:
from llama_index.core import Document

doc_text = "\n\n".join([d.get_content() for d in docs0])
docs = [Document(text=doc_text)]

## Parse Chunk Hierarchy from Text, Load into Storage

In this section we make use of the `HierarchicalNodeParser`. This will output a hierarchy of nodes, from top-level nodes with bigger chunk sizes to child nodes with smaller chunk sizes, where each child node has a parent node with a bigger chunk size.

By default, the hierarchy is:
- 1st level: chunk size 2048
- 2nd level: chunk size 512
- 3rd level: chunk size 128


We then load these nodes into storage. The leaf nodes are indexed and retrieved via a vector store - these are the nodes that will first be directly retrieved via similarity search. The other nodes will be retrieved from a docstore.

In [10]:
from llama_index.core.node_parser import HierarchicalNodeParser, SentenceSplitter

In [11]:
node_parser = HierarchicalNodeParser.from_defaults()

In [12]:
nodes = node_parser.get_nodes_from_documents(docs)

In [13]:
len(nodes)

1009

Here we import a simple helper function for fetching "leaf" nodes within a node list.
These are nodes that don't have children of their own.

In [14]:
from llama_index.core.node_parser import get_leaf_nodes, get_root_nodes

In [15]:
leaf_nodes = get_leaf_nodes(nodes)

In [16]:
len(leaf_nodes)

783

In [17]:
root_nodes = get_root_nodes(nodes)

### Load into Storage

We define a docstore, which we load all nodes into.

We then define a `VectorStoreIndex` containing just the leaf-level nodes.

In [18]:
# define storage context
from llama_index.core.storage.docstore import SimpleDocumentStore
from llama_index.core.storage import StorageContext
from llama_index.llms.openai import OpenAI
from llama_index.embeddings.openai import OpenAIEmbedding

docstore = SimpleDocumentStore()

# insert nodes into docstore
docstore.add_documents(nodes)

# define storage context (will include vector store by default too)
storage_context = StorageContext.from_defaults(docstore=docstore)

In [20]:
## Load index into vector index
from llama_index.core import VectorStoreIndex

base_index = VectorStoreIndex(
    leaf_nodes,
    embed_model = OpenAIEmbedding(model='text-embedding-3-small'),
    storage_context=storage_context,
)

## Define Retriever

In [19]:
from llama_index.core.retrievers.auto_merging_retriever import AutoMergingRetriever

In [21]:
base_retriever = base_index.as_retriever(similarity_top_k=6)
retriever = AutoMergingRetriever(base_retriever, storage_context, verbose=True)

In [25]:
# query_str = "What were some lessons learned from red-teaming?"
# query_str = "Can you tell me about the key concepts for safety finetuning"
query_str = (
    "What could be the potential outcomes of adjusting the amount of safety data used in the RLHF stage?"
)

base_nodes = base_retriever.retrieve(query_str)

nodes = retriever.retrieve(query_str)

> Merging 3 nodes into parent node.
> Parent node id: 17ef2f43-9553-4dca-bc57-e22127535a59.
> Parent node text: We conduct RLHF by first collecting human preference data for safety similar to Section 3.2.2: an...



In [22]:
# query_str = "What were some lessons learned from red-teaming?"
# query_str = "Can you tell me about the key concepts for safety finetuning"
query_str = (
    "What could be the potential outcomes of adjusting the amount of safety data used in the RLHF stage?"
)

base_nodes = base_retriever.retrieve(query_str)

nodes = retriever.retrieve(query_str)

In [25]:
# query_str = "What were some lessons learned from red-teaming?"
# query_str = "Can you tell me about the key concepts for safety finetuning"
query_str = (
    "What could be the potential outcomes of adjusting the amount of safety data used in the RLHF stage?"
)

base_nodes = base_retriever.retrieve(query_str)

nodes = retriever.retrieve(query_str)

In [26]:
len(nodes)

6

In [27]:
len(base_nodes)

6

In [24]:
from llama_index.core.response.notebook_utils import display_source_node

for node in nodes:
    display_source_node(node, source_length=10000)

**Node ID:** 18cdca60-c495-40ed-90d8-8584451a6b7b<br>**Similarity:** 0.20969704892734822<br>**Text:** Llama 2: Open Foundation and Fine-Tuned Chat Models
Hugo Touvron∗
Louis Martin†
Kevin Stone†
Peter Albert Amjad Almahairi Yasmine Babaei Nikolay Bashlykov Soumya Batra
Prajjwal Bhargava Shruti Bhosale Dan Bikel Lukas Blecher Cristian Canton Ferrer Moya Chen
Guillem Cucurull David Esiobu Jude Fernandes Jeremy Fu Wenyin Fu Brian Fuller
Cynthia Gao Vedanuj Goswami Naman Goyal Anthony Hartshorn Saghar Hosseini<br>

**Node ID:** a954be12-18ac-46c1-a375-dfb5c8f947fc<br>**Similarity:** 0.12247519370800712<br>**Text:** Vedanuj Goswami Naman Goyal Anthony Hartshorn Saghar Hosseini Rui Hou
Hakan Inan Marcin Kardas Viktor Kerkez Madian Khabsa Isabel Kloumann Artem Korenev
Punit Singh Koura Marie-Anne Lachaux Thibaut Lavril Jenya Lee Diana Liskovich
Yinghai Lu Yuning Mao Xavier Martinet Todor Mihaylov Pushkar Mishra
Igor Molybog Yixin Nie Andrew Poulton Jeremy Reizenstein Rashi Rungta Kalyan Saladi
Alan Schelten<br>

In [25]:
for node in base_nodes:
    display_source_node(node, source_length=10000)

**Node ID:** 18cdca60-c495-40ed-90d8-8584451a6b7b<br>**Similarity:** 0.20969704892734822<br>**Text:** Llama 2: Open Foundation and Fine-Tuned Chat Models
Hugo Touvron∗
Louis Martin†
Kevin Stone†
Peter Albert Amjad Almahairi Yasmine Babaei Nikolay Bashlykov Soumya Batra
Prajjwal Bhargava Shruti Bhosale Dan Bikel Lukas Blecher Cristian Canton Ferrer Moya Chen
Guillem Cucurull David Esiobu Jude Fernandes Jeremy Fu Wenyin Fu Brian Fuller
Cynthia Gao Vedanuj Goswami Naman Goyal Anthony Hartshorn Saghar Hosseini<br>

**Node ID:** a954be12-18ac-46c1-a375-dfb5c8f947fc<br>**Similarity:** 0.12247519370800712<br>**Text:** Vedanuj Goswami Naman Goyal Anthony Hartshorn Saghar Hosseini Rui Hou
Hakan Inan Marcin Kardas Viktor Kerkez Madian Khabsa Isabel Kloumann Artem Korenev
Punit Singh Koura Marie-Anne Lachaux Thibaut Lavril Jenya Lee Diana Liskovich
Yinghai Lu Yuning Mao Xavier Martinet Todor Mihaylov Pushkar Mishra
Igor Molybog Yixin Nie Andrew Poulton Jeremy Reizenstein Rashi Rungta Kalyan Saladi
Alan Schelten<br>

## Plug it into Query Engine

In [26]:
from llama_index.core.query_engine import RetrieverQueryEngine

In [27]:
query_engine = RetrieverQueryEngine.from_args(retriever)
base_query_engine = RetrieverQueryEngine.from_args(base_retriever)

In [28]:
response = query_engine.query(query_str)

In [29]:
print(str(response))

Adjusting the amount of safety data used in the RLHF stage could potentially impact the model's performance and generalization capabilities. It may lead to improved safety and robustness of the chat models, as well as influence the fine-tuning process to enhance the overall quality of the models.


In [39]:
base_response = base_query_engine.query(query_str)

In [40]:
print(str(base_response))

Adjusting the amount of safety data used in the RLHF stage could potentially lead to improvements in model safety performance without significantly impacting the helpfulness score distribution. It may help the model align with safety guidelines early on, laying a foundation for high-quality human preference data annotation. Additionally, increasing the amount of safety data in model training could result in a significant improvement in the mean safety reward model score while keeping the helpfulness counterpart relatively stable. Furthermore, the addition of more safety training data may gradually eliminate the most unsafe responses, as indicated by the disappearance of the left tail of safety reward model scores.


## Evaluation

We evaluate how well the hierarchical retriever works compared to the baseline retriever in a more quantitative manner.

**WARNING**: This can be *expensive*, especially with GPT-4. Use caution and tune the sample size to fit your budget.

In [32]:
from llama_index.core.llama_dataset.generator import RagDatasetGenerator
from llama_index.core.llama_dataset import LabelledRagDataset
from llama_index.llms.openai import OpenAI
import nest_asyncio

nest_asyncio.apply()

In [31]:
gpt4 = OpenAI(model='gpt-4o')

In [39]:
# NOTE: run this if the dataset isn't already saved
dataset_generator = RagDatasetGenerator(
    root_nodes[:2],
    llm=gpt4,
    show_progress=True,
    num_questions_per_chunk=3,
)

In [41]:
eval_dataset = await dataset_generator.agenerate_dataset_from_nodes()

100%|██████████| 2/2 [00:02<00:00,  1.36s/it]
100%|██████████| 3/3 [00:06<00:00,  2.08s/it]
100%|██████████| 3/3 [00:12<00:00,  4.10s/it]


In [42]:
eval_dataset.save_json("data/llama2_eval_qr_dataset.json")

In [43]:
# optional
eval_dataset = LabelledRagDataset.from_json(
    "data/llama2_eval_qr_dataset.json"
)

### Compare Results

We run evaluations on each of the retrievers: correctness, semantic similarity, relevance, and faithfulness.

In [33]:
import asyncio
import nest_asyncio

nest_asyncio.apply()

In [34]:
from llama_index.core.evaluation import (
    CorrectnessEvaluator,
    SemanticSimilarityEvaluator,
    RelevancyEvaluator,
    FaithfulnessEvaluator,
    PairwiseComparisonEvaluator,
)


from collections import defaultdict
import pandas as pd

In [35]:
gpt4 = OpenAI(temperature=0, model="gpt-4o")

In [55]:
# NOTE: can uncomment other evaluators
evaluator_c = CorrectnessEvaluator(llm=gpt4)
evaluator_s = SemanticSimilarityEvaluator(embed_model=OpenAIEmbedding(model='text-embedding-3-small'))
evaluator_r = RelevancyEvaluator(llm=gpt4)
evaluator_f = FaithfulnessEvaluator(llm=gpt4)
pairwise_evaluator = PairwiseComparisonEvaluator(llm=gpt4)

In [44]:
from llama_index.core.evaluation.eval_utils import get_responses, get_results_df
from llama_index.core.evaluation import BatchEvalRunner

In [46]:
eval_qs = [example.query for example in eval_dataset.examples]
ref_response_strs = [example.reference_answer for example in eval_dataset.examples]

In [47]:
pred_responses = get_responses(eval_qs, query_engine, show_progress=True)

100%|██████████| 6/6 [00:09<00:00,  1.64s/it]


In [48]:
base_pred_responses = get_responses(
    eval_qs, base_query_engine, show_progress=True
)

100%|██████████| 6/6 [00:03<00:00,  1.78it/s]


In [49]:
import numpy as np

pred_response_strs = [str(p) for p in pred_responses]
base_pred_response_strs = [str(p) for p in base_pred_responses]

In [50]:
evaluator_dict = {
    "correctness": evaluator_c,
    "faithfulness": evaluator_f,
    "relevancy": evaluator_r,
    "semantic_similarity": evaluator_s,
}
batch_runner = BatchEvalRunner(evaluator_dict, workers=2, show_progress=True)

In [51]:
eval_results = await batch_runner.aevaluate_responses(
    eval_qs, responses=pred_responses, reference=ref_response_strs
)

100%|██████████| 24/24 [00:14<00:00,  1.66it/s]


In [52]:
base_eval_results = await batch_runner.aevaluate_responses(
    eval_qs, responses=base_pred_responses, reference=ref_response_strs
)

100%|██████████| 24/24 [00:14<00:00,  1.63it/s]


In [53]:
results_df = get_results_df(
    [eval_results, base_eval_results],
    ["Auto Merging Retriever", "Base Retriever"],
    ["correctness", "relevancy", "faithfulness", "semantic_similarity"],
)
display(results_df)

Unnamed: 0,names,correctness,relevancy,faithfulness,semantic_similarity
0,Auto Merging Retriever,2.5,0.333333,0.0,0.902941
1,Base Retriever,2.5,0.166667,0.0,0.89918


**Analysis**: The results are roughly the same.

Let's also try to see which answer GPT-4 prefers with our pairwise evals.