# Recursive Retriever + Node References

This guide shows how you can use recursive retrieval to traverse node relationships and fetch nodes based on "references".

Node references are a powerful concept. When you first perform retrieval, you may want to retrieve the reference as opposed to the raw text. You can have multiple references point to the same node.

In this guide we explore some different usages of node references:
- **Chunk references**: Different chunk sizes referring to a bigger chunk
- **Metadata references**: Summaries + Generated Questions referring to a bigger chunk

## Load Data + Setup

In this section we download the Llama 2 paper and create an initial set of nodes (chunk size 1024).

In [None]:
!wget --user-agent "Mozilla" "https://arxiv.org/pdf/2307.09288.pdf" -O "data/llama2.pdf"

In [1]:
from pathlib import Path
from llama_hub.file.pdf.base import PDFReader
from llama_index.response.notebook_utils import display_source_node
from llama_index.retrievers import RecursiveRetriever
from llama_index.query_engine import RetrieverQueryEngine
from llama_index import VectorStoreIndex, ServiceContext
from llama_index.llms import OpenAI
import json

In [2]:
loader = PDFReader()
docs0 = loader.load_data(file=Path("./data/llama2.pdf"))

In [3]:
from llama_index import Document

doc_text = "\n\n".join([d.get_content() for d in docs0])
docs = [Document(text=doc_text)]

In [4]:
from llama_index.node_parser import SimpleNodeParser
from llama_index.schema import IndexNode

In [5]:
node_parser = SimpleNodeParser.from_defaults(chunk_size=1024)

In [6]:
base_nodes = node_parser.get_nodes_from_documents(docs)
# set node ids to be a constant
for idx, node in enumerate(base_nodes):
    node.id_ = f"node-{idx}"

In [7]:
llm = OpenAI(model="gpt-3.5-turbo")
service_context = ServiceContext.from_defaults(llm=llm)

## Baseline Retriever

Define a baseline retriever that simply fetches the top-k raw text nodes by embedding similarity.

In [8]:
base_index = VectorStoreIndex(base_nodes)
base_retriever = base_index.as_retriever(similarity_top_k=2)

In [9]:
retrievals = base_retriever.retrieve(
    "Can you tell me about the key concepts for safety finetuning"
)

In [10]:
for n in retrievals:
    display_source_node(n, source_length=1500)

**Node ID:** node-22<br>**Similarity:** 0.8322523818232979<br>**Text:** Wethendefinebestpracticesforsafeandhelpfulmodelresponses: themodelshouldfirstaddressimmediate
safetyconcernsifapplicable,thenaddressthepromptbyexplainingthepotentialriskstotheuser,andfinally
provide additional information if possible.We also ask the annotators to avoid negative user experience
categories (see Appendix A.5.2).The guidelines are meant to be a general guide for the model and are
iteratively refined and revised to include newly identified risks.4.2.2 Safety Supervised Fine-Tuning
InaccordancewiththeestablishedguidelinesfromSection4.2.1,wegatherpromptsanddemonstrations
ofsafemodelresponsesfromtrainedannotators,andusethedataforsupervisedfine-tuninginthesame
manner as described in Section 3.1.An example can be found in Table 5.The annotators are instructed to initially come up with prompts that they think could potentially induce
themodel toexhibit unsafebehavior, i.e.,perform redteaming, asdefined bythe guidelines.Subsequently,
annotators are tasked with crafting a safe and helpful response that the model should produce.4.2.3 Safety RLHF
Weobserveearlyinthedevelopmentof Llama 2-Chat thatitisabletogeneralizefromthesafedemonstrations
insupervisedfine-tuning.Themodelquicklylearnstowritedetailedsaferesponses,addresssafetyconcerns,
explainwhythetopicmightbesensitive,andprovideadditionalhelpfulinformation.Inparticular,when
the model outputs safe responses, they are often more detailed than what the average annotator writes.Therefore, after gathering only a few thousan...<br>

**Node ID:** node-18<br>**Similarity:** 0.79782506307983<br>**Text:** Next,wedescribe
theprocessofoursafetyalignment(Section4.2),explaininghowwecollectedsafety-relatedannotationsand
utilizedSFTandRLHF,andpresentexperimentalresults.Then,wediscusstheredteamingweperformedto
furtherunderstandandimprovemodelsafety(Section4.3).Finally,wepresentquantitativesafetyevaluations
ofLlama 2-Chat (Section 4.4).We also share a model card in the Appendix, in Table 52.4.1 Safety in Pretraining
It is important to understand what is in the pretraining data both to increase transparency and to shed
lightonrootcausesofpotentialdownstreamissues,suchaspotentialbiases.Thiscaninformwhat,ifany,
downstream mitigations to consider, and help guide appropriate model use.In this section, we analyze the
pretraining datafor distributionsof languages,demographic representations,and toxicity.Wealso present
the results of testing the pretrained models on existing safety benchmarks.Steps Taken to Pretrain Responsibly.We followed Meta’s standard privacy and legal review processes for
each dataset used in training.We did not use any Meta user data in training.We excluded data from certain
sitesknowntocontainahighvolumeofpersonalinformationaboutprivateindividuals.Wemadeabest
effort to train our models efficiently to reduce the carbon footprint of pretraining (Section 2.2.1).Sharing our
modelsbroadlywillreducetheneedforotherstotrainsimilarmodels.Noadditionalfilteringwasconducted
onthedatasets,toallow Llama 2 tobemorewidelyusableacrosstasks(e.g.,itcanbebetterusedforhate
speechclassif...<br>

In [11]:
query_engine_base = RetrieverQueryEngine.from_args(
    base_retriever, service_context=service_context
)

In [12]:
response = query_engine_base.query(
    "Can you tell me about the key concepts for safety finetuning"
)
print(str(response))

The key concepts for safety fine-tuning include addressing immediate safety concerns, explaining potential risks to the user, and providing additional information if possible. The guidelines for safety fine-tuning are meant to be a general guide for the model and are iteratively refined and revised to include newly identified risks. Safety fine-tuning can be done through supervised fine-tuning, where prompts and demonstrations of safe model responses are gathered from trained annotators, or through reinforcement learning from human feedback (RLHF), where human preference data is collected to train a safety reward model. The goal of safety fine-tuning is to make the model more robust to unsafe behavior and improve its ability to generate nuanced and detailed safe responses.


## Chunk References: Smaller Child Chunks Referring to Bigger Parent Chunk

In this usage example, we show how to build a graph of smaller chunks pointing to bigger parent chunks.

During query-time, we retrieve smaller chunks, but we follow references to bigger chunks. This allows us to have more context for synthesis.

In [13]:
sub_chunk_sizes = [128, 256, 512]
sub_node_parsers = [
    SimpleNodeParser.from_defaults(chunk_size=c) for c in sub_chunk_sizes
]

all_nodes = []
for base_node in base_nodes:
    for n in sub_node_parsers:
        sub_nodes = n.get_nodes_from_documents([base_node])
        sub_inodes = [
            IndexNode.from_text_node(sn, base_node.node_id) for sn in sub_nodes
        ]
        all_nodes.extend(sub_inodes)

    # also add original node to node
    base_inode = IndexNode.from_text_node(base_node, base_node.node_id)
    all_nodes.append(base_inode)

In [14]:
all_nodes_dict = {n.node_id: n for n in all_nodes}

In [15]:
vector_index_chunk = VectorStoreIndex(all_nodes, service_context=service_context)

In [16]:
vector_retriever_chunk = vector_index_chunk.as_retriever(similarity_top_k=2)

In [17]:
retriever_chunk = RecursiveRetriever(
    "vector",
    retriever_dict={"vector": vector_retriever_chunk},
    node_dict=all_nodes_dict,
    verbose=True,
)

In [18]:
nodes = retriever_chunk.retrieve(
    "Can you tell me about the key concepts for safety finetuning"
)
for node in nodes:
    display_source_node(node, source_length=2000)

[36;1m[1;3mRetrieving with query id None: Can you tell me about the key concepts for safety finetuning
[0m[38;5;200m[1;3mRetrieved node with id, entering: node-21
[0m[36;1m[1;3mRetrieving with query id node-21: Can you tell me about the key concepts for safety finetuning
[0m[38;5;200m[1;3mRetrieved node with id, entering: node-22
[0m[36;1m[1;3mRetrieving with query id node-22: Can you tell me about the key concepts for safety finetuning
[0m

**Node ID:** node-21<br>**Similarity:** 0.8693572142828401<br>**Text:** 22

TruthfulQA ↑ToxiGen ↓
MPT7B 29.13 22.32
30B 35.25 22.61
Falcon7B 25.95 14.53
40B 40.39 23.44
Llama 17B 27.42 23.00
13B 41.74 23.08
33B 44.19 22.57
65B 48.71 21.77
Llama 27B 33.29 21.25
13B 41.86 26.10
34B 43.45 21.19
70B 50.18 24.60
Table 11: Evaluation of pretrained LLMs on automatic safety benchmarks.For TruthfulQA, we present the
percentageofgenerationsthatarebothtruthfulandinformative(thehigherthebetter).ForToxiGen,we
present the percentage of toxic generations (the smaller, the better).Benchmarks give a summary view ofmodel capabilities and behaviors that allow us to understand general
patternsinthemodel,buttheydonotprovideafullycomprehensiveviewoftheimpactthemodelmayhave
onpeopleorreal-worldoutcomes;thatwouldrequirestudyofend-to-endproductdeployments.Further
testing and mitigation should be done to understand bias and other social issues for the specific context
in which a system may be deployed.For this, it may be necessary to test beyond the groups available in
theBOLDdataset(race,religion,andgender).AsLLMsareintegratedanddeployed,welookforwardto
continuing research that will amplify their potential for positive impact on these important social issues.4.2 Safety Fine-Tuning
In this section, we describe our approach to safety fine-tuning, including safety categories, annotation
guidelines,andthetechniquesweusetomitigatesafetyrisks.Weemployaprocesssimilartothegeneral
fine-tuning methods as described in Section 3, with some notable differences related to safety concerns.Specifically, we use the following techniques in safety fine-tuning:
1.Supervised Safety Fine-Tuning : We initialize by gathering adversarial prompts and safe demonstra-
tions that are then included in the general supervised fine-tuning process (Section 3.1).This teaches
themodeltoalignwithoursafetyguidelinesevenbeforeRLHF,andthuslaysthefoundationfor
high-quality human preference data annotation.2.Safety RLHF : Subsequently, we integrate safety in the general RLHF pipeline described in Se...<br>

**Node ID:** node-22<br>**Similarity:** 0.854811252011206<br>**Text:** Wethendefinebestpracticesforsafeandhelpfulmodelresponses: themodelshouldfirstaddressimmediate
safetyconcernsifapplicable,thenaddressthepromptbyexplainingthepotentialriskstotheuser,andfinally
provide additional information if possible.We also ask the annotators to avoid negative user experience
categories (see Appendix A.5.2).The guidelines are meant to be a general guide for the model and are
iteratively refined and revised to include newly identified risks.4.2.2 Safety Supervised Fine-Tuning
InaccordancewiththeestablishedguidelinesfromSection4.2.1,wegatherpromptsanddemonstrations
ofsafemodelresponsesfromtrainedannotators,andusethedataforsupervisedfine-tuninginthesame
manner as described in Section 3.1.An example can be found in Table 5.The annotators are instructed to initially come up with prompts that they think could potentially induce
themodel toexhibit unsafebehavior, i.e.,perform redteaming, asdefined bythe guidelines.Subsequently,
annotators are tasked with crafting a safe and helpful response that the model should produce.4.2.3 Safety RLHF
Weobserveearlyinthedevelopmentof Llama 2-Chat thatitisabletogeneralizefromthesafedemonstrations
insupervisedfine-tuning.Themodelquicklylearnstowritedetailedsaferesponses,addresssafetyconcerns,
explainwhythetopicmightbesensitive,andprovideadditionalhelpfulinformation.Inparticular,when
the model outputs safe responses, they are often more detailed than what the average annotator writes.Therefore, after gathering only a few thousand supervised demonstrations, we switched entirely to RLHF to
teachthemodelhowtowritemorenuancedresponses.ComprehensivetuningwithRLHFhastheadded
benefit that it may make the model more robust to jailbreak attempts (Bai et al., 2022a).WeconductRLHFbyfirstcollectinghumanpreferencedataforsafetysimilartoSection3.2.2: annotators
writeapromptthattheybelievecanelicitunsafebehavior,andthencomparemultiplemodelresponsesto
theprompts,selectingtheresponsethatissafestaccordingtoasetofguidelines.Wethenusethehu...<br>

In [19]:
query_engine_chunk = RetrieverQueryEngine.from_args(
    retriever_chunk, service_context=service_context
)

In [20]:
response = query_engine_chunk.query(
    "Can you tell me about the key concepts for safety finetuning"
)
print(str(response))

[36;1m[1;3mRetrieving with query id None: Can you tell me about the key concepts for safety finetuning
[0m[38;5;200m[1;3mRetrieved node with id, entering: node-21
[0m[36;1m[1;3mRetrieving with query id node-21: Can you tell me about the key concepts for safety finetuning
[0m[38;5;200m[1;3mRetrieved node with id, entering: node-22
[0m[36;1m[1;3mRetrieving with query id node-22: Can you tell me about the key concepts for safety finetuning
[0mThe key concepts for safety fine-tuning include supervised safety fine-tuning, safety RLHF (Reinforcement Learning from Human Feedback), and safety context distillation. In supervised safety fine-tuning, adversarial prompts and safe demonstrations are gathered and included in the fine-tuning process to align the model with safety guidelines. Safety RLHF involves training a safety-specific reward model and gathering challenging adversarial prompts to optimize the model's safety through rejection sampling and PPO (Proximal Policy Optimiz

## Metadata References: Summaries + Generated Questions referring to a bigger chunk

In this usage example, we show how to define additional context that references the source node.

This additional context includes summaries as well as generated questions.

During query-time, we retrieve smaller chunks, but we follow references to bigger chunks. This allows us to have more context for synthesis.

In [21]:
from llama_index.node_parser import SimpleNodeParser
from llama_index.schema import IndexNode
from llama_index.node_parser.extractors import (
    SummaryExtractor,
    QuestionsAnsweredExtractor,
    MetadataExtractor,
)

In [22]:
metadata_extractor = MetadataExtractor(
    extractors=[
        SummaryExtractor(summaries=["self"], show_progress=True),
        QuestionsAnsweredExtractor(questions=5, show_progress=True),
    ],
)

In [None]:
# run metadata extractor across base nodes, get back dictionaries
metadata_dicts = metadata_extractor.extract(base_nodes)

In [23]:
# cache metadata dicts
def save_metadata_dicts(path):
    with open(path, "w") as fp:
        for m in metadata_dicts:
            fp.write(json.dumps(m) + "\n")


def load_metadata_dicts(path):
    with open(path, "r") as fp:
        metadata_dicts = [json.loads(l) for l in fp.readlines()]
        return metadata_dicts

In [56]:
save_metadata_dicts("data/llama2_metadata_dicts.jsonl")

In [24]:
metadata_dicts = load_metadata_dicts("data/llama2_metadata_dicts.jsonl")

In [25]:
# all nodes consists of source nodes, along with metadata
all_nodes = base_nodes
for idx, d in enumerate(metadata_dicts):
    inode_q = IndexNode(
        text=d["questions_this_excerpt_can_answer"], index_id=base_nodes[idx].node_id
    )
    inode_s = IndexNode(text=d["section_summary"], index_id=base_nodes[idx].node_id)
    all_nodes.extend([inode_q, inode_s])

In [26]:
all_nodes_dict = {n.node_id: n for n in all_nodes}

In [27]:
## Load index into vector index
from llama_index import VectorStoreIndex, ServiceContext
from llama_index.llms import OpenAI

llm = OpenAI(model="gpt-3.5-turbo")
service_context = ServiceContext.from_defaults(llm=llm)

vector_index_metadata = VectorStoreIndex(all_nodes, service_context=service_context)

In [28]:
vector_retriever_metadata = vector_index_metadata.as_retriever(similarity_top_k=2)

In [29]:
retriever_metadata = RecursiveRetriever(
    "vector",
    retriever_dict={"vector": vector_retriever_metadata},
    node_dict=all_nodes_dict,
    verbose=True,
)

In [30]:
nodes = retriever_metadata.retrieve(
    "Can you tell me about the key concepts for safety finetuning"
)
for node in nodes:
    display_source_node(node, source_length=2000)

[36;1m[1;3mRetrieving with query id None: Can you tell me about the key concepts for safety finetuning
[0m[38;5;200m[1;3mRetrieved node with id, entering: node-21
[0m[36;1m[1;3mRetrieving with query id node-21: Can you tell me about the key concepts for safety finetuning
[0m[38;5;200m[1;3mRetrieved node with id, entering: node-27
[0m[36;1m[1;3mRetrieving with query id node-27: Can you tell me about the key concepts for safety finetuning
[0m

**Node ID:** node-21<br>**Similarity:** 0.8455204474983238<br>**Text:** 22

TruthfulQA ↑ToxiGen ↓
MPT7B 29.13 22.32
30B 35.25 22.61
Falcon7B 25.95 14.53
40B 40.39 23.44
Llama 17B 27.42 23.00
13B 41.74 23.08
33B 44.19 22.57
65B 48.71 21.77
Llama 27B 33.29 21.25
13B 41.86 26.10
34B 43.45 21.19
70B 50.18 24.60
Table 11: Evaluation of pretrained LLMs on automatic safety benchmarks.For TruthfulQA, we present the
percentageofgenerationsthatarebothtruthfulandinformative(thehigherthebetter).ForToxiGen,we
present the percentage of toxic generations (the smaller, the better).Benchmarks give a summary view ofmodel capabilities and behaviors that allow us to understand general
patternsinthemodel,buttheydonotprovideafullycomprehensiveviewoftheimpactthemodelmayhave
onpeopleorreal-worldoutcomes;thatwouldrequirestudyofend-to-endproductdeployments.Further
testing and mitigation should be done to understand bias and other social issues for the specific context
in which a system may be deployed.For this, it may be necessary to test beyond the groups available in
theBOLDdataset(race,religion,andgender).AsLLMsareintegratedanddeployed,welookforwardto
continuing research that will amplify their potential for positive impact on these important social issues.4.2 Safety Fine-Tuning
In this section, we describe our approach to safety fine-tuning, including safety categories, annotation
guidelines,andthetechniquesweusetomitigatesafetyrisks.Weemployaprocesssimilartothegeneral
fine-tuning methods as described in Section 3, with some notable differences related to safety concerns.Specifically, we use the following techniques in safety fine-tuning:
1.Supervised Safety Fine-Tuning : We initialize by gathering adversarial prompts and safe demonstra-
tions that are then included in the general supervised fine-tuning process (Section 3.1).This teaches
themodeltoalignwithoursafetyguidelinesevenbeforeRLHF,andthuslaysthefoundationfor
high-quality human preference data annotation.2.Safety RLHF : Subsequently, we integrate safety in the general RLHF pipeline described in Se...<br>

**Node ID:** node-27<br>**Similarity:** 0.8430926818606254<br>**Text:** In each case, we took the overall lessons as a guide
to helpfurther modelsafetytraining, and specificallytook data fromthese exercisesformodel fine-tuning,
model feedback training, and as a signal for other safety model training.Multiple additionalrounds ofred teaming wereperformed over severalmonths tomeasure the robustness
of each new model as it was released internally.We defined the robustness of a model, γ, with respect to
a red teaming exercise executed by a set of experts as the average number of created prompts that would
triggeraviolatingresponsefromthemodelperpersonperhour.Asanexample,onour7Bmodel,wehadan
evolution of γ: 1.8→0.45over several red teaming iterations and model refinements.Robustness will likely
continue to improve with additional red teaming efforts.Another magnitude that we tracked as new models
were produced was the percentage of prompts triggering violating responses discovered in the previous red
teaming exercises that were mitigated in a given new candidate release.On average, we had a 90% rejection
rate model over model.4.4 Safety Evaluation of Llama 2-Chat
SafetyHumanEvaluation.Wecollectedroughly2,000adversarialpromptsforhumanevaluationaccording
to the safety categories in Section 4.2.1, in which 1,351 prompts are single-turn and 623 are multi-turn.Examples of evaluation prompts and responses can be found in Appendix A.4.6.We then asked raters to judge models for safety violations on a five-point Likert scale with the following
definitions:
•5- No safety violations and very helpful
•4- No safety violations and only minor non-safety issues
•3- No safety violations but not helpful or other major non-safety issues
•2- Mild or moderate safety violations
29

(a)Overall violation percentage.(b)Overall safety and helpfulness mean rating.Figure17: Overallsafetymeasures.Left: Llama 2-Chat haslowviolationpercentageoverallacrossmodel
sizes.Right: Llama 2-Chat has high safety and helpfulness mean rating overall across model sizes.It is
importan...<br>

In [31]:
query_engine_metadata = RetrieverQueryEngine.from_args(
    retriever_metadata, service_context=service_context
)

In [32]:
response = query_engine_metadata.query(
    "Can you tell me about the key concepts for safety finetuning"
)
print(str(response))

[36;1m[1;3mRetrieving with query id None: Can you tell me about the key concepts for safety finetuning
[0m[38;5;200m[1;3mRetrieved node with id, entering: node-21
[0m[36;1m[1;3mRetrieving with query id node-21: Can you tell me about the key concepts for safety finetuning
[0m[38;5;200m[1;3mRetrieved node with id, entering: node-27
[0m[36;1m[1;3mRetrieving with query id node-27: Can you tell me about the key concepts for safety finetuning
[0mThe key concepts for safety fine-tuning include supervised safety fine-tuning, safety RLHF (Reinforcement Learning from Human Feedback), safety context distillation, safety categories, and annotation guidelines. Supervised safety fine-tuning involves gathering adversarial prompts and safe demonstrations to align the model with safety guidelines. Safety RLHF integrates safety into the RLHF pipeline by training a safety-specific reward model and gathering challenging adversarial prompts. Safety context distillation refines the RLHF pipel

## Evaluation

We evalate how well our recursive retrieval + node reference methods work. We evaluate both chunk references as well as metadata references. We use embedding similarity lookup to retrieve the reference nodes.

We compare both methods against a baseline retriever where we fetch the raw nodes directly.

In terms of metrics, we evaluate using both hit-rate and MRR.

### Dataset Generation

We first generate a dataset of questions from the set of text chunks.

In [33]:
from llama_index.finetuning import (
    generate_qa_embedding_pairs,
    EmbeddingQAFinetuneDataset,
)

In [36]:
eval_dataset = generate_qa_embedding_pairs(base_nodes)

100%|███████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 231/231 [19:05<00:00,  4.96s/it]


In [39]:
eval_dataset.save_json("data/llama2_eval_dataset.json")

In [40]:
# optional
eval_dataset = EmbeddingQAFinetuneDataset.from_json("data/llama2_eval_dataset.json")

In [47]:
# TODO: generalize into eval functions
from tqdm import tqdm


def evaluate(
    dataset,
    retriever,
    verbose=False,
):
    corpus = dataset.corpus
    queries = dataset.queries
    relevant_docs = dataset.relevant_docs

    eval_results = []
    for query_id, query in tqdm(queries.items()):
        retrieved_nodes = retriever.retrieve(query)
        retrieved_ids = [node.node.node_id for node in retrieved_nodes]
        expected_id = relevant_docs[query_id][0]

        # for tmp in retrieved_nodes:
        #     print(f"NODE: {tmp.node.node_id}")
        #     print(tmp.node.get_content())

        rank = None
        for idx, id in enumerate(retrieved_ids):
            if id == expected_id:
                rank = idx + 1
                break

        is_hit = rank is not None  # assume 1 relevant doc
        mrr = 0 if rank is None else 1 / rank

        eval_result = {
            "is_hit": is_hit,
            "mrr": mrr,
            "retrieved": retrieved_ids,
            "expected": expected_id,
            "query": query_id,
        }
        eval_results.append(eval_result)

    return eval_results

### Compare Results

We run evaluations on each of the retrievers to measure hit rate and MRR.

We find that retrievers with node references (either chunk or metadata) tend to perform better than retrieving the raw chunks.

In [69]:
import pandas as pd

# set vector retriever similarity top k to higher
top_k = 10


def display_results(names, results_arr):
    """Display results from evaluate."""

    hit_rates = []
    mrrs = []
    for name, results in zip(names, results_arr):
        results_df = pd.DataFrame(results)
        hit_rate = results_df["is_hit"].mean()
        mrr = results_df["mrr"].mean()
        hit_rates.append(hit_rate)
        mrrs.append(mrr)

    final_df = pd.DataFrame({"retrievers": names, "hit_rate": hit_rates, "mrr": mrrs})
    display(final_df)

In [None]:
vector_retriever_chunk = vector_index_chunk.as_retriever(similarity_top_k=top_k)
retriever_chunk = RecursiveRetriever(
    "vector",
    retriever_dict={"vector": vector_retriever_chunk},
    node_dict=all_nodes_dict,
    verbose=True,
)
results_chunk = evaluate(eval_dataset, retriever_chunk)

In [None]:
vector_retriever_metadata = vector_index_metadata.as_retriever(similarity_top_k=top_k)
retriever_metadata = RecursiveRetriever(
    "vector",
    retriever_dict={"vector": vector_retriever_metadata},
    node_dict=all_nodes_dict,
    verbose=True,
)

results_metadata = evaluate(eval_dataset, retriever_metadata)

In [55]:
base_retriever = base_index.as_retriever(similarity_top_k=top_k)
results_base = evaluate(eval_dataset, base_retriever)

100%|██████████████████████████████████████████████████████████████| 509/509 [01:26<00:00,  5.88it/s]


In [71]:
display_results(
    [
        "Base Retriever",
        "Retriever (Chunk References)",
        "Retriever (Metadata References)",
    ],
    [results_base, results_chunk, results_metadata],
)

Unnamed: 0,retrievers,hit_rate,mrr
0,Base Retriever,0.269155,0.191413
1,Retriever (Chunk References),0.292731,0.254551
2,Retriever (Metadata References),0.286837,0.240858
