<a href="https://colab.research.google.com/github/run-llama/llama_index/blob/main/docs/docs/examples/evaluation/retrieval/retriever_eval.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Retrieval Evaluation

This notebook uses our `RetrieverEvaluator` to evaluate the quality of any Retriever module defined in LlamaIndex.

We specify a set of different evaluation metrics: this includes hit-rate, MRR, Precision, Recall, AP, and NDCG. For any given question, these will compare the quality of retrieved results from the ground-truth context.

To ease the burden of creating the eval dataset in the first place, we can rely on synthetic data generation.

## Setup

Here we load in data (PG essay), parse into Nodes. We then index this data using our simple vector index and get a retriever.

In [None]:
%pip install llama-index-llms-openai llama-index-readers-file matplotlib

In [1]:
import nest_asyncio

nest_asyncio.apply()

In [2]:
from llama_index.core import VectorStoreIndex, SimpleDirectoryReader
from llama_index.core.node_parser import SentenceSplitter
# from llama_index.llms.openai import OpenAI

Download Data

In [None]:
!mkdir -p 'data/paul_graham/'
!wget 'https://raw.githubusercontent.com/run-llama/llama_index/main/docs/docs/examples/data/paul_graham/paul_graham_essay.txt' -O 'data/paul_graham/paul_graham_essay.txt'

In [3]:
# documents = SimpleDirectoryReader("./data/paul_graham/").load_data()
documents = SimpleDirectoryReader("/home/daimler/workspaces/agents-course-huggingface/chat-neus-catala/data/documents").load_data()

In [4]:
node_parser = SentenceSplitter(chunk_size=512)
nodes = node_parser.get_nodes_from_documents(documents, show_progress=True )

Parsing nodes:   0%|          | 0/516 [00:00<?, ?it/s]

In [5]:
# by default, the node ids are set to random uuids. To ensure same id's per run, we manually set them.
for idx, node in enumerate(nodes):
    node.id_ = f"node_{idx}"

In [6]:
from llama_index.core import VectorStoreIndex, SimpleDirectoryReader, Settings
from llama_index.core.agent.workflow import AgentWorkflow
from llama_index.llms.ollama import Ollama
from llama_index.embeddings.huggingface import HuggingFaceEmbedding
import asyncio
import os

# Settings control global defaults
Settings.embed_model = HuggingFaceEmbedding(model_name="BAAI/bge-base-en-v1.5")
Settings.llm = Ollama(model="llama3.2:latest", request_timeout=360.0)
# llm = OpenAI(model="gpt-4")

In [7]:
import torch
torch.cuda.empty_cache()

vector_index = VectorStoreIndex(nodes)
retriever = vector_index.as_retriever(similarity_top_k=2)

### Try out Retrieval

We'll try out retrieval over a simple dataset.

In [8]:
# retrieved_nodes = retriever.retrieve("What did the author do growing up?")
retrieved_nodes = retriever.retrieve("Que havien de passar Les presoneres de Ravensbrück abans d incorporar-se als treballs forçats?")
retrieved_nodes

[NodeWithScore(node=TextNode(id_='node_178', embedding=None, metadata={'page_label': '87', 'file_name': 'NeusCatala-llibre.pdf', 'file_path': '/home/daimler/workspaces/agents-course-huggingface/chat-neus-catala/data/documents/NeusCatala-llibre.pdf', 'file_type': 'application/pdf', 'file_size': 6741968, 'creation_date': '2025-03-21', 'last_modified_date': '2025-03-13'}, excluded_embed_metadata_keys=['file_name', 'file_type', 'file_size', 'creation_date', 'last_modified_date', 'last_accessed_date'], excluded_llm_metadata_keys=['file_name', 'file_type', 'file_size', 'creation_date', 'last_modified_date', 'last_accessed_date'], relationships={<NodeRelationship.SOURCE: '1'>: RelatedNodeInfo(node_id='623a3cbe-fd46-4548-ae1b-6f0ba75ede8b', node_type='4', metadata={'page_label': '87', 'file_name': 'NeusCatala-llibre.pdf', 'file_path': '/home/daimler/workspaces/agents-course-huggingface/chat-neus-catala/data/documents/NeusCatala-llibre.pdf', 'file_type': 'application/pdf', 'file_size': 6741968,

In [9]:
from llama_index.core.response.notebook_utils import display_source_node

for node in retrieved_nodes:
    display_source_node(node, source_length=1000)

**Node ID:** node_178<br>**Similarity:** 0.6955495215777152<br>**Text:** Encara es pregunta de
quin país seria la dona que li sal-
và la vida...
Treballs forçats i càstigs
Les presoneres de Ravensbrück
havien de passar una quarante-
na abans d’incorporar-se als tre-
balls forçats. Però el grup de la
Neus no l’acabà. Als 20 dies ja
les portaren al recompte a la pla-
ça central. La sirena les arranca-
va del catre a les 3 de la matina-
da: tocava vestir-se, calçar-se, fer
el llit, prendre el cafè i passar el
primer recompte davant de la
barraca. A l’hivern, l’aigua esta-
va gelada i no es podien ni ren-
tar. Després, desfilaven cap a la
plaça central del camp on, a les
4 en punt, havien de formar en
quadres. Les feien estar dretes i
immòbils durant 5 hores, a 22
graus sota zero. “I esperàvem
aterrides quin seria el primer crim
del dia”, rememora la Neus.
La primera feina que els obli-
garen a fer a les de la barraca
22 fou a pic i pala. Les tragueren
del camp en grups de quaranta.
“Travessàvem Fürstenberg amb una
aixada enorme a l’esquena. Un cop
arribàvem...<br>

**Node ID:** node_326<br>**Similarity:** 0.6838852637493945<br>**Text:** Peregrinatge a Ravensbrück. Maig 2005.
Pàg. 93 - La Neus Català amb la seva estimada “Titi”, Teresa Menot, en un peregrinatge al camp de  Ravensbrück. Arxiu
familiar Neus Català
Pàg. 94 - “Nosaltres, les deportades quan ens trobem sempre riem, perqué sóm les tres del Kommando de les Gandu-
les,  per això aquestes rialles!! Els altres estan mirant perquè riem com a ximples”  Trobada de tres dones del Kommando
Faul (Kommando de les Gandules) d’Holleischen durant el peregrinatge al camp de Ravensbrück. Maig 2005.
Pàg. 97 - - - - - “Les valentes, al camp de Ravensbrück perquè quan plorava alguna la consolàvem, i les gandules a
Holleischen””Qui no plorava de cara, plorava d’amagat, a qui no li podien caure les llàgrimes allà tancades..era nor-
mal...!” La Neus Català i l’Assumpta Montellà, autora del llibre sobre La maternitat d’Elna. Bressol dels exiliats . Xerrada
sobre xarxes d’exili, solidaritat i Resistència. Centre Social de Sants. 10 d’octubre de 2006.  Foto-muntatge de Maria
Pren...<br>

## Build an Evaluation dataset of (query, context) pairs

Here we build a simple evaluation dataset over the existing text corpus.

We use our `generate_question_context_pairs` to generate a set of (question, context) pairs over a given unstructured text corpus. This uses the LLM to auto-generate questions from each context chunk.

We get back a `EmbeddingQAFinetuneDataset` object. At a high-level this contains a set of ids mapping to queries and relevant doc chunks, as well as the corpus itself.

In [10]:
from llama_index.core.evaluation import (
    generate_question_context_pairs,
    EmbeddingQAFinetuneDataset
    )
from llama_index.core.llama_dataset.legacy.embedding import DEFAULT_QA_GENERATE_PROMPT_TMPL


In [11]:
qa_dataset = generate_question_context_pairs(
    nodes, 
    # llm=llm, 
    num_questions_per_chunk=2,
    qa_generate_prompt_tmpl=DEFAULT_QA_GENERATE_PROMPT_TMPL+ "\n Please respond directly to the question without any introductory text or formatting.",


)

100%|██████████| 1142/1142 [12:44<00:00,  1.49it/s]


In [12]:
queries = qa_dataset.queries.values()
print(list(queries)[:20])

['What is the 10th number in the given sequence?', 'Is the ISBN a unique identifier?', 'Who is responsible for editing the book and how can you contact them?', "What is the publication's Dipòsit Legal number?", 'What is the name of the author of the book "Memòria i lluita"?', 'Who are the individuals mentioned as having the same last name "Belenguer Mercadé"?', 'What type of assessment will you be conducting - a multiple-choice test or a short-answer examination?', 'What is the purpose of creating a presentation, and what are its components?', 'Is there a specific type of presentation mentioned in the given context information that needs to be tested on the quiz/examination?', 'What is the number of lines in the given context information?', 'Is the number provided 6 being presented as a line count for a poem or a code?', 'What organization is represented by the phrase "Fundació Pere Ardiaca" and who is its president?', 'Who is described as being "la primera a posar-se-hi", indicating t

In [13]:
# [optional] save
qa_dataset.save_json("pg_eval_dataset.json")

In [14]:
# [optional] load
qa_dataset = EmbeddingQAFinetuneDataset.from_json("pg_eval_dataset.json")

## Use `RetrieverEvaluator` for Retrieval Evaluation

We're now ready to run our retrieval evals. We'll run our `RetrieverEvaluator` over the eval dataset that we generated.

We define two functions: `get_eval_results` and also `display_results` that run our retriever over the dataset.

In [16]:
include_cohere_rerank = False

if include_cohere_rerank:
    !pip install cohere -q

In [17]:
from llama_index.core.evaluation import RetrieverEvaluator

metrics = ["hit_rate", "mrr", "precision", "recall", "ap", "ndcg"]

if include_cohere_rerank:
    metrics.append(
        "cohere_rerank_relevancy"  # requires COHERE_API_KEY environment variable to be set
    )

retriever_evaluator = RetrieverEvaluator.from_metric_names(
    metrics, retriever=retriever
)

In [18]:
# try it out on a sample query
sample_id, sample_query = list(qa_dataset.queries.items())[1]
sample_expected = qa_dataset.relevant_docs[sample_id]

eval_result = retriever_evaluator.evaluate(sample_query, sample_expected)
print(eval_result)

Query: Is the ISBN a unique identifier?
Metrics: {'hit_rate': 1.0, 'mrr': 1.0, 'precision': 0.5, 'recall': 1.0, 'ap': 1.0, 'ndcg': 1.0}



In [19]:
for (sample_id, sample_query) in list(qa_dataset.queries.items()):
    sample_expected = qa_dataset.relevant_docs[sample_id]

    eval_result = retriever_evaluator.evaluate(sample_query, sample_expected)
    print(eval_result)

Query: What is the 10th number in the given sequence?
Metrics: {'hit_rate': 0.0, 'mrr': 0.0, 'precision': 0.0, 'recall': 0.0, 'ap': 0.0, 'ndcg': 0.0}

Query: Is the ISBN a unique identifier?
Metrics: {'hit_rate': 1.0, 'mrr': 1.0, 'precision': 0.5, 'recall': 1.0, 'ap': 1.0, 'ndcg': 1.0}

Query: Who is responsible for editing the book and how can you contact them?
Metrics: {'hit_rate': 0.0, 'mrr': 0.0, 'precision': 0.0, 'recall': 0.0, 'ap': 0.0, 'ndcg': 0.0}

Query: What is the publication's Dipòsit Legal number?
Metrics: {'hit_rate': 0.0, 'mrr': 0.0, 'precision': 0.0, 'recall': 0.0, 'ap': 0.0, 'ndcg': 0.0}

Query: What is the name of the author of the book "Memòria i lluita"?
Metrics: {'hit_rate': 0.0, 'mrr': 0.0, 'precision': 0.0, 'recall': 0.0, 'ap': 0.0, 'ndcg': 0.0}

Query: Who are the individuals mentioned as having the same last name "Belenguer Mercadé"?
Metrics: {'hit_rate': 1.0, 'mrr': 1.0, 'precision': 0.5, 'recall': 1.0, 'ap': 1.0, 'ndcg': 1.0}

Query: What type of assessment 

In [20]:
# try it out on an entire dataset
eval_results = await retriever_evaluator.aevaluate_dataset(qa_dataset)

In [21]:
import pandas as pd


def display_results(name, eval_results):
    """Display results from evaluate."""

    metric_dicts = []
    for eval_result in eval_results:
        metric_dict = eval_result.metric_vals_dict
        metric_dicts.append(metric_dict)

    full_df = pd.DataFrame(metric_dicts)

    columns = {
        "retrievers": [name],
        **{k: [full_df[k].mean()] for k in metrics},
    }

    if include_cohere_rerank:
        crr_relevancy = full_df["cohere_rerank_relevancy"].mean()
        columns.update({"cohere_rerank_relevancy": [crr_relevancy]})

    metric_df = pd.DataFrame(columns)

    return metric_df

In [22]:
display_results("top-2 eval", eval_results)

Unnamed: 0,retrievers,hit_rate,mrr,precision,recall,ap,ndcg
0,top-2 eval,0.183691,0.159579,0.091846,0.183691,0.159579,0.165893
