# Pipeline for evaluating RAG dataset

In [1]:
from dotenv import load_dotenv
load_dotenv()

import nest_asyncio
nest_asyncio.apply()

## Response Evaluation

Measures how good are the responses of the model given the context an ground truth answer. Does the response match the retrieved context? Does it also match the query? Does it match the reference answer or guidelines?

These evaluation modules are in the following forms:

- Correctness: Whether the generated answer matches that of the reference answer given the query (requires labels).
- Semantic Similarity Whether the predicted answer is semantically similar to the reference answer (requires labels).
- Faithfulness: Evaluates if the answer is faithful to the retrieved contexts (in other words, whether if there's hallucination).
- Context Relevancy: Whether retrieved context is relevant to the query.
- Answer Relevancy: Whether the generated answer is relevant to the query.
- Guideline Adherence: Whether the predicted answer adheres to specific guidelines.

## Load predefined dataset

let's start with evaluating existing datasets. More datasets can be downloaded from: https://llamahub.ai/?tab=llama_datasets

In [2]:
from llama_index.core.llama_dataset import download_llama_dataset

# download and install dependencies
rag_dataset, documents = download_llama_dataset(
    "PaulGrahamEssayDataset", "./data/paul_graham"
)

In [3]:
rag_dataset.to_pandas()

Unnamed: 0,query,reference_contexts,reference_answer,reference_answer_by,query_by
0,"In the essay, the author mentions his early ex...",[What I Worked On\n\nFebruary 2021\n\nBefore c...,The first computer the author used for program...,ai (gpt-4),ai (gpt-4)
1,The author switched his major from philosophy ...,[What I Worked On\n\nFebruary 2021\n\nBefore c...,The two specific influences that led the autho...,ai (gpt-4),ai (gpt-4)
2,"In the essay, the author discusses his initial...",[I couldn't have put this into words when I wa...,The two main influences that initially drew th...,ai (gpt-4),ai (gpt-4)
3,The author mentions his shift of interest towa...,[I couldn't have put this into words when I wa...,The author shifted his interest towards Lisp a...,ai (gpt-4),ai (gpt-4)
4,"In the essay, the author mentions his interest...",[So I looked around to see what I could salvag...,"The author in the essay is Paul Graham, who wa...",ai (gpt-4),ai (gpt-4)
5,The author discusses his decision to write a b...,[So I looked around to see what I could salvag...,The author decided to write a book on Lisp hac...,ai (gpt-4),ai (gpt-4)
6,"In the essay, the author mentions a quick deci...","[I didn't want to drop out of grad school, but...",The author decided to attempt writing his diss...,ai (gpt-4),ai (gpt-4)
7,The author describes the atmosphere and practi...,"[I didn't want to drop out of grad school, but...","According to the author's account, the student...",ai (gpt-4),ai (gpt-4)
8,"In the essay, the author discusses his experie...","[We actually had one of those little stoves, f...","In the essay, the author explains that paintin...",ai (gpt-4),ai (gpt-4)
9,The author shares his work experience at a com...,"[We actually had one of those little stoves, f...","Interleaf, the company where the author worked...",ai (gpt-4),ai (gpt-4)


In [15]:
from llama_index.core import VectorStoreIndex

# a basic RAG pipeline, uses service context defaults
index = VectorStoreIndex.from_documents(documents=documents)
query_engine = index.as_query_engine()

## Evaluate with RagEvaluatorPack

In [8]:
from llama_index.core.llama_pack import download_llama_pack

RagEvaluatorPack = download_llama_pack("RagEvaluatorPack", "./pack")

rag_evaluator = RagEvaluatorPack(
    query_engine=query_engine,  # built with the same source Documents as the rag_dataset
    rag_dataset=rag_dataset,
)
benchmark_df = rag_evaluator.run()

2it [00:41, 20.89s/it]
2it [00:21, 10.70s/it]
2it [00:30, 15.46s/it]
2it [00:20, 10.06s/it]
  return _intenum_converter(super().family, AddressFamily)
2it [00:32, 16.50s/it]
2it [00:18,  9.04s/it]
2it [00:23, 11.52s/it]
2it [00:24, 12.27s/it]
2it [00:20, 10.41s/it]
2it [00:32, 16.00s/it]
2it [00:30, 15.14s/it]
2it [00:21, 10.67s/it]
2it [00:32, 16.07s/it]
2it [00:29, 14.53s/it]
2it [00:23, 11.88s/it]
2it [00:23, 11.61s/it]
2it [00:30, 15.45s/it]
2it [00:37, 18.73s/it]
2it [00:19,  9.59s/it]
2it [00:41, 20.87s/it]
2it [00:28, 14.04s/it]
2it [00:27, 13.95s/it]


In [9]:
benchmark_df.head()

rag,base_rag
metrics,Unnamed: 1_level_1
mean_correctness_score,4.011364
mean_relevancy_score,0.772727
mean_faithfulness_score,1.0
mean_context_similarity_score,0.926472


## Low level evaluation 

In [13]:
# generate prediction dataset
prediction_dataset = await rag_dataset.amake_predictions_with(
    predictor=query_engine, show_progress=True
)

In [27]:
prediction_dataset.to_pandas()[:10]

Unnamed: 0,response,contexts
0,The author's first experience with programming...,[What I Worked On\n\nFebruary 2021\n\nBefore c...
1,The author was influenced to develop an intere...,[All that seemed left for philosophy were edge...
2,The author mentions that two main influences t...,[All that seemed left for philosophy were edge...
3,The author shifted his interest towards Lisp b...,"[Its brokenness did, as so often happens, gene..."
4,The author of the essay tries to reconcile his...,[If he even knew about the strange classes I w...
5,The author decided to write a book on Lisp hac...,"[Its brokenness did, as so often happens, gene..."
6,The author made a quick decision to claim that...,[If he even knew about the strange classes I w...
7,The students and faculty at the Accademia di B...,[The students and faculty in the painting depa...
8,The process of painting a still life differs f...,[The students and faculty in the painting depa...
9,Interleaf had added a scripting language to th...,"[I wanted to go back to RISD, but I was now br..."


In [19]:
import tqdm
from llama_index.llms.openai import OpenAI
from llama_index.core.evaluation import (
    CorrectnessEvaluator,
    EvaluationResult,
    FaithfulnessEvaluator,
    RelevancyEvaluator,
    SemanticSimilarityEvaluator,
)

judges = {}
judges["correctness"] = CorrectnessEvaluator(
    llm=OpenAI(temperature=0, model="gpt-4-1106-preview")
)

for example, prediction in tqdm.tqdm(
    zip(rag_dataset.examples, prediction_dataset.predictions)
):
    correctness_result = judges["correctness"].evaluate(
        query=example.query,
        response=prediction.response,
        reference=example.reference_answer,
    )

44it [04:31,  6.17s/it]


In [20]:
correctness_result

EvaluationResult(query='Paul Graham mentions his experience of leaving YC and no longer working with Jessica. How does he describe this experience and what does it reveal about his personal and professional relationship with Jessica?', contexts=None, response='Paul Graham describes his experience of leaving YC and no longer working with Jessica as the worst thing about leaving YC. This reveals that his personal and professional relationship with Jessica was significant and valued by him. Despite the challenges and stresses associated with his work, the absence of working with Jessica stood out as a particularly negative aspect of his departure from Y Combinator.', passing=True, feedback='The generated answer is relevant and correct in capturing the essence of Paul Graham\'s sentiment about leaving YC and no longer working with Jessica. It correctly identifies that the relationship was significant and valued. However, it does not use the specific metaphor of "pulling up a deeply rooted 

## Generate custom RAG dataset for evaluation from custom documents

Will be useful later when we will have to work with our own dataset

In [25]:
# generate questions against chunks
from llama_index.core.llama_dataset.generator import RagDatasetGenerator
from llama_index.llms.openai import OpenAI
from llama_index.core import ServiceContext
from llama_index.core import SimpleDirectoryReader

# set context for llm provider
gpt_4_context = ServiceContext.from_defaults(
    llm=OpenAI(model="gpt-4", temperature=0.3)
)

documents = SimpleDirectoryReader("./data/paul_graham").load_data()

# instantiate a DatasetGenerator
dataset_generator = RagDatasetGenerator.from_documents(
    documents,
    service_context=gpt_4_context,
    num_questions_per_chunk=2,  # set the number of questions per nodes
    show_progress=True,
)

  gpt_4_context = ServiceContext.from_defaults(


Parsing nodes:   0%|          | 0/1 [00:00<?, ?it/s]

In [None]:
rag_dataset = dataset_generator.generate_dataset_from_nodes()

In [30]:
rag_dataset.to_pandas()

Unnamed: 0,query,reference_contexts,reference_answer,reference_answer_by,query_by
0,Describe the author's initial experience with ...,"[{\n ""examples"": [\n {\n ...",The author's initial experience with programmi...,ai (gpt-4),ai (gpt-4)
1,How did the author's approach to programming c...,"[{\n ""examples"": [\n {\n ...","With the advent of microcomputers, the author'...",ai (gpt-4),ai (gpt-4)
2,"""Describe the author's initial interest in phi...","[In college I was going to study philosophy, w...",The author initially found philosophy appealin...,ai (gpt-4),ai (gpt-4)
3,"""Discuss the two specific influences that spar...","[In college I was going to study philosophy, w...",The author's interest in AI was sparked by two...,ai (gpt-4),ai (gpt-4)
4,"""What were the two specific influences that le...","[You had to type programs on punch cards, then...",The two specific influences that led the autho...,ai (gpt-4),ai (gpt-4)
...,...,...,...,...,...
137,"""Explain how Paul Graham connects the evolutio...","[Customary VC practice had once, like the cust...",Paul Graham draws a parallel between the evolu...,ai (gpt-4),ai (gpt-4)
138,"In Paul Graham's essay, he uses the analogy of...",[Presumably aliens need numbers and errors and...,Paul Graham uses the Pythagorean theorem and L...,ai (gpt-4),ai (gpt-4)
139,Paul Graham describes his experience of leavin...,[Presumably aliens need numbers and errors and...,"The metaphor of ""pulling up a deeply rooted tr...",ai (gpt-4),ai (gpt-4)
140,"In Paul Graham's metaphor, what does the ""deep...",[We'd been working on YC almost the whole time...,"In Paul Graham's metaphor, the ""deeply rooted ...",ai (gpt-4),ai (gpt-4)


## Evaluate retriever

Measures the capability of the retriever to retrieve the right sources. Are the retrieved sources relevant to the query?

The core retrieval evaluation steps revolve around the following:
- Dataset generation: Given an unstructured text corpus, synthetically generate (question, context) pairs.
- Retrieval Evaluation: Given a retriever and a set of questions, evaluate retrieved results using ranking metrics.

Metrics used:
- mrr
- hit_rate

In [57]:
from llama_index.core import Document
from llama_index.embeddings.openai import OpenAIEmbedding
from llama_index.core.node_parser import SentenceSplitter
from llama_index.core.extractors import TitleExtractor
from llama_index.core.ingestion import IngestionPipeline, IngestionCache

# create the pipeline with transformations
pipeline = IngestionPipeline(
    transformations=[
        SentenceSplitter(chunk_size=512, chunk_overlap=25),
    ]
)

# run the pipeline
nodes = pipeline.run(documents=documents)
index_nodes = VectorStoreIndex(nodes)

In [55]:
len(nodes)

38

In [61]:
from llama_index.core.evaluation import (
    generate_question_context_pairs,
    EmbeddingQAFinetuneDataset,
)
from llama_index.llms.openai import OpenAI

llm=OpenAI(temperature=0, model="gpt-4")
qa_dataset = generate_question_context_pairs(
    nodes, llm=llm, num_questions_per_chunk=2
)
qa_dataset.save_json("data/paul_graham/retriever_eval_dataset.json")

100%|████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 38/38 [02:49<00:00,  4.47s/it]


In [None]:
# qa_dataset = EmbeddingQAFinetuneDataset.from_json("data/paul_graham/retriever_eval_dataset.json")

In [58]:
from llama_index.core.evaluation import RetrieverEvaluator

metrics = ["mrr", "hit_rate"]

retriever = index_nodes.as_retriever(similarity_top_k=2)
retriever_evaluator = RetrieverEvaluator.from_metric_names(
    metrics, retriever=retriever
)

In [62]:
sample_id, sample_query = list(qa_dataset.queries.items())[0]
sample_expected = qa_dataset.relevant_docs[sample_id]

eval_result = retriever_evaluator.evaluate(sample_query, sample_expected)
print(eval_result)

Query: "Describe the author's early experiences with programming on the IBM 1401. What were some of the challenges he faced and how did these experiences shape his understanding of programming?"
Metrics: {'mrr': 1.0, 'hit_rate': 1.0}



In [63]:
eval_results = await retriever_evaluator.aevaluate_dataset(qa_dataset)

In [68]:
import pandas as pd


def display_results(name, eval_results):
    """Display results from evaluate."""

    metric_dicts = []
    for eval_result in eval_results:
        metric_dict = eval_result.metric_vals_dict
        metric_dicts.append(metric_dict)

    full_df = pd.DataFrame(metric_dicts)

    hit_rate = full_df["hit_rate"].mean()
    mrr = full_df["mrr"].mean()
    columns = {"retrievers": [name], "hit_rate": [hit_rate], "mrr": [mrr]}

    metric_df = pd.DataFrame(columns)

    return metric_df

In [69]:
display_results("top-2 eval", eval_results)

Unnamed: 0,retrievers,hit_rate,mrr
0,top-2 eval,0.842105,0.769737


In [70]:
from llama_index.core.response.notebook_utils import display_source_node

retrieved_nodes = retriever.retrieve("In the essay, the author mentions his early experiences with programming. Describe the first computer he used for programming, the language he used, and the challenges he faced.")
for node in retrieved_nodes:
    display_source_node(node, source_length=1000)

**Node ID:** db667069-f991-4332-a110-fe38b99b0697<br>**Similarity:** 0.8743050242610546<br>**Text:** What I Worked On

February 2021

Before college the two main things I worked on, outside of school, were writing and programming. I didn't write essays. I wrote what beginning writers were supposed to write then, and probably still are: short stories. My stories were awful. They had hardly any plot, just characters with strong feelings, which I imagined made them deep.

The first programs I tried writing were on the IBM 1401 that our school district used for what was then called "data processing." This was in 9th grade, so I was 13 or 14. The school district's 1401 happened to be in the basement of our junior high school, and my friend Rich Draves and I got permission to use it. It was like a mini Bond villain's lair down there, with all these alien-looking machines â CPU, disk drives, printer, card reader â sitting up on a raised floor under bright fluorescent lights.

The language we used was an early version of Fortran. You had to type programs on punch cards, then stack them...<br>

**Node ID:** 26e8864e-e1e1-4f48-8a20-2561a79c7c3f<br>**Similarity:** 0.8739117193294521<br>**Text:** With microcomputers, everything changed. Now you could have a computer sitting right in front of you, on a desk, that could respond to your keystrokes as it was running instead of just churning through a stack of punch cards and then stopping. [1]

The first of my friends to get a microcomputer built it himself. It was sold as a kit by Heathkit. I remember vividly how impressed and envious I felt watching him sitting in front of it, typing programs right into the computer.

Computers were expensive in those days and it took me years of nagging before I convinced my father to buy one, a TRS-80, in about 1980. The gold standard then was the Apple II, but a TRS-80 was good enough. This was when I really started programming. I wrote simple games, a program to predict how high my model rockets would fly, and a word processor that my father used to write at least one book. There was only room in memory for about 2 pages of text, so he'd write 2 pages at a time and then print them out, but...<br>

## References

- https://docs.llamaindex.ai/en/latest/module_guides/evaluating/evaluating_with_llamadatasets/#using-a-labelledragdataset
- https://www.llamaindex.ai/blog/introducing-llama-datasets-aadb9994ad9e
- https://docs.llamaindex.ai/en/stable/module_guides/evaluating/evaluating_with_llamadatasets/#building-a-labelledragdataset
- https://docs.llamaindex.ai/en/stable/examples/evaluation/retrieval/retriever_eval/