# RAG pipeline evaluation using DeepEval

[DeepEval](https://www.confident-ai.com/) is a framework to evaluate [Retrieval Augmented Generation](https://www.deepset.ai/blog/llms-retrieval-augmentation) (RAG) pipelines.
It supports metrics like context relevance, answer correctness, faithfulness, and more.

For more information about evaluators, supported metrics and usage, check out:

* [DeepEvalEvaluator](https://docs.haystack.deepset.ai/docs/deepevalevaluator)
* [Model based evaluation](https://docs.haystack.deepset.ai/docs/model-based-evaluation)

This notebook shows how to use [DeepEval-Haystack](https://haystack.deepset.ai/integrations/deepeval) integration to evaluate a RAG pipeline against various metrics.

## Prerequisites:

- [OpenAI](https://openai.com/) key
    - **DeepEval** uses  for computing some metrics, so we need an OpenAI key.

In [1]:
import os
from getpass import getpass

os.environ["OPENAI_API_KEY"] = getpass("Enter OpenAI API key:")

Enter OpenAI API key: ········


## Install dependencies

In [None]:
!pip install haystack-ai
!pip install "datasets>=2.6.1"
!pip install deepeval-haystack

## Create a RAG pipeline

We'll first need to create a RAG pipeline. Refer to this [link](https://haystack.deepset.ai/tutorials/27_first_rag_pipeline) for a detailed tutorial on how to create RAG pipelines.

In this notebook, we're using the [SQUAD V2](https://huggingface.co/datasets/rajpurkar/squad_v2) dataset for getting the context, questions and ground truth answers.





**Initialize the document store**



In [2]:
from datasets import load_dataset
from haystack import Document
from haystack.document_stores.in_memory import InMemoryDocumentStore

document_store = InMemoryDocumentStore()

dataset = load_dataset("rajpurkar/squad_v2", split="validation")
documents = list(set(dataset["context"]))
docs = [Document(content=doc) for doc in documents]
document_store.write_documents(docs)

1204

In [None]:
from haystack.components.builders import ChatPromptBuilder
from haystack.components.generators.chat import OpenAIChatGenerator
from haystack.components.retrievers.in_memory import InMemoryBM25Retriever
from haystack.dataclasses import ChatMessage

retriever = InMemoryBM25Retriever(document_store, top_k=3)

chat_message = ChatMessage.from_user(
    text="""Given the following information, answer the question.

Context:
{% for document in documents %}
    {{ document.content }}
{% endfor %}

Question: {{question}}
Answer:
"""
)
chat_prompt_builder = ChatPromptBuilder(template=[chat_message], required_variables="*")
chat_generator = OpenAIChatGenerator(model="gpt-4o-mini")

**Build the RAG pipeline**

In [None]:
from haystack import Pipeline
from haystack.components.builders.answer_builder import AnswerBuilder

rag_pipeline = Pipeline()
# Add components to your pipeline
rag_pipeline.add_component("retriever", retriever)
rag_pipeline.add_component("chat_prompt_builder", chat_prompt_builder)
rag_pipeline.add_component("llm", chat_generator)
rag_pipeline.add_component("answer_builder", AnswerBuilder())

# Now, connect the components to each other
rag_pipeline.connect("retriever", "chat_prompt_builder.documents")
rag_pipeline.connect("chat_prompt_builder", "llm")
rag_pipeline.connect("llm.replies", "answer_builder.replies")
rag_pipeline.connect("retriever", "answer_builder.documents")

<haystack.core.pipeline.pipeline.Pipeline object at 0x16a07b210>
🚅 Components
  - retriever: InMemoryBM25Retriever
  - chat_prompt_builder: ChatPromptBuilder
  - llm: OpenAIChatGenerator
  - answer_builder: AnswerBuilder
🛤️ Connections
  - retriever.documents -> chat_prompt_builder.documents (List[Document])
  - retriever.documents -> answer_builder.documents (List[Document])
  - chat_prompt_builder.prompt -> llm.messages (List[ChatMessage])
  - llm.replies -> answer_builder.replies (List[ChatMessage])

**Running the pipeline**

In [5]:
question = "In what country is Normandy located?"

response = rag_pipeline.run(
    {"retriever": {"query": question}, "chat_prompt_builder": {"question": question}, "answer_builder": {"query": question}}
)

In [6]:
print(response["answer_builder"]["answers"][0].data)

Normandy is located in France.


We're done building our RAG pipeline. Let's evaluate it now!

## Get questions, contexts, responses and ground truths for evaluation

For computing most metrics, we will need to provide the following to the evaluator:
1. Questions
2. Generated responses
3. Retrieved contexts
4. Ground truth (Specifically, this is needed for `context precision`, `context recall` and `answer correctness` metrics)

We'll start with random three questions from the dataset (see below) and now we'll get the matching `contexts` and `responses` for those questions.

### Helper function to get context and responses for our questions


In [7]:
def get_contexts_and_responses(questions, pipeline):
    contexts = []
    responses = []
    for question in questions:
        response = pipeline.run(
            {
                "retriever": {"query": question},
                "chat_prompt_builder": {"question": question},
                "answer_builder": {"query": question},
            }
        )

        contexts.append([d.content for d in response["answer_builder"]["answers"][0].documents])
        responses.append(response["answer_builder"]["answers"][0].data)
    return contexts, responses

In [8]:
question_map = {
    "Which mountain range influenced the split of the regions?": 0,
    "What is the prize offered for finding a solution to P=NP?": 1,
    "Which Californio is located in the upper part?": 2
}
questions = list(question_map.keys())
contexts, responses = get_contexts_and_responses(questions, rag_pipeline)

### Ground truths, review all fields

Now that we have questions, contexts, and responses we'll also get the matching ground truth answers.

In [9]:
ground_truths = [""] * len(question_map)

for question, index in question_map.items():
    idx = dataset["question"].index(question)
    ground_truths[index] = dataset["answers"][idx]["text"][0]

In [10]:
print("Questions:\n")
print("\n".join(questions))

Questions:

Which mountain range influenced the split of the regions?
What is the prize offered for finding a solution to P=NP?
Which Californio is located in the upper part?


In [11]:
print("Contexts:\n")
for c in contexts:
  print(c[0])

Contexts:

The state is most commonly divided and promoted by its regional tourism groups as consisting of northern, central, and southern California regions. The two AAA Auto Clubs of the state, the California State Automobile Association and the Automobile Club of Southern California, choose to simplify matters by dividing the state along the lines where their jurisdictions for membership apply, as either northern or southern California, in contrast to the three-region point of view. Another influence is the geographical phrase South of the Tehachapis, which would split the southern region off at the crest of that transverse range, but in that definition, the desert portions of north Los Angeles County and eastern Kern and San Bernardino Counties would be included in the southern California region due to their remoteness from the central valley and interior desert landscape.
If a problem X is in C and hard for C, then X is said to be complete for C. This means that X is the hardest p

In [12]:
print("Responses:\n")
print("\n".join(responses))

Responses:

The mountain range that influenced the split of the regions is the Tehachapi Mountains.
The prize offered for finding a solution to P=NP is US$1,000,000.
The provided information does not mention any Californios or any details related to California. A Californio refers to a Hispanic person who was born in California during the period when it was part of Mexico, before becoming part of the United States. Without more context or specific information about Californios, it is not possible to answer the question regarding which Californio is located in the upper part.


In [13]:
print("Ground truths:\n")
print("\n".join(ground_truths))

Ground truths:

Tehachapis
$1,000,000
Monterey


## Evaluate the RAG pipeline





Now that we have the `questions`, `contexts`,`responses` and the `ground truths`, we can begin our pipeline evaluation and compute all the supported metrics.

## Metrics computation

In addition to evaluating the final responses of the LLM, it is important that we also evaluate the individual components of the RAG pipeline as they can significantly impact the overall performance. Therefore, there are different metrics to evaluate the retriever, the generator and the overall pipeline. For a full list of available metrics and their expected inputs, check out the [DeepEvalEvaluator Docs](https://docs.haystack.deepset.ai/docs/deepevalevaluator)

The [DeepEval documentation](https://deepeval.com/docs/metrics-introduction) provides explanation of the individual metrics with simple examples for each of them.

### Contextul Precision

The contextual precision metric measures our RAG pipeline's retriever by evaluating whether items in our contexts that are relevant to the given input are ranked higher than irrelevant ones.

In [14]:
from haystack import Pipeline
from haystack_integrations.components.evaluators.deepeval import DeepEvalEvaluator, DeepEvalMetric

context_precision_pipeline = Pipeline()
evaluator = DeepEvalEvaluator(metric=DeepEvalMetric.CONTEXTUAL_PRECISION, metric_params={"model":"gpt-4o-mini"})
context_precision_pipeline.add_component("evaluator", evaluator)

In [15]:
evaluation_results = context_precision_pipeline.run(
    {"evaluator": {"questions": questions, "contexts": contexts, "ground_truths": ground_truths, "responses": responses}}
)
print(evaluation_results["evaluator"]["results"])



Metrics Summary

  - ✅ Contextual Precision (score: 1.0, threshold: 0.0, strict: False, evaluation model: gpt-4o-mini, reason: The score is 1.00 because the relevant node ranks first, providing a direct connection to the influence of the Tehachapi mountain range on regional divisions. The subsequent nodes, which discuss unrelated topics such as structural geology and climate, are ranked lower, ensuring that the relevant information is prioritized., error: None)

For test case:

  - input: Which mountain range influenced the split of the regions?
  - actual output: The mountain range that influenced the split of the regions is the Tehachapi Mountains.
  - expected output: Tehachapis
  - context: None
  - retrieval context: ['The state is most commonly divided and promoted by its regional tourism groups as consisting of northern, central, and southern California regions. The two AAA Auto Clubs of the state, the California State Automobile Association and the Automobile Club of Southern

[[{'name': 'contextual_precision', 'score': 1.0, 'explanation': 'The score is 1.00 because the relevant node ranks first, providing a direct connection to the influence of the Tehachapi mountain range on regional divisions. The subsequent nodes, which discuss unrelated topics such as structural geology and climate, are ranked lower, ensuring that the relevant information is prioritized.'}], [{'name': 'contextual_precision', 'score': 0.5, 'explanation': 'The score is 0.50 because while the second node provides a clear and direct answer regarding the prize for solving P=NP, the first and third nodes, ranked higher, do not address the question and instead focus on unrelated topics. This results in a mixed ranking where relevant information is overshadowed by irrelevant nodes.'}], [{'name': 'contextual_precision', 'score': 0.0, 'explanation': "The score is 0.00 because all nodes in the retrieval contexts are irrelevant to the input question about Californios. The first node ranks highest b

### Contextual Recall

Contextual recall measures the extent to which the contexts aligns with the `ground truth`.

In [16]:
from haystack import Pipeline
from haystack_integrations.components.evaluators.deepeval import DeepEvalEvaluator, DeepEvalMetric

context_recall_pipeline = Pipeline()
evaluator = DeepEvalEvaluator(metric=DeepEvalMetric.CONTEXTUAL_RECALL, metric_params={"model":"gpt-4o-mini"})
context_recall_pipeline.add_component("evaluator", evaluator)

In [17]:
evaluation_results = context_recall_pipeline.run(
    {"evaluator": {"questions": questions, "contexts": contexts, "ground_truths": ground_truths, "responses": responses}}
)
print(evaluation_results["evaluator"]["results"])



Metrics Summary

  - ✅ Contextual Recall (score: 0.0, threshold: 0.0, strict: False, evaluation model: gpt-4o-mini, reason: The score is 0.00 because the sentence 'Monterey' does not appear in any part of the node(s) in retrieval context., error: None)

For test case:

  - input: Which Californio is located in the upper part?
  - actual output: The provided information does not mention any Californios or any details related to California. A Californio refers to a Hispanic person who was born in California during the period when it was part of Mexico, before becoming part of the United States. Without more context or specific information about Californios, it is not possible to answer the question regarding which Californio is located in the upper part.
  - expected output: Monterey
  - context: None
  - retrieval context: ['In the centre of Basel, the first major city in the course of the stream, is located the "Rhine knee"; this is a major bend, where the overall direction of the Rh

[[{'name': 'contextual_recall', 'score': 0.0, 'explanation': "The score is 0.00 because the sentence 'Monterey' does not appear in any part of the node(s) in retrieval context."}], [{'name': 'contextual_recall', 'score': 1.0, 'explanation': "The score is 1.00 because the term 'Tehachapis' is directly referenced in the retrieval context, making it perfectly aligned with the expected output."}], [{'name': 'contextual_recall', 'score': 1.0, 'explanation': "The score is 1.00 because the sentence '$1,000,000' is directly supported by the 2nd node in retrieval context, which clearly states 'There is a US$1,000,000 prize for resolving the problem...'. This strong alignment confirms the accuracy of the expected output."}]]


### Contextual Relevancy

The contextual relevancy metric measures the quality of our RAG pipeline's retriever by evaluating the overall relevance of the context for a given question.

In [18]:
from haystack import Pipeline
from haystack_integrations.components.evaluators.deepeval import DeepEvalEvaluator, DeepEvalMetric

context_relevancy_pipeline = Pipeline()
evaluator = DeepEvalEvaluator(metric=DeepEvalMetric.CONTEXTUAL_RELEVANCE, metric_params={"model":"gpt-4o-mini"})
context_relevancy_pipeline.add_component("evaluator", evaluator)

In [19]:
evaluation_results = context_relevancy_pipeline.run(
    {"evaluator": {"questions": questions, "contexts": contexts, "responses": responses}}
)
print(evaluation_results["evaluator"]["results"])



Metrics Summary

  - ✅ Contextual Relevancy (score: 0.07692307692307693, threshold: 0.0, strict: False, evaluation model: gpt-4o-mini, reason: The score is 0.08 because the retrieval context primarily discusses various unrelated topics, such as jurisdiction and experimental processes, which do not address the influence of a mountain range on regional splits. The only relevant statement mentions 'South of the Tehachapis' but lacks specificity about the mountain range's influence, making it insufficient to connect to the input question., error: None)

For test case:

  - input: Which mountain range influenced the split of the regions?
  - actual output: The mountain range that influenced the split of the regions is the Tehachapi Mountains.
  - expected output: None
  - context: None
  - retrieval context: ['The state is most commonly divided and promoted by its regional tourism groups as consisting of northern, central, and southern California regions. The two AAA Auto Clubs of the sta

[[{'name': 'contextual_relevance', 'score': 0.07692307692307693, 'explanation': "The score is 0.08 because the retrieval context primarily discusses various unrelated topics, such as jurisdiction and experimental processes, which do not address the influence of a mountain range on regional splits. The only relevant statement mentions 'South of the Tehachapis' but lacks specificity about the mountain range's influence, making it insufficient to connect to the input question."}], [{'name': 'contextual_relevance', 'score': 0.14285714285714285, 'explanation': "The score is 0.14 because the majority of the retrieval context focuses on theoretical aspects of complexity classes and implications of the P=NP problem, which do not address the prize. However, the relevant statements mention that 'The P versus NP problem is one of the Millennium Prize Problems' and 'There is a US$1,000,000 prize for resolving the problem,' which directly answer the input question."}], [{'name': 'contextual_relevan

### Answer relevancy

The answer relevancy metric measures the quality of our RAG pipeline's response by evaluating how relevant the response is compared to the provided question.

In [20]:
from haystack import Pipeline
from haystack_integrations.components.evaluators.deepeval import DeepEvalEvaluator, DeepEvalMetric

answer_relevancy_pipeline = Pipeline()
evaluator = DeepEvalEvaluator(metric=DeepEvalMetric.ANSWER_RELEVANCY, metric_params={"model":"gpt-4o-mini"})
answer_relevancy_pipeline.add_component("evaluator", evaluator)

In [21]:
evaluation_results = answer_relevancy_pipeline.run(
    {"evaluator": {"questions": questions, "responses": responses, "contexts": contexts}}
)
print(evaluation_results["evaluator"]["results"])



Metrics Summary

  - ✅ Answer Relevancy (score: 1.0, threshold: 0.0, strict: False, evaluation model: gpt-4o-mini, reason: The score is 1.00 because the response directly addressed the question about the prize for solving P=NP without including any irrelevant statements., error: None)

For test case:

  - input: What is the prize offered for finding a solution to P=NP?
  - actual output: The prize offered for finding a solution to P=NP is US$1,000,000.
  - expected output: None
  - context: None
  - retrieval context: ['If a problem X is in C and hard for C, then X is said to be complete for C. This means that X is the hardest problem in C. (Since many problems could be equally hard, one might say that X is one of the hardest problems in C.) Thus the class of NP-complete problems contains the most difficult problems in NP, in the sense that they are the ones most likely not to be in P. Because the problem P = NP is not solved, being able to reduce a known NP-complete problem, Π2, to 

[[{'name': 'answer_relevancy', 'score': 1.0, 'explanation': 'The score is 1.00 because the response directly addressed the question about the prize for solving P=NP without including any irrelevant statements.'}], [{'name': 'answer_relevancy', 'score': 1.0, 'explanation': 'The score is 1.00 because the response directly addressed the question about the mountain range influencing the split of the regions without any irrelevant statements.'}], [{'name': 'answer_relevancy', 'score': 0.6, 'explanation': 'The score is 0.60 because the output included irrelevant statements that did not address the question about the location of Californios, indicating a lack of specific information. However, some relevant content was present, which is why the score is not lower.'}]]


### Faithfulness

The faithfulness metric measures the quality of our RAG pipeline's responses by evaluating whether the response factually aligns with the contents of context we provided.

In [22]:
from haystack import Pipeline
from haystack_integrations.components.evaluators.deepeval import DeepEvalEvaluator, DeepEvalMetric

faithfulness_pipeline = Pipeline()
evaluator = DeepEvalEvaluator(metric=DeepEvalMetric.FAITHFULNESS, metric_params={"model":"gpt-4o-mini"} )
faithfulness_pipeline.add_component("evaluator", evaluator)

In [None]:
evaluation_results = faithfulness_pipeline.run(
    {"evaluator": {"questions": questions, "contexts": contexts, "responses": responses}}
)
print(evaluation_results["evaluator"]["results"])

**Our pipeline evaluation using DeepEval is now complete!**

**Haystack Useful Sources**

* [Docs](https://docs.haystack.deepset.ai/docs/intro)
* [Tutorials](https://haystack.deepset.ai/tutorials)
* [Other Cookbooks](https://github.com/deepset-ai/haystack-cookbook)