# RAG pipeline evaluation using DeepEval

[DeepEval](https://www.confident-ai.com/) is a framework to evaluate [Retrieval Augmented Generation](https://www.deepset.ai/blog/llms-retrieval-augmentation) (RAG) pipelines.
It supports metrics like context relevance, answer correctness, faithfulness, and more.

For more information about evaluators, supported metrics and usage, check out:

* [DeepEvalEvaluator](https://docs.haystack.deepset.ai/docs/deepevalevaluator)
* [Model based evaluation](https://docs.haystack.deepset.ai/docs/model-based-evaluation)

This notebook shows how to use [DeepEval-Haystack](https://haystack.deepset.ai/integrations/deepeval) integration to evaluate a RAG pipeline against various metrics.

## Prerequisites:

- [OpenAI](https://openai.com/) key
    - **DeepEval** uses  for computing some metrics, so we need an OpenAI key.

In [1]:
import os
from getpass import getpass

os.environ["OPENAI_API_KEY"] = getpass("Enter OpenAI API key:")

Enter OpenAI API key: ········


## Install dependencies

In [None]:
!pip install --upgrade haystack-ai
!pip install "datasets>=2.6.1"
!pip install deepeval-haystack
!pip install --upgrade deepeval

Collecting deepeval
  Using cached deepeval-3.2.4-py3-none-any.whl.metadata (17 kB)
Collecting anthropic (from deepeval)
  Using cached anthropic-0.57.1-py3-none-any.whl.metadata (27 kB)
Collecting click<8.2.0,>=8.0.0 (from deepeval)
  Using cached click-8.1.8-py3-none-any.whl.metadata (2.3 kB)
Collecting google-genai<2.0.0,>=1.9.0 (from deepeval)
  Using cached google_genai-1.24.0-py3-none-any.whl.metadata (40 kB)
Collecting grpcio<2.0.0,>=1.67.1 (from deepeval)
  Downloading grpcio-1.73.1-cp311-cp311-macosx_11_0_universal2.whl.metadata (3.8 kB)
Collecting ollama (from deepeval)
  Using cached ollama-0.5.1-py3-none-any.whl.metadata (4.3 kB)
Collecting opentelemetry-api<2.0.0,>=1.24.0 (from deepeval)
  Using cached opentelemetry_api-1.34.1-py3-none-any.whl.metadata (1.5 kB)
Collecting opentelemetry-exporter-otlp-proto-grpc<2.0.0,>=1.24.0 (from deepeval)
  Using cached opentelemetry_exporter_otlp_proto_grpc-1.34.1-py3-none-any.whl.metadata (2.4 kB)
Collecting opentelemetry-sdk<2.0.0,>=1

## Create a RAG pipeline

We'll first need to create a RAG pipeline. Refer to this [link](https://haystack.deepset.ai/tutorials/27_first_rag_pipeline) for a detailed tutorial on how to create RAG pipelines.

In this notebook, we're using the [SQUAD V2](https://huggingface.co/datasets/rajpurkar/squad_v2) dataset for getting the context, questions and ground truth answers.





**Initialize the document store**



In [3]:
from datasets import load_dataset
from haystack import Document
from haystack.document_stores.in_memory import InMemoryDocumentStore

document_store = InMemoryDocumentStore()

dataset = load_dataset("rajpurkar/squad_v2", split="validation")
documents = list(set(dataset["context"]))
docs = [Document(content=doc) for doc in documents]
document_store.write_documents(docs)

1204

In [4]:
import os
from getpass import getpass
from haystack.components.builders import ChatPromptBuilder
from haystack.components.generators.chat import OpenAIChatGenerator
from haystack.components.retrievers.in_memory import InMemoryBM25Retriever
from haystack.dataclasses import ChatMessage

retriever = InMemoryBM25Retriever(document_store, top_k=3)

chat_message = ChatMessage.from_user(
    text="""Given the following information, answer the question.

Context:
{% for document in documents %}
    {{ document.content }}
{% endfor %}

Question: {{question}}
Answer:
"""
)
chat_prompt_builder = ChatPromptBuilder(template=[chat_message], required_variables="*")
chat_generator = OpenAIChatGenerator(model="gpt-4o-mini")

**Build the RAG pipeline**

In [5]:
from haystack import Pipeline
from haystack.components.builders.answer_builder import AnswerBuilder

rag_pipeline = Pipeline()
# Add components to your pipeline
rag_pipeline.add_component("retriever", retriever)
rag_pipeline.add_component("chat_prompt_builder", chat_prompt_builder)
rag_pipeline.add_component("llm", chat_generator)
rag_pipeline.add_component(name="answer_builder", instance=AnswerBuilder())

# Now, connect the components to each other
rag_pipeline.connect("retriever", "chat_prompt_builder.documents")
rag_pipeline.connect("chat_prompt_builder", "llm")
rag_pipeline.connect("llm.replies", "answer_builder.replies")
rag_pipeline.connect("retriever", "answer_builder.documents")

<haystack.core.pipeline.pipeline.Pipeline object at 0x14dbabc90>
🚅 Components
  - retriever: InMemoryBM25Retriever
  - chat_prompt_builder: ChatPromptBuilder
  - llm: OpenAIChatGenerator
  - answer_builder: AnswerBuilder
🛤️ Connections
  - retriever.documents -> chat_prompt_builder.documents (List[Document])
  - retriever.documents -> answer_builder.documents (List[Document])
  - chat_prompt_builder.prompt -> llm.messages (List[ChatMessage])
  - llm.replies -> answer_builder.replies (List[ChatMessage])

**Running the pipeline**

In [6]:
question = "In what country is Normandy located?"

response = rag_pipeline.run(
    {"retriever": {"query": question}, "chat_prompt_builder": {"question": question}, "answer_builder": {"query": question}}
)

In [7]:
print(response["answer_builder"]["answers"][0].data)

Normandy is located in France.


We're done building our RAG pipeline. Let's evaluate it now!

## Get questions, contexts, responses and ground truths for evaluation

For computing most metrics, we will need to provide the following to the evaluator:
1. Questions
2. Generated responses
3. Retrieved contexts
4. Ground truth (Specifically, this is needed for `context precision`, `context recall` and `answer correctness` metrics)

We'll start with random three questions from the dataset (see below) and now we'll get the matching `contexts` and `responses` for those questions.

### Helper function to get context and responses for our questions


In [8]:
def get_contexts_and_responses(questions, pipeline):
    contexts = []
    responses = []
    for question in questions:
        response = pipeline.run(
            {
                "retriever": {"query": question},
                "chat_prompt_builder": {"question": question},
                "answer_builder": {"query": question},
            }
        )

        contexts.append([d.content for d in response["answer_builder"]["answers"][0].documents])
        responses.append(response["answer_builder"]["answers"][0].data)
    return contexts, responses

In [9]:
question_map = {
    "Which mountain range influenced the split of the regions?": 0,
    "What is the prize offered for finding a solution to P=NP?": 1,
    "Which Californio is located in the upper part?": 2
}
questions = list(question_map.keys())
contexts, responses = get_contexts_and_responses(questions, rag_pipeline)

### Ground truths, review all fields

Now that we have questions, contexts, and responses we'll also get the matching ground truth answers.

In [10]:
ground_truths = [""] * len(question_map)

for question, index in question_map.items():
    idx = dataset["question"].index(question)
    ground_truths[index] = dataset["answers"][idx]["text"][0]

In [11]:
print("Questions:\n")
print("\n".join(questions))

Questions:

Which mountain range influenced the split of the regions?
What is the prize offered for finding a solution to P=NP?
Which Californio is located in the upper part?


In [12]:
print("Contexts:\n")
for c in contexts:
  print(c[0])

Contexts:

The state is most commonly divided and promoted by its regional tourism groups as consisting of northern, central, and southern California regions. The two AAA Auto Clubs of the state, the California State Automobile Association and the Automobile Club of Southern California, choose to simplify matters by dividing the state along the lines where their jurisdictions for membership apply, as either northern or southern California, in contrast to the three-region point of view. Another influence is the geographical phrase South of the Tehachapis, which would split the southern region off at the crest of that transverse range, but in that definition, the desert portions of north Los Angeles County and eastern Kern and San Bernardino Counties would be included in the southern California region due to their remoteness from the central valley and interior desert landscape.
If a problem X is in C and hard for C, then X is said to be complete for C. This means that X is the hardest p

In [13]:
print("Responses:\n")
print("\n".join(responses))

Responses:

The Tehachapi Mountains influenced the split of the regions, specifically the division between northern and southern California.
The prize offered for finding a solution to the P vs NP problem is US$1,000,000.
The term "Californio" typically refers to the Spanish-speaking inhabitants of California during the period of Mexican rule from 1821 to 1848. However, the provided context does not contain information relevant to Californios or their geographical locations. If you are referring to a specific aspect related to Californios or a particular individual, please provide more context or clarity for a more accurate response.


In [14]:
print("Ground truths:\n")
print("\n".join(ground_truths))

Ground truths:

Tehachapis
$1,000,000
Monterey


## Evaluate the RAG pipeline





Now that we have the `questions`, `contexts`,`responses` and the `ground truths`, we can begin our pipeline evaluation and compute all the supported metrics.

## Metrics computation

In addition to evaluating the final responses of the LLM, it is important that we also evaluate the individual components of the RAG pipeline as they can significantly impact the overall performance. Therefore, there are different metrics to evaluate the retriever, the generator and the overall pipeline. For a full list of available metrics and their expected inputs, check out the [DeepEvalEvaluator Docs](https://docs.haystack.deepset.ai/docs/deepevalevaluator)

The [DeepEval documentation](https://docs.confident-ai.com/docs/metrics-introduction) provides explanation of the individual metrics with simple examples for each of them.

### Contextul Precision

The contextual precision metric measures our RAG pipeline's retriever by evaluating whether items in our contexts that are relevant to the given input are ranked higher than irrelevant ones.

In [15]:
from haystack import Pipeline
from haystack_integrations.components.evaluators.deepeval import DeepEvalEvaluator, DeepEvalMetric

context_precision_pipeline = Pipeline()
evaluator = DeepEvalEvaluator(metric=DeepEvalMetric.CONTEXTUAL_PRECISION, metric_params={"model":"gpt-4o-mini"})
context_precision_pipeline.add_component("evaluator", evaluator)



ModuleNotFoundError: No module named 'deepeval.evaluate.types'; 'deepeval.evaluate' is not a package

In [None]:
evaluation_results = context_precision_pipeline.run(
    {"evaluator": {"questions": questions, "contexts": contexts, "ground_truths": ground_truths, "responses": responses}}
)
print(evaluation_results["evaluator"]["results"])

### Contextual Recall

Contextual recall measures the extent to which the contexts aligns with the `ground truth`.

In [29]:
from haystack import Pipeline
from haystack_integrations.components.evaluators.deepeval import DeepEvalEvaluator, DeepEvalMetric

context_recall_pipeline = Pipeline()
evaluator = DeepEvalEvaluator(metric=DeepEvalMetric.CONTEXTUAL_RECALL, metric_params={"model":"gpt-4"})
context_recall_pipeline.add_component("evaluator", evaluator)

In [30]:
evaluation_results = context_recall_pipeline.run(
    {"evaluator": {"questions": questions, "contexts": contexts, "ground_truths": ground_truths, "responses": responses}}
)
print(evaluation_results["evaluator"]["results"])

[[{'name': 'contextual_recall', 'score': 1.0, 'explanation': "The score is 1.00 because every aspect of the expected output was successfully found in the retrieval context. Great work matching 'Tehachapis' to the information provided in the 1st node!"}], [{'name': 'contextual_recall', 'score': 1.0, 'explanation': "The score is 1.00 because the expected output, '$1,000,000', is precisely matched in the 2nd node in retrieval context which makes it perfectly accurate. There's no element missing or unaccounted for, hence the perfect score. Well done!"}], [{'name': 'contextual_recall', 'score': 0.0, 'explanation': "The score is 0.00 because the sentence 'Monterey' in the expected output could not be attributed to any node in the retrieval context."}]]


### Contextual Relevancy

The contextual relevancy metric measures the quality of our RAG pipeline's retriever by evaluating the overall relevance of the context for a given question.

In [31]:
from haystack import Pipeline
from haystack_integrations.components.evaluators.deepeval import DeepEvalEvaluator, DeepEvalMetric

context_relevancy_pipeline = Pipeline()
evaluator = DeepEvalEvaluator(metric=DeepEvalMetric.CONTEXTUAL_RELEVANCE, metric_params={"model":"gpt-4"})
context_relevancy_pipeline.add_component("evaluator", evaluator)

In [32]:
evaluation_results = context_relevancy_pipeline.run(
    {"evaluator": {"questions": questions, "contexts": contexts, "responses": responses}}
)
print(evaluation_results["evaluator"]["results"])

[[{'name': 'contextual_relevance', 'score': 0.5, 'explanation': "The score is 0.50 because the irrelevant sentences mainly discuss temperature data in Victoria and regional divisions in California, instead of addressing the influence of a mountain range on the splitting of regions. Although these sentences pertain to geographical features and characteristics, they do not directly respond to the specific query regarding mountain ranges' impact on regional splits."}], [{'name': 'contextual_relevance', 'score': 0.46153846153846156, 'explanation': 'The score is 0.46 because the irrelevant sentences, drawn from nodes 2 and 3 of the retrieval context, delve into discussions and complex examples of computational problem-solving, without directly addressing the specific query of the prize offered for a solution to P=NP.'}], [{'name': 'contextual_relevance', 'score': 0.0, 'explanation': 'The score is 0.00 because the input question is asking about a Californio in a specific location, however, a

### Answer relevancy

The answer relevancy metric measures the quality of our RAG pipeline's response by evaluating how relevant the response is compared to the provided question.

In [33]:
from haystack import Pipeline
from haystack_integrations.components.evaluators.deepeval import DeepEvalEvaluator, DeepEvalMetric

answer_relevancy_pipeline = Pipeline()
evaluator = DeepEvalEvaluator(metric=DeepEvalMetric.ANSWER_RELEVANCY, metric_params={"model":"gpt-4"})
answer_relevancy_pipeline.add_component("evaluator", evaluator)

In [34]:
evaluation_results = answer_relevancy_pipeline.run(
    {"evaluator": {"questions": questions, "responses": responses, "contexts": contexts}}
)
print(evaluation_results["evaluator"]["results"])

KeyboardInterrupt: 

#### Note

When this notebook was created, the version 0.20.57 of [deepeval](https://github.com/confident-ai/deepeval/tree/v0.20.57) required the use of contexts for calculating Answer Relevancy. Please note that future versions will no longer require the context field. Specifically, the upcoming release of deepeval-haystack will eliminate the context field as a mandatory requirement.

### Faithfulness

The faithfulness metric measures the quality of our RAG pipeline's responses by evaluating whether the response factually aligns with the contents of context we provided.

In [None]:
from haystack import Pipeline
from haystack_integrations.components.evaluators.deepeval import DeepEvalEvaluator, DeepEvalMetric

faithfulness_pipeline = Pipeline()
evaluator = DeepEvalEvaluator(metric=DeepEvalMetric.FAITHFULNESS, metric_params={"model":"gpt-4"} )
faithfulness_pipeline.add_component("evaluator", evaluator)

In [None]:
evaluation_results = faithfulness_pipeline.run(
    {"evaluator": {"questions": questions, "contexts": contexts, "responses": responses}}
)
print(evaluation_results["evaluator"]["results"])

**Our pipeline evaluation using DeepEval is now complete!**

**Haystack Useful Sources**

* [Docs](https://docs.haystack.deepset.ai/docs/intro)
* [Tutorials](https://haystack.deepset.ai/tutorials)
* [Other Cookbooks](https://github.com/deepset-ai/haystack-cookbook)