# Tutorial: Evaluating RAG Pipelines

- **Level**: Intermediate
- **Time to complete**: 15 minutes
- **Components Used**: `InMemoryDocumentStore`, `InMemoryEmbeddingRetriever`, `PromptBuilder`, `OpenAIGenerator`, `DocumentMRREvaluator`, `FaithfulnessEvaluator`, `SASEvaluator`
- **Prerequisites**: You must have an API key from an active OpenAI account as this tutorial is using the gpt-3.5-turbo model by OpenAI: https://platform.openai.com/api-keys
- **Goal**: After completing this tutorial, you'll have learned how to evaluate your RAG pipelines using some of the model-based evaluation frameworks integerated into Haystack.

> This tutorial uses Haystack 2.0. To learn more, read the [Haystack 2.0 announcement](https://haystack.deepset.ai/blog/haystack-2-release) or visit the [Haystack 2.0 Documentation](https://docs.haystack.deepset.ai/docs/intro).

## Overview

In this tutorial, you will learn how to evaluate Haystack pipelines, in particular, Retriaval-Augmented Generation ([RAG](https://www.deepset.ai/blog/llms-retrieval-augmentation)) pipelines.
1. You will first build a pipeline that answers medical questions based on PubMed data.
2. You will build an evaluation pipeline that makes use of some metrics like Document MRR and Answer Faithfulness.
3. You will run your RAG pipeline and evaluated the output with your evaluation pipeline.

Haystack provides a wide range of [`Evaluators`](https://docs.haystack.deepset.ai/v2.1-unstable/docs/evaluators) which can perform 2 types of evaluations:
- [Model-Based evaluation](https://docs.haystack.deepset.ai/v2.1-unstable/docs/model-based-evaluation)
- [Statistical evaluation](https://docs.haystack.deepset.ai/v2.1-unstable/docs/statistical-evaluation)

We will use some of these evalution techniques in this tutorial to evaluate a RAG pipeline that is designed to answer questions on PubMed data.

>🧑‍🍳 As well as Haystack's own evaluation metrics, you can also integrate with a number of evaluation frameworks. See the integrations and examples below 👇
>
> - [Evaluate with DeepEval](https://colab.research.google.com/github/deepset-ai/haystack-cookbook/blob/main/notebooks/rag_eval_deep_eval.ipynb)
>
> - [Evaluate with RAGAS](https://colab.research.google.com/github/deepset-ai/haystack-cookbook/blob/main/notebooks/rag_eval_ragas.ipynb)
>
> - [Evaluate with UpTrain](https://colab.research.google.com/github/deepset-ai/haystack-cookbook/blob/main/notebooks/rag_eval_uptrain.ipynb)

### Evaluating RAG Pipelines
RAG pipelines ultimately consist of at least 2 steps:
- Retrieval
- Generation

To evaluate a full RAG pipeline, we have to evaluate each of these steps in isolation, as well as a full unit. While retrieval can in some cases be evaluated with some statistical metrics that require labels, it's not a straight-forward task to do the same for the generation step. Instead, we often rely on model-based metrics to evaluate the generation step, where an LLM is used as the 'evaluator'.


## Preparing the Colab Environment

- [Enable GPU Runtime in Colab](https://docs.haystack.deepset.ai/v2.0/docs/enabling-gpu-acceleration)
- [Set logging level to INFO](https://docs.haystack.deepset.ai/v2.0/docs/setting-the-log-level)

## Installing Haystack

Install Haystack 2.0 and [datasets](https://pypi.org/project/datasets/) with `pip`:

In [2]:
%%bash

pip install git+https://github.com/deepset-ai/haystack.git@main
pip install "datasets>=2.6.1"
pip install sentence-transformers>=2.2.0

Collecting git+https://github.com/deepset-ai/haystack.git@main
  Cloning https://github.com/deepset-ai/haystack.git (to revision main) to /tmp/pip-req-build-1hckhr9e
  Resolved https://github.com/deepset-ai/haystack.git to commit 8cb3cecf3408b96e718b9e2f00d13697997a1b75
  Installing build dependencies: started
  Installing build dependencies: finished with status 'done'
  Getting requirements to build wheel: started
  Getting requirements to build wheel: finished with status 'done'
  Preparing metadata (pyproject.toml): started
  Preparing metadata (pyproject.toml): finished with status 'done'
Collecting boilerpy3 (from haystack-ai==2.1.0rc0)
  Downloading boilerpy3-1.0.7-py3-none-any.whl (22 kB)
Collecting haystack-bm25 (from haystack-ai==2.1.0rc0)
  Downloading haystack_bm25-1.0.2-py2.py3-none-any.whl (8.8 kB)
Collecting lazy-imports (from haystack-ai==2.1.0rc0)
  Downloading lazy_imports-0.3.1-py3-none-any.whl (12 kB)
Collecting openai>=1.1.0 (from haystack-ai==2.1.0rc0)
  Downloadi

  Running command git clone --filter=blob:none --quiet https://github.com/deepset-ai/haystack.git /tmp/pip-req-build-1hckhr9e


### Enabling Telemetry

Knowing you're using this tutorial helps us decide where to invest our efforts to build a better product but you can always opt out by commenting the following line. See [Telemetry](https://docs.haystack.deepset.ai/v2.0/docs/enabling-telemetry) for more details.

In [None]:
from haystack.telemetry import tutorial_running

tutorial_running(35)

## Create the RAG Pipeline to Evaluate

To evaluate a RAG pipeline, we need a RAG pipeline to start with. So, we will start by creating a question answering pipeline.

> 💡 For a complete tutorial on creating Retrieval-Augmmented Generation pipelines check out the [Creating Your First QA Pipeline with Retrieval-Augmentation Tutorial](https://haystack.deepset.ai/tutorials/27_first_rag_pipeline)

For this tutorial, we will be using [a labeled PubMed dataset](https://huggingface.co/datasets/vblagoje/PubMedQA_instruction/viewer/default/train?row=0) with questions, contexts and answers. This way, we can use the contexts as Documents, and we also have the required labeled data that we need for some of the evaluation metrics we will be using.

First, let's fetch the prepared dataset and extract `all_documents`, `all_questions` and `all_ground_truth_answers`:

> ℹ️ The dataset is quite large, we're using the first 10000 rows in this example, but you can increase this if you want to


In [40]:
from datasets import load_dataset
from haystack import Document

dataset = load_dataset("vblagoje/PubMedQA_instruction", split="train")
dataset = dataset.select(range(10000))
all_documents = [Document(content=doc["context"]) for doc in dataset]
all_questions = [doc["instruction"] for doc in dataset]
all_ground_truth_answers = [doc["response"] for doc in dataset]

Next, let's build a simple indexing pipeline and write the `documents` into a DocumentStore. Here, we're using the `InMemoryDocumentStore`.

> `InMemoryDocumentStore` is the simplest DocumentStore to get started with. It requires no external dependencies and it's a good option for smaller projects and debugging. But it doesn't scale up so well to larger Document collections, so it's not a good choice for production systems. To learn more about the different types of external databases that Haystack supports, see [DocumentStore Integrations](https://haystack.deepset.ai/integrations?type=Document+Store).

In [41]:
from typing import List
from haystack import Pipeline
from haystack.components.embedders import SentenceTransformersDocumentEmbedder
from haystack.components.writers import DocumentWriter
from haystack.document_stores.in_memory import InMemoryDocumentStore
from haystack.document_stores.types import DuplicatePolicy

document_store = InMemoryDocumentStore()

document_embedder = SentenceTransformersDocumentEmbedder(model="sentence-transformers/all-MiniLM-L6-v2")
document_writer = DocumentWriter(document_store=document_store, policy=DuplicatePolicy.SKIP)

indexing = Pipeline()
indexing.add_component(instance=document_embedder, name="document_embedder")
indexing.add_component(instance=document_writer, name="document_writer")

indexing.connect("document_embedder.documents", "document_writer.documents")

indexing.run({"document_embedder": {"documents": all_documents}})


Batches:   0%|          | 0/313 [00:00<?, ?it/s]

{'document_writer': {'documents_written': 10000}}

Now that we have our data ready, we can create a simple RAG pipeline.

In this example, we'll be using:
- [`InMemoryEmbeddingRetriever`](https://docs.haystack.deepset.ai/docs/inmemoryembeddingretriever) which will get the relevant documents to the query.
- [`OpenAIGenerator`](https://docs.haystack.deepset.ai/docs/OpenAIGenerator) to generate answers to queries. You can replace `OpenAIGenerator` in your pipeline with another `Generator`. Check out the full list of generators [here](https://docs.haystack.deepset.ai/docs/generators).

In [43]:
import os
from getpass import getpass
from haystack.components.builders import AnswerBuilder, PromptBuilder
from haystack.components.embedders import SentenceTransformersTextEmbedder
from haystack.components.generators import OpenAIGenerator
from haystack.components.retrievers.in_memory import InMemoryEmbeddingRetriever

if "OPENAI_API_KEY" not in os.environ:
    os.environ["OPENAI_API_KEY"] = getpass("Enter OpenAI API key:")

template = """
        You have to answer the following question based on the given context information only.

        Context:
        {% for document in documents %}
            {{ document.content }}
        {% endfor %}

        Question: {{question}}
        Answer:
        """

rag_pipeline = Pipeline()
rag_pipeline.add_component(
    "query_embedder", SentenceTransformersTextEmbedder(model="sentence-transformers/all-MiniLM-L6-v2")
)
rag_pipeline.add_component("retriever", InMemoryEmbeddingRetriever(document_store, top_k=3))
rag_pipeline.add_component("prompt_builder", PromptBuilder(template=template))
rag_pipeline.add_component("generator", OpenAIGenerator(model="gpt-3.5-turbo"))
rag_pipeline.add_component("answer_builder", AnswerBuilder())

rag_pipeline.connect("query_embedder", "retriever.query_embedding")
rag_pipeline.connect("retriever", "prompt_builder.documents")
rag_pipeline.connect("prompt_builder", "generator")
rag_pipeline.connect("generator.replies", "answer_builder.replies")
rag_pipeline.connect("generator.meta", "answer_builder.meta")
rag_pipeline.connect("retriever", "answer_builder.documents")

<haystack.core.pipeline.pipeline.Pipeline object at 0x7dee64432500>
🚅 Components
  - query_embedder: SentenceTransformersTextEmbedder
  - retriever: InMemoryEmbeddingRetriever
  - prompt_builder: PromptBuilder
  - generator: OpenAIGenerator
  - answer_builder: AnswerBuilder
🛤️ Connections
  - query_embedder.embedding -> retriever.query_embedding (List[float])
  - retriever.documents -> prompt_builder.documents (List[Document])
  - retriever.documents -> answer_builder.documents (List[Document])
  - prompt_builder.prompt -> generator.prompt (str)
  - generator.replies -> answer_builder.replies (List[str])
  - generator.meta -> answer_builder.meta (List[Dict[str, Any]])

### Asking a Question

When asking a question, use the `run()` method of the pipeline. Make sure to provide the question to all components that require it as input. In this case these are the `query_embedder`, the `prompt_builder` and the `answer_builder`.

In [44]:
question = "Do high levels of procalcitonin in the early phase after pediatric liver transplantation indicate poor postoperative outcome?"

response = rag_pipeline.run(
    {"query_embedder": {"text": question}, "prompt_builder": {"question": question}, "answer_builder": {"query": question}}
)
print(response["answer_builder"]["answers"][0].data)

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Yes, high levels of procalcitonin in the early phase after pediatric liver transplantation indicate poor postoperative outcome as patients with high PCT levels on postoperative day 2 had higher International Normalized Ratio values, suffered more often from primary graft non-function, had longer stays in the pediatric intensive care unit and on mechanical ventilation.


## Evaluate the Pipeline

For this tutorial, let's evaluate the pipeline with the following metrics:

- [Document Mean Reciprocal Rank](https://docs.haystack.deepset.ai/v2.1-unstable/docs/documentmrrevaluator): Evaluates retrieved documents using ground truth labels. It checks at what rank ground truth documents appear in the list of retrieved documents.
- [Semantic Answer Similarity](https://docs.haystack.deepset.ai/v2.1-unstable/docs/sasevaluator): Evaluates predicted answers using ground truth labels. It checks the semantic similarity of a predicted answer and the ground truth answer using a fine-tuned language model.
- [Faithfulness](https://docs.haystack.deepset.ai/v2.1-unstable/docs/faithfulnessevaluator): Uses an LLM to evaluate whether a generated answer can be inferred from the provided contexts. Does not require ground truth labels.


Firt, let's actually tun our RAG pipeline with a set of questions, and make sure we have the ground truth labels (both answers and documents) for these questions. Let's start with 50 random questions and labels 👇

> 📝 **Some Notes:**
>
> 1. For a full list of available metrics, check out the [Haystack Evaluators](https://docs.haystack.deepset.ai/v2.1-unstable/docs/evaluators).
>
> 2. In our dataset, for each example question, we have 1 ground truth document as labels. However, in some scenarios more than 1 ground truth document may be provided as labels. You will notice that this is why we provide a list od `ground_truth_documents` for each question.

In [45]:
import random

questions, ground_truth_answers, ground_truth_docs = zip(*random.sample(list(zip(all_questions, all_ground_truth_answers, all_documents)), 50))


Next, let's run our pipeline and make sure to track what our pipeline returns as answers, and which documents it retrieves:

In [None]:
rag_answers = []
retrieved_docs = []

for question in list(questions):
  response = rag_pipeline.run({"query_embedder": {"text": question},
                              "prompt_builder": {"question": question},
                              "answer_builder": {"query": question}})
  print(f"Question: {question}")
  print("Answer from pipeline:")
  print(response["answer_builder"]["answers"][0].data)
  print("\n-----------------------------------\n")

  rag_answers.append(response["answer_builder"]["answers"][0].data)
  retrieved_docs.append(response["answer_builder"]["answers"][0].documents)

While each evaluator is a component that can be run individually in Haystack, they can also be added into a pipeline. This way, we can construct an `eval_pipeline` that includes all evaluators for the metrics we want to evaluate our pipeline on.

In [47]:
from haystack.components.evaluators.document_mrr import DocumentMRREvaluator
from haystack.components.evaluators.faithfulness import FaithfulnessEvaluator
from haystack.components.evaluators.sas_evaluator import SASEvaluator

eval_pipeline = Pipeline()
eval_pipeline.add_component("doc_mrr_evaluator", DocumentMRREvaluator())
eval_pipeline.add_component("groundness_evaluator", FaithfulnessEvaluator())
eval_pipeline.add_component("sas_evaluator", SASEvaluator(model="sentence-transformers/all-MiniLM-L6-v2"))

results = eval_pipeline.run({
    "doc_mrr_evaluator": {"ground_truth_documents": list([d] for d in ground_truth_docs) , "retrieved_documents": retrieved_docs},
    "groundness_evaluator": {"questions": list(questions), "contexts": list([d.content] for d in ground_truth_docs), "responses": rag_answers},
    "sas_evaluator": {"predicted_answers": rag_answers, "ground_truth_answers": list(ground_truth_answers)}
})

### Constructing an Evaluation Report

Once we've run our evaluation pipeline, we can also create a full evaluation report. Haystac provides an `EvaluationRunResult` which we can use to display a `score_report` 👇

In [48]:
from haystack.evaluation.eval_run_result import EvaluationRunResult

data = {
    "inputs": {
        "question": list(questions),
        "contexts": list([d.content] for d in ground_truth_docs),
        "answer": list(ground_truth_answers),
        "predicted_answer": rag_answers,
    },
    "results": {
        "Mean Reciprocal Rank": {"individual_scores": results["doc_mrr_evaluator"]["individual_scores"],
                                 "score": results["doc_mrr_evaluator"]["score"]},
        "Semantic Answer Similarity": {"individual_scores": results["sas_evaluator"]["individual_scores"],
                                       "score": results["sas_evaluator"]["score"]},
        "Faithfulness": {"individual_scores": results["groundness_evaluator"]["individual_scores"],
                         "score": results["groundness_evaluator"]["score"]}
        },
}
evaluation_result = EvaluationRunResult(run_name="pubmed_rag_pipeline", inputs=data["inputs"], results=data["results"])
evaluation_result.score_report()

Unnamed: 0,score
Mean Reciprocal Rank,0.986667
Semantic Answer Similarity,0.764463
Faithfulness,0.98


#### Extra: Convert the Report into a Pandas DataFrame

In addition, you can display your evaluation results as a pandas dataframe 👇

In [49]:
results_df = evaluation_result.to_pandas()
results_df

Unnamed: 0,question,contexts,answer,predicted_answer,Mean Reciprocal Rank,Semantic Answer Similarity,Faithfulness
0,Does expression profile-defined classification...,[This study was conducted to gain insight into...,This study has shed light on heterogeneity in ...,"Yes, the expression profile-defined classifica...",1.0,[[0.8346112]],1.0
1,Does use of statin during hospitalization impr...,[To examine the relationship between statin us...,In-hospital statin use in ICH patients is asso...,"Based on the context information provided, the...",1.0,[[0.77155924]],1.0
2,Does serum of patients with septic shock stimu...,[To describe the concentrations of sTREM-1 in ...,Levels of sTREM-1 correlated with sepsis sever...,"Yes, sera of patients with septic shock evoked...",1.0,[[0.83037823]],1.0
3,Do adult mouse subventricular zones stimulate ...,[Patients with glioblastoma multiforme (GBM) h...,"Taken together, these data demonstrate the sig...","Yes, adult mouse subventricular zones do stimu...",1.0,[[0.6407193]],1.0
4,Does rapid ventricular pacing produce myocardi...,[Rapid ventricular pacing reduces the incidenc...,Rapid ventricular pacing protects the myocardi...,"No, rapid ventricular pacing produces myocardi...",1.0,[[0.901664]],0.0
5,Does blockade of the programmed death-1 pathwa...,[Effective therapeutic interventions for chron...,Analogous to the effects in other chronic lung...,"Yes, blockade of the programmed death-1 (PD-1)...",1.0,[[0.8659028]],1.0
6,Is thrombosis confined to the portal vein a co...,[There is a lack of agreement regarding preexi...,A thrombosis confined to the portal vein per s...,Thrombosis confined to the portal vein is not ...,1.0,[[0.65946174]],1.0
7,Is mac-2-binding protein a diagnostic marker f...,"[Biliary tract carcinoma is a deadly disease, ...","Biliary Mac-2BP levels, especially when used i...","Yes, Mac-2-binding protein (Mac-2BP) is identi...",1.0,[[0.8994812]],1.0
8,Is hIC2 a novel dosage-dependent regulator of ...,[22q11 deletion syndrome arises from recombina...,Our results demonstrate a novel role for Hic2 ...,"Based on the context information provided, it ...",1.0,[[0.9038536]],1.0
9,Does population-based study reveal new risk-st...,[The risk of progression of Barrett's esophagu...,"Low-grade dysplasia, abnormal DNA ploidy, and ...","Yes, the population-based study reveals a new ...",0.333333,[[0.5243304]],1.0


## What's next

🎉 Congratulations! You've learned how to evaluate a RAG pipeline with model-based evaluation frameworks and without any labeling efforts.

If you liked this tutorial, you may also enjoy:
- [Serializing Haystack Pipelines](https://haystack.deepset.ai/tutorials/29_serializing_pipelines)
-  [Creating Your First QA Pipeline with Retrieval-Augmentation](https://haystack.deepset.ai/tutorials/27_first_rag_pipeline)

To stay up to date on the latest Haystack developments, you can [sign up for our newsletter](https://landing.deepset.ai/haystack-community-updates?utm_campaign=developer-relations&utm_source=moel_based_evaluation). Thanks for reading!