# ⚖️ Benchmark, compare and find the best LLM for your RAG pipeline

In this tutorial, we will show you how to compare LLMs and choose the best one for your use case using [Argilla](https://github.com/argilla-io/argilla), [Ragas](https://github.com/explodinggradients/ragas) and [Haystack](https://github.com/deepset-ai/haystack).

> [!NOTE]
>
> We use `haystack` for RAG and `ragas` for evaluation in this tutorial but these can be easily replaced by others.
> - RAG alternatives to `haystack` include `langchain` or `llama-index`.
> - Evaluation alternatives to `ragas` include `haystack` or `distilabel`.

We will walk you through the steps to:

- Setup a RAG pipeline using Haystack.
- Evaluate the performance of your RAG pipeline using Ragas eval.
- Compare the performance of your RAG pipeline for different LLMs in Argilla.
- Choose the best LLM for your use case.

This tutorial is based on the [Haystack documentation](https://haystack.deepset.ai/tutorials/35_evaluating_rag_pipelines) and the [Ragas documentation](https://docs.ragas.io/en/stable/howtos/applications/compare_llms.html).

## Getting started

### Deploy the Argilla server¶

If you already have deployed Argilla, you can skip this step. Otherwise, you can quickly deploy Argilla following [this guide](https://docs.argilla.io/latest/getting_started/quickstart/).


### Set up the environment¶

To complete this tutorial, you need to install this integration and a third-party library via pip.


In [None]:
%pip install -qqq "haystack-ai" \
            "datasets>=2.6.1" \
            "sentence-transformers>=3.0.0" \
            "ragas" \
            "argilla"

## Setup a RAG pipeline using Haystack.

We will use the [PubMedQA_instruction](https://huggingface.co/datasets/vblagoje/PubMedQA_instruction) dataset to create a RAG system. This dataset contains documents, questions and answers from PubMed. We will use the `context` field as the document, the `instruction` field as the question and the `response` field as the ground truth answer. 

### Load a dataset

In [None]:
from datasets import load_dataset
from haystack import Document

dataset = load_dataset("vblagoje/PubMedQA_instruction", split="train")
dataset = dataset.select(range(1000))
all_documents = [Document(content=doc["context"]) for doc in dataset]
all_questions = [doc["instruction"] for doc in dataset]
all_ground_truth_answers = [doc["response"] for doc in dataset]

### Index documents 

We will use the `SentenceTransformersDocumentEmbedder` to create embeddings for the documents. The `DocumentWriter` will write the documents to the document store. The `InMemoryDocumentStore` will store the documents in memory.

In [None]:
from haystack import Pipeline
from haystack.components.embedders import SentenceTransformersDocumentEmbedder
from haystack.components.writers import DocumentWriter
from haystack.document_stores.in_memory import InMemoryDocumentStore
from haystack.document_stores.types import DuplicatePolicy

document_store = InMemoryDocumentStore()

document_embedder = SentenceTransformersDocumentEmbedder(model="sentence-transformers/all-MiniLM-L6-v2")
document_writer = DocumentWriter(document_store=document_store, policy=DuplicatePolicy.SKIP)

indexing = Pipeline()
indexing.add_component(instance=document_embedder, name="document_embedder")
indexing.add_component(instance=document_writer, name="document_writer")

indexing.connect("document_embedder.documents", "document_writer.documents")

indexing.run({"document_embedder": {"documents": all_documents}})

Now that we have our data ready and indexed in our semantic search index, we can create a simple RAG pipeline. You can actually save the `InMemoryDocumentStore` to and from disk using the `save_to_disk` and `load_from_disk` methods, which is great for demo and baseline purposes.

For production use cases, you might want to use a more robust document store, such as elasticsearch, weaviate or lancedb

### Create a RAG pipeline

Now that we have our RAG pipeline set up, we can choose two LLMs and evaluate the performance of the pipeline. We will be using the [Free Serverless Inference API](https://huggingface.co/docs/api-inference/en/index) from HuggingFace and compare two models: microsoft/Phi-3.5-mini-instruct and meta-llama/Llama-3.1-8B-Instruct.

> [!NOTE]
> The models that are available through the HuggingFace API for free are subject to change. You can find the latest free models [here](https://huggingface.co/models?inference=warm&pipeline_tag=text-generation&sort=trending).

In [10]:
import os
from getpass import getpass
from haystack.components.builders import AnswerBuilder, PromptBuilder
from haystack.components.embedders import SentenceTransformersTextEmbedder
from haystack.components.generators import HuggingFaceAPIGenerator
from haystack.components.retrievers.in_memory import InMemoryEmbeddingRetriever

if "HF_TOKEN" not in os.environ:
    os.environ["HF_TOKEN"] = getpass("Enter HuggingFace API key:")

template = """
        You have to answer the following question based on the given context information only.

        Context:
        {% for document in documents %}
            {{ document.content }}
        {% endfor %}

        Question: {{question}}
        Answer:
        """

def get_rag_pipeline(model):
    query_embedder = SentenceTransformersTextEmbedder(model="sentence-transformers/all-MiniLM-L6-v2")
    retriever = InMemoryEmbeddingRetriever(document_store, top_k=3)
    prompt_builder = PromptBuilder(template=template)
    answer_builder = AnswerBuilder()
    generator = HuggingFaceAPIGenerator(api_type="serverless_inference_api", api_params={"model": model})
    rag_pipeline = Pipeline()
    rag_pipeline.add_component("query_embedder", query_embedder)
    rag_pipeline.add_component("retriever", retriever)
    rag_pipeline.add_component("prompt_builder", prompt_builder)
    rag_pipeline.add_component("generator", generator)
    rag_pipeline.add_component("answer_builder", answer_builder)

    rag_pipeline.connect("query_embedder", "retriever.query_embedding")
    rag_pipeline.connect("retriever", "prompt_builder.documents")
    rag_pipeline.connect("prompt_builder", "generator")
    rag_pipeline.connect("generator.replies", "answer_builder.replies")
    rag_pipeline.connect("generator.meta", "answer_builder.meta")
    rag_pipeline.connect("retriever", "answer_builder.documents")
    return rag_pipeline 

rag_pipeline_llama = get_rag_pipeline("meta-llama/Llama-3.1-8B-Instruct")
rag_pipeline_phi = get_rag_pipeline("microsoft/Phi-3.5-mini-instruct")

Let's now test the pipelines by asking a question.

In [11]:
question = "Do high levels of procalcitonin in the early phase after pediatric liver transplantation indicate poor postoperative outcome?"

response_llama = rag_pipeline_llama.run(
    {
        "query_embedder": {"text": question},
        "prompt_builder": {"question": question},
        "answer_builder": {"query": question},
    }
)
response_phi = rag_pipeline_phi.run(
    {
        "query_embedder": {"text": question},
        "prompt_builder": {"question": question},
        "answer_builder": {"query": question},
    }
)
response_llama["answer_builder"]["answers"][0].data, response_phi["answer_builder"]["answers"][0].data

Batches: 100%|██████████| 1/1 [00:00<00:00, 17.16it/s]
Batches: 100%|██████████| 1/1 [00:00<00:00, 15.81it/s]


(' Yes. According to the text, patients with high PCT levels on postoperative day (POD) 2 had higher International Normalized Ratio values on POD 5 and suffered more often from primary graft non-function. They also had a longer stay in the pediatric intensive care unit and on mechanical ventilation. There was no correlation between PCT elevation and systemic infection. However, PCT levels were correlated with peak serum lactate levels immediately after graft reperfusion and elevation of serum aminotransferases on POD 1. Therefore, high levels of procalcitonin in the early phase after pediatric liver transplantation indicate poor postoperative outcome.       \n        \n        The best answer is Yes.',
 '\n\n        Answer: Yes, high levels of procalcitonin in the early phase after pediatric liver transplantation are associated with poor postoperative outcomes, including higher International Normalized Ratio values, primary graft non-function, longer stay in the pediatric intensive car

## Evaluate the performance of the RAG pipeline

### Generate responses using the different LLMs

For each LLM, we will generate a response to 25 questions and we will store the answers and the retrieved documents. Let's first sample 25 questions from the dataset.

In [12]:
import random

questions, ground_truth_answers, ground_truth_docs = zip(
    *random.sample(list(zip(all_questions, all_ground_truth_answers, all_documents)), 25)
)

Now we can generate the responses for each of the RAG pipelines.


In [None]:
def generate_responses(rag_pipeline):
    rag_answers = []
    retrieved_docs = []


    for question in questions:
        response = rag_pipeline.run(
            {
                "query_embedder": {"text": question},
                "prompt_builder": {"question": question},
                "answer_builder": {"query": question},
            }
        )

        rag_answers.append(response["answer_builder"]["answers"][0].data)
        retrieved_docs.append(response["answer_builder"]["answers"][0].documents)
    return rag_answers, retrieved_docs

rag_answers_llama, retrieved_docs_llama = generate_responses(rag_pipeline_llama)
rag_answers_phi, retrieved_docs_phi = generate_responses(rag_pipeline_phi)

### Use LLM as evaluators using Ragas

For our LLM evaluation we will use a selection of [the builtin Ragas evaluation metrics](https://docs.ragas.io/en/stable/getstarted/evaluation.html). These metrics are forwarded to an LLM, which acts as judge to evaluate the performance of the RAG pipeline. We will be using the following metrics:

- Faithfulness - Measures the factual consistency of the answer to the context based on the question.
- Answer_relevancy - Measures how relevant the answer is to the question.
- Answer_correctness - Measures how correct the answer is.

The evaluation metrics are computed on top of Hugging Face Datasets dataset, with the columns `question`, `reference`, `answer` and `retrieved_contexts`. This format differs based on the selected evaluation metrics but when misconfigured, Ragas will throw an error stating that a required column is missing. Lets format our data.

In [33]:
from datasets import Dataset

dataset_llama = Dataset.from_dict({
    "question": questions,
    "reference": ground_truth_answers,
    "answer": rag_answers_llama,
    "retrieved_contexts": [[doc.content for doc in retrieved_docs] for retrieved_docs in retrieved_docs_llama],
})

dataset_phi = Dataset.from_dict({
    "question": questions,
    "reference": ground_truth_answers,
    "answer": rag_answers_phi,
    "retrieved_contexts": [[doc.content for doc in retrieved_docs] for retrieved_docs in retrieved_docs_phi],
})
dataset_llama[0]

{'question': 'Is african-american race a predictor of seminal vesicle invasion after radical prostatectomy?',
 'reference': 'AA men have an increased risk of SVI after RP, particularly among men with Gleason ≤ 6 disease. This might represent racial differences in the biology of PCa disease progression, which contribute to poorer outcomes in AA men.',
 'answer': " - Yes, according to the study, after adjusting for known predictors of adverse pathologic features, AA race remained a predictor of SVI. \n         - No, according to the study, there was no significant difference in extraprostatic spread, positive surgical margin, lymph node involvement, or adverse pathologic features across race groups. However, among patients with ≥ 1 adverse pathologic features, AA men had higher rate of seminal vesicle invasion (SVI) compared with CS men. \n         - Maybe, according to the study, after adjusting for known predictors of adverse pathologic features AA race remained a predictor of SVI. How

Normally, Ragas would be run on the whole dataset, only providing an average score for each metrics, however, we will run it on a single sample to get the scores for each metric. This allows us to see the actual scores for each one of the questions and get a much better understanding of the strengths and weaknesses of the RAG pipeline when evaluating it in Argilla.

In [39]:
from ragas import evaluate
from ragas.metrics import (
    faithfulness,
    answer_relevancy,
    answer_correctness,
)

metrics = [
    faithfulness,
    answer_relevancy,
    answer_correctness,
]

def evaluate_ragas(sample):
    ds = Dataset.from_list([sample])
    sample.update(evaluate(ds, metrics=metrics, show_progress=False))
    return sample
    

result_llama = dataset_llama.map(evaluate_ragas, batched=False)
result_phi = dataset_phi.map(evaluate_ragas, batched=False)
result_llama[0]

{'question': 'Is african-american race a predictor of seminal vesicle invasion after radical prostatectomy?',
 'reference': 'AA men have an increased risk of SVI after RP, particularly among men with Gleason ≤ 6 disease. This might represent racial differences in the biology of PCa disease progression, which contribute to poorer outcomes in AA men.',
 'answer': " - Yes, according to the study, after adjusting for known predictors of adverse pathologic features, AA race remained a predictor of SVI. \n         - No, according to the study, there was no significant difference in extraprostatic spread, positive surgical margin, lymph node involvement, or adverse pathologic features across race groups. However, among patients with ≥ 1 adverse pathologic features, AA men had higher rate of seminal vesicle invasion (SVI) compared with CS men. \n         - Maybe, according to the study, after adjusting for known predictors of adverse pathologic features AA race remained a predictor of SVI. How

## Review and correct the feedback in Argilla

We can now review the initial LLM suggestions in Argilla  and correct them where needed. In order to do that, we will first create a new Argilla dataset with the correct settings. We will be adding some fields to the dataset to store the fixed information like question, ground truth answer and retrieved contexts. We will also be adding questions to store the LLM suggestions for the answer and the evaluation scores. Lastly, we will be adding some additional metadata to allow for more effective filtering of the dataset edge cases later.

### Create a new Argilla dataset

In [None]:
import argilla as rg

client = rg.Argilla(api_url="https://davidberenstein1957-argilla.hf.space/", api_key="12345678")

values = list(range(11))
settings = rg.Settings(
    fields=[
        rg.TextField(name="question", description="The question asked to the RAG pipeline"),
        rg.TextField(name="reference", description="The ground truth answer to the question"),
        rg.TextField(name="retrieved_contexts_llama", description="The retrieved contexts from the RAG pipeline with Llama-3.1-8B-Instruct"),
        rg.TextField(name="retrieved_contexts_phi", description="The retrieved contexts from the RAG pipeline with Phi-3.5-mini-instruct"),
    ], # are expected to represented fixed values
    questions=[
        rg.TextQuestion(name="answer_llama", description="The answer to the question from the RAG pipeline with Llama-3.1-8B-Instruct"),
        rg.RatingQuestion(name="faithfulness_llama", description="How faithful is the answer to the question based on the context?", values=values),
        rg.RatingQuestion(name="answer_relevancy_llama", description="How relevant is the answer to the question?", values=values),
        rg.RatingQuestion(name="answer_correctness_llama", description="How correct is the answer to the question?", values=values),
        rg.TextQuestion(name="answer_phi", description="The answer to the question from the RAG pipeline with Phi-3.5-mini-instruct"),
        rg.RatingQuestion(name="faithfulness_phi", description="How faithful is the answer to the question based on the context?", values=values),
        rg.RatingQuestion(name="answer_relevancy_phi", description="How relevant is the answer to the question?", values=values),
        rg.RatingQuestion(name="answer_correctness_phi", description="How correct is the answer to the question?", values=values),
    ], # are expected to represent model suggestions
    metadata=[
        rg.FloatMetadataProperty(name="faithfullness_difference"),
        rg.FloatMetadataProperty(name="answer_relevancy_difference"),
        rg.FloatMetadataProperty(name="answer_correctness_difference"),
    ], # can be used the filter the dataset later
)

dataset = rg.Dataset(
    "haystack-ragas-rag-evaluation",
    client=client,
    settings=settings,
)
dataset = dataset.create()

### Upload the data to Argilla

We can log records to the dataset using the `dataset.records.log` using a `List[Dict]`, however, the dataset columns need to align with the fields, questions and metadata in the Argilla settings. Therefore, we will first format the dataset and then log it.

In [None]:
records = []

def format_contexts_as_html_summaries(contexts: list[str]):
    context_dict = {f"Retrieved doc {k}": value for k, value in enumerate(contexts)}
    html_content = ""
    html_content = "<body>"
    for k, v in context_dict.items():
        html_content += f'<details><summary>{k}</summary>\n<p>{v}</p>\n</details>\n'
    html_content += "</body>"
    return html_content

for row_1, row_2 in zip(result_llama, result_phi):
    records.append(
        {
            "question": row_1["question"],
            "reference": row_1["reference"],
            "retrieved_contexts_llama": format_contexts_as_html_summaries(row_1["retrieved_contexts"]),
            "retrieved_contexts_phi": format_contexts_as_html_summaries(row_2["retrieved_contexts"]),
            "answer_llama": row_1["answer"],
            "answer_phi": row_2["answer"],
            "faithfulness_llama": int(row_1["faithfulness"]*10),
            "answer_relevancy_llama": int(row_1["answer_relevancy"]*10),
            "answer_correctness_llama": int(row_1["answer_correctness"]*10),
            "faithfulness_phi": int(row_2["faithfulness"]*10),
            "answer_relevancy_phi": int(row_2["answer_relevancy"]*10),
            "answer_correctness_phi": int(row_2["answer_correctness"]*10),
            "faithfullness_difference": abs(row_1["faithfulness"] - row_2["faithfulness"]),
            "answer_relevancy_difference": abs(row_1["answer_relevancy"] - row_2["answer_relevancy"]),
            "answer_correctness_difference": abs(row_1["answer_correctness"] - row_2["answer_correctness"]),
        }
    )
dataset.records.log(records)



We can now have a look at the actual performance of the RAG pipeline for each of the LLMs. While having powerfill search and filtering capabilities. Argilla also allows for sorting the records based on the metadata fields we added before. For example, let's sort the records by the `faithfullness_difference` field to get the records with the biggest difference in faithfulness scores between the two LLMs.

![ragas_haystack_evaluation](./images/ragas_haystack_evaluation.png)