# Evaluating a Haystack RAG pipeline with FlowJudge

## Overview

This tutorial demonstrates how to evaluate a Retrieval-Augmented Generation (RAG) pipeline built with Haystack using `flow-eval`. We'll showcase how to:

1. Set up a basic RAG pipeline using Haystack
2. Integrate `flow-eval` evaluators into a Haystack evaluation pipeline
3. Assess the RAG system's performance using multiple metrics:
   - Semantic Answer Similarity (SAS)
   - Context Relevancy
   - Faithfulness

Key highlights:

- Use of the `HaystackLMEvaluator` class to seamlessly incorporate `flow-eval` evaluators
- Demonstration of custom metric creation for tailored evaluations
- Utilization of both pre-built Haystack evaluators and custom `flow-eval` evaluators

By the end of this tutorial, you'll have a clear understanding of how to comprehensively evaluate your Haystack RAG pipelines using `flow-eval`, enabling you to iteratively improve your system's performance without relying on proprietary large language models.

### Additional requirements

- Haystack: Make sure you have Haystack installed. You can install it via pip:
  ```bash
  pip install haystack-ai
  ```

- Sentence Transformers: Make sure you have Sentence Transformers installed. You can install it via pip:
  ```bash
  pip install "sentence-transformers>=3.0.0"
  ```

- Set your free HuggingFace API token as an environment variable:
    ```python
    import os
    os.environ["HF_TOKEN"] = "your_token_here"
    ```

    You can get your HuggingFace API token [here](https://huggingface.co/settings/tokens).

Note that this notebook primarily demonstrates the integration of `flow-eval` with Haystack for evaluating RAG pipelines. While we do set up a basic RAG pipeline using Haystack, the main emphasis is on the evaluation process using `flow-eval` evaluators.

For detailed explanations on building RAG pipelines with Haystack, please refer to the official [Haystack documentation](https://docs.haystack.deepset.ai/docs/intro).

## Dataset

For this tutorial, we are going to use a subset of the `LegalBench` dataset, which contains contracts and questions from the contracts.

In [1]:
try:
    from datasets import Dataset
except ImportError as e:
    print("datasets is not installed. ")
    print("Please run `pip install datasets` to install it.")
    print("\nAfter installation, restart the kernel and run this cell again.")
    raise SystemExit(f"Stopping execution due to missing datasets dependency: {e}")

In [2]:
from datasets import load_dataset

ds = load_dataset("flowaicom/legalbench_contracts_qa_subset", "default")

In [None]:
ds

This dataset contains:
- Questions: A question about the contract.
- Context: The contract itself.
- Original answer: The original answer to the question which can be considered as the ground truth.
- Answer: The answer used for generating the answer with reasoning, which can include noise with respect to the original answer.
- Answer with reasoning: An answer to the question including the reasoning for the answer based on the contract.

For this tutorial:
- We use instances without perturbations (where original_answer == answer)
- The contract text (context) is used to create documents
- We use `answer_with_reasoning` as the ground truth for evaluators

In [None]:
try:
    from haystack import Document
except ImportError:
    print("Haystack is not installed. ")
    print("Please install it according to the 'Additional Requirements' section above.")
    print("\nAfter installation, restart the kernel and run this cell again.")
    raise SystemExit("Stopping execution due to missing Haystack dependency.")

filtered_ds = ds.filter(lambda x: x['original_answer'] == x['answer'])

all_documents = [Document(content=context) for context in filtered_ds['train']['context']]
all_questions = [q for q in filtered_ds['train']['question']]
all_ground_truths = [a for a in filtered_ds['train']['answer_with_reasoning']]

print(f"Number of documents: {len(all_documents)}")
print(f"Number of questions: {len(all_questions)}")
print(f"Number of ground truths: {len(all_ground_truths)}")

In [None]:
from IPython.display import Markdown, display

display(Markdown(f"**Question:** {all_questions[0]}"))
display(Markdown(f"**Context:** {all_documents[0].content}"))
display(Markdown(f"**Ground truth answer:** {all_ground_truths[0]}"))

## Creating a RAG pipeline with Haystack

We will be creating a very simple RAG pipeline with Haystack.

For more detail explanations about building the RAG pipeline, please refer to this tutorial in the Haystack documentation - [Tutorial: Evaluating RAG pipelines](https://haystack.deepset.ai/tutorials/35_evaluating_rag_pipelines)

>Note that we have made minor modifications to the pipeline for this tutorial. In particular, we are using `HuggingFaceAPIChatGenerator` and `ChatPromptBuilder`.

### Indexing the documents

We need to index the documents so we can later use a retriever to find the most similar document to the question.

We are using the `InMemoryDocumentStore`, which is a simple in-memory document store that doesn't require setting up a database.

We are also using an small open-source embedding model from Sentence Transformers to convert the documents into embeddings.

Finally, we are using the `DocumentWriter` to write the documents into the document store.

In [None]:
from haystack import Pipeline
from haystack.components.embedders import SentenceTransformersDocumentEmbedder
from haystack.components.writers import DocumentWriter
from haystack.document_stores.in_memory import InMemoryDocumentStore
from haystack.document_stores.types import DuplicatePolicy

document_store = InMemoryDocumentStore()

document_embedder = SentenceTransformersDocumentEmbedder(model="sentence-transformers/all-MiniLM-L6-v2")
document_writer = DocumentWriter(document_store=document_store, policy=DuplicatePolicy.SKIP)

indexing = Pipeline()
indexing.add_component(instance=document_embedder, name="document_embedder")
indexing.add_component(instance=document_writer, name="document_writer")

indexing.connect("document_embedder.documents", "document_writer.documents")

indexing.run({"document_embedder": {"documents": all_documents}})

### Create the RAG pipeline

Haystack lets us easily create a RAG pipeline using:

- `InMemoryEmbeddingRetriever` which will get the relevant documents to the query.
- `HuggingFaceAPIChatGenerator` to generate the answer to the question. We are going to use a small open model for this example.

>Note you can use the free serverless inference API from HuggingFace to quickly experiment with different models. However, it's rate-limited and not suitable for production. To make use of the API, you just need to provide [your free HuggingFace API token](https://huggingface.co/settings/tokens).



In [None]:
from haystack.components.builders import AnswerBuilder, ChatPromptBuilder
from haystack.components.embedders import SentenceTransformersTextEmbedder
from haystack.components.generators.chat import HuggingFaceAPIChatGenerator
from haystack.components.retrievers.in_memory import InMemoryEmbeddingRetriever
from haystack.utils.hf import HFGenerationAPIType, Secret
from haystack.dataclasses import ChatMessage

api_type = HFGenerationAPIType.SERVERLESS_INFERENCE_API
llm = HuggingFaceAPIChatGenerator(api_type=api_type,
                                        api_params={"model": "microsoft/Phi-3.5-mini-instruct"},
                                        token=Secret.from_env_var("HF_TOKEN")
                                        )


template_str = """
You have to answer the following question based on the given context information only.

Context:
{% for document in documents %}
    {{ document.content }}
{% endfor %}

Question: {{question}}
Answer:
"""
template = [ChatMessage.from_user(template_str)]
prompt_builder = ChatPromptBuilder(template=template)

rag_pipeline = Pipeline()
rag_pipeline.add_component(
    "query_embedder", SentenceTransformersTextEmbedder(model="sentence-transformers/all-MiniLM-L6-v2")
)
rag_pipeline.add_component("retriever", InMemoryEmbeddingRetriever(document_store, top_k=3))
rag_pipeline.add_component("prompt_builder", prompt_builder)
rag_pipeline.add_component("llm", llm)
rag_pipeline.add_component("answer_builder", AnswerBuilder())

rag_pipeline.connect("query_embedder", "retriever.query_embedding")
rag_pipeline.connect("retriever", "prompt_builder.documents")
rag_pipeline.connect("prompt_builder", "llm")
rag_pipeline.connect("llm.replies", "answer_builder.replies")
rag_pipeline.connect("retriever", "answer_builder.documents")

Let's test the pipeline with a single question.

In [None]:
# Quick test of the pipeline
question = "Does CNN permit using bots to artificially increase page visits for certain content?"

response = rag_pipeline.run(
    {
        "query_embedder": {"text": question},
        "prompt_builder": {"question": question},
        "answer_builder": {"query": question},
    }
)
print(response["answer_builder"]["answers"][0].data)


In [None]:
# display the retrieved documents and similarity scores
for i, doc in enumerate(response['answer_builder']['answers'][0].documents, 1):
    display(Markdown(f"""**Document {i} (Score: {doc.score:.4f}):**\n\n{doc.content[:500]}..."""))

## Evaluating the pipeline

With our initial RAG pipeline prototype in place, we can now focus on evaluation.

To showcase the integration of `FlowJudge` within the Haystack framework, we'll evaluate the pipeline using both statistical and model-based evaluators.

Haystack employs the concept of an __Evaluation pipeline__, which computes scoring metrics to assess the RAG pipeline's performance.

Our evaluation pipeline will incorporate three key metrics:
- __Semantic Answer Similarity (SAS)__: Measures the semantic similarity between generated and ground truth answers, going beyond simple lexical matching.
- __Context Relevancy__: Determines how well the retrieved documents align with the given query.
- __Faithfulness__: Assesses the extent to which the generated answer is grounded in the retrieved documents.

For context relevancy and faithfulness, we'll leverage `FlowJudge` evaluators, eliminating the need for proprietary large models like GPT-4 or Claude 3.5 Sonnet.

### Obtaining generated answers

Our first step is to generate answers using the RAG pipeline.

>Note: We're using HuggingFace's free serverless inference API, which may take several minutes. To avoid rate limits, we're processing only 20 questions. If execution fails, you can resume from the last successful point by rerunning the cell.

In [None]:
questions = all_questions[:20]
ground_truths = all_ground_truths[:20]

rag_answers = []
retrieved_docs = []

for question in questions:
    response = rag_pipeline.run(
        {
            "query_embedder": {"text": question},
            "prompt_builder": {"question": question},
            "answer_builder": {"query": question},
        }
    )
    print(f"Question: {question}")
    print("Answer from pipeline:")
    print(response["answer_builder"]["answers"][0].data)
    print("\n-----------------------------------\n")

    rag_answers.append(response["answer_builder"]["answers"][0].data)
    retrieved_docs.append(response["answer_builder"]["answers"][0].documents)

We now convert the retrieved documents into a single string so `FlowJudge` can format the prompt properly under the hood.

In [11]:
# Concatenate the retrieved documents into a single string
str_retrieved_docs = []
for docs in retrieved_docs:
    str_retrieved_doc = ""
    for i, doc in enumerate(docs, 1):
        str_retrieved_doc += doc.content
        str_retrieved_doc += "\n"
    str_retrieved_docs.append(str_retrieved_doc)

In [None]:
display(Markdown(f"**Retrieved documents:** {str_retrieved_docs[0]}"))

### Evaluators in Haystack

__Evaluators__ in Haystack are versatile components that can operate independently or as integral parts of a pipeline.

We'll construct an evaluation pipeline to efficiently obtain scores from all evaluators in a single pass. Additionally, Haystack provides functionality to generate a comprehensive evaluation report.

#### Creating flow-eval evaluators using the HaystackFlowJudge class

We can use our integration with Haystack to create `flow-eval` evaluators in a flexible way. The process is as follows:
1. Create a `LMEval` that will be used to compute the score for the evaluator.
2. Initialize the model - We are using the transformers configuration for Flow-Judge-v0.1.
3. Instantiate the `HaystackLMEvaluator` evaluator.

> **Important Note on Model Selection:**
> 
> There's a known issue with Phi-3 models producing gibberish outputs for contexts exceeding 4096 tokens (including input and output). While this has been addressed in recent transformers library updates, still remains an issue in the vLLM engine. We recommend the following:
> 
> - For longer contexts: Use the `Flow-Judge-v0.1_HF` model configuration.
> - **Caveat:** Inference with transformers is significantly slower than with optimized runtimes.
> 
> This approach ensures reliable outputs for extensive contexts, albeit with a trade-off in processing speed.

In [None]:
from flow_eval.integrations.haystack import HaystackEvaluator
from flow_eval.lm import LMEval, RubricItem
from flow_eval.lm.metrics import RESPONSE_FAITHFULNESS_5POINT
from flow_eval.lm.models import Vllm, Hf, Llamafile, Baseten

# Create a model using Hugging Face Transformers with Flash Attention
model = Hf()

# Or if not running on Ampere GPU or newer, create a model using no flash attn and Hugging Face Transformers
# model = Hf(flash_attn=False)

# Creating a model using Vllm
# model = Vllm()

# If you have other applications open taking up VRAM, you can use less VRAM by setting gpu_memory_utilization to a lower value.
# model = Vllm(gpu_memory_utilization=0.70)

# If you are running on a Silicon Mac, you can create a model using Llamafile
# model = Llamafile()

# Or create a model using Baseten if you don't want to run locally.
# As a pre-requisite step:
#  - Sign up to Baseten
#  - Generate an api key https://app.baseten.co/settings/api_keys
#  - Set the api key as an environment variable & initialize:
# import os
# os.environ["BASETEN_API_KEY"] = "your_api_key"
# model = Baseten()

We create the context relevancy metric from scratch. For learning more about how to create custom metrics, refer to the [custom metrics tutorial](https://github.com/flowaicom/flow-judge/blob/main/examples/2_custom_evaluation_criteria.ipynb).

In [14]:
# Context relevancy
cr_criteria = "Based on the provided query and context, how relevant and sufficient is the context for responding to the query?"
cr_rubric = [
    RubricItem(
        score=1,
        description="The context provided is not relevant or insufficient to respond to the query."
    ),
    RubricItem(
        score=2,
        description="The context is mostly irrelevant to the query. It may contain some tangentially related information but is insufficient for adequately responding to the query."
    ),
    RubricItem(
        score=3,
        description="The context is somewhat relevant to the query. It contains some information that could be used to partially respond to the query, but key details are missing for a complete response."
    ),
    RubricItem(
        score=4,
        description="The context is mostly relevant to the query. It contains most of the necessary information to respond to the query, but may be missing some minor details."
    ),
    RubricItem(
        score=5,
        description="The context is highly relevant to the query. It contains all the necessary information to comprehensively respond to the query without needing any additional context."
    )
]
cr_eval = LMEval(
    name="Context Relevancy",
    criteria=cr_criteria,
    rubric=cr_rubric,
    input_columns=["question"],
    output_column="contexts"
)

For creating the faithfulness evaluator, we are going to use the `RESPONSE_FAITHFULNESS_5POINT` preset in flow-judge library as a template.

> Note that we need to use the expected keys so we need to update required inputs and outputs to match the expected keys in the RAG pipeline. In this case, the score descriptions are still relevant with these changes.

In [None]:
ff_criteria = RESPONSE_FAITHFULNESS_5POINT.criteria
ff_rubric = RESPONSE_FAITHFULNESS_5POINT.rubric

display(Markdown(f"**Criteria:** {ff_criteria}"))
display(Markdown("**Rubric:**"))

for item in ff_rubric:
    display(Markdown(f"- **Score {item.score}:** {item.description}"))

ff_eval = LMEval(
    name="Faithfulness",
    criteria=ff_criteria,
    rubric=ff_rubric,
    input_columns=["question", "contexts"],
    output_column="predicted_answers"
)

We can now create the Flow Judge evaluators:

In [16]:
cr_evaluator = HaystackLMEvaluator(
    eval=cr_eval,
    model=model, # the vLLM instance of Flow-Judge-v0.1
    progress_bar=True,
    raise_on_failure=True, # to raise an error when pipeline run fails
    save_results=True, # to save evaluation results to disk
    fail_on_parse_error=False # to fail if there is a parsing error, otherwise return "Error" and score -1
)

ff_evaluator = HaystackLMEvaluator(
    eval=ff_eval,
    model=model,
    progress_bar=True,
    raise_on_failure=True,
    save_results=True,
    fail_on_parse_error=False
)

#### Haystack evaluators

Now let's crete the semantic answer similarity evaluator using the Haystack implementation. This evaluator will use the same embedding model as the retriever in the RAG pipeline.

In [17]:
from haystack.components.evaluators.sas_evaluator import SASEvaluator

sas_evaluator = SASEvaluator(model="sentence-transformers/all-MiniLM-L6-v2")

### Evaluation pipeline

We can now create a Haystack evaluation pipeline that will evaluate the RAG pipeline and obtains the evaluation results.

In [18]:
eval_pipeline = Pipeline()

# add components to the pipeline
eval_pipeline.add_component("sas_evaluator", sas_evaluator)
eval_pipeline.add_component("cr_evaluator", cr_evaluator)
eval_pipeline.add_component("ff_evaluator", ff_evaluator)

>Note that executing the following cell might take a while to complete due to the size of the inputs, specially if running on a machine with low resources.

In [None]:
results = eval_pipeline.run(
    {
        "sas_evaluator": {
            'predicted_answers': rag_answers,
            'ground_truth_answers': ground_truths,
        },
        "cr_evaluator": {
            'question': questions,
            'contexts': str_retrieved_docs,
        },
        "ff_evaluator": {
            'question': questions,
            'contexts': str_retrieved_docs,
            'predicted_answers': rag_answers,
        }
    }
)

In [None]:
results

### Evaluation report

Haystack provides a convenient way to generate an evaluation report using the `EvaluationRunResult` class.

In [None]:
from haystack.evaluation.eval_run_result import EvaluationRunResult

inputs = {
    "question": questions,
    "contexts": str_retrieved_docs,
    "answer": ground_truths,
    "predicted_answer": rag_answers,
}

evaluation_result = EvaluationRunResult(run_name="report", inputs=inputs, results=results)
evaluation_result.score_report()


We can also easily conver to a pandas dataframe.

In [None]:
results_df = evaluation_result.to_pandas()
results_df.head(5)

## Summary

In this tutorial, we demonstrated how to evaluate a Retrieval-Augmented Generation (RAG) pipeline using `flow-eval` and Haystack. Key aspects covered include:

1. Setting up a basic RAG pipeline with Haystack:
   - Using `InMemoryDocumentStore` for document storage
   - Implementing `SentenceTransformersDocumentEmbedder` for document embedding
   - Utilizing `HuggingFaceAPIChatGenerator` for answer generation

2. Creating custom evaluators with `FlowJudge`:
   - Developing a custom metric for context relevancy
   - Adapting a preset metric for faithfulness
   - Using the `HaystackFlowJudge` class to integrate FlowJudge evaluators into Haystack

3. Building a comprehensive evaluation pipeline:
   - Incorporating both FlowJudge and native Haystack evaluators
   - Using `SASEvaluator` for semantic answer similarity

4. Executing the evaluation and analyzing results:
   - Running the evaluation pipeline on a subset of questions
   - Utilizing `EvaluationRunResult` to generate a summary report
   - Converting results to a pandas DataFrame for further analysis

5. Demonstrating the flexibility of `flow-eval`:
   - Seamless integration with Haystack's evaluation framework
   - Ability to create custom metrics and adapt existing ones
   - Using open-source models to avoid reliance on proprietary large language models

This tutorial showcases how `flow-eval` can be effectively used to evaluate and iteratively improve RAG pipelines built with Haystack, providing a comprehensive assessment of performance across multiple dimensions including semantic similarity, context relevancy, and faithfulness.
