# Evaluating a Haystack RAG pipeline with FlowJudge

## Overview

This tutorial demonstrates how to evaluate a Retrieval-Augmented Generation (RAG) pipeline built with Haystack using `FlowJudge`. We'll showcase how to:

1. Set up a basic RAG pipeline using Haystack
2. Integrate `FlowJudge` evaluators into a Haystack evaluation pipeline
3. Assess the RAG system's performance using multiple metrics:
   - Semantic Answer Similarity (SAS)
   - Context Relevancy
   - Faithfulness

Key highlights:

- Use of the `HaystackFlowJudge` class to seamlessly incorporate `FlowJudge` evaluators
- Demonstration of custom metric creation for tailored evaluations
- Utilization of both pre-built Haystack evaluators and custom `FlowJudge` evaluators

By the end of this tutorial, you'll have a clear understanding of how to comprehensively evaluate your Haystack RAG pipelines using `FlowJudge`, enabling you to iteratively improve your system's performance without relying on proprietary large language models.

### Additional requirements

- Haystack: Make sure you have Haystack installed. You can install it via pip:
  ```bash
  pip install haystack-ai
  ```

- Sentence Transformers: Make sure you have Sentence Transformers installed. You can install it via pip:
  ```bash
  pip install "sentence-transformers>=3.0.0"
  ```

- Set your free HuggingFace API token as an environment variable:
    ```python
    import os
    os.environ["HF_TOKEN"] = "your_token_here"
    ```

    You can get your HuggingFace API token [here](https://huggingface.co/settings/tokens).

Note that this notebook primarily demonstrates the integration of `FlowJudge` with Haystack for evaluating RAG pipelines. While we do set up a basic RAG pipeline using Haystack, the main emphasis is on the evaluation process using FlowJudge evaluators.

For detailed explanations on building RAG pipelines with Haystack, please refer to the official [Haystack documentation](https://docs.haystack.deepset.ai/docs/intro).

## Dataset

For this tutorial, we are going to use a subset of the `LegalBench` dataset, which contains contracts and questions from the contracts.

In [1]:
try:
    from datasets import Dataset
except ImportError as e:
    print("datasets is not installed. ")
    print("Please run `pip install datasets` to install it.")
    print("\nAfter installation, restart the kernel and run this cell again.")
    raise SystemExit(f"Stopping execution due to missing datasets dependency: {e}")

In [2]:
from datasets import load_dataset

ds = load_dataset("flowaicom/legalbench_contracts_qa_subset", "default")

In [2]:
ds

DatasetDict({
    train: Dataset({
        features: ['id', 'question', 'context', 'original_answer', 'answer', 'answer_with_reasoning'],
        num_rows: 100
    })
})

This dataset contains:
- Questions: A question about the contract.
- Context: The contract itself.
- Original answer: The original answer to the question which can be considered as the ground truth.
- Answer: The answer used for generating the answer with reasoning, which can include noise with respect to the original answer.
- Answer with reasoning: An answer to the question including the reasoning for the answer based on the contract.

For this tutorial:
- We use instances without perturbations (where original_answer == answer)
- The contract text (context) is used to create documents
- We use `answer_with_reasoning` as the ground truth for evaluators

In [3]:
try:
    from haystack import Document
except ImportError:
    print("Haystack is not installed. ")
    print("Please install it according to the 'Additional Requirements' section above.")
    print("\nAfter installation, restart the kernel and run this cell again.")
    raise SystemExit("Stopping execution due to missing Haystack dependency.")

filtered_ds = ds.filter(lambda x: x['original_answer'] == x['answer'])

all_documents = [Document(content=context) for context in filtered_ds['train']['context']]
all_questions = [q for q in filtered_ds['train']['question']]
all_ground_truths = [a for a in filtered_ds['train']['answer_with_reasoning']]

print(f"Number of documents: {len(all_documents)}")
print(f"Number of questions: {len(all_questions)}")
print(f"Number of ground truths: {len(all_ground_truths)}")

Number of documents: 59
Number of questions: 59
Number of ground truths: 59


In [4]:
from IPython.display import Markdown, display

display(Markdown(f"**Question:** {all_questions[0]}"))
display(Markdown(f"**Context:** {all_documents[0].content}"))
display(Markdown(f"**Ground truth answer:** {all_ground_truths[0]}"))

**Question:** Does CNN permit using bots to artificially increase page visits for certain content?

**Context:** The following terms apply to content submitted by users, and user conduct, on the Site's Interactive Areas, including CNN iReport:

(A) Interactive Areas. The Site may contain comments sections, discussion forums, or other interactive features, including CNN iReport, CNN's user-generated news community ("Interactive Areas") in which you may post or upload user-generated content, including but not limited to iReports, comments, video, photos, messages, other materials or items (collectively, "User Content"). You are solely responsible for your use of any Interactive Areas and you use them at your own risk. Interactive Areas are available for individuals aged 13 years or older. By submitting User Content to an Interactive Area, you represent that you are 13 years of age or older and, if you are under the age of 18, you either are an emancipated minor, or have obtained the legal consent of your parent or legal guardian to enter into these Terms of Use, submit content, participate on the Site, and fulfill the obligations set forth in these Terms of Use, which forms a binding contract between you and CNN. Employees of Cable News Network, Inc. may not submit User Content without permission from their supervisors.

(B) Community Guidelines. By submitting any User Content or participating in an Interactive Area within or in connection with the Site, you agree to abide by the following rules of conduct:
You agree not to upload, post or otherwise transmit any User Content that:
violates or infringes in any way upon the rights of others, including any statements which may defame, harass, stalk or threaten others.
you know to be false, misleading or inaccurate.
contains blatant expressions of bigotry, racism, racially or ethnically offensive content, hate speech, abusiveness, vulgarity or profanity.
contains or advocates pornography or sexually explicit content, pedophilia, incest, bestiality, or that is otherwise obscene or lewd.
violates any law or advocates or provides instruction on dangerous, illegal, or predatory acts, or discusses illegal activities with the intent to commit them.
advocates violent behavior.
poses a reasonable threat to personal or public safety.
contains violent images of killing or physical abuse that appear to have been captured solely, or principally, for exploitive, prurient, or gratuitous purposes.
is protected by copyright, trademark, trade secret, right of publicity or other proprietary right without the express permission of the owner of such copyright, trademark, trade secret, right of publicity or other proprietary right. The burden of determining that any User Content is not protected by copyright, trademark, trade secret, right of publicity or other proprietary right rests with you. You shall be solely liable for any damage resulting from any infringement of copyrights, trademarks, trade secrets, rights of publicity or other proprietary rights or any other harm resulting from such a submission. Any person determined by CNN, in its sole discretion, to have violated the intellectual property or other rights of others shall be barred from submitting or posting any further material on the Site.
does not generally pertain to the designated topic or theme of any Interactive Area.
contains any unsolicited or unauthorized advertising or promotional materials with respect to products or services, "junk mail," "spam," "chain letters," "pyramid schemes," or any other form of solicitation.
You agree not to engage in activity that would constitute a criminal offense or give rise to a civil liability.
You agree that if necessary, you have the consent of each and every identifiable natural person in any submission to use such persons name or likeness in the manner contemplated by the Site.
You agree that any person who appears in your submission who is a current member of the Screen Actors Guild (SAG), the American Federation of Television and Radio Actors (AFTRA) or any other rights society is not entitled to compensation by CNN.
You agree not to impersonate any person or entity, including, but not limited to, CNN or any CNN employee, or falsely state or otherwise misrepresent your affiliation with any person or entity.
You agree not to represent or suggest, directly or indirectly, CNN's endorsement of User Content.
You agree not to interfere with any other user's right to privacy, including by harvesting or collecting personally-identifiable information about the Site users or posting private information about a third party.
You agree not to upload, post or otherwise transmit any User Content, software or other materials which contain a virus or other harmful or disruptive component.
You agree not to interfere with or disrupt the Site or the servers or networks connected to the Site, or disobey any requirements, procedures, policies or regulations of networks connected to the Site.
You agree not to reproduce, duplicate, copy, sell, resell or exploit for any commercial purpose, any portion of the Site, use the Site, or access to the Site.
You agree not to use any service, technology or automated system to artificially inflate the page views that your User Content receives. This includes pay-per-click services, web "robots" and any other current or future technologies. You also agree not to direct any third party to use these services, technologies or automated systems on your behalf.
You agree not to use any technology, service or automated system to post more User Content than an individual could upload in a given period of time. You also agree not to direct any third party to use these services, technologies or automated systems on your behalf.
Any conduct that in CNN's sole discretion restricts or inhibits anyone else from using or enjoying the Site will not be permitted. CNN reserves the right in its sole discretion to remove or edit User Content by you and to terminate Your Account for any reason.
CNN does not vouch for the accuracy or credibility of any User Content, and does not take any responsibility or assume any liability for any actions you may take as a result of reading User Content posted on the Site. Through your use of Interactive Areas, you may be exposed to content that you may find offensive, objectionable, harmful, inaccurate or deceptive. There may also be risks of dealing with underage persons, people acting under false pretense, international trade issues and foreign nationals. By using Interactive Areas, you assume all associated risks.

**Ground truth answer:** No, CNN does not permit using bots to artificially increase page visits for certain content. 

This is explicitly stated in the Community Guidelines, which prohibits users from using "any service, technology or automated system to artificially inflate the page views that your User Content receives." This rule specifically includes "web 'robots' and any other current or future technologies" and extends to directing third parties to use these services on the user's behalf. This means that any attempt to manipulate page views through automated means is a direct violation of CNN's terms of use and can result in the removal of the content or even termination of the user's account.

## Creating a RAG pipeline with Haystack

We will be creating a very simple RAG pipeline with Haystack.

For more detail explanations about building the RAG pipeline, please refer to this tutorial in the Haystack documentation - [Tutorial: Evaluating RAG pipelines](https://haystack.deepset.ai/tutorials/35_evaluating_rag_pipelines)

>Note that we have made minor modifications to the pipeline for this tutorial. In particular, we are using `HuggingFaceAPIChatGenerator` and `ChatPromptBuilder`.

### Indexing the documents

We need to index the documents so we can later use a retriever to find the most similar document to the question.

We are using the `InMemoryDocumentStore`, which is a simple in-memory document store that doesn't require setting up a database.

We are also using an small open-source embedding model from Sentence Transformers to convert the documents into embeddings.

Finally, we are using the `DocumentWriter` to write the documents into the document store.

In [5]:
from haystack import Pipeline
from haystack.components.embedders import SentenceTransformersDocumentEmbedder
from haystack.components.writers import DocumentWriter
from haystack.document_stores.in_memory import InMemoryDocumentStore
from haystack.document_stores.types import DuplicatePolicy

document_store = InMemoryDocumentStore()

document_embedder = SentenceTransformersDocumentEmbedder(model="sentence-transformers/all-MiniLM-L6-v2")
document_writer = DocumentWriter(document_store=document_store, policy=DuplicatePolicy.SKIP)

indexing = Pipeline()
indexing.add_component(instance=document_embedder, name="document_embedder")
indexing.add_component(instance=document_writer, name="document_writer")

indexing.connect("document_embedder.documents", "document_writer.documents")

indexing.run({"document_embedder": {"documents": all_documents}})



Batches:   0%|          | 0/2 [00:00<?, ?it/s]

ID 'c0836ba2f9d26978182748cd7560eb3f4e88a82850b29a2243b1cc63da11b6e5' already exists
ID '88987c8c6fbb5fac2bbe6689fce472c5df6f7d650fbe42d218dcf7e1b77aa975' already exists
ID '676bedf216ff078645168c0b1567e4c78dbc22ec710fc74216a74f7a0b5167fc' already exists
ID 'f4fcb43ca4803f5ac8a34374d7cbf9cf248347e5d9ca63f32aaac6296794de6d' already exists
ID 'c59aba33746d10420a8e8f09a70d2fa5961a81fc9c41cdc47850a49f6568808b' already exists
ID '8805408d3f6c8ae955cf721219c0b4b3891438faa059189e1fd4c1111712c205' already exists


{'document_writer': {'documents_written': 53}}

### Create the RAG pipeline

Haystack lets us easily create a RAG pipeline using:

- `InMemoryEmbeddingRetriever` which will get the relevant documents to the query.
- `HuggingFaceAPIChatGenerator` to generate the answer to the question. We are going to use a small open model for this example.

>Note you can use the free serverless inference API from HuggingFace to quickly experiment with different models. However, it's rate-limited and not suitable for production. To make use of the API, you just need to provide [your free HuggingFace API token](https://huggingface.co/settings/tokens).



In [6]:
from haystack.components.builders import AnswerBuilder, ChatPromptBuilder
from haystack.components.embedders import SentenceTransformersTextEmbedder
from haystack.components.generators.chat import HuggingFaceAPIChatGenerator
from haystack.components.retrievers.in_memory import InMemoryEmbeddingRetriever
from haystack.utils.hf import HFGenerationAPIType, Secret
from haystack.dataclasses import ChatMessage

api_type = HFGenerationAPIType.SERVERLESS_INFERENCE_API
llm = HuggingFaceAPIChatGenerator(api_type=api_type,
                                        api_params={"model": "microsoft/Phi-3.5-mini-instruct"},
                                        token=Secret.from_env_var("HF_TOKEN")
                                        )


template_str = """
You have to answer the following question based on the given context information only.

Context:
{% for document in documents %}
    {{ document.content }}
{% endfor %}

Question: {{question}}
Answer:
"""
template = [ChatMessage.from_user(template_str)]
prompt_builder = ChatPromptBuilder(template=template)

rag_pipeline = Pipeline()
rag_pipeline.add_component(
    "query_embedder", SentenceTransformersTextEmbedder(model="sentence-transformers/all-MiniLM-L6-v2")
)
rag_pipeline.add_component("retriever", InMemoryEmbeddingRetriever(document_store, top_k=3))
rag_pipeline.add_component("prompt_builder", prompt_builder)
rag_pipeline.add_component("llm", llm)
rag_pipeline.add_component("answer_builder", AnswerBuilder())

rag_pipeline.connect("query_embedder", "retriever.query_embedding")
rag_pipeline.connect("retriever", "prompt_builder.documents")
rag_pipeline.connect("prompt_builder", "llm")
rag_pipeline.connect("llm.replies", "answer_builder.replies")
rag_pipeline.connect("retriever", "answer_builder.documents")

<haystack.core.pipeline.pipeline.Pipeline object at 0x7ac1526ebe50>
🚅 Components
  - query_embedder: SentenceTransformersTextEmbedder
  - retriever: InMemoryEmbeddingRetriever
  - prompt_builder: ChatPromptBuilder
  - llm: HuggingFaceAPIChatGenerator
  - answer_builder: AnswerBuilder
🛤️ Connections
  - query_embedder.embedding -> retriever.query_embedding (List[float])
  - retriever.documents -> prompt_builder.documents (List[Document])
  - retriever.documents -> answer_builder.documents (List[Document])
  - prompt_builder.prompt -> llm.messages (List[ChatMessage])
  - llm.replies -> answer_builder.replies (List[ChatMessage])

Let's test the pipeline with a single question.

In [7]:
# Quick test of the pipeline
question = "Does CNN permit using bots to artificially increase page visits for certain content?"

response = rag_pipeline.run(
    {
        "query_embedder": {"text": question},
        "prompt_builder": {"question": question},
        "answer_builder": {"query": question},
    }
)
print(response["answer_builder"]["answers"][0].data)


Batches:   0%|          | 0/1 [00:00<?, ?it/s]

No, CNN does not permit the use of any technology, service or automated system, including bots, to artificially inflate page views, post more User Content than an individual could upload in a given period, or direct any third party to use these systems on their behalf. Users agree not to engage in any activity that could disrupt the Site or violate any of these terms.


In [8]:
# display the retrieved documents and similarity scores
for i, doc in enumerate(response['answer_builder']['answers'][0].documents, 1):
    display(Markdown(f"""**Document {i} (Score: {doc.score:.4f}):**\n\n{doc.content[:500]}..."""))

**Document 1 (Score: 0.3840):**

The Site contains copyrighted material, trademarks and other proprietary information, including, but not limited to, text, software, photos, video, graphics, music and sound, and the entire contents of the Site are copyrighted as a collective work under the United States copyright laws. CNN owns copyright in the selection, coordination, arrangement and enhancement of such content, as well as in the content original to it. You may not modify, publish, transmit, participate in the transfer or sale...

**Document 2 (Score: 0.3641):**

The following terms apply to content submitted by users, and user conduct, on the Site's Interactive Areas, including CNN iReport:

(A) Interactive Areas. The Site may contain comments sections, discussion forums, or other interactive features, including CNN iReport, CNN's user-generated news community ("Interactive Areas") in which you may post or upload user-generated content, including but not limited to iReports, comments, video, photos, messages, other materials or items (collectively, "Use...

**Document 3 (Score: 0.3209):**

7. Indemnification.
You agree to defend, indemnify and hold harmless CNN, its affiliates and their respective directors, officers, employees and agents from and against all claims and expenses, including attorneys' fees, arising out of the use of the Site by you or your Account. CNN reserves the right to take over the exclusive defense of any claim for which we are entitled to indemnification under this Section. In such event, you shall provide CNN with such cooperation as is reasonably requeste...

## Evaluating the pipeline

With our initial RAG pipeline prototype in place, we can now focus on evaluation.

To showcase the integration of `FlowJudge` within the Haystack framework, we'll evaluate the pipeline using both statistical and model-based evaluators.

Haystack employs the concept of an __Evaluation pipeline__, which computes scoring metrics to assess the RAG pipeline's performance.

Our evaluation pipeline will incorporate three key metrics:
- __Semantic Answer Similarity (SAS)__: Measures the semantic similarity between generated and ground truth answers, going beyond simple lexical matching.
- __Context Relevancy__: Determines how well the retrieved documents align with the given query.
- __Faithfulness__: Assesses the extent to which the generated answer is grounded in the retrieved documents.

For context relevancy and faithfulness, we'll leverage `FlowJudge` evaluators, eliminating the need for proprietary large models like GPT-4 or Claude 3.5 Sonnet.

### Obtaining generated answers

Our first step is to generate answers using the RAG pipeline.

>Note: We're using HuggingFace's free serverless inference API, which may take several minutes. To avoid rate limits, we're processing only 20 questions. If execution fails, you can resume from the last successful point by rerunning the cell.

In [9]:
questions = all_questions[:20]
ground_truths = all_ground_truths[:20]

rag_answers = []
retrieved_docs = []

for question in questions:
    response = rag_pipeline.run(
        {
            "query_embedder": {"text": question},
            "prompt_builder": {"question": question},
            "answer_builder": {"query": question},
        }
    )
    print(f"Question: {question}")
    print("Answer from pipeline:")
    print(response["answer_builder"]["answers"][0].data)
    print("\n-----------------------------------\n")

    rag_answers.append(response["answer_builder"]["answers"][0].data)
    retrieved_docs.append(response["answer_builder"]["answers"][0].documents)

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Question: Does CNN permit using bots to artificially increase page visits for certain content?
Answer from pipeline:
No, CNN does not permit the use of any technology, service or automated system, including bots, to artificially inflate page views, post more User Content than an individual could upload in a given period, or direct any third party to use these systems on their behalf. Users agree not to engage in any activity that could disrupt the Site or violate any of these terms.

-----------------------------------



Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Question: Does eBay promise that the product data it provides will be somewhat accurate?
Answer from pipeline:
Yes, eBay does acknowledge and warrant that the product data provided is accurate. However, it is important to note that eBay cannot promise that the content provided through their services will always be available, accurate, complete, and up-to-date. Users are responsible for ensuring the accuracy of the content they provide for their eBay listings.

-----------------------------------



Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Question: Will certain terms remain in force notwithstanding a users termination of the service?
Answer from pipeline:
Yes, certain terms will remain in force even if a user terminates the service. The specified sections that will survive termination of the Terms or the user's Account include:

1. 4 (Your Content): This section likely deals with the user's ownership and responsibilities related to the content they create or upload on the services.

2. 6 (Things You Cannot Do): This section probably outlines prohibited actions and activities on the services, thereby governing user behavior.

3. 10 (Indemnity): This section may require users to indemnify or protect the service providers from legal liability arising from their use of the services.

4. 11 (Disclaimers): These sections likely contain important disclaimers of warranty, limitations of liability, or other legal notices limiting the service provider's responsibilities to the user.

5. 12 (Limitation of Liability): This section 

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Question: Where a Netflix subscriber requires customer support, is there an alternative means to obtaining customer support apart from enabling Netflix to remotely access the device or devices in question?
Answer from pipeline:
Yes, according to the provided context, if a Netflix subscriber does not want to consent to support through the remote access tool, they can still receive assistance from Customer Support through other means. The document states, "In certain instances, Customer Service may best be able to assist you by using a remote access support tool through which we have full access to your computer. If you do not want us to have this access, you should not consent to support through the remote access tool, and we will assist you through other means." Therefore, alternative channels or methods of customer support are available for users who opt-out of remote access.

-----------------------------------



Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Question: Can Apple disclose personal information about me if asked to by the government?
Answer from pipeline:
Yes, Apple can disclose personal information about you if they determine it is necessary to comply with any applicable law, regulation, legal process, or governmental request. This may include sharing information for investigations or complaints regarding your use of the Site, or in response to court orders or legal processes where Apple is legally required to provide such information.

-----------------------------------



Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Question: Am I allowed to scrape data on the services for a research project Im doing?
Answer from pipeline:
No, according to the provided context, you are not allowed to scrape data from the services using automated means, such as robots, spiders, scripts, software, or any manual or automatic device, tool, or process designed to data mine or scrape content from the Services. Doing so could result in civil, criminal, and/or administrative penalties, fines, or sanctions. Therefore, for a research project, you would need to request permission from The New York Times Rights and Permissions Department or the copyright holder identified in the copyright notice contained in the Content.

-----------------------------------



Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Question: Does Amazon take responsibility for items sold by third parties through Amazon?
Answer from pipeline:
No, Amazon does not take responsibility for items sold by third parties through Amazon. According to the context, when you purchase any product or service offered by these third parties, you are purchasing directly from those third parties, not from Amazon. Amazon also does not warrant, examine, or evaluate the offerings of these businesses or individuals. Users are advised to carefully review the privacy statements and other conditions of use of those third parties.

-----------------------------------



Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Question: Does NYT check all user content before its uploaded and posted?
Answer from pipeline:
The provided context does not explicitly state whether The New York Times (NYT) checks all user content before it is uploaded and posted. However, NYT's Terms of Service do prohibit users from engaging in actions that could harm the Services, a third party, or violate applicable laws or intellectual property rights. It implies that there are certain controls and restrictions in place to maintain the integrity of the Services, but it does not directly address the procedure for content review before posting. Therefore, based on the given context, it cannot be definitively stated whether NYT checks all user content before it is uploaded and posted.

-----------------------------------



Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Question: Are there some changes to the terms that Instructure might not let me know about?
Answer from pipeline:
Yes, Instructure may revise the Acceptable Use Policy guidelines from time to time, and some of these revisions could meaningfully reduce your rights. Although Instructure will generally provide notice of significant revisions through notification methods such as email to the email address associated with your account, it is possible that not all changes are communicated to users. In such cases, you would be bound by and subject to the revised Acceptable Use Policy guidelines once they come into effect, as per the agreement stated in the Miscellaneous section of the context.

-----------------------------------



Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Question: If someone else is using content I posted on Instagram, could this prevent the content from being deleted?
Answer from pipeline:
Yes, if someone else is using your content on Instagram in accordance with existing licenses and they have not deleted it, this could prevent the content from being immediately deleted. The license you granted allows them to use the content as outlined in the terms, and until they remove it or you delete your account, the content remains under those licensing terms and may not be deleted within 90 days of your request. However, eventually, your content may still be deleted if the person no longer uses it or if Instagram determines that deletion is necessary for legal, safety, security, or compliance reasons.

Additionally, legal actions or regulations may also play a role in content removal timelines. In cases where deletion within 90 days is not technically feasible, Instagram will retain the content for as long as necessary, but the exact duration

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Question: Is eBay entitled to decide which party pays the costs of arbitration?
Answer from pipeline:
Yes, according to the context provided, eBay has the entitlement to decide which party will bear the costs of arbitration. Under Section D titled "Arbitration Fees," it states, "If You are unable to afford the arbitration costs, Zoom will advance those costs to You, subject to the arbitrators determination if costs should be reimbursed to Zoom if Zoom prevails." This implies that eBay may advance arbitration costs to the opposing party, but it might be reimbursed by the same party if eBay wins the arbitration case. 

Additionally, the section on Dispute Procedure under Section F mentions that if a party is unable to afford the arbitration costs, the other party may absorb those costs. However, it should be noted that the arbitrator may decide whether the costs should be reimbursed to eBay, depending on who prevails in the arbitration. Thus, while eBay may contribute initially to arbitr

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Question: Is it sufficient for me to disclose in the edit summary the compensation I received for the edits I made?
Answer from pipeline:
No, it is not sufficient to disclose compensation in the edit summary for edits made. The context provided does not specifically mention that you must disclose compensation for edits. Instead, the context discusses various aspects of Apple's Terms of Use, disclosure of information, enforcement and violation consequences, liability limitations, dispute resolutions, and the conditions under which the services may be used, among others. These sections do not dictate any requirements for participants to disclose compensation received for edits or contributions. Any such requirement would likely be outlined explicitly in the terms of that specific context. Always refer to the specific terms and conditions in each scenario.

-----------------------------------



Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Question: Does Verizon need to tell me if it makes changes to the services?
Answer from pipeline:
Yes, Verizon must inform you if changes are made to the services. According to the provided context, Verizon will let you know when they update the Services. Using the Services after these changes take effect means you agree to the new terms, although if you don't agree to the updates, you must cease using the Services and close your account, especially if you're a parent or guardian overseeing a minor child's account.

-----------------------------------



Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Question: Does Shopify agree to ensure the encryption of all data provided to it?
Answer from pipeline:
No, according to the context provided, while credit card information is always encrypted during transfer over networks, it is stated that "Your Materials (not including credit card information), may be transferred unencrypted and involve transmissions over various networks." This implies that not all data provided to Shopify is guaranteed to be encrypted.

-----------------------------------



Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Question: Are there some Disney products that I cant access anonymously?
Answer from pipeline:
Yes, based on Section E of the provided context, some Disney products may require you to create an account, which includes providing your contact information, username, and password. By creating an account, Disney has the ability to verify your identity and maintain the confidentiality of the information you provide. This means that you cannot access these Disney products anonymously as they require account registration and information sharing to participate or secure additional benefits.

-----------------------------------



Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Question: Can Twitch continue using my content for advertising and marketing after Ive deleted that content?
Answer from pipeline:
No, Twitch cannot continue using your content for advertising and marketing after you delete your content. According to the context provided, the rights granted by you to Twitch and its sub-licensees for User Content terminate once you delete such content from the Twitch Services or generally by closing your account. There are exceptions noted, such as if Twitch used the content for promotional purposes or if others copied or stored portions of the User Content (e.g., made a Clip). However, in your case, the deletion of the content would generally result in the termination of Twitch's rights to use it for advertising and marketing.

-----------------------------------



Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Question: Does Facebook take responsibility for racist content posted on Facebook?
Answer from pipeline:
Facebook does not take responsibility for individual content posted on its platform. According to the provided context, users are expressly advised not to use language that abuses or discriminates on the basis of race, gender, sexual preference, age, region, disability, etc., as hate speech is grounds for immediate and permanent suspension of access to all or part of the service. Additionally, users are instructed to engage in debates without resorting to personal attacks.

Furthermore, Facebook's Terms of Use state that the company is not responsible for the content users post. Any submitted content by a user, in their view, represents that they have all the necessary legal rights to upload, post, or submit such content and it will not violate any law or the rights of any person. Therefore, while Facebook does have measures in place against hate speech and abuse, it provides explic

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Question: Are there circumstances in which a security breach of a users Microsoft account could give rise to a claim by Microsoft against such user?
Answer from pipeline:
Yes, according to the context provided, Microsoft may hold the user liable for any losses incurred by Microsoft or another party due to someone else using the user's account or password. This suggests that in cases of unauthorized use of a Microsoft account, Microsoft could potentially take legal action against the user whose account was compromised, leading to a claim. Additionally, the user is responsible for maintaining the confidentiality of their account and password, and any unauthorized use could constitute an unlawful and unfair business practice. Therefore, if Microsoft suffers damages because of such a security breach, it might seek appropriate legal remedies, which could include making a claim against the user. However, the user is not liable for any loss that Microsoft may incur due to unauthorized use of 

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Question: Does NYT require parental consent in respect of persons aged 17 who wish to access NYT services?
Answer from pipeline:
No, NYT requires parental consent for individuals under the age of 18 to access NYT services. Specifically, it states that anyone under 18 years old must ask their parent or legal guardian to review and agree to the Terms of Service before using any part of the services, or have their parent or guardian complete the purchase and/or registration on their behalf. Therefore, for persons aged 17, parental consent would indeed be required according to these terms.

-----------------------------------



Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Question: Does Instagram become the owner of content that I post on Instagram?
Answer from pipeline:
No, Instagram does not claim ownership of your content. However, when you share, post, or upload content on Instagram, you grant Instagram a non-exclusive, royalty-free, transferable, sub-licensable, worldwide license to host, use, distribute, modify, run, copy, publicly perform or display, translate, and create derivative works of your content. This license expires when your content is deleted from Instagram's systems. You retain the rights to share your content with anyone else outside of Instagram.

-----------------------------------



We now convert the retrieved documents into a single string so `FlowJudge` can format the prompt properly under the hood.

In [10]:
# Concatenate the retrieved documents into a single string
str_retrieved_docs = []
for docs in retrieved_docs:
    str_retrieved_doc = ""
    for i, doc in enumerate(docs, 1):
        str_retrieved_doc += doc.content
        str_retrieved_doc += "\n"
    str_retrieved_docs.append(str_retrieved_doc)

In [11]:
display(Markdown(f"**Retrieved documents:** {str_retrieved_docs[0]}"))

**Retrieved documents:** The Site contains copyrighted material, trademarks and other proprietary information, including, but not limited to, text, software, photos, video, graphics, music and sound, and the entire contents of the Site are copyrighted as a collective work under the United States copyright laws. CNN owns copyright in the selection, coordination, arrangement and enhancement of such content, as well as in the content original to it. You may not modify, publish, transmit, participate in the transfer or sale, create derivative works, or in any way exploit, any of the content, in whole or in part. You may download copyrighted material for your personal use only. Except as otherwise expressly permitted under copyright law, no copying, redistribution, retransmission, publication or commercial exploitation of downloaded material will be permitted without the express permission of CNN and the copyright owner. In the event of any permitted copying, redistribution or publication of copyrighted material, no changes in or deletion of author attribution, trademark legend or copyright notice shall be made. You acknowledge that you do not acquire any ownership rights by downloading copyrighted material.
The following terms apply to content submitted by users, and user conduct, on the Site's Interactive Areas, including CNN iReport:

(A) Interactive Areas. The Site may contain comments sections, discussion forums, or other interactive features, including CNN iReport, CNN's user-generated news community ("Interactive Areas") in which you may post or upload user-generated content, including but not limited to iReports, comments, video, photos, messages, other materials or items (collectively, "User Content"). You are solely responsible for your use of any Interactive Areas and you use them at your own risk. Interactive Areas are available for individuals aged 13 years or older. By submitting User Content to an Interactive Area, you represent that you are 13 years of age or older and, if you are under the age of 18, you either are an emancipated minor, or have obtained the legal consent of your parent or legal guardian to enter into these Terms of Use, submit content, participate on the Site, and fulfill the obligations set forth in these Terms of Use, which forms a binding contract between you and CNN. Employees of Cable News Network, Inc. may not submit User Content without permission from their supervisors.

(B) Community Guidelines. By submitting any User Content or participating in an Interactive Area within or in connection with the Site, you agree to abide by the following rules of conduct:
You agree not to upload, post or otherwise transmit any User Content that:
violates or infringes in any way upon the rights of others, including any statements which may defame, harass, stalk or threaten others.
you know to be false, misleading or inaccurate.
contains blatant expressions of bigotry, racism, racially or ethnically offensive content, hate speech, abusiveness, vulgarity or profanity.
contains or advocates pornography or sexually explicit content, pedophilia, incest, bestiality, or that is otherwise obscene or lewd.
violates any law or advocates or provides instruction on dangerous, illegal, or predatory acts, or discusses illegal activities with the intent to commit them.
advocates violent behavior.
poses a reasonable threat to personal or public safety.
contains violent images of killing or physical abuse that appear to have been captured solely, or principally, for exploitive, prurient, or gratuitous purposes.
is protected by copyright, trademark, trade secret, right of publicity or other proprietary right without the express permission of the owner of such copyright, trademark, trade secret, right of publicity or other proprietary right. The burden of determining that any User Content is not protected by copyright, trademark, trade secret, right of publicity or other proprietary right rests with you. You shall be solely liable for any damage resulting from any infringement of copyrights, trademarks, trade secrets, rights of publicity or other proprietary rights or any other harm resulting from such a submission. Any person determined by CNN, in its sole discretion, to have violated the intellectual property or other rights of others shall be barred from submitting or posting any further material on the Site.
does not generally pertain to the designated topic or theme of any Interactive Area.
contains any unsolicited or unauthorized advertising or promotional materials with respect to products or services, "junk mail," "spam," "chain letters," "pyramid schemes," or any other form of solicitation.
You agree not to engage in activity that would constitute a criminal offense or give rise to a civil liability.
You agree that if necessary, you have the consent of each and every identifiable natural person in any submission to use such persons name or likeness in the manner contemplated by the Site.
You agree that any person who appears in your submission who is a current member of the Screen Actors Guild (SAG), the American Federation of Television and Radio Actors (AFTRA) or any other rights society is not entitled to compensation by CNN.
You agree not to impersonate any person or entity, including, but not limited to, CNN or any CNN employee, or falsely state or otherwise misrepresent your affiliation with any person or entity.
You agree not to represent or suggest, directly or indirectly, CNN's endorsement of User Content.
You agree not to interfere with any other user's right to privacy, including by harvesting or collecting personally-identifiable information about the Site users or posting private information about a third party.
You agree not to upload, post or otherwise transmit any User Content, software or other materials which contain a virus or other harmful or disruptive component.
You agree not to interfere with or disrupt the Site or the servers or networks connected to the Site, or disobey any requirements, procedures, policies or regulations of networks connected to the Site.
You agree not to reproduce, duplicate, copy, sell, resell or exploit for any commercial purpose, any portion of the Site, use the Site, or access to the Site.
You agree not to use any service, technology or automated system to artificially inflate the page views that your User Content receives. This includes pay-per-click services, web "robots" and any other current or future technologies. You also agree not to direct any third party to use these services, technologies or automated systems on your behalf.
You agree not to use any technology, service or automated system to post more User Content than an individual could upload in a given period of time. You also agree not to direct any third party to use these services, technologies or automated systems on your behalf.
Any conduct that in CNN's sole discretion restricts or inhibits anyone else from using or enjoying the Site will not be permitted. CNN reserves the right in its sole discretion to remove or edit User Content by you and to terminate Your Account for any reason.
CNN does not vouch for the accuracy or credibility of any User Content, and does not take any responsibility or assume any liability for any actions you may take as a result of reading User Content posted on the Site. Through your use of Interactive Areas, you may be exposed to content that you may find offensive, objectionable, harmful, inaccurate or deceptive. There may also be risks of dealing with underage persons, people acting under false pretense, international trade issues and foreign nationals. By using Interactive Areas, you assume all associated risks.
7. Indemnification.
You agree to defend, indemnify and hold harmless CNN, its affiliates and their respective directors, officers, employees and agents from and against all claims and expenses, including attorneys' fees, arising out of the use of the Site by you or your Account. CNN reserves the right to take over the exclusive defense of any claim for which we are entitled to indemnification under this Section. In such event, you shall provide CNN with such cooperation as is reasonably requested by CNN.
8. Termination.
CNN may terminate or suspend these Terms of Use at any time without notice to you. Without limiting the foregoing, CNN shall have the right to immediately terminate Your Account in the event of any conduct by you which CNN, in its sole discretion, considers to be unacceptable, or in the event of any breach by you of these Terms of Use. The provisions of Sections 1 - 13 shall survive termination of these Terms of Use.
9. Trademarks.
CNN, its parent, subsidiaries and affiliates, own all rights to their logos and trademarks used in connection with the Site. All other logos and trademarks appearing on the Site are the property of their respective owners.


### Evaluators in Haystack

__Evaluators__ in Haystack are versatile components that can operate independently or as integral parts of a pipeline.

We'll construct an evaluation pipeline to efficiently obtain scores from all evaluators in a single pass. Additionally, Haystack provides functionality to generate a comprehensive evaluation report.

#### Creating FlowJudge evaluators using the HaystackFlowJudge class

We can use our integration with Haystack to create FlowJudge evaluators in a flexible way. The process is as follows:
1. Create a `CustomMetric` that will be used to compute the score for the evaluator.
2. Initialize the model - We are using the transformers configuration for Flow-Judge-v0.1.
3. Instantiate the `HaystackFlowJudge` evaluator.

> **Important Note on Model Selection:**
> 
> There's a known issue with Phi-3 models producing gibberish outputs for contexts exceeding 4096 tokens (including input and output). While this has been addressed in recent transformers library updates, still remains an issue in the vLLM engine. We recommend the following:
> 
> - For longer contexts: Use the `Flow-Judge-v0.1_HF` model configuration.
> - **Caveat:** Inference with transformers is significantly slower than with optimized runtimes.
> 
> This approach ensures reliable outputs for extensive contexts, albeit with a trade-off in processing speed.

In [12]:
from flow_judge.integrations.haystack import HaystackFlowJudge
from flow_judge.metrics.metric import CustomMetric, RubricItem
from flow_judge.metrics.presets import RESPONSE_FAITHFULNESS_5POINT

from flow_judge.models import Vllm #, Llamafile, Hf

# If you are running on an Ampere GPU or newer, create a model using VLLM
model = Vllm()

# If you are low on VRAM, you can use quantized AWQ of the VLLM model
# model = Vllm(quantized=True)

# Or if not running on Ampere GPU or newer, create a model using no flash attn and Hugging Face Transformers
# model = Hf(flash_attn=False)

# Or create a model using Llamafile if not running an Nvidia GPU & running a Silicon MacOS for example
# model = Llamafile()

INFO:flow_judge.models.huggingface:Downloading the model from Hugging Face Hub using hf-transfer for faster downloads...


Fetching 12 files:   0%|          | 0/12 [00:00<?, ?it/s]

INFO:accelerate.utils.modeling:We will use 90% of the memory on device 0 for storing the model, and 10% for the buffer to avoid OOM. You can set `max_memory` in to a higher value to use more memory (at your own risk).


Loading checkpoint shards:   0%|          | 0/2 [00:00<?, ?it/s]

We create the context relevancy metric from scratch. For learning more about how to create custom metrics, refer to the [custom metrics tutorial](https://github.com/flowaicom/flow-judge/blob/main/examples/2_custom_evaluation_criteria.ipynb).

In [13]:
# Context relevancy
cr_criteria = "Based on the provided query and context, how relevant and sufficient is the context for responding to the query?"
cr_rubric = [
    RubricItem(
        score=1,
        description="The context provided is not relevant or insufficient to respond to the query."
    ),
    RubricItem(
        score=2,
        description="The context is mostly irrelevant to the query. It may contain some tangentially related information but is insufficient for adequately responding to the query."
    ),
    RubricItem(
        score=3,
        description="The context is somewhat relevant to the query. It contains some information that could be used to partially respond to the query, but key details are missing for a complete response."
    ),
    RubricItem(
        score=4,
        description="The context is mostly relevant to the query. It contains most of the necessary information to respond to the query, but may be missing some minor details."
    ),
    RubricItem(
        score=5,
        description="The context is highly relevant to the query. It contains all the necessary information to comprehensively respond to the query without needing any additional context."
    )
]
cr_metric = CustomMetric(
    name="Context Relevancy",
    criteria=cr_criteria,
    rubric=cr_rubric,
    required_inputs=["question"],
    required_output="contexts"
)

For creating the faithfulness evaluator, we are going to use the `RESPONSE_FAITHFULNESS_5POINT` preset in flow-judge library as a template.

> Note that we need to use the expected keys so we need to update required inputs and outputs to match the expected keys in the RAG pipeline. In this case, the score descriptions are still relevant with these changes.

In [14]:
ff_criteria = RESPONSE_FAITHFULNESS_5POINT.criteria
ff_rubric = RESPONSE_FAITHFULNESS_5POINT.rubric

display(Markdown(f"**Criteria:** {ff_criteria}"))
display(Markdown("**Rubric:**"))

for item in ff_rubric:
    display(Markdown(f"- **Score {item.score}:** {item.description}"))

ff_metric = CustomMetric(
    name="Faithfulness",
    criteria=ff_criteria,
    rubric=ff_rubric,
    required_inputs=["question", "contexts"],
    required_output="predicted_answers"
)

**Criteria:** Based on the given context, evaluate how consistent and faithful the generated response is to the context. The response should not contain any hallucinated or fabricated information that is not supported by the context.

**Rubric:**

- **Score 1:** The response is completely inconsistent with the provided context. It contains significant amount of hallucinated or fabricated information that directly contradicts or is not supported at all by the context.

- **Score 2:** The response is mostly inconsistent with the provided context. While it may contain some information from the context, it introduces a substantial amount of hallucinated or fabricated details that deviate from the context.

- **Score 3:** The response is somewhat consistent with the provided context. It includes a mix of information from the context and some hallucinated or fabricated details. The fabrications are minor and do not significantly contradict the context.

- **Score 4:** The response is mostly consistent with the provided context. The vast majority of the content is supported by the context, with only minor and inconsequential inconsistencies or fabrications, if any.

- **Score 5:** The response is completely consistent with and faithful to the provided context. All details in the response are directly supported by the context, without any hallucinated or fabricated information.

We can now create the Flow Judge evaluators:

In [15]:
cr_evaluator = HaystackFlowJudge(
    metric=cr_metric,
    model=model, # the vLLM instance of Flow-Judge-v0.1
    progress_bar=True,
    raise_on_failure=True, # to raise an error when pipeline run fails
    save_results=True, # to save evaluation results to disk
    fail_on_parse_error=False # to fail if there is a parsing error, otherwise return "Error" and score -1
)

ff_evaluator = HaystackFlowJudge(
    metric=ff_metric,
    model=model,
    progress_bar=True,
    raise_on_failure=True,
    save_results=True,
    fail_on_parse_error=False
)

#### Haystack evaluators

Now let's crete the semantic answer similarity evaluator using the Haystack implementation. This evaluator will use the same embedding model as the retriever in the RAG pipeline.

In [16]:
from haystack.components.evaluators.sas_evaluator import SASEvaluator

sas_evaluator = SASEvaluator(model="sentence-transformers/all-MiniLM-L6-v2")

### Evaluation pipeline

We can now create a Haystack evaluation pipeline that will evaluate the RAG pipeline and obtains the evaluation results.

In [17]:
eval_pipeline = Pipeline()

# add components to the pipeline
eval_pipeline.add_component("sas_evaluator", sas_evaluator)
eval_pipeline.add_component("cr_evaluator", cr_evaluator)
eval_pipeline.add_component("ff_evaluator", ff_evaluator)

>Note that executing the following cell might take a while to complete due to the size of the inputs, specially if running on a machine with low resources.

In [18]:
results = eval_pipeline.run(
    {
        "sas_evaluator": {
            'predicted_answers': rag_answers,
            'ground_truth_answers': ground_truths,
        },
        "cr_evaluator": {
            'question': questions,
            'contexts': str_retrieved_docs,
        },
        "ff_evaluator": {
            'question': questions,
            'contexts': str_retrieved_docs,
            'predicted_answers': rag_answers,
        }
    }
)

INFO:haystack.core.pipeline.base:Warming up component sas_evaluator...
INFO:sentence_transformers.SentenceTransformer:Load pretrained SentenceTransformer: sentence-transformers/all-MiniLM-L6-v2
INFO:haystack.core.pipeline.pipeline:Running component sas_evaluator


Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

INFO:haystack.core.pipeline.pipeline:Running component cr_evaluator
Starting from v4.46, the `logits` model output will have the same type as the model (except at train time, where it will always be FP32)
We detected that you are passing `past_key_values` as a tuple of tuples. This is deprecated and will be removed in v4.47. Please convert your cache or use an appropriate `Cache` class (https://huggingface.co/docs/transformers/kv_cache#legacy-cache-format)
INFO:flow_judge.models.huggingface:Automatically determined batch size: 2
Processing batches:   0%|          | 0/10 [00:00<?, ?it/s]Both `max_new_tokens` (=1000) and `max_length`(=8192) seem to have been set. `max_new_tokens` will take precedence. Please refer to the documentation for more information. (https://huggingface.co/docs/transformers/main/en/main_classes/text_generation)
Processing batches:  10%|█         | 1/10 [00:17<02:35, 17.26s/it]Both `max_new_tokens` (=1000) and `max_length`(=8192) seem to have been set. `max_new_tok

In [19]:
results

{'sas_evaluator': {'score': 0.8931622713804245,
  'individual_scores': [0.8243861198425293,
   0.9228897094726562,
   0.8793361186981201,
   0.9031497836112976,
   0.8966434597969055,
   0.8381413221359253,
   0.9597147703170776,
   0.9309449195861816,
   0.798747181892395,
   0.9507496356964111,
   0.818832516670227,
   0.853968620300293,
   0.8608775734901428,
   0.8927598595619202,
   0.930637001991272,
   0.9335268139839172,
   0.8244764804840088,
   0.9270716309547424,
   0.9546440243721008,
   0.9617478847503662]},
 'cr_evaluator': {'results': [{'feedback': 'The context provided is highly relevant to the query about whether CNN permits using bots to artificially increase page visits for certain content. The context includes specific clauses that directly address this issue:\n\n1. "You agree not to use any technology, service or automated system to artificially inflate the page views that your User Content receives."\n2. "You also agree not to direct any third party to use these s

### Evaluation report

Haystack provides a convenient way to generate an evaluation report using the `EvaluationRunResult` class.

In [20]:
from haystack.evaluation.eval_run_result import EvaluationRunResult

inputs = {
    "question": questions,
    "contexts": str_retrieved_docs,
    "answer": ground_truths,
    "predicted_answer": rag_answers,
}

evaluation_result = EvaluationRunResult(run_name="report", inputs=inputs, results=results)
evaluation_result.score_report()


Unnamed: 0,metrics,score
0,sas_evaluator,0.893162
1,cr_evaluator,4.45
2,ff_evaluator,4.2


We can also easily conver to a pandas dataframe.

In [22]:
results_df = evaluation_result.to_pandas()
results_df.head(5)

Unnamed: 0,question,contexts,answer,predicted_answer,sas_evaluator,cr_evaluator,ff_evaluator
0,Does CNN permit using bots to artificially inc...,"The Site contains copyrighted material, tradem...","No, CNN does not permit using bots to artifici...","No, CNN does not permit the use of any technol...",0.824386,5.0,5.0
1,Does eBay promise that the product data it pro...,When you provide content using our Services (d...,"No, eBay does not promise that the product dat...","Yes, eBay does acknowledge and warrant that th...",0.92289,4.0,2.0
2,Will certain terms remain in force notwithstan...,16. Termination\nYou may terminate these Terms...,"Yes, certain terms will remain in force notwit...","Yes, certain terms will remain in force even i...",0.879336,5.0,4.0
3,Where a Netflix subscriber requires customer s...,8. Miscellaneous\n\n8.1. Governing Law. These ...,"Yes, according to the provided context, there ...","Yes, according to the provided context, if a N...",0.90315,5.0,5.0
4,Can Apple disclose personal information about ...,Violation of These Terms of Use\nApple may dis...,"Yes, Apple can disclose personal information a...","Yes, Apple can disclose personal information a...",0.896643,5.0,5.0


## Summary

In this tutorial, we demonstrated how to evaluate a Retrieval-Augmented Generation (RAG) pipeline using `FlowJudge` and Haystack. Key aspects covered include:

1. Setting up a basic RAG pipeline with Haystack:
   - Using `InMemoryDocumentStore` for document storage
   - Implementing `SentenceTransformersDocumentEmbedder` for document embedding
   - Utilizing `HuggingFaceAPIChatGenerator` for answer generation

2. Creating custom evaluators with `FlowJudge`:
   - Developing a custom metric for context relevancy
   - Adapting a preset metric for faithfulness
   - Using the `HaystackFlowJudge` class to integrate FlowJudge evaluators into Haystack

3. Building a comprehensive evaluation pipeline:
   - Incorporating both FlowJudge and native Haystack evaluators
   - Using `SASEvaluator` for semantic answer similarity

4. Executing the evaluation and analyzing results:
   - Running the evaluation pipeline on a subset of questions
   - Utilizing `EvaluationRunResult` to generate a summary report
   - Converting results to a pandas DataFrame for further analysis

5. Demonstrating the flexibility of `FlowJudge`:
   - Seamless integration with Haystack's evaluation framework
   - Ability to create custom metrics and adapt existing ones
   - Using open-source models to avoid reliance on proprietary large language models

This tutorial showcases how `FlowJudge` can be effectively used to evaluate and iteratively improve RAG pipelines built with Haystack, providing a comprehensive assessment of performance across multiple dimensions including semantic similarity, context relevancy, and faithfulness.