# Evaluating a Haystack RAG pipeline with FlowJudge

## Overview

TODO - Improve

This tutorial demonstrates how to use FlowJudge evaluators in the popular Haystack framework for building RAG pipelines.

We will use our integration with Haystack, the `HaystackFlowJudge` class, which is a Haystack component that wraps the FlowJudge class.

### Additional requirements

- Haystack: Make sure you have Haystack installed. You can install it via pip:
  ```bash
  pip install haystack-ai
  ```

- Sentence Transformers: Make sure you have Sentence Transformers installed. You can install it via pip:
  ```bash
  pip install "sentence-transformers>=3.0.0"
  ```


Note. refer to notebook in haystack and don't focus a lot on the pipeline building, but focus on the evaluation.

## Dataset

For this tutorial, we are going to use a subset of the `LegalBench` dataset, which contains contracts and questions from the contracts.

We have introduced some noise in the dataset, by swapping the answers to some questions. See dataset card for more details.

In [1]:
from datasets import load_dataset

ds = load_dataset("flowaicom/legalbench_contracts_qa_subset", "default")

In [2]:
ds

DatasetDict({
    train: Dataset({
        features: ['id', 'question', 'context', 'original_answer', 'answer', 'answer_with_reasoning'],
        num_rows: 100
    })
})

The dataset contains:
- Questions: A question about the contract.
- Context: The contract itself.
- Original answer: The original answer to the question which can be considered as the ground truth.
- Answer: The answer used for generating the answer with reasoning, which can include noise with respect to the original answer.
- Answer with reasoning: An answer to the question including the reasoning for the answer based on the contract.

Let's prepare the data to create our RAG pipeline.

We are gonna use the context or contract to create the documents.

For this particular example, we are going to use instances without perturbations.

We are going to use the `answer_with_reasoning` as the ground truth for the evaluators.


In [3]:
from haystack import Document

filtered_ds = ds.filter(lambda x: x['original_answer'] == x['answer'])

all_documents = [Document(content=context) for context in filtered_ds['train']['context']]
all_questions = [q for q in filtered_ds['train']['question']]
all_ground_truths = [a for a in filtered_ds['train']['answer_with_reasoning']]

print(f"Number of documents: {len(all_documents)}")
print(f"Number of questions: {len(all_questions)}")
print(f"Number of ground truths: {len(all_ground_truths)}")

Number of documents: 59
Number of questions: 59
Number of ground truths: 59


For efficiency, we are only using this very small dataset for this tutorial. Below is an example of one of the questions in the dataset.

In [4]:
from IPython.display import Markdown, display

display(Markdown(f"**Question:** {all_questions[0]}"))
display(Markdown(f"**Context:** {all_documents[0].content}"))
display(Markdown(f"**Ground truth answer:** {all_ground_truths[0]}"))

**Question:** Does CNN permit using bots to artificially increase page visits for certain content?

**Context:** The following terms apply to content submitted by users, and user conduct, on the Site's Interactive Areas, including CNN iReport:

(A) Interactive Areas. The Site may contain comments sections, discussion forums, or other interactive features, including CNN iReport, CNN's user-generated news community ("Interactive Areas") in which you may post or upload user-generated content, including but not limited to iReports, comments, video, photos, messages, other materials or items (collectively, "User Content"). You are solely responsible for your use of any Interactive Areas and you use them at your own risk. Interactive Areas are available for individuals aged 13 years or older. By submitting User Content to an Interactive Area, you represent that you are 13 years of age or older and, if you are under the age of 18, you either are an emancipated minor, or have obtained the legal consent of your parent or legal guardian to enter into these Terms of Use, submit content, participate on the Site, and fulfill the obligations set forth in these Terms of Use, which forms a binding contract between you and CNN. Employees of Cable News Network, Inc. may not submit User Content without permission from their supervisors.

(B) Community Guidelines. By submitting any User Content or participating in an Interactive Area within or in connection with the Site, you agree to abide by the following rules of conduct:
You agree not to upload, post or otherwise transmit any User Content that:
violates or infringes in any way upon the rights of others, including any statements which may defame, harass, stalk or threaten others.
you know to be false, misleading or inaccurate.
contains blatant expressions of bigotry, racism, racially or ethnically offensive content, hate speech, abusiveness, vulgarity or profanity.
contains or advocates pornography or sexually explicit content, pedophilia, incest, bestiality, or that is otherwise obscene or lewd.
violates any law or advocates or provides instruction on dangerous, illegal, or predatory acts, or discusses illegal activities with the intent to commit them.
advocates violent behavior.
poses a reasonable threat to personal or public safety.
contains violent images of killing or physical abuse that appear to have been captured solely, or principally, for exploitive, prurient, or gratuitous purposes.
is protected by copyright, trademark, trade secret, right of publicity or other proprietary right without the express permission of the owner of such copyright, trademark, trade secret, right of publicity or other proprietary right. The burden of determining that any User Content is not protected by copyright, trademark, trade secret, right of publicity or other proprietary right rests with you. You shall be solely liable for any damage resulting from any infringement of copyrights, trademarks, trade secrets, rights of publicity or other proprietary rights or any other harm resulting from such a submission. Any person determined by CNN, in its sole discretion, to have violated the intellectual property or other rights of others shall be barred from submitting or posting any further material on the Site.
does not generally pertain to the designated topic or theme of any Interactive Area.
contains any unsolicited or unauthorized advertising or promotional materials with respect to products or services, "junk mail," "spam," "chain letters," "pyramid schemes," or any other form of solicitation.
You agree not to engage in activity that would constitute a criminal offense or give rise to a civil liability.
You agree that if necessary, you have the consent of each and every identifiable natural person in any submission to use such persons name or likeness in the manner contemplated by the Site.
You agree that any person who appears in your submission who is a current member of the Screen Actors Guild (SAG), the American Federation of Television and Radio Actors (AFTRA) or any other rights society is not entitled to compensation by CNN.
You agree not to impersonate any person or entity, including, but not limited to, CNN or any CNN employee, or falsely state or otherwise misrepresent your affiliation with any person or entity.
You agree not to represent or suggest, directly or indirectly, CNN's endorsement of User Content.
You agree not to interfere with any other user's right to privacy, including by harvesting or collecting personally-identifiable information about the Site users or posting private information about a third party.
You agree not to upload, post or otherwise transmit any User Content, software or other materials which contain a virus or other harmful or disruptive component.
You agree not to interfere with or disrupt the Site or the servers or networks connected to the Site, or disobey any requirements, procedures, policies or regulations of networks connected to the Site.
You agree not to reproduce, duplicate, copy, sell, resell or exploit for any commercial purpose, any portion of the Site, use the Site, or access to the Site.
You agree not to use any service, technology or automated system to artificially inflate the page views that your User Content receives. This includes pay-per-click services, web "robots" and any other current or future technologies. You also agree not to direct any third party to use these services, technologies or automated systems on your behalf.
You agree not to use any technology, service or automated system to post more User Content than an individual could upload in a given period of time. You also agree not to direct any third party to use these services, technologies or automated systems on your behalf.
Any conduct that in CNN's sole discretion restricts or inhibits anyone else from using or enjoying the Site will not be permitted. CNN reserves the right in its sole discretion to remove or edit User Content by you and to terminate Your Account for any reason.
CNN does not vouch for the accuracy or credibility of any User Content, and does not take any responsibility or assume any liability for any actions you may take as a result of reading User Content posted on the Site. Through your use of Interactive Areas, you may be exposed to content that you may find offensive, objectionable, harmful, inaccurate or deceptive. There may also be risks of dealing with underage persons, people acting under false pretense, international trade issues and foreign nationals. By using Interactive Areas, you assume all associated risks.

**Ground truth answer:** No, CNN does not permit using bots to artificially increase page visits for certain content. 

This is explicitly stated in the Community Guidelines, which prohibits users from using "any service, technology or automated system to artificially inflate the page views that your User Content receives." This rule specifically includes "web 'robots' and any other current or future technologies" and extends to directing third parties to use these services on the user's behalf. This means that any attempt to manipulate page views through automated means is a direct violation of CNN's terms of use and can result in the removal of the content or even termination of the user's account.

## Creating a RAG pipeline with Haystack

We will be creating a very simple RAG pipeline with Haystack. For more detail explanations about building the RAG pipeline, please refer to this tutorial in the Haystack documentation - [Tutorial: Evaluating RAG pipelines](https://haystack.deepset.ai/tutorials/35_evaluating_rag_pipelines)

>Note that we have made minor modifications to the pipeline for this tutorial. In particular, we are using `HuggingFaceAPIChatGenerator` and `ChatPromptBuilder`.

### Indexing the documents

We need to index the documents so we can later use a retriever to find the most similar document to the question.

We are using the `InMemoryDocumentStore`, which is a simple in-memory document store that doesn't require setting up a database.

We are also using an small open-source embedding model from Sentence Transformers to convert the documents into embeddings.

Finally, we are using the `DocumentWriter` to write the documents into the document store.

In [5]:
from haystack import Pipeline
from haystack.components.embedders import SentenceTransformersDocumentEmbedder
from haystack.components.writers import DocumentWriter
from haystack.document_stores.in_memory import InMemoryDocumentStore
from haystack.document_stores.types import DuplicatePolicy

document_store = InMemoryDocumentStore()

document_embedder = SentenceTransformersDocumentEmbedder(model="sentence-transformers/all-MiniLM-L6-v2")
document_writer = DocumentWriter(document_store=document_store, policy=DuplicatePolicy.SKIP)

indexing = Pipeline()
indexing.add_component(instance=document_embedder, name="document_embedder")
indexing.add_component(instance=document_writer, name="document_writer")

indexing.connect("document_embedder.documents", "document_writer.documents")

indexing.run({"document_embedder": {"documents": all_documents}})



Batches:   0%|          | 0/2 [00:00<?, ?it/s]

ID 'c0836ba2f9d26978182748cd7560eb3f4e88a82850b29a2243b1cc63da11b6e5' already exists
ID '88987c8c6fbb5fac2bbe6689fce472c5df6f7d650fbe42d218dcf7e1b77aa975' already exists
ID '676bedf216ff078645168c0b1567e4c78dbc22ec710fc74216a74f7a0b5167fc' already exists
ID 'f4fcb43ca4803f5ac8a34374d7cbf9cf248347e5d9ca63f32aaac6296794de6d' already exists
ID 'c59aba33746d10420a8e8f09a70d2fa5961a81fc9c41cdc47850a49f6568808b' already exists
ID '8805408d3f6c8ae955cf721219c0b4b3891438faa059189e1fd4c1111712c205' already exists


{'document_writer': {'documents_written': 53}}

### Create the RAG pipeline

Haystack lets us easily create a RAG pipeline using:

- `InMemoryEmbeddingRetriever` which will get the relevant documents to the query.
- `HuggingFaceAPIChatGenerator` to generate the answer to the question. We are going to use a small open model for this example.

>Note you can use the free serverless inference API from HuggingFace to quickly experiment with different models. However, it's rate-limited and not suitable for production. To make use of the API, you just need to provide [your free HuggingFace API token](https://huggingface.co/settings/tokens).



In [6]:
from haystack.components.builders import AnswerBuilder, ChatPromptBuilder
from haystack.components.embedders import SentenceTransformersTextEmbedder
from haystack.components.generators.chat import HuggingFaceAPIChatGenerator
from haystack.components.retrievers.in_memory import InMemoryEmbeddingRetriever
from haystack.utils.hf import HFGenerationAPIType, Secret
from haystack.dataclasses import ChatMessage

api_type = HFGenerationAPIType.SERVERLESS_INFERENCE_API
llm = HuggingFaceAPIChatGenerator(api_type=api_type,
                                        api_params={"model": "microsoft/Phi-3.5-mini-instruct"},
                                        token=Secret.from_env_var("HF_TOKEN")                                        
                                        )


template_str = """
You have to answer the following question based on the given context information only.

Context:
{% for document in documents %}
    {{ document.content }}
{% endfor %}

Question: {{question}}
Answer:
"""
template = [ChatMessage.from_user(template_str)]
prompt_builder = ChatPromptBuilder(template=template)

rag_pipeline = Pipeline()
rag_pipeline.add_component(
    "query_embedder", SentenceTransformersTextEmbedder(model="sentence-transformers/all-MiniLM-L6-v2")
)
rag_pipeline.add_component("retriever", InMemoryEmbeddingRetriever(document_store, top_k=3))
rag_pipeline.add_component("prompt_builder", prompt_builder)
rag_pipeline.add_component("llm", llm)
rag_pipeline.add_component("answer_builder", AnswerBuilder())

rag_pipeline.connect("query_embedder", "retriever.query_embedding")
rag_pipeline.connect("retriever", "prompt_builder.documents")
rag_pipeline.connect("prompt_builder", "llm")
rag_pipeline.connect("llm.replies", "answer_builder.replies")
rag_pipeline.connect("retriever", "answer_builder.documents")

<haystack.core.pipeline.pipeline.Pipeline object at 0x79588c4e1ea0>
🚅 Components
  - query_embedder: SentenceTransformersTextEmbedder
  - retriever: InMemoryEmbeddingRetriever
  - prompt_builder: ChatPromptBuilder
  - llm: HuggingFaceAPIChatGenerator
  - answer_builder: AnswerBuilder
🛤️ Connections
  - query_embedder.embedding -> retriever.query_embedding (List[float])
  - retriever.documents -> prompt_builder.documents (List[Document])
  - retriever.documents -> answer_builder.documents (List[Document])
  - prompt_builder.prompt -> llm.messages (List[ChatMessage])
  - llm.replies -> answer_builder.replies (List[ChatMessage])

Let's test the pipeline with a single question.

In [17]:
# Quick test of the pipeline
question = "Does CNN permit using bots to artificially increase page visits for certain content?"

response = rag_pipeline.run(
    {
        "query_embedder": {"text": question},
        "prompt_builder": {"question": question},
        "answer_builder": {"query": question},
    }
)
print(response["answer_builder"]["answers"][0].data)


Batches:   0%|          | 0/1 [00:00<?, ?it/s]

No, CNN does not permit the use of any technology, service, or automated system to artificially inflate page views that User Content receives. This includes pay-per-click services, web "robots," and any other current or future technologies. Users are also prohibited from directing any third parties to use these services, technologies, or automated systems on their behalf.


In [28]:
# display the retrieved documents and similarity scores
for i, doc in enumerate(response['answer_builder']['answers'][0].documents, 1):
    display(Markdown(f"""**Document {i} (Score: {doc.score:.4f}):**\n\n{doc.content[:500]}..."""))

**Document 1 (Score: 0.3840):**

The Site contains copyrighted material, trademarks and other proprietary information, including, but not limited to, text, software, photos, video, graphics, music and sound, and the entire contents of the Site are copyrighted as a collective work under the United States copyright laws. CNN owns copyright in the selection, coordination, arrangement and enhancement of such content, as well as in the content original to it. You may not modify, publish, transmit, participate in the transfer or sale...

**Document 2 (Score: 0.3641):**

The following terms apply to content submitted by users, and user conduct, on the Site's Interactive Areas, including CNN iReport:

(A) Interactive Areas. The Site may contain comments sections, discussion forums, or other interactive features, including CNN iReport, CNN's user-generated news community ("Interactive Areas") in which you may post or upload user-generated content, including but not limited to iReports, comments, video, photos, messages, other materials or items (collectively, "Use...

**Document 3 (Score: 0.3209):**

7. Indemnification.
You agree to defend, indemnify and hold harmless CNN, its affiliates and their respective directors, officers, employees and agents from and against all claims and expenses, including attorneys' fees, arising out of the use of the Site by you or your Account. CNN reserves the right to take over the exclusive defense of any claim for which we are entitled to indemnification under this Section. In such event, you shall provide CNN with such cooperation as is reasonably requeste...

## Evaluating the pipeline

Now that we have our very first prototype of a RAG pipeline, we can start evaluating it.

To demonstrate how to use `FlowJudge` in the Haystack framework, we are going to evaluate the pipeline using a combination of statistical evaluators and model-based evaluators.

Haystack uses the concept of __Evaluation pipeline__, which is a pipeline responsible for computing the scoring metrics to measure the performance of the RAG pipeline.

Our evaluation pipeline will consist of:
- Semantic Answer Similarity: Evaluates the semantic similarity of the generated answer and the ground truth rather than their lexical overlap.
- Context Relevancy: Assesses the relevance of the retrieved documents to the query.
- Faithfulness: Evaluates to what extent a generated answer is based on retrieved documents

For context relevancy and faithfulness, we will create FlowJudge evaluators so we don't need to rely on proprietary large models.

### Obtain the generated answers

First of all, we need to obtain the generated answers from the RAG pipeline.

>Note that since we are using the free serverless inference API from HuggingFace, the generation process takes a few minutes. We are limiting the number of questions for now to 10 to avoid hitting the rate limit. If the execution crashes, you can resume from where it left off by running the cell again.

In [12]:
questions = all_questions[:20]
ground_truths = all_ground_truths[:20]

rag_answers = []
retrieved_docs = []

for question in questions:
    response = rag_pipeline.run(
        {
            "query_embedder": {"text": question},
            "prompt_builder": {"question": question},
            "answer_builder": {"query": question},
        }
    )
    print(f"Question: {question}")
    print("Answer from pipeline:")
    print(response["answer_builder"]["answers"][0].data)
    print("\n-----------------------------------\n")

    rag_answers.append(response["answer_builder"]["answers"][0].data)
    retrieved_docs.append(response["answer_builder"]["answers"][0].documents)

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Question: Does CNN permit using bots to artificially increase page visits for certain content?
Answer from pipeline:
No, CNN does not permit the use of any technology, service, or automated system to artificially inflate the page views that your User Content receives. This includes pay-per-click services, web "robots," and any other current or future technologies. Additionally, you are not allowed to direct any third party to use these services, technologies, or automated systems on your behalf.

-----------------------------------



Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Question: Does eBay promise that the product data it provides will be somewhat accurate?
Answer from pipeline:
Yes, eBay represents and warrants that the product data it provides is accurate. However, they cannot guarantee that the content provided through their Services will always be available, complete, and up-to-date.

-----------------------------------



Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Question: Will certain terms remain in force notwithstanding a users termination of the service?
Answer from pipeline:
Yes, certain terms will remain in force notwithstanding a user's termination of the service. Specifically, sections 4 (Your Content), 6 (Things You Cannot Do), 10 (Indemnity), 11 (Disclaimers), 12 (Limitation of Liability), 13 (Governing Law and Venue), 16 (Termination), and 17 (Miscellaneous) will continue to be effective, as stated in the provided context. This means these sections will survive any termination of the service or the user's account. Additionally, the context mentions that these AUP Guidelines, which naturally should survive termination, will remain valid as per the Terms. Hence, certain terms indeed remain enforceable despite the service termination.

-----------------------------------



Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Question: Where a Netflix subscriber requires customer support, is there an alternative means to obtaining customer support apart from enabling Netflix to remotely access the device or devices in question?
Answer from pipeline:
Yes, according to the provided context, if a Netflix subscriber does not want to enable remote access support through their device, they can still obtain assistance through other means. However, the specific alternative means are not detailed in the given context. Typically, customers can reach out through contact details provided on the Netflix Help Center webpage, such as phone numbers, email addresses, or live chat support options.

-----------------------------------



Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Question: Can Apple disclose personal information about me if asked to by the government?
Answer from pipeline:
Yes, according to the context provided, Apple may disclose any information about you, including your identity, if they determine that such disclosure is necessary in connection with any investigation or complaint regarding your use of the Site, or to identify, contact, or bring legal action against someone causing injury to or interference with Apples rights or property. This action may occur when Apple believes it's required by law or to comply with a legal process, government request, or to protect the rights, property, or personal safety of Apple and its stakeholders.

-----------------------------------



Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Question: Am I allowed to scrape data on the services for a research project Im doing?
Answer from pipeline:
No, you are not allowed to scrape data from the Services using automated means, such as robots, spiders, scripts, software, or any manual or automatic device, tool, or process designed to mine, scrape, or collect data from the Services. Engaging in such prohibited use may result in civil, criminal, and/or administrative penalties, fines, or sanctions against you. It's recommended to seek permission from The New York Times Company or the copyright holder before attempting to access or use their content for research or any other purpose.

-----------------------------------



Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Question: Does Amazon take responsibility for items sold by third parties through Amazon?
Answer from pipeline:
No, Amazon does not take responsibility for items sold by third parties through Amazon services. The context clearly states that when purchasing products or services from parties other than Amazon, you are directly purchasing from those third parties. Amazon acknowledges that they are not responsible for or liable for the actions, products, or content of these third-party businesses or individuals. Users are advised to carefully review their privacy statements and other conditions of use. Amazon also states that they are not responsible for examining or evaluating the offerings of any of these businesses or individuals.

-----------------------------------



Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Question: Does NYT check all user content before its uploaded and posted?
Answer from pipeline:
The provided context does not explicitly mention NYT's procedures for checking all user-generated content before uploading and posting it on their services. However, users agree not to use services or content in violation of intellectual property rights, and to respect copyright laws, which implies that NYT may have policies and measures in place to prevent the violation of these laws. Detailed information regarding content review processes would typically be found in additional sections of the terms of service or within specific content moderation policies set by NYT.

-----------------------------------



Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Question: Are there some changes to the terms that Instructure might not let me know about?
Answer from pipeline:
Yes, Instructure may revise the Acceptable Use Policy (AUP) Guidelines from time to time, potentially with changes that may significantly alter your rights. However, if there are meaningful reductions in these rights, Instructure is committed to providing notice to affected users, for example, through an email to the associated account email address. It's important to stay informed about any such revisions by regularly checking the posted and updated versions of the AUP Guidelines on their website.

-----------------------------------



Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Question: If someone else is using content I posted on Instagram, could this prevent the content from being deleted?
Answer from pipeline:
Yes, if someone else is using your content in accordance with the license you granted to Verizon Media, and they have not deleted it, your content may not be deleted within 90 days of account or content deletion request. This is because the license allows the use of your content, and as long as it is being used under the terms of this license, it will retain its visibility even if other users are interacting with it. However, it's important to note that deletion will still occur eventually if it doesn't violate any laws or regulations, and the exact duration for retaining such content would be determined on a case-by-case basis.

-----------------------------------



Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Question: Is eBay entitled to decide which party pays the costs of arbitration?
Answer from pipeline:
Yes, according to the provided context, eBay has the authority to decide which party pays the costs of arbitration. This is outlined in Section 3 ("Costs of Arbitration") of the agreement. It states that the payment of all filing, administration, and arbitrator fees will be governed by the American Arbitration Association's rules, unless otherwise stated in the agreement. Furthermore, if the relief sought is $10,000 or less, the requesting party must submit a Demand for Arbitration, and eBay will make arrangements to pay these fees directly to the AAA. Additionally, the arbitrator may determine the claim(s) to be frivolous and in such cases, the claimant agrees to reimburse eBay for all related fees. This implies that eBay has the right and responsibility to decide on the cost arrangements of the arbitration process.

-----------------------------------



Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Question: Is it sufficient for me to disclose in the edit summary the compensation I received for the edits I made?
Answer from pipeline:
No, based on the provided context, it is not stated that disclosing compensation in the edit summary for the edits made is required or sufficient. The terms primarily focus on privacy, compliance with legal processes, data preservation and disclosure, data usage limits, jurisdiction, and risk management associated with the Site, Services, and Products. It does not mention anything about the process or visibility of compensation for edits. Please refer to the specific terms of service or policy section to ascertain the requirements regarding compensation disclosure.

-----------------------------------



Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Question: Does Verizon need to tell me if it makes changes to the services?
Answer from pipeline:
Yes, Verizon will inform you when they make changes to the services. According to the provided context, Verizon may change the terms at any time and will notify you when these changes become effective. Using the services after the changes are implemented means you agree to the new terms. If you do not agree to the new terms, you are expected to stop using the services, close your Microsoft account (if applicable), and assist any minor children with closing their respective Microsoft accounts.

-----------------------------------



Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Question: Does Shopify agree to ensure the encryption of all data provided to it?
Answer from pipeline:
No, Shopify does not explicitly agree to ensure the encryption of all data provided to it. However, it is mentioned that credit card information is always encrypted during transfer over networks. For other data, there is no specific mention of encryption, and the Terms of Service state that all materials may be transferred unencrypted and involve transmissions over various networks and changes to conform to technical requirements. Therefore, while credit card information is encrypted, there is no clear provision in the provided context that Shopify agrees to ensure the encryption of all data provided to it.

-----------------------------------



Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Question: Are there some Disney products that I cant access anonymously?
Answer from pipeline:
Yes, there are certain Disney Products that may require you to create an account or provide account information in order to access or participate fully. For these products, you agree to maintain accurate and current information, including contact details. Additionally, you are responsible for the security of your account and must not share your account information with others. If there are any breaches of security or unauthorized use of your account, you are required to promptly notify Disney. Therefore, in these cases, you cannot access these Disney Products anonymously.

-----------------------------------



Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Question: Can Twitch continue using my content for advertising and marketing after Ive deleted that content?
Answer from pipeline:
No, Twitch cannot continue using your content for advertising and marketing after you have deleted it. According to the context provided, the rights granted to Twitch with respect to streaming live and pre-recorded audio-visual works terminate once you delete such User Content from the Twitch Services, except under specific circumstances such as if Twitch has used it for promotional purposes or if others copied or stored portions of the User Content. Therefore, if you delete your content, Twitch should not use it for advertising and marketing purposes.

-----------------------------------



Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Question: Does Facebook take responsibility for racist content posted on Facebook?
Answer from pipeline:
Facebook does not take responsibility for racist content posted on its platform. According to the provided context, the company acknowledges that hate speech of any kind is grounds for immediate and permanent suspension of access to all or part of the Service. However, it also states that while Facebook does not and cannot review every Submission and is not responsible for the content of these messages, it reserves the right to delete, move, or edit Submissions that it, in its sole discretion, deems as abusive, defamatory, obscene, in violation of copyright or trademark laws, in violation of these Terms of Service, or otherwise unacceptable. Thus, Facebook takes steps to address abusive content, but responsibility for the content ultimately lies with the individual user who posted it.

-----------------------------------



Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Question: Are there circumstances in which a security breach of a users Microsoft account could give rise to a claim by Microsoft against such user?
Answer from pipeline:
Yes, there are circumstances in which a security breach of a user's Microsoft account could give rise to a claim by Microsoft against such user. If the user fails to maintain the confidentiality of their password and account, resulting in unauthorized use or damage, Microsoft could potentially hold the user liable for any losses incurred by Microsoft or other parties due to someone else using their account or password. The user may be sued or required to provide monetary compensation if legal action is taken by Microsoft. Claims for such situations are unlawful and unfair business practices, which could cause irreparable harm to Microsoft. It's worth noting that while Microsoft may disclose user information under specific situations, as described in the provided context, unauthorized access or misuse of the account di

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Question: Does NYT require parental consent in respect of persons aged 17 who wish to access NYT services?
Answer from pipeline:
No, NYT requires parental consent for persons aged 13 or older, not specifically for persons aged 17. According to the provided context, any person who is 13 years or older can use the Services in the USA and the UK, while individuals aged 16 and older can use it anywhere else. If someone under 18 would like to use the Services, they should ask their parent or legal guardian to review and agree to these Terms of Service before they use or complete any purchases or registrations on their behalf. The requirement specifically stipulates the need for parental consent for minors under the age of 18. The age requirement for parental consent does not specify 17; it states 13 and above. However, it's implied that anyone under 18 must have parental consent.

-----------------------------------



Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Question: Does Instagram become the owner of content that I post on Instagram?
Answer from pipeline:
No, Instagram does not claim ownership of the content you post on Instagram. You are free to share your content with anyone else at your discretion. However, Instagram grants you a non-exclusive, royalty-free, transferable, sub-licensable, worldwide license to host, use, distribute, modify, run, copy, publicly perform or display, translate, and create derivative works of your content, as long as this is consistent with your privacy and application settings. This license ends when your content is deleted from Instagram's systems. Instagram's usage of your content is primarily for providing the Service and enhancing your user experience on the platform.

-----------------------------------



We need to convert the retrieved documents to a single string for FlowJudge evaluators.

In [22]:
str_retrieved_docs = []
for docs in retrieved_docs:
    str_retrieved_doc = ""
    for i, doc in enumerate(docs, 1):
        str_retrieved_doc += doc.content
        str_retrieved_doc += "\n"
    str_retrieved_docs.append(str_retrieved_doc)

In [23]:
str_retrieved_docs

['The Site contains copyrighted material, trademarks and other proprietary information, including, but not limited to, text, software, photos, video, graphics, music and sound, and the entire contents of the Site are copyrighted as a collective work under the United States copyright laws. CNN owns copyright in the selection, coordination, arrangement and enhancement of such content, as well as in the content original to it. You may not modify, publish, transmit, participate in the transfer or sale, create derivative works, or in any way exploit, any of the content, in whole or in part. You may download copyrighted material for your personal use only. Except as otherwise expressly permitted under copyright law, no copying, redistribution, retransmission, publication or commercial exploitation of downloaded material will be permitted without the express permission of CNN and the copyright owner. In the event of any permitted copying, redistribution or publication of copyrighted material,

In [24]:
display(Markdown(f"**Retrieved documents:** {str_retrieved_docs[0]}"))

**Retrieved documents:** The Site contains copyrighted material, trademarks and other proprietary information, including, but not limited to, text, software, photos, video, graphics, music and sound, and the entire contents of the Site are copyrighted as a collective work under the United States copyright laws. CNN owns copyright in the selection, coordination, arrangement and enhancement of such content, as well as in the content original to it. You may not modify, publish, transmit, participate in the transfer or sale, create derivative works, or in any way exploit, any of the content, in whole or in part. You may download copyrighted material for your personal use only. Except as otherwise expressly permitted under copyright law, no copying, redistribution, retransmission, publication or commercial exploitation of downloaded material will be permitted without the express permission of CNN and the copyright owner. In the event of any permitted copying, redistribution or publication of copyrighted material, no changes in or deletion of author attribution, trademark legend or copyright notice shall be made. You acknowledge that you do not acquire any ownership rights by downloading copyrighted material.
The following terms apply to content submitted by users, and user conduct, on the Site's Interactive Areas, including CNN iReport:

(A) Interactive Areas. The Site may contain comments sections, discussion forums, or other interactive features, including CNN iReport, CNN's user-generated news community ("Interactive Areas") in which you may post or upload user-generated content, including but not limited to iReports, comments, video, photos, messages, other materials or items (collectively, "User Content"). You are solely responsible for your use of any Interactive Areas and you use them at your own risk. Interactive Areas are available for individuals aged 13 years or older. By submitting User Content to an Interactive Area, you represent that you are 13 years of age or older and, if you are under the age of 18, you either are an emancipated minor, or have obtained the legal consent of your parent or legal guardian to enter into these Terms of Use, submit content, participate on the Site, and fulfill the obligations set forth in these Terms of Use, which forms a binding contract between you and CNN. Employees of Cable News Network, Inc. may not submit User Content without permission from their supervisors.

(B) Community Guidelines. By submitting any User Content or participating in an Interactive Area within or in connection with the Site, you agree to abide by the following rules of conduct:
You agree not to upload, post or otherwise transmit any User Content that:
violates or infringes in any way upon the rights of others, including any statements which may defame, harass, stalk or threaten others.
you know to be false, misleading or inaccurate.
contains blatant expressions of bigotry, racism, racially or ethnically offensive content, hate speech, abusiveness, vulgarity or profanity.
contains or advocates pornography or sexually explicit content, pedophilia, incest, bestiality, or that is otherwise obscene or lewd.
violates any law or advocates or provides instruction on dangerous, illegal, or predatory acts, or discusses illegal activities with the intent to commit them.
advocates violent behavior.
poses a reasonable threat to personal or public safety.
contains violent images of killing or physical abuse that appear to have been captured solely, or principally, for exploitive, prurient, or gratuitous purposes.
is protected by copyright, trademark, trade secret, right of publicity or other proprietary right without the express permission of the owner of such copyright, trademark, trade secret, right of publicity or other proprietary right. The burden of determining that any User Content is not protected by copyright, trademark, trade secret, right of publicity or other proprietary right rests with you. You shall be solely liable for any damage resulting from any infringement of copyrights, trademarks, trade secrets, rights of publicity or other proprietary rights or any other harm resulting from such a submission. Any person determined by CNN, in its sole discretion, to have violated the intellectual property or other rights of others shall be barred from submitting or posting any further material on the Site.
does not generally pertain to the designated topic or theme of any Interactive Area.
contains any unsolicited or unauthorized advertising or promotional materials with respect to products or services, "junk mail," "spam," "chain letters," "pyramid schemes," or any other form of solicitation.
You agree not to engage in activity that would constitute a criminal offense or give rise to a civil liability.
You agree that if necessary, you have the consent of each and every identifiable natural person in any submission to use such persons name or likeness in the manner contemplated by the Site.
You agree that any person who appears in your submission who is a current member of the Screen Actors Guild (SAG), the American Federation of Television and Radio Actors (AFTRA) or any other rights society is not entitled to compensation by CNN.
You agree not to impersonate any person or entity, including, but not limited to, CNN or any CNN employee, or falsely state or otherwise misrepresent your affiliation with any person or entity.
You agree not to represent or suggest, directly or indirectly, CNN's endorsement of User Content.
You agree not to interfere with any other user's right to privacy, including by harvesting or collecting personally-identifiable information about the Site users or posting private information about a third party.
You agree not to upload, post or otherwise transmit any User Content, software or other materials which contain a virus or other harmful or disruptive component.
You agree not to interfere with or disrupt the Site or the servers or networks connected to the Site, or disobey any requirements, procedures, policies or regulations of networks connected to the Site.
You agree not to reproduce, duplicate, copy, sell, resell or exploit for any commercial purpose, any portion of the Site, use the Site, or access to the Site.
You agree not to use any service, technology or automated system to artificially inflate the page views that your User Content receives. This includes pay-per-click services, web "robots" and any other current or future technologies. You also agree not to direct any third party to use these services, technologies or automated systems on your behalf.
You agree not to use any technology, service or automated system to post more User Content than an individual could upload in a given period of time. You also agree not to direct any third party to use these services, technologies or automated systems on your behalf.
Any conduct that in CNN's sole discretion restricts or inhibits anyone else from using or enjoying the Site will not be permitted. CNN reserves the right in its sole discretion to remove or edit User Content by you and to terminate Your Account for any reason.
CNN does not vouch for the accuracy or credibility of any User Content, and does not take any responsibility or assume any liability for any actions you may take as a result of reading User Content posted on the Site. Through your use of Interactive Areas, you may be exposed to content that you may find offensive, objectionable, harmful, inaccurate or deceptive. There may also be risks of dealing with underage persons, people acting under false pretense, international trade issues and foreign nationals. By using Interactive Areas, you assume all associated risks.
7. Indemnification.
You agree to defend, indemnify and hold harmless CNN, its affiliates and their respective directors, officers, employees and agents from and against all claims and expenses, including attorneys' fees, arising out of the use of the Site by you or your Account. CNN reserves the right to take over the exclusive defense of any claim for which we are entitled to indemnification under this Section. In such event, you shall provide CNN with such cooperation as is reasonably requested by CNN.
8. Termination.
CNN may terminate or suspend these Terms of Use at any time without notice to you. Without limiting the foregoing, CNN shall have the right to immediately terminate Your Account in the event of any conduct by you which CNN, in its sole discretion, considers to be unacceptable, or in the event of any breach by you of these Terms of Use. The provisions of Sections 1 - 13 shall survive termination of these Terms of Use.
9. Trademarks.
CNN, its parent, subsidiaries and affiliates, own all rights to their logos and trademarks used in connection with the Site. All other logos and trademarks appearing on the Site are the property of their respective owners.


### Evaluators in Haystack

Evaluators are components that can be run individually or as part of a Haystack pipeline.

We are going to construct an evaluation pipeline to obtain score from all the evaluators in one pass. Haystack also allows us to construct an evaluation report.

#### Creating FlowJudge evaluators using the HaystackFlowJudge class

We can use our integration with Haystack to create FlowJudge evaluators in a flexible way. The process is as follows:
1. Create a `CustomMetric` that will be used to compute the score for the evaluator.
2. Initialize the model - In this case we are using a quantized version of Flow-Judge-v0.1 running on the vLLM engine.
3. Create the `HaystackFlowJudge` evaluator.


In [39]:
from flow_judge.integrations.haystack import HaystackFlowJudge
from flow_judge.models.model_factory import ModelFactory
from flow_judge.metrics.metric import CustomMetric, RubricItem
from flow_judge.metrics.presets import RESPONSE_FAITHFULNESS_5POINT

# Create a model using ModelFactory
model = ModelFactory.create_model("Flow-Judge-v0.1-AWQ") # ! Replace with "Flow-Judge-v0.1_HF_no_flsh_attn" if running on no Ampere GPUs


INFO 10-01 15:04:34 awq_marlin.py:89] The model is convertible to awq_marlin during runtime. Using awq_marlin kernel.
INFO 10-01 15:04:34 llm_engine.py:223] Initializing an LLM engine (v0.6.1.post2) with config: model='flowaicom/Flow-Judge-v0.1-AWQ', speculative_config=None, tokenizer='flowaicom/Flow-Judge-v0.1-AWQ', skip_tokenizer_init=False, tokenizer_mode=auto, revision=None, override_neuron_config=None, rope_scaling=None, rope_theta=None, tokenizer_revision=None, trust_remote_code=True, dtype=torch.bfloat16, max_seq_len=8192, download_dir=None, load_format=LoadFormat.AUTO, tensor_parallel_size=1, pipeline_parallel_size=1, disable_custom_all_reduce=False, quantization=awq_marlin, enforce_eager=True, kv_cache_dtype=auto, quantization_param_path=None, device_config=cuda, decoding_config=DecodingConfig(guided_decoding_backend='outlines'), observability_config=ObservabilityConfig(otlp_traces_endpoint=None, collect_model_forward_time=False, collect_model_execute_time=False), seed=0, serv

Loading safetensors checkpoint shards:   0% Completed | 0/1 [00:00<?, ?it/s]


INFO 10-01 15:04:36 model_runner.py:1008] Loading model weights took 2.1861 GB
INFO 10-01 15:04:38 gpu_executor.py:122] # GPU blocks: 3094, # CPU blocks: 682


In [51]:
# Context relevancy
cr_criteria = "Based on the provided query and context, how relevant and sufficient is the context for responding to the query?"
cr_rubric = [
    RubricItem(
        score=1,
        description="The context provided is not relevant or insufficient to respond to the query."
    ),
    RubricItem(
        score=2,
        description="The context is mostly irrelevant to the query. It may contain some tangentially related information but is insufficient for adequately responding to the query."
    ),
    RubricItem(
        score=3,
        description="The context is somewhat relevant to the query. It contains some information that could be used to partially respond to the query, but key details are missing for a complete response."
    ),
    RubricItem(
        score=4,
        description="The context is mostly relevant to the query. It contains most of the necessary information to respond to the query, but may be missing some minor details."
    ),
    RubricItem(
        score=5,
        description="The context is highly relevant to the query. It contains all the necessary information to comprehensively respond to the query without needing any additional context."
    )
]
cr_metric = CustomMetric(
    name="Context Relevancy",
    criteria=cr_criteria,
    rubric=cr_rubric,
    required_inputs=["question"],
    required_output="contexts"
)

For creating the faithfulness evaluator, we are going to use the `RESPONSE_FAITHFULNESS_5POINT` preset in flow-judge library as a template.

> Note that we need to use the expected keys so we need to update required inputs and outputs to match the expected keys in the RAG pipeline. In this case, the score descriptions are still relevant with these changes.

In [52]:
ff_criteria = RESPONSE_FAITHFULNESS_5POINT.criteria
ff_rubric = RESPONSE_FAITHFULNESS_5POINT.rubric

display(Markdown(f"**Criteria:** {ff_criteria}"))
display(Markdown("**Rubric:**"))

for item in ff_rubric:
    display(Markdown(f"- **Score {item.score}:** {item.description}"))

ff_metric = CustomMetric(
    name="Faithfulness",
    criteria=ff_criteria,
    rubric=ff_rubric,
    required_inputs=["question", "contexts"],
    required_output="predicted_answers"
)

**Criteria:** Based on the given context, evaluate how consistent and faithful the generated response is to the context. The response should not contain any hallucinated or fabricated information that is not supported by the context.

**Rubric:**

- **Score 1:** The response is completely inconsistent with the provided context. It contains significant amount of hallucinated or fabricated information that directly contradicts or is not supported at all by the context.

- **Score 2:** The response is mostly inconsistent with the provided context. While it may contain some information from the context, it introduces a substantial amount of hallucinated or fabricated details that deviate from the context.

- **Score 3:** The response is somewhat consistent with the provided context. It includes a mix of information from the context and some hallucinated or fabricated details. The fabrications are minor and do not significantly contradict the context.

- **Score 4:** The response is mostly consistent with the provided context. The vast majority of the content is supported by the context, with only minor and inconsequential inconsistencies or fabrications, if any.

- **Score 5:** The response is completely consistent with and faithful to the provided context. All details in the response are directly supported by the context, without any hallucinated or fabricated information.

We can now create the Flow Judge evaluators:

In [53]:
cr_evaluator = HaystackFlowJudge(
    metric=cr_metric,
    model=model, # the vLLM instance of Flow-Judge-v0.1
    progress_bar=True,
    raise_on_failure=True, # to raise an error when pipeline run fails
    save_results=True, # to save evaluation results to disk
    fail_on_parse_error=False # to fail if there is a parsing error, otherwise return "Error" and score -1
)

ff_evaluator = HaystackFlowJudge(
    metric=ff_metric,
    model=model,
    progress_bar=True,
    raise_on_failure=True,
    save_results=True,
    fail_on_parse_error=False
)


#### Haystack evaluators

Now let's crete the semantic answer similarity evaluator using the Haystack implementation. This evaluator will use the same embedding model as the retriever in the RAG pipeline.

In [54]:
from haystack.components.evaluators.sas_evaluator import SASEvaluator

sas_evaluator = SASEvaluator(model="sentence-transformers/all-MiniLM-L6-v2")

### Evaluation pipeline

It's time to put it all together. We are going to create an evaluation pipeline that will evaluate the RAG pipeline.

In [55]:
eval_pipeline = Pipeline()

# add components to the pipeline
eval_pipeline.add_component("sas_evaluator", sas_evaluator)
# eval_pipeline.add_component("cr_evaluator", cr_evaluator)
# eval_pipeline.add_component("ff_evaluator", ff_evaluator)

In [56]:
# run pipeline
results = eval_pipeline.run(
    {
        "sas_evaluator": {
            'predicted_answers': rag_answers,
            'ground_truth_answers': ground_truths,
        }
    }
)

INFO:haystack.core.pipeline.base:Warming up component sas_evaluator...
INFO:sentence_transformers.SentenceTransformer:Load pretrained SentenceTransformer: sentence-transformers/all-MiniLM-L6-v2
INFO:haystack.core.pipeline.pipeline:Running component sas_evaluator


Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

In [58]:
eval_pipeline.add_component("cr_evaluator", cr_evaluator)

In [59]:
cr_evaluator

<flow_judge.integrations.haystack.HaystackFlowJudge object at 0x7594da041780>
cr_evaluator
Inputs:
  - question: List[str]
  - contexts: List[str]
Outputs:
  - results: List[Dict[str, Any]]

In [92]:
str_retrieved_docs[:2]

['Context 1:\nThe Site contains copyrighted material, trademarks and other proprietary information, including, but not limited to, text, software, photos, video, graphics, music and sound, and the entire contents of the Site are copyrighted as a collective work under the United States copyright laws. CNN owns copyright in the selection, coordination, arrangement and enhancement of such content, as well as in the content original to it. You may not modify, publish, transmit, participate in the transfer or sale, create derivative works, or in any way exploit, any of the content, in whole or in part. You may download copyrighted material for your personal use only. Except as otherwise expressly permitted under copyright law, no copying, redistribution, retransmission, publication or commercial exploitation of downloaded material will be permitted without the express permission of CNN and the copyright owner. In the event of any permitted copying, redistribution or publication of copyright

In [79]:
results = eval_pipeline.run(
    {
        "sas_evaluator": {
            'predicted_answers': rag_answers,
            'ground_truth_answers': ground_truths,
        },
        "cr_evaluator": {
            'question': questions,
            'contexts': str_retrieved_docs,
        }
    }
)

INFO:haystack.core.pipeline.base:Warming up component sas_evaluator...
INFO:haystack.core.pipeline.pipeline:Running component sas_evaluator


Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

INFO:haystack.core.pipeline.pipeline:Running component cr_evaluator
Processed prompts:   0%|          | 0/20 [00:00<?, ?it/s, est. speed input: 0.00 toks/s, output: 0.00 toks/s]



Processed prompts: 100%|██████████| 20/20 [01:03<00:00,  3.20s/it, est. speed input: 889.65 toks/s, output: 258.20 toks/s]








































































yourating,





orate






































































































if


































































































































































































































































































































































at



































































































































































































































































































































In [81]:
ff_evaluator

<flow_judge.integrations.haystack.HaystackFlowJudge object at 0x7594da042140>
Inputs:
  - question: List[str]
  - contexts: List[str]
  - predicted_answers: List[str]
Outputs:
  - results: List[Dict[str, Any]]

In [82]:
eval_pipeline.add_component("ff_evaluator", ff_evaluator)

In [84]:
results = eval_pipeline.run(
    {
        "sas_evaluator": {
            'predicted_answers': rag_answers,
            'ground_truth_answers': ground_truths,
        },
        "cr_evaluator": {
            'question': questions,
            'contexts': str_retrieved_docs,
        },
        "ff_evaluator": {
            'question': questions,
            'contexts': str_retrieved_docs,
            'predicted_answers': rag_answers,
        }
    }
)

INFO:haystack.core.pipeline.base:Warming up component sas_evaluator...
INFO:haystack.core.pipeline.pipeline:Running component sas_evaluator


Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

INFO:haystack.core.pipeline.pipeline:Running component cr_evaluator
Processed prompts: 100%|██████████| 20/20 [01:08<00:00,  3.44s/it, est. speed input: 825.76 toks/s, output: 247.98 toks/s]








































































yourating,





orate























































































If,












if


































































































































































































































































































































































at at



























































































































































































































































In [85]:
results

{'sas_evaluator': {'score': 0.854657432436943,
  'individual_scores': [0.7936003804206848,
   0.8575922250747681,
   0.872139036655426,
   0.8882833123207092,
   0.8823309540748596,
   0.7627882957458496,
   0.940034031867981,
   0.8782095909118652,
   0.8595547676086426,
   0.6018112897872925,
   0.874082088470459,
   0.7303538918495178,
   0.7400498986244202,
   0.9672666788101196,
   0.8900795578956604,
   0.9298853874206543,
   0.9177019596099854,
   0.8769631385803223,
   0.9166738986968994,
   0.9137482643127441]},
 'cr_evaluator': {'results': [None,
   None,
   None,
   None,
   None,
   None,
   None,
   None,
   None,
   None,
   {'feedback': 'The context provided is highly relevant and sufficiently detailed to answer the query about whether eBay can decide which party pays the costs of arbitration. The context includes specific information from the "Costs of Arbitration" section, which directly addresses the question. It states that "Payment of all filing, administration and 

### Evaluation report

Haystack provides a convenient way to generate an evaluation report using the `EvaluationRunResult` class.

In [87]:
from haystack.evaluation.eval_run_result import EvaluationRunResult

inputs = {
    "question": questions,
    "contexts": str_retrieved_docs,
    "answer": ground_truths,
    "predicted_answer": rag_answers,
}

evaluation_result = EvaluationRunResult(run_name="report", inputs=inputs, results=results)
evaluation_result.score_report()


ValueError: Aggregate score missing for cr_evaluator.