# Using Ragas to Evaluate a RAG Application built with LangChain and LangGraph

In the following notebook, we'll be looking at how [Ragas](https://github.com/explodinggradients/ragas) can be helpful in a number of ways when looking to evaluate your RAG applications!

While this example is rooted in LangChain/LangGraph - Ragas is framework agnostic (you don't even need to be using a framework!).

- 🤝 Breakout Room #1
  1. Task 1: Installing Required Libraries
  2. Task 2: Set Environment Variables
  3. Task 3: Synthetic Dataset Generation for Evaluation using Ragas
  4. Task 4: Evaluating our Pipeline with Ragas
  5. Task 6: Making Adjustments and Re-Evaluating

But first! Let's set some dependencies!

## Dependencies and API Keys:

> NOTE: Please skip the pip install commands if you are running the notebook locally.

In [1]:
#!pip install -qU ragas==0.2.10
#!pip install -qU cohere langchain_cohere

In [3]:
#!pip install -qU langchain-community==0.3.14 langchain-openai==0.2.14 unstructured==0.16.12 langgraph==0.2.61 langchain-qdrant==0.2.0

We'll also need to provide our API keys.

First, OpenAI's for our LLM/embedding model combination! 

And next, in case we use cohere cntextual compression, we need our cohere key. 

And then, if we are going to use tracing, we need our Langsmith key.

In [73]:
import os
from getpass import getpass
os.environ["OPENAI_API_KEY"] = getpass("Please enter your OpenAI API key!")
os.environ["COHERE_API_KEY"] = getpass("Please enter your Cohere API key!")
os.environ["LANGSMITH_API_KEY"] = getpass("Please enter your Langsmith API key!")

Determine the chunking and the retrieval you want to evaluate!

In [176]:
# By default, we are using regular text splitting/chunking and Naive QDrant retrieval.
SEMANTIC_SPLIT = False
COHERE_RETRIEVER = False

os.environ["LANGSMITH_TRACING"]="true"
os.environ["LANGCHAIN_TRACING_V2"] = "true"
os.environ["LANGCHAIN_ENDPOINT"]="https://api.smith.langchain.com"

if SEMANTIC_SPLIT and COHERE_RETRIEVER:
    os.environ["LANGSMITH_PROJECT"] = "Assignment-8-Rag-Evaluation-Semantic-Cohere"
elif SEMANTIC_SPLIT:
    os.environ["LANGSMITH_PROJECT"] = "Assignment-8-Rag-Evaluation-Semantic-Naive" 
elif COHERE_RETRIEVER:
    os.environ["LANGSMITH_PROJECT"] = "Assignment-8-Rag-Evaluation-Naive-Cohere"
else:
    os.environ["LANGSMITH_PROJECT"] = "Assignment-8-Rag-Evaluation-Naive-Naive"


## Generating Synthetic Test Data

We wil be using Ragas to build out a set of synthetic test questions, references, and reference contexts. This is useful because it will allow us to find out how our system is performing.

> NOTE: Ragas is best suited for finding *directional* changes in your LLM-based systems. The absolute scores aren't comparable in a vacuum.

### Data Preparation

We'll prepare our data - and download our webpages which we'll be using for our data today.

These webpages are from [Simon Willison's](https://simonwillison.net/) yearly "AI learnings".

- [2023 Blog](https://simonwillison.net/2023/Dec/31/ai-in-2023/)
- [2024 Blog](https://simonwillison.net/2024/Dec/31/llms-in-2024/)

Let's start by collecting our data into a useful pile!

In [151]:
!mkdir data

mkdir: data: File exists


In [152]:
!curl https://simonwillison.net/2023/Dec/31/ai-in-2023/ -o data/2023_llms.html

  % Total    % Received % Xferd  Average Speed   Time    Time     Time  Current
                                 Dload  Upload   Total   Spent    Left  Speed
100 31524    0 31524    0     0  68584      0 --:--:-- --:--:-- --:--:-- 69436


In [153]:
!curl https://simonwillison.net/2024/Dec/31/llms-in-2024/ -o data/2024_llms.html

  % Total    % Received % Xferd  Average Speed   Time    Time     Time  Current
                                 Dload  Upload   Total   Spent    Left  Speed
100 70549    0 70549    0     0   128k      0 --:--:-- --:--:-- --:--:--  128k


Next, let's load our data into a familiar LangChain format using the `DirectoryLoader`.

In [177]:
from langchain_community.document_loaders import DirectoryLoader

path = "data/"
loader = DirectoryLoader(path, glob="*.html")
docs = loader.load()

libmagic is unavailable but assists in filetype detection. Please consider installing libmagic for better results.
libmagic is unavailable but assists in filetype detection. Please consider installing libmagic for better results.


### Knowledge Graph Based Synthetic Generation

Ragas uses a knowledge graph based approach to create data. This is extremely useful as it allows us to create complex queries rather simply. The additional testset complexity allows us to evaluate larger problems more effectively, as systems tend to be very strong on simple evaluation tasks.

Let's start by defining our `sdg_llm` (which will generate our questions, summaries, and more), and our `sdg_embeddings` which will be useful in building our graph.

In [178]:
from ragas.llms import LangchainLLMWrapper
from ragas.embeddings import LangchainEmbeddingsWrapper
from langchain_openai import ChatOpenAI
from langchain_openai import OpenAIEmbeddings
sdg_llm = LangchainLLMWrapper(ChatOpenAI(model="gpt-4o-mini"))
sdg_embeddings = LangchainEmbeddingsWrapper(OpenAIEmbeddings(model="text-embedding-3-small"))


This will generate our knowledge graph under the hood, and will - from there - generate our personas and scenarios to construct our queries.

If we do not give the query distribution it will use 1/3, 1/3, 1/3 split for single hop specific, multihop specific and multihop abstract.

In [179]:
from ragas.testset import TestsetGenerator

sdg_generator = TestsetGenerator(llm=sdg_llm, embedding_model=sdg_embeddings)
sdg_dataset = sdg_generator.generate_with_langchain_docs(docs, testset_size=10)

Applying HeadlinesExtractor:   0%|          | 0/2 [00:00<?, ?it/s]

Applying HeadlineSplitter:   0%|          | 0/2 [00:00<?, ?it/s]

Applying SummaryExtractor:   0%|          | 0/2 [00:00<?, ?it/s]

Applying CustomNodeFilter:   0%|          | 0/12 [00:00<?, ?it/s]

Applying [EmbeddingExtractor, ThemesExtractor, NERExtractor]:   0%|          | 0/24 [00:00<?, ?it/s]

Applying [CosineSimilarityBuilder, OverlapScoreBuilder]:   0%|          | 0/2 [00:00<?, ?it/s]

Generating personas:   0%|          | 0/2 [00:00<?, ?it/s]

Generating Scenarios:   0%|          | 0/3 [00:00<?, ?it/s]

Generating Samples:   0%|          | 0/12 [00:00<?, ?it/s]

In [180]:
sdg_dataset.to_pandas()

Unnamed: 0,user_input,reference_contexts,reference,synthesizer_name
0,What role does Microsoft play in the developme...,[The ethics of this space remain diabolically ...,Microsoft Research has produced models like Ph...,single_hop_specifc_query_synthesizer
1,Wot are the implications of LLMs on the use of...,"[and software engineer, LLMs are infuriating. ...",LLMs have significant implications for the use...,single_hop_specifc_query_synthesizer
2,What is a notable characteristic of GPT-4 in t...,[Simon Willison’s Weblog Subscribe Stuff we fi...,GPT-4 is part of the breakthrough year for Lar...,single_hop_specifc_query_synthesizer
3,What do you expect to see in 2024 regarding AI...,[the document includes some of the clearest ex...,I’m hoping 2024 sees significant amounts of de...,single_hop_specifc_query_synthesizer
4,Why do some people think LLMs are all hot air ...,[<1-hop>\n\nThe ethics of this space remain di...,Some people think LLMs are all hot air because...,multi_hop_abstract_query_synthesizer
5,Wht are the legal implicashuns of using LLMs t...,[<1-hop>\n\nThe ethics of this space remain di...,The legal implications of using LLMs trained o...,multi_hop_abstract_query_synthesizer
6,What are some positive applications of generat...,[<1-hop>\n\nthe document includes some of the ...,Positive applications of generative AI can out...,multi_hop_abstract_query_synthesizer
7,What are the ethical concerns surrounding the ...,[<1-hop>\n\nThe ethics of this space remain di...,The ethical concerns surrounding the use of La...,multi_hop_abstract_query_synthesizer
8,What significant advancements in Large Languag...,[<1-hop>\n\nSimon Willison’s Weblog Subscribe ...,"In Simon Willison's weblog for 2023, he highli...",multi_hop_specific_query_synthesizer
9,"What advancements were made in LLMs in 2024, p...",[<1-hop>\n\nalso collected 211 definitions on ...,"In 2024, significant advancements in Large Lan...",multi_hop_specific_query_synthesizer


## LangChain RAG

Now we'll construct our LangChain RAG, which we will be evaluating using the above created test data!

### R - Retrieval

Let's start with building our retrieval pipeline, which will involve loading the same data we used to create our synthetic test set above.

> NOTE: We need to use the same data - as our test set is specifically designed for this data.

In [181]:
path = "data/"
loader = DirectoryLoader(path, glob="*.html")
docs = loader.load()

libmagic is unavailable but assists in filetype detection. Please consider installing libmagic for better results.
libmagic is unavailable but assists in filetype detection. Please consider installing libmagic for better results.


Now that we have our data loaded, let's split it into chunks! 

We are providing:
  * two options for chunking - **Naive or Semantic**, and
  * two options for retrieval: **Naive** (get top 5) or **Cohere** (get top 20 and then juice stuff from them using cohere)

**Naive Splitting/Chunking**

In [182]:
from langchain.text_splitter import RecursiveCharacterTextSplitter

def text_split():
    text_splitter = RecursiveCharacterTextSplitter(chunk_size=1000, chunk_overlap=200)
    split_documents = text_splitter.split_documents(docs)
    len(split_documents)
    print(f"Number of naive chunks: {len(split_documents)}") 
     # Calculate and print chunk length statistics
    chunk_lengths = [len(doc.page_content) for doc in split_documents]
    print(f"Max chunk length: {max(chunk_lengths)}")
    print(f"Min chunk length: {min(chunk_lengths)}")
    print(f"Average chunk length: {sum(chunk_lengths)/len(chunk_lengths):.2f}")
    return split_documents

**Semantic Splitting/Chunking**

In [183]:
from langchain_experimental.text_splitter import SemanticChunker # Note: Often in 'experimental'

# Must we match the model we use later for retrieval/generation?
sc_embeddings = OpenAIEmbeddings(model="text-embedding-3-small") 

# Initialize the SemanticChunker
# Ok, so for blog article text, we want meaningful chunks, we don't want overly big chunks.
# We start with percentile - it looks at the distribution of similarity scores between adjacent 
# sentences within that document and splits where the similarity drops into a lower percentile.
# Default is bottom 5% similarity scores. I want to avoid overly large chunks, so I will change
# it to ensure that the chunks are not too big.
semantic_splitter = SemanticChunker(
    embeddings=sc_embeddings, 
    breakpoint_threshold_type="percentile", # Or standard_deviation, interquartile, gradient
    breakpoint_threshold_amount=85
)
def semantic_split():
    # Split the documents
    semantic_split_documents = semantic_splitter.split_documents(docs)

    # 5. Check the result
    print(f"Number of semantic chunks: {len(semantic_split_documents)}") 
    # Calculate and print chunk length statistics
    chunk_lengths = [len(doc.page_content) for doc in semantic_split_documents]
    print(f"Max chunk length: {max(chunk_lengths)}")
    print(f"Min chunk length: {min(chunk_lengths)}")
    print(f"Average chunk length: {sum(chunk_lengths)/len(chunk_lengths):.2f}")
    return semantic_split_documents

#### ❓ Question: 

What is the purpose of the `chunk_overlap` parameter in the `RecursiveCharacterTextSplitter`?

Now that we have the helper functions for our two chunking/splitting options, we are ready to construct our RAG retriever. We will do this in two steps: \
First, we set up the Qdrant vector db and then we will set our retriever node for the graph.

**Qdrant vector store** Depending on SEMANTIC_SPLIT we will chose how to split the data before adding it to the vector store.

In [184]:
# We need an embedding model for Qdrant to use
from langchain_openai import OpenAIEmbeddings

# Qdrant imports
from langchain_qdrant import QdrantVectorStore
from qdrant_client import QdrantClient
from qdrant_client.http.models import Distance, VectorParams


embeddings = OpenAIEmbeddings(model="text-embedding-3-small")

# Our in memory QDrant vector store.
client = QdrantClient(":memory:")
client.create_collection(
    collection_name="ai_across_years",
    vectors_config=VectorParams(size=1536, distance=Distance.COSINE),
)
vector_store = QdrantVectorStore(
    client=client,
    collection_name="ai_across_years",
    embedding=embeddings,
)

# Depending on which type of chunking we are evaluating, we will add the documents to the vector store.
split_documents = None
if (SEMANTIC_SPLIT):
  split_documents = semantic_split()
else:
  split_documents = text_split()

_ = vector_store.add_documents(documents=split_documents)


Number of naive chunks: 75
Max chunk length: 998
Min chunk length: 406
Average chunk length: 880.93


**With Naive chunking:**

Number of naive chunks: 75 \
Max chunk length: 998 \
Min chunk length: 406 \
Average chunk length: 880.55

**With Semantic chunking: Percentile: 85**

Number of semantic chunks: 70\
Max chunk length: 5088\
Min chunk length: 22\
Average chunk length: 858.27

Now we are ready to construct our RAG retriever node.

 We will create two retriever nodes,
   - one that uses a **regular or naive** retrieval using a naive Qdrant vector DB - this one gets the top 5 matches, 
   - and another **cohere_retriever** that gets top 20 matches and then contextually compresses them down into 5.

Reranking, or contextual compression, is a technique that uses a reranker to compress the retrieved documents into a smaller set of documents.\
This is essentially a slower, more accurate form of semantic similarity that we use on a smaller subset of our documents.

In [185]:
# Cohere imports
from langchain.retrievers.contextual_compression import ContextualCompressionRetriever
from langchain_cohere import CohereRerank

# We define both our retriever nodes and later, we will add the correct one to our graph.

# Cohere retriever node.
# We'll first set our retriever to return more documents, which will allow us to take advantage of the reranking.
base_retriever = vector_store.as_retriever(search_kwargs={"k": 20})
def cohere_retrieve(state):
  compressor = CohereRerank(model="rerank-v3.5")
  compression_retriever = ContextualCompressionRetriever(
    base_compressor=compressor, base_retriever=base_retriever, search_kwargs={"k": 5}
  )
  retrieved_docs = compression_retriever.invoke(state["question"])
  return {"context" : retrieved_docs}
  
# Naive retriever node.
naive_base_retriever = vector_store.as_retriever(search_kwargs={"k": 5})
def retrieve(state):
  retrieved_docs = naive_base_retriever.invoke(state["question"])
  return {"context" : retrieved_docs}


### Augmented

Let's create a simple RAG prompt!

In [186]:
from langchain.prompts import ChatPromptTemplate

RAG_PROMPT = """\
You are a helpful assistant who answers questions based on provided context. You must only use the provided context, and cannot use your own knowledge.

### Question
{question}

### Context
{context}
"""

rag_prompt = ChatPromptTemplate.from_template(RAG_PROMPT)

### Generation

We'll also need an LLM to generate responses - we'll use `gpt-4o-mini` to avoid using the same model as our judge model.

In [187]:
from langchain_openai import ChatOpenAI

generator_llm = ChatOpenAI(model="gpt-4o-mini")

Then we can create a `generate` node!

In [188]:
def rag_generate(state):
  docs_content = "\n\n".join(doc.page_content for doc in state["context"])
  messages = rag_prompt.format_messages(question=state["question"], context=docs_content)
  response = generator_llm.invoke(messages)
  return {"response" : response.content}

### Building RAG Graph with LangGraph

Let's create some state for our LangGraph RAG graph!

In [189]:
from langgraph.graph import START, StateGraph
from typing_extensions import List, TypedDict
from langchain_core.documents import Document

class State(TypedDict):
  question: str
  context: List[Document]
  response: str

Now we can build our simple graph!

> NOTE: We're using `add_sequence` since we will always move from retrieval to generation. This is essentially building a chain in LangGraph.

In [190]:
graph_builder = StateGraph(State)
if COHERE_RETRIEVER:
  graph_builder.add_sequence([cohere_retrieve, rag_generate])
  graph_builder.add_edge(START, "cohere_retrieve")
else:
  graph_builder.add_sequence([retrieve, rag_generate])
  graph_builder.add_edge(START, "retrieve")
  
graph = graph_builder.compile()

In [191]:
# print(graph.get_graph().draw_ascii())

Let's do a test to make sure it's doing what we'd expect.

In [192]:
response = graph.invoke({"question" : "How are LLM agents useful?"})

In [193]:
response["response"]

'LLM agents are considered useful primarily in two contexts: as tools that can act on behalf of users, similar to a travel agent, and as systems that utilize LLMs to solve problems through the execution of tasks in a loop. However, there is significant skepticism about their practicality due to concerns regarding their ability to distinguish truth from fiction, leading to issues of "gullibility." This limitation raises questions about the reliability of LLM agents in making meaningful decisions.\n\nDespite the challenges, LLMs have demonstrated particular effectiveness in writing code, as the grammar of programming languages is less complex than natural languages. This capability has become one of the most recognized applications of LLMs, even as there is a need for better criticism and discussions around their ethical implications and potential negative impacts. Overall, while LLM agents show promise, their utility is still under scrutiny, and their full potential may not be realized 

## Evaluating the App with Ragas

Now we can finally do our evaluation!

We'll start by running the queries we generated usign SDG above through our application to get context and responses.

In [196]:
import time
for test_row in sdg_dataset:
  response = graph.invoke({"question" : test_row.eval_sample.user_input})
  test_row.eval_sample.response = response["response"]
  test_row.eval_sample.retrieved_contexts = [context.page_content for context in response["context"]]
  time.sleep(5)

time.sleep(5)

In [197]:
sdg_dataset.to_pandas()

Unnamed: 0,user_input,retrieved_contexts,reference_contexts,response,reference,synthesizer_name
0,What role does Microsoft play in the developme...,"[If you can gather the right data, and afford ...",[The ethics of this space remain diabolically ...,Microsoft is involved in the development of La...,Microsoft Research has produced models like Ph...,single_hop_specifc_query_synthesizer
1,Wot are the implications of LLMs on the use of...,[Code may be the best application\n\nThe ethic...,"[and software engineer, LLMs are infuriating. ...",The implications of Large Language Models (LLM...,LLMs have significant implications for the use...,single_hop_specifc_query_synthesizer
2,What is a notable characteristic of GPT-4 in t...,[I’m relieved that this has changed completely...,[Simon Willison’s Weblog Subscribe Stuff we fi...,A notable characteristic of GPT-4 in the conte...,GPT-4 is part of the breakthrough year for Lar...,single_hop_specifc_query_synthesizer
3,What do you expect to see in 2024 regarding AI...,[Law is not ethics. Is it OK to train models o...,[the document includes some of the clearest ex...,"In 2024, I expect to see a significant increas...",I’m hoping 2024 sees significant amounts of de...,single_hop_specifc_query_synthesizer
4,Why do some people think LLMs are all hot air ...,[A lot of people are yet to be sold on their v...,[<1-hop>\n\nThe ethics of this space remain di...,"Some people think LLMs are ""all hot air"" due t...",Some people think LLMs are all hot air because...,multi_hop_abstract_query_synthesizer
5,Wht are the legal implicashuns of using LLMs t...,"[Since then, almost every major LLM (and most ...",[<1-hop>\n\nThe ethics of this space remain di...,The legal implications of using LLMs trained o...,The legal implications of using LLMs trained o...,multi_hop_abstract_query_synthesizer
6,What are some positive applications of generat...,[Code may be the best application\n\nThe ethic...,[<1-hop>\n\nthe document includes some of the ...,Some positive applications of generative AI th...,Positive applications of generative AI can out...,multi_hop_abstract_query_synthesizer
7,What are the ethical concerns surrounding the ...,"[Since then, almost every major LLM (and most ...",[<1-hop>\n\nThe ethics of this space remain di...,The ethical concerns surrounding the use of La...,The ethical concerns surrounding the use of La...,multi_hop_abstract_query_synthesizer
8,What significant advancements in Large Languag...,[Simon Willison’s Weblog\n\nSubscribe\n\nThing...,[<1-hop>\n\nSimon Willison’s Weblog Subscribe ...,"In his weblog, Simon Willison highlighted seve...","In Simon Willison's weblog for 2023, he highli...",multi_hop_specific_query_synthesizer
9,"What advancements were made in LLMs in 2024, p...",[The rise of inference-scaling “reasoning” mod...,[<1-hop>\n\nalso collected 211 definitions on ...,"In 2024, significant advancements were made in...","In 2024, significant advancements in Large Lan...",multi_hop_specific_query_synthesizer


Then we can convert that table into a `EvaluationDataset` which will make the process of evaluation smoother.

In [198]:
from ragas import EvaluationDataset

evaluation_dataset = EvaluationDataset.from_pandas(sdg_dataset.to_pandas())

We'll need to select a judge model - in this case we're using the same model that was used to generate our Synthetic Data.

In [199]:
from ragas import evaluate
from ragas.llms import LangchainLLMWrapper

evaluator_llm = LangchainLLMWrapper(ChatOpenAI(model="gpt-4.1-mini"))

Next up - we simply evaluate on our desired metrics!

In [200]:
from ragas.metrics import LLMContextRecall, Faithfulness, FactualCorrectness, ResponseRelevancy, ContextEntityRecall, NoiseSensitivity
from ragas import evaluate, RunConfig

# default max_workers is 16.
# custom_run_config = RunConfig(timeout=360, max_workers=8)
custom_run_config = RunConfig(
    timeout=300,          # 5 minutes max for operations
    max_retries=15,       # More retries for rate limits
    max_wait=90,          # Longer wait between retries
    max_workers=8,        # Fewer concurrent API calls
    log_tenacity=True     # Log retry attempts
)

result = evaluate(
    dataset=evaluation_dataset,
    metrics=[LLMContextRecall(), Faithfulness(), FactualCorrectness(), ResponseRelevancy(), ContextEntityRecall(), NoiseSensitivity()],
    llm=evaluator_llm,
    run_config=custom_run_config
)
result

Evaluating:   0%|          | 0/72 [00:00<?, ?it/s]

Exception raised in Job[65]: AttributeError('StringIO' object has no attribute 'statements')


{'context_recall': 0.7778, 'faithfulness': 0.8526, 'factual_correctness': 0.4775, 'answer_relevancy': 0.8710, 'context_entity_recall': 0.4893, 'noise_sensitivity_relevant': 0.3974}



<span style="color:green"> **RESULTS for the three runs**

Con-Rec: Context Recall
|Chunking|Retrieval|  Con-Rec| Faith | Factual | Answer_rel| Con-Ent-Rec | Noise_sens |
| -------|---------| ---------------|--------------|---------------------|------------------|-----------------------|-------------------|
|TextSplit| Naive  |     0.7803     |    0.8319    |    0.6000           |     0.7959       |   0.4514              |   0.3639          |
| TextSplit| Cohere|     0.7917     |    0.9021    |    0.6050           |     0.8736       |   0.4550              |   0.3862          |
| Semantic | Naive |     0.8788     |    0.9032    |    0.5718           |     0.8735       |   0.4793              |   0.4513          |


Simple chunking, Naive retrieval: Two runs

|metric        |run1 | run2 |
|--------------|-----|------|
|context_recall| 0.7803 | 0.6153 |
|faithfulness| 0.8319 | 0.7748 | 
|factual_correctness'| 0.6000 | 0.5492 |
|answer_relevancy'| 0.7959 | 0.7931|
|context_entity_recall'| 0.4514 | 0.4310 |
|noise_sensitivity_relevant'| 0.3639 | 0.2666|


Semantic chunking, Naive Retrieval: Three runs:
|metric        |run1 | run2 | run3 |
|--------------|-----|------|------|
|context_recall| 0.8788| .8417| 0.5389|
|faithfulness | 0.9032|0.8616| 0.9088|
|factual_correctness| 0.5718 | 0.5191| 0.4145|
|answer_relevancy| 0.8735| 0.9570| 0.8648|
| context_entity_recall| 0.4793 | 0.3603| 0.4496|
| noise_sensitivity_relevant| 0.4513| 0.5918|0.4446|





Simple Chunking, Cohere Retrieval: Three runs:

| metric                     | run1   | run2   | run3   |
|----------------------------|--------|--------|--------|
| context_recall             | 0.7917 | 0.5909 | 0.7778 |
| faithfulness               | 0.9021 | 0.6857 | 0.8526 |
| factual_correctness      | 0.6050 | 0.5442 | 0.4775 |
| answer_relevancy         | 0.8736 | 0.8682 | 0.8710 |
| context_entity_recall    | 0.4550 | 0.4618 | 0.4893 |
| noise_sensitivity_relevant | 0.3862 | 0.3018 | 0.3974 |






#### ❓ Question: 

Which system performed better, on what metrics, and why?