# Using Ragas to Evaluate a RAG Application built with LangChain and LangGraph

In the following notebook, we'll be looking at how [Ragas](https://github.com/explodinggradients/ragas) can be helpful in a number of ways when looking to evaluate your RAG applications!

While this example is rooted in LangChain/LangGraph - Ragas is framework agnostic (you don't even need to be using a framework!).

- 🤝 Breakout Room #1
  1. Task 1: Installing Required Libraries
  2. Task 2: Set Environment Variables
  3. Task 3: Synthetic Dataset Generation for Evaluation using Ragas
  4. Task 4: Evaluating our Pipeline with Ragas
  5. Task 6: Making Adjustments and Re-Evaluating

But first! Let's set some dependencies!

## Dependencies and API Keys:

We'll also need to provide our API keys.

First, OpenAI's for our LLM/embedding model combination!

In [1]:
import os
from getpass import getpass
os.environ["OPENAI_API_KEY"] = getpass("Please enter your OpenAI API key!")

### Useful functions

In [37]:
import json
# For debugging
def printJSON(j):
    output = json.dumps(j, indent=2)
    lines = output.split("\n")
    for line in lines:
        print(line)

## Generating Synthetic Test Data

We wil be using Ragas to build out a set of synthetic test questions, references, and reference contexts. This is useful because it will allow us to find out how our system is performing.

> NOTE: Ragas is best suited for finding *directional* changes in your LLM-based systems. The absolute scores aren't comparable in a vacuum.

### Data Preparation

We'll prepare our data - which should hopefull be familiar at this point since it's our Loan Data use-case!

Next, let's load our data into a familiar LangChain format using the `DirectoryLoader`.

In [2]:
from langchain_community.document_loaders import DirectoryLoader
from langchain_community.document_loaders import PyMuPDFLoader


path = "data/"
loader = DirectoryLoader(path, glob="*.pdf", loader_cls=PyMuPDFLoader)
docs = loader.load()

### Knowledge Graph Based Synthetic Generation

Ragas uses a knowledge graph based approach to create data. This is extremely useful as it allows us to create complex queries rather simply. The additional testset complexity allows us to evaluate larger problems more effectively, as systems tend to be very strong on simple evaluation tasks.

Let's start by defining our `generator_llm` (which will generate our questions, summaries, and more), and our `generator_embeddings` which will be useful in building our graph.

### Abstracted SDG

The above method is the full process - but we can shortcut that using the provided abstractions!

This will generate our knowledge graph under the hood, and will - from there - generate our personas and scenarios to construct our queries.



In [3]:
from ragas.llms import LangchainLLMWrapper
from ragas.embeddings import LangchainEmbeddingsWrapper
from langchain_openai import ChatOpenAI
from langchain_openai import OpenAIEmbeddings
generator_llm = LangchainLLMWrapper(ChatOpenAI(model="gpt-4.1"))
generator_embeddings = LangchainEmbeddingsWrapper(OpenAIEmbeddings())

  for match in re.finditer('{0}\s*'.format(re.escape(sent)), self.original_text):
  txt = re.sub('(?<={0})\.'.format(am), '∯', txt)
  txt = re.sub('(?<={0})\.'.format(am), '∯', txt)


In [4]:
from ragas.testset import TestsetGenerator

generator = TestsetGenerator(llm=generator_llm, embedding_model=generator_embeddings)
dataset = generator.generate_with_langchain_docs(docs[:20], testset_size=10)

Applying HeadlinesExtractor:   0%|          | 0/17 [00:00<?, ?it/s]

Applying HeadlineSplitter:   0%|          | 0/20 [00:00<?, ?it/s]

unable to apply transformation: 'headlines' property not found in this node
unable to apply transformation: 'headlines' property not found in this node
unable to apply transformation: 'headlines' property not found in this node


Applying SummaryExtractor:   0%|          | 0/31 [00:00<?, ?it/s]

Property 'summary' already exists in node '402a75'. Skipping!
Property 'summary' already exists in node 'ee6272'. Skipping!
Property 'summary' already exists in node '5749c1'. Skipping!
Property 'summary' already exists in node '1881e0'. Skipping!
Property 'summary' already exists in node '1e76b5'. Skipping!
Property 'summary' already exists in node '2abb97'. Skipping!
Property 'summary' already exists in node '59ca4e'. Skipping!
Property 'summary' already exists in node 'b6fce1'. Skipping!
Property 'summary' already exists in node '0cde7f'. Skipping!
Property 'summary' already exists in node '452262'. Skipping!
Property 'summary' already exists in node '2d01ee'. Skipping!
Property 'summary' already exists in node '77c593'. Skipping!
Property 'summary' already exists in node 'f79fab'. Skipping!
Property 'summary' already exists in node 'da003e'. Skipping!


Applying CustomNodeFilter:   0%|          | 0/6 [00:00<?, ?it/s]

Applying [EmbeddingExtractor, ThemesExtractor, NERExtractor]:   0%|          | 0/41 [00:00<?, ?it/s]

Property 'summary_embedding' already exists in node '452262'. Skipping!
Property 'summary_embedding' already exists in node 'f79fab'. Skipping!
Property 'summary_embedding' already exists in node '59ca4e'. Skipping!
Property 'summary_embedding' already exists in node '0cde7f'. Skipping!
Property 'summary_embedding' already exists in node 'da003e'. Skipping!
Property 'summary_embedding' already exists in node '402a75'. Skipping!
Property 'summary_embedding' already exists in node 'b6fce1'. Skipping!
Property 'summary_embedding' already exists in node '1e76b5'. Skipping!
Property 'summary_embedding' already exists in node 'ee6272'. Skipping!
Property 'summary_embedding' already exists in node '77c593'. Skipping!
Property 'summary_embedding' already exists in node '2abb97'. Skipping!
Property 'summary_embedding' already exists in node '2d01ee'. Skipping!
Property 'summary_embedding' already exists in node '5749c1'. Skipping!
Property 'summary_embedding' already exists in node '1881e0'. Sk

Applying [CosineSimilarityBuilder, OverlapScoreBuilder]:   0%|          | 0/2 [00:00<?, ?it/s]

Generating personas:   0%|          | 0/3 [00:00<?, ?it/s]

Generating Scenarios:   0%|          | 0/3 [00:00<?, ?it/s]

Generating Samples:   0%|          | 0/12 [00:00<?, ?it/s]

In [5]:
dataset.to_pandas()

Unnamed: 0,user_input,reference_contexts,reference,synthesizer_name
0,Why FAFSA keep changin and what new stuff we g...,[Application and Verification Guide Introducti...,FAFSA keep changin cause the FAFSA Simplificat...,single_hop_specifc_query_synthesizer
1,Wut is FAFSA?,[Chapter 1: The Application Process We removed...,FAFSA renewal functionality has been deferred ...,single_hop_specifc_query_synthesizer
2,What is an ISIR and how does a school receive it?,[The FPS also checks the application for possi...,The Institutional Student Information Record (...,single_hop_specifc_query_synthesizer
3,"As a Financial Aid Administrator, what are the...",[2. The disclosure of their FTI by the IRS to ...,"Federal student aid information, including Fed...",single_hop_specifc_query_synthesizer
4,"According to FAFSA guidelines, how is family s...",[<1-hop>\n\nequal the tax filer(s) plus depend...,Family size determination on the FAFSA for an ...,multi_hop_abstract_query_synthesizer
5,According to the FAFSA form and Title IV progr...,[<1-hop>\n\nSubmission of a court order or off...,To determine a student's independent status fo...,multi_hop_abstract_query_synthesizer
6,What specific types of Federal Tax Information...,[<1-hop>\n\n2. The disclosure of their FTI by ...,The specific types of Federal Tax Information ...,multi_hop_abstract_query_synthesizer
7,According to federal financial aid regulations...,"[<1-hop>\n\nOrphan, Ward of the Court, or in F...",The receipt of child support directly impacts ...,multi_hop_abstract_query_synthesizer
8,How does the ISIR reflect both the results of ...,[<1-hop>\n\nThe FPS also checks the applicatio...,The ISIR (Institutional Student Information Re...,multi_hop_specific_query_synthesizer
9,How does the IRS provide Federal Tax Informati...,[<1-hop>\n\n2. The disclosure of their FTI by ...,The IRS provides Federal Tax Information (FTI)...,multi_hop_specific_query_synthesizer


## LangChain RAG

Now we'll construct our LangChain RAG, which we will be evaluating using the above created test data!

### R - Retrieval

Let's start with building our retrieval pipeline, which will involve loading the same data we used to create our synthetic test set above.

> NOTE: We need to use the same data - as our test set is specifically designed for this data.

In [5]:
path = "data/"
loader = DirectoryLoader(path, glob="*.pdf", loader_cls=PyMuPDFLoader)
docs = loader.load()

Now that we have our data loaded, let's split it into chunks!

In [6]:
from langchain.text_splitter import RecursiveCharacterTextSplitter

text_splitter = RecursiveCharacterTextSplitter(chunk_size=1000, chunk_overlap=200)
split_documents = text_splitter.split_documents(docs)
len(split_documents)

1102

#### ❓ Question: 

What is the purpose of the `chunk_overlap` parameter in the `RecursiveCharacterTextSplitter`?

#### Answer:

The key idea is preserving context. Because we are using the RecursiveCharacterTextSplitter chunker we look for text blocks of max size `chunk_size` that break logically (eg. end of sentence, end of paragraph). However sometimes, there could be text adjacent to the text block that make sense to be in that chunk in order to preserve context. Hence by implementing an overlap, you have 2 chunks that have that overlap text to preserve the context. Now if the corpus that is to be chunked has a very clear method of demarcating ideas/meaning eg. a structured document or bug descriptions then there's a natural way of chunking without the need for overalap.  

Next up, we'll need to provide an embedding model that we can use to construct our vector store.

In [7]:
from langchain_openai import OpenAIEmbeddings

embeddings = OpenAIEmbeddings(model="text-embedding-3-small")

Now we can build our in memory QDrant vector store.

In [8]:
from langchain_qdrant import QdrantVectorStore
from qdrant_client import QdrantClient
from qdrant_client.http.models import Distance, VectorParams

client = QdrantClient(":memory:")

client.create_collection(
    collection_name="loan_data",
    vectors_config=VectorParams(size=1536, distance=Distance.COSINE),
)

vector_store = QdrantVectorStore(
    client=client,
    collection_name="loan_data",
    embedding=embeddings,
)

We can now add our documents to our vector store.

In [9]:
_ = vector_store.add_documents(documents=split_documents)

Let's define our retriever.

In [10]:
retriever = vector_store.as_retriever(search_kwargs={"k": 5})

Now we can produce a node for retrieval!

In [11]:
def retrieve(state):
  retrieved_docs = retriever.invoke(state["question"])
  return {"context" : retrieved_docs}

### Augmented

Let's create a simple RAG prompt!

In [12]:
from langchain.prompts import ChatPromptTemplate

RAG_PROMPT = """\
You are a helpful assistant who answers questions based on provided context. You must only use the provided context, and cannot use your own knowledge.

### Question
{question}

### Context
{context}
"""

rag_prompt = ChatPromptTemplate.from_template(RAG_PROMPT)

### Generation

We'll also need an LLM to generate responses - we'll use `gpt-4o-nano` to avoid using the same model as our judge model.

In [15]:
from langchain_openai import ChatOpenAI

llm = ChatOpenAI(model="gpt-4.1-nano")

Then we can create a `generate` node!

In [16]:
def generate(state):
  docs_content = "\n\n".join(doc.page_content for doc in state["context"])
  messages = rag_prompt.format_messages(question=state["question"], context=docs_content)
  response = llm.invoke(messages)
  return {"response" : response.content}

### Building RAG Graph with LangGraph

Let's create some state for our LangGraph RAG graph!

In [17]:
from langgraph.graph import START, StateGraph
from typing_extensions import List, TypedDict
from langchain_core.documents import Document

class State(TypedDict):
  question: str
  context: List[Document]
  response: str

Now we can build our simple graph!

> NOTE: We're using `add_sequence` since we will always move from retrieval to generation. This is essentially building a chain in LangGraph.

In [18]:
graph_builder = StateGraph(State).add_sequence([retrieve, generate])
graph_builder.add_edge(START, "retrieve")
graph = graph_builder.compile()

Let's do a test to make sure it's doing what we'd expect.

In [19]:
response = graph.invoke({"question" : "What are the different kinds of loans?"})

In [20]:
response["response"]

'Based on the provided context, the different kinds of loans mentioned are:\n\n1. Direct Loan\n2. Direct Unsubsidized Loan\n\nThe context also discusses various aspects related to these loans, such as loan periods, transfer rules, and repayment plans, but the specific types of loans explicitly named are "Direct Loan" and "Direct Unsubsidized Loan."'

## Evaluating the App with Ragas

Now we can finally do our evaluation!

We'll start by running the queries we generated usign SDG above through our application to get context and responses.

In [21]:
for test_row in dataset:
  response = graph.invoke({"question" : test_row.eval_sample.user_input})
  test_row.eval_sample.response = response["response"]
  test_row.eval_sample.retrieved_contexts = [context.page_content for context in response["context"]]

In [22]:
dataset.samples[0].eval_sample.response

'The different types of academic calendars that affect Title IV aid disbursement are:\n\n1. Standard Term: Includes traditional semesters, trimesters, and quarters, which contain between 14 and 21 weeks of instructional time. Aid is disbursed based on these standard terms, with specific rules for timing aligned with the length and structure of the terms.\n\n2. Nonstandard Term: These are academic calendars that do not follow the traditional semester or trimester structure. They may have varying lengths and configurations, affecting how and when aid is awarded and disbursed.\n\n3. Non-term: Programs with non-term calendars do not have specific academic terms like semesters or quarters. Disbursement timing and aid calculations are based on non-term-based rules, especially for programs with nonstandard durations.\n\n4. Subscription-Based Program: These programs have unique structures that may not align with traditional academic years or terms, influencing the way aid is calculated and dis

Then we can convert that table into a `EvaluationDataset` which will make the process of evaluation smoother.

In [23]:
from ragas import EvaluationDataset

evaluation_dataset = EvaluationDataset.from_pandas(dataset.to_pandas())

We'll need to select a judge model - in this case we're using the same model that was used to generate our Synthetic Data.

In [24]:
from ragas import evaluate
from ragas.llms import LangchainLLMWrapper

evaluator_llm = LangchainLLMWrapper(ChatOpenAI(model="gpt-4.1-mini"))

Next up - we simply evaluate on our desired metrics!

In [25]:
from ragas.metrics import LLMContextRecall, Faithfulness, FactualCorrectness, ResponseRelevancy, ContextEntityRecall, NoiseSensitivity
from ragas import evaluate, RunConfig

custom_run_config = RunConfig(timeout=360)

result = evaluate(
    dataset=evaluation_dataset,
    metrics=[LLMContextRecall(), Faithfulness(), FactualCorrectness(), ResponseRelevancy(), ContextEntityRecall(), NoiseSensitivity()],
    llm=evaluator_llm,
    run_config=custom_run_config
)
result

Evaluating:   0%|          | 0/72 [00:00<?, ?it/s]

{'context_recall': 0.8506, 'faithfulness': 0.9032, 'factual_correctness': 0.6217, 'answer_relevancy': 0.8783, 'context_entity_recall': 0.3755, 'noise_sensitivity_relevant': 0.2780}

In [40]:
result.to_pandas()

Unnamed: 0,user_input,retrieved_contexts,reference_contexts,response,reference,context_recall,faithfulness,factual_correctness,answer_relevancy,context_entity_recall,noise_sensitivity_relevant
0,Wht are the diffrent types of academic calende...,"[use of a Scheduled Academic Year (SAY), BBAY ...","[non-term (includes clock-hour calendars), or ...",The different types of academic calendars that...,The types of academic calendars that affect Ti...,1.0,0.857143,0.25,0.982859,0.5,0.692308
1,"i dont get what Volume 8, Chapter 3 say about ...",[Work in a Standard Term=). If a standard term...,[Inclusion of Clinical Work in a Standard Term...,"According to Volume 8, Chapter 3, when clinica...","See Volume 8, Chapter 3 for additional guidanc...",1.0,1.0,0.55,0.912482,0.666667,0.785714
2,Whaat are Non-Term Charactristics in student a...,[Non-Term Characteristics\nA program that meas...,[Non-Term Characteristics A program that measu...,Non-Term Characteristics in student aid progra...,A program has Non-Term Characteristics if it m...,1.0,1.0,0.79,0.976705,0.125,0.0
3,where i find info on Chapters 5 and 6 for dire...,[information on Direct Loan annual loan limit ...,[both the credit or clock hours and the weeks ...,You can find information on Chapters 5 and 6 f...,For information on Direct Loan annual loan lim...,1.0,0.5,0.67,0.94078,0.75,0.0
4,What are the disbursement requirements and tim...,"[2), you must disburse the Title IV funds duri...",[<1-hop>\n\nboth the credit or clock hours and...,In clock-hour or non-term credit-hour programs...,In clock-hour or non-term credit-hour programs...,0.625,0.733333,0.54,0.955672,0.190476,0.1875
5,What are the disbursement requirements and tim...,[information on Direct Loan annual loan limit ...,[<1-hop>\n\nboth the credit or clock hours and...,The disbursement requirements and timing diffe...,In clock-hour or non-term credit-hour programs...,0.4,1.0,0.64,0.941295,0.5,0.0625
6,How do the disbursement requirements and timin...,[section below.\nExcept as noted above for the...,[<1-hop>\n\nboth the credit or clock hours and...,"In federal student aid programs, the disbursem...",In clock-hour or non-term credit-hour programs...,0.833333,0.785714,0.39,0.96787,0.333333,0.428571
7,How does accelerated student progression in a ...,[both the credit or clock hours and the weeks ...,[<1-hop>\n\nboth the credit or clock hours and...,Accelerated student progression in a clock-hou...,Accelerated student progression in a clock-hou...,0.2,0.9,0.8,0.937455,0.5,0.210526
8,What are the disbursement timing requirements ...,[Disbursement Timing in Subscription-Based Pro...,[<1-hop>\n\nDisbursement Timing in Subscriptio...,"In subscription-based programs, there are spec...","In subscription-based programs, for the first ...",0.5,1.0,0.42,0.967442,0.1,0.181818
9,how do appendix a and appendix b help with und...,"[2), you must disburse the Title IV funds duri...",[<1-hop>\n\nboth the credit or clock hours and...,Appendix A and Appendix B provide detailed gui...,appendix a gives examples that show how the ru...,0.75,0.5,0.64,0.908577,0.0,0.230769


## Making Adjustments and Re-Evaluating

Now that we've got our baseline - let's make a change and see how the model improves or doesn't improve!

> NOTE: This will be using Cohere's Rerank model - please be sure to [sign-up for an API key!](https://docs.cohere.com/reference/about)

In [27]:
os.environ["COHERE_API_KEY"] = getpass("Please enter your Cohere API key!")


We'll first set our retriever to return more documents, which will allow us to take advantage of the reranking.

In [28]:
adjusted_example_retriever = vector_store.as_retriever(search_kwargs={"k": 20})

Reranking, or contextual compression, is a technique that uses a reranker to compress the retrieved documents into a smaller set of documents.

This is essentially a slower, more accurate form of semantic similarity that we use on a smaller subset of our documents.

In [29]:
from langchain.retrievers.contextual_compression import ContextualCompressionRetriever
from langchain_cohere import CohereRerank

def retrieve_adjusted(state):
  compressor = CohereRerank(model="rerank-v3.5")
  compression_retriever = ContextualCompressionRetriever(
    base_compressor=compressor, base_retriever=adjusted_example_retriever, search_kwargs={"k": 5}
  )
  retrieved_docs = compression_retriever.invoke(state["question"])
  return {"context" : retrieved_docs}

We can simply rebuild our graph with the new retriever!

In [31]:
class AdjustedState(TypedDict):
  question: str
  context: List[Document]
  response: str

adjusted_graph_builder = StateGraph(AdjustedState).add_sequence([retrieve_adjusted, generate])
adjusted_graph_builder.add_edge(START, "retrieve_adjusted")
adjusted_graph = adjusted_graph_builder.compile()

In [32]:
response = adjusted_graph.invoke({"question" : "What are the different kinds of loans?"})
response["response"]

'The provided context mentions different types of loans related to federal student aid, specifically:\n\n- Federal PLUS Loans\n- Federal Family Education Loan (FFEL) Program loans (which include some loan types prior to July 1, 2010)\n- Direct Subsidized Loans\n- Direct Unsubsidized Loans\n- Student Direct PLUS Loans (a type of Direct PLUS Loan)\n\nWhile the context primarily discusses Direct Subsidized Loans, Direct Unsubsidized Loans, and PLUS Loans, it indicates that these are among the different kinds of loans available in the federal student aid programs.'

In [33]:
import time
import copy

rerank_dataset = copy.deepcopy(dataset)

for test_row in rerank_dataset:
  response = adjusted_graph.invoke({"question" : test_row.eval_sample.user_input})
  test_row.eval_sample.response = response["response"]
  test_row.eval_sample.retrieved_contexts = [context.page_content for context in response["context"]]
  time.sleep(2) # To try to avoid rate limiting.

In [34]:
rerank_dataset.samples[0].eval_sample.response

'The different types of academic calendars that affect Title IV aid disbursement are:\n\n1. Standard Term: Includes semesters, trimesters, and quarters, which contain between 14 and 21 weeks of instructional time and have specific start and end dates within a set time frame.\n\n2. Nonstandard Term: Similar to standard terms but may vary in length and schedule, still within a defined period for class sessions.\n\n3. Non-term: Classes do not begin and end within a set time frame, and academic progress can be measured in credit hours or clock hours.\n\n4. Subscription-based: Used by subscription-based programs where students pay per term with the expectation of completing a specific number of credit hours during that term.'

In [35]:
rerank_evaluation_dataset = EvaluationDataset.from_pandas(rerank_dataset.to_pandas())

In [36]:
result = evaluate(
    dataset=rerank_evaluation_dataset,
    metrics=[LLMContextRecall(), Faithfulness(), FactualCorrectness(), ResponseRelevancy(), ContextEntityRecall(), NoiseSensitivity()],
    llm=evaluator_llm,
    run_config=custom_run_config
)
result

Evaluating:   0%|          | 0/72 [00:00<?, ?it/s]

{'context_recall': 0.7519, 'faithfulness': 0.8402, 'factual_correctness': 0.6017, 'answer_relevancy': 0.9491, 'context_entity_recall': 0.3318, 'noise_sensitivity_relevant': 0.2941}

In [41]:
result.to_pandas()

Unnamed: 0,user_input,retrieved_contexts,reference_contexts,response,reference,context_recall,faithfulness,factual_correctness,answer_relevancy,context_entity_recall,noise_sensitivity_relevant
0,Wht are the diffrent types of academic calende...,"[use of a Scheduled Academic Year (SAY), BBAY ...","[non-term (includes clock-hour calendars), or ...",The different types of academic calendars that...,The types of academic calendars that affect Ti...,1.0,0.857143,0.25,0.982859,0.5,0.692308
1,"i dont get what Volume 8, Chapter 3 say about ...",[Work in a Standard Term=). If a standard term...,[Inclusion of Clinical Work in a Standard Term...,"According to Volume 8, Chapter 3, when clinica...","See Volume 8, Chapter 3 for additional guidanc...",1.0,1.0,0.55,0.912482,0.666667,0.785714
2,Whaat are Non-Term Charactristics in student a...,[Non-Term Characteristics\nA program that meas...,[Non-Term Characteristics A program that measu...,Non-Term Characteristics in student aid progra...,A program has Non-Term Characteristics if it m...,1.0,1.0,0.79,0.976705,0.125,0.0
3,where i find info on Chapters 5 and 6 for dire...,[information on Direct Loan annual loan limit ...,[both the credit or clock hours and the weeks ...,You can find information on Chapters 5 and 6 f...,For information on Direct Loan annual loan lim...,1.0,0.5,0.67,0.94078,0.75,0.0
4,What are the disbursement requirements and tim...,"[2), you must disburse the Title IV funds duri...",[<1-hop>\n\nboth the credit or clock hours and...,In clock-hour or non-term credit-hour programs...,In clock-hour or non-term credit-hour programs...,0.625,0.733333,0.54,0.955672,0.190476,0.1875
5,What are the disbursement requirements and tim...,[information on Direct Loan annual loan limit ...,[<1-hop>\n\nboth the credit or clock hours and...,The disbursement requirements and timing diffe...,In clock-hour or non-term credit-hour programs...,0.4,1.0,0.64,0.941295,0.5,0.0625
6,How do the disbursement requirements and timin...,[section below.\nExcept as noted above for the...,[<1-hop>\n\nboth the credit or clock hours and...,"In federal student aid programs, the disbursem...",In clock-hour or non-term credit-hour programs...,0.833333,0.785714,0.39,0.96787,0.333333,0.428571
7,How does accelerated student progression in a ...,[both the credit or clock hours and the weeks ...,[<1-hop>\n\nboth the credit or clock hours and...,Accelerated student progression in a clock-hou...,Accelerated student progression in a clock-hou...,0.2,0.9,0.8,0.937455,0.5,0.210526
8,What are the disbursement timing requirements ...,[Disbursement Timing in Subscription-Based Pro...,[<1-hop>\n\nDisbursement Timing in Subscriptio...,"In subscription-based programs, there are spec...","In subscription-based programs, for the first ...",0.5,1.0,0.42,0.967442,0.1,0.181818
9,how do appendix a and appendix b help with und...,"[2), you must disburse the Title IV funds duri...",[<1-hop>\n\nboth the credit or clock hours and...,Appendix A and Appendix B provide detailed gui...,appendix a gives examples that show how the ru...,0.75,0.5,0.64,0.908577,0.0,0.230769


#### ❓ Question: 

Which system performed better, on what metrics, and why?

#### Answer:

Without Reranking:
`{'context_recall': 0.8506, 'faithfulness': 0.9032, 'factual_correctness': 0.6217, 'answer_relevancy': 0.8783, 'context_entity_recall': 0.3755, 'noise_sensitivity_relevant': 0.2780}`

With Reranking:
`{'context_recall': 0.7519, 'faithfulness': 0.8402, 'factual_correctness': 0.6017, 'answer_relevancy': 0.9491, 'context_entity_recall': 0.3318, 'noise_sensitivity_relevant': 0.2941}`

The key metric we are trying to improve is `answer_relevancy` as that is ultimately what we want with a RAG - are we answering the question correctly with the given data we provide the system. Also that's the whole point of including the rerank step where we use Cohere's model which is trained with Q&A data to give us the most relevent chunks to include in the context. 

But notice both `answer_relevancy` and `noise_sensitivity_relevant` increased while the rest decreased. I was perplexed so I asked Cursor (you can see the prompt and answer in PROMPTS.md). The key thing is this: with using the reranker we get better relevance in the context at the expense of coverage. The 5 reranked chunks are more focused and relevant but may contain less comprehensive information than 5 best semantic chunks. 


