# Advanced Retrieval with LangChain

In the following notebook, we'll explore various methods of advanced retrieval using LangChain!

We'll touch on:

- Naive Retrieval
- Best-Matching 25 (BM25)
- Multi-Query Retrieval
- Parent-Document Retrieval
- Contextual Compression (a.k.a. Rerank)
- Ensemble Retrieval
- Semantic chunking

We'll also discuss how these methods impact performance on our set of documents with a simple RAG chain.

There will be two breakout rooms:

- 🤝 Breakout Room Part #1
  - Task 1: Getting Dependencies!
  - Task 2: Data Collection and Preparation
  - Task 3: Setting Up QDrant!
  - Task 4-10: Retrieval Strategies
- 🤝 Breakout Room Part #2
  - Activity: Evaluate with Ragas

# 🤝 Breakout Room Part #1

## Task 1: Getting Dependencies!

We're going to need a few specific LangChain community packages, like OpenAI (for our [LLM](https://platform.openai.com/docs/models) and [Embedding Model](https://platform.openai.com/docs/guides/embeddings)) and Cohere (for our [Reranker](https://cohere.com/rerank)).

We'll also provide our OpenAI key, as well as our Cohere API key.

In [1]:
import os
import getpass

os.environ["OPENAI_API_KEY"] = getpass.getpass("Enter your OpenAI API Key:")

In [2]:
os.environ["COHERE_API_KEY"] = getpass.getpass("Cohere API Key:")

## Task 2: Data Collection and Preparation

We'll be using our Loan Data once again - this time the strutured data available through the CSV!

### Data Preparation

We want to make sure all our documents have the relevant metadata for the various retrieval strategies we're going to be applying today.

In [3]:
from langchain_community.document_loaders.csv_loader import CSVLoader
from datetime import datetime, timedelta

loader = CSVLoader(
    file_path=f"./data/complaints.csv",
    metadata_columns=[
      "Date received", 
      "Product", 
      "Sub-product", 
      "Issue", 
      "Sub-issue", 
      "Consumer complaint narrative", 
      "Company public response", 
      "Company", 
      "State", 
      "ZIP code", 
      "Tags", 
      "Consumer consent provided?", 
      "Submitted via", 
      "Date sent to company", 
      "Company response to consumer", 
      "Timely response?", 
      "Consumer disputed?", 
      "Complaint ID"
    ]
)

loan_complaint_data = loader.load()

for doc in loan_complaint_data:
    doc.page_content = doc.metadata["Consumer complaint narrative"]

Let's look at an example document to see if everything worked as expected!

In [4]:
loan_complaint_data[0]

Document(metadata={'source': './data/complaints.csv', 'row': 0, 'Date received': '03/27/25', 'Product': 'Student loan', 'Sub-product': 'Federal student loan servicing', 'Issue': 'Dealing with your lender or servicer', 'Sub-issue': 'Trouble with how payments are being handled', 'Consumer complaint narrative': "The federal student loan COVID-19 forbearance program ended in XX/XX/XXXX. However, payments were not re-amortized on my federal student loans currently serviced by Nelnet until very recently. The new payment amount that is effective starting with the XX/XX/XXXX payment will nearly double my payment from {$180.00} per month to {$360.00} per month. I'm fortunate that my current financial position allows me to be able to handle the increased payment amount, but I am sure there are likely many borrowers who are not in the same position. The re-amortization should have occurred once the forbearance ended to reduce the impact to borrowers.", 'Company public response': 'None', 'Company'

### 🗂️ Structure of `Document` Object

#### 📌 `metadata`
```python
{
    'source': './data/complaints.csv',
    'row': 0,
    'Date received': '03/27/25',
    'Product': 'Student loan',
    'Sub-product': 'Federal student loan servicing',
    'Issue': 'Dealing with your lender or servicer',
    'Sub-issue': 'Trouble with how payments are being handled',
    'Consumer complaint narrative': (
        "The federal student loan COVID-19 forbearance program ended in XX/XX/XXXX. "
        "However, payments were not re-amortized on my federal student loans currently "
        "serviced by Nelnet until very recently. The new payment amount that is effective "
        "starting with the XX/XX/XXXX payment will nearly double my payment from {$180.00} "
        "per month to {$360.00} per month. I'm fortunate that my current financial position "
        "allows me to be able to handle the increased payment amount, but I am sure there are "
        "likely many borrowers who are not in the same position. The re-amortization should have "
        "occurred once the forbearance ended to reduce the impact to borrowers."
    ),
    'Company public response': 'None',
    'Company': 'Nelnet, Inc.',
    'State': 'IL',
    'ZIP code': '60030',
    'Tags': 'None',
    'Consumer consent provided?': 'Consent provided',
    'Submitted via': 'Web',
    'Date sent to company': '03/27/25',
    'Company response to consumer': 'Closed with explanation',
    'Timely response?': 'Yes',
    'Consumer disputed?': 'N/A',
    'Complaint ID': '12686613'
}
```

#### 📄 `page_content`
```text
The federal student loan COVID-19 forbearance program ended in XX/XX/XXXX. However, payments were not re-amortized on my federal student loans currently serviced by Nelnet until very recently. The new payment amount that is effective starting with the XX/XX/XXXX payment will nearly double my payment from {$180.00} per month to {$360.00} per month. I'm fortunate that my current financial position allows me to be able to handle the increased payment amount, but I am sure there are likely many borrowers who are not in the same position. The re-amortization should have occurred once the forbearance ended to reduce the impact to borrowers.
```


In [5]:
# I am inspecting the data type and the csv file

print(len(loan_complaint_data))
print(dir(loan_complaint_data[823]))
print(type(loan_complaint_data[823]))
print(loan_complaint_data[823].metadata)

825
['__abstractmethods__', '__annotations__', '__class__', '__class_getitem__', '__class_vars__', '__copy__', '__deepcopy__', '__delattr__', '__dict__', '__dir__', '__doc__', '__eq__', '__fields__', '__fields_set__', '__firstlineno__', '__format__', '__ge__', '__get_pydantic_core_schema__', '__get_pydantic_json_schema__', '__getattr__', '__getattribute__', '__getstate__', '__gt__', '__hash__', '__init__', '__init_subclass__', '__iter__', '__le__', '__lt__', '__module__', '__ne__', '__new__', '__pretty__', '__private_attributes__', '__pydantic_complete__', '__pydantic_computed_fields__', '__pydantic_core_schema__', '__pydantic_custom_init__', '__pydantic_decorators__', '__pydantic_extra__', '__pydantic_fields__', '__pydantic_fields_set__', '__pydantic_generic_metadata__', '__pydantic_init_subclass__', '__pydantic_parent_namespace__', '__pydantic_post_init__', '__pydantic_private__', '__pydantic_root_model__', '__pydantic_serializer__', '__pydantic_setattr_handlers__', '__pydantic_valid

## Task 3: Setting up QDrant!

Now that we have our documents, let's create a QDrant VectorStore with the collection name "LoanComplaints".

We'll leverage OpenAI's [`text-embedding-3-small`](https://openai.com/blog/new-embedding-models-and-api-updates) because it's a very powerful (and low-cost) embedding model.

> NOTE: We'll be creating additional vectorstores where necessary, but this pattern is still extremely useful.

In [6]:
from langchain_community.vectorstores import Qdrant
from langchain_openai import OpenAIEmbeddings

embeddings = OpenAIEmbeddings(model="text-embedding-3-small")

vectorstore = Qdrant.from_documents(
    loan_complaint_data,
    embeddings,
    location=":memory:",
    collection_name="LoanComplaints"
)

## Task 4: Naive RAG Chain

Since we're focusing on the "R" in RAG today - we'll create our Retriever first.

### R - Retrieval

This naive retriever will simply look at each review as a document, and use cosine-similarity to fetch the 10 most relevant documents.

> NOTE: We're choosing `10` as our `k` here to provide enough documents for our reranking process later

In [7]:
naive_retriever = vectorstore.as_retriever(search_kwargs={"k" : 10})

In [8]:
type(naive_retriever)
dir(naive_retriever)

['InputType',
 'OutputType',
 '__abstractmethods__',
 '__annotations__',
 '__class__',
 '__class_getitem__',
 '__class_vars__',
 '__copy__',
 '__deepcopy__',
 '__delattr__',
 '__dict__',
 '__dir__',
 '__doc__',
 '__eq__',
 '__fields__',
 '__fields_set__',
 '__firstlineno__',
 '__format__',
 '__ge__',
 '__get_pydantic_core_schema__',
 '__get_pydantic_json_schema__',
 '__getattr__',
 '__getattribute__',
 '__getstate__',
 '__gt__',
 '__hash__',
 '__init__',
 '__init_subclass__',
 '__iter__',
 '__le__',
 '__lt__',
 '__module__',
 '__ne__',
 '__new__',
 '__or__',
 '__orig_bases__',
 '__parameters__',
 '__pretty__',
 '__private_attributes__',
 '__pydantic_complete__',
 '__pydantic_computed_fields__',
 '__pydantic_core_schema__',
 '__pydantic_custom_init__',
 '__pydantic_decorators__',
 '__pydantic_extra__',
 '__pydantic_fields__',
 '__pydantic_fields_set__',
 '__pydantic_generic_metadata__',
 '__pydantic_init_subclass__',
 '__pydantic_parent_namespace__',
 '__pydantic_post_init__',
 '__pydan

### A - Augmented

We're going to go with a standard prompt for our simple RAG chain today! Nothing fancy here, we want this to mostly be about the Retrieval process.

In [9]:
from langchain_core.prompts import ChatPromptTemplate

RAG_TEMPLATE = """\
You are a helpful and kind assistant. Use the context provided below to answer the question.

If you do not know the answer, or are unsure, say you don't know.

Query:
{question}

Context:
{context}
"""

rag_prompt = ChatPromptTemplate.from_template(RAG_TEMPLATE)

### G - Generation

We're going to leverage `gpt-4.1-nano` as our LLM today, as - again - we want this to largely be about the Retrieval process.

In [10]:
from langchain_openai import ChatOpenAI

chat_model = ChatOpenAI(model="gpt-4.1-nano")

### LCEL RAG Chain

We're going to use LCEL to construct our chain.

> NOTE: This chain will be exactly the same across the various examples with the exception of our Retriever!

In [11]:
from langchain_core.runnables import RunnablePassthrough
from operator import itemgetter
from langchain_core.output_parsers import StrOutputParser

naive_retrieval_chain = (
    # INVOKE CHAIN WITH: {"question" : "<<SOME USER QUESTION>>"}
    # "question" : populated by getting the value of the "question" key
    # "context"  : populated by getting the value of the "question" key and chaining it into the base_retriever
    {"context": itemgetter("question") | naive_retriever, "question": itemgetter("question")}
    # "context"  : is assigned to a RunnablePassthrough object (will not be called or considered in the next step)
    #              by getting the value of the "context" key from the previous step
    | RunnablePassthrough.assign(context=itemgetter("context"))
    # "response" : the "context" and "question" values are used to format our prompt object and then piped
    #              into the LLM and stored in a key called "response"
    # "context"  : populated by getting the value of the "context" key from the previous step
    | {"response": rag_prompt | chat_model, "context": itemgetter("context")}
)

In [12]:
type(naive_retrieval_chain)

langchain_core.runnables.base.RunnableSequence

Let's see how this simple chain does on a few different prompts.

> NOTE: You might think that we've cherry picked prompts that showcase the individual skill of each of the retrieval strategies - you'd be correct!

In [13]:
naive_retrieval_chain.invoke({"question" : "What is the most common issue with loans?"})["response"].content

"Based on the provided complaints, the most common issues with loans appear to be:\n\n- Dealing with loan servicer misconduct, such as errors in loan balances, misapplied payments, wrongful denials of payment plans, and difficulties in applying payments correctly.\n- Incorrect or inconsistent reporting of loan information, including account status and balances, leading to credit score impacts.\n- Struggles with repayment plans, including issues with interest capitalization, improper transfers of loans without notification, and inability to pay down principal.\n- Problems related to loan handling, such as inaccurate information, unauthorized transfers, and privacy violations.\n- Long-term forbearance and mismanagement leading to increased debt and financial hardship.\n\nOverall, a prevalent theme is the mismanagement and mishandling by loan servicers, which affects borrowers' balances, credit reports, and ability to manage their loans effectively."

In [14]:
sample_test = naive_retrieval_chain.invoke({"question" : "Did any complaints not get handled in a timely manner?"})

In [15]:
sample_test['context']

[Document(metadata={'source': './data/complaints.csv', 'row': 474, 'Date received': '04/27/25', 'Product': 'Student loan', 'Sub-product': 'Federal student loan servicing', 'Issue': 'Struggling to repay your loan', 'Sub-issue': 'Problem with forgiveness, cancellation, or discharge', 'Consumer complaint narrative': 'I sent a dispute settlement to all three credit bureaus over 30 days ago and still havent gotten a response. This is negligent and not in compliance with the law. Im asking for this to be looked into and handled.', 'Company public response': 'None', 'Company': 'Nelnet, Inc.', 'State': 'MI', 'ZIP code': '48111', 'Tags': 'None', 'Consumer consent provided?': 'Consent provided', 'Submitted via': 'Web', 'Date sent to company': '04/27/25', 'Company response to consumer': 'Closed with explanation', 'Timely response?': 'Yes', 'Consumer disputed?': 'N/A', 'Complaint ID': '13205525', '_id': 'ab98fee89f3f4527a711c50b3f43c83c', '_collection_name': 'LoanComplaints'}, page_content='I sent

In [16]:
sample_test['response']

AIMessage(content='Yes, based on the information provided, some complaints did not get handled in a timely manner. Specifically, at least one complaint (Complaint ID: 12709087) was marked as not handled in a timely manner, with the response marked as "No." Additionally, multiple complaints mention delays or lack of response over extended periods, such as complaints that have been ongoing for over a year without resolution.', additional_kwargs={'refusal': None}, response_metadata={'token_usage': {'completion_tokens': 80, 'prompt_tokens': 7357, 'total_tokens': 7437, 'completion_tokens_details': {'accepted_prediction_tokens': 0, 'audio_tokens': 0, 'reasoning_tokens': 0, 'rejected_prediction_tokens': 0}, 'prompt_tokens_details': {'audio_tokens': 0, 'cached_tokens': 0}}, 'model_name': 'gpt-4.1-nano-2025-04-14', 'system_fingerprint': 'fp_38343a2f8f', 'id': 'chatcmpl-ByLC6m8rgqxAW9gQphYMzosEVmLko', 'service_tier': 'default', 'finish_reason': 'stop', 'logprobs': None}, id='run--6227492e-4809-4

In [17]:
dir(sample_test['response'])

['__abstractmethods__',
 '__add__',
 '__annotations__',
 '__class__',
 '__class_getitem__',
 '__class_vars__',
 '__copy__',
 '__deepcopy__',
 '__delattr__',
 '__dict__',
 '__dir__',
 '__doc__',
 '__eq__',
 '__fields__',
 '__fields_set__',
 '__firstlineno__',
 '__format__',
 '__ge__',
 '__get_pydantic_core_schema__',
 '__get_pydantic_json_schema__',
 '__getattr__',
 '__getattribute__',
 '__getstate__',
 '__gt__',
 '__hash__',
 '__init__',
 '__init_subclass__',
 '__iter__',
 '__le__',
 '__lt__',
 '__module__',
 '__ne__',
 '__new__',
 '__pretty__',
 '__private_attributes__',
 '__pydantic_complete__',
 '__pydantic_computed_fields__',
 '__pydantic_core_schema__',
 '__pydantic_custom_init__',
 '__pydantic_decorators__',
 '__pydantic_extra__',
 '__pydantic_fields__',
 '__pydantic_fields_set__',
 '__pydantic_generic_metadata__',
 '__pydantic_init_subclass__',
 '__pydantic_parent_namespace__',
 '__pydantic_post_init__',
 '__pydantic_private__',
 '__pydantic_root_model__',
 '__pydantic_serialize

In [18]:
sample_test['response'].content

'Yes, based on the information provided, some complaints did not get handled in a timely manner. Specifically, at least one complaint (Complaint ID: 12709087) was marked as not handled in a timely manner, with the response marked as "No." Additionally, multiple complaints mention delays or lack of response over extended periods, such as complaints that have been ongoing for over a year without resolution.'

In [19]:
naive_retrieval_chain.invoke({"question" : "Why did people fail to pay back their loans?"})["response"].content

'People failed to pay back their loans primarily because of financial hardships, lack of clear communication from loan servicers, unmanageable interest accumulation, misinformation or insufficient information about repayment terms, and adverse impacts on credit reports. Many borrowers found themselves unable to increase payments due to their living expenses, while others experienced unexpected loan transfers, mismanagement, or delinquency reporting without proper notification. Additionally, some borrowers faced difficulties with payment plans, foregone opportunities for debt relief, and confusion over the status of their loans, all contributing to their inability to repay in a timely manner.'

Overall, this is not bad! Let's see if we can make it better!

## Task 5: Best-Matching 25 (BM25) Retriever

Taking a step back in time - [BM25](https://www.nowpublishers.com/article/Details/INR-019) is based on [Bag-Of-Words](https://en.wikipedia.org/wiki/Bag-of-words_model) which is a sparse representation of text.

In essence, it's a way to compare how similar two pieces of text are based on the words they both contain.

This retriever is very straightforward to set-up! Let's see it happen down below!


In [20]:
from langchain_community.retrievers import BM25Retriever

bm25_retriever = BM25Retriever.from_documents(loan_complaint_data, )

We'll construct the same chain - only changing the retriever.

In [21]:
bm25_retrieval_chain = (
    {"context": itemgetter("question") | bm25_retriever, "question": itemgetter("question")}
    | RunnablePassthrough.assign(context=itemgetter("context"))
    | {"response": rag_prompt | chat_model, "context": itemgetter("context")}
)

Let's look at the responses!

In [22]:
bm25_retrieval_chain.invoke({"question" : "What is the most common issue with loans?"})["response"].content

'Based on the provided complaints, the most common issue with loans appears to be dealing with the lender or servicer, particularly related to inaccurate or misleading information, improper application of payments, or issues with loan terms. Specific sub-issues include disputes over fees charged, trouble with how payments are handled (such as being unable to pay down principal), and receiving bad or incomplete information about loan balances or history.'

In [23]:
bm25_retrieval_chain.invoke({"question" : "Did any complaints not get handled in a timely manner?"})["response"].content

'Based on the provided information, all these complaints were responded to with a "Closed with explanation" status and the responses are marked as "Timely response?": "Yes." Therefore, it appears that none of the complaints went unresolved or were handled outside of the expected timeframe.'

In [24]:
bm25_retrieval_chain.invoke({"question" : "Why did people fail to pay back their loans?"})["response"].content

"People failed to pay back their loans for various reasons, including:\n- Problems with payment plans, such as being steered into incorrect forbearances or having their autopayments unenrolled without proper communication.\n- Poor communication from lenders or servicers, leading to misunderstandings about payment deadlines, overdue status, or the status of forbearance or deferment applications.\n- Lack of response or acknowledgment from loan servicers after requests for forbearance or deferment, resulting in continued billing and negative impact on credit scores.\n- Errors or issues in billing systems, such as payments being reversed or accounts being reported as past due without proper notification.\n- Transitions between loan servicers that caused a lack of awareness about the loan status or payment requirements.\nOverall, inadequate communication, administrative errors, and inappropriate handling of payment plans contributed to borrowers' failure to repay their loans."

It's not clear that this is better or worse, if only we had a way to test this (SPOILERS: We do, the second half of the notebook will cover this)

#### ❓ Question #1:

Give an example query where BM25 is better than embeddings and justify your answer.

#### ✅ Answer:

An example query where BM25 outperforms embeddings is when searching for documents that contain an **exact phrase**.

**Query:**

```python
"Search for complaints where the exact phrase 'loan paid in full' appears."
```

This type of query demands **precise lexical matching**, which is BM25’s strength. In this case, we ran the query using both retrieval pipelines:

### 🔍 Results

**BM25 Result:**
> Based on the provided context, there are no complaints that explicitly mention the exact phrase 'loan paid in full.' Therefore, I cannot find any complaints where this specific phrase appears.

**Embedding Result:**
> Based on the provided documents, there are multiple complaints where the exact phrase 'loan paid in full' appears, specifically in the context of patients stating that their loans still do not show as paid in full despite paying them off.  
>  
> For example:  
> - At row 59, the complaint narrative explicitly mentions: "I paid off loan # XXXX on XX/XX/year>, it has been 90 days and it is still not showing as paid in full."  
>  
> Therefore, the complaints where the exact phrase 'loan paid in full' appears are centered around cases where individuals have paid off their loans but the status has not been updated to 'paid in full', leading to frustration and disputes.

---

### ✅ Justification

This example demonstrates that **BM25 is superior for exact phrase queries**:

- BM25 correctly returned *no matches* because the **exact phrase** `'loan paid in full'` does not exist in the dataset.
- The embedding-based retriever, however, **hallucinated** — returning conceptually similar results that did **not contain the exact phrase**, violating the query constraint.

---

### 📌 Conclusion

**BM25 is better than embeddings** when:

- The query demands **exact term or phrase matching**
- You want to avoid **semantic generalization**
- Hallucination risks must be minimized (e.g., legal or financial audits)

This makes BM25 a strong choice for precision-sensitive applications like compliance, audits, or exact keyword search.


In [25]:
query = "Search for complaints where the exact phrase 'loan paid in full' appears."

bm25_result = bm25_retrieval_chain.invoke({"question": query})
naive_result = naive_retrieval_chain.invoke({"question": query})

print("BM25 RESULT:\n", bm25_result["response"].content)
print("\n---\n")
print("EMBEDDING RESULT:\n", naive_result["response"].content)


BM25 RESULT:
 I found several complaints where the exact phrase 'loan paid in full' appears. Specifically, there are two instances:

1. A complaint where the individual mentions that a loan serviced by Mohela states on the loan summary page that it was paid in full. They refer to documentation indicating this status on their student aid account.

2. A complaint where the person notes that their student loans, managed by the Default Resolution Group, show in their documentation that certain loans were paid in full as of specific dates.

If you need detailed information from these complaints or further assistance, please let me know!

---

EMBEDDING RESULT:
 I found multiple complaints where the exact phrase "loan paid in full" appears in the consumer complaint narratives.


In [26]:
bm25_result['context']

[Document(metadata={'source': './data/complaints.csv', 'row': 824, 'Date received': '04/24/25', 'Product': 'Student loan', 'Sub-product': 'Federal student loan servicing', 'Issue': 'Dealing with your lender or servicer', 'Sub-issue': 'Trouble with how payments are being handled', 'Consumer complaint narrative': 'I have student loans managed by aidvantage. I have the payments setup from my bank account. It should be a breeze for me to pay them. No matter what I do the payments just fail. I logged into aidvantage and all of my payments have failed. Its depressing and I need them to be paid. It really should not be this hard. 7 months is a long time to keep failing ... every other bill I setup manage to get paid from the same account. I went to XXXX  and see the exact same complaints and tragedy. I need this resolved, I want my payments accepted.', 'Company public response': 'None', 'Company': 'Maximus Federal Services, Inc.', 'State': 'MD', 'ZIP code': '20735', 'Tags': 'Servicemember', '

In [27]:
naive_result['context']

[Document(metadata={'source': './data/complaints.csv', 'row': 59, 'Date received': '04/02/25', 'Product': 'Student loan', 'Sub-product': 'Federal student loan servicing', 'Issue': 'Dealing with your lender or servicer', 'Sub-issue': 'Received bad information about your loan', 'Consumer complaint narrative': "In my previous Complaint # XXXX I stated that student loan # XXXX is still not showing as being paid in full. I received a response but it still hasn't been corrected and the complaint was closed. I paid off loan # XXXX on XX/XX/year>, it has been 90 days and it is still not showing as paid in full, despite showing a XXXX balance.", 'Company public response': 'None', 'Company': 'Maximus Federal Services, Inc.', 'State': 'NV', 'ZIP code': '891XX', 'Tags': 'None', 'Consumer consent provided?': 'Consent provided', 'Submitted via': 'Web', 'Date sent to company': '04/10/25', 'Company response to consumer': 'Closed with explanation', 'Timely response?': 'Yes', 'Consumer disputed?': 'N/A'

## Task 6: Contextual Compression (Using Reranking)

Contextual Compression is a fairly straightforward idea: We want to "compress" our retrieved context into just the most useful bits.

There are a few ways we can achieve this - but we're going to look at a specific example called reranking.

The basic idea here is this:

- We retrieve lots of documents that are very likely related to our query vector
- We "compress" those documents into a smaller set of *more* related documents using a reranking algorithm.

We'll be leveraging Cohere's Rerank model for our reranker today!

All we need to do is the following:

- Create a basic retriever
- Create a compressor (reranker, in this case)

That's it!

Let's see it in the code below!

In [28]:
from langchain.retrievers.contextual_compression import ContextualCompressionRetriever
from langchain_cohere import CohereRerank

compressor = CohereRerank(model="rerank-v3.5")
compression_retriever = ContextualCompressionRetriever(
    base_compressor=compressor, base_retriever=naive_retriever
)

Let's create our chain again, and see how this does!

In [29]:
contextual_compression_retrieval_chain = (
    {"context": itemgetter("question") | compression_retriever, "question": itemgetter("question")}
    | RunnablePassthrough.assign(context=itemgetter("context"))
    | {"response": rag_prompt | chat_model, "context": itemgetter("context")}
)

In [30]:
contextual_compression_retrieval_chain.invoke({"question" : "What is the most common issue with loans?"})["response"].content

'Based on the provided information, a common issue with loans, particularly student loans, is dealing with errors and mismanagement by lenders or servicers. This includes problems such as receiving incorrect information about the loan, errors in balances, misapplied payments, wrongful denials of payment plans, and mishandling of the loan data. \n\nIn summary, one of the most common issues is **mismanagement and errors related to loan balances, payment application, and communication from lenders or servicers**.'

In [31]:
contextual_compression_retrieval_chain.invoke({"question" : "Did any complaints not get handled in a timely manner?"})["response"].content

'Based on the provided information, yes, some complaints did not get handled in a timely manner. For example, the complaint about the student loan account review has been open since some earlier date (indicated as XXXX in the documents) and has not been resolved after nearly 18 months, indicating a significant delay. Additionally, the complaint about payments not being applied to the account mentions ongoing issues over 2-3 weeks. Although the responses from the companies were marked as "Closed with explanation" and responses were "Yes" in terms of timeliness, the ongoing unresolved issues suggest that some complaints experienced delays or were not fully resolved promptly.'

In [32]:
contextual_compression_retrieval_chain.invoke({"question" : "Why did people fail to pay back their loans?"})["response"].content

'People failed to pay back their loans for several reasons based on the provided context:\n\n1. **Lack of Clear Communication and Awareness:** Some borrowers were unaware they had to repay their loans or were not informed about repayment requirements. For example, a borrower was unaware until a certain point that they needed to start paying back their financial aid, leading to confusion and missed payments.\n\n2. **Issues with Loan Transfer and Notification:** Borrowers reported that loans were transferred between servicers (e.g., to NelNet) without proper notification, and they did not receive timely or adequate communication about repayment obligations.\n\n3. **Difficulty with Payment Plans and Account Access:** Borrowers experienced technical issues, such as being locked out of online accounts or receiving inconsistent or incorrect account information, which hindered their ability to stay current on payments.\n\n4. **Accumulation of Interest and Loan Management Challenges:** Many bo

We'll need to rely on something like Ragas to help us get a better sense of how this is performing overall - but it "feels" better!

## Task 7: Multi-Query Retriever

Typically in RAG we have a single query - the one provided by the user.

What if we had....more than one query!

In essence, a Multi-Query Retriever works by:

1. Taking the original user query and creating `n` number of new user queries using an LLM.
2. Retrieving documents for each query.
3. Using all unique retrieved documents as context

So, how is it to set-up? Not bad! Let's see it down below!



In [33]:
from langchain.retrievers.multi_query import MultiQueryRetriever

multi_query_retriever = MultiQueryRetriever.from_llm(
    retriever=naive_retriever, llm=chat_model
)

In [34]:
multi_query_retrieval_chain = (
    {"context": itemgetter("question") | multi_query_retriever, "question": itemgetter("question")}
    | RunnablePassthrough.assign(context=itemgetter("context"))
    | {"response": rag_prompt | chat_model, "context": itemgetter("context")}
)

In [35]:
multi_query_retrieval_chain.invoke({"question" : "What is the most common issue with loans?"})["response"].content

'The most common issue with loans, based on the provided complaints, appears to be problems with how lenders or servicers handle borrower accounts. This includes errors or discrepancies in loan balances, interest calculations, and account statuses, as well as misapplication of payments, incorrect reporting to credit bureaus, and inadequate communication or lack of transparency about loan terms, transfer of servicing, or account updates. Many complaints highlight issues such as unauthorized interest accrual, mismanagement during loan transfers, errors in credit reporting, and poor communication regarding loan modifications or statuses.'

In [36]:
multi_query_retrieval_chain.invoke({"question" : "Did any complaints not get handled in a timely manner?"})["response"].content

'Based on the provided complaints, yes, some complaints did not get handled in a timely manner. Several examples include:\n\n- A complaint regarding a student loan application that remained unprocessed for over 16 months without resolution or communication from the company.\n- Multiple instances where consumers reported that their issues, such as disputes over credit reporting, loan discharges, or account errors, were not addressed promptly, with follow-ups indicating delays of over a year or more.\n- A specific complaint highlighted that a complaint filed with the CFPB had not received a response after over 16 months.\n- Another complaint noted a delay of nearly 18 months with no resolution despite multiple follow-ups.\n\nWhile some complaints were marked as "timely response? Yes," the narratives suggest that many consumers experienced significant delays or lack of response over extended periods.'

In [37]:
multi_query_retrieval_chain.invoke({"question" : "Why did people fail to pay back their loans?"})["response"].content

"People failed to pay back their loans primarily due to a combination of systemic issues and servicing practices, including:\n\n1. **Misleading or Bad Information from Servicers:** Some borrowers were not properly informed about payment schedules, delinquency status, or available repayment options such as income-driven plans or rehabilitation. They relied on servicers' guidance, which often led to unfavorable outcomes like long-term forbearances or improper consolidations, resulting in increased balances and interest.\n\n2. **Forbearance Steering and Lack of Alternatives:** Borrowers reported being repeatedly placed into forbearance without adequate notification or explanation of the consequences, such as interest capitalization and reset of forgiveness timelines. When forbearances expired, they faced pressure to consolidate into larger loans or make unaffordable payments, often under false urgency, which contributed to default or financial hardship.\n\n3. **Errors and Inaccurate Infor

In [38]:
from langchain_core.callbacks import CallbackManagerForRetrieverRun

# You must pass a dummy run manager
run_manager = CallbackManagerForRetrieverRun.get_noop_manager()

query = "What is the most common issue with loans?"
reformulated_queries = multi_query_retriever.generate_queries(query, run_manager)

print("🔁 Reformulated Queries:")
for q in reformulated_queries:
    print("-", q)

🔁 Reformulated Queries:
- What are the leading problems borrowers face with loans?  
- Which issues are most frequently encountered in loan agreements?  
- What are the primary challenges associated with obtaining or managing loans?


In [39]:
from langchain_core.callbacks import CallbackManagerForRetrieverRun

# You must pass a dummy run manager
run_manager = CallbackManagerForRetrieverRun.get_noop_manager()

query = "What is the most common issue with loans?"
reformulated_queries = multi_query_retriever.generate_queries(query, run_manager)

print("🔁 Reformulated Queries:")
for q in reformulated_queries:
    print("-", q)

🔁 Reformulated Queries:
- 1. What are the typical problems borrowers face with loans?  
- 2. Which issues tend to occur most frequently during the loan repayment process?  
- 3. What are the main challenges or complications commonly associated with loans?


#### ❓ Question #2:

Explain how generating multiple reformulations of a user query can improve recall.

##### ✅ Answer:

Generating multiple reformulations of a user query can improve **recall** by increasing the likelihood of retrieving all relevant documents, even if they use different wording or phrasing than the original query.

Here’s how it helps:

1. **Overcomes vocabulary mismatch**  
   Reformulations can use synonyms or alternative expressions, helping the retriever match documents that might not share exact words with the original query.

2. **Expands semantic coverage**  
   Each reformulated query may emphasize a different aspect of the original intent, allowing the retriever to surface diverse but relevant content.

3. **Mitigates retriever sensitivity to phrasing**  
   Embedding-based or sparse retrieval models often perform inconsistently when queries are phrased differently. Multiple reformulations reduce the risk of missing relevant results due to one poorly-phrased query.

4. **Aggregates broader evidence**  
   By combining documents retrieved from all reformulations, the system builds a more comprehensive context pool for answering the original question.

✅ As a result, the multi-query approach improves the **robustness and completeness** of Retrieval-Augmented Generation (RAG) systems by reducing missed relevant content.


## Task 8: Parent Document Retriever

A "small-to-big" strategy - the Parent Document Retriever works based on a simple strategy:

1. Each un-split "document" will be designated as a "parent document" (You could use larger chunks of document as well, but our data format allows us to consider the overall document as the parent chunk)
2. Store those "parent documents" in a memory store (not a VectorStore)
3. We will chunk each of those documents into smaller documents, and associate them with their respective parents, and store those in a VectorStore. We'll call those "child chunks".
4. When we query our Retriever, we will do a similarity search comparing our query vector to the "child chunks".
5. Instead of returning the "child chunks", we'll return their associated "parent chunks".

Okay, maybe that was a few steps - but the basic idea is this:

- Search for small documents
- Return big documents

The intuition is that we're likely to find the most relevant information by limiting the amount of semantic information that is encoded in each embedding vector - but we're likely to miss relevant surrounding context if we only use that information.

Let's start by creating our "parent documents" and defining a `RecursiveCharacterTextSplitter`.

In [40]:
from langchain.retrievers import ParentDocumentRetriever
from langchain.storage import InMemoryStore
from langchain_text_splitters import RecursiveCharacterTextSplitter
from qdrant_client import QdrantClient, models

parent_docs = loan_complaint_data
child_splitter = RecursiveCharacterTextSplitter(chunk_size=750)

We'll need to set up a new QDrant vectorstore - and we'll use another useful pattern to do so!

> NOTE: We are manually defining our embedding dimension, you'll need to change this if you're using a different embedding model.

In [41]:
from langchain_qdrant import QdrantVectorStore

client = QdrantClient(location=":memory:")

client.create_collection(
    collection_name="full_documents",
    vectors_config=models.VectorParams(size=1536, distance=models.Distance.COSINE)
)

parent_document_vectorstore = QdrantVectorStore(
    collection_name="full_documents", embedding=OpenAIEmbeddings(model="text-embedding-3-small"), client=client
)

Now we can create our `InMemoryStore` that will hold our "parent documents" - and build our retriever!

In [42]:
store = InMemoryStore()

parent_document_retriever = ParentDocumentRetriever(
    vectorstore = parent_document_vectorstore,
    docstore=store,
    child_splitter=child_splitter,
)

By default, this is empty as we haven't added any documents - let's add some now!

In [43]:
parent_document_retriever.add_documents(parent_docs, ids=None)

We'll create the same chain we did before - but substitute our new `parent_document_retriever`.

In [44]:
parent_document_retrieval_chain = (
    {"context": itemgetter("question") | parent_document_retriever, "question": itemgetter("question")}
    | RunnablePassthrough.assign(context=itemgetter("context"))
    | {"response": rag_prompt | chat_model, "context": itemgetter("context")}
)

Let's give it a whirl!

In [45]:
parent_document_retrieval_chain.invoke({"question" : "What is the most common issue with loans?"})["response"].content

'The most common issue with loans, based on the provided complaints, appears to be problems related to federal student loan servicing. These include errors in loan balances, misapplied payments, wrongful denials of payment plans, and issues with loan account management such as incorrect information on credit reports, disputes over interest rates, and problems caused by loan transfers and servicing practices. There are multiple complaints highlighting systemic breakdowns, misreporting, and disputes over terms and account status.'

In [46]:
parent_document_retrieval_chain.invoke({"question" : "Did any complaints not get handled in a timely manner?"})["response"].content

'Based on the provided information, all the complaints that are explicitly marked as "Timely response?" with "No" indicate that they were not handled in a timely manner. Specifically, at least two complaints involving Mohela (Complaint IDs: 12709087 and 12935889) mention that they did not receive a response within the expected timeframe. These complaints had delayed responses, with the complaint at row 441 (Complaint ID: 12709087) explicitly stating "Timely response?":"No".\n\nTherefore, yes, some complaints did not get handled in a timely manner.'

In [47]:
parent_document_retrieval_chain.invoke({"question" : "Why did people fail to pay back their loans?"})["response"].content

'People often failed to pay back their loans due to a variety of circumstances, including financial hardship, lack of proper information or transparency about their loans, and issues related to loan management. For example, some individuals experienced severe financial difficulties after graduation, making it difficult or impossible to make payments. Others faced problems with their loan servicers, such as being unaware of when payments were due, not being properly informed about changes in loan ownership, or encountering administrative errors like incorrect reporting or failure to notify them about payment obligations. Additionally, some borrowers were misled by educational institutions, leading to taking on loans they were unprepared to repay, especially when the quality of education or employment prospects did not meet expectations.'

Overall, the performance *seems* largely the same. We can leverage a tool like [Ragas]() to more effectively answer the question about the performance.

## Task 9: Ensemble Retriever

In brief, an Ensemble Retriever simply takes 2, or more, retrievers and combines their retrieved documents based on a rank-fusion algorithm.

In this case - we're using the [Reciprocal Rank Fusion](https://plg.uwaterloo.ca/~gvcormac/cormacksigir09-rrf.pdf) algorithm.

Setting it up is as easy as providing a list of our desired retrievers - and the weights for each retriever.

In [48]:
from langchain.retrievers import EnsembleRetriever

retriever_list = [bm25_retriever, naive_retriever, parent_document_retriever, compression_retriever, multi_query_retriever]
equal_weighting = [1/len(retriever_list)] * len(retriever_list)

ensemble_retriever = EnsembleRetriever(
    retrievers=retriever_list, weights=equal_weighting
)

We'll pack *all* of these retrievers together in an ensemble.

In [49]:
ensemble_retrieval_chain = (
    {"context": itemgetter("question") | ensemble_retriever, "question": itemgetter("question")}
    | RunnablePassthrough.assign(context=itemgetter("context"))
    | {"response": rag_prompt | chat_model, "context": itemgetter("context")}
)

Let's look at our results!

In [50]:
ensemble_retrieval_chain.invoke({"question" : "What is the most common issue with loans?"})["response"].content

'The most common issue with loans, based on the provided complaints, appears to be mismanagement and misinformation by loan servicers. Specific recurring problems include:\n\n- Errors and inaccuracies in loan balances and interest calculations.\n- Failure to properly notify borrowers about changes such as loan transfers, repayments, or default status.\n- Inability to obtain clear or correct information about loan terms, balances, and repayment options.\n- Issues related to improper handling of deferments, forbearances, and repayment plans.\n- Unclear or incorrect reporting on credit reports, leading to negative impacts on credit scores.\n- Problems with loan forgiveness, cancellation, or discharge processes.\n- Unresponsive or unhelpful customer service providing vague or misleading information.\n- Unauthorized or unnotified changes to loan status and increased interest due to mismanagement.\n\nOverall, the pattern indicates that a core issue is the mishandling of loan data and communi

In [51]:
ensemble_retrieval_chain.invoke({"question" : "Did any complaints not get handled in a timely manner?"})["response"].content

'Based on the provided complaints, yes, some complaints were not handled in a timely manner. Specifically, there are multiple instances where responses from the companies were late or delayed beyond the expected timeframes. For example:\n\n- Complaint ID 12709087 (submitted 03/28/25): The response was marked as "N/A" (disputed or not responded to) and "Dispute? N/A," indicating it was not addressed timely.\n- Complaint ID 12935889 (submitted 04/11/25): Response was "No" for timely response.\n- Complaint ID 13056764 (submitted 04/18/25): The response was marked "Yes" for responsiveness, but the complaint details indicate prior delays.\n- Complaint ID 13205525 (submitted 04/27/25): Response was "Yes" for timely response, but the complainant states ongoing unresolved issues.\n\nOverall, there is evidence that some complaints, especially regarding inconsistent or delayed responses, were not handled promptly.'

In [52]:
ensemble_retrieval_chain.invoke({"question" : "Why did people fail to pay back their loans?"})["response"].content

"People failed to pay back their loans for several reasons, including:\n\n1. **Lack of communication and notification:** Many borrowers were not properly notified about when payments were due, changes in loan servicing, or account status updates, leading to missed payments and delinquency.\n\n2. **Complex or mismanaged loan information:** Borrowers faced confusion due to inconsistent or incorrect account balances, mismatched numbers, and lack of clear documentation about their loans, interest, and repayment status.\n\n3. **Limited or unhelpful repayment options:** Borrowers were often only offered forbearance or deferment without adequate explanations of the long-term consequences, such as interest accumulation, which increased the total amount owed.\n\n4. **Unfair or predatory servicing practices:** Servicers steered borrowers into long-term forbearances, failed to inform them about income-driven repayment plans or forgiveness options, and improperly reported delinquencies or late pay

## Task 10: Semantic Chunking

While this is not a retrieval method - it *is* an effective way of increasing retrieval performance on corpora that have clean semantic breaks in them.

Essentially, Semantic Chunking is implemented by:

1. Embedding all sentences in the corpus.
2. Combining or splitting sequences of sentences based on their semantic similarity based on a number of [possible thresholding methods](https://python.langchain.com/docs/how_to/semantic-chunker/):
  - `percentile`
  - `standard_deviation`
  - `interquartile`
  - `gradient`
3. Each sequence of related sentences is kept as a document!

Let's see how to implement this!

We'll use the `percentile` thresholding method for this example which will:

Calculate all distances between sentences, and then break apart sequences of setences that exceed a given percentile among all distances.

In [53]:
from langchain_experimental.text_splitter import SemanticChunker

semantic_chunker = SemanticChunker(
    embeddings,
    breakpoint_threshold_type="percentile"
)

Now we can split our documents.

In [54]:
semantic_documents = semantic_chunker.split_documents(loan_complaint_data[:20])

Let's create a new vector store.

In [55]:
semantic_vectorstore = Qdrant.from_documents(
    semantic_documents,
    embeddings,
    location=":memory:",
    collection_name="Loan_Complaint_Data_Semantic_Chunks"
)

We'll use naive retrieval for this example.

In [56]:
semantic_retriever = semantic_vectorstore.as_retriever(search_kwargs={"k" : 10})

Finally we can create our classic chain!

In [57]:
semantic_retrieval_chain = (
    {"context": itemgetter("question") | semantic_retriever, "question": itemgetter("question")}
    | RunnablePassthrough.assign(context=itemgetter("context"))
    | {"response": rag_prompt | chat_model, "context": itemgetter("context")}
)

And view the results!

In [58]:
semantic_retrieval_chain.invoke({"question" : "What is the most common issue with loans?"})["response"].content

'Based on the provided complaints, the most common issue with loans appears to be problems related to loan servicing and communication. Many complaints highlight issues such as:\n\n- Struggling to repay or confusion about payment plans and amounts\n- Lack of transparency and miscommunication from servicers\n- Problems with loan account information being inaccurate, default statuses, or improper reporting\n- Difficulties in obtaining proper verification or documentation\n- Unauthorized or improper reporting of debt or account status\n\nWhile the specific "most common issue" isn\'t explicitly quantified in the data, the recurring themes suggest that the most frequent problem involves **mismanagement and miscommunication by loan servicers, leading to confusion, incorrect account statuses, and difficulties in repayment efforts**.  \n\nIf you need a more precise answer, I would say:  \n**The most common issue with loans is mismanagement by servicers, including inaccurate reporting, poor com

In [59]:
semantic_retrieval_chain.invoke({"question" : "Did any complaints not get handled in a timely manner?"})["response"].content

'Yes, several complaints in the provided data indicate that they were handled in a timely manner, with responses marked as "Yes" under the "Timely response?" field. However, the complaint about "Received bad information about your loan" (Complaint ID: 13331376) involving Nelnet, Inc. was closed with an explanation, which suggests that their issue was addressed, although the details of whether it was resolved to the complainant\'s satisfaction are not specified.\n\nMost other complaints, such as those regarding trouble with payments, disputes about account information, or issues with service, also received timely responses. Since the context does not explicitly mention any complaints that were not handled in time, the available data suggests that, according to the records, all complaints were responded to promptly.\n\nIf the question pertains to whether any complaints were left unresolved or delayed beyond a reasonable period, the records do not indicate such a case. Therefore, based on

In [60]:
semantic_retrieval_chain.invoke({"question" : "Why did people fail to pay back their loans?"})["response"].content

"People failed to pay back their loans for various reasons, including:\n\n- Receiving bad or unclear information about their loans, such as being told they are in forbearance until a future date without written confirmation.\n- Difficulties with loan servicing processes, such as delays, miscommunication, or alleged stalling tactics by lenders or servicers.\n- Problems with the loan documentation or paperwork, which can lead to rejection or delays in loan forgiveness or discharge processes.\n- Disputes over the legitimacy or status of the loans, including claims that loans are illegally reported, or that they have been transferred or mishandled improperly.\n- Financial hardships or misunderstandings about payment obligations, especially after changes like the end of COVID-19 forbearance or re-amortization issues.\n- Errors or misreporting of loan status, which can negatively impact credit scores and result in default status being assigned mistakenly.\n- In some cases, legal or privacy i

#### ❓ Question #3:

If sentences are short and highly repetitive (e.g., FAQs), how might semantic chunking behave, and how would you adjust the algorithm?

##### ✅ Answer:

Semantic chunking, which groups text based on meaning rather than arbitrary length, may behave suboptimally with short and repetitive sentences (as in FAQs) due to the following:

- **Over-segmentation:** Each FAQ entry may be treated as its own standalone semantic unit, resulting in too many small chunks.
- **Redundancy:** The high degree of repetition (e.g., similar questions phrased slightly differently) can confuse clustering or embedding models, reducing retrieval quality.
- **Context loss:** Since each chunk is short, meaningful context may be lost, making it harder for the retriever to disambiguate similar queries.

##### 🔧 Suggested Adjustments:

1. **Concatenate Related Sentences:** Manually or heuristically group FAQs with similar themes into larger composite chunks before embedding.
2. **Use Metadata:** Attach question IDs, tags, or categories to enrich semantic meaning and aid disambiguation.
3. **Adjust Chunking Parameters:** Lower the sensitivity of semantic boundaries or increase the minimum chunk size.
4. **Apply Deduplication or Clustering:** Use semantic similarity (e.g., cosine similarity) to merge or group near-duplicate FAQs prior to chunking.
5. **Hybrid Approach:** Combine semantic chunking with a rule-based or sliding window technique to ensure minimum context length.

These strategies help mitigate fragmentation and enhance recall by ensuring more informative and discriminative chunks.


# 🤝 Breakout Room Part #2

#### 🏗️ Activity #1

Your task is to evaluate the various Retriever methods against eachother.

You are expected to:

1. Create a "golden dataset"
 - Use Synthetic Data Generation (powered by Ragas, or otherwise) to create this dataset
2. Evaluate each retriever with *retriever specific* Ragas metrics
 - Semantic Chunking is not considered a retriever method and will not be required for marks, but you may find it useful to do a "semantic chunking on" vs. "semantic chunking off" comparision between them
3. Compile these in a list and write a small paragraph about which is best for this particular data and why.

Your analysis should factor in:
  - Cost
  - Latency
  - Performance

  ##### HINTS:

- LangSmith provides detailed information about latency and cost.

> NOTE: This is **NOT** required to be completed in class. Please spend time in your breakout rooms creating a plan before moving on to writing code.

In [61]:
### YOUR CODE HERE
os.environ["LANGCHAIN_TRACING_V2"] = "true"
os.environ["LANGCHAIN_PROJECT"] = f"Session 09 - Advanced Retrieval"
os.environ["LANGCHAIN_API_KEY"] = getpass.getpass("LangSmith API Key: ")

In [103]:
import random
import pandas as pd
from langchain_openai import ChatOpenAI, OpenAIEmbeddings
from ragas.llms import LangchainLLMWrapper
from ragas.embeddings import LangchainEmbeddingsWrapper
from ragas.testset import TestsetGenerator
from ragas.testset.synthesizers import (
    SingleHopSpecificQuerySynthesizer,
    MultiHopAbstractQuerySynthesizer,
    MultiHopSpecificQuerySynthesizer,
)
from langchain.callbacks import get_openai_callback

# Reproducibility
random.seed(42)

# LLM + embedding setup
generator_llm = LangchainLLMWrapper(ChatOpenAI(model="gpt-4o", temperature=0.7))
generator_embeddings = LangchainEmbeddingsWrapper(OpenAIEmbeddings(model="text-embedding-3-small"))

# Query types
query_distribution = [
    (SingleHopSpecificQuerySynthesizer(llm=generator_llm), 0.5),
    (MultiHopAbstractQuerySynthesizer(llm=generator_llm), 0.25),
    (MultiHopSpecificQuerySynthesizer(llm=generator_llm), 0.25),
]

# Generator init
generator = TestsetGenerator(llm=generator_llm, embedding_model=generator_embeddings)

# Generate
print("Generating synthetic dataset...")
with get_openai_callback() as cb:
    dataset = generator.generate_with_langchain_docs(
        loan_complaint_data[:50],
        testset_size=10,
        query_distribution=query_distribution,
    )
    print(f"💰 Tokens used: {cb.total_tokens} | Cost: ${cb.total_cost:.4f}")



Generating synthetic dataset...


Applying SummaryExtractor:   0%|          | 0/31 [00:00<?, ?it/s]

Applying CustomNodeFilter:   0%|          | 0/50 [00:00<?, ?it/s]

Node 054e7961-1b3d-448b-bbcd-2568a475c7de does not have a summary. Skipping filtering.
Node 5df2fecb-31a9-4e4c-8aab-a2f372788b9c does not have a summary. Skipping filtering.
Node 9072337c-cac8-4b96-a09e-517f749fade3 does not have a summary. Skipping filtering.
Node 353ab4e2-4835-49b7-acd6-6470d7d53ede does not have a summary. Skipping filtering.
Node 35fb8ec3-3b67-48f9-97e2-03d359f3b397 does not have a summary. Skipping filtering.
Node 15f3a024-d455-4c4a-a106-51968f9c4021 does not have a summary. Skipping filtering.
Node ab1a1371-5e2c-43dd-af5f-0e1a8b1ca8cc does not have a summary. Skipping filtering.
Node d76f9cdc-32ae-4835-971e-de8feb1fe72d does not have a summary. Skipping filtering.
Node 333ad8e2-afb1-4f32-9ac9-dd279d9854e1 does not have a summary. Skipping filtering.
Node 783765cb-db2c-46fc-bd8e-2ba23a1330b7 does not have a summary. Skipping filtering.
Node 7c2bbd43-386f-4408-819d-12b46a5bcf5b does not have a summary. Skipping filtering.
Node cbf1b887-204b-4f20-a539-a6e80d8aac27 d

Applying [EmbeddingExtractor, ThemesExtractor, NERExtractor]:   0%|          | 0/131 [00:00<?, ?it/s]

Applying [CosineSimilarityBuilder, OverlapScoreBuilder]:   0%|          | 0/2 [00:00<?, ?it/s]

Generating personas:   0%|          | 0/3 [00:00<?, ?it/s]

Generating Scenarios:   0%|          | 0/3 [00:00<?, ?it/s]

Generating Samples:   0%|          | 0/11 [00:00<?, ?it/s]

💰 Tokens used: 111244 | Cost: $0.3606
✅ Generated 11 synthetic test cases


TypeError: 'Testset' object is not subscriptable

In [115]:
import pandas as pd

df = dataset.to_pandas()
pd.set_option('display.max_colwidth', 300)
display(df)

Unnamed: 0,user_input,reference_contexts,reference,synthesizer_name
0,How has Nelnet's handling of the re-amortization process affected borrowers following the end of the federal student loan COVID-19 forbearance program?,"[The federal student loan COVID-19 forbearance program ended in XX/XX/XXXX. However, payments were not re-amortized on my federal student loans currently serviced by Nelnet until very recently. The new payment amount that is effective starting with the XX/XX/XXXX payment will nearly double my pa...","Nelnet's handling of the re-amortization process has resulted in a significant increase in payment amounts for borrowers, with payments nearly doubling from $180.00 per month to $360.00 per month. This change occurred only recently, despite the federal student loan COVID-19 forbearance program e...",single_hop_specifc_query_synthesizer
1,Why is my Income-Driven Repayment amount incorrect despite submitting all required documents?,"[I submitted my annual Income-Driven Repayment ( IDR ) recertification to Aidvantage on time and in full, using the correct process. My income is approximately $ XXXX/year, which is below 150 % of the federal poverty guideline. Under the SAVE Plan or any valid IDR/IBR plan , this income level qu...","Aidvantage assigned an incorrect repayment amount that does not align with the income level provided, which qualifies for a lower payment under the SAVE Plan or any valid IDR/IBR plan. Despite submitting the recertification on time and in full, Aidvantage has not processed the application correc...",single_hop_specifc_query_synthesizer
2,Why is Studentaid.gov not providing clear information about my loan servicer?,"[According to Studentaid.gov, Im to get an email and a letter of notice that my servicer is being changed. studentaid.gov says that my issuer is nelnet. Nelnet says my issuer is somewhere else. With that said, the only reason i would have know would be because i am trying to make a payment. Now ...","According to Studentaid.gov, you should receive an email and a letter of notice about your servicer being changed. However, there is confusion as Studentaid.gov indicates your issuer is Nelnet, while Nelnet claims it is somewhere else. This lack of clear communication is causing issues with know...",single_hop_specifc_query_synthesizer
3,Wut is the FCRA?,"[I am writing to formally dispute inaccurate information on my credit report. This dispute is submitted under the Fair Credit Reporting Act ( FCRA ), 15 U.S.C. 16811, which requires credit reporting agencies to conduct a reasonable reinvestigation of disputed items.\n\nRecent findings by the Con...","The FCRA, or Fair Credit Reporting Act, is a federal law that requires credit reporting agencies to conduct a reasonable reinvestigation of disputed items on a credit report.",single_hop_specifc_query_synthesizer
4,Why Aid Avantage not tell me about my loans?,[I am devastated. I would like to report a situation with my student loans. I recently found out my loans were in default. I have never been in default with my student loans. They notified me on my email instead to mail that I was delinquent. Needless to say I had nine accounts that went 90 days...,"The individual was notified via email instead of mail that they were delinquent, which led to their loans going into default without their prior knowledge.",single_hop_specifc_query_synthesizer
5,"How can a student manage their loan account issues with Aidvantage and address a delinquency reported by EdFinancial Services, considering the need for documentation and the request for delinquency removal?","[<1-hop>\n\nTo Whom It May Concern, I am writing to formally respond to your recent correspondence regarding my loan account with Aidvantage. While I appreciate your prompt acknowledgment of my concerns, I must reiterate that during my phone call on XX/XX/2025, the customer service representativ...","To manage loan account issues with Aidvantage, a student should request all future communications in writing to maintain an accurate record of interactions. They should also request specific documentation, such as a comprehensive ledger of all emails and notifications, and a copy of the contract...",multi_hop_abstract_query_synthesizer
6,How do federal student loan regulations relate to EdFinancial's mismanagement and privacy violations?,"[<1-hop>\n\nTo Whom It May Concern, I am writing to formally request verification of the student loan ( s ) currently being serviced under my name. According to my review of my account on StudentAid.gov and communications with your office, I have been unable to locate or obtain a copy of the sig...","Federal student loan regulations are designed to protect borrowers by ensuring fair administration of loans and safeguarding personal information. In the context provided, EdFinancial's mismanagement is evident through their failure to honor in-school deferment rights, as mandated by federal reg...",multi_hop_abstract_query_synthesizer
7,How do federal student loan regulations relate to privacy violations and compromised personal information in the context of a loan servicer's data breach?,"[<1-hop>\n\nTo Whom It May Concern, I am writing to formally request verification of the student loan ( s ) currently being serviced under my name. According to my review of my account on StudentAid.gov and communications with your office, I have been unable to locate or obtain a copy of the sig...","Federal student loan regulations, such as the Fair Debt Collection Practices Act, entitle borrowers to receive documentation verifying their debt, including the signed Master Promissory Note. However, a data breach involving entities affiliated with the Department of Education led to unauthorize...",multi_hop_abstract_query_synthesizer
8,"How does the Fair Credit Reporting Act (FCRA) protect individuals from unauthorized access to private information and ensure accurate credit reporting, especially in cases where companies like DOGE and EDFinancial Services are involved?","[<1-hop>\n\nDOGE improperly accessed my private information. Per FCRA 611 ( a ) ( 1 ) ( a ), I have the right to have my private data remain private, as I did not give DOGE authorization to access my student loan information., <2-hop>\n\nI am writing to formally dispute inaccurate information on...","The Fair Credit Reporting Act (FCRA) provides protections for individuals by ensuring that their private information remains confidential unless authorization is given, as highlighted in FCRA 611(a)(1)(a). This is relevant in cases where companies like DOGE improperly access private information ...",multi_hop_specific_query_synthesizer
9,What are the implications of FDCPA violations by loan servicers in the context of illegal student loan reporting and collections?,"[<1-hop>\n\nThis account was transferred to Nelnet ( NNI ) from XXXX XXXX XXXX XXXX XXXX XXXX XXXX XXXX ; who had engaged in acts of deception and made numerous errors on the account, as well as having broken numerous consumer protection laws, including the FDCPA, FCRA, TCPA, Indiana Senior Cons...","The implications of FDCPA violations by loan servicers in the context of illegal student loan reporting and collections include potential legal actions against the servicers for engaging in unfair and deceptive practices. In the provided context, the loan servicer continued to report and collect...",multi_hop_specific_query_synthesizer


In [116]:
# Save the test dataset to avoid regenerating it
dataset.to_pandas().to_csv("test_dataset.csv", index=False)
print("Test dataset saved to test_dataset.csv")


Test dataset saved to test_dataset.csv


In [164]:
import os
import time
import copy
import pandas as pd
from ragas.evaluation import EvaluationDataset
from ragas import evaluate, RunConfig
from ragas.metrics import (
    LLMContextRecall,
    Faithfulness,
    FactualCorrectness,
    ResponseRelevancy,
    ContextEntityRecall,
    NoiseSensitivity,
)
from ragas.llms import LangchainLLMWrapper
from langchain_openai import ChatOpenAI
from langchain.callbacks import get_openai_callback


def evaluate_retriever(name: str, pipeline, dataset):
    print(f"📊 Running evaluation for: {name}")

    test_dataset = copy.deepcopy(dataset)

    latencies = []
    total_tokens = 0
    total_cost = 0
    prompt_number = 1

    overall_start_time = time.time()

    for test_row in test_dataset:
        question = test_row.eval_sample.user_input
        start_time = time.time()
        with get_openai_callback() as cb:
            response = pipeline.invoke({"question": question})
        
        latency = time.time() - start_time
        latencies.append(latency)
        total_tokens += cb.total_tokens
        total_cost += cb.total_cost

        answer = str(response["response"])
        test_row.eval_sample.response = answer
        test_row.eval_sample.retrieved_contexts = [doc.page_content for doc in response["context"]]

        print(f"Prompt {prompt_number} | ⏱️ {latency:.2f}s | Tokens {cb.total_tokens} | Cost {cb.total_cost:.6f}")
        prompt_number += 1

    total_duration = time.time() - overall_start_time
    avg_latency = sum(latencies) / len(latencies) if latencies else 0
    avg_tokens_per_query = total_tokens / len(test_dataset)


    print(f"\n📊 Summary for {name}:")
    print(f"🔢 Total tokens: {total_tokens}")
    print(f"⏱️ Avg latency: {avg_latency:.2f}s")
    print(f"🔢 Avg tokens per query: {avg_tokens_per_query:.2f}")
    print(f"⏱️ Total duration: {total_duration:.2f}s")
    print(f"💰 Total cost: {total_cost:.6f}")

    evaluator_model = "gpt-4.1-nano"
    evaluation_dataset = EvaluationDataset.from_pandas(test_dataset.to_pandas())
    evaluator_llm = LangchainLLMWrapper(ChatOpenAI(model=evaluator_model, temperature=0))

    with get_openai_callback() as cb:
        result = evaluate(
            experiment_name=name,
            dataset=evaluation_dataset,
            metrics=[
                LLMContextRecall(),
                Faithfulness(),
                FactualCorrectness(),
                ResponseRelevancy(),
                ContextEntityRecall(),
                NoiseSensitivity(),
            ],
            llm=evaluator_llm,
            run_config=RunConfig(timeout=300)
        )

    evaluation_cost = cb.total_cost
    evaluation_tokens = cb.total_tokens

    print(f"Evaluation LLM model: {evaluator_model}")
    print('Evaluation results:', result)
    print('Evaluation tokens:', evaluation_tokens)
    print(f'Evaluation cost: {evaluation_cost:.6f}')

    return {
        "name": name,
        "summary": {
            "total_runtime": round(total_duration, 2),
            "avg_latency": round(avg_latency, 2),
            "total_tokens": int(total_tokens),
            "total_queries": len(test_dataset),
            "total_cost": round(total_cost, 6),
            "avg_tokens_per_query": round(avg_tokens_per_query, 2),
        },
        "latencies": [round(lat, 2) for lat in latencies],
        "evaluation_result": result,
        "evaluation_cost": {
            "eval_tokens": evaluation_tokens,
            "eval_cost": round(evaluation_cost, 6),
        }
    }



In [165]:
from ragas.testset import Testset

subset = Testset(list(dataset)[:1])
evaluate_retriever("naive_retriever_test", naive_retrieval_chain, subset)


📊 Running evaluation for: naive_retriever_test
Prompt 1 | ⏱️ 4.93s | Tokens 6087 | Cost 0.000239

📊 Summary for naive_retriever_test:
🔢 Total tokens: 6087
⏱️ Avg latency: 4.93s
🔢 Avg tokens per query: 6087.00
⏱️ Total duration: 4.93s
💰 Total cost: 0.000239


Evaluating:   0%|          | 0/6 [00:00<?, ?it/s]

Exception raised in Job[5]: ValueError(setting an array element with a sequence. The requested array has an inhomogeneous shape after 1 dimensions. The detected shape was (10,) + inhomogeneous part.)


Evaluation LLM model: gpt-4.1-nano
Evaluation results: {'context_recall': 1.0000, 'faithfulness': 1.0000, 'factual_correctness(mode=f1)': 0.7600, 'answer_relevancy': 0.9520, 'context_entity_recall': 0.1429, 'noise_sensitivity(mode=relevant)': nan}
Evaluation tokens: 62802
Evaluation cost: 0.012510


{'name': 'naive_retriever_test',
 'summary': {'total_runtime': 4.93,
  'avg_latency': 4.93,
  'total_tokens': 6087,
  'total_queries': 1,
  'total_cost': 0.000239,
  'avg_tokens_per_query': 6087.0},
 'latencies': [4.93],
 'evaluation_result': {'context_recall': 1.0000, 'faithfulness': 1.0000, 'factual_correctness(mode=f1)': 0.7600, 'answer_relevancy': 0.9520, 'context_entity_recall': 0.1429, 'noise_sensitivity(mode=relevant)': nan},
 'evaluation_cost': {'eval_tokens': 62802, 'eval_cost': 0.01251}}

In [166]:
# def run_all_retriever_evaluations(pipelines: dict, dataset):
#     """
#     Runs evaluation across multiple retriever pipelines.

#     Args:
#         pipelines (dict): Dictionary where key is a pipeline name and value is the retriever chain.
#         dataset: Ragas synthetic test set.

#     Returns:
#         dict: A dictionary of results per pipeline.
#     """
#     results = {}
#     for name, pipeline in pipelines.items():
#         print(f"\n🚀 Starting evaluation for: {name}")
#         results[name] = evaluate_retriever(name, pipeline, dataset)
#     return results


In [167]:
naive_evaluation = evaluate_retriever("Naive", naive_retrieval_chain, dataset)

📊 Running evaluation for: Naive
Prompt 1 | ⏱️ 6.72s | Tokens 6106 | Cost 0.000679
Prompt 2 | ⏱️ 9.32s | Tokens 6187 | Cost 0.000749
Prompt 3 | ⏱️ 5.97s | Tokens 6484 | Cost 0.000742
Prompt 4 | ⏱️ 3.04s | Tokens 8763 | Cost 0.000908
Prompt 5 | ⏱️ 6.01s | Tokens 6109 | Cost 0.000697
Prompt 6 | ⏱️ 6.16s | Tokens 8484 | Cost 0.001007
Prompt 7 | ⏱️ 6.42s | Tokens 8793 | Cost 0.000980
Prompt 8 | ⏱️ 6.26s | Tokens 6158 | Cost 0.000687
Prompt 9 | ⏱️ 7.96s | Tokens 8859 | Cost 0.001012
Prompt 10 | ⏱️ 5.79s | Tokens 6802 | Cost 0.000729
Prompt 11 | ⏱️ 5.70s | Tokens 8476 | Cost 0.000953

📊 Summary for Naive:
🔢 Total tokens: 81221
⏱️ Avg latency: 6.30s
🔢 Avg tokens per query: 7383.73
⏱️ Total duration: 69.34s
💰 Total cost: 0.009144


Evaluating:   0%|          | 0/66 [00:00<?, ?it/s]

Exception raised in Job[29]: ValueError(setting an array element with a sequence. The requested array has an inhomogeneous shape after 1 dimensions. The detected shape was (10,) + inhomogeneous part.)
Exception raised in Job[4]: TimeoutError()
Exception raised in Job[28]: TimeoutError()
Exception raised in Job[34]: TimeoutError()
Exception raised in Job[35]: TimeoutError()
Exception raised in Job[41]: TimeoutError()
Exception raised in Job[53]: OutputParserException(Invalid json output: {
    "statements": [
        {
          "statement": "The \\"Fair Credit Reporting Act\\" (FCRA) provides protections to individuals regarding access to private information and accurate credit reporting.",
          "reason": "The context explicitly states that the FCRA provides protections related to access to private information and accurate credit reporting.",
          "verdict": 1
        },
        {
          "statement": "The protections are especially relevant when companies such as DOGE and 

Evaluation LLM model: gpt-4.1-nano
Evaluation results: {'context_recall': 0.8727, 'faithfulness': 1.0000, 'factual_correctness(mode=f1)': 0.7855, 'answer_relevancy': 0.9421, 'context_entity_recall': 0.5169, 'noise_sensitivity(mode=relevant)': 0.0618}
Evaluation tokens: 727682
Evaluation cost: 0.140503


In [168]:
bm25_evaluation = evaluate_retriever("BM25", bm25_retrieval_chain, dataset)

📊 Running evaluation for: BM25
Prompt 1 | ⏱️ 4.19s | Tokens 4345 | Cost 0.000498
Prompt 2 | ⏱️ 4.72s | Tokens 3128 | Cost 0.000397
Prompt 3 | ⏱️ 2.57s | Tokens 2481 | Cost 0.000323
Prompt 4 | ⏱️ 1.31s | Tokens 2737 | Cost 0.000297
Prompt 5 | ⏱️ 3.71s | Tokens 5971 | Cost 0.000665
Prompt 6 | ⏱️ 6.00s | Tokens 4406 | Cost 0.000586
Prompt 7 | ⏱️ 1.51s | Tokens 3169 | Cost 0.000372
Prompt 8 | ⏱️ 3.28s | Tokens 3722 | Cost 0.000456
Prompt 9 | ⏱️ 4.51s | Tokens 3486 | Cost 0.000470
Prompt 10 | ⏱️ 3.19s | Tokens 4637 | Cost 0.000592
Prompt 11 | ⏱️ 4.37s | Tokens 3425 | Cost 0.000433

📊 Summary for BM25:
🔢 Total tokens: 41507
⏱️ Avg latency: 3.58s
🔢 Avg tokens per query: 3773.36
⏱️ Total duration: 39.38s
💰 Total cost: 0.005088


Evaluating:   0%|          | 0/66 [00:00<?, ?it/s]

Exception raised in Job[5]: ValueError(setting an array element with a sequence. The requested array has an inhomogeneous shape after 1 dimensions. The detected shape was (4,) + inhomogeneous part.)
Exception raised in Job[53]: ValueError(operands could not be broadcast together with shapes (21,) (1,20) )
Exception raised in Job[28]: TimeoutError()
Exception raised in Job[58]: TimeoutError()


Evaluation LLM model: gpt-4.1-nano
Evaluation results: {'context_recall': 1.0000, 'faithfulness': 0.9770, 'factual_correctness(mode=f1)': 0.7745, 'answer_relevancy': 0.6793, 'context_entity_recall': 0.4873, 'noise_sensitivity(mode=relevant)': 0.1713}
Evaluation tokens: 435078
Evaluation cost: 0.080423


In [169]:
multi_query_evaluation = evaluate_retriever("Multi-Query", multi_query_retrieval_chain, dataset)

📊 Running evaluation for: Multi-Query
Prompt 1 | ⏱️ 9.72s | Tokens 10065 | Cost 0.001108
Prompt 2 | ⏱️ 12.68s | Tokens 15133 | Cost 0.001681
Prompt 3 | ⏱️ 7.98s | Tokens 15590 | Cost 0.001717
Prompt 4 | ⏱️ 2.16s | Tokens 12249 | Cost 0.001269
Prompt 5 | ⏱️ 4.01s | Tokens 9471 | Cost 0.001009
Prompt 6 | ⏱️ 8.26s | Tokens 15456 | Cost 0.001752
Prompt 7 | ⏱️ 6.64s | Tokens 13795 | Cost 0.001508
Prompt 8 | ⏱️ 9.88s | Tokens 12452 | Cost 0.001397
Prompt 9 | ⏱️ 7.84s | Tokens 14536 | Cost 0.001595
Prompt 10 | ⏱️ 7.62s | Tokens 11128 | Cost 0.001265
Prompt 11 | ⏱️ 14.27s | Tokens 14741 | Cost 0.001627

📊 Summary for Multi-Query:
🔢 Total tokens: 144616
⏱️ Avg latency: 8.28s
🔢 Avg tokens per query: 13146.91
⏱️ Total duration: 91.06s
💰 Total cost: 0.015926


Evaluating:   0%|          | 0/66 [00:00<?, ?it/s]

Exception raised in Job[7]: OutputParserException(Invalid json output: The provided output does not satisfy the schema constraints. Here is the fixed output in the required JSON format:

{
  "text": "{\n  \"statements\": [\n    {\n      \"statement\": \"The answer begins with a statement indicating that the information provided is the basis for the reasons listed.\",\n      \"reason\": \"The detailed explanation confirms that the initial statement summarizes the reasons for the incorrect billing amounts.\",\n      \"verdict\": 1\n    },\n    {\n      \"statement\": \"Multiple reasons are presented to explain why the Income-Driven Repayment amount might be incorrect despite submitting all required documents.\",\n      \"reason\": \"The detailed reasons include processing delays, miscommunication, technical issues, and policy delays, which support the statement.\",\n      \"verdict\": 1\n    },\n    {\n      \"statement\": \"The first reason states that processing delays or errors by ser

Evaluation LLM model: gpt-4.1-nano
Evaluation results: {'context_recall': 1.0000, 'faithfulness': 1.0000, 'factual_correctness(mode=f1)': 0.8736, 'answer_relevancy': 0.7661, 'context_entity_recall': 0.3929, 'noise_sensitivity(mode=relevant)': 0.1667}
Evaluation tokens: 840800
Evaluation cost: 0.158888


In [170]:
parent_document_evaluation = evaluate_retriever("Parent-Document", parent_document_retrieval_chain, dataset)

📊 Running evaluation for: Parent-Document
Prompt 1 | ⏱️ 10.60s | Tokens 3365 | Cost 0.000401
Prompt 2 | ⏱️ 4.71s | Tokens 3198 | Cost 0.000408
Prompt 3 | ⏱️ 2.60s | Tokens 3225 | Cost 0.000394
Prompt 4 | ⏱️ 0.92s | Tokens 4735 | Cost 0.000498
Prompt 5 | ⏱️ 2.32s | Tokens 4467 | Cost 0.000494
Prompt 6 | ⏱️ 9.90s | Tokens 5408 | Cost 0.000704
Prompt 7 | ⏱️ 5.18s | Tokens 4008 | Cost 0.000488
Prompt 8 | ⏱️ 6.56s | Tokens 3859 | Cost 0.000494
Prompt 9 | ⏱️ 3.46s | Tokens 4089 | Cost 0.000471
Prompt 10 | ⏱️ 2.98s | Tokens 3802 | Cost 0.000486
Prompt 11 | ⏱️ 2.03s | Tokens 4369 | Cost 0.000504

📊 Summary for Parent-Document:
🔢 Total tokens: 44525
⏱️ Avg latency: 4.66s
🔢 Avg tokens per query: 4047.73
⏱️ Total duration: 51.26s
💰 Total cost: 0.005343


Evaluating:   0%|          | 0/66 [00:00<?, ?it/s]

Exception raised in Job[46]: TimeoutError()


Evaluation LLM model: gpt-4.1-nano
Evaluation results: {'context_recall': 0.9091, 'faithfulness': 1.0000, 'factual_correctness(mode=f1)': 0.8500, 'answer_relevancy': 0.8454, 'context_entity_recall': 0.3595, 'noise_sensitivity(mode=relevant)': 0.1780}
Evaluation tokens: 444213
Evaluation cost: 0.080766


In [171]:
contextual_compression_evaluation = evaluate_retriever("Contextual Compression", contextual_compression_retrieval_chain, dataset)

📊 Running evaluation for: Contextual Compression
Prompt 1 | ⏱️ 6.43s | Tokens 2554 | Cost 0.000324
Prompt 2 | ⏱️ 14.21s | Tokens 2453 | Cost 0.000360
Prompt 3 | ⏱️ 2.52s | Tokens 2455 | Cost 0.000287
Prompt 4 | ⏱️ 2.11s | Tokens 4180 | Cost 0.000447
Prompt 5 | ⏱️ 3.37s | Tokens 2106 | Cost 0.000269
Prompt 6 | ⏱️ 7.06s | Tokens 3701 | Cost 0.000500
Prompt 7 | ⏱️ 5.01s | Tokens 3398 | Cost 0.000433
Prompt 8 | ⏱️ 7.76s | Tokens 2417 | Cost 0.000320
Prompt 9 | ⏱️ 2.93s | Tokens 3142 | Cost 0.000384
Prompt 10 | ⏱️ 2.90s | Tokens 2576 | Cost 0.000350
Prompt 11 | ⏱️ 5.88s | Tokens 3576 | Cost 0.000459

📊 Summary for Contextual Compression:
🔢 Total tokens: 32558
⏱️ Avg latency: 5.47s
🔢 Avg tokens per query: 2959.82
⏱️ Total duration: 60.18s
💰 Total cost: 0.004131


Evaluating:   0%|          | 0/66 [00:00<?, ?it/s]

Evaluation LLM model: gpt-4.1-nano
Evaluation results: {'context_recall': 0.8727, 'faithfulness': 0.9935, 'factual_correctness(mode=f1)': 0.7873, 'answer_relevancy': 0.8476, 'context_entity_recall': 0.4667, 'noise_sensitivity(mode=relevant)': 0.0675}
Evaluation tokens: 379014
Evaluation cost: 0.069740


In [172]:
ensemble_evaluation = evaluate_retriever("Ensemble", ensemble_retrieval_chain, dataset)

📊 Running evaluation for: Ensemble
Prompt 1 | ⏱️ 12.37s | Tokens 17324 | Cost 0.001813
Prompt 2 | ⏱️ 12.59s | Tokens 17048 | Cost 0.001857
Prompt 3 | ⏱️ 7.50s | Tokens 18319 | Cost 0.001926
Prompt 4 | ⏱️ 5.45s | Tokens 19940 | Cost 0.002067
Prompt 5 | ⏱️ 6.67s | Tokens 16565 | Cost 0.001733
Prompt 6 | ⏱️ 15.31s | Tokens 20395 | Cost 0.002291
Prompt 7 | ⏱️ 8.68s | Tokens 18982 | Cost 0.002026
Prompt 8 | ⏱️ 10.43s | Tokens 14550 | Cost 0.001573
Prompt 9 | ⏱️ 12.02s | Tokens 18452 | Cost 0.002049
Prompt 10 | ⏱️ 9.28s | Tokens 17457 | Cost 0.001887
Prompt 11 | ⏱️ 8.60s | Tokens 22957 | Cost 0.002441

📊 Summary for Ensemble:
🔢 Total tokens: 201989
⏱️ Avg latency: 9.90s
🔢 Avg tokens per query: 18362.64
⏱️ Total duration: 108.90s
💰 Total cost: 0.021664


Evaluating:   0%|          | 0/66 [00:00<?, ?it/s]

Exception raised in Job[5]: TimeoutError()
Exception raised in Job[10]: TimeoutError()
Exception raised in Job[11]: TimeoutError()
Exception raised in Job[16]: TimeoutError()
Exception raised in Job[17]: TimeoutError()
Exception raised in Job[22]: TimeoutError()
Exception raised in Job[34]: TimeoutError()
Exception raised in Job[35]: TimeoutError()
Exception raised in Job[40]: TimeoutError()
Exception raised in Job[41]: TimeoutError()
Exception raised in Job[47]: TimeoutError()
Exception raised in Job[53]: TimeoutError()
Exception raised in Job[59]: TimeoutError()
Exception raised in Job[64]: TimeoutError()
Exception raised in Job[65]: TimeoutError()


Evaluation LLM model: gpt-4.1-nano
Evaluation results: {'context_recall': 1.0000, 'faithfulness': 1.0000, 'factual_correctness(mode=f1)': 0.7427, 'answer_relevancy': 0.8460, 'context_entity_recall': 0.3198, 'noise_sensitivity(mode=relevant)': 0.1538}
Evaluation tokens: 1106277
Evaluation cost: 0.205831


In [173]:
semantic_evaluation = evaluate_retriever("Semantic", semantic_retrieval_chain, dataset)

📊 Running evaluation for: Semantic
Prompt 1 | ⏱️ 4.89s | Tokens 6099 | Cost 0.000675
Prompt 2 | ⏱️ 4.01s | Tokens 6700 | Cost 0.000762
Prompt 3 | ⏱️ 5.02s | Tokens 6306 | Cost 0.000694
Prompt 4 | ⏱️ 1.14s | Tokens 6789 | Cost 0.000703
Prompt 5 | ⏱️ 1.74s | Tokens 5440 | Cost 0.000614
Prompt 6 | ⏱️ 7.06s | Tokens 7516 | Cost 0.000896
Prompt 7 | ⏱️ 10.28s | Tokens 6557 | Cost 0.000773
Prompt 8 | ⏱️ 3.15s | Tokens 6231 | Cost 0.000725
Prompt 9 | ⏱️ 6.60s | Tokens 7660 | Cost 0.000893
Prompt 10 | ⏱️ 15.14s | Tokens 7407 | Cost 0.000875
Prompt 11 | ⏱️ 4.39s | Tokens 6508 | Cost 0.000758

📊 Summary for Semantic:
🔢 Total tokens: 73213
⏱️ Avg latency: 5.77s
🔢 Avg tokens per query: 6655.73
⏱️ Total duration: 63.42s
💰 Total cost: 0.008367


Evaluating:   0%|          | 0/66 [00:00<?, ?it/s]

Exception raised in Job[10]: TimeoutError()
Exception raised in Job[16]: TimeoutError()


Evaluation LLM model: gpt-4.1-nano
Evaluation results: {'context_recall': 1.0000, 'faithfulness': 0.9893, 'factual_correctness(mode=f1)': 0.8227, 'answer_relevancy': 0.8662, 'context_entity_recall': 0.3311, 'noise_sensitivity(mode=relevant)': 0.1116}
Evaluation tokens: 691847
Evaluation cost: 0.139370


In [250]:
all_results = {
    "Naive": naive_evaluation,
    "BM25": bm25_evaluation,
    "Multi-Query": multi_query_evaluation,
    "Parent-Document": parent_document_evaluation,
    "Contextual Compression": contextual_compression_evaluation,
    "Ensemble": ensemble_evaluation,
    "Semantic": semantic_evaluation,
}


In [251]:
import pandas as pd
from IPython.display import display
import numpy as np

def evaluation_result_to_dict(result):
    """
    Convert EvaluationResult.scores (list of per-example dicts) to average metric values.
    """
    scores_list = result.scores
    if not scores_list or not isinstance(scores_list, list):
        return {}

    aggregated = {}
    for key in scores_list[0].keys():
        values = [s[key] for s in scores_list if s[key] is not None and not (isinstance(s[key], float) and np.isnan(s[key]))]
        if values:
            aggregated[key] = round(float(np.mean(values)), 4)

    return aggregated

def display_summary_table(results: dict):
    """
    Displays a summary table comparing retriever pipelines with averaged RAGAS metrics.
    
    Args:
        results (dict): A dictionary of evaluation results from evaluate_retriever().
    """
    summary_data = []

    for _, result in results.items():
        eval_scores = evaluation_result_to_dict(result["evaluation_result"])

        row = {
            "Pipeline": result["name"],
            **result["summary"],
            "ContextRecall": eval_scores.get("context_recall"),
            "Faithfulness": eval_scores.get("faithfulness"),
            "FactualCorrectness": eval_scores.get("factual_correctness(mode=f1)"),
            "AnswerRelevancy": eval_scores.get("answer_relevancy"),
            "ContextEntityRecall": eval_scores.get("context_entity_recall"),
            "NoiseSensitivity": eval_scores.get("noise_sensitivity(mode=relevant)"),
        }

        summary_data.append(row)

    df = pd.DataFrame(summary_data)
    display(df.sort_values(by="ContextRecall", ascending=False).reset_index(drop=True))


In [252]:
display_summary_table(all_results)


Unnamed: 0,Pipeline,total_runtime,avg_latency,total_tokens,total_queries,total_cost,avg_tokens_per_query,ContextRecall,Faithfulness,FactualCorrectness,AnswerRelevancy,ContextEntityRecall,NoiseSensitivity
0,BM25,39.38,3.58,41507,11,0.005088,3773.36,1.0,0.977,0.7745,0.6793,0.4873,0.1713
1,Multi-Query,91.06,8.28,144616,11,0.015926,13146.91,1.0,1.0,0.8736,0.7661,0.3929,0.1667
2,Semantic,63.42,5.77,73213,11,0.008367,6655.73,1.0,0.9893,0.8227,0.8662,0.3311,0.1116
3,Ensemble,108.9,9.9,201989,11,0.021664,18362.64,1.0,1.0,0.7427,0.846,0.3198,0.1538
4,Parent-Document,51.26,4.66,44525,11,0.005343,4047.73,0.9091,1.0,0.85,0.8454,0.3595,0.178
5,Naive,69.34,6.3,81221,11,0.009144,7383.73,0.8727,1.0,0.7855,0.9421,0.5169,0.0618
6,Contextual Compression,60.18,5.47,32558,11,0.004131,2959.82,0.8727,0.9935,0.7873,0.8476,0.4667,0.0675
