# Advanced Retrieval with LangChain

In the following notebook, we'll explore various methods of advanced retrieval using LangChain!

We'll touch on:

- Naive Retrieval
- Best-Matching 25 (BM25)
- Multi-Query Retrieval
- Parent-Document Retrieval
- Contextual Compression (a.k.a. Rerank)
- Ensemble Retrieval
- Semantic chunking

We'll also discuss how these methods impact performance on our set of documents with a simple RAG chain.

There will be two breakout rooms:

- 🤝 Breakout Room Part #1
  - Task 1: Getting Dependencies!
  - Task 2: Data Collection and Preparation
  - Task 3: Setting Up QDrant!
  - Task 4-10: Retrieval Strategies
- 🤝 Breakout Room Part #2
  - Activity: Evaluate with Ragas

# 🤝 Breakout Room Part #1

## Task 1: Getting Dependencies!

We're going to need a few specific LangChain community packages, like OpenAI (for our [LLM](https://platform.openai.com/docs/models) and [Embedding Model](https://platform.openai.com/docs/guides/embeddings)) and Cohere (for our [Reranker](https://cohere.com/rerank)).

We'll also provide our OpenAI key, as well as our Cohere API key.

In [1]:
import os
import getpass

os.environ["OPENAI_API_KEY"] = getpass.getpass("Enter your OpenAI API Key:")

In [2]:
os.environ["COHERE_API_KEY"] = getpass.getpass("Cohere API Key:")

## Task 2: Data Collection and Preparation

We'll be using our Loan Data once again - this time the strutured data available through the CSV!

### Data Preparation

We want to make sure all our documents have the relevant metadata for the various retrieval strategies we're going to be applying today.

In [3]:
# from langchain_community.document_loaders.csv_loader import CSVLoader
# from datetime import datetime, timedelta

# loader = CSVLoader(
#     file_path=f"./data/complaints.csv",
#     metadata_columns=[
#       "Date received", 
#       "Product", 
#       "Sub-product", 
#       "Issue", 
#       "Sub-issue", 
#       "Consumer complaint narrative", 
#       "Company public response", 
#       "Company", 
#       "State", 
#       "ZIP code", 
#       "Tags", 
#       "Consumer consent provided?", 
#       "Submitted via", 
#       "Date sent to company", 
#       "Company response to consumer", 
#       "Timely response?", 
#       "Consumer disputed?", 
#       "Complaint ID"
#     ]
# )

# loan_complaint_data = loader.load()

# for doc in loan_complaint_data:
#     doc.page_content = doc.metadata["Consumer complaint narrative"]

In [4]:
from langchain_community.document_loaders import DirectoryLoader
from langchain_community.document_loaders import PyMuPDFLoader


path = "data/"
loader = DirectoryLoader(path, glob="*.pdf", loader_cls=PyMuPDFLoader)
loan_docs = loader.load()

In [5]:
len(loan_docs)

76

Let's look at an example document to see if everything worked as expected!

In [6]:
loan_docs[0]

Document(metadata={'producer': 'GPL Ghostscript 10.00.0', 'creator': 'wkhtmltopdf 0.12.6', 'creationdate': "D:20241217172423Z00'00'", 'source': 'data/Applications_and_Verification_Guide.pdf', 'file_path': 'data/Applications_and_Verification_Guide.pdf', 'total_pages': 76, 'format': 'PDF 1.7', 'title': '', 'author': '', 'subject': '', 'keywords': '', 'moddate': "D:20241217172423Z00'00'", 'trapped': '', 'modDate': "D:20241217172423Z00'00'", 'creationDate': "D:20241217172423Z00'00'", 'page': 0}, page_content='Application and Verification Guide\nIntroduction\nThis guide is intended for college financial aid administrators and counselors who help students with the financial aid\nprocess4completing the Free Application for Federal Student Aid (FAFSA®) form, verifying information, and making\ncorrections and other changes to the information reported on the FAFSA form.\nThroughout the Federal Student Aid Handbook, we use <college,= <school,= and <institution= interchangeably unless a\nmore spec

## 📄 Document Object Structure

### ➤ `Document`
- **metadata**: `dict`
  - `producer`: `'GPL Ghostscript 10.00.0'`
  - `creator`: `'wkhtmltopdf 0.12.6'`
  - `creationdate`: `'D:20241217172423Z00\'00\''`
  - `source`: `'data/Applications_and_Verification_Guide.pdf'`
  - `file_path`: `'data/Applications_and_Verification_Guide.pdf'`
  - `total_pages`: `76`
  - `format`: `'PDF 1.7'`
  - `title`: `''`
  - `author`: `''`
  - `subject`: `''`
  - `keywords`: `''`
  - `moddate`: `'D:20241217172423Z00\'00\''`
  - `trapped`: `''`
  - `modDate`: `'D:20241217172423Z00\'00\''`
  - `creationDate`: `'D:20241217172423Z00\'00\''`
  - `page`: `0`
- **page_content**: `str`
  - Begins with:
    ```
    Application and Verification Guide
    Introduction
    This guide is intended for college financial aid administrators ...
    ```



## Task 3: Setting up QDrant!

Now that we have our documents, let's create a QDrant VectorStore with the collection name "LoanComplaints".

We'll leverage OpenAI's [`text-embedding-3-small`](https://openai.com/blog/new-embedding-models-and-api-updates) because it's a very powerful (and low-cost) embedding model.

> NOTE: We'll be creating additional vectorstores where necessary, but this pattern is still extremely useful.

In [7]:
from langchain_community.vectorstores import Qdrant
from langchain_openai import OpenAIEmbeddings

embeddings = OpenAIEmbeddings(model="text-embedding-3-small")

vectorstore = Qdrant.from_documents(
    loan_docs,
    embeddings,
    location=":memory:",
    collection_name="LoanData"
)

## Task 4: Naive RAG Chain

Since we're focusing on the "R" in RAG today - we'll create our Retriever first.

### R - Retrieval

This naive retriever will simply look at each review as a document, and use cosine-similarity to fetch the 10 most relevant documents.

> NOTE: We're choosing `10` as our `k` here to provide enough documents for our reranking process later

In [8]:
naive_retriever = vectorstore.as_retriever(search_kwargs={"k" : 10})

### A - Augmented

We're going to go with a standard prompt for our simple RAG chain today! Nothing fancy here, we want this to mostly be about the Retrieval process.

In [9]:
from langchain_core.prompts import ChatPromptTemplate

RAG_TEMPLATE = """\
You are a helpful and kind assistant. Use the context provided below to answer the question.

If you do not know the answer, or are unsure, say you don't know.

Query:
{question}

Context:
{context}
"""

rag_prompt = ChatPromptTemplate.from_template(RAG_TEMPLATE)

### G - Generation

We're going to leverage `gpt-4.1-nano` as our LLM today, as - again - we want this to largely be about the Retrieval process.

In [10]:
from langchain_openai import ChatOpenAI

chat_model = ChatOpenAI(model="gpt-4.1-nano")

### LCEL RAG Chain

We're going to use LCEL to construct our chain.

> NOTE: This chain will be exactly the same across the various examples with the exception of our Retriever!

In [11]:
from langchain_core.runnables import RunnablePassthrough
from operator import itemgetter
from langchain_core.output_parsers import StrOutputParser

naive_retrieval_chain = (
    # INVOKE CHAIN WITH: {"question" : "<<SOME USER QUESTION>>"}
    # "question" : populated by getting the value of the "question" key
    # "context"  : populated by getting the value of the "question" key and chaining it into the base_retriever
    {"context": itemgetter("question") | naive_retriever, "question": itemgetter("question")}
    # "context"  : is assigned to a RunnablePassthrough object (will not be called or considered in the next step)
    #              by getting the value of the "context" key from the previous step
    | RunnablePassthrough.assign(context=itemgetter("context"))
    # "response" : the "context" and "question" values are used to format our prompt object and then piped
    #              into the LLM and stored in a key called "response"
    # "context"  : populated by getting the value of the "context" key from the previous step
    | {"response": rag_prompt | chat_model, "context": itemgetter("context")}
)

Let's see how this simple chain does on a few different prompts.

> NOTE: You might think that we've cherry picked prompts that showcase the individual skill of each of the retrieval strategies - you'd be correct!

In [None]:
# naive_retrieval_chain.invoke({"question" : "What is the most common issue with loans?"})["response"].content

"Based on the provided complaints, the most common issues with loans appear to be:\n\n- Dealing with loan servicer misconduct, such as errors in loan balances, misapplied payments, wrongful denials of payment plans, and difficulties in applying payments correctly.\n- Incorrect or inconsistent reporting of loan information, including account status and balances, leading to credit score impacts.\n- Struggles with repayment plans, including issues with interest capitalization, improper transfers of loans without notification, and inability to pay down principal.\n- Problems related to loan handling, such as inaccurate information, unauthorized transfers, and privacy violations.\n- Long-term forbearance and mismanagement leading to increased debt and financial hardship.\n\nOverall, a prevalent theme is the mismanagement and mishandling by loan servicers, which affects borrowers' balances, credit reports, and ability to manage their loans effectively."

In [None]:
# sample_test = naive_retrieval_chain.invoke({"question" : "Did any complaints not get handled in a timely manner?"})

In [None]:
# sample_test['context']

[Document(metadata={'source': './data/complaints.csv', 'row': 474, 'Date received': '04/27/25', 'Product': 'Student loan', 'Sub-product': 'Federal student loan servicing', 'Issue': 'Struggling to repay your loan', 'Sub-issue': 'Problem with forgiveness, cancellation, or discharge', 'Consumer complaint narrative': 'I sent a dispute settlement to all three credit bureaus over 30 days ago and still havent gotten a response. This is negligent and not in compliance with the law. Im asking for this to be looked into and handled.', 'Company public response': 'None', 'Company': 'Nelnet, Inc.', 'State': 'MI', 'ZIP code': '48111', 'Tags': 'None', 'Consumer consent provided?': 'Consent provided', 'Submitted via': 'Web', 'Date sent to company': '04/27/25', 'Company response to consumer': 'Closed with explanation', 'Timely response?': 'Yes', 'Consumer disputed?': 'N/A', 'Complaint ID': '13205525', '_id': 'ab98fee89f3f4527a711c50b3f43c83c', '_collection_name': 'LoanComplaints'}, page_content='I sent

In [None]:
# sample_test['response']

AIMessage(content='Yes, based on the information provided, some complaints did not get handled in a timely manner. Specifically, at least one complaint (Complaint ID: 12709087) was marked as not handled in a timely manner, with the response marked as "No." Additionally, multiple complaints mention delays or lack of response over extended periods, such as complaints that have been ongoing for over a year without resolution.', additional_kwargs={'refusal': None}, response_metadata={'token_usage': {'completion_tokens': 80, 'prompt_tokens': 7357, 'total_tokens': 7437, 'completion_tokens_details': {'accepted_prediction_tokens': 0, 'audio_tokens': 0, 'reasoning_tokens': 0, 'rejected_prediction_tokens': 0}, 'prompt_tokens_details': {'audio_tokens': 0, 'cached_tokens': 0}}, 'model_name': 'gpt-4.1-nano-2025-04-14', 'system_fingerprint': 'fp_38343a2f8f', 'id': 'chatcmpl-ByLC6m8rgqxAW9gQphYMzosEVmLko', 'service_tier': 'default', 'finish_reason': 'stop', 'logprobs': None}, id='run--6227492e-4809-4

In [None]:
# dir(sample_test['response'])

['__abstractmethods__',
 '__add__',
 '__annotations__',
 '__class__',
 '__class_getitem__',
 '__class_vars__',
 '__copy__',
 '__deepcopy__',
 '__delattr__',
 '__dict__',
 '__dir__',
 '__doc__',
 '__eq__',
 '__fields__',
 '__fields_set__',
 '__firstlineno__',
 '__format__',
 '__ge__',
 '__get_pydantic_core_schema__',
 '__get_pydantic_json_schema__',
 '__getattr__',
 '__getattribute__',
 '__getstate__',
 '__gt__',
 '__hash__',
 '__init__',
 '__init_subclass__',
 '__iter__',
 '__le__',
 '__lt__',
 '__module__',
 '__ne__',
 '__new__',
 '__pretty__',
 '__private_attributes__',
 '__pydantic_complete__',
 '__pydantic_computed_fields__',
 '__pydantic_core_schema__',
 '__pydantic_custom_init__',
 '__pydantic_decorators__',
 '__pydantic_extra__',
 '__pydantic_fields__',
 '__pydantic_fields_set__',
 '__pydantic_generic_metadata__',
 '__pydantic_init_subclass__',
 '__pydantic_parent_namespace__',
 '__pydantic_post_init__',
 '__pydantic_private__',
 '__pydantic_root_model__',
 '__pydantic_serialize

In [None]:
# sample_test['response'].content

'Yes, based on the information provided, some complaints did not get handled in a timely manner. Specifically, at least one complaint (Complaint ID: 12709087) was marked as not handled in a timely manner, with the response marked as "No." Additionally, multiple complaints mention delays or lack of response over extended periods, such as complaints that have been ongoing for over a year without resolution.'

In [None]:
# naive_retrieval_chain.invoke({"question" : "Why did people fail to pay back their loans?"})["response"].content

'People failed to pay back their loans primarily because of financial hardships, lack of clear communication from loan servicers, unmanageable interest accumulation, misinformation or insufficient information about repayment terms, and adverse impacts on credit reports. Many borrowers found themselves unable to increase payments due to their living expenses, while others experienced unexpected loan transfers, mismanagement, or delinquency reporting without proper notification. Additionally, some borrowers faced difficulties with payment plans, foregone opportunities for debt relief, and confusion over the status of their loans, all contributing to their inability to repay in a timely manner.'

Overall, this is not bad! Let's see if we can make it better!

## Task 5: Best-Matching 25 (BM25) Retriever

Taking a step back in time - [BM25](https://www.nowpublishers.com/article/Details/INR-019) is based on [Bag-Of-Words](https://en.wikipedia.org/wiki/Bag-of-words_model) which is a sparse representation of text.

In essence, it's a way to compare how similar two pieces of text are based on the words they both contain.

This retriever is very straightforward to set-up! Let's see it happen down below!


In [12]:
from langchain_community.retrievers import BM25Retriever

bm25_retriever = BM25Retriever.from_documents(loan_docs, )

We'll construct the same chain - only changing the retriever.

In [13]:
bm25_retrieval_chain = (
    {"context": itemgetter("question") | bm25_retriever, "question": itemgetter("question")}
    | RunnablePassthrough.assign(context=itemgetter("context"))
    | {"response": rag_prompt | chat_model, "context": itemgetter("context")}
)

Let's look at the responses!

In [None]:
# bm25_retrieval_chain.invoke({"question" : "What is the most common issue with loans?"})["response"].content

'Based on the provided complaints, the most common issue with loans appears to be dealing with the lender or servicer, particularly related to inaccurate or misleading information, improper application of payments, or issues with loan terms. Specific sub-issues include disputes over fees charged, trouble with how payments are handled (such as being unable to pay down principal), and receiving bad or incomplete information about loan balances or history.'

In [None]:
# bm25_retrieval_chain.invoke({"question" : "Did any complaints not get handled in a timely manner?"})["response"].content

'Based on the provided information, all these complaints were responded to with a "Closed with explanation" status and the responses are marked as "Timely response?": "Yes." Therefore, it appears that none of the complaints went unresolved or were handled outside of the expected timeframe.'

In [None]:
# bm25_retrieval_chain.invoke({"question" : "Why did people fail to pay back their loans?"})["response"].content

"People failed to pay back their loans for various reasons, including:\n- Problems with payment plans, such as being steered into incorrect forbearances or having their autopayments unenrolled without proper communication.\n- Poor communication from lenders or servicers, leading to misunderstandings about payment deadlines, overdue status, or the status of forbearance or deferment applications.\n- Lack of response or acknowledgment from loan servicers after requests for forbearance or deferment, resulting in continued billing and negative impact on credit scores.\n- Errors or issues in billing systems, such as payments being reversed or accounts being reported as past due without proper notification.\n- Transitions between loan servicers that caused a lack of awareness about the loan status or payment requirements.\nOverall, inadequate communication, administrative errors, and inappropriate handling of payment plans contributed to borrowers' failure to repay their loans."

It's not clear that this is better or worse, if only we had a way to test this (SPOILERS: We do, the second half of the notebook will cover this)

#### ❓ Question #1:

Give an example query where BM25 is better than embeddings and justify your answer.

#### ✅ Answer:

An example query where BM25 outperforms embeddings is when searching for documents that contain an **exact phrase**.

**Query:**

```python
"Search for complaints where the exact phrase 'loan paid in full' appears."
```

This type of query demands **precise lexical matching**, which is BM25’s strength. In this case, we ran the query using both retrieval pipelines:

### 🔍 Results

**BM25 Result:**
> Based on the provided context, there are no complaints that explicitly mention the exact phrase 'loan paid in full.' Therefore, I cannot find any complaints where this specific phrase appears.

**Embedding Result:**
> Based on the provided documents, there are multiple complaints where the exact phrase 'loan paid in full' appears, specifically in the context of patients stating that their loans still do not show as paid in full despite paying them off.  
>  
> For example:  
> - At row 59, the complaint narrative explicitly mentions: "I paid off loan # XXXX on XX/XX/year>, it has been 90 days and it is still not showing as paid in full."  
>  
> Therefore, the complaints where the exact phrase 'loan paid in full' appears are centered around cases where individuals have paid off their loans but the status has not been updated to 'paid in full', leading to frustration and disputes.

---

### ✅ Justification

This example demonstrates that **BM25 is superior for exact phrase queries**:

- BM25 correctly returned *no matches* because the **exact phrase** `'loan paid in full'` does not exist in the dataset.
- The embedding-based retriever, however, **hallucinated** — returning conceptually similar results that did **not contain the exact phrase**, violating the query constraint.

---

### 📌 Conclusion

**BM25 is better than embeddings** when:

- The query demands **exact term or phrase matching**
- You want to avoid **semantic generalization**
- Hallucination risks must be minimized (e.g., legal or financial audits)

This makes BM25 a strong choice for precision-sensitive applications like compliance, audits, or exact keyword search.


In [None]:
# query = "Search for complaints where the exact phrase 'loan paid in full' appears."

# bm25_result = bm25_retrieval_chain.invoke({"question": query})
# naive_result = naive_retrieval_chain.invoke({"question": query})

# print("BM25 RESULT:\n", bm25_result["response"].content)
# print("\n---\n")
# print("EMBEDDING RESULT:\n", naive_result["response"].content)


BM25 RESULT:
 I found several complaints where the exact phrase 'loan paid in full' appears. Specifically, there are two instances:

1. A complaint where the individual mentions that a loan serviced by Mohela states on the loan summary page that it was paid in full. They refer to documentation indicating this status on their student aid account.

2. A complaint where the person notes that their student loans, managed by the Default Resolution Group, show in their documentation that certain loans were paid in full as of specific dates.

If you need detailed information from these complaints or further assistance, please let me know!

---

EMBEDDING RESULT:
 I found multiple complaints where the exact phrase "loan paid in full" appears in the consumer complaint narratives.


In [None]:
# bm25_result['context']

[Document(metadata={'source': './data/complaints.csv', 'row': 824, 'Date received': '04/24/25', 'Product': 'Student loan', 'Sub-product': 'Federal student loan servicing', 'Issue': 'Dealing with your lender or servicer', 'Sub-issue': 'Trouble with how payments are being handled', 'Consumer complaint narrative': 'I have student loans managed by aidvantage. I have the payments setup from my bank account. It should be a breeze for me to pay them. No matter what I do the payments just fail. I logged into aidvantage and all of my payments have failed. Its depressing and I need them to be paid. It really should not be this hard. 7 months is a long time to keep failing ... every other bill I setup manage to get paid from the same account. I went to XXXX  and see the exact same complaints and tragedy. I need this resolved, I want my payments accepted.', 'Company public response': 'None', 'Company': 'Maximus Federal Services, Inc.', 'State': 'MD', 'ZIP code': '20735', 'Tags': 'Servicemember', '

In [None]:
# naive_result['context']

[Document(metadata={'source': './data/complaints.csv', 'row': 59, 'Date received': '04/02/25', 'Product': 'Student loan', 'Sub-product': 'Federal student loan servicing', 'Issue': 'Dealing with your lender or servicer', 'Sub-issue': 'Received bad information about your loan', 'Consumer complaint narrative': "In my previous Complaint # XXXX I stated that student loan # XXXX is still not showing as being paid in full. I received a response but it still hasn't been corrected and the complaint was closed. I paid off loan # XXXX on XX/XX/year>, it has been 90 days and it is still not showing as paid in full, despite showing a XXXX balance.", 'Company public response': 'None', 'Company': 'Maximus Federal Services, Inc.', 'State': 'NV', 'ZIP code': '891XX', 'Tags': 'None', 'Consumer consent provided?': 'Consent provided', 'Submitted via': 'Web', 'Date sent to company': '04/10/25', 'Company response to consumer': 'Closed with explanation', 'Timely response?': 'Yes', 'Consumer disputed?': 'N/A'

## Task 6: Contextual Compression (Using Reranking)

Contextual Compression is a fairly straightforward idea: We want to "compress" our retrieved context into just the most useful bits.

There are a few ways we can achieve this - but we're going to look at a specific example called reranking.

The basic idea here is this:

- We retrieve lots of documents that are very likely related to our query vector
- We "compress" those documents into a smaller set of *more* related documents using a reranking algorithm.

We'll be leveraging Cohere's Rerank model for our reranker today!

All we need to do is the following:

- Create a basic retriever
- Create a compressor (reranker, in this case)

That's it!

Let's see it in the code below!

In [14]:
from langchain.retrievers.contextual_compression import ContextualCompressionRetriever
from langchain_cohere import CohereRerank

compressor = CohereRerank(model="rerank-v3.5")
compression_retriever = ContextualCompressionRetriever(
    base_compressor=compressor, base_retriever=naive_retriever
)

Let's create our chain again, and see how this does!

In [15]:
contextual_compression_retrieval_chain = (
    {"context": itemgetter("question") | compression_retriever, "question": itemgetter("question")}
    | RunnablePassthrough.assign(context=itemgetter("context"))
    | {"response": rag_prompt | chat_model, "context": itemgetter("context")}
)

In [None]:
# contextual_compression_retrieval_chain.invoke({"question" : "What is the most common issue with loans?"})["response"].content

'Based on the provided information, a common issue with loans, particularly student loans, is dealing with errors and mismanagement by lenders or servicers. This includes problems such as receiving incorrect information about the loan, errors in balances, misapplied payments, wrongful denials of payment plans, and mishandling of the loan data. \n\nIn summary, one of the most common issues is **mismanagement and errors related to loan balances, payment application, and communication from lenders or servicers**.'

In [None]:
# contextual_compression_retrieval_chain.invoke({"question" : "Did any complaints not get handled in a timely manner?"})["response"].content

'Based on the provided information, yes, some complaints did not get handled in a timely manner. For example, the complaint about the student loan account review has been open since some earlier date (indicated as XXXX in the documents) and has not been resolved after nearly 18 months, indicating a significant delay. Additionally, the complaint about payments not being applied to the account mentions ongoing issues over 2-3 weeks. Although the responses from the companies were marked as "Closed with explanation" and responses were "Yes" in terms of timeliness, the ongoing unresolved issues suggest that some complaints experienced delays or were not fully resolved promptly.'

In [None]:
# contextual_compression_retrieval_chain.invoke({"question" : "Why did people fail to pay back their loans?"})["response"].content

'People failed to pay back their loans for several reasons based on the provided context:\n\n1. **Lack of Clear Communication and Awareness:** Some borrowers were unaware they had to repay their loans or were not informed about repayment requirements. For example, a borrower was unaware until a certain point that they needed to start paying back their financial aid, leading to confusion and missed payments.\n\n2. **Issues with Loan Transfer and Notification:** Borrowers reported that loans were transferred between servicers (e.g., to NelNet) without proper notification, and they did not receive timely or adequate communication about repayment obligations.\n\n3. **Difficulty with Payment Plans and Account Access:** Borrowers experienced technical issues, such as being locked out of online accounts or receiving inconsistent or incorrect account information, which hindered their ability to stay current on payments.\n\n4. **Accumulation of Interest and Loan Management Challenges:** Many bo

We'll need to rely on something like Ragas to help us get a better sense of how this is performing overall - but it "feels" better!

## Task 7: Multi-Query Retriever

Typically in RAG we have a single query - the one provided by the user.

What if we had....more than one query!

In essence, a Multi-Query Retriever works by:

1. Taking the original user query and creating `n` number of new user queries using an LLM.
2. Retrieving documents for each query.
3. Using all unique retrieved documents as context

So, how is it to set-up? Not bad! Let's see it down below!



In [16]:
from langchain.retrievers.multi_query import MultiQueryRetriever

multi_query_retriever = MultiQueryRetriever.from_llm(
    retriever=naive_retriever, llm=chat_model
)

In [17]:
multi_query_retrieval_chain = (
    {"context": itemgetter("question") | multi_query_retriever, "question": itemgetter("question")}
    | RunnablePassthrough.assign(context=itemgetter("context"))
    | {"response": rag_prompt | chat_model, "context": itemgetter("context")}
)

In [None]:
# multi_query_retrieval_chain.invoke({"question" : "What is the most common issue with loans?"})["response"].content

'The most common issue with loans, based on the provided complaints, appears to be problems with how lenders or servicers handle borrower accounts. This includes errors or discrepancies in loan balances, interest calculations, and account statuses, as well as misapplication of payments, incorrect reporting to credit bureaus, and inadequate communication or lack of transparency about loan terms, transfer of servicing, or account updates. Many complaints highlight issues such as unauthorized interest accrual, mismanagement during loan transfers, errors in credit reporting, and poor communication regarding loan modifications or statuses.'

In [None]:
# multi_query_retrieval_chain.invoke({"question" : "Did any complaints not get handled in a timely manner?"})["response"].content

'Based on the provided complaints, yes, some complaints did not get handled in a timely manner. Several examples include:\n\n- A complaint regarding a student loan application that remained unprocessed for over 16 months without resolution or communication from the company.\n- Multiple instances where consumers reported that their issues, such as disputes over credit reporting, loan discharges, or account errors, were not addressed promptly, with follow-ups indicating delays of over a year or more.\n- A specific complaint highlighted that a complaint filed with the CFPB had not received a response after over 16 months.\n- Another complaint noted a delay of nearly 18 months with no resolution despite multiple follow-ups.\n\nWhile some complaints were marked as "timely response? Yes," the narratives suggest that many consumers experienced significant delays or lack of response over extended periods.'

In [None]:
# multi_query_retrieval_chain.invoke({"question" : "Why did people fail to pay back their loans?"})["response"].content

"People failed to pay back their loans primarily due to a combination of systemic issues and servicing practices, including:\n\n1. **Misleading or Bad Information from Servicers:** Some borrowers were not properly informed about payment schedules, delinquency status, or available repayment options such as income-driven plans or rehabilitation. They relied on servicers' guidance, which often led to unfavorable outcomes like long-term forbearances or improper consolidations, resulting in increased balances and interest.\n\n2. **Forbearance Steering and Lack of Alternatives:** Borrowers reported being repeatedly placed into forbearance without adequate notification or explanation of the consequences, such as interest capitalization and reset of forgiveness timelines. When forbearances expired, they faced pressure to consolidate into larger loans or make unaffordable payments, often under false urgency, which contributed to default or financial hardship.\n\n3. **Errors and Inaccurate Infor

In [18]:
from langchain_core.callbacks import CallbackManagerForRetrieverRun

# You must pass a dummy run manager
run_manager = CallbackManagerForRetrieverRun.get_noop_manager()

query = "What is the most common issue with loans?"
reformulated_queries = multi_query_retriever.generate_queries(query, run_manager)

print("🔁 Reformulated Queries:")
for q in reformulated_queries:
    print("-", q)

🔁 Reformulated Queries:
- What are the primary challenges people face when applying for or managing loans?  
- Can you identify the most frequently reported problems associated with loans?  
- What are the common pitfalls or issues encountered in loan agreements or repayment processes?


In [None]:
# from langchain_core.callbacks import CallbackManagerForRetrieverRun

# # You must pass a dummy run manager
# run_manager = CallbackManagerForRetrieverRun.get_noop_manager()

# query = "What is the most common issue with loans?"
# reformulated_queries = multi_query_retriever.generate_queries(query, run_manager)

# print("🔁 Reformulated Queries:")
# for q in reformulated_queries:
#     print("-", q)

🔁 Reformulated Queries:
- 1. What are the typical problems borrowers face with loans?  
- 2. Which issues tend to occur most frequently during the loan repayment process?  
- 3. What are the main challenges or complications commonly associated with loans?


#### ❓ Question #2:

Explain how generating multiple reformulations of a user query can improve recall.

##### ✅ Answer:

Generating multiple reformulations of a user query can improve **recall** by increasing the likelihood of retrieving all relevant documents, even if they use different wording or phrasing than the original query.

Here’s how it helps:

1. **Overcomes vocabulary mismatch**  
   Reformulations can use synonyms or alternative expressions, helping the retriever match documents that might not share exact words with the original query.

2. **Expands semantic coverage**  
   Each reformulated query may emphasize a different aspect of the original intent, allowing the retriever to surface diverse but relevant content.

3. **Mitigates retriever sensitivity to phrasing**  
   Embedding-based or sparse retrieval models often perform inconsistently when queries are phrased differently. Multiple reformulations reduce the risk of missing relevant results due to one poorly-phrased query.

4. **Aggregates broader evidence**  
   By combining documents retrieved from all reformulations, the system builds a more comprehensive context pool for answering the original question.

✅ As a result, the multi-query approach improves the **robustness and completeness** of Retrieval-Augmented Generation (RAG) systems by reducing missed relevant content.


## Task 8: Parent Document Retriever

A "small-to-big" strategy - the Parent Document Retriever works based on a simple strategy:

1. Each un-split "document" will be designated as a "parent document" (You could use larger chunks of document as well, but our data format allows us to consider the overall document as the parent chunk)
2. Store those "parent documents" in a memory store (not a VectorStore)
3. We will chunk each of those documents into smaller documents, and associate them with their respective parents, and store those in a VectorStore. We'll call those "child chunks".
4. When we query our Retriever, we will do a similarity search comparing our query vector to the "child chunks".
5. Instead of returning the "child chunks", we'll return their associated "parent chunks".

Okay, maybe that was a few steps - but the basic idea is this:

- Search for small documents
- Return big documents

The intuition is that we're likely to find the most relevant information by limiting the amount of semantic information that is encoded in each embedding vector - but we're likely to miss relevant surrounding context if we only use that information.

Let's start by creating our "parent documents" and defining a `RecursiveCharacterTextSplitter`.

In [19]:
from langchain.retrievers import ParentDocumentRetriever
from langchain.storage import InMemoryStore
from langchain_text_splitters import RecursiveCharacterTextSplitter
from qdrant_client import QdrantClient, models

parent_docs = loan_docs
child_splitter = RecursiveCharacterTextSplitter(chunk_size=750)

We'll need to set up a new QDrant vectorstore - and we'll use another useful pattern to do so!

> NOTE: We are manually defining our embedding dimension, you'll need to change this if you're using a different embedding model.

In [20]:
from langchain_qdrant import QdrantVectorStore

client = QdrantClient(location=":memory:")

client.create_collection(
    collection_name="full_documents",
    vectors_config=models.VectorParams(size=1536, distance=models.Distance.COSINE)
)

parent_document_vectorstore = QdrantVectorStore(
    collection_name="full_documents", embedding=OpenAIEmbeddings(model="text-embedding-3-small"), client=client
)

Now we can create our `InMemoryStore` that will hold our "parent documents" - and build our retriever!

In [21]:
store = InMemoryStore()

parent_document_retriever = ParentDocumentRetriever(
    vectorstore = parent_document_vectorstore,
    docstore=store,
    child_splitter=child_splitter,
)

By default, this is empty as we haven't added any documents - let's add some now!

In [22]:
parent_document_retriever.add_documents(parent_docs, ids=None)

We'll create the same chain we did before - but substitute our new `parent_document_retriever`.

In [23]:
parent_document_retrieval_chain = (
    {"context": itemgetter("question") | parent_document_retriever, "question": itemgetter("question")}
    | RunnablePassthrough.assign(context=itemgetter("context"))
    | {"response": rag_prompt | chat_model, "context": itemgetter("context")}
)

Let's give it a whirl!

In [None]:
# parent_document_retrieval_chain.invoke({"question" : "What is the most common issue with loans?"})["response"].content

'The most common issue with loans, based on the provided complaints, appears to be problems related to federal student loan servicing. These include errors in loan balances, misapplied payments, wrongful denials of payment plans, and issues with loan account management such as incorrect information on credit reports, disputes over interest rates, and problems caused by loan transfers and servicing practices. There are multiple complaints highlighting systemic breakdowns, misreporting, and disputes over terms and account status.'

In [None]:
# parent_document_retrieval_chain.invoke({"question" : "Did any complaints not get handled in a timely manner?"})["response"].content

'Based on the provided information, all the complaints that are explicitly marked as "Timely response?" with "No" indicate that they were not handled in a timely manner. Specifically, at least two complaints involving Mohela (Complaint IDs: 12709087 and 12935889) mention that they did not receive a response within the expected timeframe. These complaints had delayed responses, with the complaint at row 441 (Complaint ID: 12709087) explicitly stating "Timely response?":"No".\n\nTherefore, yes, some complaints did not get handled in a timely manner.'

In [None]:
# parent_document_retrieval_chain.invoke({"question" : "Why did people fail to pay back their loans?"})["response"].content

'People often failed to pay back their loans due to a variety of circumstances, including financial hardship, lack of proper information or transparency about their loans, and issues related to loan management. For example, some individuals experienced severe financial difficulties after graduation, making it difficult or impossible to make payments. Others faced problems with their loan servicers, such as being unaware of when payments were due, not being properly informed about changes in loan ownership, or encountering administrative errors like incorrect reporting or failure to notify them about payment obligations. Additionally, some borrowers were misled by educational institutions, leading to taking on loans they were unprepared to repay, especially when the quality of education or employment prospects did not meet expectations.'

Overall, the performance *seems* largely the same. We can leverage a tool like [Ragas]() to more effectively answer the question about the performance.

## Task 9: Ensemble Retriever

In brief, an Ensemble Retriever simply takes 2, or more, retrievers and combines their retrieved documents based on a rank-fusion algorithm.

In this case - we're using the [Reciprocal Rank Fusion](https://plg.uwaterloo.ca/~gvcormac/cormacksigir09-rrf.pdf) algorithm.

Setting it up is as easy as providing a list of our desired retrievers - and the weights for each retriever.

In [24]:
from langchain.retrievers import EnsembleRetriever

retriever_list = [bm25_retriever, naive_retriever, parent_document_retriever, compression_retriever, multi_query_retriever]
equal_weighting = [1/len(retriever_list)] * len(retriever_list)

ensemble_retriever = EnsembleRetriever(
    retrievers=retriever_list, weights=equal_weighting
)

We'll pack *all* of these retrievers together in an ensemble.

In [25]:
ensemble_retrieval_chain = (
    {"context": itemgetter("question") | ensemble_retriever, "question": itemgetter("question")}
    | RunnablePassthrough.assign(context=itemgetter("context"))
    | {"response": rag_prompt | chat_model, "context": itemgetter("context")}
)

Let's look at our results!

In [None]:
# ensemble_retrieval_chain.invoke({"question" : "What is the most common issue with loans?"})["response"].content

'The most common issue with loans, based on the provided complaints, appears to be mismanagement and misinformation by loan servicers. Specific recurring problems include:\n\n- Errors and inaccuracies in loan balances and interest calculations.\n- Failure to properly notify borrowers about changes such as loan transfers, repayments, or default status.\n- Inability to obtain clear or correct information about loan terms, balances, and repayment options.\n- Issues related to improper handling of deferments, forbearances, and repayment plans.\n- Unclear or incorrect reporting on credit reports, leading to negative impacts on credit scores.\n- Problems with loan forgiveness, cancellation, or discharge processes.\n- Unresponsive or unhelpful customer service providing vague or misleading information.\n- Unauthorized or unnotified changes to loan status and increased interest due to mismanagement.\n\nOverall, the pattern indicates that a core issue is the mishandling of loan data and communi

In [None]:
# ensemble_retrieval_chain.invoke({"question" : "Did any complaints not get handled in a timely manner?"})["response"].content

'Based on the provided complaints, yes, some complaints were not handled in a timely manner. Specifically, there are multiple instances where responses from the companies were late or delayed beyond the expected timeframes. For example:\n\n- Complaint ID 12709087 (submitted 03/28/25): The response was marked as "N/A" (disputed or not responded to) and "Dispute? N/A," indicating it was not addressed timely.\n- Complaint ID 12935889 (submitted 04/11/25): Response was "No" for timely response.\n- Complaint ID 13056764 (submitted 04/18/25): The response was marked "Yes" for responsiveness, but the complaint details indicate prior delays.\n- Complaint ID 13205525 (submitted 04/27/25): Response was "Yes" for timely response, but the complainant states ongoing unresolved issues.\n\nOverall, there is evidence that some complaints, especially regarding inconsistent or delayed responses, were not handled promptly.'

In [None]:
# ensemble_retrieval_chain.invoke({"question" : "Why did people fail to pay back their loans?"})["response"].content

"People failed to pay back their loans for several reasons, including:\n\n1. **Lack of communication and notification:** Many borrowers were not properly notified about when payments were due, changes in loan servicing, or account status updates, leading to missed payments and delinquency.\n\n2. **Complex or mismanaged loan information:** Borrowers faced confusion due to inconsistent or incorrect account balances, mismatched numbers, and lack of clear documentation about their loans, interest, and repayment status.\n\n3. **Limited or unhelpful repayment options:** Borrowers were often only offered forbearance or deferment without adequate explanations of the long-term consequences, such as interest accumulation, which increased the total amount owed.\n\n4. **Unfair or predatory servicing practices:** Servicers steered borrowers into long-term forbearances, failed to inform them about income-driven repayment plans or forgiveness options, and improperly reported delinquencies or late pay

## Task 10: Semantic Chunking

While this is not a retrieval method - it *is* an effective way of increasing retrieval performance on corpora that have clean semantic breaks in them.

Essentially, Semantic Chunking is implemented by:

1. Embedding all sentences in the corpus.
2. Combining or splitting sequences of sentences based on their semantic similarity based on a number of [possible thresholding methods](https://python.langchain.com/docs/how_to/semantic-chunker/):
  - `percentile`
  - `standard_deviation`
  - `interquartile`
  - `gradient`
3. Each sequence of related sentences is kept as a document!

Let's see how to implement this!

We'll use the `percentile` thresholding method for this example which will:

Calculate all distances between sentences, and then break apart sequences of setences that exceed a given percentile among all distances.

In [26]:
from langchain_experimental.text_splitter import SemanticChunker

semantic_chunker = SemanticChunker(
    embeddings,
    breakpoint_threshold_type="percentile"
)

Now we can split our documents.

In [27]:
semantic_documents = semantic_chunker.split_documents(loan_docs[:20])

Let's create a new vector store.

In [28]:
semantic_vectorstore = Qdrant.from_documents(
    semantic_documents,
    embeddings,
    location=":memory:",
    collection_name="Loan_Data_Semantic_Chunks"
)

We'll use naive retrieval for this example.

In [29]:
semantic_retriever = semantic_vectorstore.as_retriever(search_kwargs={"k" : 10})

Finally we can create our classic chain!

In [30]:
semantic_retrieval_chain = (
    {"context": itemgetter("question") | semantic_retriever, "question": itemgetter("question")}
    | RunnablePassthrough.assign(context=itemgetter("context"))
    | {"response": rag_prompt | chat_model, "context": itemgetter("context")}
)

And view the results!

In [None]:
# semantic_retrieval_chain.invoke({"question" : "What is the most common issue with loans?"})["response"].content

'Based on the provided complaints, the most common issue with loans appears to be problems related to loan servicing and communication. Many complaints highlight issues such as:\n\n- Struggling to repay or confusion about payment plans and amounts\n- Lack of transparency and miscommunication from servicers\n- Problems with loan account information being inaccurate, default statuses, or improper reporting\n- Difficulties in obtaining proper verification or documentation\n- Unauthorized or improper reporting of debt or account status\n\nWhile the specific "most common issue" isn\'t explicitly quantified in the data, the recurring themes suggest that the most frequent problem involves **mismanagement and miscommunication by loan servicers, leading to confusion, incorrect account statuses, and difficulties in repayment efforts**.  \n\nIf you need a more precise answer, I would say:  \n**The most common issue with loans is mismanagement by servicers, including inaccurate reporting, poor com

In [None]:
# semantic_retrieval_chain.invoke({"question" : "Did any complaints not get handled in a timely manner?"})["response"].content

'Yes, several complaints in the provided data indicate that they were handled in a timely manner, with responses marked as "Yes" under the "Timely response?" field. However, the complaint about "Received bad information about your loan" (Complaint ID: 13331376) involving Nelnet, Inc. was closed with an explanation, which suggests that their issue was addressed, although the details of whether it was resolved to the complainant\'s satisfaction are not specified.\n\nMost other complaints, such as those regarding trouble with payments, disputes about account information, or issues with service, also received timely responses. Since the context does not explicitly mention any complaints that were not handled in time, the available data suggests that, according to the records, all complaints were responded to promptly.\n\nIf the question pertains to whether any complaints were left unresolved or delayed beyond a reasonable period, the records do not indicate such a case. Therefore, based on

In [None]:
# semantic_retrieval_chain.invoke({"question" : "Why did people fail to pay back their loans?"})["response"].content

"People failed to pay back their loans for various reasons, including:\n\n- Receiving bad or unclear information about their loans, such as being told they are in forbearance until a future date without written confirmation.\n- Difficulties with loan servicing processes, such as delays, miscommunication, or alleged stalling tactics by lenders or servicers.\n- Problems with the loan documentation or paperwork, which can lead to rejection or delays in loan forgiveness or discharge processes.\n- Disputes over the legitimacy or status of the loans, including claims that loans are illegally reported, or that they have been transferred or mishandled improperly.\n- Financial hardships or misunderstandings about payment obligations, especially after changes like the end of COVID-19 forbearance or re-amortization issues.\n- Errors or misreporting of loan status, which can negatively impact credit scores and result in default status being assigned mistakenly.\n- In some cases, legal or privacy i

#### ❓ Question #3:

If sentences are short and highly repetitive (e.g., FAQs), how might semantic chunking behave, and how would you adjust the algorithm?

##### ✅ Answer:

Semantic chunking, which groups text based on meaning rather than arbitrary length, may behave suboptimally with short and repetitive sentences (as in FAQs) due to the following:

- **Over-segmentation:** Each FAQ entry may be treated as its own standalone semantic unit, resulting in too many small chunks.
- **Redundancy:** The high degree of repetition (e.g., similar questions phrased slightly differently) can confuse clustering or embedding models, reducing retrieval quality.
- **Context loss:** Since each chunk is short, meaningful context may be lost, making it harder for the retriever to disambiguate similar queries.

##### 🔧 Suggested Adjustments:

1. **Concatenate Related Sentences:** Manually or heuristically group FAQs with similar themes into larger composite chunks before embedding.
2. **Use Metadata:** Attach question IDs, tags, or categories to enrich semantic meaning and aid disambiguation.
3. **Adjust Chunking Parameters:** Lower the sensitivity of semantic boundaries or increase the minimum chunk size.
4. **Apply Deduplication or Clustering:** Use semantic similarity (e.g., cosine similarity) to merge or group near-duplicate FAQs prior to chunking.
5. **Hybrid Approach:** Combine semantic chunking with a rule-based or sliding window technique to ensure minimum context length.

These strategies help mitigate fragmentation and enhance recall by ensuring more informative and discriminative chunks.


# 🤝 Breakout Room Part #2

#### 🏗️ Activity #1

Your task is to evaluate the various Retriever methods against eachother.

You are expected to:

1. Create a "golden dataset"
 - Use Synthetic Data Generation (powered by Ragas, or otherwise) to create this dataset
2. Evaluate each retriever with *retriever specific* Ragas metrics
 - Semantic Chunking is not considered a retriever method and will not be required for marks, but you may find it useful to do a "semantic chunking on" vs. "semantic chunking off" comparision between them
3. Compile these in a list and write a small paragraph about which is best for this particular data and why.

Your analysis should factor in:
  - Cost
  - Latency
  - Performance

  ##### HINTS:

- LangSmith provides detailed information about latency and cost.

> NOTE: This is **NOT** required to be completed in class. Please spend time in your breakout rooms creating a plan before moving on to writing code.

In [31]:
### YOUR CODE HERE
os.environ["LANGCHAIN_TRACING_V2"] = "true"
os.environ["LANGCHAIN_PROJECT"] = f"Session 09 - Advanced Retrieval Loan Docs"
os.environ["LANGCHAIN_API_KEY"] = getpass.getpass("LangSmith API Key: ")

In [None]:
# import random
# import pandas as pd
# from langchain_openai import ChatOpenAI, OpenAIEmbeddings
# from ragas.llms import LangchainLLMWrapper
# from ragas.embeddings import LangchainEmbeddingsWrapper
# from ragas.testset import TestsetGenerator
# from ragas.testset.synthesizers import (
#     SingleHopSpecificQuerySynthesizer,
#     MultiHopAbstractQuerySynthesizer,
#     MultiHopSpecificQuerySynthesizer,
# )
# from langchain.callbacks import get_openai_callback

# # Reproducibility
# random.seed(42)

# # LLM + embedding setup
# generator_llm = LangchainLLMWrapper(ChatOpenAI(model="gpt-4o", temperature=0.7))
# generator_embeddings = LangchainEmbeddingsWrapper(OpenAIEmbeddings(model="text-embedding-3-small"))

# # Query types
# query_distribution = [
#     (SingleHopSpecificQuerySynthesizer(llm=generator_llm), 0.5),
#     (MultiHopAbstractQuerySynthesizer(llm=generator_llm), 0.25),
#     (MultiHopSpecificQuerySynthesizer(llm=generator_llm), 0.25),
# ]

# # Generator init
# generator = TestsetGenerator(llm=generator_llm, embedding_model=generator_embeddings)

# # Generate
# print("Generating synthetic dataset...")
# with get_openai_callback() as cb:
#     dataset = generator.generate_with_langchain_docs(
#         loan_docs[:50],
#         testset_size=10,
#         query_distribution=query_distribution,
#     )
#     print(f"💰 Tokens used: {cb.total_tokens} | Cost: ${cb.total_cost:.4f}")



Generating synthetic dataset...


Applying SummaryExtractor:   0%|          | 0/31 [00:00<?, ?it/s]

Applying CustomNodeFilter:   0%|          | 0/50 [00:00<?, ?it/s]

Node 054e7961-1b3d-448b-bbcd-2568a475c7de does not have a summary. Skipping filtering.
Node 5df2fecb-31a9-4e4c-8aab-a2f372788b9c does not have a summary. Skipping filtering.
Node 9072337c-cac8-4b96-a09e-517f749fade3 does not have a summary. Skipping filtering.
Node 353ab4e2-4835-49b7-acd6-6470d7d53ede does not have a summary. Skipping filtering.
Node 35fb8ec3-3b67-48f9-97e2-03d359f3b397 does not have a summary. Skipping filtering.
Node 15f3a024-d455-4c4a-a106-51968f9c4021 does not have a summary. Skipping filtering.
Node ab1a1371-5e2c-43dd-af5f-0e1a8b1ca8cc does not have a summary. Skipping filtering.
Node d76f9cdc-32ae-4835-971e-de8feb1fe72d does not have a summary. Skipping filtering.
Node 333ad8e2-afb1-4f32-9ac9-dd279d9854e1 does not have a summary. Skipping filtering.
Node 783765cb-db2c-46fc-bd8e-2ba23a1330b7 does not have a summary. Skipping filtering.
Node 7c2bbd43-386f-4408-819d-12b46a5bcf5b does not have a summary. Skipping filtering.
Node cbf1b887-204b-4f20-a539-a6e80d8aac27 d

Applying [EmbeddingExtractor, ThemesExtractor, NERExtractor]:   0%|          | 0/131 [00:00<?, ?it/s]

Applying [CosineSimilarityBuilder, OverlapScoreBuilder]:   0%|          | 0/2 [00:00<?, ?it/s]

Generating personas:   0%|          | 0/3 [00:00<?, ?it/s]

Generating Scenarios:   0%|          | 0/3 [00:00<?, ?it/s]

Generating Samples:   0%|          | 0/11 [00:00<?, ?it/s]

💰 Tokens used: 111244 | Cost: $0.3606
✅ Generated 11 synthetic test cases


TypeError: 'Testset' object is not subscriptable

In [44]:
import random
import pandas as pd
from langchain_openai import ChatOpenAI, OpenAIEmbeddings
from ragas.llms import LangchainLLMWrapper
from ragas.embeddings import LangchainEmbeddingsWrapper
from ragas.testset import TestsetGenerator
from ragas.testset.synthesizers import (
    SingleHopSpecificQuerySynthesizer,
    MultiHopAbstractQuerySynthesizer,
    MultiHopSpecificQuerySynthesizer,
)
from langchain.callbacks import get_openai_callback

# Reproducibility
random.seed(42)

# LLM + embedding setup
generator_llm = LangchainLLMWrapper(ChatOpenAI(model="gpt-4o", temperature=0.7))
generator_embeddings = LangchainEmbeddingsWrapper(OpenAIEmbeddings(model="text-embedding-3-small"))

# Query types
query_distribution = [
    (SingleHopSpecificQuerySynthesizer(llm=generator_llm), 0.5),
    (MultiHopAbstractQuerySynthesizer(llm=generator_llm), 0.25),
    (MultiHopSpecificQuerySynthesizer(llm=generator_llm), 0.25),
]

# Generator init
generator = TestsetGenerator(llm=generator_llm, embedding_model=generator_embeddings)

# Generate
print("Generating synthetic dataset...")
with get_openai_callback() as cb:
    dataset = generator.generate_with_langchain_docs(
        loan_docs[:50],
        testset_size=10,
        query_distribution=query_distribution,
    )
    print(f"💰 Tokens used: {cb.total_tokens} | Cost: ${cb.total_cost:.4f}")



Generating synthetic dataset...


Applying HeadlinesExtractor:   0%|          | 0/44 [00:00<?, ?it/s]

Applying HeadlineSplitter:   0%|          | 0/50 [00:00<?, ?it/s]

unable to apply transformation: 'headlines' property not found in this node
unable to apply transformation: 'headlines' property not found in this node
unable to apply transformation: 'headlines' property not found in this node
unable to apply transformation: 'headlines' property not found in this node
unable to apply transformation: 'headlines' property not found in this node
unable to apply transformation: 'headlines' property not found in this node


Applying SummaryExtractor:   0%|          | 0/70 [00:00<?, ?it/s]

Property 'summary' already exists in node 'e45506'. Skipping!
Property 'summary' already exists in node 'c07a5e'. Skipping!
Property 'summary' already exists in node '290e95'. Skipping!
Property 'summary' already exists in node '1e903a'. Skipping!
Property 'summary' already exists in node 'b91f7a'. Skipping!
Property 'summary' already exists in node '6664b0'. Skipping!
Property 'summary' already exists in node 'b736f8'. Skipping!
Property 'summary' already exists in node '01fb2c'. Skipping!
Property 'summary' already exists in node '3b5ee8'. Skipping!
Property 'summary' already exists in node '149d80'. Skipping!
Property 'summary' already exists in node '7da800'. Skipping!
Property 'summary' already exists in node '0aedac'. Skipping!
Property 'summary' already exists in node '7c9a37'. Skipping!
Property 'summary' already exists in node 'a7ba06'. Skipping!
Property 'summary' already exists in node '7ca893'. Skipping!
Property 'summary' already exists in node '0c92dd'. Skipping!
Property

Applying CustomNodeFilter:   0%|          | 0/37 [00:00<?, ?it/s]

Applying [EmbeddingExtractor, ThemesExtractor, NERExtractor]:   0%|          | 0/134 [00:00<?, ?it/s]

Property 'summary_embedding' already exists in node '3b5ee8'. Skipping!
Property 'summary_embedding' already exists in node '01fb2c'. Skipping!
Property 'summary_embedding' already exists in node '290e95'. Skipping!
Property 'summary_embedding' already exists in node 'b736f8'. Skipping!
Property 'summary_embedding' already exists in node '2007d1'. Skipping!
Property 'summary_embedding' already exists in node 'e45506'. Skipping!
Property 'summary_embedding' already exists in node 'b91f7a'. Skipping!
Property 'summary_embedding' already exists in node '2bd8fb'. Skipping!
Property 'summary_embedding' already exists in node 'c07a5e'. Skipping!
Property 'summary_embedding' already exists in node 'a7ba06'. Skipping!
Property 'summary_embedding' already exists in node '149d80'. Skipping!
Property 'summary_embedding' already exists in node '7c9a37'. Skipping!
Property 'summary_embedding' already exists in node '7ca893'. Skipping!
Property 'summary_embedding' already exists in node 'e12951'. Sk

Applying [CosineSimilarityBuilder, OverlapScoreBuilder]:   0%|          | 0/2 [00:00<?, ?it/s]

Generating personas:   0%|          | 0/3 [00:00<?, ?it/s]

Generating Scenarios:   0%|          | 0/3 [00:00<?, ?it/s]

Generating Samples:   0%|          | 0/11 [00:00<?, ?it/s]

💰 Tokens used: 247858 | Cost: $0.7015


In [45]:
import pandas as pd

df = dataset.to_pandas()
pd.set_option('display.max_colwidth', 300)
display(df)

Unnamed: 0,user_input,reference_contexts,reference,synthesizer_name
0,"What changes did the Consolidated Appropriations Act, 2021 bring to the FAFSA process?","[Application and Verification Guide Introduction This guide is intended for college financial aid administrators and counselors who help students with the financial aid process4completing the Free Application for Federal Student Aid (FAFSA®) form, verifying information, and making corrections an...","The Consolidated Appropriations Act, 2021, which included the FAFSA Simplification Act, mandated a significant overhaul of federal student aid, including changes to the Free Application for Federal Student Aid (FAFSA) form, need analysis, and many policies and procedures for schools participatin...",single_hop_specifc_query_synthesizer
1,Wht is the purpose of the FAFSA Submission Summary in the financial aid process?,"[The FPS also checks the application for possible inconsistencies and mistakes. For instance, if a dependent student reported the parents9 marital status as married but reported the family size as <2,= the edit checks would catch the inconsistency. Even when data is inconsistent, the FPS may be ...","The FAFSA Submission Summary shows the information the student originally provided, the SAI, results of the eligibility matches, information about aid history, and information about any inconsistencies identified through the FPS edits. It is made available to the student online or mailed as a pa...",single_hop_specifc_query_synthesizer
2,What information can financial aid administrators view on the ISIR?,"[2. The disclosure of their FTI by the IRS to the Department; 3. The use of their FTI by a Department official to determine an applicant9s eligibility for federal student aid and the amount for which they are eligible; and 4. The redisclosure of FTI by the Department to an eligible institution, ...",Financial aid administrators can view the transferred data on the ISIR.,single_hop_specifc_query_synthesizer
3,What FTI mean and what info it include?,[Federal Tax Information The following data received by the Department from the IRS are considered FTI: Tax Year (ex. Award year 2025-26 is based on 2023 tax year information from the IRS) Tax Filing Status Adjusted Gross Income (AGI) Number of Exemptions and Number of Dependents Income Earned f...,"FTI stands for Federal Tax Information and includes data such as Tax Year, Tax Filing Status, Adjusted Gross Income (AGI), Number of Exemptions and Dependents, Income Earned from Work, Taxes Paid, Educational Credits, Untaxed IRA distributions, IRA deductions and payments, Tax exempt interest, U...",single_hop_specifc_query_synthesizer
4,Wut shud a student do if they need to contact the FSAIC to sync their data with the SSA?,"[Because the Department matches the student9s name and SSN with the Social Security Administration (SSA), the name on the FAFSA form should match the one in the SSA9s records (i.e., as it appears on the student9s Social Security card). Students (except citizens of one of the Freely Associated St...","After the student receives confirmation that SSA has corrected its records, the student must contact the Federal Student Aid Information Center (FSAIC) and ask them to manually sync their data with SSA.",single_hop_specifc_query_synthesizer
5,How is the definition of a parent on the FAFSA relevant to the completion of the form for students with unmarried biological parents living together?,[<1-hop>\n\nThis question asks if a 2023 IRS Form 1040 or 1040-NR was (or will be) completed. It also asks about income earned in a foreign country or if the student spouse filed a tax return in a U.S. territory. See the <Student Tax Filing Status (19)= question for additional information. Stude...,The definition of a parent on the FAFSA is relevant to the completion of the form for students with unmarried biological parents living together because both parents must report their information on the FAFSA form. This is because the FAFSA considers both biological and adoptive parents who are ...,multi_hop_abstract_query_synthesizer
6,"How does the FAFSA form address situations where a student's parent has a unique tax filing status, such as not filing a U.S. tax return due to foreign income or other reasons?","[<1-hop>\n\nParent Tax Filing Status (37) This question asks if the parent filed (or will file) a 2023 IRS 1040 or 1040-NR. If the answer is <no=, the parent will be asked to select one of the following reasons for not filing: 1. The parent filed or will file a tax return with Puerto Rico or ano...",The FAFSA form addresses situations where a student's parent has a unique tax filing status by requiring manual entry of income and tax data if the parent did not file a U.S. tax return due to foreign income or other reasons. If a parent indicates they earned income in a foreign country but did ...,multi_hop_abstract_query_synthesizer
7,"What information must a student provide about their spouse when completing the FAFSA form, and how is this information used in the FAFSA Submission Summary?",[<1-hop>\n\nThe ISIR will only display the federal school code of the receiving school. The information of other schools the student included on the FAFSA form will not appear except on the FAFSA Submission Summary and on ISIRs sent to state grant agencies. See Volume 6 of the 2025-26 FAFSA Spec...,"When completing the FAFSA form, a student must provide their spouse's identity information, including name, Social Security number, date of birth, and email address, if the spouse is a required contributor. This information should match what appears on the spouse's social security card. If the s...",multi_hop_abstract_query_synthesizer
8,What happens if a parent says they didn’t file a tax return but has foreign income affecting SAI?,"[<1-hop>\n\nwages and business and farm income) only. This may be the same number as your AGI. Fiscal Year Tax Returns For a fiscal year (rather than calendar year) tax return, the individual should use information from the return that includes the greater length of time in 2023. For example, an...","If a parent indicates they did not file a U.S. tax return for reasons other than low income, this is treated as conflicting information, and documentation is requested to resolve it. If a parent did not file any tax return because they did not earn any income and their state of legal residence i...",multi_hop_specific_query_synthesizer
9,"How should a student report taxable grants and scholarships on the FAFSA, and what role does IRS Form 1040 Schedule 1 play in this process?","[<1-hop>\n\nManually Entered Data/Manually Provided Tax-Payer Data IRA rollover into another IRA or qualified plan. Typically indicated as ""ROLLOVER"" on IRS Form 1040-line 4. Pension rollover into an IRA or other qualified plan. Typically indicated as ""ROLLOVER"" on IRS Form 1040-line 5. Earned i...","A student should report only the amount of grants and scholarships received that was reported as taxable income on their tax return. This includes the taxable portions of fellowships, assistantships, stipends, and employer tuition reimbursements. These amounts are typically indicated on IRS Form...",multi_hop_specific_query_synthesizer


In [46]:
# Save the test dataset to avoid regenerating it
dataset.to_pandas().to_csv("loan_test_dataset.csv", index=False)
print("Test dataset saved to loan_test_dataset.csv")


Test dataset saved to loan_test_dataset.csv


In [107]:
import os
import time
import copy
import pandas as pd
import json
import numpy as np
from ragas.evaluation import EvaluationDataset
from ragas import evaluate, RunConfig
from ragas.metrics import (
    LLMContextRecall,
    Faithfulness,
    FactualCorrectness,
    ResponseRelevancy,
    ContextEntityRecall,
    NoiseSensitivity,
)
from ragas.llms import LangchainLLMWrapper
from langchain_openai import ChatOpenAI
from langchain.callbacks import get_openai_callback

def evaluation_result_to_dict(result):
    """
    Convert EvaluationResult.scores (list of per-example dicts) to average metric values.
    """
    scores_list = result.scores
    if not scores_list or not isinstance(scores_list, list):
        return {}

    aggregated = {}
    for key in scores_list[0].keys():
        values = [s[key] for s in scores_list if s[key] is not None and not (isinstance(s[key], float) and np.isnan(s[key]))]
        if values:
            aggregated[key] = round(float(np.mean(values)), 4)

    return aggregated


def evaluate_retriever(name: str, pipeline, dataset):
    print(f"📊 Running evaluation for: {name}")

    test_dataset = copy.deepcopy(dataset)

    latencies = []
    total_tokens = 0
    total_cost = 0
    prompt_number = 1
    prompt_logs = []

    overall_start_time = time.time()

    for test_row in test_dataset:
        question = test_row.eval_sample.user_input
        start_time = time.time()
        with get_openai_callback() as cb:
            response = pipeline.invoke({"question": question})
        
        latency = time.time() - start_time
        latencies.append(latency)
        total_tokens += cb.total_tokens
        total_cost += cb.total_cost

        test_row.eval_sample.response = str(response["response"])
        test_row.eval_sample.retrieved_contexts = [doc.page_content for doc in response["context"]]

        token_usage = {
            "completion_tokens": response["response"].response_metadata['token_usage']["completion_tokens"],
            "prompt_tokens": response["response"].response_metadata['token_usage']["prompt_tokens"],
            "total_tokens": response["response"].response_metadata['token_usage']["total_tokens"]
        }

        model_name = response["response"].response_metadata['model_name']

        prompt_logs.append({
            "prompt_number": prompt_number,
            "question": question,
            "response": response["response"].content,
            "token_usage": token_usage,
            "model_name": model_name,
            "retrieved_contexts": [doc.page_content for doc in response["context"]],
            "retrieved_metadata": [doc.metadata for doc in response["context"]],
            "latency": round(latency, 2),
            "tokens": cb.total_tokens,
            "cost": round(cb.total_cost, 6),
            "timestamp": time.strftime('%Y-%m-%d %H:%M:%S', time.localtime(start_time))
        })

        print(f"Prompt {prompt_number} | {response["response"].content} |⏱️ {latency:.2f}s | Tokens {cb.total_tokens} | Cost {cb.total_cost:.6f}")
        prompt_number += 1

    total_duration = time.time() - overall_start_time
    avg_latency = sum(latencies) / len(latencies) if latencies else 0
    avg_tokens_per_query = total_tokens / len(test_dataset)

    print(f"\n📊 Summary for {name}:")
    print(f"🔢 Total tokens: {total_tokens}")
    print(f"⏱️ Avg latency: {avg_latency:.2f}s")
    print(f"🔢 Avg tokens per query: {avg_tokens_per_query:.2f}")
    print(f"⏱️ Total duration: {total_duration:.2f}s")
    print(f"💰 Total cost: {total_cost:.6f}")

    evaluator_model = "gpt-4.1-nano"
    evaluation_dataset = EvaluationDataset.from_pandas(test_dataset.to_pandas())
    evaluator_llm = LangchainLLMWrapper(ChatOpenAI(model=evaluator_model, temperature=0))

    with get_openai_callback() as cb:
        result = evaluate(
            experiment_name=name,
            dataset=evaluation_dataset,
            metrics=[
                LLMContextRecall(),
                Faithfulness(),
                FactualCorrectness(),
                ResponseRelevancy(),
                ContextEntityRecall(),
                NoiseSensitivity(),
            ],
            llm=evaluator_llm,
            run_config=RunConfig(timeout=300)
        )

    evaluation_cost = cb.total_cost
    evaluation_tokens = cb.total_tokens

    print(f"Evaluation LLM model: {evaluator_model}")
    print('Evaluation results:', result)
    print('Evaluation tokens:', evaluation_tokens)
    print(f'Evaluation cost: {evaluation_cost:.6f}')

    evaluation_result_dict = evaluation_result_to_dict(result)

    output_dir = "loan_docs_retriever_logs"
    os.makedirs(output_dir, exist_ok=True)
    log_filename = f'{name.lower().replace(" ", "_")}_logs.json'
    output_path = os.path.join(output_dir, log_filename)

    full_output = {
        "name": name,
        "summary": {
            "total_runtime": round(total_duration, 2),
            "avg_latency": round(avg_latency, 2),
            "total_tokens": int(total_tokens),
            "total_queries": len(test_dataset),
            "total_cost": round(total_cost, 6),
            "avg_tokens_per_query": round(avg_tokens_per_query, 2),
        },
        "latencies": [round(lat, 2) for lat in latencies],
        "evaluation_result": evaluation_result_dict,
        "evaluation_cost": {
            "eval_model": evaluator_model,
            "eval_tokens": evaluation_tokens,
            "eval_cost": round(evaluation_cost, 6),
        },
        "prompt_logs_path": output_path,
        "prompt_logs": prompt_logs,
    }

    with open(output_path, "w", encoding="utf-8") as f:
        json.dump(full_output, f, indent=2, ensure_ascii=False)

    print(f"✅ Full result saved to: {output_path}")

    del full_output['prompt_logs']
    return full_output

In [106]:
from ragas.testset import Testset

subset = Testset(list(dataset)[:3])
evaluate_retriever("naive_retriever_test", naive_retrieval_chain, subset)


📊 Running evaluation for: naive_retriever_test
Prompt 1 | The Consolidated Appropriations Act, 2021, specifically through the FAFSA Simplification Act, brought significant changes to the FAFSA process, primarily focusing on streamlining and simplifying the application. The key changes include:

1. Introduction of the FUTURE Act and the FA-DDX: The act authorized a direct data exchange with the IRS (the FA-DDX), replacing the previous IRS Data Retrieval Tool (IRS-DRT). This allows most applicants' income and tax information to be automatically imported from the IRS with their consent, reducing the need for self-reporting.

2. Mandatory Consent and Approval: All applicants and contributors (parents/spouses) must provide consent and approval for the Department to retrieve and use Federal Tax Information (FTI) via the FA-DDX, ensuring privacy and compliance.

3. Removal of the IRS Data Retrieval Tool: The IRS-DRT was retired after the 2023-24 application cycle, making the FA-DDX the primar

Evaluating:   0%|          | 0/18 [00:00<?, ?it/s]

Exception raised in Job[10]: TimeoutError()
Exception raised in Job[4]: TimeoutError()


Evaluation LLM model: gpt-4.1-nano
Evaluation results: {'context_recall': 1.0000, 'faithfulness': 1.0000, 'factual_correctness(mode=f1)': 0.9067, 'answer_relevancy': 0.9810, 'context_entity_recall': 0.5000, 'noise_sensitivity(mode=relevant)': 0.2568}
Evaluation tokens: 244253
Evaluation cost: 0.036949
✅ Full result saved to: loan_docs_retriever_logs/naive_retriever_test_logs.json


{'name': 'naive_retriever_test',
 'summary': {'total_runtime': 8.22,
  'avg_latency': 2.74,
  'total_tokens': 29784,
  'total_queries': 3,
  'total_cost': 0.001783,
  'avg_tokens_per_query': 9928.0},
 'latencies': [4.7, 1.56, 1.96],
 'evaluation_result': {'context_recall': 1.0,
  'faithfulness': 1.0,
  'factual_correctness(mode=f1)': 0.9067,
  'answer_relevancy': 0.981,
  'context_entity_recall': 0.5,
  'noise_sensitivity(mode=relevant)': 0.2568},
 'evaluation_cost': {'eval_model': 'gpt-4.1-nano',
  'eval_tokens': 244253,
  'eval_cost': 0.036949},
 'prompt_logs_path': 'loan_docs_retriever_logs/naive_retriever_test_logs.json'}

In [166]:
# def run_all_retriever_evaluations(pipelines: dict, dataset):
#     """
#     Runs evaluation across multiple retriever pipelines.

#     Args:
#         pipelines (dict): Dictionary where key is a pipeline name and value is the retriever chain.
#         dataset: Ragas synthetic test set.

#     Returns:
#         dict: A dictionary of results per pipeline.
#     """
#     results = {}
#     for name, pipeline in pipelines.items():
#         print(f"\n🚀 Starting evaluation for: {name}")
#         results[name] = evaluate_retriever(name, pipeline, dataset)
#     return results


In [108]:
naive_evaluation = evaluate_retriever("Naive", naive_retrieval_chain, dataset)

📊 Running evaluation for: Naive
Prompt 1 | The Consolidated Appropriations Act, 2021, through the FAFSA Simplification Act, introduced several significant changes to the FAFSA process, including:

1. Implementation of the FUTURE Act Direct Data Exchange (FA-DDX): This replaced the previous IRS Data Retrieval Tool (DRT) and allows the Department of Education to obtain federal tax information directly from the IRS with the consent of the applicants and contributors. This streamlined income reporting by eliminating the need for most applicants and their families to self-report their IRS-reported income and tax data. The FTI transferred via FA-DDX is considered verified for Title IV purposes.

2. Consent and Approval Requirements: Both students and their contributors (parents/spouses) must provide consent and approval for the Department to access and use their federal tax information via the FA-DDX.

3. Enhanced Data Handling and Privacy Measures: The process now involves explicit consent,

Evaluating:   0%|          | 0/66 [00:00<?, ?it/s]

Exception raised in Job[17]: ValueError(setting an array element with a sequence. The requested array has an inhomogeneous shape after 1 dimensions. The detected shape was (10,) + inhomogeneous part.)
Exception raised in Job[10]: TimeoutError()
Exception raised in Job[22]: TimeoutError()
Exception raised in Job[28]: TimeoutError()
Exception raised in Job[34]: TimeoutError()
Exception raised in Job[46]: TimeoutError()
Exception raised in Job[58]: TimeoutError()
Exception raised in Job[64]: TimeoutError()


Evaluation LLM model: gpt-4.1-nano
Evaluation results: {'context_recall': 0.9091, 'faithfulness': 0.9955, 'factual_correctness(mode=f1)': 0.7636, 'answer_relevancy': 0.9499, 'context_entity_recall': 0.3715, 'noise_sensitivity(mode=relevant)': 0.1200}
Evaluation tokens: 953293
Evaluation cost: 0.144740
✅ Full result saved to: loan_docs_retriever_logs/naive_logs.json


In [109]:
bm25_evaluation = evaluate_retriever("BM25", bm25_retrieval_chain, dataset)

📊 Running evaluation for: BM25
Prompt 1 | The Consolidated Appropriations Act, 2021, through the FAFSA Simplification Act, brought significant changes to the FAFSA process, notably implementing the FUTURE Act Direct Data Exchange (FA-DDX) with the IRS. This replaced the previous IRS Data Retrieval Tool (IRS-DRT), allowing most applicants and their contributors to have their federal tax information (FTI) automatically imported into the FAFSA form with consent and approval. This change streamlined the application process, making it faster and less burdensome for students by reducing the need for manual income reporting and enhancing the accuracy of income data used for financial aid eligibility determination. Additionally, the transfer of FTI via FA-DDX is considered verified for Title IV purposes, further simplifying compliance and verification procedures. |⏱️ 3.18s | Tokens 4078 | Cost 0.000453
Prompt 2 | The purpose of the FAFSA Submission Summary in the financial aid process is to pr

Evaluating:   0%|          | 0/66 [00:00<?, ?it/s]

Exception raised in Job[10]: TimeoutError()
Exception raised in Job[4]: TimeoutError()
Exception raised in Job[22]: TimeoutError()
Exception raised in Job[34]: TimeoutError()
Exception raised in Job[40]: TimeoutError()


Evaluation LLM model: gpt-4.1-nano
Evaluation results: {'context_recall': 0.9818, 'faithfulness': 0.9899, 'factual_correctness(mode=f1)': 0.7945, 'answer_relevancy': 0.9500, 'context_entity_recall': 0.4110, 'noise_sensitivity(mode=relevant)': 0.1902}
Evaluation tokens: 504282
Evaluation cost: 0.079412
✅ Full result saved to: loan_docs_retriever_logs/bm25_logs.json


In [110]:
multi_query_evaluation = evaluate_retriever("Multi-Query", multi_query_retrieval_chain, dataset)

📊 Running evaluation for: Multi-Query
Prompt 1 | The Consolidated Appropriations Act, 2021, through the FAFSA Simplification Act, brought significant changes to the FAFSA process, primarily focusing on streamlining and simplifying application procedures. The key changes include:

1. **Introduction of the FAFSA Simplification Act**: Mandated a comprehensive overhaul of the FAFSA form, need analysis, and related policies, in conjunction with the FUTURE Act.

2. **Implementation of the Future Act Direct Data Exchange (FA-DDX)**: Replaced the IRS Data Retrieval Tool (DRT) with a secure, near-real-time data exchange system that automatically imports certain federal tax information (FTI) directly from the IRS into the FAFSA after obtaining applicant consent. This reduces the need for self-reporting income and tax data, enhances accuracy, and considers transferred FTI as verified for Title IV purposes.

3. **Enhanced Consent and Approval Requirements**: Applicants and contributors (parents/sp

Evaluating:   0%|          | 0/66 [00:00<?, ?it/s]

Exception raised in Job[4]: TimeoutError()
Exception raised in Job[5]: TimeoutError()
Exception raised in Job[10]: TimeoutError()
Exception raised in Job[17]: TimeoutError()
Exception raised in Job[22]: TimeoutError()
Exception raised in Job[23]: TimeoutError()
Exception raised in Job[28]: TimeoutError()
Exception raised in Job[35]: TimeoutError()
Exception raised in Job[40]: TimeoutError()
Exception raised in Job[41]: TimeoutError()
Exception raised in Job[46]: TimeoutError()
Exception raised in Job[47]: TimeoutError()
Exception raised in Job[52]: TimeoutError()
Exception raised in Job[53]: TimeoutError()
Exception raised in Job[59]: TimeoutError()
Exception raised in Job[65]: TimeoutError()


Evaluation LLM model: gpt-4.1-nano
Evaluation results: {'context_recall': 0.9091, 'faithfulness': 1.0000, 'factual_correctness(mode=f1)': 0.8473, 'answer_relevancy': 0.9488, 'context_entity_recall': 0.5500, 'noise_sensitivity(mode=relevant)': 0.1250}
Evaluation tokens: 998899
Evaluation cost: 0.150473
✅ Full result saved to: loan_docs_retriever_logs/multi-query_logs.json


In [111]:
parent_document_evaluation = evaluate_retriever("Parent-Document", parent_document_retrieval_chain, dataset)

📊 Running evaluation for: Parent-Document
Prompt 1 | The Consolidated Appropriations Act, 2021, through the FAFSA Simplification Act, brought significant changes to the FAFSA process, including:

1. Implementation of the FAFSA Simplification Act alongside the FUTURE Act to streamline the FAFSA application process.
2. Introduction of the FUTURE Act Direct Data Exchange (FA-DDX) with the IRS, replacing the previous IRS Data Retrieval Tool (DRT). The FA-DDX automates the transfer of income and tax information from the IRS to the FAFSA form, reducing the need for applicants and their families to self-report this information.
3. The transferred federal tax information (FTI) via the FA-DDX is considered verified for Title IV purposes.
4. Applicants and contributors must provide consent and approval for the Department to obtain and use their IRS tax information through the FA-DDX.
5. Updates to the FAFSA application and verification processes, including date updates, resource links, and the r

Evaluating:   0%|          | 0/66 [00:00<?, ?it/s]

Exception raised in Job[41]: ValueError(setting an array element with a sequence. The requested array has an inhomogeneous shape after 1 dimensions. The detected shape was (3,) + inhomogeneous part.)
Exception raised in Job[64]: TimeoutError()


Evaluation LLM model: gpt-4.1-nano
Evaluation results: {'context_recall': 1.0000, 'faithfulness': 0.9813, 'factual_correctness(mode=f1)': 0.7855, 'answer_relevancy': 0.8622, 'context_entity_recall': 0.3666, 'noise_sensitivity(mode=relevant)': 0.2921}
Evaluation tokens: 412029
Evaluation cost: 0.065158
✅ Full result saved to: loan_docs_retriever_logs/parent-document_logs.json


In [112]:
contextual_compression_evaluation = evaluate_retriever("Contextual Compression", contextual_compression_retrieval_chain, dataset)

📊 Running evaluation for: Contextual Compression
Prompt 1 | The Consolidated Appropriations Act, 2021, through the FAFSA Simplification Act, brought several significant changes to the FAFSA process:

1. Implementation of the FAFSA Simplification Act alongside the FUTURE Act to streamline the FAFSA application process.
2. Introduction of the FA-DDX (FUTURE Act Direct Data Exchange), which replaced the IRS Data Retrieval Tool (DRT). This new tool automates the transfer of income and tax information from the IRS directly to the FAFSA, reducing the need for applicants to self-report their income. The FTI transferred via FA-DDX is considered verified for Title IV purposes.
3. Mandatory consent and approval are now required from applicants (and contributors such as spouses or parents) for the Department to access Federal Tax Information from the IRS via FA-DDX.
4. Updates to application deadlines, processing timelines, and procedures for submitting and correcting FAFSA applications.
5. Chang

Evaluating:   0%|          | 0/66 [00:00<?, ?it/s]

Exception raised in Job[4]: TimeoutError()
Exception raised in Job[16]: TimeoutError()
Exception raised in Job[22]: TimeoutError()
Exception raised in Job[28]: TimeoutError()
Exception raised in Job[64]: TimeoutError()


Evaluation LLM model: gpt-4.1-nano
Evaluation results: {'context_recall': 1.0000, 'faithfulness': 0.9697, 'factual_correctness(mode=f1)': 0.8782, 'answer_relevancy': 0.9440, 'context_entity_recall': 0.4068, 'noise_sensitivity(mode=relevant)': 0.1558}
Evaluation tokens: 425287
Evaluation cost: 0.067210
✅ Full result saved to: loan_docs_retriever_logs/contextual_compression_logs.json


In [113]:
ensemble_evaluation = evaluate_retriever("Ensemble", ensemble_retrieval_chain, dataset)

📊 Running evaluation for: Ensemble
Prompt 1 | The Consolidated Appropriations Act, 2021, through the FAFSA Simplification Act, brought several significant changes to the FAFSA process:

1. **Implementation of the FUTURE Act and FA-DDX:**  
   - Replaced the IRS Data Retrieval Tool (DRT) with the FUTURE Act Direct Data Exchange (FA-DDX), enabling near-real-time, secure transfer of specific federal tax information (FTI) from the IRS to the FAFSA.  
   - This change reduces the need for applicants and contributors to self-report income and tax data, since FTI transferred via FA-DDX is considered verified for Title IV purposes.

2. **Consent and Approval Requirements:**  
   - Applicants and contributors now must explicitly provide consent and approval for the Department to obtain and use FTI via the FA-DDX, including a separate approval process for each aid cycle.  
   - Signatures are required in addition to consent and approval, either electronically via the FSA ID or in signed paper fo

Evaluating:   0%|          | 0/66 [00:00<?, ?it/s]

Exception raised in Job[4]: TimeoutError()
Exception raised in Job[5]: TimeoutError()
Exception raised in Job[10]: TimeoutError()
Exception raised in Job[11]: TimeoutError()
Exception raised in Job[16]: TimeoutError()
Exception raised in Job[17]: TimeoutError()
Exception raised in Job[23]: TimeoutError()
Exception raised in Job[28]: TimeoutError()
Exception raised in Job[34]: TimeoutError()
Exception raised in Job[35]: TimeoutError()
Exception raised in Job[41]: TimeoutError()
Exception raised in Job[46]: TimeoutError()
Exception raised in Job[47]: TimeoutError()
Exception raised in Job[53]: TimeoutError()
Exception raised in Job[58]: TimeoutError()
Exception raised in Job[59]: TimeoutError()
Exception raised in Job[65]: TimeoutError()


Evaluation LLM model: gpt-4.1-nano
Evaluation results: {'context_recall': 1.0000, 'faithfulness': 0.9899, 'factual_correctness(mode=f1)': 0.8727, 'answer_relevancy': 0.9501, 'context_entity_recall': 0.5466, 'noise_sensitivity(mode=relevant)': 0.3333}
Evaluation tokens: 1017422
Evaluation cost: 0.151372
✅ Full result saved to: loan_docs_retriever_logs/ensemble_logs.json


In [114]:
semantic_evaluation = evaluate_retriever("Semantic", semantic_retrieval_chain, dataset)

📊 Running evaluation for: Semantic
Prompt 1 | The Consolidated Appropriations Act, 2021, through the FAFSA Simplification Act, introduced significant changes to the FAFSA process. The key changes include:

1. **Implementation of the FA-DDX:** The Act authorized a direct data exchange (FA-DDX) with the IRS, replacing the previous IRS Data Retrieval Tool (DRT). This allows most applicants to have their federal tax information automatically transferred and considered verified, reducing the need for self-reporting income information.

2. **Requirements for Consent and Approval:** Both students and contributors (such as parents and spouses) must provide consent and approval for the Department to obtain and use their federal tax information via the FA-DDX. This ensures privacy and compliance with legal standards.

3. **Changes to Application Processes:** The FAFSA Partner Portal was updated to remove the ability for financial aid administrators to initiate applications on behalf of students,

Evaluating:   0%|          | 0/66 [00:00<?, ?it/s]

Exception raised in Job[4]: TimeoutError()
Exception raised in Job[10]: TimeoutError()
Exception raised in Job[22]: TimeoutError()
Exception raised in Job[23]: TimeoutError()
Exception raised in Job[28]: TimeoutError()
Exception raised in Job[34]: TimeoutError()
Exception raised in Job[41]: TimeoutError()
Exception raised in Job[46]: TimeoutError()
Exception raised in Job[58]: TimeoutError()
Exception raised in Job[64]: TimeoutError()


Evaluation LLM model: gpt-4.1-nano
Evaluation results: {'context_recall': 1.0000, 'faithfulness': 0.9930, 'factual_correctness(mode=f1)': 0.7891, 'answer_relevancy': 0.9489, 'context_entity_recall': 0.2083, 'noise_sensitivity(mode=relevant)': 0.1069}
Evaluation tokens: 705683
Evaluation cost: 0.122638
✅ Full result saved to: loan_docs_retriever_logs/semantic_logs.json


In [115]:
all_results = {
    "Naive": naive_evaluation,
    "BM25": bm25_evaluation,
    "Multi-Query": multi_query_evaluation,
    "Parent-Document": parent_document_evaluation,
    "Contextual Compression": contextual_compression_evaluation,
    "Ensemble": ensemble_evaluation,
    "Semantic": semantic_evaluation,
}


In [116]:
import pandas as pd
from IPython.display import display

def display_summary_table(results: dict):
    """
    Displays a summary table comparing retriever pipelines with averaged RAGAS metrics.
    
    Args:
        results (dict): A dictionary of evaluation results from evaluate_retriever().
    """
    summary_data = []

    for _, result in results.items():
        eval_scores = result["evaluation_result"]

        row = {
            "Pipeline": result["name"],
            **result["summary"],
            "ContextRecall": eval_scores.get("context_recall"),
            "Faithfulness": eval_scores.get("faithfulness"),
            "FactualCorrectness": eval_scores.get("factual_correctness(mode=f1)"),
            "AnswerRelevancy": eval_scores.get("answer_relevancy"),
            "ContextEntityRecall": eval_scores.get("context_entity_recall"),
            "NoiseSensitivity": eval_scores.get("noise_sensitivity(mode=relevant)"),
        }

        summary_data.append(row)

    df = pd.DataFrame(summary_data)
    display(df.sort_values(by="ContextRecall", ascending=False).reset_index(drop=True))


In [117]:
display_summary_table(all_results)


Unnamed: 0,Pipeline,total_runtime,avg_latency,total_tokens,total_queries,total_cost,avg_tokens_per_query,ContextRecall,Faithfulness,FactualCorrectness,AnswerRelevancy,ContextEntityRecall,NoiseSensitivity
0,Ensemble,90.8,8.25,163327,11,0.017491,14847.91,1.0,0.9899,0.8727,0.9501,0.5466,0.3333
1,Contextual Compression,52.27,4.75,38018,11,0.004537,3456.18,1.0,0.9697,0.8782,0.944,0.4068,0.1558
2,Parent-Document,71.44,6.49,32814,11,0.004018,2983.09,1.0,0.9813,0.7855,0.8622,0.3666,0.2921
3,Semantic,42.27,3.84,65412,11,0.007299,5946.55,1.0,0.993,0.7891,0.9489,0.2083,0.1069
4,BM25,34.39,3.12,47793,11,0.005475,4344.82,0.9818,0.9899,0.7945,0.95,0.411,0.1902
5,Multi-Query,82.08,7.46,149441,11,0.016196,13585.55,0.9091,1.0,0.8473,0.9488,0.55,0.125
6,Naive,38.87,3.53,113514,11,0.012209,10319.45,0.9091,0.9955,0.7636,0.9499,0.3715,0.12


| Pipeline                  | Total Runtime (s)     | Avg Latency (s)      | Total Tokens         | Total Cost ($)         | Avg Tokens/Query     | Context Recall         | Faithfulness             | Factual Correctness       | Answer Relevancy         | Context Entity Recall     | Noise Sensitivity         |
|---------------------------|------------------------|------------------------|------------------------|--------------------------|------------------------|--------------------------|----------------------------|-----------------------------|----------------------------|-----------------------------|----------------------------|
| **Naive**                 | 🟡 *38.87*             | 🟡 *3.53*             | 🟡 *113,514*          | 🔴 **0.012209**         | 🔴 **10,319.45**      | 🔴 **0.9091**           | 🟡 *0.9955*               | 🔴 **0.7636**             | 🟡 *0.9499*              | 🔴 **0.3715**             | 🟢 **0.1200**             |
| **BM25**                  | 🟢 **34.39**           | 🟢 **3.12**           | 🟢 **47,793**         | 🟡 *0.005475*           | 🟡 *4,344.82*         | 🟡 *0.9818*             | 🟡 *0.9899*               | 🟡 *0.7945*               | 🟢 **0.9500**             | 🟡 *0.4110*               | 🔴 **0.1902**             |
| **Multi-Query**           | 🔴 **82.08**           | 🔴 **7.46**           | 🔴 **149,441**        | 🔴 **0.016196**         | 🔴 **13,585.55**      | 🔴 **0.9091**           | 🟢 **1.0000**             | 🟡 *0.8473*               | 🟡 *0.9488*              | 🟢 **0.5500**             | 🟡 *0.1250*              |
| **Parent-Document**       | 🟡 *71.44*             | 🟡 *6.49*             | 🟢 **32,814**         | 🟢 **0.004018**         | 🟢 **2,983.09**       | 🟢 **1.0000**           | 🔴 **0.9813**             | 🔴 **0.7855**             | 🔴 **0.8622**             | 🔴 **0.3666**             | 🔴 **0.2921**             |
| **Contextual Compression**| 🟡 *52.27*             | 🟡 *4.75*             | 🟡 *38,018*           | 🟢 **0.004537**         | 🟢 **3,456.18**       | 🟢 **1.0000**           | 🔴 **0.9697**             | 🟢 **0.8782**             | 🟡 *0.9440*              | 🟡 *0.4068*               | 🟡 *0.1558*              |
| **Ensemble**              | 🔴 **90.80**           | 🔴 **8.25**           | 🔴 **163,327**        | 🔴 **0.017491**         | 🔴 **14,847.91**      | 🟢 **1.0000**           | 🟡 *0.9899*               | 🟡 *0.8727*               | 🟢 **0.9501**             | 🟡 *0.5466*               | 🔴 **0.3333**             |
| **Semantic**              | 🟢 **42.27**           | 🟢 **3.84**           | 🟡 *65,412*           | 🟡 *0.007299*           | 🟡 *5,946.55*         | 🟢 **1.0000**           | 🟢 **0.9930**             | 🔴 **0.7891**             | 🟡 *0.9489*              | 🔴 **0.2083**             | 🟢 **0.1069**             |
