# Advanced Retrieval with LangChain

In the following notebook, we'll explore various methods of advanced retrieval using LangChain!

We'll touch on:

- Naive Retrieval
- Best-Matching 25 (BM25)
- Multi-Query Retrieval
- Parent-Document Retrieval
- Contextual Compression (a.k.a. Rerank)
- Ensemble Retrieval
- Semantic chunking

We'll also discuss how these methods impact performance on our set of documents with a simple RAG chain.

There will be two breakout rooms:

- 🤝 Breakout Room Part #1
  - Task 1: Getting Dependencies!
  - Task 2: Data Collection and Preparation
  - Task 3: Setting Up QDrant!
  - Task 4-10: Retrieval Strategies
- 🤝 Breakout Room Part #2
  - Activity: Evaluate with Ragas

# 🤝 Breakout Room Part #1

## Task 1: Getting Dependencies!

We're going to need a few specific LangChain community packages, like OpenAI (for our [LLM](https://platform.openai.com/docs/models) and [Embedding Model](https://platform.openai.com/docs/guides/embeddings)) and Cohere (for our [Reranker](https://cohere.com/rerank)).

We'll also provide our OpenAI key, as well as our Cohere API key.

In [4]:
import os
import getpass

os.environ["OPENAI_API_KEY"] = getpass.getpass("Enter your OpenAI API Key:")

In [5]:
os.environ["COHERE_API_KEY"] = getpass.getpass("Cohere API Key:")

## Task 2: Data Collection and Preparation

We'll be using our Loan Data once again - this time the strutured data available through the CSV!

### Data Preparation

We want to make sure all our documents have the relevant metadata for the various retrieval strategies we're going to be applying today.

In [6]:
from langchain_community.document_loaders.csv_loader import CSVLoader
from datetime import datetime, timedelta

loader = CSVLoader(
    file_path=f"./data/complaints.csv",
    metadata_columns=[
      "Date received", 
      "Product", 
      "Sub-product", 
      "Issue", 
      "Sub-issue", 
      "Consumer complaint narrative", 
      "Company public response", 
      "Company", 
      "State", 
      "ZIP code", 
      "Tags", 
      "Consumer consent provided?", 
      "Submitted via", 
      "Date sent to company", 
      "Company response to consumer", 
      "Timely response?", 
      "Consumer disputed?", 
      "Complaint ID"
    ]
)

loan_complaint_data = loader.load()

for doc in loan_complaint_data:
    doc.page_content = doc.metadata["Consumer complaint narrative"]

Let's look at an example document to see if everything worked as expected!

In [7]:
loan_complaint_data[0]

Document(metadata={'source': './data/complaints.csv', 'row': 0, 'Date received': '03/27/25', 'Product': 'Student loan', 'Sub-product': 'Federal student loan servicing', 'Issue': 'Dealing with your lender or servicer', 'Sub-issue': 'Trouble with how payments are being handled', 'Consumer complaint narrative': "The federal student loan COVID-19 forbearance program ended in XX/XX/XXXX. However, payments were not re-amortized on my federal student loans currently serviced by Nelnet until very recently. The new payment amount that is effective starting with the XX/XX/XXXX payment will nearly double my payment from {$180.00} per month to {$360.00} per month. I'm fortunate that my current financial position allows me to be able to handle the increased payment amount, but I am sure there are likely many borrowers who are not in the same position. The re-amortization should have occurred once the forbearance ended to reduce the impact to borrowers.", 'Company public response': 'None', 'Company'

## Task 3: Setting up QDrant!

Now that we have our documents, let's create a QDrant VectorStore with the collection name "LoanComplaints".

We'll leverage OpenAI's [`text-embedding-3-small`](https://openai.com/blog/new-embedding-models-and-api-updates) because it's a very powerful (and low-cost) embedding model.

> NOTE: We'll be creating additional vectorstores where necessary, but this pattern is still extremely useful.

In [8]:
from langchain_community.vectorstores import Qdrant
from langchain_openai import OpenAIEmbeddings

embeddings = OpenAIEmbeddings(model="text-embedding-3-small")

vectorstore = Qdrant.from_documents(
    loan_complaint_data,
    embeddings,
    location=":memory:",
    collection_name="LoanComplaints"
)

## Task 4: Naive RAG Chain

Since we're focusing on the "R" in RAG today - we'll create our Retriever first.

### R - Retrieval

This naive retriever will simply look at each review as a document, and use cosine-similarity to fetch the 10 most relevant documents.

> NOTE: We're choosing `10` as our `k` here to provide enough documents for our reranking process later

In [9]:
naive_retriever = vectorstore.as_retriever(search_kwargs={"k" : 10})

### A - Augmented

We're going to go with a standard prompt for our simple RAG chain today! Nothing fancy here, we want this to mostly be about the Retrieval process.

In [10]:
from langchain_core.prompts import ChatPromptTemplate

RAG_TEMPLATE = """\
You are a helpful and kind assistant. Use the context provided below to answer the question.

If you do not know the answer, or are unsure, say you don't know.

Query:
{question}

Context:
{context}
"""

rag_prompt = ChatPromptTemplate.from_template(RAG_TEMPLATE)

### G - Generation

We're going to leverage `gpt-4.1-nano` as our LLM today, as - again - we want this to largely be about the Retrieval process.

In [11]:
from langchain_openai import ChatOpenAI

chat_model = ChatOpenAI(model="gpt-4.1-nano")

### LCEL RAG Chain

We're going to use LCEL to construct our chain.

> NOTE: This chain will be exactly the same across the various examples with the exception of our Retriever!

In [12]:
from langchain_core.runnables import RunnablePassthrough
from operator import itemgetter
from langchain_core.output_parsers import StrOutputParser

naive_retrieval_chain = (
    # INVOKE CHAIN WITH: {"question" : "<<SOME USER QUESTION>>"}
    # "question" : populated by getting the value of the "question" key
    # "context"  : populated by getting the value of the "question" key and chaining it into the base_retriever
    {"context": itemgetter("question") | naive_retriever, "question": itemgetter("question")}
    # "context"  : is assigned to a RunnablePassthrough object (will not be called or considered in the next step)
    #              by getting the value of the "context" key from the previous step
    | RunnablePassthrough.assign(context=itemgetter("context"))
    # "response" : the "context" and "question" values are used to format our prompt object and then piped
    #              into the LLM and stored in a key called "response"
    # "context"  : populated by getting the value of the "context" key from the previous step
    | {"response": rag_prompt | chat_model, "context": itemgetter("context")}
)

Let's see how this simple chain does on a few different prompts.

> NOTE: You might think that we've cherry picked prompts that showcase the individual skill of each of the retrieval strategies - you'd be correct!

In [13]:
naive_retrieval_chain.invoke({"question" : "What is the most common issue with loans?"})["response"].content

'Based on the information provided, the most common issues with loans appear to be related to mismanagement and errors by loan servicers, including errors in loan balances, misapplied payments, wrongful denials of payment plans, inaccuracies on credit reports, and difficulties in applying payments correctly. These problems often lead to incorrect reporting, increased balances, and increased financial hardship for borrowers.'

In [14]:
naive_retrieval_chain.invoke({"question" : "Did any complaints not get handled in a timely manner?"})["response"].content

'Based on the provided data, yes, some complaints did not get handled in a timely manner. Specifically, there are complaints where the response was marked as "No" for timely response:\n\n- Complaint ID 12709087 (MOHELA) received on 03/28/25, where the response was "Closed with explanation" and marked as "Timely response?": No.\n- Complaint ID 12832400 (Maximus Federal Services, Inc.) received on 04/05/25, which was marked as "Timely response?": Yes.\n- Complaint ID 12973003 (EdFinancial Services) received on 04/14/25, and marked as "Timely response?": Yes.\n- Complaint ID 12975634 (Maximus Federal Services, Inc.) received on 04/14/25, marked as "Timely response?": Yes.\n- Complaint ID 13062402 (Nelnet, Inc.) received on 04/18/25, marked as "Timely response?": Yes.\n- Complaint ID 13056764 (EdFinancial Services) received on 04/18/25, marked as "Timely response?": Yes.\n- Several other complaints, such as the one received on 04/24/25 (Maximus) with response "Closed with explanation" and 

In [15]:
naive_retrieval_chain.invoke({"question" : "Why did people fail to pay back their loans?"})["response"].content

'People failed to pay back their loans primarily due to a combination of factors highlighted in the complaints:\n\n1. **Lack of Clear Communication and Notification:** Many borrowers were not properly notified about important changes such as loan transferences, when repayment was to start, or updates on payment plans. For example, some complain about loans being transferred without notice, or being expected to start payments before the grace period without explanation.\n\n2. **Complex and Hidden Interest Accumulation:** Borrowers expressed difficulty understanding how interest accumulates, especially when loans are placed in forbearance or deferment, leading to continued interest that negates their payments and increases overall debt.\n\n3. **Inadequate or Restrictive Payment Options:** Many complain that the available options, such as forbearance or deferment, either extend the repayment period or cause interest to grow significantly. Some describe the inability to make additional pay

Overall, this is not bad! Let's see if we can make it better!

## Task 5: Best-Matching 25 (BM25) Retriever

Taking a step back in time - [BM25](https://www.nowpublishers.com/article/Details/INR-019) is based on [Bag-Of-Words](https://en.wikipedia.org/wiki/Bag-of-words_model) which is a sparse representation of text.

In essence, it's a way to compare how similar two pieces of text are based on the words they both contain.

This retriever is very straightforward to set-up! Let's see it happen down below!


In [16]:
from langchain_community.retrievers import BM25Retriever

bm25_retriever = BM25Retriever.from_documents(loan_complaint_data, )

We'll construct the same chain - only changing the retriever.

In [17]:
bm25_retrieval_chain = (
    {"context": itemgetter("question") | bm25_retriever, "question": itemgetter("question")}
    | RunnablePassthrough.assign(context=itemgetter("context"))
    | {"response": rag_prompt | chat_model, "context": itemgetter("context")}
)

Let's look at the responses!

In [18]:
bm25_retrieval_chain.invoke({"question" : "What is the most common issue with loans?"})["response"].content

'Based on the provided context, the most common issue with loans appears to be problems related to dealing with lenders or servicers, such as disputes over fees, repayment practices, or inaccurate or confusing information about loans. Specifically, complaints often involve difficulty applying payments correctly, lack of clear information about loan balances or terms, and issues with the accuracy of loan data or repayment calculations.'

In [19]:
bm25_retrieval_chain.invoke({"question" : "Did any complaints not get handled in a timely manner?"})["response"].content

'Based on the provided information, all of the complaints listed indicate that the companies responded in a "timely" manner, with responses marked as "Yes" under the "Timely response?" field. Therefore, it appears that no complaints were left unresolved or handled outside of the appropriate timeframe.'

In [20]:
bm25_retrieval_chain.invoke({"question" : "Why did people fail to pay back their loans?"})["response"].content

"People often fail to pay back their loans due to a variety of reasons, including issues with payment plans, miscommunication, or administrative errors. In the provided context, specific reasons include:\n\n- Problems with payment plans, such as being steered into the wrong types of forbearances or experiencing complications with their repayment options.\n- Lack of communication from loan servicers, such as not receiving responses to requests for forbearance or deferment, or being unaware of transfer or changes in loan servicing.\n- Errors or glitches in payment processing, such as payments being reversed or not properly enrolled in autopay, leading to missed payments and negative credit impacts.\n- Deception or mismanagement by loan servicers, which can include selling loans to entities that refuse to help or providing bad information about loan status.\n- Lack of proper notification or transparency, which results in borrowers being unaware of their repayment obligations or account st

It's not clear that this is better or worse, if only we had a way to test this (SPOILERS: We do, the second half of the notebook will cover this)

#### ❓ Question #1:

Give an example query where BM25 is better than embeddings and justify your answer.

##### ✅ Answer:

Here's an example query where BM25 would likely perform better than embeddings:

## Example Query:
**"What are the specific fees charged by Nelnet for late payments?"**

## Why BM25 would be better:

### 1. **Exact Keyword Matching**
- BM25 excels at finding documents containing specific terms like "Nelnet", "fees", "late payments"
- It would prioritize documents that contain these exact words, even if they're not semantically similar
- Embeddings might miss documents that use different phrasing (e.g., "penalties" instead of "fees")

### 2. **Company/Entity Names**
- "Nelnet" is a specific company name that BM25 would treat as a high-value term
- Embeddings might group it with other loan servicers, diluting the relevance
- BM25's bag-of-words approach gives equal weight to proper nouns

### 3. **Technical/Specific Terminology**
- Terms like "late payments", "fees", "charges" are specific financial terms
- BM25 would find documents containing these exact phrases
- Embeddings might retrieve documents about general payment issues instead

### 4. **Sparse Information Retrieval**
- When looking for specific factual information (like fee amounts), BM25's keyword-based approach is more precise
- Embeddings might return broader, more general documents about payment problems

### 5. **Domain-Specific Vocabulary**
- Financial/loan terminology often has precise meanings
- BM25 treats each term independently, while embeddings might conflate related but distinct concepts
- For example, "late payment fees" vs "processing fees" vs "origination fees" are different concepts

## Real-world scenario:
If a user is specifically looking for Nelnet's late payment fee structure, BM25 would likely return documents that explicitly mention "Nelnet" and "late payment fees", while embeddings might return documents about general payment issues with any servicer.

This demonstrates BM25's strength in **precision** for specific, keyword-heavy queries versus embeddings' strength in **semantic understanding** for conceptual queries.

## Task 6: Contextual Compression (Using Reranking)

Contextual Compression is a fairly straightforward idea: We want to "compress" our retrieved context into just the most useful bits.

There are a few ways we can achieve this - but we're going to look at a specific example called reranking.

The basic idea here is this:

- We retrieve lots of documents that are very likely related to our query vector
- We "compress" those documents into a smaller set of *more* related documents using a reranking algorithm.

We'll be leveraging Cohere's Rerank model for our reranker today!

All we need to do is the following:

- Create a basic retriever
- Create a compressor (reranker, in this case)

That's it!

Let's see it in the code below!

In [21]:
from langchain.retrievers.contextual_compression import ContextualCompressionRetriever
from langchain_cohere import CohereRerank

compressor = CohereRerank(model="rerank-v3.5")
compression_retriever = ContextualCompressionRetriever(
    base_compressor=compressor, base_retriever=naive_retriever
)

Let's create our chain again, and see how this does!

In [22]:
contextual_compression_retrieval_chain = (
    {"context": itemgetter("question") | compression_retriever, "question": itemgetter("question")}
    | RunnablePassthrough.assign(context=itemgetter("context"))
    | {"response": rag_prompt | chat_model, "context": itemgetter("context")}
)

In [23]:
contextual_compression_retrieval_chain.invoke({"question" : "What is the most common issue with loans?"})["response"].content

'Based on the provided context, the most common issue with loans appears to be problems related to improper handling and misinformation by lenders or servicers. Specific issues include errors in loan balances, misapplied payments, wrongful denials of payment plans, lack of clear communication, and mishandling of loan data. These problems often lead to increased debt despite payments, disputes over account accuracy, and privacy violations.'

In [24]:
contextual_compression_retrieval_chain.invoke({"question" : "Did any complaints not get handled in a timely manner?"})["response"].content

'Based on the provided information, some complaints were handled in a timely manner, as indicated by the responses marked "Yes" for "Timely response?" However, there are also complaints where issues remain unresolved or lengthy delays are noted, such as the complaint about pending response after over a year and nearly 18 months with no resolution. \n\nTherefore, while some complaints did receive timely responses, others experienced significant delays or have not yet been resolved.'

In [25]:
contextual_compression_retrieval_chain.invoke({"question" : "Why did people fail to pay back their loans?"})["response"].content

'People failed to pay back their loans primarily due to a combination of factors highlighted in the complaints:\n\n1. Lack of understanding and communication: Borrowers were often not informed about their repayment obligations or the details of their loans, including interest accumulation and repayment terms.\n\n2. Accumulation of interest and stagnant wages: Interest continued to grow even when payments were deferred or reduced, making loans harder to pay off over time. Borrowers also faced stagnant wages, which made increasing payments difficult.\n\n3. Limited repayment options: Many borrowers felt that available options, such as forbearance or deferment, extended the repayment period and increased total interest, without providing manageable solutions.\n\n4. Financial hardship and lack of support: Borrowers, especially young or low-income individuals, struggled with balancing loan repayment alongside everyday expenses, leading to difficulties in making payments.\n\n5. Administrative

We'll need to rely on something like Ragas to help us get a better sense of how this is performing overall - but it "feels" better!

## Task 7: Multi-Query Retriever

Typically in RAG we have a single query - the one provided by the user.

What if we had....more than one query!

In essence, a Multi-Query Retriever works by:

1. Taking the original user query and creating `n` number of new user queries using an LLM.
2. Retrieving documents for each query.
3. Using all unique retrieved documents as context

So, how is it to set-up? Not bad! Let's see it down below!



In [26]:
from langchain.retrievers.multi_query import MultiQueryRetriever

multi_query_retriever = MultiQueryRetriever.from_llm(
    retriever=naive_retriever, llm=chat_model
)

In [27]:
multi_query_retrieval_chain = (
    {"context": itemgetter("question") | multi_query_retriever, "question": itemgetter("question")}
    | RunnablePassthrough.assign(context=itemgetter("context"))
    | {"response": rag_prompt | chat_model, "context": itemgetter("context")}
)

In [28]:
multi_query_retrieval_chain.invoke({"question" : "What is the most common issue with loans?"})["response"].content

'The most common issue with loans, based on the provided complaints, appears to be problems related to mismanagement by loan servicers, including errors in loan balances, misapplied payments, wrongful denials of payment plans, and inadequate communication or transparency. Many complaints highlight errors in account handling, improper reporting, and disputes over loan terms and balances.'

In [29]:
multi_query_retrieval_chain.invoke({"question" : "Did any complaints not get handled in a timely manner?"})["response"].content

'Based on the provided complaints data, it appears that several complaints indicate issues with timely handling, response, or resolution of complaints or issues by the companies involved. For example:\n\n- One complaint (row 441) indicates the response was **not timely** ("Timely response?": "No").\n- Multiple complaints report long delays or failure to respond within expected timeframes, including complaints from rows 66, 90, 95, and others, mentioning delays of over 30 days, months, or even years without resolution or proper handling.\n- Several complaints explicitly mention that the companies failed to address issues promptly or did not respond at all, indicating that complaints were not handled in a timely manner.\n\nTherefore, yes, some complaints did not get handled in a timely manner, as evidenced by multiple complaints stating delays, lack of response, or overdue resolutions.'

In [30]:
multi_query_retrieval_chain.invoke({"question" : "Why did people fail to pay back their loans?"})["response"].content

'People failed to pay back their loans primarily because of several interconnected reasons highlighted in the complaints:\n\n1. **Accumulation of Unmanageable Interest:** Many borrowers were not adequately informed that interest would continue to accrue and compound during forbearance or deferment, leading to an increasing balance that becomes difficult to repay.\n\n2. **Lack of Clear Communication and Guidance:** Borrowers often received little to no notice or explanations about their repayment obligations, changes in loan servicing, or available repayment options such as income-driven plans or loan rehabilitation.\n\n3. **Predatory and Misleading Practices:** Some borrowers experienced forbearance steering, where they were repeatedly placed into long-term forbearances without being informed about more sustainable options, causing their balances to balloon.\n\n4. **Systemic Service Failures:** Errors in loan balances, misapplied payments, improper transfers between servicers, and inac

#### ❓ Question #2:

Explain how generating multiple reformulations of a user query can improve recall.

##### ✅ Answer:

Here's how generating multiple reformulations of a user query can improve recall:

## How Multiple Query Reformulations Improve Recall:

### 1. **Addressing Vocabulary Mismatch**
- **Problem**: Users and documents may use different terms for the same concept
- **Solution**: Multiple reformulations can include synonyms, related terms, and alternative phrasings
- **Example**: 
  - Original: "loan payment problems"
  - Reformulations: "payment difficulties", "repayment issues", "trouble with payments", "payment challenges"

### 2. **Capturing Different Query Intentions**
- **Problem**: A single query might not capture all possible interpretations
- **Solution**: Different reformulations can explore various aspects of the query
- **Example**:
  - Original: "student loan issues"
  - Reformulations: 
    - "problems with student loans"
    - "student loan complaints"
    - "difficulties with education financing"
    - "issues with federal student aid"

### 3. **Handling Query Ambiguity**
- **Problem**: Queries can be ambiguous or underspecified
- **Solution**: Multiple reformulations can disambiguate and explore different meanings
- **Example**:
  - Original: "loan problems"
  - Reformulations:
    - "problems with loan payments"
    - "problems with loan servicers"
    - "problems with loan applications"
    - "problems with loan forgiveness"

### 4. **Expanding Semantic Coverage**
- **Problem**: Documents might be relevant but use different semantic expressions
- **Solution**: Reformulations can cover broader semantic space
- **Example**:
  - Original: "late payment fees"
  - Reformulations:
    - "penalties for missed payments"
    - "charges for overdue amounts"
    - "fees for delayed payments"
    - "consequences of not paying on time"

### 5. **Overcoming Embedding Limitations**
- **Problem**: Embeddings might miss relevant documents due to semantic drift
- **Solution**: Multiple queries increase the chance of finding relevant documents
- **Mechanism**: Each reformulation creates a different vector, potentially retrieving different document sets

## Technical Process:

1. **Query Generation**: LLM creates multiple reformulations of the original query
2. **Parallel Retrieval**: Each reformulation is used to retrieve documents independently
3. **Deduplication**: Remove duplicate documents across all retrieval results
4. **Ranking**: Combine and rank all unique documents using fusion algorithms (like Reciprocal Rank Fusion)

## Benefits:

- **Higher Recall**: More relevant documents are found across different query formulations
- **Better Coverage**: Documents that might be missed by a single query are captured
- **Robustness**: Less dependent on the specific wording of the original query
- **Comprehensive Results**: Provides a more complete picture of available relevant information

This approach essentially "casts a wider net" by exploring the query space from multiple angles, significantly improving the chances of finding all relevant documents.

## Task 8: Parent Document Retriever

A "small-to-big" strategy - the Parent Document Retriever works based on a simple strategy:

1. Each un-split "document" will be designated as a "parent document" (You could use larger chunks of document as well, but our data format allows us to consider the overall document as the parent chunk)
2. Store those "parent documents" in a memory store (not a VectorStore)
3. We will chunk each of those documents into smaller documents, and associate them with their respective parents, and store those in a VectorStore. We'll call those "child chunks".
4. When we query our Retriever, we will do a similarity search comparing our query vector to the "child chunks".
5. Instead of returning the "child chunks", we'll return their associated "parent chunks".

Okay, maybe that was a few steps - but the basic idea is this:

- Search for small documents
- Return big documents

The intuition is that we're likely to find the most relevant information by limiting the amount of semantic information that is encoded in each embedding vector - but we're likely to miss relevant surrounding context if we only use that information.

Let's start by creating our "parent documents" and defining a `RecursiveCharacterTextSplitter`.

In [31]:
from langchain.retrievers import ParentDocumentRetriever
from langchain.storage import InMemoryStore
from langchain_text_splitters import RecursiveCharacterTextSplitter
from qdrant_client import QdrantClient, models

parent_docs = loan_complaint_data
child_splitter = RecursiveCharacterTextSplitter(chunk_size=750)

We'll need to set up a new QDrant vectorstore - and we'll use another useful pattern to do so!

> NOTE: We are manually defining our embedding dimension, you'll need to change this if you're using a different embedding model.

In [32]:
from langchain_qdrant import QdrantVectorStore

client = QdrantClient(location=":memory:")

client.create_collection(
    collection_name="full_documents",
    vectors_config=models.VectorParams(size=1536, distance=models.Distance.COSINE)
)

parent_document_vectorstore = QdrantVectorStore(
    collection_name="full_documents", embedding=OpenAIEmbeddings(model="text-embedding-3-small"), client=client
)

Now we can create our `InMemoryStore` that will hold our "parent documents" - and build our retriever!

In [33]:
store = InMemoryStore()

parent_document_retriever = ParentDocumentRetriever(
    vectorstore = parent_document_vectorstore,
    docstore=store,
    child_splitter=child_splitter,
)

By default, this is empty as we haven't added any documents - let's add some now!

In [34]:
parent_document_retriever.add_documents(parent_docs, ids=None)

We'll create the same chain we did before - but substitute our new `parent_document_retriever`.

In [35]:
parent_document_retrieval_chain = (
    {"context": itemgetter("question") | parent_document_retriever, "question": itemgetter("question")}
    | RunnablePassthrough.assign(context=itemgetter("context"))
    | {"response": rag_prompt | chat_model, "context": itemgetter("context")}
)

Let's give it a whirl!

In [36]:
parent_document_retrieval_chain.invoke({"question" : "What is the most common issue with loans?"})["response"].content

'The most common issue with loans, based on the provided complaints, appears to be related to "Dealing with your lender or servicer," including specific problems such as discrepancies in loan balances, misapplied payments, wrongful denials of payment plans, and issues with loan reporting or verification. These issues reflect systemic problems in loan servicing and communication between lenders and borrowers.'

In [37]:
parent_document_retrieval_chain.invoke({"question" : "Did any complaints not get handled in a timely manner?"})["response"].content

'Based on the provided information, several complaints indicate that they were not handled in a timely manner. Specifically:\n\n- The complaint regarding the student loan application submitted to MOHELA found that the consumer was told it would take 15 days for someone to contact them, but they had not heard back as of the date of the complaint, which was after that period. Additionally, repeated calls with wait times of four hours or more suggest delays in handling the issue.\n- Another complaint involved waiting on hold for several hours (up to seven hours at one point) with MOHELA, indicating significant delays in resolving issues.\n- The complaint about disputes sent to credit bureaus over 30 days ago also remained unresolved.\n\nIn summary, yes, some complaints did not get handled in a timely manner.'

In [38]:
parent_document_retrieval_chain.invoke({"question" : "Why did people fail to pay back their loans?"})["response"].content

"People failed to pay back their loans primarily due to a variety of challenges including financial hardship, mismanagement, lack of proper information, and issues related to the status of their lenders or educational institutions. Specific reasons mentioned include:\n\n- Experiencing severe financial hardship after graduation, making it difficult to make or maintain payments.\n- Being misled or inadequately informed about the long-term financial consequences of taking out loans, especially when attending schools facing financial instability.\n- The inability to secure employment in their field of study, which hampers their ability to repay loans.\n- Withdrawal of communication or failure of loan servicers to provide clear information and proper notification regarding repayment obligations.\n- Mistakes or issues with credit reporting and collection practices, which can complicate repayment efforts.\n\nOverall, these factors combined can hinder individuals' ability to successfully repay

Overall, the performance *seems* largely the same. We can leverage a tool like [Ragas]() to more effectively answer the question about the performance.

## Task 9: Ensemble Retriever

In brief, an Ensemble Retriever simply takes 2, or more, retrievers and combines their retrieved documents based on a rank-fusion algorithm.

In this case - we're using the [Reciprocal Rank Fusion](https://plg.uwaterloo.ca/~gvcormac/cormacksigir09-rrf.pdf) algorithm.

Setting it up is as easy as providing a list of our desired retrievers - and the weights for each retriever.

In [39]:
from langchain.retrievers import EnsembleRetriever

retriever_list = [bm25_retriever, naive_retriever, parent_document_retriever, compression_retriever, multi_query_retriever]
equal_weighting = [1/len(retriever_list)] * len(retriever_list)

ensemble_retriever = EnsembleRetriever(
    retrievers=retriever_list, weights=equal_weighting
)

We'll pack *all* of these retrievers together in an ensemble.

In [40]:
ensemble_retrieval_chain = (
    {"context": itemgetter("question") | ensemble_retriever, "question": itemgetter("question")}
    | RunnablePassthrough.assign(context=itemgetter("context"))
    | {"response": rag_prompt | chat_model, "context": itemgetter("context")}
)

Let's look at our results!

In [41]:
ensemble_retrieval_chain.invoke({"question" : "What is the most common issue with loans?"})["response"].content

'The most common issue with loans, based on the complaints provided, appears to be problems related to the handling and management of student loans by servicers. The prevalent issues include:\n\n- **Dealing with your lender or servicer** (e.g., receiving bad information, improper handling of loan accounts, problems with paying or applying payments, and inappropriate loan transfers)\n- **Trouble with how payments are being handled** (e.g., inability to apply payments correctly, interest capitalization, or misapplication of payments)\n- **Incorrect or misleading information about loan balances and status** (such as inaccurate reporting, hidden or unexplained interest growth, or errors in loan details)\n- **Lack of transparency and communication** (including failure to inform borrowers of loan transfer, default status, or changes in loan terms)\n- **Problems with loan classification or mismanagement** (e.g., incorrect loan type classification, improper ending of deferments, or mishandling

In [42]:
ensemble_retrieval_chain.invoke({"question" : "Did any complaints not get handled in a timely manner?"})["response"].content

'Based on the information provided, yes, there were complaints that did not get handled in a timely manner. Specifically:\n\n- Complaint ID 12709087 received on 03/28/25 was marked as "No" for timely response.\n- Complaint ID 12935889 received on 04/11/25 was also marked as "No" for timely response.\n- Complaint ID 12668396 received on 03/26/25 was marked as "No" for timely response.\n\nMost other complaints indicate responses that were timely ("Yes") or are not explicitly marked as late responses, but at least these three clearly indicate delays in handling. \n\nSo, the answer is: Yes, some complaints did not get handled in a timely manner.'

In [43]:
ensemble_retrieval_chain.invoke({"question" : "Why did people fail to pay back their loans?"})["response"].content

'People failed to pay back their loans primarily due to a combination of factors highlighted in the complaints:\n\n1. **Lack of Clear or Accurate Information**: Borrowers were often misinformed or insufficiently informed about their repayment options, loan balances, interest accrual, and the status of their loans, which led to confusion and unintentional delinquency.\n\n2. **Predatory or Coercive Servicing Practices**: Many borrowers experienced steering into long-term forbearances, default, or consolidation without being informed of alternative options like income-driven repayment plans or loan rehabilitation. This practice, known as "forbearance steering," increased their total debt due to accumulated interest and reduced their chances of manageable repayment.\n\n3. **Unfair Collection and Reporting Practices**: Complaints include improper reporting of delinquency, failure to notify borrowers of missed payments, and misreporting loan statuses, which damaged credit scores and hindered

## Task 10: Semantic Chunking

While this is not a retrieval method - it *is* an effective way of increasing retrieval performance on corpora that have clean semantic breaks in them.

Essentially, Semantic Chunking is implemented by:

1. Embedding all sentences in the corpus.
2. Combining or splitting sequences of sentences based on their semantic similarity based on a number of [possible thresholding methods](https://python.langchain.com/docs/how_to/semantic-chunker/):
  - `percentile`
  - `standard_deviation`
  - `interquartile`
  - `gradient`
3. Each sequence of related sentences is kept as a document!

Let's see how to implement this!

We'll use the `percentile` thresholding method for this example which will:

Calculate all distances between sentences, and then break apart sequences of setences that exceed a given percentile among all distances.

In [44]:
from langchain_experimental.text_splitter import SemanticChunker

semantic_chunker = SemanticChunker(
    embeddings,
    breakpoint_threshold_type="percentile"
)

Now we can split our documents.

In [45]:
semantic_documents = semantic_chunker.split_documents(loan_complaint_data[:20])

Let's create a new vector store.

In [46]:
semantic_vectorstore = Qdrant.from_documents(
    semantic_documents,
    embeddings,
    location=":memory:",
    collection_name="Loan_Complaint_Data_Semantic_Chunks"
)

We'll use naive retrieval for this example.

In [47]:
semantic_retriever = semantic_vectorstore.as_retriever(search_kwargs={"k" : 10})

Finally we can create our classic chain!

In [48]:
semantic_retrieval_chain = (
    {"context": itemgetter("question") | semantic_retriever, "question": itemgetter("question")}
    | RunnablePassthrough.assign(context=itemgetter("context"))
    | {"response": rag_prompt | chat_model, "context": itemgetter("context")}
)

And view the results!

In [49]:
semantic_retrieval_chain.invoke({"question" : "What is the most common issue with loans?"})["response"].content

'Based on the provided data, the most common issue with loans appears to be related to problems with repayment and servicing, such as difficulties in making payments, inaccurate billing or payment amounts, and lack of clear communication from loan servicers. Many complaints mention issues like auto-debit problems, incorrect interest calculations, and delays or failures in processing recertifications or payment plans. These recurring themes suggest that mismanagement or miscommunication regarding loan repayment is a prevalent concern among borrowers.'

In [50]:
semantic_retrieval_chain.invoke({"question" : "Did any complaints not get handled in a timely manner?"})["response"].content

'Based on the provided data, most complaints indicate that responses from the companies were marked as "Closed with explanation" and are noted as being handled in a timely manner ("Yes" for timely response). However, one complaint explicitly states that the company, Nelnet, did not respond to the complaint—despite multiple attempts—and the complaint remains unresolved with no proper handling mentioned.\n\nSpecifically:\n- The complaint regarding the transferred account with alleged misconduct (Complaint ID: 13331376) shows that Nelnet never responded to the consumer\'s letters, despite acknowledgment of receipt and multiple follow-ups. The company\'s response was "Closed with explanation," but it appears that the complaints about misconduct and errors were not addressed properly or promptly.\n\n**Conclusion:**  \nYes, there was at least one complaint that did not get handled in a timely manner, or potentially at all, as per the complaint narrative.'

In [51]:
semantic_retrieval_chain.invoke({"question" : "Why did people fail to pay back their loans?"})["response"].content

"People failed to pay back their loans for various reasons, including issues with the processing or handling of their payments, misunderstandings about their loan status, difficulties with loan servicers providing inaccurate or incomplete information, and delays or errors in re-amortization or reactivation of payments after forbearance periods ended. Additionally, some borrowers faced disputes over the legitimacy of their debts or experienced improper reporting of their loan status, which contributed to non-repayment. In some cases, the lack of transparency, poor communication, or alleged deliberate stalling by loan servicers also hindered borrowers' ability to manage and repay their loans effectively."

#### ❓ Question #3:

If sentences are short and highly repetitive (e.g., FAQs), how might semantic chunking behave, and how would you adjust the algorithm?

##### ✅ Answer:

Here's how semantic chunking would behave with short, highly repetitive sentences (like FAQs) and how to adjust the algorithm:

### 1. **Over-chunking Problem**
- **Issue**: Short, repetitive sentences will have very similar embeddings
- **Result**: The algorithm might create too many tiny chunks or fail to find meaningful breakpoints
- **Example**: FAQ sentences like "What is X?" "How do I Y?" "Where can I find Z?" would all have similar semantic vectors

### 2. **Poor Semantic Differentiation**
- **Issue**: Repetitive content lacks semantic diversity
- **Result**: The similarity thresholding methods (percentile, standard deviation, etc.) won't find clear breakpoints
- **Problem**: All sentences appear equally similar, so no natural chunking boundaries are detected

### 3. **Ineffective Thresholding**
- **Issue**: Methods like percentile or standard deviation assume semantic variation
- **Result**: With repetitive content, these methods may create arbitrary or inconsistent chunks
- **Example**: If all sentence similarities are 0.85-0.90, percentile-based chunking becomes unreliable

## How to Adjust the Algorithm:

### 1. **Use Structural Chunking Instead**
```python
# Instead of semantic chunking, use rule-based chunking
from langchain_text_splitters import RecursiveCharacterTextSplitter

structural_chunker = RecursiveCharacterTextSplitter(
    chunk_size=1000,
    chunk_overlap=200,
    separators=["\n\n", "\n", ". ", " ", ""]
)
```

### 2. **Adjust Threshold Parameters**
```python
# Use more aggressive thresholding for repetitive content
semantic_chunker = SemanticChunker(
    embeddings,
    breakpoint_threshold_type="percentile",
    percentile_threshold=95  # Higher threshold for more aggressive chunking
)
```

### 3. **Pre-filter Repetitive Content**
```python
# Remove or consolidate repetitive sentences before chunking
def deduplicate_sentences(documents):
    seen_content = set()
    filtered_docs = []
    for doc in documents:
        if doc.page_content not in seen_content:
            seen_content.add(doc.page_content)
            filtered_docs.append(doc)
    return filtered_docs
```

### 4. **Use Hybrid Approach**
```python
# Combine semantic and structural chunking
def hybrid_chunking(documents):
    # First, use structural chunking for repetitive sections
    structural_chunks = structural_chunker.split_documents(documents)
    
    # Then, apply semantic chunking only to non-repetitive sections
    semantic_chunks = []
    for chunk in structural_chunks:
        if has_semantic_variation(chunk):
            semantic_chunks.extend(semantic_chunker.split_documents([chunk]))
        else:
            semantic_chunks.append(chunk)
    
    return semantic_chunks
```

### 5. **Adjust Embedding Strategy**
```python
# Use domain-specific embeddings or fine-tuned models
from sentence_transformers import SentenceTransformer

# Use a model fine-tuned for your specific domain
domain_embeddings = SentenceTransformer('domain-specific-model')
```

### 6. **Implement Content-Aware Chunking**
```python
def content_aware_chunking(documents):
    chunks = []
    for doc in documents:
        # Check if content is repetitive
        if is_repetitive_content(doc.page_content):
            # Use larger chunks for repetitive content
            chunks.extend(structural_chunker.split_documents([doc]))
        else:
            # Use semantic chunking for varied content
            chunks.extend(semantic_chunker.split_documents([doc]))
    return chunks
```

## Best Practices for Repetitive Content:

1. **Analyze content first**: Determine if semantic chunking is appropriate
2. **Use structural chunking**: For FAQs and repetitive content, rule-based chunking is often better
3. **Combine approaches**: Use semantic chunking only where it adds value
4. **Adjust thresholds**: Use higher thresholds for repetitive content
5. **Consider domain-specific solutions**: FAQ content might benefit from question-answer pair chunking

## Alternative for FAQs:
```python
# For FAQ content, chunk by Q&A pairs instead
def faq_chunking(documents):
    chunks = []
    for doc in documents:
        # Split by question-answer patterns
        qa_pairs = extract_qa_pairs(doc.page_content)
        for qa in qa_pairs:
            chunks.append(Document(page_content=qa, metadata=doc.metadata))
    return chunks
```

The key insight is that semantic chunking works best with semantically diverse content. For repetitive content like FAQs, traditional structural chunking or domain-specific approaches are often more effective.

# 🤝 Breakout Room Part #2

#### 🏗️ Activity #1

Your task is to evaluate the various Retriever methods against eachother.

You are expected to:

1. Create a "golden dataset"
 - Use Synthetic Data Generation (powered by Ragas, or otherwise) to create this dataset
2. Evaluate each retriever with *retriever specific* Ragas metrics
 - Semantic Chunking is not considered a retriever method and will not be required for marks, but you may find it useful to do a "semantic chunking on" vs. "semantic chunking off" comparision between them
3. Compile these in a list and write a small paragraph about which is best for this particular data and why.

Your analysis should factor in:
  - Cost
  - Latency
  - Performance

> NOTE: This is **NOT** required to be completed in class. Please spend time in your breakout rooms creating a plan before moving on to writing code.

##### HINTS:

- LangSmith provides detailed information about latency and cost.

In [52]:
# Add code here

In [54]:
# ============================================================================
# 1. SETUP LANGSMITH DATASET
# ============================================================================

import os
import time
from getpass import getpass 
from datetime import datetime
from collections import defaultdict
from langsmith import Client

# Setup LangSmith API key
os.environ["LANGCHAIN_API_KEY"] = getpass("Please enter your LANGCHAIN API key!")

# Create LangSmith client
client = Client()

# Create dataset for evaluation
dataset_name = f"goldendataset-{datetime.now().strftime('%Y%m%d-%H%M%S')}"

langsmith_dataset = client.create_dataset(
    dataset_name=dataset_name,
    description="Retrieval Methods Evaluation Dataset"
)

print(f" Dataset URL: https://smith.langchain.com/datasets/{langsmith_dataset.id}")
print(f" Dataset: {dataset_name}")

 Dataset URL: https://smith.langchain.com/datasets/332652ca-a3a1-4476-a487-0c7c76b566a5
 Dataset: goldendataset-20250729-033937


In [55]:
# =============================================================================
# STEP 2: LOAD DOCUMENTS
# =============================================================================

# Use the loan_complaint_data already loaded in the notebook
print(f"📊 Loaded {len(loan_complaint_data)} loan complaint documents")
print(f"📋 Sample document: {loan_complaint_data[0].page_content[:200]}...")

📊 Loaded 825 loan complaint documents
📋 Sample document: The federal student loan COVID-19 forbearance program ended in XX/XX/XXXX. However, payments were not re-amortized on my federal student loans currently serviced by Nelnet until very recently. The new...


In [56]:
# =============================================================================
# STEP 3: GENERATE GOLDEN DATASET USING ABSTRACTED SDG
# =============================================================================

from ragas.testset import TestsetGenerator
from ragas.llms import LangchainLLMWrapper
from ragas.embeddings import LangchainEmbeddingsWrapper
from langchain_openai import ChatOpenAI
from langchain_openai import OpenAIEmbeddings

generator_llm = LangchainLLMWrapper(ChatOpenAI(model="gpt-4.1"))
generator_embeddings = LangchainEmbeddingsWrapper(OpenAIEmbeddings())

import random
sample_docs = random.sample(loan_complaint_data, min(50, len(loan_complaint_data)))

generator = TestsetGenerator(llm=generator_llm, embedding_model=generator_embeddings)

testset = generator.generate_with_langchain_docs(sample_docs[:20], testset_size=10)

# Extract questions
test_questions = list(testset.to_pandas()['user_input'])
print(f" Generated {len(test_questions)} questions")

testset.to_pandas()

Applying SummaryExtractor:   0%|          | 0/12 [00:00<?, ?it/s]

Applying CustomNodeFilter:   0%|          | 0/20 [00:00<?, ?it/s]

Node 94e6048c-7d08-4191-aebf-bba94ae7d4fa does not have a summary. Skipping filtering.
Node baada44b-c39c-4e9f-8259-0ced2560097d does not have a summary. Skipping filtering.
Node 3d5ac006-558d-43e2-a427-8a833f1ebff3 does not have a summary. Skipping filtering.
Node 566b8b5b-e0be-4c64-9b5c-d806a643fed1 does not have a summary. Skipping filtering.
Node 868c1e29-e765-4398-96c7-3d5e9ec0446a does not have a summary. Skipping filtering.
Node c1086a1a-1175-40ef-a564-8b1ca5125bd2 does not have a summary. Skipping filtering.
Node 585e26d7-9f08-4d07-842d-a332778a42ee does not have a summary. Skipping filtering.
Node 14dc12b5-7570-452b-8a0c-332cb1b9ec6e does not have a summary. Skipping filtering.


Applying [EmbeddingExtractor, ThemesExtractor, NERExtractor]:   0%|          | 0/52 [00:00<?, ?it/s]

Applying OverlapScoreBuilder:   0%|          | 0/1 [00:00<?, ?it/s]

Generating personas:   0%|          | 0/3 [00:00<?, ?it/s]

Generating Scenarios:   0%|          | 0/2 [00:00<?, ?it/s]

Generating Samples:   0%|          | 0/10 [00:00<?, ?it/s]

 Generated 10 questions


Unnamed: 0,user_input,reference_contexts,reference,synthesizer_name
0,How come XXXX XXXX just go and look at my stuf...,[XXXX XXXX violated my FERPA rights by accessi...,XXXX XXXX violated my FERPA rights by accessin...,single_hop_specifc_query_synthesizer
1,Wut FCRA requriements were not folowed in this...,[I am filing this complaint against XXXX due ...,"Under the Fair Credit Reporting Act (FCRA), lo...",single_hop_specifc_query_synthesizer
2,What issue occurred after the student loan was...,[Federal request for payment history and recor...,After the student loan was transferred to XXXX...,single_hop_specifc_query_synthesizer
3,How FERPA got broke by Nelnet data breach?,[I am filing this complaint against Nelnet due...,The Nelnet data breach raised concerns over a ...,single_hop_specifc_query_synthesizer
4,Howw can a loan servicer's repeated payment re...,[I've made regular payments on my loan but the...,"In the provided context, the borrower's credit...",single_hop_specifc_query_synthesizer
5,If a student loan was paid off as part of a bo...,[<1-hop>\n\nMy student loan was part of a borr...,When a student loan is paid off through a borr...,multi_hop_specific_query_synthesizer
6,What was the impact of the Department of Gover...,"[<1-hop>\n\nOn XX/XX/year>, XXXX XXXX instruct...",The Department of Government Efficiency gained...,multi_hop_specific_query_synthesizer
7,How have issues with NelNet's online account s...,[<1-hop>\n\nI spoke with a representative on t...,Borrowers have faced significant challenges wi...,multi_hop_specific_query_synthesizer
8,How did Nelnet's data breach and online accoun...,[<1-hop>\n\nI am filing this complaint against...,Nelnet's data breach led to concerns about una...,multi_hop_specific_query_synthesizer
9,How did the actions of the Department of Gover...,[<1-hop>\n\nI am submitting this complaint due...,The compromise of personal information for stu...,multi_hop_specific_query_synthesizer


In [65]:
# =============================================================================
# STEP 4: EVALUATE RETRIEVERS WITH REDUCED QUESTIONS + COST TRACKING (FIXED)
# =============================================================================

import time
from ragas import EvaluationDataset, evaluate
from ragas.metrics import LLMContextRecall, Faithfulness, FactualCorrectness, ResponseRelevancy, ContextEntityRecall, NoiseSensitivity
from ragas.llms import LangchainLLMWrapper
from ragas import RunConfig
from langchain_openai import ChatOpenAI
from langchain.callbacks import LangChainTracer
from langsmith import Client

# Setup evaluation LLM for Ragas
evaluator_llm = LangchainLLMWrapper(ChatOpenAI(model="gpt-4.1-mini"))
custom_run_config = RunConfig(timeout=360)

# Setup LangSmith tracer for cost tracking
langsmith_client = Client()
tracer = LangChainTracer(project_name="retriever-evaluation")

# Collect existing retrievers from notebook
retrievers = {
    "Naive": naive_retrieval_chain,
    "BM25": bm25_retrieval_chain,
    "Multi-Query": multi_query_retrieval_chain,
    "Parent Document": parent_document_retrieval_chain,
    "Contextual Compression": contextual_compression_retrieval_chain,
    "Ensemble": ensemble_retrieval_chain
}

# Use only first 10 question
limited_testset = list(testset)[:10]
print(f"📊 Using {len(limited_testset)} questions for evaluation")

# Evaluate each retriever
results = {}
for name, chain in retrievers.items():
    print(f"📊 Evaluating {name}...")
    
    # Track latency and cost
    latencies = []
    total_cost = 0.0
    eval_dataset_samples = []
    
    for test_row in limited_testset:
        # Measure latency with cost tracking
        start_time = time.time()
        
        result = chain.invoke(
            {"question": test_row.eval_sample.user_input},
            config={"callbacks": [tracer], "tags": [f"retriever_{name}"]}
        )
        
        end_time = time.time()
        latencies.append(end_time - start_time)
        
        # Extract response and context
        response = result["response"].content if hasattr(result["response"], 'content') else str(result["response"])
        retrieved_contexts = [doc.page_content for doc in result["context"]]
        
        eval_sample = {
            'user_input': test_row.eval_sample.user_input,
            'response': response,
            'retrieved_contexts': retrieved_contexts,
            'reference_contexts': test_row.eval_sample.reference_contexts,
            'reference': test_row.eval_sample.reference
        }
        eval_dataset_samples.append(eval_sample)
    
    # Create evaluation dataset
    import pandas as pd
    eval_df = pd.DataFrame(eval_dataset_samples)
    evaluation_dataset = EvaluationDataset.from_pandas(eval_df)
    
    # Run Ragas evaluation
    ragas_result = evaluate(
        dataset=evaluation_dataset,
        metrics=[
            LLMContextRecall(), 
            Faithfulness(), 
            FactualCorrectness(), 
            ResponseRelevancy(), 
            ContextEntityRecall(), 
            NoiseSensitivity()
        ],
        llm=evaluator_llm,
        run_config=custom_run_config
    )
    
    # 🔥 FIXED: Convert Decimal to float
    tag_filter = f"has(tags, 'retriever_{name}')"
    recent_runs = list(langsmith_client.list_runs(
        project_name="retriever-evaluation",
        limit=50,
        filter=tag_filter
    ))
    
    for run in recent_runs:
        if run.total_cost:
            total_cost += float(run.total_cost)  # 🔥 Convert Decimal to float
    
    # Calculate metrics
    avg_latency = sum(latencies) / len(latencies)
    cost_per_query = total_cost / len(latencies) if len(latencies) > 0 else 0.0
    
    # Store results
    results[name] = {
        'ragas_scores': ragas_result,
        'avg_latency': avg_latency,
        'total_latency': sum(latencies),
        'total_cost': total_cost,
        'cost_per_query': cost_per_query,
        'num_queries': len(latencies)
    }
    
    print(f"✅ {name} completed - Latency: {avg_latency:.2f}s | Cost: ${total_cost:.4f}")

print("🎉 All evaluations completed!")

📊 Using 10 questions for evaluation
📊 Evaluating Naive...


Evaluating:   0%|          | 0/60 [00:00<?, ?it/s]

Exception raised in Job[11]: TimeoutError()
Exception raised in Job[35]: TimeoutError()
Exception raised in Job[41]: TimeoutError()
Exception raised in Job[47]: TimeoutError()
Exception raised in Job[53]: TimeoutError()


✅ Naive completed - Latency: 5.64s | Cost: $0.0107
📊 Evaluating BM25...


Evaluating:   0%|          | 0/60 [00:00<?, ?it/s]

Exception raised in Job[16]: TimeoutError()


✅ BM25 completed - Latency: 5.03s | Cost: $0.0053
📊 Evaluating Multi-Query...


Evaluating:   0%|          | 0/60 [00:00<?, ?it/s]

Exception raised in Job[23]: AttributeError('StringIO' object has no attribute 'statements')
Exception raised in Job[41]: AttributeError('StringIO' object has no attribute 'statements')
Exception raised in Job[11]: AttributeError('StringIO' object has no attribute 'statements')
Exception raised in Job[17]: TimeoutError()
Exception raised in Job[35]: TimeoutError()
Exception raised in Job[46]: TimeoutError()
Exception raised in Job[47]: TimeoutError()
Exception raised in Job[53]: TimeoutError()
Exception raised in Job[59]: TimeoutError()


✅ Multi-Query completed - Latency: 6.19s | Cost: $0.0144
📊 Evaluating Parent Document...


Evaluating:   0%|          | 0/60 [00:00<?, ?it/s]

Exception raised in Job[59]: AttributeError('StringIO' object has no attribute 'statements')


✅ Parent Document completed - Latency: 3.87s | Cost: $0.0063
📊 Evaluating Contextual Compression...


Evaluating:   0%|          | 0/60 [00:00<?, ?it/s]

Exception raised in Job[36]: AttributeError('StringIO' object has no attribute 'classifications')


✅ Contextual Compression completed - Latency: 3.66s | Cost: $0.0042
📊 Evaluating Ensemble...


Evaluating:   0%|          | 0/60 [00:00<?, ?it/s]

Exception raised in Job[5]: TimeoutError()
Exception raised in Job[11]: TimeoutError()
Exception raised in Job[16]: TimeoutError()
Exception raised in Job[17]: TimeoutError()
Exception raised in Job[29]: TimeoutError()
Exception raised in Job[34]: TimeoutError()
Exception raised in Job[35]: TimeoutError()
Exception raised in Job[41]: TimeoutError()
Exception raised in Job[47]: TimeoutError()
Exception raised in Job[53]: TimeoutError()


✅ Ensemble completed - Latency: 10.00s | Cost: $0.0074
🎉 All evaluations completed!


In [66]:
# =============================================================================
# STEP 5: ANALYZE AND DISPLAY RESULTS WITH CORRECTED RAGAS METRICS
# =============================================================================

import pandas as pd

# Compile results with Ragas metrics, performance data, and cost
summary_data = []
for name, result_data in results.items():
    ragas_scores = result_data['ragas_scores']
    
    # 🔥 CORRECTED: Extract Ragas metric scores from DataFrame
    df = ragas_scores.to_pandas()
    
    # Extract scores using correct column names from the DataFrame
    context_recall = df['context_recall'].mean() if 'context_recall' in df.columns else 0.0
    faithfulness = df['faithfulness'].mean() if 'faithfulness' in df.columns else 0.0
    factual_correctness = df['factual_correctness'].mean() if 'factual_correctness' in df.columns else 0.0
    response_relevancy = df['answer_relevancy'].mean() if 'answer_relevancy' in df.columns else 0.0  # Note: answer_relevancy, not response_relevancy
    context_entity_recall = df['context_entity_recall'].mean() if 'context_entity_recall' in df.columns else 0.0
    noise_sensitivity = df['noise_sensitivity_relevant'].mean() if 'noise_sensitivity_relevant' in df.columns else 0.0  # Note: noise_sensitivity_relevant
    
    summary_data.append({
        'Retriever': name,
        'Context Recall': f"{context_recall:.3f}",
        'Faithfulness': f"{faithfulness:.3f}",
        'Factual Correctness': f"{factual_correctness:.3f}",
        'Response Relevancy': f"{response_relevancy:.3f}",
        'Context Entity Recall': f"{context_entity_recall:.3f}",
        'Noise Sensitivity': f"{noise_sensitivity:.3f}",
        'Avg Latency (s)': f"{result_data['avg_latency']:.2f}",
        'Total Cost ($)': f"{result_data['total_cost']:.4f}",
        'Cost per Query ($)': f"{result_data['cost_per_query']:.4f}",
        'Queries': result_data['num_queries']
    })

results_df = pd.DataFrame(summary_data)

print("📊 RAGAS EVALUATION RESULTS WITH PERFORMANCE METRICS & COST")
print("=" * 100)
print(results_df.to_string(index=False))

# Calculate composite scores for ranking
print("\n📈 COMPOSITE PERFORMANCE ANALYSIS")
print("=" * 50)

for name, result_data in results.items():
    ragas_scores = result_data['ragas_scores']
    
    # 🔥 CORRECTED: Extract quality metrics from DataFrame
    df = ragas_scores.to_pandas()
    
    quality_metrics = [
        df['faithfulness'].mean() if 'faithfulness' in df.columns else 0.0,
        df['factual_correctness'].mean() if 'factual_correctness' in df.columns else 0.0,
        df['answer_relevancy'].mean() if 'answer_relevancy' in df.columns else 0.0,  # Note: answer_relevancy
        df['context_recall'].mean() if 'context_recall' in df.columns else 0.0
    ]
    
    avg_quality = sum(quality_metrics) / len(quality_metrics) if quality_metrics else 0.0
    avg_latency = result_data['avg_latency']
    
    print(f"{name:20} | Quality: {avg_quality:.3f} | Latency: {avg_latency:.2f}s | Score/Speed: {avg_quality/avg_latency:.3f}")

print("\n💡 ANALYSIS:")
print("- Higher Quality scores are better")
print("- Lower Latency is better") 
print("- Score/Speed ratio shows quality per second")
print("- Consider cost implications for production use")
print("- Contextual Compression may have additional API costs (Cohere Rerank)")
print("- Multi-Query generates additional LLM calls, increasing cost and latency")

📊 RAGAS EVALUATION RESULTS WITH PERFORMANCE METRICS & COST
             Retriever Context Recall Faithfulness Factual Correctness Response Relevancy Context Entity Recall Noise Sensitivity Avg Latency (s) Total Cost ($) Cost per Query ($)  Queries
                 Naive          0.817        0.840               0.493              0.849                 0.362             0.430            5.64         0.0107             0.0011       10
                  BM25          0.755        0.762               0.474              0.858                 0.375             0.254            5.03         0.0053             0.0005       10
           Multi-Query          0.900        0.898               0.382              0.863                 0.513             0.458            6.19         0.0144             0.0014       10
       Parent Document          0.777        0.827               0.494              0.939                 0.449             0.447            3.87         0.0063             0.0006      

## 📊 **Recommendation for Loan Complaint Data:**

**For this loan complaint dataset, the Parent Document retriever emerges as the optimal choice**, delivering the best overall balance of quality, speed, and cost-effectiveness. Here's why:

**Parent Document achieves the highest efficiency** with a Score/Speed ratio of 0.196, combining strong quality (0.759) with excellent response times (3.87s) and reasonable cost ($0.0006 per query). Its standout feature is exceptional response relevancy (0.939) - the highest among all retrievers - which is crucial for loan complaint data where users need precise, contextually appropriate answers about their specific issues.

**While Ensemble delivers the highest quality score (0.846)** with perfect context recall, its 10-second response time makes it impractical for customer service applications where users expect quick responses. **Multi-Query offers similar quality (0.761)** but at 60% higher latency and 2.3x the cost, making it inefficient for high-volume deployment.

**For loan complaint scenarios**, Parent Document's "small-to-big" approach is particularly effective - it searches on focused complaint excerpts but returns full complaint contexts, ensuring users get comprehensive information about similar loan issues while maintaining fast retrieval. The combination of sub-4-second response times, strong factual correctness (0.494), and high response relevancy makes it ideal for production deployment where customer satisfaction depends on both speed and accuracy.

**Bottom line**: Parent Document retrieval provides the optimal solution for loan complaint data, maximizing both user experience and operational efficiency.