# Advanced Retrieval with LangChain

In the following notebook, we'll explore various methods of advanced retrieval using LangChain!

We'll touch on:

- Naive Retrieval
- Best-Matching 25 (BM25)
- Multi-Query Retrieval
- Parent-Document Retrieval
- Contextual Compression (a.k.a. Rerank)
- Ensemble Retrieval
- Semantic chunking

We'll also discuss how these methods impact performance on our set of documents with a simple RAG chain.

There will be two breakout rooms:

- 🤝 Breakout Room Part #1
  - Task 1: Getting Dependencies!
  - Task 2: Data Collection and Preparation
  - Task 3: Setting Up QDrant!
  - Task 4-10: Retrieval Strategies
- 🤝 Breakout Room Part #2
  - Activity: Evaluate with Ragas

# 🤝 Breakout Room Part #1

## Task 1: Getting Dependencies!

We're going to need a few specific LangChain community packages, like OpenAI (for our [LLM](https://platform.openai.com/docs/models) and [Embedding Model](https://platform.openai.com/docs/guides/embeddings)) and Cohere (for our [Reranker](https://cohere.com/rerank)).

We'll also provide our OpenAI key, as well as our Cohere API key.

In [None]:
import os
import getpass

os.environ["OPENAI_API_KEY"] = ""

In [None]:
os.environ["COHERE_API_KEY"] = ""

## Task 2: Data Collection and Preparation

We'll be using our Loan Data once again - this time the strutured data available through the CSV!

### Data Preparation

We want to make sure all our documents have the relevant metadata for the various retrieval strategies we're going to be applying today.

In [6]:
from langchain_community.document_loaders.csv_loader import CSVLoader
from datetime import datetime, timedelta

loader = CSVLoader(
    file_path=f"./data/complaints.csv",
    metadata_columns=[
      "Date received", 
      "Product", 
      "Sub-product", 
      "Issue", 
      "Sub-issue", 
      "Consumer complaint narrative", 
      "Company public response", 
      "Company", 
      "State", 
      "ZIP code", 
      "Tags", 
      "Consumer consent provided?", 
      "Submitted via", 
      "Date sent to company", 
      "Company response to consumer", 
      "Timely response?", 
      "Consumer disputed?", 
      "Complaint ID"
    ]
)

loan_complaint_data = loader.load()

for doc in loan_complaint_data:
    doc.page_content = doc.metadata["Consumer complaint narrative"]

Let's look at an example document to see if everything worked as expected!

In [7]:
loan_complaint_data[0]

Document(metadata={'source': './data/complaints.csv', 'row': 0, 'Date received': '03/27/25', 'Product': 'Student loan', 'Sub-product': 'Federal student loan servicing', 'Issue': 'Dealing with your lender or servicer', 'Sub-issue': 'Trouble with how payments are being handled', 'Consumer complaint narrative': "The federal student loan COVID-19 forbearance program ended in XX/XX/XXXX. However, payments were not re-amortized on my federal student loans currently serviced by Nelnet until very recently. The new payment amount that is effective starting with the XX/XX/XXXX payment will nearly double my payment from {$180.00} per month to {$360.00} per month. I'm fortunate that my current financial position allows me to be able to handle the increased payment amount, but I am sure there are likely many borrowers who are not in the same position. The re-amortization should have occurred once the forbearance ended to reduce the impact to borrowers.", 'Company public response': 'None', 'Company'

## Task 3: Setting up QDrant!

Now that we have our documents, let's create a QDrant VectorStore with the collection name "LoanComplaints".

We'll leverage OpenAI's [`text-embedding-3-small`](https://openai.com/blog/new-embedding-models-and-api-updates) because it's a very powerful (and low-cost) embedding model.

> NOTE: We'll be creating additional vectorstores where necessary, but this pattern is still extremely useful.

In [8]:
from langchain_community.vectorstores import Qdrant
from langchain_openai import OpenAIEmbeddings

embeddings = OpenAIEmbeddings(model="text-embedding-3-small")

vectorstore = Qdrant.from_documents(
    loan_complaint_data,
    embeddings,
    location=":memory:",
    collection_name="LoanComplaints"
)

## Task 4: Naive RAG Chain

Since we're focusing on the "R" in RAG today - we'll create our Retriever first.

### R - Retrieval

This naive retriever will simply look at each review as a document, and use cosine-similarity to fetch the 10 most relevant documents.

> NOTE: We're choosing `10` as our `k` here to provide enough documents for our reranking process later

In [9]:
naive_retriever = vectorstore.as_retriever(search_kwargs={"k" : 10})

### A - Augmented

We're going to go with a standard prompt for our simple RAG chain today! Nothing fancy here, we want this to mostly be about the Retrieval process.

In [10]:
from langchain_core.prompts import ChatPromptTemplate

RAG_TEMPLATE = """\
You are a helpful and kind assistant. Use the context provided below to answer the question.

If you do not know the answer, or are unsure, say you don't know.

Query:
{question}

Context:
{context}
"""

rag_prompt = ChatPromptTemplate.from_template(RAG_TEMPLATE)

### G - Generation

We're going to leverage `gpt-4.1-nano` as our LLM today, as - again - we want this to largely be about the Retrieval process.

In [11]:
from langchain_openai import ChatOpenAI

chat_model = ChatOpenAI(model="gpt-4.1-nano")

### LCEL RAG Chain

We're going to use LCEL to construct our chain.

> NOTE: This chain will be exactly the same across the various examples with the exception of our Retriever!

In [12]:
from langchain_core.runnables import RunnablePassthrough
from operator import itemgetter
from langchain_core.output_parsers import StrOutputParser

naive_retrieval_chain = (
    # INVOKE CHAIN WITH: {"question" : "<<SOME USER QUESTION>>"}
    # "question" : populated by getting the value of the "question" key
    # "context"  : populated by getting the value of the "question" key and chaining it into the base_retriever
    {"context": itemgetter("question") | naive_retriever, "question": itemgetter("question")}
    # "context"  : is assigned to a RunnablePassthrough object (will not be called or considered in the next step)
    #              by getting the value of the "context" key from the previous step
    | RunnablePassthrough.assign(context=itemgetter("context"))
    # "response" : the "context" and "question" values are used to format our prompt object and then piped
    #              into the LLM and stored in a key called "response"
    # "context"  : populated by getting the value of the "context" key from the previous step
    | {"response": rag_prompt | chat_model, "context": itemgetter("context")}
)

Let's see how this simple chain does on a few different prompts.

> NOTE: You might think that we've cherry picked prompts that showcase the individual skill of each of the retrieval strategies - you'd be correct!

In [13]:
naive_retrieval_chain.invoke({"question" : "What is the most common issue with loans?"})["response"].content

'The most common issue with loans, based on the complaints provided, appears to be dealing with the mismanagement and inaccuracies related to loan balances, interest calculations, and the handling of payments by servicers. Many complaints involve errors in loan balances, misapplied payments, incorrect reporting of account status, and problems with loan transfers or communications from servicers. Additionally, issues with unfair or predatory repayment practices and difficulty in obtaining accurate information are prevalent.\n\nIn summary, the most common issue is **mismanagement of student loan accounts, including errors in balances, interest, and payment application, along with poor communication from loan servicers.**'

In [14]:
naive_retrieval_chain.invoke({"question" : "Did any complaints not get handled in a timely manner?"})["response"].content

"Based on the provided complaints, yes, there are several complaints indicating that issues were not handled in a timely manner. Specifically, two complaints directly mention a failure to respond within the expected time frame:\n\n1. Complaint from 03/28/25 (Complaint ID: 12709087) about a loan application not being processed, where the response was marked as 'No' for timely response.\n2. Complaint from 04/05/25 (Complaint ID: 12832400) regarding a complaint to the CFPB about Aidvantage, which also was marked as 'Yes' for timely response, but the ongoing issue suggests delayed actions.\n\nAdditionally, multiple complaints highlight long delays, lack of responses, or failure to resolve issues within reasonable or legal timeframes, such as the complaint from 04/14/25 (Complaint ID: 12973003) about auto-pay and account issues that persisted beyond promised timelines.\n\nTherefore, yes, several complaints not only experienced delays but in at least one case explicitly were not handled in a

In [15]:
naive_retrieval_chain.invoke({"question" : "Why did people fail to pay back their loans?"})["response"].content

'People failed to pay back their loans for various reasons, including:\n\n1. **Accumulation of interest and inability to increase payments**: Borrowers often found that lowering their payments led to interest accumulating faster than they could pay it off, making the debt grow over time.\n\n2. **Financial hardship and job-related issues**: Many borrowers assumed they would secure jobs supporting loan repayment but faced stagnating wages, unemployment, or employment in low-paying or unstable jobs, making it hard to meet repayment obligations.\n\n3. **Mismanagement and lack of clear information from servicers**: Some borrowers experienced poor communication, unexpected transfer of loans without notice, and unclear interest calculations, leading to confusion and inability to keep up with payments.\n\n4. **Inability to qualify for forgiveness programs**: Many did not qualify for programs like PSLF or TLF, which could reduce their debt, leaving them with unmanageable payments.\n\n5. **Loan 

Overall, this is not bad! Let's see if we can make it better!

## Task 5: Best-Matching 25 (BM25) Retriever

Taking a step back in time - [BM25](https://www.nowpublishers.com/article/Details/INR-019) is based on [Bag-Of-Words](https://en.wikipedia.org/wiki/Bag-of-words_model) which is a sparse representation of text.

In essence, it's a way to compare how similar two pieces of text are based on the words they both contain.

This retriever is very straightforward to set-up! Let's see it happen down below!


In [16]:
from langchain_community.retrievers import BM25Retriever

bm25_retriever = BM25Retriever.from_documents(loan_complaint_data, )

We'll construct the same chain - only changing the retriever.

In [17]:
bm25_retrieval_chain = (
    {"context": itemgetter("question") | bm25_retriever, "question": itemgetter("question")}
    | RunnablePassthrough.assign(context=itemgetter("context"))
    | {"response": rag_prompt | chat_model, "context": itemgetter("context")}
)

Let's look at the responses!

In [18]:
bm25_retrieval_chain.invoke({"question" : "What is the most common issue with loans?"})["response"].content

'Based on the provided context, the most common issue with loans appears to be related to dealing with lenders or servicers, particularly issues with miscommunication, incorrect information, or problems with repayment and fees. Specifically, recurring issues include:\n\n- Disputes over fees charged and incorrect explanations about loan balances or terms\n- Difficulty applying payments correctly or paying down principal\n- Receiving inaccurate or bad information about loans or loan history\n- Problems with loan repayment terms, interest calculations, or loan duration\n\nTherefore, the most common issue is problems related to dealing with lenders or servicers, especially involving miscommunication, incorrect information, and repayment difficulties.'

In [19]:
bm25_retrieval_chain.invoke({"question" : "Did any complaints not get handled in a timely manner?"})["response"].content

'Based on the provided complaints, all of them include responses from the companies that were marked as "Timely response?": "Yes." Therefore, it appears that all complaints were handled in a timely manner.'

In [20]:
bm25_retrieval_chain.invoke({"question" : "Why did people fail to pay back their loans?"})["response"].content

'People failed to pay back their loans for various reasons, including issues with their payment plans, miscommunication or lack of communication from the loan servicers, being unaware of account status changes, or experiencing mistakes and complications caused by the lenders or servicers. For instance, some borrowers were not properly informed about their loan transfers, repayment schedule changes, or issues with autopay enrollment, leading to missed payments and negative credit impacts. Others faced problems with billing errors, incorrect account information, or prolonged delays in response from the loan servicing companies, which hindered their ability to make or track payments effectively.'

It's not clear that this is better or worse, if only we had a way to test this (SPOILERS: We do, the second half of the notebook will cover this)

<div style="background-color: #204B8E; color: white; padding: 10px; border-radius: 5px;">

#### ❓ Question #1:

Give an example query where BM25 is better than embeddings and justify your answer.

### Answer:
* example query: ""Philippines Income Tax Act 1997 Section 45"
* This query is looking for a very specific law section. The exact wording and number (“Section 45”) matter a lot.
* Legal references often have unique identifiers (numbers, subsections, letter codes) that are not semantically similar to anything else. Embedding models can miss their exactness because they compress meaning into a vector space where “Section 45” and “Section 44” might be close. BM25 treats them as totally different terms.

</div>

## Task 6: Contextual Compression (Using Reranking)

Contextual Compression is a fairly straightforward idea: We want to "compress" our retrieved context into just the most useful bits.

There are a few ways we can achieve this - but we're going to look at a specific example called reranking.

The basic idea here is this:

- We retrieve lots of documents that are very likely related to our query vector
- We "compress" those documents into a smaller set of *more* related documents using a reranking algorithm.

We'll be leveraging Cohere's Rerank model for our reranker today!

All we need to do is the following:

- Create a basic retriever
- Create a compressor (reranker, in this case)

That's it!

Let's see it in the code below!

In [21]:
from langchain.retrievers.contextual_compression import ContextualCompressionRetriever
from langchain_cohere import CohereRerank

compressor = CohereRerank(model="rerank-v3.5")
compression_retriever = ContextualCompressionRetriever(
    base_compressor=compressor, base_retriever=naive_retriever
)

Let's create our chain again, and see how this does!

In [22]:
contextual_compression_retrieval_chain = (
    {"context": itemgetter("question") | compression_retriever, "question": itemgetter("question")}
    | RunnablePassthrough.assign(context=itemgetter("context"))
    | {"response": rag_prompt | chat_model, "context": itemgetter("context")}
)

In [23]:
contextual_compression_retrieval_chain.invoke({"question" : "What is the most common issue with loans?"})["response"].content

'Based on the provided context, the most common issue with loans appears to be problems related to dealing with lenders or servicers, including receiving incorrect or bad information about loans, errors in loan balances, misapplied payments, wrongful denials of payment plans, and mishandling of loan data. Many complaints involve confusion over balances, unapproved transfers of loans, lack of communication, and disputes over accuracy and privacy violations.'

In [24]:
contextual_compression_retrieval_chain.invoke({"question" : "Did any complaints not get handled in a timely manner?"})["response"].content

'Based on the provided context, there is an indication that some complaints did not get handled in a timely manner. For example, the complaint from the individual regarding their student loans and account review has been open since back when the complaint was submitted (specific dates are redacted with XXXX), and it mentions that it has been nearly 18 months with no resolution. Additionally, the complaint from the person with payments totaling $3100 that were not applied to their account also references a timeline of over 2-3 weeks and ongoing issues. \n\nTherefore, yes, some complaints did not get handled in a timely manner.'

In [25]:
contextual_compression_retrieval_chain.invoke({"question" : "Why did people fail to pay back their loans?"})["response"].content

'People failed to pay back their loans primarily due to a combination of factors including lack of understanding about repayment responsibilities, poor communication from lenders and servicers, and financial hardships. Specifically:\n\n- Many borrowers were unaware that they needed to repay their student loans, especially if they were not adequately informed by financial aid officers.\n- There were issues with loan servicers failing to notify borrowers of payment due dates, changes in loan ownership, or terms of repayment.\n- Borrowers often faced difficulties accessing accurate or consistent account information, leading to confusion about balances and interest.\n- Restricted payment options like forbearance and deferment resulted in continued interest accumulation, making it harder to pay off the loans and sometimes increasing total debt.\n- Some borrowers experienced increased balances despite payments due to accumulation of interest and a lack of clear, transparent breakdowns of the

We'll need to rely on something like Ragas to help us get a better sense of how this is performing overall - but it "feels" better!

## Task 7: Multi-Query Retriever

Typically in RAG we have a single query - the one provided by the user.

What if we had....more than one query!

In essence, a Multi-Query Retriever works by:

1. Taking the original user query and creating `n` number of new user queries using an LLM.
2. Retrieving documents for each query.
3. Using all unique retrieved documents as context

So, how is it to set-up? Not bad! Let's see it down below!



In [26]:
from langchain.retrievers.multi_query import MultiQueryRetriever

multi_query_retriever = MultiQueryRetriever.from_llm(
    retriever=naive_retriever, llm=chat_model
)

In [27]:
multi_query_retrieval_chain = (
    {"context": itemgetter("question") | multi_query_retriever, "question": itemgetter("question")}
    | RunnablePassthrough.assign(context=itemgetter("context"))
    | {"response": rag_prompt | chat_model, "context": itemgetter("context")}
)

In [28]:
multi_query_retrieval_chain.invoke({"question" : "What is the most common issue with loans?"})["response"].content

'Based on the provided data, the most common issue with loans appears to be problems related to "Dealing with your lender or servicer," which includes sub-issues such as errors in loan balances, misapplied payments, wrongful denials of payment plans, and bad information about loans. A recurring theme is mismanagement by loan servicers, inaccurate reporting, and difficulties in resolving disputes or clarifying account details.\n\nTherefore, the most common issue with loans from the given data is:\n\n**Problems with loan servicers, including mismanagement, incorrect information, and difficulties in communication or resolution.**'

In [29]:
multi_query_retrieval_chain.invoke({"question" : "Did any complaints not get handled in a timely manner?"})["response"].content

'Based on the provided complaints, yes, some complaints indicate that they were not handled in a timely manner. Specifically:\n\n- Several complaints note delayed responses or lack of response from the companies, despite multiple follow-ups. For example:\n  - Complaint from 04/18/25 (Complaint ID: 13062402) against Nelnet states that inaccuracies in reporting were not corrected within required timeframes, and disputes were not resolved promptly.\n  - Complaint from 04/21/25 (Complaint ID: 13091395) against Maximus Federal Services reports that the company did not respond within the required 15 days.\n  - Complaint from 04/27/25 (Complaint ID: 13205525) against Nelnet mentions that over 30 days had passed without response to a dispute.\n  - Complaint from 05/08/25 (Complaint ID: 13410623) against Maximus Federal Services details that they waited over 18 months for resolution, and the complaint was inadequately addressed.\n\nThese examples suggest that, in multiple instances, complaints 

In [30]:
multi_query_retrieval_chain.invoke({"question" : "Why did people fail to pay back their loans?"})["response"].content

'People failed to pay back their loans primarily due to factors such as mismanagement by loan servicers, inaccurate or delayed communication about payments, and the accumulation of interest during periods of forbearance or deferment. Many borrowers experienced issues like being steered into long-term forbearances with little notice about the consequences, or being unable to apply additional payments toward the principal, which prolonged their debt. Others faced errors in reporting their payment history or account status, which adversely affected their credit scores. Additionally, some borrowers were misled about repayment options, interest accrual, and loan forgiveness programs, making it difficult for them to manage and pay off their loans effectively. These systemic issues, combined with financial hardships and lack of transparent information, contributed to the failure of many individuals to pay back their loans.'

<div style="background-color: #204B8E; color: white; padding: 10px; border-radius: 5px;">

#### ❓ Question #2:

Explain how generating multiple reformulations of a user query can improve recall.

</div>

<div style="background-color: #204B8E; color: white; padding: 10px; border-radius: 5px;">

### Answer:

- Generating multiple reformulations of a user query can improve recall because it helps the retriever find more relevant documents that might use different wording or phrasing than the original query.
- This improves recall because it reduces the risk of missing relevant documents that use different words, structures, or terms from the original query.

</div>

## Task 8: Parent Document Retriever

A "small-to-big" strategy - the Parent Document Retriever works based on a simple strategy:

1. Each un-split "document" will be designated as a "parent document" (You could use larger chunks of document as well, but our data format allows us to consider the overall document as the parent chunk)
2. Store those "parent documents" in a memory store (not a VectorStore)
3. We will chunk each of those documents into smaller documents, and associate them with their respective parents, and store those in a VectorStore. We'll call those "child chunks".
4. When we query our Retriever, we will do a similarity search comparing our query vector to the "child chunks".
5. Instead of returning the "child chunks", we'll return their associated "parent chunks".

Okay, maybe that was a few steps - but the basic idea is this:

- Search for small documents
- Return big documents

The intuition is that we're likely to find the most relevant information by limiting the amount of semantic information that is encoded in each embedding vector - but we're likely to miss relevant surrounding context if we only use that information.

Let's start by creating our "parent documents" and defining a `RecursiveCharacterTextSplitter`.

In [31]:
from langchain.retrievers import ParentDocumentRetriever
from langchain.storage import InMemoryStore
from langchain_text_splitters import RecursiveCharacterTextSplitter
from qdrant_client import QdrantClient, models

parent_docs = loan_complaint_data
child_splitter = RecursiveCharacterTextSplitter(chunk_size=750)

We'll need to set up a new QDrant vectorstore - and we'll use another useful pattern to do so!

> NOTE: We are manually defining our embedding dimension, you'll need to change this if you're using a different embedding model.

In [32]:
from langchain_qdrant import QdrantVectorStore

client = QdrantClient(location=":memory:")

client.create_collection(
    collection_name="full_documents",
    vectors_config=models.VectorParams(size=1536, distance=models.Distance.COSINE)
)

parent_document_vectorstore = QdrantVectorStore(
    collection_name="full_documents", embedding=OpenAIEmbeddings(model="text-embedding-3-small"), client=client
)

Now we can create our `InMemoryStore` that will hold our "parent documents" - and build our retriever!

In [33]:
store = InMemoryStore()

parent_document_retriever = ParentDocumentRetriever(
    vectorstore = parent_document_vectorstore,
    docstore=store,
    child_splitter=child_splitter,
)

By default, this is empty as we haven't added any documents - let's add some now!

In [34]:
parent_document_retriever.add_documents(parent_docs, ids=None)

We'll create the same chain we did before - but substitute our new `parent_document_retriever`.

In [35]:
parent_document_retrieval_chain = (
    {"context": itemgetter("question") | parent_document_retriever, "question": itemgetter("question")}
    | RunnablePassthrough.assign(context=itemgetter("context"))
    | {"response": rag_prompt | chat_model, "context": itemgetter("context")}
)

Let's give it a whirl!

In [36]:
parent_document_retrieval_chain.invoke({"question" : "What is the most common issue with loans?"})["response"].content

'The most common issue with loans, based on the provided complaints, appears to be related to incorrect or misleading information in the credit reporting process, such as errors in loan balances, account status, or interest rates, as well as issues with loan servicer misconduct, including misapplied payments, wrongful denials of payment plans, and disputes over debt legitimacy.'

In [37]:
parent_document_retrieval_chain.invoke({"question" : "Did any complaints not get handled in a timely manner?"})["response"].content

'Based on the provided complaints, all four complaints indicate that they were not handled in a timely manner. Specifically, the first two complaints involving Mohela\'s student loan servicing explicitly mention that responses or communications took longer than expected, and the complaints were marked as "No" for "Timely response." \n\nTherefore, yes, some complaints did not get handled in a timely manner.'

In [38]:
parent_document_retrieval_chain.invoke({"question" : "Why did people fail to pay back their loans?"})["response"].content

"People failed to pay back their loans due to several reasons highlighted in the complaints. These include:\n\n1. Lack of clear communication and proper notification from loan servicers about when payments were due, such as failing to inform borrowers about payment start dates or changes in loan ownership.\n2. Challenges in managing repayment during financial hardship, such as experiencing severe financial difficulties, unemployment, or health issues, which made it difficult to make consistent payments.\n3. Mismanagement or failures by institutions, including not reevaluating payments based on the borrower's circumstances, or reporting delinquency while issues were unresolved.\n4. Misrepresentation or lack of transparency by educational institutions about the value of the degrees, career prospects, and the debt they encouraged students to take on, leading to financial hardship after graduation.\n5. Issues related to loan servicers' practices, such as improper reporting of late payments

Overall, the performance *seems* largely the same. We can leverage a tool like [Ragas]() to more effectively answer the question about the performance.

## Task 9: Ensemble Retriever

In brief, an Ensemble Retriever simply takes 2, or more, retrievers and combines their retrieved documents based on a rank-fusion algorithm.

In this case - we're using the [Reciprocal Rank Fusion](https://plg.uwaterloo.ca/~gvcormac/cormacksigir09-rrf.pdf) algorithm.

Setting it up is as easy as providing a list of our desired retrievers - and the weights for each retriever.

In [39]:
from langchain.retrievers import EnsembleRetriever

retriever_list = [bm25_retriever, naive_retriever, parent_document_retriever, compression_retriever, multi_query_retriever]
equal_weighting = [1/len(retriever_list)] * len(retriever_list)

ensemble_retriever = EnsembleRetriever(
    retrievers=retriever_list, weights=equal_weighting
)

We'll pack *all* of these retrievers together in an ensemble.

In [40]:
ensemble_retrieval_chain = (
    {"context": itemgetter("question") | ensemble_retriever, "question": itemgetter("question")}
    | RunnablePassthrough.assign(context=itemgetter("context"))
    | {"response": rag_prompt | chat_model, "context": itemgetter("context")}
)

Let's look at our results!

In [41]:
ensemble_retrieval_chain.invoke({"question" : "What is the most common issue with loans?"})["response"].content

'Based on the provided complaints, the most common issues with student loans appear to be:\n\n- Dealing with lenders or servicers issues, including errors in loan balances, misapplied payments, wrongful denials of payment plans, and trouble with how payments are handled.\n- Problems arising from loan transfer and lack of clear communication about loan terms, balances, interest calculations, or transfers without proper notification.\n- Incorrect or inconsistent information reported on credit reports and mismanagement leading to credit score drops.\n- Lack of proper documentation (e.g., missing signed promissory notes) and failure of servicers to provide required legal loan documents.\n- Issues related to loan repayment difficulties due to interest accumulation, improper loan consolidation handling, and problems with loan discharge or forgiveness programs.\n- Unauthorized or inaccurate reporting and breach of privacy (FERPA violations).\n\nIn summary, the most common issue seems to be **

In [42]:
ensemble_retrieval_chain.invoke({"question" : "Did any complaints not get handled in a timely manner?"})["response"].content

'Based on the provided complaints data, several complaints indicate that issues were not handled in a timely manner. Specifically:\n\n- Complaint ID 12973003 from EdFinancial Services (NJ) submitted on 04/14/25 reports that the company responded "Yes" to timely response, but the complaint states the issue had been ongoing for 2-3 weeks with no resolution, and the customer was still experiencing problems after waiting over 2-3 weeks, suggesting a delay.\n\n- Complaint ID 12654977 from MOHELA (MD) submitted on 03/25/25 indicates that Mohela\'s response was "No" to timely response, implying they did not handle the issue promptly. The complaint mentions that payments were not being processed correctly for over a month, and multiple complaints and escalations to no avail.\n\n- Complaint ID 13062402 from Nelnet (PA) received on 04/18/25 shows the response was "Yes" to timely, but the complaint indicates the error persisted for over 10 days beyond the promised 48 hours, and the problem was no

In [43]:
ensemble_retrieval_chain.invoke({"question" : "Why did people fail to pay back their loans?"})["response"].content

'People failed to pay back their loans primarily due to a combination of factors such as unmanageable interest accumulation, lack of clear communication and proper notification from loan servicers about payment obligations, misleading steering into forbearance or deferment options that led to increased debt, systemic mismanagement, and insufficient information about repayment options like income-driven plans or loan forgiveness programs. Additionally, some struggled because of financial hardships, employment issues, or misunderstandings about the status of their loans, including transfers between loan servicers and errors in reporting, all contributing to their inability to make timely payments.'

## Task 10: Semantic Chunking

While this is not a retrieval method - it *is* an effective way of increasing retrieval performance on corpora that have clean semantic breaks in them.

Essentially, Semantic Chunking is implemented by:

1. Embedding all sentences in the corpus.
2. Combining or splitting sequences of sentences based on their semantic similarity based on a number of [possible thresholding methods](https://python.langchain.com/docs/how_to/semantic-chunker/):
  - `percentile`
  - `standard_deviation`
  - `interquartile`
  - `gradient`
3. Each sequence of related sentences is kept as a document!

Let's see how to implement this!

We'll use the `percentile` thresholding method for this example which will:

Calculate all distances between sentences, and then break apart sequences of setences that exceed a given percentile among all distances.

In [44]:
from langchain_experimental.text_splitter import SemanticChunker

semantic_chunker = SemanticChunker(
    embeddings,
    breakpoint_threshold_type="percentile"
)

Now we can split our documents.

In [45]:
semantic_documents = semantic_chunker.split_documents(loan_complaint_data[:20])

Let's create a new vector store.

In [46]:
semantic_vectorstore = Qdrant.from_documents(
    semantic_documents,
    embeddings,
    location=":memory:",
    collection_name="Loan_Complaint_Data_Semantic_Chunks"
)

We'll use naive retrieval for this example.

In [47]:
semantic_retriever = semantic_vectorstore.as_retriever(search_kwargs={"k" : 10})

Finally we can create our classic chain!

In [48]:
semantic_retrieval_chain = (
    {"context": itemgetter("question") | semantic_retriever, "question": itemgetter("question")}
    | RunnablePassthrough.assign(context=itemgetter("context"))
    | {"response": rag_prompt | chat_model, "context": itemgetter("context")}
)

And view the results!

In [49]:
semantic_retrieval_chain.invoke({"question" : "What is the most common issue with loans?"})["response"].content

'The most common issue with loans, based on the complaints provided, appears to be problems related to loan servicing and handling. Specific issues include incorrect or delayed information about loan balances, repayment plans, or payment processing; improper reporting or errors on credit reports; and issues with loan forgiveness or discharge documentation. Many complaints also highlight a lack of communication, transparency, and accountability from loan servicers like Nelnet, EdFinancial, and Aidvantage.'

In [50]:
semantic_retrieval_chain.invoke({"question" : "Did any complaints not get handled in a timely manner?"})["response"].content

'Based on the provided complaints, it appears that several complaints were marked as handled in a timely manner, with responses noted as "Yes" for timely response and "Closed with explanation" for company responses. However, there is at least one complaint (Complaint ID: 13331376) where the consumer reported that despite the company response being "Closed with explanation," the issues remained unresolved, and the consumer continued to face problems such as transfer of accounts, errors, and alleged misconduct.\n\nIn particular, the complaint regarding the transfer of a student loan to Nelnet (ID: 13331376) involved ongoing issues and allegations of misconduct, despite the company\'s response indicating the complaint was closed. This suggests that some complaints may not have been fully handled to the consumer\'s satisfaction or in a manner that resolved the underlying issues.\n\nTherefore, while the records indicate many complaints were marked as handled "timely," there are indications 

In [51]:
semantic_retrieval_chain.invoke({"question" : "Why did people fail to pay back their loans?"})["response"].content

'Based on the provided information, people failed to pay back their loans for various reasons, including:\n\n1. Lack of clear communication and transparency from lenders or servicers, leading to confusion about loan status and terms.\n2. Difficulties in verifying or providing required documentation for loan forgiveness or discharge, resulting in delays and stalls.\n3. Disputes over the legitimacy or accuracy of reported loan information, often due to errors or alleged improper reporting.\n4. Technical issues or errors in payment processing, such as payments not being received or processed correctly.\n5. Concerns over improper or illegal practices by loan servicers, including unauthorized access to personal data, or reporting debts that are now legally void.\n6. Financial hardships or changes in personal circumstances, although not explicitly detailed in the complaints, can also be a factor.\n\nIn summary, failures to pay back loans often stem from administrative issues, disputes over l

<div style="background-color: #204B8E; color: white; padding: 10px; border-radius: 5px;">

#### ❓ Question #3:

If sentences are short and highly repetitive (e.g., FAQs), how might semantic chunking behave, and how would you adjust the algorithm?

</div>

<div style="background-color: #204B8E; color: white; padding: 10px; border-radius: 5px;">

### Answer:
When sentences are short and highly repetitive, like in FAQs, semantic chunking often over-splits the text into tiny pieces, creating many near-duplicate chunks that can clutter the index and confuse retrieval. This can also separate questions from their answers, which reduces the usefulness of the results and may cause the retriever to return only partial context. To address this, I would adjust the algorithm to chunk by full FAQ pairs (question and answer together) instead of individual sentences, merge related FAQs in the same category until they reach a minimum token size (around 300–500 tokens), and keep only a small overlap of 20–50 tokens between merged groups to preserve context without creating redundancy. I’d also normalize and deduplicate entries by hashing the question text, filter near-duplicates during retrieval using maximal marginal relevance, and combine BM25 with embeddings for hybrid retrieval. Finally, I would include metadata like category, FAQ ID, and section headers in each chunk so that retrieval remains precise and contextually relevant.

</div>

# 🤝 Breakout Room Part #2

<div style="background-color: #204B8E; color: white; padding: 10px; border-radius: 5px;">

### 🏗️ Activity #1

Your task is to evaluate the various Retriever methods against each other. 
You can use the loans or bills dataset.

You are expected to:

1. Create a "golden dataset"
 - Use Synthetic Data Generation (powered by Ragas, or otherwise) to create this dataset
2. Evaluate each retriever with *retriever specific* Ragas metrics
 - Semantic Chunking is not considered a retriever method and will not be required for marks, but you may find it useful to do a "semantic chunking on" vs. "semantic chunking off" comparision between them
3. Compile these in a list and write a small paragraph about which is best for this particular data and why.

Your analysis should factor in:
  - Cost
  - Latency
  - Performance

> NOTE: This is **NOT** required to be completed in class. Please spend time in your breakout rooms creating a plan before moving on to writing code.

</div>

In [None]:
import os
import time
import statistics
import unicodedata
from copy import deepcopy
from typing import List, Tuple, Dict, Callable

# config
FOLDER = "bills"                 # change if needed
EVAL_K = 8                       # main top-k for recall-oriented run
EVAL_K_PREC = 2                  # low top-k secondary run to improve precision
DO_SMALL_CHUNK_COMPARISON = True # set to False if you do not want the small-chunk run

# -----------------------
# load and chunk documents
# -----------------------

from langchain_community.document_loaders import PyMuPDFLoader, PyPDFLoader, PDFPlumberLoader
from langchain.docstore.document import Document

pdf_files = sorted([f for f in os.listdir(FOLDER) if f.lower().endswith(".pdf")])

def load_pdf_any(path: str) -> List[Document]:
    # try pymupdf first
    try:
        return PyMuPDFLoader(path).load()
    except Exception:
        pass
    # then pypdf
    try:
        return PyPDFLoader(path).load()
    except Exception:
        pass
    # finally pdfplumber
    try:
        return PDFPlumberLoader(path).load()
    except Exception:
        return []

docs = []
skipped = []
for fname in pdf_files:
    path = os.path.join(FOLDER, fname)
    got = load_pdf_any(path)
    if got:
        docs.extend(got)
    else:
        skipped.append(fname)

print(f"loaded pages: {len(docs)} | files attempted: {len(pdf_files)} | skipped: {len(skipped)}")
if skipped:
    print("skipped files:", skipped[:10])

from langchain.text_splitter import RecursiveCharacterTextSplitter

# use a legal-text-friendly chunk size
splitter = RecursiveCharacterTextSplitter(chunk_size=700, chunk_overlap=80)
chunks = splitter.split_documents(docs)

# add stable ids
for i, d in enumerate(chunks):
    base = d.metadata.get("source", f"src_{i}").split("/")[-1].replace(" ", "_")
    d.metadata["id"] = f"{base}#chunk_{i}"

print("chunks:", len(chunks))
print("sample metadata:", chunks[0].metadata if chunks else "no chunks")

# -----------------------
# build retrievers: bm25, dense/faiss, hybrid
# -----------------------

from langchain_community.retrievers import BM25Retriever
from langchain_community.vectorstores import FAISS
from langchain_community.embeddings import HuggingFaceEmbeddings

bm25 = BM25Retriever.from_documents(chunks)
bm25.k = EVAL_K

hf_emb = HuggingFaceEmbeddings(model_name="sentence-transformers/all-MiniLM-L6-v2")
faiss_db = FAISS.from_documents(chunks, hf_emb)

def dense_retrieve(q: str, k: int = EVAL_K):
    return faiss_db.similarity_search(q, k=k)

def hybrid_retrieve(q: str, k: int = EVAL_K):
    # simple, stable merge: prefer dense rank, then fill with bm25 uniques
    bm_docs = bm25.get_relevant_documents(q)
    de_docs = dense_retrieve(q, k=max(k, len(bm_docs)))
    seen = set()
    merged = []
    for d in de_docs:
        key = (d.metadata.get("source"), d.metadata.get("id"))
        if key not in seen:
            merged.append(d)
            seen.add(key)
        if len(merged) == k:
            return merged
    for d in bm_docs:
        key = (d.metadata.get("source"), d.metadata.get("id"))
        if key not in seen:
            merged.append(d)
            seen.add(key)
        if len(merged) == k:
            break
    return merged[:k]

def timed(fn, *args, **kwargs):
    t0 = time.time()
    out = fn(*args, **kwargs)
    t1 = time.time()
    return out, (t1 - t0) * 1000.0

# -----------------------
# golden dataset helpers
# -----------------------

import regex as re

def normalize_txt(s: str) -> str:
    # normalize unicode, collapse whitespace, remove hyphen linebreaks
    s = unicodedata.normalize("NFKC", s)
    s = s.replace("-\n", "")
    s = s.replace("\n", " ")
    s = re.sub(r"\s+", " ", s).strip()
    return s

id_to_text = {d.metadata["id"]: d.page_content for d in chunks}
id_to_normtext = {cid: normalize_txt(txt) for cid, txt in id_to_text.items()}
id_to_src = {d.metadata["id"]: d.metadata.get("source", "") for d in chunks}

def find_chunks_globally_by_regex(patterns, max_hits=3, flags=re.IGNORECASE, filename_contains=None):
    # filename_contains can be a string or list to filter by source filename
    if isinstance(patterns, str):
        patterns = [patterns]
    regs = [re.compile(p, flags) for p in patterns]
    if filename_contains and isinstance(filename_contains, str):
        filename_contains = [filename_contains]
    hits = []
    for cid, txt in id_to_normtext.items():
        src = id_to_src[cid]
        if filename_contains:
            if not any(k.lower() in src.lower() for k in filename_contains):
                continue
        for rg in regs:
            if rg.search(txt):
                hits.append(cid)
                break
        if len(hits) >= max_hits:
            break
    return hits

golden: List[Dict] = []

def add_gold(question: str, answer: str, ground_truth_doc_ids: List[str]):
    missing = [i for i in ground_truth_doc_ids if i not in id_to_text]
    if missing:
        raise ValueError(f"these ids are not in chunks: {missing}")
    entry = {
        "question": question.strip(),
        "answer": answer.strip(),
        "ground_truth_doc_ids": ground_truth_doc_ids
    }
    golden.append(entry)
    return entry

def preview_gold(n: int = 5):
    return deepcopy(golden[:n])

# -----------------------
# build a clean golden set for ai bills only
# -----------------------

# file name anchors for precision
FN_SBN25 = "20250725 SBN 25 AI Regulation Act.pdf"
FN_SB29  = "SB 29 - AI Regulation Act - Sen Pia.pdf"
FN_HB2186 = "HB02186.pdf"
FN_HB7913 = "HB07913.pdf"
FN_HB10944 = "HB10944.pdf"  # present in folder if provided

# regex patterns for common sections
pat_short_title = [r"\bshort\s+title\b", r"\bthis\s+act\s+shall\s+be\s+known\s+as\b", r"\bshall\s+be\s+known\s+as\b"]
pat_policy = [r"\bdeclaration\s+of\s+policy\b", r"\bit\s+is\s+the\s+policy\s+of\s+the\s+state\b"]
pat_objectives = [r"\bobjectives?\b", r"\bgoals?\b"]
pat_coverage = [r"\bcoverage\b", r"\bscope\b"]
pat_definitions = [r"\bdefinition[s]?\s+of\s+terms\b", r"\bfor\s+the\s+purposes?\s+of\s+this\s+act\b"]
pat_effectivity = [r"\bshall\s+take\s+effect\b", r"\beffectivity\b"]
pat_separability = [r"\bseparability\s+clause\b"]
pat_repealing = [r"\brepealing\s+clause\b"]
pat_appropriations = [r"\bappropriation[s]?\b"]
# bill-specific patterns
pat_naic = [r"\bnational\s+ai\s+commission\b", r"\bnaic\b"]
pat_aib = [r"\bartificial\s+intelligence\s+board\b", r"\baib\b"]
pat_penalties = [r"\bpenalt(y|ies)\b", r"\badministrative\s+sanctions?\b", r"\bviolations?\b"]

def first_or_none(lst): 
    return lst[0:1]

# sbn 25
s_sbn25_title = first_or_none(find_chunks_globally_by_regex(pat_short_title, filename_contains=FN_SBN25))
s_sbn25_policy = first_or_none(find_chunks_globally_by_regex(pat_policy, filename_contains=FN_SBN25))
s_sbn25_obj = first_or_none(find_chunks_globally_by_regex(pat_objectives, filename_contains=FN_SBN25))
s_sbn25_cov = first_or_none(find_chunks_globally_by_regex(pat_coverage, filename_contains=FN_SBN25))
s_sbn25_defs = first_or_none(find_chunks_globally_by_regex(pat_definitions, filename_contains=FN_SBN25))
s_sbn25_naic = first_or_none(find_chunks_globally_by_regex(pat_naic, filename_contains=FN_SBN25))
s_sbn25_pen = first_or_none(find_chunks_globally_by_regex(pat_penalties, filename_contains=FN_SBN25))
s_sbn25_eff = first_or_none(find_chunks_globally_by_regex(pat_effectivity, filename_contains=FN_SBN25))
s_sbn25_sep = first_or_none(find_chunks_globally_by_regex(pat_separability, filename_contains=FN_SBN25))
s_sbn25_rep = first_or_none(find_chunks_globally_by_regex(pat_repealing, filename_contains=FN_SBN25))

# sb 29
s_sb29_title = first_or_none(find_chunks_globally_by_regex(pat_short_title, filename_contains=FN_SB29))
s_sb29_policy = first_or_none(find_chunks_globally_by_regex(pat_policy, filename_contains=FN_SB29))
s_sb29_obj = first_or_none(find_chunks_globally_by_regex(pat_objectives, filename_contains=FN_SB29))
s_sb29_cov = first_or_none(find_chunks_globally_by_regex(pat_coverage, filename_contains=FN_SB29))
s_sb29_defs = first_or_none(find_chunks_globally_by_regex(pat_definitions, filename_contains=FN_SB29))
s_sb29_naic = first_or_none(find_chunks_globally_by_regex(pat_naic, filename_contains=FN_SB29))
s_sb29_pen = first_or_none(find_chunks_globally_by_regex(pat_penalties, filename_contains=FN_SB29))
s_sb29_eff = first_or_none(find_chunks_globally_by_regex(pat_effectivity, filename_contains=FN_SB29))
s_sb29_sep = first_or_none(find_chunks_globally_by_regex(pat_separability, filename_contains=FN_SB29))
s_sb29_rep = first_or_none(find_chunks_globally_by_regex(pat_repealing, filename_contains=FN_SB29))

# hb 2186 (ai education)
h_2186_title = first_or_none(find_chunks_globally_by_regex(pat_short_title, filename_contains=FN_HB2186))
h_2186_policy = first_or_none(find_chunks_globally_by_regex(pat_policy, filename_contains=FN_HB2186))
h_2186_obj = first_or_none(find_chunks_globally_by_regex([r"\bstatement\s+of\s+objectives\b"] + pat_objectives, filename_contains=FN_HB2186))
h_2186_scope = first_or_none(find_chunks_globally_by_regex([r"\bscope\s+and\s+coverage\b"] + pat_coverage, filename_contains=FN_HB2186))
h_2186_curriculum = first_or_none(find_chunks_globally_by_regex([r"\bcurriculum\s+development\b"], filename_contains=FN_HB2186))
h_2186_teacher = first_or_none(find_chunks_globally_by_regex([r"\bteacher\s+training\b", r"\bcapacity[-\s]?building\b"], filename_contains=FN_HB2186))
h_2186_eff = first_or_none(find_chunks_globally_by_regex(pat_effectivity, filename_contains=FN_HB2186))

# hb 7913 (council + ai board)
h_7913_title = first_or_none(find_chunks_globally_by_regex(pat_short_title, filename_contains=FN_HB7913))
h_7913_policy = first_or_none(find_chunks_globally_by_regex(pat_policy, filename_contains=FN_HB7913))
h_7913_defs = first_or_none(find_chunks_globally_by_regex(pat_definitions, filename_contains=FN_HB7913))
h_7913_council = first_or_none(find_chunks_globally_by_regex([r"\bphilippine\s+council\s+on\s+artificial\s+intelligence\b"], filename_contains=FN_HB7913))
h_7913_aib = first_or_none(find_chunks_globally_by_regex(pat_aib, filename_contains=FN_HB7913))
h_7913_wholegov = first_or_none(find_chunks_globally_by_regex([r"\bwhole\s+of\s+government\s+approach\b"], filename_contains=FN_HB7913))
h_7913_database = first_or_none(find_chunks_globally_by_regex([r"\bcentral\s+database\b"], filename_contains=FN_HB7913))
h_7913_pen = first_or_none(find_chunks_globally_by_regex(pat_penalties, filename_contains=FN_HB7913))
h_7913_eff = first_or_none(find_chunks_globally_by_regex(pat_effectivity, filename_contains=FN_HB7913))

# hb 10944 optional, only if present in folder
h_10944_title = first_or_none(find_chunks_globally_by_regex(pat_short_title, filename_contains=FN_HB10944)) if any(FN_HB10944 in s for s in id_to_src.values()) else []
h_10944_policy = first_or_none(find_chunks_globally_by_regex(pat_policy, filename_contains=FN_HB10944)) if h_10944_title else []

def add_if_found(q: str, cids: List[str]):
    # use the located ground truth as the answer text for ragas stability
    if not cids:
        print(f"note: no ground truth found for '{q}'")
        return
    gt_texts = [id_to_text[cids[0]]]
    add_gold(q, gt_texts[0][:3000], cids)  # keep answer non-empty but trimmed

# sbn 25 questions
add_if_found("what is the short title of sbn 25?", s_sbn25_title)
add_if_found("what is the declaration of policy under sbn 25?", s_sbn25_policy)
add_if_found("what are the objectives under sbn 25?", s_sbn25_obj)
add_if_found("what is the coverage of sbn 25?", s_sbn25_cov)
add_if_found("what are the definitions under sbn 25?", s_sbn25_defs)
add_if_found("what body does sbn 25 create to oversee ai?", s_sbn25_naic)
add_if_found("are there penalties or administrative sanctions in sbn 25?", s_sbn25_pen)
add_if_found("what is the effectivity clause of sbn 25?", s_sbn25_eff)
add_if_found("what is the separability clause of sbn 25?", s_sbn25_sep)
add_if_found("what is the repealing clause of sbn 25?", s_sbn25_rep)

# sb 29 questions
add_if_found("what is the short title of sb 29?", s_sb29_title)
add_if_found("what is the declaration of policy under sb 29?", s_sb29_policy)
add_if_found("what are the objectives under sb 29?", s_sb29_obj)
add_if_found("what is the coverage of sb 29?", s_sb29_cov)
add_if_found("what are the definitions under sb 29?", s_sb29_defs)
add_if_found("what body does sb 29 create to regulate ai?", s_sb29_naic)
add_if_found("are there penalties or administrative sanctions in sb 29?", s_sb29_pen)
add_if_found("what is the effectivity clause of sb 29?", s_sb29_eff)
add_if_found("what is the separability clause of sb 29?", s_sb29_sep)
add_if_found("what is the repealing clause of sb 29?", s_sb29_rep)

# hb 2186 questions
add_if_found("what is the short title of hb 2186?", h_2186_title)
add_if_found("what is the declaration of policy under hb 2186?", h_2186_policy)
add_if_found("what are the objectives of hb 2186?", h_2186_obj)
add_if_found("what is the scope and coverage of hb 2186?", h_2186_scope)
add_if_found("what is covered in curriculum development under hb 2186?", h_2186_curriculum)
add_if_found("what are the teacher training provisions under hb 2186?", h_2186_teacher)
add_if_found("what is the effectivity clause of hb 2186?", h_2186_eff)

# hb 7913 questions
add_if_found("what is the short title of hb 7913?", h_7913_title)
add_if_found("what is the declaration of policy under hb 7913?", h_7913_policy)
add_if_found("what are the definitions under hb 7913?", h_7913_defs)
add_if_found("what is the philippine council on ai in hb 7913?", h_7913_council)
add_if_found("what is the artificial intelligence board in hb 7913?", h_7913_aib)
add_if_found("what is the whole of government approach in hb 7913?", h_7913_wholegov)
add_if_found("what is the central database requirement in hb 7913?", h_7913_database)
add_if_found("are there penalties or violations listed in hb 7913?", h_7913_pen)
add_if_found("what is the effectivity clause of hb 7913?", h_7913_eff)

# hb 10944 optional samples
if h_10944_title:
    add_if_found("what is the short title of hb 10944?", h_10944_title)
    add_if_found("what is the declaration of policy under hb 10944?", h_10944_policy)

print("golden size now:", len(golden))
print("golden preview:", preview_gold(5))

# -----------------------
# ragas metrics shim and dataset builder
# -----------------------

METRICS = None
NAMES = None

try:
    # older function api
    from ragas.metrics import context_precision, context_recall, context_relevancy
    METRICS = [context_precision, context_recall, context_relevancy]
    NAMES = ["context_precision", "context_recall", "context_relevancy"]
except Exception:
    try:
        # newer class api
        from ragas.metrics import ContextPrecision, ContextRecall, ContextRelevancy
        METRICS = [ContextPrecision(), ContextRecall(), ContextRelevancy()]
        NAMES = ["context_precision", "context_recall", "context_relevancy"]
    except Exception:
        from ragas.metrics import context_precision, context_recall
        METRICS = [context_precision, context_recall]
        NAMES = ["context_precision", "context_recall"]

from ragas import evaluate
from datasets import Dataset
import pandas as pd

def build_eval_dataset(
    golden_items: List[Dict],
    k: int,
    retrieve_fn: Callable[[str, int], List[Document]],
):
    # ragas expects: question, contexts (list[str]), ground_truths (list[str]), answer, reference
    rows = []
    for row in golden_items:
        q = row["question"]
        hits = retrieve_fn(q, k=k) or []
        contexts = [d.page_content for d in hits]

        # build ground truth texts from the saved ids
        gts = [id_to_text[i] for i in row["ground_truth_doc_ids"] if i in id_to_text]

        # use gt as both answer and reference to keep api differences happy
        answer = gts[0] if gts else row.get("answer", "")
        reference = "\n\n".join(gts) if gts else answer

        rows.append({
            "question": q,
            "contexts": contexts,
            "ground_truths": gts if gts else [reference] if reference else [],
            "ground_truth": gts[0] if gts else reference,
            "answer": answer,
            "reference": reference,
        })
    return Dataset.from_list(rows)

def eval_retriever(golden_items, name: str, retrieve_fn, k: int = EVAL_K, n_latency: int = 10):
    ds = build_eval_dataset(golden_items, k=k, retrieve_fn=retrieve_fn)
    report = evaluate(ds, metrics=METRICS)

    # robust getter across ragas versions
    def get_metric(rep, key):
        try:
            return float(rep[key])
        except Exception:
            pass
        try:
            return float(getattr(rep, "scores", {}).get(key))
        except Exception:
            pass
        try:
            df = rep.to_pandas()
            if key in df.columns:
                return float(df[key].iloc[0])
        except Exception:
            pass
        return None

    vals = {k: get_metric(report, k) for k in NAMES}

    # latency probe
    times = []
    for row in golden_items[:max(1, min(n_latency, len(golden_items)))]:
        t0 = time.time()
        _ = retrieve_fn(row["question"], k)
        t1 = time.time()
        times.append((t1 - t0) * 1000.0)
    avg_ms = statistics.mean(times) if times else None

    out = {"retriever": name, "avg_latency_ms": round(avg_ms, 1) if avg_ms else None}
    out.update({k: (round(v, 3) if v is not None else None) for k, v in vals.items()})
    return out

# -----------------------
# evaluate retrievers (recall-oriented run at k=EVAL_K)
# -----------------------

def bm25_retrieve(q: str, k: int = EVAL_K):
    return bm25.get_relevant_documents(q)[:k]

results = []
for name, fn in [
    ("bm25", bm25_retrieve),
    ("dense", dense_retrieve),
    ("hybrid", hybrid_retrieve),
]:
    results.append(
        eval_retriever(
            golden_items=golden,
            name=name,
            retrieve_fn=fn,
            k=EVAL_K,
            n_latency=min(10, len(golden)),
        )
    )

# optional small-chunk comparison
if DO_SMALL_CHUNK_COMPARISON:
    splitter_small = RecursiveCharacterTextSplitter(chunk_size=350, chunk_overlap=60)
    chunks_small = splitter_small.split_documents(docs)

    bm25_small = BM25Retriever.from_documents(chunks_small); bm25_small.k = EVAL_K
    faiss_db_small = FAISS.from_documents(chunks_small, hf_emb)

    def dense_retrieve_small(q: str, k: int = EVAL_K):
        return faiss_db_small.similarity_search(q, k=k)

    def hybrid_retrieve_small(q: str, k: int = EVAL_K):
        de_docs = dense_retrieve_small(q, k=max(k, 16))
        bm_docs = bm25_small.get_relevant_documents(q)
        seen = set(); merged = []
        for d in de_docs:
            key = (d.metadata.get("source"), d.metadata.get("id"))
            if key not in seen:
                merged.append(d); seen.add(key)
            if len(merged) == k: return merged
        for d in bm_docs:
            key = (d.metadata.get("source"), d.metadata.get("id"))
            if key not in seen:
                merged.append(d); seen.add(key)
            if len(merged) == k: break
        return merged[:k]

    for name, fn in [
        ("bm25_smallchunk", lambda q, k=EVAL_K: bm25_small.get_relevant_documents(q)[:k]),
        ("dense_smallchunk", dense_retrieve_small),
        ("hybrid_smallchunk", hybrid_retrieve_small),
    ]:
        results.append(
            eval_retriever(
                golden_items=golden,
                name=name,
                retrieve_fn=fn,
                k=EVAL_K,
                n_latency=min(10, len(golden)),
            )
        )

# -----------------------
# precision-oriented second run at k=EVAL_K_PREC
# -----------------------

results_lowk = []
for name, fn in [
    ("bm25_k2", lambda q, k=EVAL_K_PREC: bm25.get_relevant_documents(q)[:k]),
    ("dense_k2", lambda q, k=EVAL_K_PREC: dense_retrieve(q, k=EVAL_K_PREC)),
    ("hybrid_k2", lambda q, k=EVAL_K_PREC: hybrid_retrieve(q, k=EVAL_K_PREC)),
]:
    results_lowk.append(
        eval_retriever(
            golden_items=golden,
            name=name,
            retrieve_fn=fn,
            k=EVAL_K_PREC,
            n_latency=min(10, len(golden)),
        )
    )

# -----------------------
# build results tables and summaries
# -----------------------

from pandas import DataFrame

cols = ["retriever"] + NAMES + ["avg_latency_ms"]

df = DataFrame(results)
df = df[[c for c in cols if c in df.columns]]
sort_key = "context_recall" if "context_recall" in df.columns else (NAMES[0] if len(NAMES) else None)
if sort_key:
    df = df.sort_values(by=sort_key, ascending=False)
df = df.reset_index(drop=True)

df_lowk = DataFrame(results_lowk)
df_lowk = df_lowk[[c for c in cols if c in df_lowk.columns]]
if sort_key and sort_key in df_lowk.columns:
    df_lowk = df_lowk.sort_values(by=sort_key, ascending=False).reset_index(drop=True)

print("\n=== retriever comparison (k = EVAL_K) ===")
print(df.to_string(index=False))

print("\n=== retriever comparison (k = EVAL_K_PREC) ===")
print(df_lowk.to_string(index=False))

def quick_summary(df_in: DataFrame):
    for _, row in df_in.iterrows():
        name = row["retriever"]
        lat = row.get("avg_latency_ms", None)
        parts = []
        for m in ["context_precision", "context_recall", "context_relevancy"]:
            if m in df_in.columns and m in row:
                parts.append(f"{m}={row[m]}")
        metrics_str = ", ".join(parts) if parts else "no ragas metrics found"
        print(f"- {name}: {metrics_str} | avg_latency_ms={lat}")

print("\n=== quick summary (k = EVAL_K) ===")
quick_summary(df)

print("\n=== quick summary (k = EVAL_K_PREC) ===")
quick_summary(df_lowk)

# end of script


loaded pages: 93 | files attempted: 8 | skipped: 0
chunks: 214
sample metadata: {'producer': '', 'creator': 'Image Capture Plus', 'creationdate': '2025-07-02T09:41:16+00:00', 'source': 'bills/20250725 SBN 25 AI Regulation Act.pdf', 'file_path': 'bills/20250725 SBN 25 AI Regulation Act.pdf', 'total_pages': 16, 'format': 'PDF 1.5', 'title': '', 'author': '', 'subject': '', 'keywords': '', 'moddate': '2025-07-02T09:41:16+00:00', 'trapped': '', 'modDate': 'D:20250702094116Z', 'creationDate': 'D:20250702094116Z', 'page': 0, 'id': '20250725_SBN_25_AI_Regulation_Act.pdf#chunk_0'}
note: no ground truth found for 'what is the separability clause of sbn 25?'
note: no ground truth found for 'what is the repealing clause of sbn 25?'
note: no ground truth found for 'what is the separability clause of sb 29?'
note: no ground truth found for 'what is the repealing clause of sb 29?'
note: no ground truth found for 'what is the declaration of policy under hb 7913?'
note: no ground truth found for 'what

Evaluating: 100%|██████████| 60/60 [00:57<00:00,  1.04it/s]
Evaluating:  72%|███████▏  | 43/60 [01:24<00:29,  1.73s/it]Exception raised in Job[56]: InternalServerError(upstream connect error or disconnect/reset before headers. reset reason: connection termination)
Evaluating: 100%|██████████| 60/60 [02:28<00:00,  2.48s/it]
Evaluating: 100%|██████████| 60/60 [01:38<00:00,  1.64s/it]
Evaluating: 100%|██████████| 60/60 [01:09<00:00,  1.16s/it]
Evaluating: 100%|██████████| 60/60 [01:40<00:00,  1.68s/it]
Evaluating: 100%|██████████| 60/60 [00:49<00:00,  1.21it/s]
Evaluating: 100%|██████████| 60/60 [00:32<00:00,  1.85it/s]
Evaluating: 100%|██████████| 60/60 [00:43<00:00,  1.38it/s]
Evaluating: 100%|██████████| 60/60 [00:45<00:00,  1.31it/s]



=== retriever comparison (k = EVAL_K) ===
        retriever  context_precision  context_recall  avg_latency_ms
             bm25              0.415             1.0             0.6
            dense              0.526             0.0            16.9
           hybrid              0.594             0.0            16.2
  bm25_smallchunk              0.800             0.0             0.9
 dense_smallchunk              0.962             0.0            16.8
hybrid_smallchunk              1.000             0.0            17.5

=== retriever comparison (k = EVAL_K_PREC) ===
retriever  context_precision  context_recall  avg_latency_ms
 dense_k2                0.0           0.167            16.4
  bm25_k2                0.0           0.000             0.6
hybrid_k2                0.5           0.000            13.3

=== quick summary (k = EVAL_K) ===
- bm25: context_precision=0.415, context_recall=1.0 | avg_latency_ms=0.6
- dense: context_precision=0.526, context_recall=0.0 | avg_latency_ms=16.

##### HINTS:

- LangSmith provides detailed information about latency and cost.

<div style="background-color: #204B8E; color: white; padding: 10px; border-radius: 5px;">

### Analysis & Observations:

Cost

* BM25 & BM25 Smallchunk
* Extremely low latency (~0.6–0.9 ms), meaning they’re very cheap to run.
* No external embedding model calls, so zero token-based cost.
* Dense & Hybrid Models
* Much higher latency (~13–17 ms) due to embedding lookups and similarity search.
* Embedding-based retrievers incur token embedding costs if not cached.
* Hybrid combines both costs (BM25 + dense) without significant latency savings.

Latency
* Fastest: BM25 variants (<1 ms) — ideal for real-time, cost-sensitive queries.
* Slowest: Dense/Hybrid smallchunk (~16–17 ms) — latency could stack if chaining multiple retrievals per user query.
* Trade-off: Higher-latency retrievers generally bring semantic matching advantages, but here recall issues reduce that benefit.

Performance

BM25 (Regular)

* Context Precision: 0.415 (lower precision, more noise)

* Context Recall: 1.0 (perfect recall in current small golden set)

* Good for coverage but needs better ranking for precision.

Dense & Hybrid (Regular)

* Higher precision (0.526–0.594) but 0.0 recall — indicating poor match rate to gold set due to vocabulary or semantic mismatch.

* Strength is underutilized due to small and literal golden dataset.

Smallchunk Variants

* Very high precision (0.800–1.0) but still 0.0 recall — meaning they find exact matches very accurately but miss most gold answers.

* Works for pinpoint fact retrieval, but risky for broad coverage.

Key Observation

* Current dataset size and missing gold entries skew recall scores — BM25 appears to dominate recall only because the golden set aligns well with keyword matching.
* Dense/hybrid retrievers are underperforming not because they’re weak, but because the queries and gold answers don’t match semantically enough for vector search to trigger.
* Smallchunk boosts precision significantly, but without improved recall, it’s not viable alone for coverage-heavy use cases.


</div>