# Advanced Retrieval with LangChain

In the following notebook, we'll explore various methods of advanced retrieval using LangChain!

We'll touch on:

- Naive Retrieval
- Best-Matching 25 (BM25)
- Multi-Query Retrieval
- Parent-Document Retrieval
- Contextual Compression (a.k.a. Rerank)
- Ensemble Retrieval
- Semantic chunking

We'll also discuss how these methods impact performance on our set of documents with a simple RAG chain.

There will be two breakout rooms:

- 🤝 Breakout Room Part #1
  - Task 1: Getting Dependencies!
  - Task 2: Data Collection and Preparation
  - Task 3: Setting Up QDrant!
  - Task 4-10: Retrieval Strategies
- 🤝 Breakout Room Part #2
  - Activity: Evaluate with Ragas

# 🤝 Breakout Room Part #1

## Task 1: Getting Dependencies!

We're going to need a few specific LangChain community packages, like OpenAI (for our [LLM](https://platform.openai.com/docs/models) and [Embedding Model](https://platform.openai.com/docs/guides/embeddings)) and Cohere (for our [Reranker](https://cohere.com/rerank)).

We'll also provide our OpenAI key, as well as our Cohere API key.

In [49]:
import os
import getpass

os.environ["OPENAI_API_KEY"] = getpass.getpass("Enter your OpenAI API Key:")

In [4]:
os.environ["COHERE_API_KEY"] = getpass.getpass("Cohere API Key:")

## Task 2: Data Collection and Preparation

We'll be using our Loan Data once again - this time the strutured data available through the CSV!

### Data Preparation

We want to make sure all our documents have the relevant metadata for the various retrieval strategies we're going to be applying today.

In [5]:
from langchain_community.document_loaders.csv_loader import CSVLoader
from datetime import datetime, timedelta

loader = CSVLoader(
    file_path=f"./data/complaints.csv",
    metadata_columns=[
      "Date received", 
      "Product", 
      "Sub-product", 
      "Issue", 
      "Sub-issue", 
      "Consumer complaint narrative", 
      "Company public response", 
      "Company", 
      "State", 
      "ZIP code", 
      "Tags", 
      "Consumer consent provided?", 
      "Submitted via", 
      "Date sent to company", 
      "Company response to consumer", 
      "Timely response?", 
      "Consumer disputed?", 
      "Complaint ID"
    ]
)

loan_complaint_data = loader.load()

for doc in loan_complaint_data:
    doc.page_content = doc.metadata["Consumer complaint narrative"]

Let's look at an example document to see if everything worked as expected!

In [52]:
loan_complaint_data[0]

Document(metadata={'source': './data/complaints.csv', 'row': 0, 'Date received': '03/27/25', 'Product': 'Student loan', 'Sub-product': 'Federal student loan servicing', 'Issue': 'Dealing with your lender or servicer', 'Sub-issue': 'Trouble with how payments are being handled', 'Consumer complaint narrative': "The federal student loan COVID-19 forbearance program ended in XX/XX/XXXX. However, payments were not re-amortized on my federal student loans currently serviced by Nelnet until very recently. The new payment amount that is effective starting with the XX/XX/XXXX payment will nearly double my payment from {$180.00} per month to {$360.00} per month. I'm fortunate that my current financial position allows me to be able to handle the increased payment amount, but I am sure there are likely many borrowers who are not in the same position. The re-amortization should have occurred once the forbearance ended to reduce the impact to borrowers.", 'Company public response': 'None', 'Company'

## Task 3: Setting up QDrant!

Now that we have our documents, let's create a QDrant VectorStore with the collection name "LoanComplaints".

We'll leverage OpenAI's [`text-embedding-3-small`](https://openai.com/blog/new-embedding-models-and-api-updates) because it's a very powerful (and low-cost) embedding model.

> NOTE: We'll be creating additional vectorstores where necessary, but this pattern is still extremely useful.

In [53]:
from langchain_community.vectorstores import Qdrant
from langchain_openai import OpenAIEmbeddings

embeddings = OpenAIEmbeddings(model="text-embedding-3-small")

vectorstore = Qdrant.from_documents(
    loan_complaint_data,
    embeddings,
    location=":memory:",
    collection_name="LoanComplaints"
)

## Task 4: Naive RAG Chain

Since we're focusing on the "R" in RAG today - we'll create our Retriever first.

### R - Retrieval

This naive retriever will simply look at each review as a document, and use cosine-similarity to fetch the 10 most relevant documents.

> NOTE: We're choosing `10` as our `k` here to provide enough documents for our reranking process later

In [54]:
naive_retriever = vectorstore.as_retriever(search_kwargs={"k" : 10})

### A - Augmented

We're going to go with a standard prompt for our simple RAG chain today! Nothing fancy here, we want this to mostly be about the Retrieval process.

In [55]:
from langchain_core.prompts import ChatPromptTemplate

RAG_TEMPLATE = """\
You are a helpful and kind assistant. Use the context provided below to answer the question.

If you do not know the answer, or are unsure, say you don't know.

Query:
{question}

Context:
{context}
"""

rag_prompt = ChatPromptTemplate.from_template(RAG_TEMPLATE)

### G - Generation

We're going to leverage `gpt-4.1-nano` as our LLM today, as - again - we want this to largely be about the Retrieval process.

In [56]:
from langchain_openai import ChatOpenAI

chat_model = ChatOpenAI(model="gpt-5-nano")

### LCEL RAG Chain

We're going to use LCEL to construct our chain.

> NOTE: This chain will be exactly the same across the various examples with the exception of our Retriever!

In [57]:
from langchain_core.runnables import RunnablePassthrough
from operator import itemgetter
from langchain_core.output_parsers import StrOutputParser

naive_retrieval_chain = (
    # INVOKE CHAIN WITH: {"question" : "<<SOME USER QUESTION>>"}
    # "question" : populated by getting the value of the "question" key
    # "context"  : populated by getting the value of the "question" key and chaining it into the base_retriever
    {"context": itemgetter("question") | naive_retriever, "question": itemgetter("question")}
    # "context"  : is assigned to a RunnablePassthrough object (will not be called or considered in the next step)
    #              by getting the value of the "context" key from the previous step
    | RunnablePassthrough.assign(context=itemgetter("context"))
    # "response" : the "context" and "question" values are used to format our prompt object and then piped
    #              into the LLM and stored in a key called "response"
    # "context"  : populated by getting the value of the "context" key from the previous step
    | {"response": rag_prompt | chat_model, "context": itemgetter("context")}
)

Let's see how this simple chain does on a few different prompts.

> NOTE: You might think that we've cherry picked prompts that showcase the individual skill of each of the retrieval strategies - you'd be correct!

In [58]:
naive_retrieval_chain.invoke({"question" : "What is the most common issue with loans?"})["response"].content

'Dealing with your lender or servicer.\n\nIn the provided complaints, the most common issue category is problems with the loan servicer/lender, including misinformation, misapplied payments, balance errors, and trouble with payment handling or plans.'

In [59]:
naive_retrieval_chain.invoke({"question" : "Did any complaints not get handled in a timely manner?"})["response"].content

'Yes. There is at least one complaint not handled in a timely manner:\n\n- MOHELA, Complaint ID 12709087, Date received 03/28/25. Timely response? No. The narrative describes multiple delays and no follow-up despite promises that it would be addressed within days.'

In [60]:
naive_retrieval_chain.invoke({"question" : "Why did people fail to pay back their loans?"})["response"].content

'Based on the provided complaints, people didn’t repay their loans for a mix of financial hardship and servicing problems. The main reasons include:\n\n- Economic hardship and insufficient income\n  - Many borrowers cannot afford higher monthly payments due to low or unstable income (contract/freelance work, living expenses rising, stagnating wages).\n  - For some, even “affordable” payments under income-driven plans are still unaffordable when basic needs are prioritized.\n\n- Interest keeps growing even when payments are paused\n  - Forbearance or deferment delays repayment but interest still accrues, so balances can increase and the required payoff becomes larger later.\n\n- Inability to access or qualify for relief programs\n  - PSLF or Teacher Loan Forgiveness often not available to many borrowers, leaving them with debt they can’t easily erase.\n\n- Servicing and communication issues\n  - Loans frequently transferred between servicers (e.g., Great Lakes, Navient, Nelnet, EdFinanc

Overall, this is not bad! Let's see if we can make it better!

## Task 5: Best-Matching 25 (BM25) Retriever

Taking a step back in time - [BM25](https://www.nowpublishers.com/article/Details/INR-019) is based on [Bag-Of-Words](https://en.wikipedia.org/wiki/Bag-of-words_model) which is a sparse representation of text.

In essence, it's a way to compare how similar two pieces of text are based on the words they both contain.

This retriever is very straightforward to set-up! Let's see it happen down below!


In [61]:
from langchain_community.retrievers import BM25Retriever

bm25_retriever = BM25Retriever.from_documents(loan_complaint_data, )

We'll construct the same chain - only changing the retriever.

In [62]:
bm25_retrieval_chain = (
    {"context": itemgetter("question") | bm25_retriever, "question": itemgetter("question")}
    | RunnablePassthrough.assign(context=itemgetter("context"))
    | {"response": rag_prompt | chat_model, "context": itemgetter("context")}
)

Let's look at the responses!

In [63]:
bm25_retrieval_chain.invoke({"question" : "What is the most common issue with loans?"})["response"].content

'Dealing with your lender or loan servicer.\n\nIn the provided samples, all the complaints are about issues with the loan servicer (including disputes over fees, how payments are applied, and getting accurate loan information).'

In [16]:
bm25_retrieval_chain.invoke({"question" : "Did any complaints not get handled in a timely manner?"})["response"].content

'Based on the provided information, all the complaints in the context received responses that were marked as "Company response to consumer: Closed with explanation" and "Timely response?": "Yes." This indicates that the complaints were handled in a timely manner.'

In [66]:
bm25_retrieval_chain.invoke({"question" : "What can you say about 1992 data?"})["response"].content

'I don’t know anything about 1992 data from the information you provided.\n\nWhat the context shows:\n- The excerpts come from a complaints dataset (complaints.csv) about student loans.\n- Dates in the sample are in 2025 (e.g., Date received: 04/10/25, 03/28/25, 04/30/25, etc.).\n- The entries involve various loan servicers (MOHELA, Maximus, EdFinancial, Nelnet, etc.) and issues like misinformation, reporting errors, or payment problems.\n\nThere’s no mention of 1992 data in the provided excerpts. If you have 1992 data you want analyzed or summarized, please share that data or clarify what you want to know.'

It's not clear that this is better or worse, if only we had a way to test this (SPOILERS: We do, the second half of the notebook will cover this)

<div style="background-color: #204B8E; color: white; padding: 10px; border-radius: 5px;">

#### ❓ Question #1:

Give an example query where BM25 is better than embeddings and justify your answer.

</div>

<div style="background-color: #204B8E; color: white; padding: 10px; border-radius: 5px;">
### Answer:

Example query:
“Find complaints that mention ‘Form 1099-C’ (cancellation of debt).”

BM25 rewards exact token overlap, so it will prioritize documents that literally contain the rare, alphanumeric token “1099-C” (including punctuation variants). Embedding search smooths meaning and may retrieve semantically related texts about “debt forgiveness” or “tax forms” without the exact string, hurting precision. For short, keyword-heavy queries with acronyms, codes, section numbers, or product names, BM25 typically outperforms embeddings.

</div>

## Task 6: Contextual Compression (Using Reranking)

Contextual Compression is a fairly straightforward idea: We want to "compress" our retrieved context into just the most useful bits.

There are a few ways we can achieve this - but we're going to look at a specific example called reranking.

The basic idea here is this:

- We retrieve lots of documents that are very likely related to our query vector
- We "compress" those documents into a smaller set of *more* related documents using a reranking algorithm.

We'll be leveraging Cohere's Rerank model for our reranker today!

All we need to do is the following:

- Create a basic retriever
- Create a compressor (reranker, in this case)

That's it!

Let's see it in the code below!

In [67]:
from langchain.retrievers.contextual_compression import ContextualCompressionRetriever
from langchain_cohere import CohereRerank

compressor = CohereRerank(model="rerank-v3.5")
compression_retriever = ContextualCompressionRetriever(
    base_compressor=compressor, base_retriever=naive_retriever
)

Let's create our chain again, and see how this does!

In [68]:
contextual_compression_retrieval_chain = (
    {"context": itemgetter("question") | compression_retriever, "question": itemgetter("question")}
    | RunnablePassthrough.assign(context=itemgetter("context"))
    | {"response": rag_prompt | chat_model, "context": itemgetter("context")}
)

In [69]:
contextual_compression_retrieval_chain.invoke({"question" : "What is the most common issue with loans?"})["response"].content

'The most common issue is problems dealing with the lender or loan servicer (e.g., receiving bad or incorrect information about the loan).'

In [70]:
contextual_compression_retrieval_chain.invoke({"question" : "Did any complaints not get handled in a timely manner?"})["response"].content

'No. In the provided records, all listed complaints show a timely response (Timely response? Yes) for the following IDs: 12975634, 12973003, and 13298273.'

In [71]:
contextual_compression_retrieval_chain.invoke({"question" : "Why did people fail to pay back their loans?"})["response"].content

'Based on the provided complaints, several intertwined reasons explain why people failed to repay their student loans:\n\n- Lack of understanding and notification\n  - Many borrowers did not realize they had to repay, especially if they weren’t clearly informed by financial aid officers.\n  - Some borrowers were not told when repayment would start or what their obligations were.\n\n- Servicing and administrative issues\n  - Loans were transferred between servicers (sometimes without the borrower’s knowledge or consent), leading to confusion.\n  - Borrowers faced locked accounts, incorrect or inconsistent information online, and notices not being received or understood.\n  - Accounts and balances were reported inconsistently (e.g., mismatched principal, interest, and balance figures).\n\n- Interest accrual and payment options\n  - Interest continued to accrue during forbearance or deferment, and even when payments were made, the balance often did not shrink.\n  - Lower monthly payments 

We'll need to rely on something like Ragas to help us get a better sense of how this is performing overall - but it "feels" better!

## Task 7: Multi-Query Retriever

Typically in RAG we have a single query - the one provided by the user.

What if we had....more than one query!

In essence, a Multi-Query Retriever works by:

1. Taking the original user query and creating `n` number of new user queries using an LLM.
2. Retrieving documents for each query.
3. Using all unique retrieved documents as context

So, how is it to set-up? Not bad! Let's see it down below!



In [72]:
from langchain.retrievers.multi_query import MultiQueryRetriever

multi_query_retriever = MultiQueryRetriever.from_llm(
    retriever=naive_retriever, llm=chat_model
)

In [73]:
multi_query_retrieval_chain = (
    {"context": itemgetter("question") | multi_query_retriever, "question": itemgetter("question")}
    | RunnablePassthrough.assign(context=itemgetter("context"))
    | {"response": rag_prompt | chat_model, "context": itemgetter("context")}
)

In [74]:
multi_query_retrieval_chain.invoke({"question" : "What is the most common issue with loans?"})["response"].content

'The most common issue in the provided loan complaints is problems dealing with the lender or loan servicer. This includes issues like misapplied payments, incorrect balances, forbearance/repayment mismanagement, and poor communication from the lender or servicer.'

In [75]:
multi_query_retrieval_chain.invoke({"question" : "Did any complaints not get handled in a timely manner?"})["response"].content

'Yes. There is at least one complaint that was not handled in a timely manner.\n\n- Example: MOHELA, Complaint ID 12709087\n  - Date received: 03/28/25\n  - Timely response? No\n  - Issue: Graduated loan application not processed and no timely update despite multiple inquiries.'

In [76]:
multi_query_retrieval_chain.invoke({"question" : "Why did people fail to pay back their loans?"})["response"].content

'Based on the complaints in the provided data, people didn’t fail to pay back their loans for mainly a mix of servicer/administrative problems and affordability issues. Key patterns include:\n\n- Servicer errors and misreporting\n  - Payments were misapplied or not applied to the principal, leaving borrowers with balances that kept growing.\n  - Delinquency and default statuses were reported to credit bureaus without proper notice or opportunity to cure.\n  - Accounts were incorrectly placed in delinquency or default (including “90 days past due,” “186 days late,” etc.) even after payment or during forbearance.\n  - Borrowers were transferred between servicers (e.g., Great Lakes, Nelnet, MOHELA, Aidvantage) with little or no notice, causing confusion about when/how to pay.\n  - Borrowers had no access to account information or login credentials, making it hard to monitor or fix status.\n  - In some cases, borrowers reported that the loan status that was reported to the credit bureaus d

<div style="background-color: #204B8E; color: white; padding: 10px; border-radius: 5px;">

#### ❓ Question #2:

Explain how generating multiple reformulations of a user query can improve recall.

</div>

<div style="background-color: #204B8E; color: white; padding: 10px; border-radius: 5px;">

### Answer:

- When generating multiple reformulations of a user query improves recall by covering a wider range of possible wordings, synonyms and phrasings that may exist in the document. Some authors may use different terminology to describe the same concept. For example, "Heart attack", "myocardial infarction" and "cardiac arrest", it can appear in separate documents but refer to same ideas. 
- By reformulating the query into multiple variations, the retriever can capture documents that would be missed if it relied only on the original phrasing.
-

</div>

## Task 8: Parent Document Retriever

A "small-to-big" strategy - the Parent Document Retriever works based on a simple strategy:

1. Each un-split "document" will be designated as a "parent document" (You could use larger chunks of document as well, but our data format allows us to consider the overall document as the parent chunk)
2. Store those "parent documents" in a memory store (not a VectorStore)
3. We will chunk each of those documents into smaller documents, and associate them with their respective parents, and store those in a VectorStore. We'll call those "child chunks".
4. When we query our Retriever, we will do a similarity search comparing our query vector to the "child chunks".
5. Instead of returning the "child chunks", we'll return their associated "parent chunks".

Okay, maybe that was a few steps - but the basic idea is this:

- Search for small documents
- Return big documents

The intuition is that we're likely to find the most relevant information by limiting the amount of semantic information that is encoded in each embedding vector - but we're likely to miss relevant surrounding context if we only use that information.

Let's start by creating our "parent documents" and defining a `RecursiveCharacterTextSplitter`.

In [77]:
from langchain.retrievers import ParentDocumentRetriever
from langchain.storage import InMemoryStore
from langchain_text_splitters import RecursiveCharacterTextSplitter
from qdrant_client import QdrantClient, models

parent_docs = loan_complaint_data
child_splitter = RecursiveCharacterTextSplitter(chunk_size=750)

We'll need to set up a new QDrant vectorstore - and we'll use another useful pattern to do so!

> NOTE: We are manually defining our embedding dimension, you'll need to change this if you're using a different embedding model.

In [78]:
from langchain_qdrant import QdrantVectorStore

client = QdrantClient(location=":memory:")

client.create_collection(
    collection_name="full_documents",
    vectors_config=models.VectorParams(size=1536, distance=models.Distance.COSINE)
)

parent_document_vectorstore = QdrantVectorStore(
    collection_name="full_documents", embedding=OpenAIEmbeddings(model="text-embedding-3-small"), client=client
)

Now we can create our `InMemoryStore` that will hold our "parent documents" - and build our retriever!

In [79]:
store = InMemoryStore()

parent_document_retriever = ParentDocumentRetriever(
    vectorstore = parent_document_vectorstore,
    docstore=store,
    child_splitter=child_splitter,
)

By default, this is empty as we haven't added any documents - let's add some now!

In [80]:
parent_document_retriever.add_documents(parent_docs, ids=None)

We'll create the same chain we did before - but substitute our new `parent_document_retriever`.

In [81]:
parent_document_retrieval_chain = (
    {"context": itemgetter("question") | parent_document_retriever, "question": itemgetter("question")}
    | RunnablePassthrough.assign(context=itemgetter("context"))
    | {"response": rag_prompt | chat_model, "context": itemgetter("context")}
)

Let's give it a whirl!

In [82]:
parent_document_retrieval_chain.invoke({"question" : "What is the most common issue with loans?"})["response"].content

'Dealing with your lender or loan servicer (servicing issues) appears to be the most common issue in these loan complaints.'

In [83]:
parent_document_retrieval_chain.invoke({"question" : "Did any complaints not get handled in a timely manner?"})["response"].content

'Yes. Two complaints were not handled in a timely manner:\n\n- Complaint ID 12709087 (Date received 03/28/25) – Timely response? No (MOHELA, CA)\n- Complaint ID 12935889 (Date received 04/11/25) – Timely response? No (MOHELA, CO)\n\nThe other two listed complaints had timely responses (Yes).'

In [84]:
parent_document_retrieval_chain.invoke({"question" : "Why did people fail to pay back their loans?"})["response"].content

'Based on the complaints in the provided data, there isn’t a single reason people failed to pay back their loans. Instead, failures to repay arose from a mix of financial hardship, school-related factors, and loan servicing/administrative issues. Key themes include:\n\n- Financial hardship after graduation\n  - Unemployment or underemployment and low income made regular payments unaffordable.\n  - Health problems, housing costs, and other living expenses reduced people’s ability to pay.\n  - For some, taking out loans for schools with poor outcomes left them with debt they couldn’t realistically repay.\n\n- High debt and poor career prospects from the school experience\n  - Borrowers felt misled about the value of degrees, job outcomes, or the stability of the institution (some schools closed or faced financial trouble).\n  - This led to limited or unstable income, making debt repayment difficult or impossible.\n\n- Borrowing/use of deferment or forbearance increasing long-term costs\n

Overall, the performance *seems* largely the same. We can leverage a tool like [Ragas]() to more effectively answer the question about the performance.

## Task 9: Ensemble Retriever

In brief, an Ensemble Retriever simply takes 2, or more, retrievers and combines their retrieved documents based on a rank-fusion algorithm.

In this case - we're using the [Reciprocal Rank Fusion](https://plg.uwaterloo.ca/~gvcormac/cormacksigir09-rrf.pdf) algorithm.

Setting it up is as easy as providing a list of our desired retrievers - and the weights for each retriever.

In [36]:
from langchain.retrievers import EnsembleRetriever

retriever_list = [bm25_retriever, naive_retriever, parent_document_retriever, compression_retriever, multi_query_retriever]
equal_weighting = [1/len(retriever_list)] * len(retriever_list)

ensemble_retriever = EnsembleRetriever(
    retrievers=retriever_list, weights=equal_weighting
)

We'll pack *all* of these retrievers together in an ensemble.

In [37]:
ensemble_retrieval_chain = (
    {"context": itemgetter("question") | ensemble_retriever, "question": itemgetter("question")}
    | RunnablePassthrough.assign(context=itemgetter("context"))
    | {"response": rag_prompt | chat_model, "context": itemgetter("context")}
)

Let's look at our results!

In [38]:
ensemble_retrieval_chain.invoke({"question" : "What is the most common issue with loans?"})["response"].content

'Based on the provided context, the most common issues with student loan complaints revolve around:\n\n- Dealing with lenders or servicers, including errors in loan balances, misapplied payments, wrongful denials of payment plans, and trouble with how payments are being handled.\n- Incorrect or inconsistent information being reported on credit reports, including late payments that cause credit score drops without proper notice.\n- Lack of transparency or clear communication about loan terms, balances, interest, or changes in loan servicing.\n- Mishandling of loan transfers, including unauthorized transfers, improper classification of loan types, and failure to follow federal regulations regarding notices and collection practices.\n- Problems with loan consolidation, including lack of disclosure and unexpected payment amounts.\n- Technical issues such as receiving bad information, inadequate investigation of disputes, or errors in data reporting.\n\nThe recurring theme is that many comp

In [39]:
ensemble_retrieval_chain.invoke({"question" : "Did any complaints not get handled in a timely manner?"})["response"].content

'Yes, according to the provided complaints, some complaints did not get handled in a timely manner. Specifically, there are instances where responses were delayed beyond the expected timeframes—such as complaints marked as "No" in the "Timely response?" field, indicating responses were not received promptly. For example, one complaint about a delinquent account reported on 04/01/25 was marked as "No" in timely response, and the response was "Closed with explanation" after a delay. Similarly, other complaints from the same period show delays in response, which suggests that some complaints were not handled promptly.'

In [40]:
ensemble_retrieval_chain.invoke({"question" : "Why did people fail to pay back their loans?"})["response"].content

"People failed to pay back their loans primarily due to factors such as:\n\n- Lack of clear communication from lenders or servicers about repayment obligations, due dates, and changes in loan status, leading to unawareness of when repayment was due or that payments had resumed.\n- Being steered into forbearance or deferment without understanding that interest would continue to accrue and capitalize, increasing overall debt instead of reducing it.\n- Difficulties in navigating complex repayment options, such as income-driven plans, rehabilitation programs, or consolidations, often without proper guidance or information.\n- Errors or mishandling by loan servicers, including incorrect account information, misapplication of payments, or failure to notify borrowers about transfer of loans or delinquency status.\n- Administrative issues like missed notices, improper reporting to credit bureaus, or unapproved transfer of loans, which negatively impacted credit scores.\n- Circumstances such as

## Task 10: Semantic Chunking

While this is not a retrieval method - it *is* an effective way of increasing retrieval performance on corpora that have clean semantic breaks in them.

Essentially, Semantic Chunking is implemented by:

1. Embedding all sentences in the corpus.
2. Combining or splitting sequences of sentences based on their semantic similarity based on a number of [possible thresholding methods](https://python.langchain.com/docs/how_to/semantic-chunker/):
  - `percentile`
  - `standard_deviation`
  - `interquartile`
  - `gradient`
3. Each sequence of related sentences is kept as a document!

Let's see how to implement this!

We'll use the `percentile` thresholding method for this example which will:

Calculate all distances between sentences, and then break apart sequences of setences that exceed a given percentile among all distances.

In [85]:
from langchain_experimental.text_splitter import SemanticChunker

semantic_chunker = SemanticChunker(
    embeddings,
    breakpoint_threshold_type="percentile"
)

Now we can split our documents.

In [86]:
semantic_documents = semantic_chunker.split_documents(loan_complaint_data[:20])

Let's create a new vector store.

In [87]:
semantic_vectorstore = Qdrant.from_documents(
    semantic_documents,
    embeddings,
    location=":memory:",
    collection_name="Loan_Complaint_Data_Semantic_Chunks"
)

We'll use naive retrieval for this example.

In [88]:
semantic_retriever = semantic_vectorstore.as_retriever(search_kwargs={"k" : 10})

Finally we can create our classic chain!

In [89]:
semantic_retrieval_chain = (
    {"context": itemgetter("question") | semantic_retriever, "question": itemgetter("question")}
    | RunnablePassthrough.assign(context=itemgetter("context"))
    | {"response": rag_prompt | chat_model, "context": itemgetter("context")}
)

And view the results!

In [90]:
semantic_retrieval_chain.invoke({"question" : "What is the most common issue with loans?"})["response"].content

'The most common issue in the provided data is problems Dealing with your lender or servicer (issues related to interacting with the loan servicer, such as communication problems, misinformation, and trouble with payments).'

In [91]:
semantic_retrieval_chain.invoke({"question" : "Did any complaints not get handled in a timely manner?"})["response"].content

'No. In the provided data, every complaint shows a timely response (Timely response? = Yes) for all entries listed.'

In [92]:
semantic_retrieval_chain.invoke({"question" : "Why did people fail to pay back their loans?"})["response"].content

'Based on the complaints in the provided data, people didn’t fail to pay back loans mainly because of lack of funds or unwillingness, but because of administrative and servicer issues that disrupted or delayed payments. Key reasons include:\n\n- Forbearance end not handled correctly\n  - Loans weren’t re-amortized when forbearance ended, causing unexpectedly large monthly payments.\n  - Borrowers were told different things (e.g., ongoing forbearance longer than written) or didn’t receive written confirmation, creating uncertainty about what they owed.\n\n- Poor communication and transparency from lenders/servicers\n  - Little to no contact after payments resumed.\n  - Written documentation was missing or unclear, leaving borrowers unsure of their status or exact amounts.\n  - Difficulty getting clear explanations from staff, with long wait times and dropped calls.\n\n- Technical and access problems with payment systems\n  - Trouble logging in or navigating the online portals.\n  - Auto

<div style="background-color: #204B8E; color: white; padding: 10px; border-radius: 5px;">

#### ❓ Question #3:

If sentences are short and highly repetitive (e.g., FAQs), how might semantic chunking behave, and how would you adjust the algorithm?

</div>

<div style="background-color: #204B8E; color: white; padding: 10px; border-radius: 5px;">

### Answer:

- naive semantic chunking tends to over-merge many items into one blob because adjacent sentences all look very similar in embedding space.
Adjust by making the FAQ Q+A pair in the atomic unit, the add light, topic-coherent batching with strict guards.

</div>

# 🤝 Breakout Room Part #2

<div style="background-color: #204B8E; color: white; padding: 10px; border-radius: 5px;">

### 🏗️ Activity #1

Your task is to evaluate the various Retriever methods against each other. 
You can use the loans or bills dataset.

You are expected to:

1. Create a "golden dataset"
 - Use Synthetic Data Generation (powered by Ragas, or otherwise) to create this dataset
2. Evaluate each retriever with *retriever specific* Ragas metrics
 - Semantic Chunking is not considered a retriever method and will not be required for marks, but you may find it useful to do a "semantic chunking on" vs. "semantic chunking off" comparision between them
3. Compile these in a list and write a small paragraph about which is best for this particular data and why.

Your analysis should factor in:
  - Cost
  - Latency
  - Performance

> NOTE: This is **NOT** required to be completed in class. Please spend time in your breakout rooms creating a plan before moving on to writing code.

</div>

SETUP

In [1]:
import nltk
nltk.download('punkt')
nltk.download('averaged_perceptron_tagger')

[nltk_data] Downloading package punkt to /Users/jay/nltk_data...
[nltk_data]   Package punkt is already up-to-date!
[nltk_data] Downloading package averaged_perceptron_tagger to
[nltk_data]     /Users/jay/nltk_data...
[nltk_data]   Package averaged_perceptron_tagger is already up-to-
[nltk_data]       date!


True

In [2]:
import os
import getpass

os.environ["LANGCHAIN_TRACING_V2"] = "true"
os.environ["LANGCHAIN_API_KEY"] = getpass.getpass("LangChain API Key:")

In [3]:
from uuid import uuid4

os.environ["LANGCHAIN_PROJECT"] = f"PSI - SDG - {uuid4().hex[0:8]}"

In [4]:
os.environ["OPENAI_API_KEY"] = getpass.getpass("OpenAI API Key:")

DATA PREPARATION

In [5]:
from langchain_community.document_loaders import DirectoryLoader
from langchain_community.document_loaders import PyMuPDFLoader


path = "bills/"
loader = DirectoryLoader(path, glob="*.pdf", loader_cls=PyMuPDFLoader)
docs = loader.load()

In [6]:
from ragas.llms import LangchainLLMWrapper
from ragas.embeddings import LangchainEmbeddingsWrapper
from langchain_openai import ChatOpenAI
from langchain_openai import OpenAIEmbeddings
generator_llm = LangchainLLMWrapper(ChatOpenAI(model="gpt-4.1-nano"))
generator_embeddings = LangchainEmbeddingsWrapper(OpenAIEmbeddings())

In [7]:
from ragas.testset.graph import KnowledgeGraph

kg = KnowledgeGraph()
kg

KnowledgeGraph(nodes: 0, relationships: 0)

In [8]:
from ragas.testset.graph import Node, NodeType

### NOTICE: We're using a subset of the data for this example - this is to keep costs/time down.
for doc in docs[:20]:
    kg.nodes.append(
        Node(
            type=NodeType.DOCUMENT,
            properties={"page_content": doc.page_content, "document_metadata": doc.metadata}
        )
    )
kg

KnowledgeGraph(nodes: 20, relationships: 0)

In [9]:
from ragas.testset.transforms import default_transforms, apply_transforms

transformer_llm = generator_llm
embedding_model = generator_embeddings

default_transforms = default_transforms(documents=docs, llm=transformer_llm, embedding_model=embedding_model)
apply_transforms(kg, default_transforms)
kg

Applying SummaryExtractor:   0%|          | 0/20 [00:00<?, ?it/s]

Applying CustomNodeFilter:   0%|          | 0/20 [00:00<?, ?it/s]

Applying [EmbeddingExtractor, ThemesExtractor, NERExtractor]:   0%|          | 0/60 [00:00<?, ?it/s]

Applying [CosineSimilarityBuilder, OverlapScoreBuilder]:   0%|          | 0/2 [00:00<?, ?it/s]

KnowledgeGraph(nodes: 20, relationships: 227)

In [11]:
kg.save("bills/ai_law.json")
bills_data_kg = KnowledgeGraph.load("bills/ai_law.json")
bills_data_kg

KnowledgeGraph(nodes: 20, relationships: 227)

In [12]:
from ragas.testset import TestsetGenerator

generator = TestsetGenerator(llm=generator_llm, embedding_model=embedding_model, knowledge_graph=bills_data_kg)

In [13]:
from ragas.testset.synthesizers import default_query_distribution, SingleHopSpecificQuerySynthesizer, MultiHopAbstractQuerySynthesizer, MultiHopSpecificQuerySynthesizer

query_distribution = [
        (SingleHopSpecificQuerySynthesizer(llm=generator_llm), 0.5),
        (MultiHopAbstractQuerySynthesizer(llm=generator_llm), 0.25),
        (MultiHopSpecificQuerySynthesizer(llm=generator_llm), 0.25),
]

In [14]:
testset = generator.generate(testset_size=10, query_distribution=query_distribution)
testset.to_pandas()

Generating personas:   0%|          | 0/3 [00:00<?, ?it/s]

Generating Scenarios:   0%|          | 0/3 [00:00<?, ?it/s]

Generating Samples:   0%|          | 0/11 [00:00<?, ?it/s]

Unnamed: 0,user_input,reference_contexts,reference,synthesizer_name
0,"As an AI researcher and developer, how does th...",[Republic of the Philippines \nHOUSE OF REPRES...,The context indicates that the Future of Jobs ...,single_hop_specifc_query_synthesizer
1,What is the role of the Department of Informat...,[age-appropriate curriculum modules in consult...,The Department of Information and Communicatio...,single_hop_specifc_query_synthesizer
2,What is the role of Rep. Robert Nazal in the A...,[Republic of the Philippines \nHOUSE OF REPRES...,Rep. Robert Nazal introduced the bill known as...,single_hop_specifc_query_synthesizer
3,What is the significance of implementing AI ed...,"[C.​\nTo foster creativity, problem-solving, a...",The integration of AI education in the Philipp...,single_hop_specifc_query_synthesizer
4,What is the role of DepEd in AI education and ...,[relevance. Feedback gathered from these pilot...,DepEd is responsible for designing and impleme...,single_hop_specifc_query_synthesizer
5,how does the requirement for existing AI syste...,[<1-hop>\n\n1 \nSec. 12. Contents of AI Regist...,the requirement for existing AI systems to reg...,multi_hop_abstract_query_synthesizer
6,H0w do the AI portal and the unpredictable AI ...,[<1-hop>\n\n1 \nk) Issue advisory opinions on ...,The context describes creating an online porta...,multi_hop_abstract_query_synthesizer
7,how GPT like models help bridge digital divide...,[<1-hop>\n\n1\n2\n3\n4\n5\n6\n7\n8 \n9\n10\n11...,"The context shows that early AI education, inc...",multi_hop_abstract_query_synthesizer
8,How does the Philippines' House Bill on instit...,[<1-hop>\n\nRepublic of the Philippines \nHOUS...,The House Bill on institutionalizing AI educat...,multi_hop_specific_query_synthesizer
9,How does the Artificial Intelligence Education...,[<1-hop>\n\nRepublic of the Philippines \nHOUS...,The Artificial Intelligence Education Act of 2...,multi_hop_specific_query_synthesizer


In [18]:
from ragas.testset import TestsetGenerator

generator = TestsetGenerator(llm=generator_llm, embedding_model=generator_embeddings)
dataset = generator.generate_with_langchain_docs(docs[:20], testset_size=10)

Applying SummaryExtractor:   0%|          | 0/20 [00:00<?, ?it/s]

Applying CustomNodeFilter:   0%|          | 0/20 [00:00<?, ?it/s]

Applying [EmbeddingExtractor, ThemesExtractor, NERExtractor]:   0%|          | 0/60 [00:00<?, ?it/s]

Applying [CosineSimilarityBuilder, OverlapScoreBuilder]:   0%|          | 0/2 [00:00<?, ?it/s]

Generating personas:   0%|          | 0/3 [00:00<?, ?it/s]

Generating Scenarios:   0%|          | 0/3 [00:00<?, ?it/s]

Generating Samples:   0%|          | 0/12 [00:00<?, ?it/s]

In [19]:
dataset.to_pandas()

Unnamed: 0,user_input,reference_contexts,reference,synthesizer_name
0,How does the Philippines plan to incorporate A...,[Republic of the Philippines \nHOUSE OF REPRES...,The bill proposes to institutionalize AI liter...,single_hop_specifc_query_synthesizer
1,What does the Department of Information and Co...,[age-appropriate curriculum modules in consult...,The Department of Information and Communicatio...,single_hop_specifc_query_synthesizer
2,"As an AI researcher and developer, how does th...",[Republic of the Philippines \nHOUSE OF REPRES...,The Act aims to institutionalize the integrati...,single_hop_specifc_query_synthesizer
3,What does the Department of Science and Techno...,"[C.​\nTo foster creativity, problem-solving, a...","The Department of Science and Technology, in c...",single_hop_specifc_query_synthesizer
4,so like how does the investments in infrastruc...,[<1-hop>\n\nage-appropriate curriculum modules...,the investments in infrastructure and learning...,multi_hop_abstract_query_synthesizer
5,How does the framework for responsible AI deve...,[<1-hop>\n\nAI presents enormous opportunities...,The framework for responsible AI development a...,multi_hop_abstract_query_synthesizer
6,How does the legislation promote careers in sc...,[<1-hop>\n\nage-appropriate curriculum modules...,The legislation promotes careers in science an...,multi_hop_abstract_query_synthesizer
7,How does the development of the philippine ai ...,[<1-hop>\n\n1\n2\n3\n4\n5\n6\n7\n8 \n9\n10\n11...,The development of the philippine ai roadmap i...,multi_hop_abstract_query_synthesizer
8,How does the Department of Education plan to i...,"[<1-hop>\n\nC.​\nTo foster creativity, problem...",The Department of Education aims to integrate ...,multi_hop_specific_query_synthesizer
9,How does the Department of Education plan to p...,"[<1-hop>\n\nC.​\nTo foster creativity, problem...",The Department of Education aims to promote AI...,multi_hop_specific_query_synthesizer


In [21]:
from langsmith import Client

client = Client()

dataset_name = "Philippines AI Bills HW 2"

langsmith_dataset = client.create_dataset(
    dataset_name=dataset_name,
    description="Philippines AI Bills HW"
)

In [22]:
for data_row in dataset.to_pandas().iterrows():
  client.create_example(
      inputs={
          "question": data_row[1]["user_input"]
      },
      outputs={
          "answer": data_row[1]["reference"]
      },
      metadata={
          "context": data_row[1]["reference_contexts"]
      },
      dataset_id=langsmith_dataset.id
  )

## Basic RAG Chain

In [23]:
rag_documents = docs

In [24]:
from langchain.text_splitter import RecursiveCharacterTextSplitter

text_splitter = RecursiveCharacterTextSplitter(
    chunk_size = 500,
    chunk_overlap = 50
)

rag_documents = text_splitter.split_documents(rag_documents)

In [25]:
from langchain_openai import OpenAIEmbeddings

embeddings = OpenAIEmbeddings(model="text-embedding-3-small")

In [26]:
from langchain_community.vectorstores import Qdrant

vectorstore = Qdrant.from_documents(
    documents=rag_documents,
    embedding=embeddings,
    location=":memory:",
    collection_name="AI Bills RAG"
)

In [27]:
retriever = vectorstore.as_retriever(search_kwargs={"k": 10})

### Augment

In [28]:
from langchain.prompts import ChatPromptTemplate

RAG_PROMPT = """\
Given a provided context and question, you must answer the question based only on context.

If you cannot answer the question based on the context - you must say "I don't know".

Context: {context}
Question: {question}
"""

rag_prompt = ChatPromptTemplate.from_template(RAG_PROMPT)

In [29]:
from langchain_openai import ChatOpenAI

llm = ChatOpenAI(model="gpt-4.1-mini")

In [30]:
from operator import itemgetter
from langchain_core.runnables import RunnablePassthrough, RunnableParallel
from langchain.schema import StrOutputParser

rag_chain = (
    {"context": itemgetter("question") | retriever, "question": itemgetter("question")}
    | rag_prompt | llm | StrOutputParser()
)

In [31]:
rag_chain.invoke({"question" : "How much is the penalty for spreading disinformation?"})

'The penalty for using AI to create or disseminate disinformation is a fine of One Million Pesos (Php 1,000,000) to Five Million Pesos (Php 5,000,000), or imprisonment of three (3) years to ten (10) years, or both, at the discretion of the court.'

Testing questions that is not in the data

In [32]:
rag_chain.invoke({"question" : "How much is the penalty for beating the red light?"})

"I don't know."

### LangSmith Set-up

In [33]:
eval_llm = ChatOpenAI(model="gpt-4.1")

In [34]:
from langsmith.evaluation import LangChainStringEvaluator, evaluate

qa_evaluator = LangChainStringEvaluator("qa", config={"llm" : eval_llm})

labeled_helpfulness_evaluator = LangChainStringEvaluator(
    "labeled_criteria",
    config={
        "criteria": {
            "helpfulness": (
                "Is this submission helpful to the user,"
                " taking into account the correct reference answer?"
            )
        },
        "llm" : eval_llm
    },
    prepare_data=lambda run, example: {
        "prediction": run.outputs["output"],
        "reference": example.outputs["answer"],
        "input": example.inputs["question"],
    }
)

empathy_evaluator = LangChainStringEvaluator(
    "criteria",
    config={
        "criteria": {
            "empathy": "Is this response empathetic? Does it make the user feel like they are being heard?",
        },
        "llm" : eval_llm
    }
)

#### LangSmith Evals

In [35]:
evaluate(
    rag_chain.invoke,
    data=dataset_name,
    evaluators=[
        qa_evaluator,
        labeled_helpfulness_evaluator,
        empathy_evaluator
    ],
    metadata={"revision_id": "default_chain_init"},
)

View the evaluation results for experiment: 'yellow-soap-88' at:
https://smith.langchain.com/o/a96da657-090f-454d-8abf-7967e1d8a759/datasets/381a6db1-3018-4369-96d7-918d3bf7da41/compare?selectedSessions=3bfad255-43f1-43ab-a510-45aa2aabad9e




0it [00:00, ?it/s]

Unnamed: 0,inputs.question,outputs.output,error,reference.answer,feedback.correctness,feedback.helpfulness,feedback.empathy,execution_time,example_id,id
0,How does Rep. Robert Nazal's Artificial Intell...,Rep. Robert Nazal's Artificial Intelligence Ed...,,Rep. Robert Nazal's Artificial Intelligence Ed...,1,1,0,3.824804,1a929557-4086-48d9-95ed-a93bf155fec8,b8463930-fef5-4c37-9f3d-fc5a3383c9c0
1,How does the Department of Education's monitor...,"Based on the provided context, the Department ...",,The Department of Education (DepEd) is tasked ...,1,0,0,6.311069,00e8742d-e875-4359-99eb-7f1ab5de78c4,4c0c4457-5683-4863-a694-c61d99130586
2,How does the Department of Education plan to p...,"According to the provided context, the Departm...",,The Department of Education aims to promote AI...,1,1,0,7.199681,5880e39d-2980-499d-9891-fb87993ffb87,d8d214c0-e603-4b68-a523-6f6cdf629817
3,How does the Department of Education plan to i...,The Department of Education (DepEd) plans to i...,,The Department of Education aims to integrate ...,1,1,0,9.40788,ed0007a0-3405-4626-b958-d393af7b02b3,accb16ec-09af-477e-813b-c5409bc41fa6
4,How does the development of the philippine ai ...,"Based on the provided context, the development...",,The development of the philippine ai roadmap i...,1,1,0,4.703974,d6afb36f-9973-4ff0-b728-fa2de21f5d48,fda56b03-dba0-4da0-b97c-17de48eba456
5,How does the legislation promote careers in sc...,The legislation promotes careers in science an...,,The legislation promotes careers in science an...,0,0,0,2.690439,031e34a1-f97a-43e1-bb34-3d0e18072464,9c2a64f9-0cda-479a-9efa-50dd15c4003b
6,How does the framework for responsible AI deve...,The framework for responsible AI development i...,,The framework for responsible AI development a...,1,1,0,2.556346,a617e0d2-345f-4505-8231-f891efd5ec1d,4b84dcf1-f78e-4dcb-b4b1-387aeb4fc8cf
7,so like how does the investments in infrastruc...,Based on the provided context from the AI Regu...,,the investments in infrastructure and learning...,1,1,0,5.941099,b7bd3380-5a00-4d74-9290-eb57be5bf9db,62e004ad-5839-4c76-8102-16c29d454af1
8,What does the Department of Science and Techno...,The Department of Science and Technology (DOST...,,"The Department of Science and Technology, in c...",1,1,0,3.479386,2241fbde-48a2-4daf-9d6a-da1e03b0b12f,236df582-2bd7-4944-a2fe-5ebc8cfd8e47
9,"As an AI researcher and developer, how does th...",The Artificial Intelligence Education Act of 2...,,The Act aims to institutionalize the integrati...,1,1,0,7.859935,f5352b23-e285-46d0-8f07-e62a62325766,7627708c-c0c1-46ac-bb4e-08c1bd0ca0fe


##### HINTS:

- LangSmith provides detailed information about latency and cost.

<div style="background-color: #204B8E; color: white; padding: 10px; border-radius: 5px;">

### Analysis & Observations:

</div>