# Advanced Retrieval with LangChain

In the following notebook, we'll explore various methods of advanced retrieval using LangChain!

We'll touch on:

- Naive Retrieval
- Best-Matching 25 (BM25)
- Multi-Query Retrieval
- Parent-Document Retrieval
- Contextual Compression (a.k.a. Rerank)
- Ensemble Retrieval
- Semantic chunking

We'll also discuss how these methods impact performance on our set of documents with a simple RAG chain.

There will be two breakout rooms:

- 🤝 Breakout Room Part #1
  - Task 1: Getting Dependencies!
  - Task 2: Data Collection and Preparation
  - Task 3: Setting Up QDrant!
  - Task 4-10: Retrieval Strategies
- 🤝 Breakout Room Part #2
  - Activity: Evaluate with Ragas

# 🤝 Breakout Room Part #1

## Task 1: Getting Dependencies!

We're going to need a few specific LangChain community packages, like OpenAI (for our [LLM](https://platform.openai.com/docs/models) and [Embedding Model](https://platform.openai.com/docs/guides/embeddings)) and Cohere (for our [Reranker](https://cohere.com/rerank)).

We'll also provide our OpenAI key, as well as our Cohere API key.

In [1]:
from dotenv import load_dotenv
import os

load_dotenv()

True

In [None]:
# import os
# import getpass

# os.environ["OPENAI_API_KEY"] = getpass.getpass("Enter your OpenAI API Key:")

In [None]:
# os.environ["COHERE_API_KEY"] = getpass.getpass("Cohere API Key:")

## Task 2: Data Collection and Preparation

We'll be using our Loan Data once again - this time the strutured data available through the CSV!

### Data Preparation

We want to make sure all our documents have the relevant metadata for the various retrieval strategies we're going to be applying today.

In [2]:
from langchain_community.document_loaders.csv_loader import CSVLoader
from datetime import datetime, timedelta

loader = CSVLoader(
    file_path=f"./data/complaints.csv",
    metadata_columns=[
      "Date received", 
      "Product", 
      "Sub-product", 
      "Issue", 
      "Sub-issue", 
      "Consumer complaint narrative", 
      "Company public response", 
      "Company", 
      "State", 
      "ZIP code", 
      "Tags", 
      "Consumer consent provided?", 
      "Submitted via", 
      "Date sent to company", 
      "Company response to consumer", 
      "Timely response?", 
      "Consumer disputed?", 
      "Complaint ID"
    ]
)

loan_complaint_data = loader.load()

for doc in loan_complaint_data:
    doc.page_content = doc.metadata["Consumer complaint narrative"]

Let's look at an example document to see if everything worked as expected!

In [3]:
loan_complaint_data[0]

Document(metadata={'source': './data/complaints.csv', 'row': 0, 'Date received': '03/27/25', 'Product': 'Student loan', 'Sub-product': 'Federal student loan servicing', 'Issue': 'Dealing with your lender or servicer', 'Sub-issue': 'Trouble with how payments are being handled', 'Consumer complaint narrative': "The federal student loan COVID-19 forbearance program ended in XX/XX/XXXX. However, payments were not re-amortized on my federal student loans currently serviced by Nelnet until very recently. The new payment amount that is effective starting with the XX/XX/XXXX payment will nearly double my payment from {$180.00} per month to {$360.00} per month. I'm fortunate that my current financial position allows me to be able to handle the increased payment amount, but I am sure there are likely many borrowers who are not in the same position. The re-amortization should have occurred once the forbearance ended to reduce the impact to borrowers.", 'Company public response': 'None', 'Company'

## Task 3: Setting up QDrant!

Now that we have our documents, let's create a QDrant VectorStore with the collection name "LoanComplaints".

We'll leverage OpenAI's [`text-embedding-3-small`](https://openai.com/blog/new-embedding-models-and-api-updates) because it's a very powerful (and low-cost) embedding model.

> NOTE: We'll be creating additional vectorstores where necessary, but this pattern is still extremely useful.

In [4]:
from langchain_community.vectorstores import Qdrant
from langchain_openai import OpenAIEmbeddings

embeddings = OpenAIEmbeddings(model="text-embedding-3-small")

vectorstore = Qdrant.from_documents(
    loan_complaint_data,
    embeddings,
    location=":memory:",
    collection_name="LoanComplaints"
)

## Task 4: Naive RAG Chain

Since we're focusing on the "R" in RAG today - we'll create our Retriever first.

### R - Retrieval

This naive retriever will simply look at each review as a document, and use cosine-similarity to fetch the 10 most relevant documents.

> NOTE: We're choosing `10` as our `k` here to provide enough documents for our reranking process later

In [5]:
naive_retriever = vectorstore.as_retriever(search_kwargs={"k" : 10})

### A - Augmented

We're going to go with a standard prompt for our simple RAG chain today! Nothing fancy here, we want this to mostly be about the Retrieval process.

In [6]:
from langchain_core.prompts import ChatPromptTemplate

RAG_TEMPLATE = """\
You are a helpful and kind assistant. Use the context provided below to answer the question.

If you do not know the answer, or are unsure, say you don't know.

Query:
{question}

Context:
{context}
"""

rag_prompt = ChatPromptTemplate.from_template(RAG_TEMPLATE)

### G - Generation

We're going to leverage `gpt-4.1-nano` as our LLM today, as - again - we want this to largely be about the Retrieval process.

In [7]:
from langchain_openai import ChatOpenAI

chat_model = ChatOpenAI(model="gpt-4.1-nano")

### LCEL RAG Chain

We're going to use LCEL to construct our chain.

> NOTE: This chain will be exactly the same across the various examples with the exception of our Retriever!

In [8]:
from langchain_core.runnables import RunnablePassthrough
from operator import itemgetter
from langchain_core.output_parsers import StrOutputParser

naive_retrieval_chain = (
    # INVOKE CHAIN WITH: {"question" : "<<SOME USER QUESTION>>"}
    # "question" : populated by getting the value of the "question" key
    # "context"  : populated by getting the value of the "question" key and chaining it into the base_retriever
    {"context": itemgetter("question") | naive_retriever, "question": itemgetter("question")}
    # "context"  : is assigned to a RunnablePassthrough object (will not be called or considered in the next step)
    #              by getting the value of the "context" key from the previous step
    | RunnablePassthrough.assign(context=itemgetter("context"))
    # "response" : the "context" and "question" values are used to format our prompt object and then piped
    #              into the LLM and stored in a key called "response"
    # "context"  : populated by getting the value of the "context" key from the previous step
    | {"response": rag_prompt | chat_model, "context": itemgetter("context")}
)

Let's see how this simple chain does on a few different prompts.

> NOTE: You might think that we've cherry picked prompts that showcase the individual skill of each of the retrieval strategies - you'd be correct!

In [9]:
naive_retrieval_chain.invoke({"question" : "What is the most common issue with loans?"})["response"].content

'The most common issues with loans, based on the complaints provided, include:\n\n- Errors in loan balances and misapplied payments\n- Receiving incorrect or bad information about loans\n- Problems with loan servicing, such as difficulty applying payments to principal or paying off smaller loans early\n- Discrepancies and inaccuracies in loan information on credit reports\n- Unauthorized transfers or mishandling of loans\n- Difficulty with loan forgiveness, cancellation, or discharge\n- Issues with loan account management, privacy violations, or improper handling of data\n\nOverall, a prevalent issue appears to be mismanagement or errors by loan servicers, including inaccurate balances, poor communication, and difficulties in repayment processes.'

In [10]:
naive_retrieval_chain.invoke({"question" : "Did any complaints not get handled in a timely manner?"})["response"].content

'Based on the provided information, yes, some complaints did not get handled in a timely manner. Specifically, at least one complaint, submitted to MOHELA on 03/28/25 regarding delays in processing a loan application, was marked as "N/A" for timely response, indicating it was not responded to within the expected time frame. Additionally, multiple complaints mention delays and lack of response from the companies, with some stating that they have been waiting over a year for resolution.'

In [11]:
naive_retrieval_chain.invoke({"question" : "Why did people fail to pay back their loans?"})["response"].content

'People failed to pay back their loans for various reasons, including:\n\n1. **Lack of clear communication and transparency from loan servicers**: Several complaints mention that borrowers were not properly notified about loan transfers, resumption of payments, or changes in payment requirements, leading to unintentional delinquencies.\n\n2. **Financial hardship and inability to afford payments**: Many borrowers faced economic difficulties, such as stagnant wages, unemployment, or high living expenses, making it impossible to keep up with payments.\n\n3. **Interest accumulation and complex loan terms**: The continuous accumulation of interest, especially when loans were placed in forbearance or deferment, reduced the total amount owed over time and made repayment more difficult.\n\n4. **Difficulty in navigating repayment plans and eligibility**: Some borrowers found it hard to access income-driven repayment plans or faced issues with how additional payments were applied, effectively ex

Overall, this is not bad! Let's see if we can make it better!

## Task 5: Best-Matching 25 (BM25) Retriever

Taking a step back in time - [BM25](https://www.nowpublishers.com/article/Details/INR-019) is based on [Bag-Of-Words](https://en.wikipedia.org/wiki/Bag-of-words_model) which is a sparse representation of text.

In essence, it's a way to compare how similar two pieces of text are based on the words they both contain.

This retriever is very straightforward to set-up! Let's see it happen down below!


In [12]:
from langchain_community.retrievers import BM25Retriever

bm25_retriever = BM25Retriever.from_documents(loan_complaint_data, )

We'll construct the same chain - only changing the retriever.

In [13]:
bm25_retrieval_chain = (
    {"context": itemgetter("question") | bm25_retriever, "question": itemgetter("question")}
    | RunnablePassthrough.assign(context=itemgetter("context"))
    | {"response": rag_prompt | chat_model, "context": itemgetter("context")}
)

Let's look at the responses!

In [14]:
bm25_retrieval_chain.invoke({"question" : "What is the most common issue with loans?"})["response"].content

'Based on the provided context, a common issue with loans, particularly federal student loans, appears to be difficulties dealing with the lender or servicer, including problems with fee disclosures, payment application, and obtaining accurate or timely information. Specific issues mentioned include disputes over fees charged, how payments are applied (e.g., to interest rather than principal), receiving incorrect or bad information about loans, and frustrations with loan management and communication.\n\nTherefore, the most common issue with loans, as reflected in these complaints, is problems related to "Dealing with your lender or servicer," such as miscommunication, incorrect information, or difficulties in managing payments and fees.\n\nIf you were looking for a more general answer, I would say that common issues involve miscommunication, lack of transparency, or errors made by lenders or servicers.\n\nLet me know if you\'d like a more specific summary!'

In [15]:
bm25_retrieval_chain.invoke({"question" : "Did any complaints not get handled in a timely manner?"})["response"].content

'Based on the provided information, all the complaints included in the context were responded to in a timely manner. Specifically, each entry indicates that the response time was "Yes" for the "Timely response?" field. Therefore, there is no evidence to suggest that any complaints did not get handled in a timely manner.'

In [16]:
bm25_retrieval_chain.invoke({"question" : "Why did people fail to pay back their loans?"})["response"].content

'People failed to pay back their loans for various reasons, including issues with the management and communication from loan servicers, problems with payment plans, and complications arising from loan transfers and miscommunications. Specifically, some borrowers experienced:\n\n- Being steered into incorrect types of forbearances or repayment plans, leading to increased loan balances.\n- Lack of response or assistance from loan servicers when seeking hardship accommodations or deferments.\n- Automatic payments being unenrolled or not processed due to transfer of loans between agencies, without proper notification.\n- Poor communication, such as failure to inform borrowers about loan status changes, overdue payments, or the requirement to re-establish autopay.\n- Errors or delays in the processing of forbearance applications, causing missed or reversed payments.\n- Administrative mistakes, like inaccurate or insufficient notice before collections or reporting negative credit impacts, an

It's not clear that this is better or worse, if only we had a way to test this (SPOILERS: We do, the second half of the notebook will cover this)

#### ❓ Question #1:

Give an example query where BM25 is better than embeddings and justify your answer.

#### Answer:

Named entities - companies, addresses, people, etc.
Embeddings of named entities can actually be misleading (imagine the embedding for Apple). Keyword search is a more direct way to retrieve data for named entities.

## Task 6: Contextual Compression (Using Reranking)

Contextual Compression is a fairly straightforward idea: We want to "compress" our retrieved context into just the most useful bits.

There are a few ways we can achieve this - but we're going to look at a specific example called reranking.

The basic idea here is this:

- We retrieve lots of documents that are very likely related to our query vector
- We "compress" those documents into a smaller set of *more* related documents using a reranking algorithm.

We'll be leveraging Cohere's Rerank model for our reranker today!

All we need to do is the following:

- Create a basic retriever
- Create a compressor (reranker, in this case)

That's it!

Let's see it in the code below!

In [17]:
from langchain.retrievers.contextual_compression import ContextualCompressionRetriever
from langchain_cohere import CohereRerank

compressor = CohereRerank(model="rerank-v3.5")
compression_retriever = ContextualCompressionRetriever(
    base_compressor=compressor, base_retriever=naive_retriever
)

Let's create our chain again, and see how this does!

In [18]:
contextual_compression_retrieval_chain = (
    {"context": itemgetter("question") | compression_retriever, "question": itemgetter("question")}
    | RunnablePassthrough.assign(context=itemgetter("context"))
    | {"response": rag_prompt | chat_model, "context": itemgetter("context")}
)

In [19]:
contextual_compression_retrieval_chain.invoke({"question" : "What is the most common issue with loans?"})["response"].content

'Based on the provided complaints, the most common issue with loans appears to be problems related to the handling and management of the loans by lenders or servicers. These issues include errors in loan balances, misapplied payments, wrongful denials of payment plans, lack of proper communication, incorrect or inconsistent loan information, and mishandling of loan transfers. These problems often lead to confusion, incorrect billing, and disputes, indicating that administrative and servicing errors are a prevalent concern among borrowers.'

In [20]:
contextual_compression_retrieval_chain.invoke({"question" : "Did any complaints not get handled in a timely manner?"})["response"].content

'Yes, based on the provided context, there are complaints that have not been handled in a timely manner. For example, one complaint regarding student loans has been open for over a year without resolution, and another has been pending for nearly 18 months. Additionally, a complaint about issues with account review and response from a loan servicer has also been ongoing for over a year.'

In [21]:
contextual_compression_retrieval_chain.invoke({"question" : "Why did people fail to pay back their loans?"})["response"].content

'People failed to pay back their loans for several reasons highlighted in the complaints:\n\n1. **Lack of Clear Information and Communication:** Borrowers were often not properly informed about their repayment obligations, loan transfer processes, or due dates. For example, some were unaware of the need to repay loans or were not notified when their loans were transferred between servicers.\n\n2. **Confusing and Inconsistent Account Information:** Borrowers experienced difficulties accessing accurate account details, leading to confusion about their balances and payment status. Discrepancies in outstanding balances and interest calculations contributed to the inability to manage repayment effectively.\n\n3. **Accumulation of Interest Despite Payments:** When loans were placed into forbearance or deferment, interest continued to accrue, often negating any payments made. This increased the total amount owed over time, making it harder to pay off the loans.\n\n4. **Limited Payment Options

We'll need to rely on something like Ragas to help us get a better sense of how this is performing overall - but it "feels" better!

## Task 7: Multi-Query Retriever

Typically in RAG we have a single query - the one provided by the user.

What if we had....more than one query!

In essence, a Multi-Query Retriever works by:

1. Taking the original user query and creating `n` number of new user queries using an LLM.
2. Retrieving documents for each query.
3. Using all unique retrieved documents as context

So, how is it to set-up? Not bad! Let's see it down below!



In [22]:
from langchain.retrievers.multi_query import MultiQueryRetriever

multi_query_retriever = MultiQueryRetriever.from_llm(
    retriever=naive_retriever, llm=chat_model
)

In [23]:
multi_query_retrieval_chain = (
    {"context": itemgetter("question") | multi_query_retriever, "question": itemgetter("question")}
    | RunnablePassthrough.assign(context=itemgetter("context"))
    | {"response": rag_prompt | chat_model, "context": itemgetter("context")}
)

In [24]:
multi_query_retrieval_chain.invoke({"question" : "What is the most common issue with loans?"})["response"].content

"The most common issues with loans, based on the complaints provided, include:\n\n- Errors in loan balances and account information\n- Misapplications or mismanagement of payments\n- Unauthorized or incorrect reporting to credit bureaus\n- Problems with loan servicers, such as harassing calls, lack of communication, or failure to address disputes\n- Difficulties in obtaining accurate information about loan terms, payment plans, or forgiveness options\n- Issues stemming from loan sales and transfers between servicers\n- Inaccuracy in reporting late payments or delinquencies\n- Breaches of privacy and potential FERPA violations\n\nOverall, the most frequently cited issue appears to be **mismanagement and errors in account information, including incorrect balances, late payments, and improper reporting**, which significantly impacts borrowers' credit and financial stability. Additionally, problems with servicer communication and transparency are also prevalent.\n\nIf you need more specifi

In [25]:
multi_query_retrieval_chain.invoke({"question" : "Did any complaints not get handled in a timely manner?"})["response"].content

'Yes, there were complaints that did not get handled in a timely manner. Specifically, the complaint from 03/28/25 regarding a student loan with MOHELA was marked "No" for timely response, indicating it was not handled promptly.'

In [26]:
multi_query_retrieval_chain.invoke({"question" : "Why did people fail to pay back their loans?"})["response"].content

"People failed to pay back their loans primarily due to systemic issues and challenges related to loan servicing and borrower awareness. The key reasons include:\n\n1. **Lack of clear and timely communication:** Borrowers were often not notified when their loans were transferred between servicers or when forbearance or payment status changed, leading to confusion about payment obligations.\n\n2. **Inadequate guidance on repayment options:** Many borrowers were steered into forbearance or deferment without being informed about alternative options like income-driven repayment plans or loan rehabilitation, which could have been more manageable and preserved forgiveness benefits.\n\n3. **Accumulation of interest during forbearance:** While in forbearance, interest continued to accrue and capitalize, increasing the total amount owed, making it more difficult to pay off over time.\n\n4. **Complex and opaque loan servicing practices:** Borrowers faced complicated procedures, inconsistent info

#### ❓ Question #2:

Explain how generating multiple reformulations of a user query can improve recall.

#### Answer:

Imagine you have 10 golden chunks in the vector DB. Each version of the user query may have an 80% probability of retrieving all the golden chunks, so each might pull 8 golden chunks. With n versions you have a 1 - (0.2)^n probability of getting all golden chunks (that probability is equal to the probability across a test set).

## Task 8: Parent Document Retriever

A "small-to-big" strategy - the Parent Document Retriever works based on a simple strategy:

1. Each un-split "document" will be designated as a "parent document" (You could use larger chunks of document as well, but our data format allows us to consider the overall document as the parent chunk)
2. Store those "parent documents" in a memory store (not a VectorStore)
3. We will chunk each of those documents into smaller documents, and associate them with their respective parents, and store those in a VectorStore. We'll call those "child chunks".
4. When we query our Retriever, we will do a similarity search comparing our query vector to the "child chunks".
5. Instead of returning the "child chunks", we'll return their associated "parent chunks".

Okay, maybe that was a few steps - but the basic idea is this:

- Search for small documents
- Return big documents

The intuition is that we're likely to find the most relevant information by limiting the amount of semantic information that is encoded in each embedding vector - but we're likely to miss relevant surrounding context if we only use that information.

Let's start by creating our "parent documents" and defining a `RecursiveCharacterTextSplitter`.

In [27]:
from langchain.retrievers import ParentDocumentRetriever
from langchain.storage import InMemoryStore
from langchain_text_splitters import RecursiveCharacterTextSplitter
from qdrant_client import QdrantClient, models

parent_docs = loan_complaint_data
child_splitter = RecursiveCharacterTextSplitter(chunk_size=750)

We'll need to set up a new QDrant vectorstore - and we'll use another useful pattern to do so!

> NOTE: We are manually defining our embedding dimension, you'll need to change this if you're using a different embedding model.

In [28]:
from langchain_qdrant import QdrantVectorStore

client = QdrantClient(location=":memory:")

client.create_collection(
    collection_name="full_documents",
    vectors_config=models.VectorParams(size=1536, distance=models.Distance.COSINE)
)

parent_document_vectorstore = QdrantVectorStore(
    collection_name="full_documents", embedding=OpenAIEmbeddings(model="text-embedding-3-small"), client=client
)

Now we can create our `InMemoryStore` that will hold our "parent documents" - and build our retriever!

In [29]:
store = InMemoryStore()

parent_document_retriever = ParentDocumentRetriever(
    vectorstore = parent_document_vectorstore,
    docstore=store,
    child_splitter=child_splitter,
)

By default, this is empty as we haven't added any documents - let's add some now!

In [30]:
parent_document_retriever.add_documents(parent_docs, ids=None)

We'll create the same chain we did before - but substitute our new `parent_document_retriever`.

In [31]:
parent_document_retrieval_chain = (
    {"context": itemgetter("question") | parent_document_retriever, "question": itemgetter("question")}
    | RunnablePassthrough.assign(context=itemgetter("context"))
    | {"response": rag_prompt | chat_model, "context": itemgetter("context")}
)

Let's give it a whirl!

In [32]:
parent_document_retrieval_chain.invoke({"question" : "What is the most common issue with loans?"})["response"].content

"Based on the provided context, the most common issue with loans appears to be problems related to federal student loan servicing, such as errors in loan balances, misapplied payments, wrongful denials of payment plans, discrepancies in interest rates, and issues with credit reporting and information accuracy. Many complaints highlight systemic breakdowns, mismanagement by servicers, or inaccuracies in the reporting process.\n\nHowever, it is important to note that the context primarily describes specific complaints rather than indicating a single most common issue. Among these, issues with loan servicing, including misreporting and errors in balances or interest rates, seem prevalent.\n\nIf I had to summarize, the most common issues involve:\n\n- Errors in loan balances and misapplied payments\n- Discrepancies in interest rates and balances\n- Problems with loan reportage and credit reporting\n- Handling of forbearance and repayment plans\n\nThese recurring themes suggest that systemi

In [33]:
parent_document_retrieval_chain.invoke({"question" : "Did any complaints not get handled in a timely manner?"})["response"].content

'Based on the provided complaints, it appears that at least two complaints were marked as "Timely response?": "No," indicating they were not handled in a timely manner. Specifically, the complaints with IDs 12709087 and 12935889 from Mohela involved delays in resolution, with the complainant stating they had not heard from anyone after waiting several weeks, despite the company\'s responses indicating they did not respond promptly.\n\nTherefore, yes, some complaints did not get handled in a timely manner.'

In [34]:
parent_document_retrieval_chain.invoke({"question" : "Why did people fail to pay back their loans?"})["response"].content

"People failed to pay back their loans primarily due to factors such as financial hardship, lack of proper information or transparency about repayment obligations, and issues with loan servicing. For example, some borrowers experienced difficulties because they were unaware when payments were expected to begin, had their payments resumed prematurely, or were unable to obtain proper guidance on repayment options after completing their studies. Others faced challenges because their educational institutions misrepresented the value of their degrees, or because the institutions faced financial problems that impacted graduates' employment prospects and ability to repay their loans. Additionally, issues like unauthorized reporting of late payments, failure to notify borrowers of changes in loan ownership, and difficulties in verifying the legitimacy of their debts further contributed to repayment problems."

Overall, the performance *seems* largely the same. We can leverage a tool like [Ragas]() to more effectively answer the question about the performance.

## Task 9: Ensemble Retriever

In brief, an Ensemble Retriever simply takes 2, or more, retrievers and combines their retrieved documents based on a rank-fusion algorithm.

In this case - we're using the [Reciprocal Rank Fusion](https://plg.uwaterloo.ca/~gvcormac/cormacksigir09-rrf.pdf) algorithm.

Setting it up is as easy as providing a list of our desired retrievers - and the weights for each retriever.

In [35]:
from langchain.retrievers import EnsembleRetriever

retriever_list = [bm25_retriever, naive_retriever, parent_document_retriever, compression_retriever, multi_query_retriever]
equal_weighting = [1/len(retriever_list)] * len(retriever_list)

ensemble_retriever = EnsembleRetriever(
    retrievers=retriever_list, weights=equal_weighting
)

We'll pack *all* of these retrievers together in an ensemble.

In [36]:
ensemble_retrieval_chain = (
    {"context": itemgetter("question") | ensemble_retriever, "question": itemgetter("question")}
    | RunnablePassthrough.assign(context=itemgetter("context"))
    | {"response": rag_prompt | chat_model, "context": itemgetter("context")}
)

Let's look at our results!

In [37]:
ensemble_retrieval_chain.invoke({"question" : "What is the most common issue with loans?"})["response"].content

'The most common issue with loans, based on the complaints provided, appears to be dealing with your lender or servicer, specifically problems such as errors in loan balances, misapplied payments, wrongful denials of payment plans, bad or incorrect information about the loan, and poor communication or lack of proper documentation from the loan servicer. Many complaints also highlight issues with incorrect or inconsistent reporting of account status, improper handling of payment plans, and difficulties in obtaining accurate information or resolving disputes.'

In [38]:
ensemble_retrieval_chain.invoke({"question" : "Did any complaints not get handled in a timely manner?"})["response"].content

'Based on the provided complaints, it appears that some complaints were handled in a timely manner, while others experienced delays or were closed with explanations. Specifically:\n\n- Several complaints received timely responses and were closed with explanations, indicating they were addressed promptly.\n- However, multiple complaints, such as those with complaint IDs 12709087, 12935889, 12739706, and others, were marked as "Closed with explanation" and noted to have responses that were delayed (e.g., "No response" or "complaint took more than 30 days" in some cases).\n- Some complaints explicitly mention failures to respond or delays in investigation, such as complaints from EdFinancial Services (MOHELA, Maximus, EdFinancial), where responses were either late or the issue was unresolved after extended periods, including one marked "No" for timely response.\n- Others involved ongoing issues with no resolution or follow-up despite multiple contacts.\n\nIn summary, yes, some complaints 

In [39]:
ensemble_retrieval_chain.invoke({"question" : "Why did people fail to pay back their loans?"})["response"].content

'People failed to pay back their loans for several reasons, as reflected in the complaints provided:\n\n1. **Lack of Clear Communication or Notice**: Many borrowers were not properly informed about when their payments would restart after forbearance or deferment, leading to unintentional delinquency and missed payments. For example, some complaints mention not receiving notices or emails about repayment obligations or account transfers.\n\n2. **Systemic Errors and Mismanagement**: Errors such as incorrect account balances, mismatched figures, or late reporting to credit bureaus occurred due to mismanagement by loan servicers. Complaints highlight issues like incorrect account statuses, failure to verify debt legitimacy, and improper reporting that negatively impacted credit scores.\n\n3. **Inappropriate Handling of Forbearance and Payment Plans**: Borrowers were steered into long-term forbearance or improperly managed payment plans, often without understanding the accruing interest or 

## Task 10: Semantic Chunking

While this is not a retrieval method - it *is* an effective way of increasing retrieval performance on corpora that have clean semantic breaks in them.

Essentially, Semantic Chunking is implemented by:

1. Embedding all sentences in the corpus.
2. Combining or splitting sequences of sentences based on their semantic similarity based on a number of [possible thresholding methods](https://python.langchain.com/docs/how_to/semantic-chunker/):
  - `percentile`
  - `standard_deviation`
  - `interquartile`
  - `gradient`
3. Each sequence of related sentences is kept as a document!

Let's see how to implement this!

We'll use the `percentile` thresholding method for this example which will:

Calculate all distances between sentences, and then break apart sequences of setences that exceed a given percentile among all distances.

In [40]:
from langchain_experimental.text_splitter import SemanticChunker

semantic_chunker = SemanticChunker(
    embeddings,
    breakpoint_threshold_type="percentile"
)

Now we can split our documents.

In [41]:
semantic_documents = semantic_chunker.split_documents(loan_complaint_data[:20])

Let's create a new vector store.

In [42]:
semantic_vectorstore = Qdrant.from_documents(
    semantic_documents,
    embeddings,
    location=":memory:",
    collection_name="Loan_Complaint_Data_Semantic_Chunks"
)

We'll use naive retrieval for this example.

In [43]:
semantic_retriever = semantic_vectorstore.as_retriever(search_kwargs={"k" : 10})

Finally we can create our classic chain!

In [44]:
semantic_retrieval_chain = (
    {"context": itemgetter("question") | semantic_retriever, "question": itemgetter("question")}
    | RunnablePassthrough.assign(context=itemgetter("context"))
    | {"response": rag_prompt | chat_model, "context": itemgetter("context")}
)

And view the results!

In [45]:
semantic_retrieval_chain.invoke({"question" : "What is the most common issue with loans?"})["response"].content

'The most common issue with loans, based on the provided complaints, appears to be problems related to the handling and servicing of federal student loans. Specifically, frequent issues include:\n\n- Difficulty with repayment problems, such as struggling to repay or issues with payment plans and re-amortization.\n- Errors or disputes about account status, including incorrect reporting or default statuses.\n- Problems with loan servicing entities, such as miscommunication, lack of transparency, or unauthorized account activity.\n- Issues with loan information being used improperly or illegally, including privacy breaches, unauthorized data access, or incorrect reporting to credit bureaus.\n- Challenges in obtaining clear information or documentation about loan terms, balances, or payments.\n- Concerns about illegal reporting, collection practices, or the continued reporting of voided or legally questionable loans.\n\nOverall, a common theme is mismanagement or errors by loan servicers, 

In [46]:
semantic_retrieval_chain.invoke({"question" : "Did any complaints not get handled in a timely manner?"})["response"].content

'Based on the provided complaints, yes, some complaints did not get handled in a timely manner. Specifically, in the complaint regarding the transfer of the account to Nelnet, the consumer mentioned that despite multiple certified mail letters acknowledging receipt and detailing misconduct, Nelnet never responded to the complaint or provided any answers. This indicates that the complaint was not handled in a timely or effective manner. \n\nHowever, other complaints, such as those about autopay setup, payment issues, or disputes about information, show responses marked as "Closed with explanation" and, for some, responses were marked "Yes" for being timely. \n\nIn summary, at least one complaint (the transfer with misconduct allegations) was not handled in a timely manner, as evidenced by the lack of response despite multiple follow-up efforts.'

In [47]:
semantic_retrieval_chain.invoke({"question" : "Why did people fail to pay back their loans?"})["response"].content

'People failed to pay back their loans for various reasons, including issues such as poor communication or lack of transparency with their lenders or servicers, difficulties in providing or verifying necessary documentation for loan forgiveness, problems with payment processing or account status inaccuracies, and disputes over the legitimacy or legality of the debt. Additionally, some borrowers experienced delays or errors related to re-amortization after forbearance periods ended, or faced complications due to improper handling of their personal information and data breaches. In some cases, borrowers felt the loan servicers were stalling or providing bad information, which hindered their ability to make payments or resolve issues.'

#### ❓ Question #3:

If sentences are short and highly repetitive (e.g., FAQs), how might semantic chunking behave, and how would you adjust the algorithm?

#### Answer:

Semantic chunking might group these repetivite sentences together if they are semantically similar.

Not sure if this is a thing, but to improve this I would consider just having an LLM chunk a document directly. Ask it to repeat back the document as a list/array where the elements are chunks, and the LLM decides how to keep semantically similar content contained within chunks.

# 🤝 Breakout Room Part #2

#### 🏗️ Activity #1

Your task is to evaluate the various Retriever methods against eachother.

You are expected to:

1. Create a "golden dataset"
 - Use Synthetic Data Generation (powered by Ragas, or otherwise) to create this dataset
2. Evaluate each retriever with *retriever specific* Ragas metrics
 - Semantic Chunking is not considered a retriever method and will not be required for marks, but you may find it useful to do a "semantic chunking on" vs. "semantic chunking off" comparision between them
3. Compile these in a list and write a small paragraph about which is best for this particular data and why.

Your analysis should factor in:
  - Cost
  - Latency
  - Performance

> NOTE: This is **NOT** required to be completed in class. Please spend time in your breakout rooms creating a plan before moving on to writing code.

##### HINTS:

- LangSmith provides detailed information about latency and cost.

In [None]:
### YOUR CODE HERE

In [48]:
from uuid import uuid4

os.environ["LANGCHAIN_PROJECT"] = f"AIM - Assignment 09 - {uuid4().hex[0:8]}"

In [50]:
from langchain_community.document_loaders import DirectoryLoader
from langchain_community.document_loaders import PyMuPDFLoader


path = "data/"
loader = DirectoryLoader(path, glob="*.pdf", loader_cls=PyMuPDFLoader)
docs = loader.load()

In [64]:
len(docs)

269

In [65]:
docs[0]

Document(metadata={'producer': 'GPL Ghostscript 10.00.0', 'creator': 'wkhtmltopdf 0.12.6', 'creationdate': "D:20250418120630Z00'00'", 'source': 'data/Academic_Calenders_Cost_of_Attendance_and_Packaging.pdf', 'file_path': 'data/Academic_Calenders_Cost_of_Attendance_and_Packaging.pdf', 'total_pages': 57, 'format': 'PDF 1.7', 'title': '', 'author': '', 'subject': '', 'keywords': '', 'moddate': "D:20250418120630Z00'00'", 'trapped': '', 'modDate': "D:20250418120630Z00'00'", 'creationDate': "D:20250418120630Z00'00'", 'page': 0}, page_content='Volume 3\nAcademic Calendars, Cost of Attendance, and\nPackaging\nIntroduction\nThis volume of the Federal Student Aid (FSA) Handbook discusses the academic calendar, payment period, and\ndisbursement requirements for awarding aid under the Title IV student financial aid programs, determining a student9s\ncost of attendance, and packaging Title IV aid.\nThroughout this volume of the Handbook, the words "we," "our," and "us" refer to the United States De

In [66]:
from ragas.llms import LangchainLLMWrapper
from ragas.embeddings import LangchainEmbeddingsWrapper
from langchain_openai import ChatOpenAI
from langchain_openai import OpenAIEmbeddings
from ragas.testset import TestsetGenerator

generator_llm = LangchainLLMWrapper(ChatOpenAI(model="gpt-4.1-nano"))
generator_embeddings = LangchainEmbeddingsWrapper(OpenAIEmbeddings())

generator = TestsetGenerator(llm=generator_llm, embedding_model=generator_embeddings)
dataset = generator.generate_with_langchain_docs(docs[:20], testset_size=10)

Applying HeadlinesExtractor:   0%|          | 0/17 [00:00<?, ?it/s]

Applying HeadlineSplitter:   0%|          | 0/20 [00:00<?, ?it/s]

unable to apply transformation: 'headlines' property not found in this node
unable to apply transformation: 'headlines' property not found in this node
unable to apply transformation: 'headlines' property not found in this node


Applying SummaryExtractor:   0%|          | 0/31 [00:00<?, ?it/s]

Property 'summary' already exists in node '7af88e'. Skipping!
Property 'summary' already exists in node '6de8f8'. Skipping!
Property 'summary' already exists in node 'e81b19'. Skipping!
Property 'summary' already exists in node '3b4a7f'. Skipping!
Property 'summary' already exists in node '128058'. Skipping!
Property 'summary' already exists in node '9776b3'. Skipping!
Property 'summary' already exists in node '67af12'. Skipping!
Property 'summary' already exists in node 'fdc653'. Skipping!
Property 'summary' already exists in node '3590d3'. Skipping!
Property 'summary' already exists in node '53cb30'. Skipping!
Property 'summary' already exists in node '75977e'. Skipping!
Property 'summary' already exists in node '6fe065'. Skipping!
Property 'summary' already exists in node '389246'. Skipping!
Property 'summary' already exists in node '19a278'. Skipping!


Applying CustomNodeFilter:   0%|          | 0/6 [00:00<?, ?it/s]

Applying [EmbeddingExtractor, ThemesExtractor, NERExtractor]:   0%|          | 0/43 [00:00<?, ?it/s]

Property 'summary_embedding' already exists in node '7af88e'. Skipping!
Property 'summary_embedding' already exists in node '6de8f8'. Skipping!
Property 'summary_embedding' already exists in node 'fdc653'. Skipping!
Property 'summary_embedding' already exists in node 'e81b19'. Skipping!
Property 'summary_embedding' already exists in node '9776b3'. Skipping!
Property 'summary_embedding' already exists in node '389246'. Skipping!
Property 'summary_embedding' already exists in node '3b4a7f'. Skipping!
Property 'summary_embedding' already exists in node '3590d3'. Skipping!
Property 'summary_embedding' already exists in node '53cb30'. Skipping!
Property 'summary_embedding' already exists in node '75977e'. Skipping!
Property 'summary_embedding' already exists in node '19a278'. Skipping!
Property 'summary_embedding' already exists in node '128058'. Skipping!
Property 'summary_embedding' already exists in node '6fe065'. Skipping!
Property 'summary_embedding' already exists in node '67af12'. Sk

Applying [CosineSimilarityBuilder, OverlapScoreBuilder]:   0%|          | 0/2 [00:00<?, ?it/s]

Generating personas:   0%|          | 0/3 [00:00<?, ?it/s]

Generating Scenarios:   0%|          | 0/3 [00:00<?, ?it/s]

Generating Samples:   0%|          | 0/12 [00:00<?, ?it/s]

In [79]:
from langsmith import Client

client = Client()

dataset_name = f"Loan Synthetic Data - Assignment 09 - RAGAS - {uuid4().hex[0:8]}"

langsmith_dataset = client.create_dataset(
    dataset_name=dataset_name,
    description="Loan Synthetic Data - Assignment 09 - RAGAS"
)

for idx, row in dataset.to_pandas().iterrows():
  client.create_example(
      inputs={
          "question": row["user_input"]
      },
      outputs={
          "answer": row["reference"]
      },
      metadata={
          "context": row["reference_contexts"]
      },
      dataset_id=langsmith_dataset.id
  )


In [None]:
from langsmith.evaluation import evaluate
from ragas.metrics import (
    Faithfulness, FactualCorrectness, ResponseRelevancy,
    ContextEntityRecall, NoiseSensitivity,
)
from ragas.metrics import LLMContextRecall, Faithfulness, FactualCorrectness, ResponseRelevancy, ContextEntityRecall, NoiseSensitivity

from ragas.integrations.langchain import EvaluatorChain
from langchain_openai import ChatOpenAI
from ragas.llms import LangchainLLMWrapper

class PatchedEvaluator(EvaluatorChain):
    def evaluate_run(self, run, example=None, evaluator_run_id=None):
        return super().evaluate_run(run, example)
    
metrics = [
    LLMContextRecall(), Faithfulness(), FactualCorrectness(), ResponseRelevancy(), ContextEntityRecall(), NoiseSensitivity()
]
llm = LangchainLLMWrapper(ChatOpenAI(model_name="gpt-4o-mini", temperature=0))
ragas_evaluators = [
    PatchedEvaluator(metric=m, llm=llm) for m in metrics
]

def run_eval(invocable, invocable_name, dataset_name):
    # Run evaluation with all metrics tracked in LangSmith
    evaluate(
        invocable.invoke,
        data=dataset_name,
        evaluators=ragas_evaluators,
        metadata={"revision_id": invocable_name},
    )

In [98]:
# from langsmith import evaluate           # SDK function you asked about

# # (1) Build one EvaluatorChain per metric
# llm = LangchainLLMWrapper(ChatOpenAI(model_name="gpt-4o-mini", temperature=0))
# metrics = [
#     Faithfulness(), FactualCorrectness(), ResponseRelevancy(),
#     ContextEntityRecall(), NoiseSensitivity(),
# ]
# custom_evaluators = [
#     EvaluatorChain(metric=m, llm=llm) for m in metrics
# ]

# # (2) Run your application against a LangSmith dataset
# results = evaluate(
#     app_or_chain_callable,     # your RAG or other LCEL callable
#     data="my-dataset-name",    # dataset slug / id
#     evaluators=custom_evaluators,
#     prediction_key="answer",   # whatever key your callable returns
#     upload_results=True,       # set False for local dry-run
# )
# print(results)

In [99]:
invocables = {
    "naive": naive_retrieval_chain,
    "bm25": bm25_retrieval_chain, 
    "contextual_compression": contextual_compression_retrieval_chain,
    "multi_query": multi_query_retrieval_chain,
    "parent_document": parent_document_retrieval_chain, 
    "ensemble": ensemble_retrieval_chain, 
    "semantic": semantic_retrieval_chain, 
}

In [100]:
from tqdm import tqdm

for invocable_name, invocable in tqdm(invocables.items()):
  print("Beginning evaluation of", invocable_name)
  result = run_eval(invocable, invocable_name, dataset_name)
  print(result)
  print("Evaluation complete")

  0%|          | 0/7 [00:00<?, ?it/s]

Beginning evaluation of naive
View the evaluation results for experiment: 'weary-punishment-5' at:
https://smith.langchain.com/o/6848a1f0-9351-4081-bbdc-eaa03dcd1c1a/datasets/285f08e3-dc5f-46d8-9fc6-719a1666bc03/compare?selectedSessions=63185981-0760-41cb-a0a1-bcdfb9571027




0it [00:00, ?it/s]

Error running evaluator PatchedEvaluator(verbose=False, metric=Faithfulness(_required_columns={<MetricType.SINGLE_TURN: 'single_turn'>: {'response', 'retrieved_contexts', 'user_input'}}, name='faithfulness', llm=LangchainLLMWrapper(langchain_llm=LangchainLLMWrapper(...)), output_type=<MetricOutputType.CONTINUOUS: 'continuous'>, nli_statements_message=NLIStatementPrompt(instruction=Your task is to judge the faithfulness of a series of statements based on a given context. For each statement you must return verdict as 1 if the statement can be directly inferred based on the context or 0 if the statement can not be directly inferred based on the context., examples=[(NLIStatementInput(context='John is a student at XYZ University. He is pursuing a degree in Computer Science. He is enrolled in several courses this semester, including Data Structures, Algorithms, and Database Management. John is a diligent student and spends a significant amount of time studying and completing assignments. He 

None
Evaluation complete
Beginning evaluation of bm25
View the evaluation results for experiment: 'virtual-place-95' at:
https://smith.langchain.com/o/6848a1f0-9351-4081-bbdc-eaa03dcd1c1a/datasets/285f08e3-dc5f-46d8-9fc6-719a1666bc03/compare?selectedSessions=d01daa55-033c-46cb-b760-7b737e5ce1df




0it [00:00, ?it/s]

Error running evaluator PatchedEvaluator(verbose=False, metric=Faithfulness(_required_columns={<MetricType.SINGLE_TURN: 'single_turn'>: {'response', 'retrieved_contexts', 'user_input'}}, name='faithfulness', llm=LangchainLLMWrapper(langchain_llm=LangchainLLMWrapper(...)), output_type=<MetricOutputType.CONTINUOUS: 'continuous'>, nli_statements_message=NLIStatementPrompt(instruction=Your task is to judge the faithfulness of a series of statements based on a given context. For each statement you must return verdict as 1 if the statement can be directly inferred based on the context or 0 if the statement can not be directly inferred based on the context., examples=[(NLIStatementInput(context='John is a student at XYZ University. He is pursuing a degree in Computer Science. He is enrolled in several courses this semester, including Data Structures, Algorithms, and Database Management. John is a diligent student and spends a significant amount of time studying and completing assignments. He 

None
Evaluation complete
Beginning evaluation of contextual_compression
View the evaluation results for experiment: 'impressionable-kitten-48' at:
https://smith.langchain.com/o/6848a1f0-9351-4081-bbdc-eaa03dcd1c1a/datasets/285f08e3-dc5f-46d8-9fc6-719a1666bc03/compare?selectedSessions=88c71305-3bd1-47e1-87fb-6eff94806835




0it [00:00, ?it/s]

Error running evaluator PatchedEvaluator(verbose=False, metric=Faithfulness(_required_columns={<MetricType.SINGLE_TURN: 'single_turn'>: {'response', 'retrieved_contexts', 'user_input'}}, name='faithfulness', llm=LangchainLLMWrapper(langchain_llm=LangchainLLMWrapper(...)), output_type=<MetricOutputType.CONTINUOUS: 'continuous'>, nli_statements_message=NLIStatementPrompt(instruction=Your task is to judge the faithfulness of a series of statements based on a given context. For each statement you must return verdict as 1 if the statement can be directly inferred based on the context or 0 if the statement can not be directly inferred based on the context., examples=[(NLIStatementInput(context='John is a student at XYZ University. He is pursuing a degree in Computer Science. He is enrolled in several courses this semester, including Data Structures, Algorithms, and Database Management. John is a diligent student and spends a significant amount of time studying and completing assignments. He 

None
Evaluation complete
Beginning evaluation of multi_query
View the evaluation results for experiment: 'drab-trousers-35' at:
https://smith.langchain.com/o/6848a1f0-9351-4081-bbdc-eaa03dcd1c1a/datasets/285f08e3-dc5f-46d8-9fc6-719a1666bc03/compare?selectedSessions=958625ac-0529-4b93-b65f-62036c979d43




0it [00:00, ?it/s]

Error running evaluator PatchedEvaluator(verbose=False, metric=Faithfulness(_required_columns={<MetricType.SINGLE_TURN: 'single_turn'>: {'response', 'retrieved_contexts', 'user_input'}}, name='faithfulness', llm=LangchainLLMWrapper(langchain_llm=LangchainLLMWrapper(...)), output_type=<MetricOutputType.CONTINUOUS: 'continuous'>, nli_statements_message=NLIStatementPrompt(instruction=Your task is to judge the faithfulness of a series of statements based on a given context. For each statement you must return verdict as 1 if the statement can be directly inferred based on the context or 0 if the statement can not be directly inferred based on the context., examples=[(NLIStatementInput(context='John is a student at XYZ University. He is pursuing a degree in Computer Science. He is enrolled in several courses this semester, including Data Structures, Algorithms, and Database Management. John is a diligent student and spends a significant amount of time studying and completing assignments. He 

None
Evaluation complete
Beginning evaluation of parent_document
View the evaluation results for experiment: 'mealy-tray-96' at:
https://smith.langchain.com/o/6848a1f0-9351-4081-bbdc-eaa03dcd1c1a/datasets/285f08e3-dc5f-46d8-9fc6-719a1666bc03/compare?selectedSessions=30c4e56f-feb3-4495-90ef-63f59ff38c80




0it [00:00, ?it/s]

Error running evaluator PatchedEvaluator(verbose=False, metric=Faithfulness(_required_columns={<MetricType.SINGLE_TURN: 'single_turn'>: {'response', 'retrieved_contexts', 'user_input'}}, name='faithfulness', llm=LangchainLLMWrapper(langchain_llm=LangchainLLMWrapper(...)), output_type=<MetricOutputType.CONTINUOUS: 'continuous'>, nli_statements_message=NLIStatementPrompt(instruction=Your task is to judge the faithfulness of a series of statements based on a given context. For each statement you must return verdict as 1 if the statement can be directly inferred based on the context or 0 if the statement can not be directly inferred based on the context., examples=[(NLIStatementInput(context='John is a student at XYZ University. He is pursuing a degree in Computer Science. He is enrolled in several courses this semester, including Data Structures, Algorithms, and Database Management. John is a diligent student and spends a significant amount of time studying and completing assignments. He 

None
Evaluation complete
Beginning evaluation of ensemble
View the evaluation results for experiment: 'aching-tooth-77' at:
https://smith.langchain.com/o/6848a1f0-9351-4081-bbdc-eaa03dcd1c1a/datasets/285f08e3-dc5f-46d8-9fc6-719a1666bc03/compare?selectedSessions=c445b221-1fe9-4eee-9e43-c91d7925e5ed




0it [00:00, ?it/s]

Error running evaluator PatchedEvaluator(verbose=False, metric=Faithfulness(_required_columns={<MetricType.SINGLE_TURN: 'single_turn'>: {'response', 'retrieved_contexts', 'user_input'}}, name='faithfulness', llm=LangchainLLMWrapper(langchain_llm=LangchainLLMWrapper(...)), output_type=<MetricOutputType.CONTINUOUS: 'continuous'>, nli_statements_message=NLIStatementPrompt(instruction=Your task is to judge the faithfulness of a series of statements based on a given context. For each statement you must return verdict as 1 if the statement can be directly inferred based on the context or 0 if the statement can not be directly inferred based on the context., examples=[(NLIStatementInput(context='John is a student at XYZ University. He is pursuing a degree in Computer Science. He is enrolled in several courses this semester, including Data Structures, Algorithms, and Database Management. John is a diligent student and spends a significant amount of time studying and completing assignments. He 

KeyboardInterrupt: 

In [None]:
from ragas import EvaluationDataset
from ragas import evaluate
from ragas.llms import LangchainLLMWrapper
from ragas.metrics import LLMContextRecall, Faithfulness, FactualCorrectness, ResponseRelevancy, ContextEntityRecall, NoiseSensitivity
from ragas import evaluate, RunConfig
import copy

def run_eval_ragas(invocable, dataset):

  dataset_this = copy.deepcopy(dataset)

  for test_row in dataset_this:
    response = invocable.invoke({"question" : test_row.eval_sample.user_input})
    test_row.eval_sample.response = response["response"].content
    test_row.eval_sample.retrieved_contexts = [context.page_content for context in response["context"]]

  evaluation_dataset = EvaluationDataset.from_pandas(dataset_this.to_pandas())
  evaluator_llm = LangchainLLMWrapper(ChatOpenAI(model="gpt-4.1-mini"))

  custom_run_config = RunConfig(timeout=360)
  result = evaluate(
      dataset=evaluation_dataset,
      metrics=[LLMContextRecall(), Faithfulness(), FactualCorrectness(), ResponseRelevancy(), ContextEntityRecall(), NoiseSensitivity()],
      llm=evaluator_llm,
      run_config=custom_run_config
  )
  return result