# Advanced Retrieval with LangChain

In the following notebook, we'll explore various methods of advanced retrieval using LangChain!

We'll touch on:

- Naive Retrieval
- Best-Matching 25 (BM25)
- Multi-Query Retrieval
- Parent-Document Retrieval
- Contextual Compression (a.k.a. Rerank)
- Ensemble Retrieval
- Semantic chunking

We'll also discuss how these methods impact performance on our set of documents with a simple RAG chain.

There will be two breakout rooms:

- 🤝 Breakout Room Part #1
  - Task 1: Getting Dependencies!
  - Task 2: Data Collection and Preparation
  - Task 3: Setting Up QDrant!
  - Task 4-10: Retrieval Strategies
- 🤝 Breakout Room Part #2
  - Activity: Evaluate with Ragas

# 🤝 Breakout Room Part #1

## Task 1: Getting Dependencies!

We're going to need a few specific LangChain community packages, like OpenAI (for our [LLM](https://platform.openai.com/docs/models) and [Embedding Model](https://platform.openai.com/docs/guides/embeddings)) and Cohere (for our [Reranker](https://cohere.com/rerank)).

We'll also provide our OpenAI key, as well as our Cohere API key.

In [82]:
from dotenv import load_dotenv
import os

load_dotenv()

True

In [83]:
# import os
# import getpass

# os.environ["OPENAI_API_KEY"] = getpass.getpass("Enter your OpenAI API Key:")

In [84]:
# os.environ["COHERE_API_KEY"] = getpass.getpass("Cohere API Key:")

## Task 2: Data Collection and Preparation

We'll be using our Loan Data once again - this time the strutured data available through the CSV!

### Data Preparation

We want to make sure all our documents have the relevant metadata for the various retrieval strategies we're going to be applying today.

In [85]:
from langchain_community.document_loaders.csv_loader import CSVLoader
from datetime import datetime, timedelta

loader = CSVLoader(
    file_path=f"./data/complaints.csv",
    metadata_columns=[
      "Date received", 
      "Product", 
      "Sub-product", 
      "Issue", 
      "Sub-issue", 
      "Consumer complaint narrative", 
      "Company public response", 
      "Company", 
      "State", 
      "ZIP code", 
      "Tags", 
      "Consumer consent provided?", 
      "Submitted via", 
      "Date sent to company", 
      "Company response to consumer", 
      "Timely response?", 
      "Consumer disputed?", 
      "Complaint ID"
    ]
)

loan_complaint_data = loader.load()

for doc in loan_complaint_data:
    doc.page_content = doc.metadata["Consumer complaint narrative"]

Let's look at an example document to see if everything worked as expected!

In [86]:
loan_complaint_data[0]

Document(metadata={'source': './data/complaints.csv', 'row': 0, 'Date received': '03/27/25', 'Product': 'Student loan', 'Sub-product': 'Federal student loan servicing', 'Issue': 'Dealing with your lender or servicer', 'Sub-issue': 'Trouble with how payments are being handled', 'Consumer complaint narrative': "The federal student loan COVID-19 forbearance program ended in XX/XX/XXXX. However, payments were not re-amortized on my federal student loans currently serviced by Nelnet until very recently. The new payment amount that is effective starting with the XX/XX/XXXX payment will nearly double my payment from {$180.00} per month to {$360.00} per month. I'm fortunate that my current financial position allows me to be able to handle the increased payment amount, but I am sure there are likely many borrowers who are not in the same position. The re-amortization should have occurred once the forbearance ended to reduce the impact to borrowers.", 'Company public response': 'None', 'Company'

## Task 3: Setting up QDrant!

Now that we have our documents, let's create a QDrant VectorStore with the collection name "LoanComplaints".

We'll leverage OpenAI's [`text-embedding-3-small`](https://openai.com/blog/new-embedding-models-and-api-updates) because it's a very powerful (and low-cost) embedding model.

> NOTE: We'll be creating additional vectorstores where necessary, but this pattern is still extremely useful.

In [87]:
from langchain_community.vectorstores import Qdrant
from langchain_openai import OpenAIEmbeddings

embeddings = OpenAIEmbeddings(model="text-embedding-3-small")

vectorstore = Qdrant.from_documents(
    loan_complaint_data,
    embeddings,
    location=":memory:",
    collection_name="LoanComplaints"
)

KeyboardInterrupt: 

## Task 4: Naive RAG Chain

Since we're focusing on the "R" in RAG today - we'll create our Retriever first.

### R - Retrieval

This naive retriever will simply look at each review as a document, and use cosine-similarity to fetch the 10 most relevant documents.

> NOTE: We're choosing `10` as our `k` here to provide enough documents for our reranking process later

In [None]:
naive_retriever = vectorstore.as_retriever(search_kwargs={"k" : 10})

### A - Augmented

We're going to go with a standard prompt for our simple RAG chain today! Nothing fancy here, we want this to mostly be about the Retrieval process.

In [None]:
from langchain_core.prompts import ChatPromptTemplate

RAG_TEMPLATE = """\
You are a helpful and kind assistant. Use the context provided below to answer the question.

If you do not know the answer, or are unsure, say you don't know.

Query:
{question}

Context:
{context}
"""

rag_prompt = ChatPromptTemplate.from_template(RAG_TEMPLATE)

### G - Generation

We're going to leverage `gpt-4.1-nano` as our LLM today, as - again - we want this to largely be about the Retrieval process.

In [None]:
from langchain_openai import ChatOpenAI

chat_model = ChatOpenAI(model="gpt-4.1-nano")

### LCEL RAG Chain

We're going to use LCEL to construct our chain.

> NOTE: This chain will be exactly the same across the various examples with the exception of our Retriever!

In [None]:
from langchain_core.runnables import RunnablePassthrough
from operator import itemgetter
from langchain_core.output_parsers import StrOutputParser

naive_retrieval_chain = (
    # INVOKE CHAIN WITH: {"question" : "<<SOME USER QUESTION>>"}
    # "question" : populated by getting the value of the "question" key
    # "context"  : populated by getting the value of the "question" key and chaining it into the base_retriever
    {"context": itemgetter("question") | naive_retriever, "question": itemgetter("question")}
    # "context"  : is assigned to a RunnablePassthrough object (will not be called or considered in the next step)
    #              by getting the value of the "context" key from the previous step
    | RunnablePassthrough.assign(context=itemgetter("context"))
    # "response" : the "context" and "question" values are used to format our prompt object and then piped
    #              into the LLM and stored in a key called "response"
    # "context"  : populated by getting the value of the "context" key from the previous step
    | {"response": rag_prompt | chat_model, "context": itemgetter("context")}
)

Let's see how this simple chain does on a few different prompts.

> NOTE: You might think that we've cherry picked prompts that showcase the individual skill of each of the retrieval strategies - you'd be correct!

In [None]:
naive_retrieval_chain.invoke({"question" : "What is the most common issue with loans?"})["response"].content

'Based on the provided context, the most common issues with loans appear to involve errors and mismanagement related to loan balances, payments, or information. Specific issues include:\n\n- Errors in loan balances, misapplied payments, or wrongful denials of payment plans.\n- Incorrect information on credit reports, such as wrongful delinquency reporting.\n- Problems with how payments are being handled, including restrictions on applying funds to principal or paying off loans faster.\n- Discrepancies in loan amounts, balances, and interest, often growing despite payments.\n- Unethical practices like improper transfer of loans, lack of transparency, and failure to communicate changes.\n- Problems with loan sale and transfer notifications.\n- Disputes over false or misleading information and violations of privacy laws.\n\nGiven this, one of the most common issues appears to be **"Dealing with the lender or servicer"**, especially involving errors, mismanagement, or lack of transparency.

In [None]:
naive_retrieval_chain.invoke({"question" : "Did any complaints not get handled in a timely manner?"})["response"].content

'Based on the provided complaints, yes, some complaints were not handled in a timely manner. For example:\n\n- One complaint submitted to MOHELA on 03/28/25 was marked as "Not timely" response, indicating it was handled late.\n- Multiple complaints indicate delays or failure to resolve issues promptly, such as an unresolved dispute for over 1 year or responses marked as "None" or "Closed with explanation" despite ongoing issues.\n\nTherefore, there is evidence that some complaints did not get handled promptly.'

In [None]:
naive_retrieval_chain.invoke({"question" : "Why did people fail to pay back their loans?"})["response"].content

"People often fail to pay back their loans due to a combination of financial hardships, mismanagement, lack of clear communication, and unmanageable interest accumulation. Based on the provided complaints:\n\n- Many borrowers are unable to increase their payments without jeopardizing their ability to cover basic living expenses such as bills, food, and transportation.\n- Loan interest continues to accumulate even when in forbearance or deferment, increasing the total amount owed over time and extending the payoff period.\n- Borrowers report not being adequately informed or notified about loan status changes, transfers between servicers, or obligations, leading to missed payments or credit reporting errors.\n- Some borrowers' loans were mismanaged or transferred without proper notification; they were unaware of payment resumption or owed amounts, resulting in delinquency and negative credit impacts.\n- Certain repayment plans or inability to apply extra payments directly to the principa

Overall, this is not bad! Let's see if we can make it better!

## Task 5: Best-Matching 25 (BM25) Retriever

Taking a step back in time - [BM25](https://www.nowpublishers.com/article/Details/INR-019) is based on [Bag-Of-Words](https://en.wikipedia.org/wiki/Bag-of-words_model) which is a sparse representation of text.

In essence, it's a way to compare how similar two pieces of text are based on the words they both contain.

This retriever is very straightforward to set-up! Let's see it happen down below!


In [None]:
from langchain_community.retrievers import BM25Retriever

bm25_retriever = BM25Retriever.from_documents(loan_complaint_data, )

We'll construct the same chain - only changing the retriever.

In [None]:
bm25_retrieval_chain = (
    {"context": itemgetter("question") | bm25_retriever, "question": itemgetter("question")}
    | RunnablePassthrough.assign(context=itemgetter("context"))
    | {"response": rag_prompt | chat_model, "context": itemgetter("context")}
)

Let's look at the responses!

In [None]:
bm25_retrieval_chain.invoke({"question" : "What is the most common issue with loans?"})["response"].content

'The most common issue with loans, based on the provided complaints, appears to be problems dealing with lenders or servicers, such as disputes over fees, applying payments correctly, or obtaining accurate information about the loan. A specific recurring issue is the difficulty in managing payments—such as applying extra funds to principal— and receiving incomplete or incorrect information about loan balances or terms. Many complaints also involve feeling misled or experiencing delays in service and communication.'

In [None]:
bm25_retrieval_chain.invoke({"question" : "Did any complaints not get handled in a timely manner?"})["response"].content

'Based on the provided information, all the complaints in the context received timely responses from the companies. Specifically, the complaints that detail ongoing issues mention that the companies responded "Closed with explanation" and were marked as "Timely response? Yes." Therefore, it appears that no complaints were left unhandled in a timely manner.'

In [None]:
bm25_retrieval_chain.invoke({"question" : "Why did people fail to pay back their loans?"})["response"].content

'People failed to pay back their loans for various reasons, including issues with payment plans, miscommunication or lack of communication from lenders or servicers, technical problems with payments (such as payments being reversed or not processed), and sometimes because they were not properly informed about changes or statuses of their loans. In some cases, borrowers were led into incorrect repayment or forbearance options, and others experienced automated or administrative errors that prevented their payments from being properly processed or acknowledged. Additionally, some borrowers cited being unaware of transfers between loan servicers and not receiving important notifications, which contributed to missed or delayed payments.'

It's not clear that this is better or worse, if only we had a way to test this (SPOILERS: We do, the second half of the notebook will cover this)

#### ❓ Question #1:

Give an example query where BM25 is better than embeddings and justify your answer.

#### Answer:

Named entities - companies, addresses, people, etc.
Embeddings of named entities can actually be misleading (imagine the embedding for Apple). Keyword search is a more direct way to retrieve data for named entities.

## Task 6: Contextual Compression (Using Reranking)

Contextual Compression is a fairly straightforward idea: We want to "compress" our retrieved context into just the most useful bits.

There are a few ways we can achieve this - but we're going to look at a specific example called reranking.

The basic idea here is this:

- We retrieve lots of documents that are very likely related to our query vector
- We "compress" those documents into a smaller set of *more* related documents using a reranking algorithm.

We'll be leveraging Cohere's Rerank model for our reranker today!

All we need to do is the following:

- Create a basic retriever
- Create a compressor (reranker, in this case)

That's it!

Let's see it in the code below!

In [None]:
from langchain.retrievers.contextual_compression import ContextualCompressionRetriever
from langchain_cohere import CohereRerank

compressor = CohereRerank(model="rerank-v3.5")
compression_retriever = ContextualCompressionRetriever(
    base_compressor=compressor, base_retriever=naive_retriever
)

Let's create our chain again, and see how this does!

In [None]:
contextual_compression_retrieval_chain = (
    {"context": itemgetter("question") | compression_retriever, "question": itemgetter("question")}
    | RunnablePassthrough.assign(context=itemgetter("context"))
    | {"response": rag_prompt | chat_model, "context": itemgetter("context")}
)

In [None]:
contextual_compression_retrieval_chain.invoke({"question" : "What is the most common issue with loans?"})["response"].content

'Based on the provided context, the most common issue with loans appears to be problems related to dealing with lenders or servicers, including errors and mismanagement such as incorrect information, misapplied payments, unauthorized transfers, and lack of communication. Specifically, for student loans, issues like incorrect balances, untransparent interest calculations, mishandling of data, and violations of privacy laws are prominent. Overall, miscommunication, errors, and mishandling by loan servicers seem to be the most frequent issues reported.'

In [None]:
contextual_compression_retrieval_chain.invoke({"question" : "Did any complaints not get handled in a timely manner?"})["response"].content

'Based on the provided information, at least one complaint did not get handled in a timely manner. Specifically, the complaint from the individual regarding their student loan account issues has been open for over a year and a half (nearly 18 months) without resolution, despite claims of waiting for responses and attempting multiple follow-ups. The complaint from the other individual regarding issues with payments not appearing on their account appears to have been responded to by the company within a timely manner, as indicated by the "Timely response?": "Yes" statement.\n\nTherefore, yes, some complaints, like the one involving the nearly 18-month unresolved issue, were not handled in a timely manner.'

In [None]:
contextual_compression_retrieval_chain.invoke({"question" : "Why did people fail to pay back their loans?"})["response"].content

"People failed to pay back their loans primarily due to several issues highlighted in the complaints:\n\n1. Lack of Clear Information: Borrowers were often not properly informed by financial aid officers or loan servicers about the requirement to repay loans, leading to unawareness of repayment obligations.\n\n2. Administrative and Communication Issues: Changes in loan servicers (e.g., transfers without notice), failure to notify borrowers about payment due dates, or updates, and difficulty accessing accurate information online hindered borrowers from making timely payments.\n\n3. Problematic Loan Management: Many borrowers experienced discrepancies in their account information, continued interest accrual even during deferment or forbearance, and were unable to receive detailed breakdowns of their loans' interest and balance, making it difficult to understand or manage their debt.\n\n4. Limited Payment Options and High Interest: The available options, such as forbearance or deferment, 

We'll need to rely on something like Ragas to help us get a better sense of how this is performing overall - but it "feels" better!

## Task 7: Multi-Query Retriever

Typically in RAG we have a single query - the one provided by the user.

What if we had....more than one query!

In essence, a Multi-Query Retriever works by:

1. Taking the original user query and creating `n` number of new user queries using an LLM.
2. Retrieving documents for each query.
3. Using all unique retrieved documents as context

So, how is it to set-up? Not bad! Let's see it down below!



In [None]:
from langchain.retrievers.multi_query import MultiQueryRetriever

multi_query_retriever = MultiQueryRetriever.from_llm(
    retriever=naive_retriever, llm=chat_model
)

In [None]:
multi_query_retrieval_chain = (
    {"context": itemgetter("question") | multi_query_retriever, "question": itemgetter("question")}
    | RunnablePassthrough.assign(context=itemgetter("context"))
    | {"response": rag_prompt | chat_model, "context": itemgetter("context")}
)

In [None]:
multi_query_retrieval_chain.invoke({"question" : "What is the most common issue with loans?"})["response"].content

'The most common issue with loans appears to be problems related to servicing and management of the loans, including errors in loan balances, misapplied payments, improper reporting of delinquency or default, lack of communication from servicers, and difficulties in obtaining accurate and transparent information about the loan terms and status. Many complaints also involve issues with how payments are handled, incorrect credit reporting, and servicing failures that lead to financial hardship for borrowers.'

In [None]:
multi_query_retrieval_chain.invoke({"question" : "Did any complaints not get handled in a timely manner?"})["response"].content

'Based on the provided complaints, yes, some complaints indicate that issues were not handled in a timely manner. For example:\n\n- One complaint from EdFinancial Services (Complaint ID: 12709087) was marked "No" under timely response, meaning it was not addressed within the expected timeframe, with a delay of over 2-3 weeks.\n- Another complaint from MOHELA (Complaint ID: 12654977) was also marked "No" for timely response, with wait times exceeding 3 hours and no reply after multiple follow-ups.\n- Several complaints from Maximus Federal Services (Aidvantage) mention delays and lack of response over periods of over a year, indicating persistent handling delays.\n\nTherefore, the answer is: Yes, some complaints did not get handled in a timely manner.'

In [None]:
multi_query_retrieval_chain.invoke({"question" : "Why did people fail to pay back their loans?"})["response"].content

'People failed to pay back their loans primarily due to a combination of factors highlighted in the complaints and narratives:\n\n1. Lack of clear and adequate information from lenders and servicers about repayment obligations, interest accrual, and available repayment options, leading to unintentional delinquency.\n2. Mismanagement and errors by loan servicers, such as misapplied or missing payments, incorrect account reporting, or delays in updating account statuses, which negatively impacted credit scores and caused confusion.\n3. High interest rates, especially when interest continued to accrue during forbearance or deferment, making repayment more difficult and extending the payoff period.\n4. Insufficient communication regarding loan transfers, changes in servicers, or resumption of payments, resulting in borrowers being unaware of when repayment was due or how to manage their loans.\n5. Limited access to or knowledge of alternative repayment plans such as income-driven repayment

#### ❓ Question #2:

Explain how generating multiple reformulations of a user query can improve recall.

#### Answer:

Imagine you have 10 golden chunks in the vector DB. Each version of the user query may have an 80% probability of retrieving all the golden chunks, so each might pull 8 golden chunks. With n versions you have a 1 - (0.2)^n probability of getting all golden chunks (that probability is equal to the probability across a test set).

## Task 8: Parent Document Retriever

A "small-to-big" strategy - the Parent Document Retriever works based on a simple strategy:

1. Each un-split "document" will be designated as a "parent document" (You could use larger chunks of document as well, but our data format allows us to consider the overall document as the parent chunk)
2. Store those "parent documents" in a memory store (not a VectorStore)
3. We will chunk each of those documents into smaller documents, and associate them with their respective parents, and store those in a VectorStore. We'll call those "child chunks".
4. When we query our Retriever, we will do a similarity search comparing our query vector to the "child chunks".
5. Instead of returning the "child chunks", we'll return their associated "parent chunks".

Okay, maybe that was a few steps - but the basic idea is this:

- Search for small documents
- Return big documents

The intuition is that we're likely to find the most relevant information by limiting the amount of semantic information that is encoded in each embedding vector - but we're likely to miss relevant surrounding context if we only use that information.

Let's start by creating our "parent documents" and defining a `RecursiveCharacterTextSplitter`.

In [None]:
from langchain.retrievers import ParentDocumentRetriever
from langchain.storage import InMemoryStore
from langchain_text_splitters import RecursiveCharacterTextSplitter
from qdrant_client import QdrantClient, models

parent_docs = loan_complaint_data
child_splitter = RecursiveCharacterTextSplitter(chunk_size=750)

We'll need to set up a new QDrant vectorstore - and we'll use another useful pattern to do so!

> NOTE: We are manually defining our embedding dimension, you'll need to change this if you're using a different embedding model.

In [None]:
from langchain_qdrant import QdrantVectorStore

client = QdrantClient(location=":memory:")

client.create_collection(
    collection_name="full_documents",
    vectors_config=models.VectorParams(size=1536, distance=models.Distance.COSINE)
)

parent_document_vectorstore = QdrantVectorStore(
    collection_name="full_documents", embedding=OpenAIEmbeddings(model="text-embedding-3-small"), client=client
)

Now we can create our `InMemoryStore` that will hold our "parent documents" - and build our retriever!

In [None]:
store = InMemoryStore()

parent_document_retriever = ParentDocumentRetriever(
    vectorstore = parent_document_vectorstore,
    docstore=store,
    child_splitter=child_splitter,
)

By default, this is empty as we haven't added any documents - let's add some now!

In [None]:
parent_document_retriever.add_documents(parent_docs, ids=None)

We'll create the same chain we did before - but substitute our new `parent_document_retriever`.

In [None]:
parent_document_retrieval_chain = (
    {"context": itemgetter("question") | parent_document_retriever, "question": itemgetter("question")}
    | RunnablePassthrough.assign(context=itemgetter("context"))
    | {"response": rag_prompt | chat_model, "context": itemgetter("context")}
)

Let's give it a whirl!

In [None]:
parent_document_retrieval_chain.invoke({"question" : "What is the most common issue with loans?"})["response"].content

'The most common issue with loans, based on the provided complaints, appears to be related to errors in loan servicing, such as misreporting of account information, errors in loan balances, misapplied payments, wrongful denials of payment plans, and disputes over interest rates and fees. Many consumers also report misconduct by loan servicers, including incorrect or misleading information, unverified debt collection, and issues arising from loan transfers.'

In [None]:
parent_document_retrieval_chain.invoke({"question" : "Did any complaints not get handled in a timely manner?"})["response"].content

'Based on the provided context, all the complaints listed were handled in a timely manner. Specifically, the complaint regarding dispute settlement with credit bureaus (Complaint ID: 13205525) was responded to within 30 days, which is generally considered timely. \n\nHowever, the complaints about issues with student loan processing, payment handling, and communication from Mohela (Complaint IDs: 12709087 and 12935889) explicitly indicate that responses were "No" or "Closed with explanation" but prior messages suggest these issues were not resolved promptly, and it was noted that some responses were delayed or not handled in a timely fashion.\n\nTherefore, yes, some complaints did not get handled in a timely manner, notably the ones involving Mohela, where responses were marked as "No" or lacking response within expected timeframes.'

In [None]:
parent_document_retrieval_chain.invoke({"question" : "Why did people fail to pay back their loans?"})["response"].content

'People often fail to pay back their loans due to various reasons, including financial hardship, mismanagement, or lack of proper information or support. The provided complaints highlight some specific causes, such as:\n\n- Severe financial hardship post-graduation making it difficult to keep up with payments.\n- Lack of clear communication from loan servicers regarding payment obligations, repayment start dates, or changes in terms.\n- Misrepresentation or lack of transparency about the value of education, job prospects, and long-term financial consequences, leading to unexpected debt burdens.\n- Problems with loan servicing practices, such as failure to notify borrowers about payments, changes in loan ownership, or issues with verification of debt legitimacy.\n- Reliance on deferment and forbearance options that increase interest and complicate repayment.\n- Additional complications like issues with credit reporting or legal questions about the legitimacy of the debts.\n\nIn summary,

Overall, the performance *seems* largely the same. We can leverage a tool like [Ragas]() to more effectively answer the question about the performance.

## Task 9: Ensemble Retriever

In brief, an Ensemble Retriever simply takes 2, or more, retrievers and combines their retrieved documents based on a rank-fusion algorithm.

In this case - we're using the [Reciprocal Rank Fusion](https://plg.uwaterloo.ca/~gvcormac/cormacksigir09-rrf.pdf) algorithm.

Setting it up is as easy as providing a list of our desired retrievers - and the weights for each retriever.

In [None]:
from langchain.retrievers import EnsembleRetriever

retriever_list = [bm25_retriever, naive_retriever, parent_document_retriever, compression_retriever, multi_query_retriever]
equal_weighting = [1/len(retriever_list)] * len(retriever_list)

ensemble_retriever = EnsembleRetriever(
    retrievers=retriever_list, weights=equal_weighting
)

We'll pack *all* of these retrievers together in an ensemble.

In [None]:
ensemble_retrieval_chain = (
    {"context": itemgetter("question") | ensemble_retriever, "question": itemgetter("question")}
    | RunnablePassthrough.assign(context=itemgetter("context"))
    | {"response": rag_prompt | chat_model, "context": itemgetter("context")}
)

Let's look at our results!

In [None]:
ensemble_retrieval_chain.invoke({"question" : "What is the most common issue with loans?"})["response"].content

'The most common issue with loans, based on the complaints data provided, appears to be related to the mishandling and inaccuracies in loan servicing. Specifically, frequent issues include errors in loan balances, misapplied payments, wrongful default reporting, incorrect credit reporting, inadequate communication from servicers, and problems with loan transfer and data handling. Many complaints highlight failures to properly notify borrowers about default status, incorrect account statuses, and mishandling of applications for income-driven repayment or forgiveness programs. \n\nIn summary, the most common issue is **servicer misconduct involving errors, miscommunication, and improper handling of student loan accounts**.'

In [None]:
ensemble_retrieval_chain.invoke({"question" : "Did any complaints not get handled in a timely manner?"})["response"].content

'Based on the provided data, yes, there are several complaints indicating that complaints were not handled in a timely manner. Specifically:\n\n- Complaint ID 12709087 (on 03/28/25): The complaint was marked "Timely response?": No, which suggests it was not handled promptly.\n- Complaint ID 12935889 (on 04/11/25): The response was "Closed with explanation" but marked "Timely response?": No, indicating delays.\n- Complaint ID 13062402 (on 04/18/25): The response was "Closed with explanation" with "Timely response?": Yes.\n- Complaint ID 12650717 (on 03/25/25): Marked "Timely response?": No.\n- Complaint ID 13001206 (on 04/16/25): Marked "Timely response?": Yes, but it was marked "Closed with explanation."\n- Several other complaints also show delays or inadequate follow-up, often marked "Closed with explanation" or with no response from the companies.\n\nOverall, multiple complaints explicitly state that they were not handled within the expected timeframe or that responses were delayed,

In [None]:
ensemble_retrieval_chain.invoke({"question" : "Why did people fail to pay back their loans?"})["response"].content

'People failed to pay back their loans primarily due to factors such as:\n\n- Lack of adequate information about how interest compounds, leading to rising balances even when making payments.\n- Limited or confusing communication from loan servicers about payment obligations, due dates, or changes in servicing companies.\n- Financial hardships, including unemployment, low income, or unexpected expenses, making it difficult to afford payments.\n- Mismanagement or misapplication of payments, often resulting in increased debt and poor credit outcomes.\n- Unfair and often opaque repayment processes, including restrictions on paying down principal or paying more quickly, which prolongs debt.\n- In some cases, borrowers were unaware of their upcoming repayment start dates or owed amounts due to inadequate notifications.\n- Errors in loan balance reporting, account status, or incorrect transfers between servicers have contributed to confusion and difficulty in repayment.\n- Borrowers also cite

## Task 10: Semantic Chunking

While this is not a retrieval method - it *is* an effective way of increasing retrieval performance on corpora that have clean semantic breaks in them.

Essentially, Semantic Chunking is implemented by:

1. Embedding all sentences in the corpus.
2. Combining or splitting sequences of sentences based on their semantic similarity based on a number of [possible thresholding methods](https://python.langchain.com/docs/how_to/semantic-chunker/):
  - `percentile`
  - `standard_deviation`
  - `interquartile`
  - `gradient`
3. Each sequence of related sentences is kept as a document!

Let's see how to implement this!

We'll use the `percentile` thresholding method for this example which will:

Calculate all distances between sentences, and then break apart sequences of setences that exceed a given percentile among all distances.

In [None]:
from langchain_experimental.text_splitter import SemanticChunker

semantic_chunker = SemanticChunker(
    embeddings,
    breakpoint_threshold_type="percentile"
)

Now we can split our documents.

In [None]:
semantic_documents = semantic_chunker.split_documents(loan_complaint_data[:20])

Let's create a new vector store.

In [None]:
semantic_vectorstore = Qdrant.from_documents(
    semantic_documents,
    embeddings,
    location=":memory:",
    collection_name="Loan_Complaint_Data_Semantic_Chunks"
)

We'll use naive retrieval for this example.

In [None]:
semantic_retriever = semantic_vectorstore.as_retriever(search_kwargs={"k" : 10})

Finally we can create our classic chain!

In [None]:
semantic_retrieval_chain = (
    {"context": itemgetter("question") | semantic_retriever, "question": itemgetter("question")}
    | RunnablePassthrough.assign(context=itemgetter("context"))
    | {"response": rag_prompt | chat_model, "context": itemgetter("context")}
)

And view the results!

In [None]:
semantic_retrieval_chain.invoke({"question" : "What is the most common issue with loans?"})["response"].content

'Based on the provided complaints data, one of the most common issues with loans appears to be problems related to loan servicing and communication. Specific frequent issues include:\n\n- Struggling to repay or misunderstandings about repayment terms (e.g., incorrect payment amounts, delays, or re-amortization issues).\n- Poor communication from loan servicers, including lack of timely or clear information about loan status, payments, or changes (e.g., misinformation about loan status, servicing changes, or availability of accounts).\n- Errors in credit reporting, such as incorrect defaults, delinquencies, or improper account statuses.\n- Unauthorized or improper collection activities and violations of privacy laws.\n- Problems with auto-debit setup, missing payments, or discrepancies in payment records.\n\nIn summary, common issues revolve around mismanagement or miscommunication by loan servicers, errors in account information or reporting, and difficulties borrowers face in managing

In [None]:
semantic_retrieval_chain.invoke({"question" : "Did any complaints not get handled in a timely manner?"})["response"].content

'Based on the provided complaints, several instances indicate that complaints were handled in a timely manner. For example, complaints with IDs 13331376, 13207537, 13425612, 13281034, 13020950, 12962044, 13179688, and 13347464 all received responses labeled as "Closed with explanation" within the expected 30-day period, and each explicitly states "Timely response?": "Yes". \n\nThere are no complaints in the provided data suggesting that any complaints were not handled in a timely manner.'

In [None]:
semantic_retrieval_chain.invoke({"question" : "Why did people fail to pay back their loans?"})["response"].content

'People failed to pay back their loans for various reasons, including issues such as receiving bad or unclear information about their loans or payment status, difficulties in communicating with their loan servicers, and delays or errors in processing their payments. Some borrowers also experienced disputes over the legitimacy or status of their loans, leading to delays or refusal in repayment. Additionally, in certain cases, borrowers faced problems related to incorrect reporting of their loan status, default accusations despite never being in default, or complications due to changes in their loan transfer or servicing agencies. These issues often caused stress, confusion, and in some cases, legal disputes, which contributed to the failure to repay the loans as expected.'

#### ❓ Question #3:

If sentences are short and highly repetitive (e.g., FAQs), how might semantic chunking behave, and how would you adjust the algorithm?

#### Answer:

Semantic chunking might group these repetivite sentences together if they are semantically similar.

Not sure if this is a thing, but to improve this I would consider just having an LLM chunk a document directly. Ask it to repeat back the document as a list/array where the elements are chunks, and the LLM decides how to keep semantically similar content contained within chunks.

# 🤝 Breakout Room Part #2

#### 🏗️ Activity #1

Your task is to evaluate the various Retriever methods against eachother.

You are expected to:

1. Create a "golden dataset"
 - Use Synthetic Data Generation (powered by Ragas, or otherwise) to create this dataset
2. Evaluate each retriever with *retriever specific* Ragas metrics
 - Semantic Chunking is not considered a retriever method and will not be required for marks, but you may find it useful to do a "semantic chunking on" vs. "semantic chunking off" comparision between them
3. Compile these in a list and write a small paragraph about which is best for this particular data and why.

Your analysis should factor in:
  - Cost
  - Latency
  - Performance

> NOTE: This is **NOT** required to be completed in class. Please spend time in your breakout rooms creating a plan before moving on to writing code.

##### HINTS:

- LangSmith provides detailed information about latency and cost.

In [None]:
### YOUR CODE HERE

In [None]:
from uuid import uuid4

os.environ["LANGCHAIN_PROJECT"] = f"AIM - Assignment 09 - {uuid4().hex[0:8]}"

In [None]:
from langchain_community.document_loaders import DirectoryLoader
from langchain_community.document_loaders import PyMuPDFLoader


path = "data/"
loader = DirectoryLoader(path, glob="*.pdf", loader_cls=PyMuPDFLoader)
docs = loader.load()

In [None]:
len(docs)

269

In [None]:
docs[0]

Document(metadata={'producer': 'GPL Ghostscript 10.00.0', 'creator': 'wkhtmltopdf 0.12.6', 'creationdate': "D:20250418120630Z00'00'", 'source': 'data/Academic_Calenders_Cost_of_Attendance_and_Packaging.pdf', 'file_path': 'data/Academic_Calenders_Cost_of_Attendance_and_Packaging.pdf', 'total_pages': 57, 'format': 'PDF 1.7', 'title': '', 'author': '', 'subject': '', 'keywords': '', 'moddate': "D:20250418120630Z00'00'", 'trapped': '', 'modDate': "D:20250418120630Z00'00'", 'creationDate': "D:20250418120630Z00'00'", 'page': 0}, page_content='Volume 3\nAcademic Calendars, Cost of Attendance, and\nPackaging\nIntroduction\nThis volume of the Federal Student Aid (FSA) Handbook discusses the academic calendar, payment period, and\ndisbursement requirements for awarding aid under the Title IV student financial aid programs, determining a student9s\ncost of attendance, and packaging Title IV aid.\nThroughout this volume of the Handbook, the words "we," "our," and "us" refer to the United States De

In [None]:
from ragas.llms import LangchainLLMWrapper
from ragas.embeddings import LangchainEmbeddingsWrapper
from langchain_openai import ChatOpenAI
from langchain_openai import OpenAIEmbeddings
from ragas.testset import TestsetGenerator

generator_llm = LangchainLLMWrapper(ChatOpenAI(model="gpt-4.1-nano"))
generator_embeddings = LangchainEmbeddingsWrapper(OpenAIEmbeddings())

generator = TestsetGenerator(llm=generator_llm, embedding_model=generator_embeddings)
dataset = generator.generate_with_langchain_docs(docs[:20], testset_size=10)

Applying HeadlinesExtractor:   0%|          | 0/17 [00:00<?, ?it/s]

Applying HeadlineSplitter:   0%|          | 0/20 [00:00<?, ?it/s]

unable to apply transformation: 'headlines' property not found in this node
unable to apply transformation: 'headlines' property not found in this node
unable to apply transformation: 'headlines' property not found in this node


Applying SummaryExtractor:   0%|          | 0/31 [00:00<?, ?it/s]

Property 'summary' already exists in node 'f29b2a'. Skipping!
Property 'summary' already exists in node '8b67fd'. Skipping!
Property 'summary' already exists in node '866700'. Skipping!
Property 'summary' already exists in node 'ab944f'. Skipping!
Property 'summary' already exists in node '8df04a'. Skipping!
Property 'summary' already exists in node '592892'. Skipping!
Property 'summary' already exists in node 'e6a1bd'. Skipping!
Property 'summary' already exists in node '4f5176'. Skipping!
Property 'summary' already exists in node 'ffaecf'. Skipping!
Property 'summary' already exists in node '38dfce'. Skipping!
Property 'summary' already exists in node '2b167b'. Skipping!
Property 'summary' already exists in node '7d4f32'. Skipping!
Property 'summary' already exists in node 'd3293f'. Skipping!
Property 'summary' already exists in node '5d48a1'. Skipping!


Applying CustomNodeFilter:   0%|          | 0/6 [00:00<?, ?it/s]

Applying [EmbeddingExtractor, ThemesExtractor, NERExtractor]:   0%|          | 0/43 [00:00<?, ?it/s]

Property 'summary_embedding' already exists in node 'f29b2a'. Skipping!
Property 'summary_embedding' already exists in node '8b67fd'. Skipping!
Property 'summary_embedding' already exists in node '8df04a'. Skipping!
Property 'summary_embedding' already exists in node '866700'. Skipping!
Property 'summary_embedding' already exists in node '7d4f32'. Skipping!
Property 'summary_embedding' already exists in node '38dfce'. Skipping!
Property 'summary_embedding' already exists in node 'ab944f'. Skipping!
Property 'summary_embedding' already exists in node '4f5176'. Skipping!
Property 'summary_embedding' already exists in node '592892'. Skipping!
Property 'summary_embedding' already exists in node 'ffaecf'. Skipping!
Property 'summary_embedding' already exists in node 'd3293f'. Skipping!
Property 'summary_embedding' already exists in node 'e6a1bd'. Skipping!
Property 'summary_embedding' already exists in node '5d48a1'. Skipping!
Property 'summary_embedding' already exists in node '2b167b'. Sk

Applying [CosineSimilarityBuilder, OverlapScoreBuilder]:   0%|          | 0/2 [00:00<?, ?it/s]

Generating personas:   0%|          | 0/3 [00:00<?, ?it/s]

Generating Scenarios:   0%|          | 0/3 [00:00<?, ?it/s]

Generating Samples:   0%|          | 0/12 [00:00<?, ?it/s]

In [111]:
dataset.to_pandas()

Unnamed: 0,user_input,reference_contexts,reference,synthesizer_name
0,How do academic calendars impact federal stude...,"[Chapter 1 Academic Years, Academic Calendars,...","Academic calendars, including the definition o...",single_hop_specifc_query_synthesizer
1,What is 34 CFR 668.3(a) about in relation to a...,[Regulatory Citations Academic year minimums: ...,Regulatory Citations Academic year minimums: 3...,single_hop_specifc_query_synthesizer
2,How does the inclusion of clinical work in a s...,[Inclusion of Clinical Work in a Standard Term...,Inclusion of clinical work in a standard term ...,single_hop_specifc_query_synthesizer
3,Can you explain what the FWS program is and ho...,[Non-Term Characteristics A program that measu...,The Federal Work-Study (FWS) Program is an exc...,single_hop_specifc_query_synthesizer
4,"whn clinical work is in a non-term calndar, ho...",[<1-hop>\n\nInclusion of Clinical Work in a St...,When clinical work is included in a non-term c...,multi_hop_abstract_query_synthesizer
5,Hwo do the regultory citatons for acedemic pro...,"[<1-hop>\n\nChapter 1 Academic Years, Academic...","The regulatory citations, such as 34 CFR 668.3...",multi_hop_abstract_query_synthesizer
6,what are the titl iv program requirments for a...,"[<1-hop>\n\nChapter 1 Academic Years, Academic...",The requirements for the academic year under t...,multi_hop_abstract_query_synthesizer
7,guidance on calculating grant awards and loan ...,[<1-hop>\n\nboth the credit or clock hours and...,The context explains that the amount of Pell G...,multi_hop_abstract_query_synthesizer
8,How does Volume 8 relate to the inclusion of c...,[<1-hop>\n\nInclusion of Clinical Work in a St...,Volume 8 provides guidance on the inclusion of...,multi_hop_specific_query_synthesizer
9,Hwo do Appendix A and Appendix B relate to the...,[<1-hop>\n\nboth the credit or clock hours and...,Appendix A provides guidance on calculating Pe...,multi_hop_specific_query_synthesizer


In [118]:
from langsmith import Client

client = Client()

dataset_name = f"Loan Synthetic Data - Assignment 09 - RAGAS - {uuid4().hex[0:8]}"

langsmith_dataset = client.create_dataset(
    dataset_name=dataset_name,
    description="Loan Synthetic Data - Assignment 09 - RAGAS"
)

for idx, row in dataset.to_pandas().iterrows():
  client.create_example(
      inputs={
          "question": row["user_input"]
      },
      outputs={
          "answer": row["reference"],
          "context": row["reference_contexts"]
      },
      dataset_id=langsmith_dataset.id
  )

# for idx, row in dataset.to_pandas().iterrows():
#     client.create_example(
#     inputs={"question": row["user_input"]},
#     outputs={"ground_truth": row["reference"]},
#     metadata={
#         "context": row["reference_contexts"]
#     },
#     dataset_id=langsmith_dataset.id,
# )


In [None]:
# from langsmith.evaluation import evaluate
# from ragas.metrics import (
#     Faithfulness, FactualCorrectness, ResponseRelevancy,
#     ContextEntityRecall, NoiseSensitivity,
# )
# from ragas.metrics import LLMContextRecall, Faithfulness, FactualCorrectness, ResponseRelevancy, ContextEntityRecall, NoiseSensitivity

# from ragas.integrations.langchain import EvaluatorChain
# from langchain_openai import ChatOpenAI
# from ragas.llms import LangchainLLMWrapper

# class PatchedEvaluator(EvaluatorChain):
#     def evaluate_run(self, run, example=None, evaluator_run_id=None):
#         return super().evaluate_run(run, example)
    
# metrics = [
#     LLMContextRecall(), Faithfulness(), FactualCorrectness(), ResponseRelevancy(), ContextEntityRecall(), NoiseSensitivity()
# ]
# llm = LangchainLLMWrapper(ChatOpenAI(model_name="gpt-4o-mini", temperature=0))
# ragas_evaluators = [
#     PatchedEvaluator(metric=m, llm=llm) for m in metrics
# ]

# def run_eval(invocable, invocable_name, dataset_name):
#     # Run evaluation with all metrics tracked in LangSmith
#     evaluate(
#         invocable.invoke,
#         data=dataset_name,
#         evaluators=ragas_evaluators,
#         metadata={"revision_id": invocable_name},
#     )

In [None]:
from ragas.dataset_schema import SingleTurnSample, MultiTurnSample
from ragas.metrics import (
    Faithfulness, 
    FactualCorrectness, 
    ResponseRelevancy, 
    ContextEntityRecall, 
    NoiseSensitivity,
    LLMContextRecall,
    ToolCallAccuracy,
    AgentGoalAccuracyWithReference,
    TopicAdherenceScore
)
from ragas.llms import LangchainLLMWrapper
from langchain_openai import ChatOpenAI
from typing import Dict, Any, Callable, Type, List, Union, Optional


class MetricFactory:
    """
    Advanced factory for creating evaluation functions for different Ragas metrics.
    Handles different sample types and metric requirements.
    """
    
    def __init__(self, llm=None):
        """
        Initialize the factory with an optional LLM.
        
        Args:
            llm: Optional LLM wrapper. If None, creates a default one.
        """
        if llm is None:
            self.llm = LangchainLLMWrapper(ChatOpenAI(model_name="gpt-4o-mini", temperature=0))
        else:
            self.llm = llm
        
        # Define which metrics use which sample types
        self.single_turn_metrics = {
            Faithfulness, 
            FactualCorrectness, 
            ResponseRelevancy, 
            ContextEntityRecall, 
            NoiseSensitivity,
            LLMContextRecall
        }
        
        self.multi_turn_metrics = {
            ToolCallAccuracy,
            AgentGoalAccuracyWithReference,
            TopicAdherenceScore
        }
        
        # Define metric signatures and required parameters
        self.metric_signatures = {
            Faithfulness: {
                "required_params": ["user_input", "response", "retrieved_contexts"],
                "optional_params": []
            },
            FactualCorrectness: {
                "required_params": ["user_input", "response", "retrieved_contexts"],
                "optional_params": []
            },
            ResponseRelevancy: {
                "required_params": ["user_input", "response", "retrieved_contexts"],
                "optional_params": []
            },
            ContextEntityRecall: {
                "required_params": ["reference", "retrieved_contexts"],
                "optional_params": []
            },
            NoiseSensitivity: {
                "required_params": ["user_input", "response", "reference", "retrieved_contexts"],
                "optional_params": []
            },
            LLMContextRecall: {
                "required_params": ["user_input", "response", "reference", "retrieved_contexts"],
                "optional_params": []
            },
            ToolCallAccuracy: {
                "required_params": ["user_input", "reference_tool_calls"],
                "optional_params": []
            },
            AgentGoalAccuracyWithReference: {
                "required_params": ["user_input", "reference"],
                "optional_params": []
            },
            TopicAdherenceScore: {
                "required_params": ["user_input", "reference_topics"],
                "optional_params": ["mode"]
            }
        }
    
    def create_evaluator(self, metric_class: Type, sample_type: str = "auto", **metric_kwargs) -> Callable:
        """
        Create an evaluation function for a specific metric.
        
        Args:
            metric_class: The Ragas metric class to use
            sample_type: Type of sample to create ("auto", "single_turn", "multi_turn")
            **metric_kwargs: Additional keyword arguments for the metric constructor
        
        Returns:
            A function that evaluates inputs/outputs using the specified metric
        """
        def evaluation_function(inputs: Dict[str, Any], outputs: Dict[str, Any], reference_outputs: Dict[str, Any]) -> Any:
            """
            Generic evaluation function that works with any Ragas metric.
            
            Args:
                inputs: Dictionary containing the question or conversation
                outputs: Dictionary containing the response with 'content' key
                reference_outputs: Dictionary containing the context and other reference data
            
            Returns:
                Evaluation score from the specified metric
            """
            # print("inputs", inputs)
            # print("outputs", outputs)
            # print("reference_outputs", reference_outputs)
            
            # Determine sample type
            if sample_type == "auto":
                if metric_class in self.single_turn_metrics:
                    use_single_turn = True
                elif metric_class in self.multi_turn_metrics:
                    use_single_turn = False
                else:
                    # Default to single turn for unknown metrics
                    use_single_turn = True
            else:
                use_single_turn = (sample_type == "single_turn")
            
            # Create the appropriate sample based on metric requirements
            if use_single_turn:
                sample = self._create_single_turn_sample(metric_class, inputs, outputs, reference_outputs)
            else:
                sample = self._create_multi_turn_sample(metric_class, inputs, outputs, reference_outputs)
            
            # Create and use the metric with any additional kwargs
            scorer = metric_class(llm=self.llm, **metric_kwargs)
            return scorer.single_turn_score(sample)
        
        return evaluation_function
    
    def _create_single_turn_sample(self, metric_class: Type, inputs: Dict[str, Any], outputs: Dict[str, Any], reference_outputs: Dict[str, Any]) -> SingleTurnSample:
        """Create a SingleTurnSample based on metric requirements."""
        signature = self.metric_signatures.get(metric_class, {})
        required_params = signature.get("required_params", ["user_input", "response", "retrieved_contexts"])
        
        # Build sample kwargs based on required parameters
        sample_kwargs = {}
        
        for param in required_params:
            if param == "user_input":
                sample_kwargs[param] = inputs["question"]
            elif param == "response":
                sample_kwargs[param] = outputs["response"].content
            elif param == "retrieved_contexts":
                sample_kwargs[param] = reference_outputs["context"]
            elif param == "reference":
                sample_kwargs[param] = reference_outputs["answer"]
        
        return SingleTurnSample(**sample_kwargs)
        

In [167]:
# from advanced_metric_factory import MetricFactory
from ragas.metrics import Faithfulness, FactualCorrectness, ResponseRelevancy, ContextEntityRecall, NoiseSensitivity, LLMContextRecall, ToolCallAccuracy, AgentGoalAccuracyWithReference, TopicAdherenceScore
from ragas.llms import LangchainLLMWrapper
from langchain_openai import ChatOpenAI

evaluator_llm = LangchainLLMWrapper(ChatOpenAI(model_name="gpt-4o-mini", temperature=0))
factory = MetricFactory(evaluator_llm)

metrics = [
    LLMContextRecall, Faithfulness, ResponseRelevancy, ContextEntityRecall, NoiseSensitivity
]
evaluators = [factory.create_evaluator(metric) for metric in metrics]

In [168]:
# from ragas.dataset_schema import SingleTurnSample 
# from ragas.metrics import Faithfulness
# from ragas.llms import LangchainLLMWrapper
# from langchain_openai import ChatOpenAI


# evaluator_llm = LangchainLLMWrapper(ChatOpenAI(model_name="gpt-4o-mini", temperature=0))
# def faith_smith(inputs, outputs, reference_outputs):
#     print("inputs", inputs)
#     print("outputs", outputs)
#     print("reference_outputs", reference_outputs)
#     sample = SingleTurnSample(
#         user_input=inputs["question"],
#         response=outputs["response"].content,
#         retrieved_contexts=reference_outputs["context"]
#     )
#     scorer = Faithfulness(llm=evaluator_llm)
#     return scorer.single_turn_score(sample)

# # res = faith_smith(
# #     {"question": "When was the first super bowl?"}, 
# #     {"response": {"content": "The first superbowl was held on Jan 15, 1967"}}, 
# #     {"context": ["The First AFL–NFL World Championship Game was an American football game played on January 15, 1967, at the Los Angeles Memorial Coliseum in Los Angeles."]}
# # )
# # res

In [169]:
def run_eval(invocable, invocable_name, dataset_name):
    # Run evaluation with all metrics tracked in LangSmith
    evaluate(
        invocable.invoke,
        data=dataset_name,
        evaluators=evaluators,
        metadata={"revision_id": invocable_name},
    )

In [170]:
invocables = {
    "naive": naive_retrieval_chain,
    "bm25": bm25_retrieval_chain, 
    "contextual_compression": contextual_compression_retrieval_chain,
    "multi_query": multi_query_retrieval_chain,
    "parent_document": parent_document_retrieval_chain, 
    "ensemble": ensemble_retrieval_chain, 
    "semantic": semantic_retrieval_chain, 
}

In [171]:
from tqdm import tqdm

for invocable_name, invocable in tqdm(invocables.items()):
  print("Beginning evaluation of", invocable_name)
  result = run_eval(invocable, invocable_name, dataset_name)
  print(result)
  print("Evaluation complete")

  0%|          | 0/7 [00:00<?, ?it/s]

Beginning evaluation of naive
View the evaluation results for experiment: 'spotless-stick-78' at:
https://smith.langchain.com/o/6848a1f0-9351-4081-bbdc-eaa03dcd1c1a/datasets/bfd9b876-db8d-423c-a478-1bbf4338fc0a/compare?selectedSessions=bf417a7c-1d3b-436b-9a24-34b3308124e7




0it [00:00, ?it/s]

inputs {'question': 'How do Chapters 1 and 3 relate to the inclusion of clinical work within standard terms and the definition of academic years for Title IV purposes?'}
outputs {'response': AIMessage(content='Chapters 1 and 3 relate to the inclusion of clinical work within standard terms and the definition of academic years for Title IV purposes by establishing clear guidelines and definitions for what constitutes meaningful attendance and full-time enrollment. \n\nChapter 1 typically addresses the criteria for including clinical or practical work within the academic calendar, clarifying how such work is recognized as part of the academic year. This ensures that clinical activities are properly integrated into the standard measures used to determine student status, eligibility, and the academic calendar for Title IV purposes.\n\nChapter 3 further defines the parameters of the academic year, including how clinical work or internships are counted towards full-time enrollment. It clarifi

  0%|          | 0/7 [00:32<?, ?it/s]


KeyboardInterrupt: 

In [None]:
from ragas import EvaluationDataset
from ragas import evaluate
from ragas.llms import LangchainLLMWrapper
from ragas.metrics import LLMContextRecall, Faithfulness, FactualCorrectness, ResponseRelevancy, ContextEntityRecall, NoiseSensitivity
from ragas import evaluate, RunConfig
import copy

def run_eval_ragas(invocable, dataset):

  dataset_this = copy.deepcopy(dataset)

  for test_row in dataset_this:
    response = invocable.invoke({"question" : test_row.eval_sample.user_input})
    test_row.eval_sample.response = response["response"].content
    test_row.eval_sample.retrieved_contexts = [context.page_content for context in response["context"]]

  evaluation_dataset = EvaluationDataset.from_pandas(dataset_this.to_pandas())
  evaluator_llm = LangchainLLMWrapper(ChatOpenAI(model="gpt-4.1-mini"))

  custom_run_config = RunConfig(timeout=360)
  result = evaluate(
      dataset=evaluation_dataset,
      metrics=[LLMContextRecall(), Faithfulness(), FactualCorrectness(), ResponseRelevancy(), ContextEntityRecall(), NoiseSensitivity()],
      llm=evaluator_llm,
      run_config=custom_run_config
  )
  return result