# Advanced Retrieval with LangChain

In the following notebook, we'll explore various methods of advanced retrieval using LangChain!

We'll touch on:

- Naive Retrieval
- Best-Matching 25 (BM25)
- Multi-Query Retrieval
- Parent-Document Retrieval
- Contextual Compression (a.k.a. Rerank)
- Ensemble Retrieval
- Semantic chunking

We'll also discuss how these methods impact performance on our set of documents with a simple RAG chain.

There will be two breakout rooms:

- 🤝 Breakout Room Part #1
  - Task 1: Getting Dependencies!
  - Task 2: Data Collection and Preparation
  - Task 3: Setting Up QDrant!
  - Task 4-10: Retrieval Strategies
- 🤝 Breakout Room Part #2
  - Activity: Evaluate with Ragas

# 🤝 Breakout Room Part #1

## Task 1: Getting Dependencies!

We're going to need a few specific LangChain community packages, like OpenAI (for our [LLM](https://platform.openai.com/docs/models) and [Embedding Model](https://platform.openai.com/docs/guides/embeddings)) and Cohere (for our [Reranker](https://cohere.com/rerank)).

We'll also provide our OpenAI key, as well as our Cohere API key.

In [97]:
import os
import getpass

os.environ["OPENAI_API_KEY"] = getpass.getpass("Enter your OpenAI API Key:")

In [98]:
os.environ["COHERE_API_KEY"] = getpass.getpass("Cohere API Key:")

## Task 2: Data Collection and Preparation

We'll be using our Loan Data once again - this time the strutured data available through the CSV!

### Data Preparation

We want to make sure all our documents have the relevant metadata for the various retrieval strategies we're going to be applying today.

In [99]:
from langchain_community.document_loaders.csv_loader import CSVLoader
from datetime import datetime, timedelta

loader = CSVLoader(
    file_path=f"./data/complaints.csv",
    metadata_columns=[
      "Date received", 
      "Product", 
      "Sub-product", 
      "Issue", 
      "Sub-issue", 
      "Consumer complaint narrative", 
      "Company public response", 
      "Company", 
      "State", 
      "ZIP code", 
      "Tags", 
      "Consumer consent provided?", 
      "Submitted via", 
      "Date sent to company", 
      "Company response to consumer", 
      "Timely response?", 
      "Consumer disputed?", 
      "Complaint ID"
    ]
)

loan_complaint_data = loader.load()

for doc in loan_complaint_data:
    doc.page_content = doc.metadata["Consumer complaint narrative"]

Let's look at an example document to see if everything worked as expected!

In [100]:
loan_complaint_data[1]

Document(metadata={'source': './data/complaints.csv', 'row': 1, 'Date received': '05/09/25', 'Product': 'Student loan', 'Sub-product': 'Federal student loan servicing', 'Issue': 'Dealing with your lender or servicer', 'Sub-issue': 'Received bad information about your loan', 'Consumer complaint narrative': 'I submitted my annual Income-Driven Repayment ( IDR ) recertification to Aidvantage on time and in full, using the correct process. My income is approximately $ XXXX/year, which is below 150 % of the federal poverty guideline. Under the SAVE Plan or any valid IDR/IBR plan , this income level qualifies for a $ XXXX monthly payment. \n\nInstead, Aidvantage assigned me a $ XXXX/month paymentan amount that is not legally or mathematically possible based on my income. I contacted them and was placed in a two-month administrative forbearance. After that, they returned with the exact same incorrect payment amount. \n\nWhen I pressed for clarification, Aidvantage responded that my IDR applic

## Task 3: Setting up QDrant!

Now that we have our documents, let's create a QDrant VectorStore with the collection name "LoanComplaints".

We'll leverage OpenAI's [`text-embedding-3-small`](https://openai.com/blog/new-embedding-models-and-api-updates) because it's a very powerful (and low-cost) embedding model.

> NOTE: We'll be creating additional vectorstores where necessary, but this pattern is still extremely useful.

In [101]:
from langchain_community.vectorstores import Qdrant
from langchain_openai import OpenAIEmbeddings

embeddings = OpenAIEmbeddings(model="text-embedding-3-small")

vectorstore = Qdrant.from_documents(
    loan_complaint_data,
    embeddings,
    location=":memory:",
    collection_name="LoanComplaints"
)

## Task 4: Naive RAG Chain

Since we're focusing on the "R" in RAG today - we'll create our Retriever first.

### R - Retrieval

This naive retriever will simply look at each review as a document, and use cosine-similarity to fetch the 10 most relevant documents.

> NOTE: We're choosing `10` as our `k` here to provide enough documents for our reranking process later

In [102]:
naive_retriever = vectorstore.as_retriever(search_kwargs={"k" : 10})

### A - Augmented

We're going to go with a standard prompt for our simple RAG chain today! Nothing fancy here, we want this to mostly be about the Retrieval process.

In [103]:
from langchain_core.prompts import ChatPromptTemplate

RAG_TEMPLATE = """\
You are a helpful and kind assistant. Use the context provided below to answer the question.

If you do not know the answer, or are unsure, say you don't know.

Query:
{question}

Context:
{context}
"""

rag_prompt = ChatPromptTemplate.from_template(RAG_TEMPLATE)

### G - Generation

We're going to leverage `gpt-4.1-nano` as our LLM today, as - again - we want this to largely be about the Retrieval process.

In [104]:
from langchain_openai import ChatOpenAI

chat_model = ChatOpenAI(model="gpt-4.1-nano")

### LCEL RAG Chain

We're going to use LCEL to construct our chain.

> NOTE: This chain will be exactly the same across the various examples with the exception of our Retriever!

In [105]:
from langchain_core.runnables import RunnablePassthrough
from operator import itemgetter
from langchain_core.output_parsers import StrOutputParser

naive_retrieval_chain = (
    # INVOKE CHAIN WITH: {"question" : "<<SOME USER QUESTION>>"}
    # "question" : populated by getting the value of the "question" key
    # "context"  : populated by getting the value of the "question" key and chaining it into the base_retriever
    {"context": itemgetter("question") | naive_retriever, "question": itemgetter("question")}
    # "context"  : is assigned to a RunnablePassthrough object (will not be called or considered in the next step)
    #              by getting the value of the "context" key from the previous step
    | RunnablePassthrough.assign(context=itemgetter("context"))
    # "response" : the "context" and "question" values are used to format our prompt object and then piped
    #              into the LLM and stored in a key called "response"
    # "context"  : populated by getting the value of the "context" key from the previous step
    | {"response": rag_prompt | chat_model, "context": itemgetter("context")}
)

Let's see how this simple chain does on a few different prompts.

> NOTE: You might think that we've cherry picked prompts that showcase the individual skill of each of the retrieval strategies - you'd be correct!

In [106]:
naive_retrieval_chain.invoke({"question" : "What is the most common issue with loans?"})["response"].content

'Based on the information provided, the most common issues with loans appear to be:\n\n- Dealing with lenders or servicers, including receiving bad or incorrect information about the loan, and issues related to loan balances, misapplied payments, and wrongful denials of payment plans.\n- Problems with payment application, such as inability to apply extra funds to principal or pay off smaller loans more quickly.\n- Errors or discrepancies in loan account status, balances, or reports, including being reported as delinquent without justification.\n- Challenges related to loan management, including improper loan transfers, lack of communication, and disputes over interest and repayment amounts.\n- Issues with loan forgiveness, cancellation, or discharge, especially when loans are mismanaged or aggressively pursued despite borrower difficulties or legal protections.\n\nIn summary, the most common issues seem to revolve around poor loan management, miscommunication, incorrect account informa

In [11]:
naive_retrieval_chain.invoke({"question" : "Did any complaints not get handled in a timely manner?"})["response"].content

'Based on the provided context, yes, there are complaints that did not get handled in a timely manner. Specifically, at least two complaints were marked as "Not in time" for response:\n\n- Complaint ID 12709087 (submitted to MOHELA), which was marked as "Timely response?": No.\n- Complaint ID 12832400 (submitted to Maximus Federal Services, Inc. KY), which was marked as "Timely response?": Yes (so this one was handled timely).\n\nAdditionally, multiple complaints mention long delays or lack of response over extended periods (e.g., over 1 year) and ongoing unresolved issues, indicating some complaints were indeed not addressed in a timely manner.'

In [12]:
naive_retrieval_chain.invoke({"question" : "Why did people fail to pay back their loans?"})["response"].content

"People failed to pay back their loans primarily due to a variety of issues highlighted in the complaints, including:\n\n1. **Difficulty Managing Interest and Payment Options:** Many borrowers were offered limited options such as forbearance or deferment, but these led to interest accumulating, making it harder to pay down the principal or catch up on payments. Lowering payments often resulted in interest negating progress, extending the repayment period and increasing total debt.\n\n2. **Unclear or Uncommunicated Loan Terms and Transfers:** Borrowers reported being inadequately informed about when payments would resume, changes in loan servicers (e.g., from Great Lakes to Nelnet or Navient) without proper notification, or being unaware of the precise status of their loans. This lack of communication led to missed payments or delinquency.\n\n3. **Lack of Assistance or Proper Payment Plans:** Several complaints indicated that borrowers could not get proper repayment plans, or their atte

In [28]:
naive_retrieval_chain.invoke({"question" : "What happned to complaint ID 13347464?"})["response"].content

'I do not have information about what happened to complaint ID 13347464.'

Overall, this is not bad! Let's see if we can make it better!

## Task 5: Best-Matching 25 (BM25) Retriever

Taking a step back in time - [BM25](https://www.nowpublishers.com/article/Details/INR-019) is based on [Bag-Of-Words](https://en.wikipedia.org/wiki/Bag-of-words_model) which is a sparse representation of text.

In essence, it's a way to compare how similar two pieces of text are based on the words they both contain.

This retriever is very straightforward to set-up! Let's see it happen down below!


In [113]:
from langchain_community.retrievers import BM25Retriever

bm25_retriever = BM25Retriever.from_documents(loan_complaint_data, )

We'll construct the same chain - only changing the retriever.

In [114]:
bm25_retrieval_chain = (
    {"context": itemgetter("question") | bm25_retriever, "question": itemgetter("question")}
    | RunnablePassthrough.assign(context=itemgetter("context"))
    | {"response": rag_prompt | chat_model, "context": itemgetter("context")}
)

Let's look at the responses!

In [115]:
bm25_retrieval_chain.invoke({"question" : "What is the most common issue with loans?"})["response"].content

'Based on the provided context, the most common issue with loans appears to be problems related to dealing with lenders or servicers, including issues with fees, payment application, and inaccurate or bad information about the loans. Specific sub-issues include disagreements over fees, trouble with how payments are being handled (such as applying payments correctly), and receiving incorrect information about loan balances or terms. \n\nTherefore, the most common issue is related to the borrower’s difficulties in dealing with lenders or servicers, often involving issues like misapplied payments, incorrect information, or dissatisfaction with how the loans are managed.'

In [16]:
bm25_retrieval_chain.invoke({"question" : "Did any complaints not get handled in a timely manner?"})["response"].content

'Based on the provided information, all the complaints detailed in the context were responded to by the companies with a response described as "Closed with explanation," and the responses are marked as timely. Therefore, it appears that no complaints went unhandled or unresolved in a timely manner.'

In [17]:
bm25_retrieval_chain.invoke({"question" : "Why did people fail to pay back their loans?"})["response"].content

'People failed to pay back their loans for various reasons, including difficulties with payment plans, issues with loan servicing companies, lack of communication or delays in response from lenders, and problems related to errors or mismanagement by loan servicers. Specifically, some borrowers experienced problems with their payment plans or are struggling to navigate complex forbearance options, while others faced issues of unacknowledged forbearance requests or technical errors that led to missed payments or negative impacts on their credit scores. Additionally, some complaints indicate that borrowers were unaware of changes in loan servicing or transfers to new companies, and that poor communication and administrative errors contributed to their inability to repay loans effectively.'

In [27]:
bm25_retrieval_chain.invoke({"question" : "What happned to complaint ID 13347464?"})["response"].content

'The complaint with ID 13347464 was submitted on 05/05/25 regarding issues with federal student loan servicing. The consumer was disputing two loans totaling $10,000, claiming that the lender failed to provide proof that the loans match the Master Promissory Note and incorrectly advised the consumer to file an FTC Identity Theft Report, despite no identity theft occurring. The company, Maximus Federal Services, responded by closing the complaint with an explanation, implying they believed their actions were appropriate under the law.'

It's not clear that this is better or worse, if only we had a way to test this (SPOILERS: We do, the second half of the notebook will cover this)

#### ❓ Question #1:

Give an example query where BM25 is better than embeddings and justify your answer.

#### Answer #1:

I asked Cursor what kinds of queries would do better with a keyword search type approach (eg. BM25) vs an embedding type approach (see PROMPTS.md) and what I understand from that is that short queries that refer to entites with exact matches in documents within the corpus will do better with BM25. For example "What happened to complaint ID 13347464?". For the BM25 retriever it found the record but the naive retriever did not.  

## Task 6: Contextual Compression (Using Reranking)

Contextual Compression is a fairly straightforward idea: We want to "compress" our retrieved context into just the most useful bits.

There are a few ways we can achieve this - but we're going to look at a specific example called reranking.

The basic idea here is this:

- We retrieve lots of documents that are very likely related to our query vector
- We "compress" those documents into a smaller set of *more* related documents using a reranking algorithm.

We'll be leveraging Cohere's Rerank model for our reranker today!

All we need to do is the following:

- Create a basic retriever
- Create a compressor (reranker, in this case)

That's it!

Let's see it in the code below!

In [124]:
from langchain.retrievers.contextual_compression import ContextualCompressionRetriever
from langchain_cohere import CohereRerank

compressor = CohereRerank(model="rerank-v3.5")
compression_retriever = ContextualCompressionRetriever(
    base_compressor=compressor, base_retriever=naive_retriever
)

Let's create our chain again, and see how this does!

In [125]:
contextual_compression_retrieval_chain = (
    {"context": itemgetter("question") | compression_retriever, "question": itemgetter("question")}
    | RunnablePassthrough.assign(context=itemgetter("context"))
    | {"response": rag_prompt | chat_model, "context": itemgetter("context")}
)

In [126]:
contextual_compression_retrieval_chain.invoke({"question" : "What is the most common issue with loans?"})["response"].content

'Based on the provided complaints, the most common issue with loans appears to be problems related to dealing with lenders or servicers, such as receiving bad or incorrect information, errors in loan balances, misapplied payments, and issues with loan transfers and handling. Many complaints also involve lack of communication, incorrect account information, and disputes over loan details, which can lead to customer confusion and credit reporting issues.'

In [21]:
contextual_compression_retrieval_chain.invoke({"question" : "Did any complaints not get handled in a timely manner?"})["response"].content

'Based on the provided context, there are indications that some complaints experienced delays in being handled. For example, the complaint about the student loan account review, submitted over a year prior and still unresolved after nearly 18 months, suggests a significant delay. Additionally, the complaint about payments not being applied to the account also mentions a follow-up, implying ongoing unresolved issues.\n\nHowever, for the complaint regarding the issue with auto pay, the response states "Timely response? Yes," indicating that this particular complaint was addressed in a timely manner.\n\nTherefore, yes, some complaints did not get handled in a timely manner, as evidenced by the complaint about unresolved student loan issues that remained open despite a long duration.'

In [22]:
contextual_compression_retrieval_chain.invoke({"question" : "Why did people fail to pay back their loans?"})["response"].content

'People failed to pay back their loans mainly due to a combination of factors including a lack of understanding of the repayment obligations, poor communication from lenders or servicers, and the ongoing accumulation of interest which made the debt grow even while making payments. Additional reasons included being unaware of the need to repay loans, receiving incorrect or incomplete information about their loan status, and facing difficulties in managing payments due to economic hardships, stagnant wages, or the effects of interest accumulating on their loans. Some borrowers also experienced issues related to their loans being transferred without proper Notification, further complicating their ability to stay informed and make timely payments.'

We'll need to rely on something like Ragas to help us get a better sense of how this is performing overall - but it "feels" better!

## Task 7: Multi-Query Retriever

Typically in RAG we have a single query - the one provided by the user.

What if we had....more than one query!

In essence, a Multi-Query Retriever works by:

1. Taking the original user query and creating `n` number of new user queries using an LLM.
2. Retrieving documents for each query.
3. Using all unique retrieved documents as context

So, how is it to set-up? Not bad! Let's see it down below!



In [127]:
from langchain.retrievers.multi_query import MultiQueryRetriever

multi_query_retriever = MultiQueryRetriever.from_llm(
    retriever=naive_retriever, llm=chat_model
)

In [128]:
multi_query_retrieval_chain = (
    {"context": itemgetter("question") | multi_query_retriever, "question": itemgetter("question")}
    | RunnablePassthrough.assign(context=itemgetter("context"))
    | {"response": rag_prompt | chat_model, "context": itemgetter("context")}
)

In [129]:
multi_query_retrieval_chain.invoke({"question" : "What is the most common issue with loans?"})["response"].content

"The most common issue with loans, based on the provided complaints, appears to be problems related to dealing with lenders or servicers, especially involving mishandling of payments, misapplication of funds, confusing or incorrect loan balances, miscommunication, and issues with loan account management such as errors in loan balances, interest calculations, and account transfers. Many complaints also highlight a lack of transparency, inadequate communication, and wrongful reporting that negatively impacts borrowers' credit and financial stability.\n"

In [26]:
multi_query_retrieval_chain.invoke({"question" : "Did any complaints not get handled in a timely manner?"})["response"].content

'Based on the provided complaints, yes, some complaints indicate that they were not handled in a timely manner. Specifically, there are multiple instances where responses from companies were delayed beyond the expected timeframe:\n\n- On 04/02/25, a complaint submitted to Maximus Federal Services, Inc. was marked as "Timely response?": No.\n- On 04/01/25, a complaint to MOHELA was also marked as "Timely response?": No.\n- Similarly, a complaint sent on 03/25/25 to MOHELA was marked as "Timely response?": No.\n- Another complaint to Maximus on 04/30/25 was marked "Timely response?": Yes, so handled promptly in this case.\n\nAdditionally, some complaints reference ongoing issues that have persisted for over a year or more without resolution, despite multiple follow-ups and escalations, suggesting that certain issues were not addressed in a timely manner.\n\nIn summary, yes, there are complaints indicating that some issues were not handled promptly, with delays ranging from several days t

In [27]:
multi_query_retrieval_chain.invoke({"question" : "Why did people fail to pay back their loans?"})["response"].content

'People failed to pay back their loans primarily due to a combination of factors including:\n\n1. Lack of clear information about repayment options: Many borrowers were not adequately informed about income-driven repayment plans, loan forgiveness programs, or the long-term impact of forbearance and deferment options. This led to unmanageable interest accumulation and difficult repayment scenarios.\n\n2. Mismanagement and misinformation by loan servicers: Several complaints highlight instances where servicers used tactics such as forbearance steering, wrongful denial or misapplication of payments, and providing incorrect loan balance information. These practices often kept borrowers in prolonged debt cycles and prevented access to better repayment options.\n\n3. Accruing interest and rising balances: Forbearance and deferment, while temporarily relieving payment obligations, often resulted in interest continuing to accrue and compound, vastly increasing the total amount owed over time.\

#### ❓ Question #2:

Explain how generating multiple reformulations of a user query can improve recall.

#### Answer #2

I posed this question to Cursor (see PROMPTS.md) and my understanding is this: when we generate multiple reformulations of the user query, we increase vocabulary diversity, query intent variations in the query resulting in (possibly) better semantic coverage and hence possibly more relevant chunks will be retrieved. Also sometimes the user did not write the query in a well structured or incomplete form, getting the LLM to reformulate the query could help correct that - thus leading to more relevant chunks being retireved. 

## Task 8: Parent Document Retriever

A "small-to-big" strategy - the Parent Document Retriever works based on a simple strategy:

1. Each un-split "document" will be designated as a "parent document" (You could use larger chunks of document as well, but our data format allows us to consider the overall document as the parent chunk)
2. Store those "parent documents" in a memory store (not a VectorStore)
3. We will chunk each of those documents into smaller documents, and associate them with their respective parents, and store those in a VectorStore. We'll call those "child chunks".
4. When we query our Retriever, we will do a similarity search comparing our query vector to the "child chunks".
5. Instead of returning the "child chunks", we'll return their associated "parent chunks".

Okay, maybe that was a few steps - but the basic idea is this:

- Search for small documents
- Return big documents

The intuition is that we're likely to find the most relevant information by limiting the amount of semantic information that is encoded in each embedding vector - but we're likely to miss relevant surrounding context if we only use that information.

Let's start by creating our "parent documents" and defining a `RecursiveCharacterTextSplitter`.

In [130]:
from langchain.retrievers import ParentDocumentRetriever
from langchain.storage import InMemoryStore
from langchain_text_splitters import RecursiveCharacterTextSplitter
from qdrant_client import QdrantClient, models

parent_docs = loan_complaint_data
child_splitter = RecursiveCharacterTextSplitter(chunk_size=750)

We'll need to set up a new QDrant vectorstore - and we'll use another useful pattern to do so!

> NOTE: We are manually defining our embedding dimension, you'll need to change this if you're using a different embedding model.

In [131]:
from langchain_qdrant import QdrantVectorStore

client = QdrantClient(location=":memory:")

client.create_collection(
    collection_name="full_documents",
    vectors_config=models.VectorParams(size=1536, distance=models.Distance.COSINE)
)

parent_document_vectorstore = QdrantVectorStore(
    collection_name="full_documents", embedding=OpenAIEmbeddings(model="text-embedding-3-small"), client=client
)

Now we can create our `InMemoryStore` that will hold our "parent documents" - and build our retriever!

In [132]:
store = InMemoryStore()

parent_document_retriever = ParentDocumentRetriever(
    vectorstore = parent_document_vectorstore,
    docstore=store,
    child_splitter=child_splitter,
)

By default, this is empty as we haven't added any documents - let's add some now!

In [133]:
parent_document_retriever.add_documents(parent_docs, ids=None)

We'll create the same chain we did before - but substitute our new `parent_document_retriever`.

In [134]:
parent_document_retrieval_chain = (
    {"context": itemgetter("question") | parent_document_retriever, "question": itemgetter("question")}
    | RunnablePassthrough.assign(context=itemgetter("context"))
    | {"response": rag_prompt | chat_model, "context": itemgetter("context")}
)

Let's give it a whirl!

In [135]:
parent_document_retrieval_chain.invoke({"question" : "What is the most common issue with loans?"})["response"].content

"The most common issue with loans, based on the provided complaints, appears to be problems related to **incorrect or misleading information on credit reports, mismanagement by servicers, and disputes over account balances or interest rates**. Many complaints involve errors in loan balances, misapplied payments, wrongful denials of payment plans, and inaccuracies in reporting to credit bureaus. These issues often stem from systemic errors, miscommunication, or misconduct by loan servicers.\n\nPlease let me know if you'd like more specific information or details on particular types of loan issues."

In [34]:
parent_document_retrieval_chain.invoke({"question" : "Did any complaints not get handled in a timely manner?"})["response"].content

'Based on the provided information, several complaints were identified as not being handled in a timely manner. Specifically, the complaints regarding the student loan service issues filed with MOHELA on 03/28/25 and 04/11/25 both received responses marked as "No" for being timely responded to. Additionally, the complaint with Nelnet filed on 04/27/25 was marked as "Yes" for timely response, indicating it was handled promptly. \n\nTherefore, yes, some complaints, including those filed with MOHELA, did not get handled in a timely manner.'

In [35]:
parent_document_retrieval_chain.invoke({"question" : "Why did people fail to pay back their loans?"})["response"].content

'People failed to pay back their loans for various reasons, including experiencing severe financial hardship, mismanagement or misrepresentation by educational institutions, and issues related to loan servicing or communication failures. Specifically, some borrowers faced unexpected or undelayed payment obligations, lack of proper information about repayment terms or grace periods, and difficulties in managing or verifying their debts. Additionally, some individuals struggled due to joblessness or inability to secure employment in their field, which made consistent repayment impossible. Others encountered administrative problems such as improper reporting, failure to notify about payments, or issues stemming from the transfer or collection of their loans.'

Overall, the performance *seems* largely the same. We can leverage a tool like [Ragas]() to more effectively answer the question about the performance.

## Task 9: Ensemble Retriever

In brief, an Ensemble Retriever simply takes 2, or more, retrievers and combines their retrieved documents based on a rank-fusion algorithm.

In this case - we're using the [Reciprocal Rank Fusion](https://plg.uwaterloo.ca/~gvcormac/cormacksigir09-rrf.pdf) algorithm.

Setting it up is as easy as providing a list of our desired retrievers - and the weights for each retriever.

In [136]:
from langchain.retrievers import EnsembleRetriever

retriever_list = [bm25_retriever, naive_retriever, parent_document_retriever, compression_retriever, multi_query_retriever]
equal_weighting = [1/len(retriever_list)] * len(retriever_list)

ensemble_retriever = EnsembleRetriever(
    retrievers=retriever_list, weights=equal_weighting
)

We'll pack *all* of these retrievers together in an ensemble.

In [137]:
ensemble_retrieval_chain = (
    {"context": itemgetter("question") | ensemble_retriever, "question": itemgetter("question")}
    | RunnablePassthrough.assign(context=itemgetter("context"))
    | {"response": rag_prompt | chat_model, "context": itemgetter("context")}
)

Let's look at our results!

In [138]:
ensemble_retrieval_chain.invoke({"question" : "What is the most common issue with loans?"})["response"].content

"Based on the provided context, the most common issues with loans seem to involve:\n\n- Errors or bad information about loan balances and interest calculations.\n- Difficulties with how payments are being handled, including being unable to apply additional payments to principal.\n- Lack of transparency and inadequate communication about loan terms, balances, and changes.\n- Unauthorized transfer or mishandling of loans.\n- Incorrect or inconsistent reporting of account status and delinquency.\n- Mishandling of loan consolidation, including lack of proper disclosure.\n- Problems with repayment plans, interest accrual during forbearance, and changes in loan classifications.\n- Disputes over loan ownership, validation, or fraudulent activity.\n\nWhile the specific most common issue isn't directly quantified in the data, a recurring theme is that many complaints involve **errors or bad information about loan balances, interest, and account status**, as well as poor handling of repayment an

In [39]:
ensemble_retrieval_chain.invoke({"question" : "Did any complaints not get handled in a timely manner?"})["response"].content

'Based on the provided complaints, yes, some complaints indicate that complaints were not handled in a timely manner. Specifically:\n\n- Complaint ID 12744910 (row 66) from 04/18/25 about payments showing late; response was "Yes" to timely response but indicates delays of 10 days due to action not being taken within promised time frame.\n- Complaint ID 12668396 (row 95) from 04/21/25 about account delinquencies and lengthy wait times; response was "No" to timely response, and the complaint notes over a month of trying to get assistance with unacceptable wait times.\n- Complaint ID 12935889 (row 89) from 04/11/25 about account status and long wait times of over four hours; response was "No" to timely response.\n- Several other complaints mention ongoing issues with delays, repeated follow-ups, or being left unaddressed over prolonged periods.\n\nIn summary, multiple complaints reflect that issues were not resolved in a timely manner, with delays ranging from several days to over a year,

In [40]:
ensemble_retrieval_chain.invoke({"question" : "Why did people fail to pay back their loans?"})["response"].content

'People failed to pay back their loans mainly due to a combination of mismanagement by loan servicers, lack of clear communication, and adverse economic circumstances. Many borrowers were not properly informed about the status of their loans, whether they were in deferment or forbearance, or the changes in loan transfer or servicing companies that affected their payment obligations. This lack of transparency often led to unintentional missed payments or late payments, negatively impacting their credit scores.\n\nAdditionally, various reports indicate that borrowers were steered into long-term forbearances instead of more manageable repayment plans like income-driven repayment or loan rehabilitation, which could have helped them avoid interest accumulation and default. Some borrowers also experienced issues such as incorrect or inconsistent loan account information, unauthorized transfer of loans, and failure to receive timely notices, which contributed to their inability to keep up wit

## Task 10: Semantic Chunking

While this is not a retrieval method - it *is* an effective way of increasing retrieval performance on corpora that have clean semantic breaks in them.

Essentially, Semantic Chunking is implemented by:

1. Embedding all sentences in the corpus.
2. Combining or splitting sequences of sentences based on their semantic similarity based on a number of [possible thresholding methods](https://python.langchain.com/docs/how_to/semantic-chunker/):
  - `percentile`
  - `standard_deviation`
  - `interquartile`
  - `gradient`
3. Each sequence of related sentences is kept as a document!

Let's see how to implement this!

We'll use the `percentile` thresholding method for this example which will:

Calculate all distances between sentences, and then break apart sequences of setences that exceed a given percentile among all distances.

In [139]:
from langchain_experimental.text_splitter import SemanticChunker

semantic_chunker = SemanticChunker(
    embeddings,
    breakpoint_threshold_type="percentile"
)

Now we can split our documents.

In [140]:
semantic_documents = semantic_chunker.split_documents(loan_complaint_data[:20])

Let's create a new vector store.

In [141]:
semantic_vectorstore = Qdrant.from_documents(
    semantic_documents,
    embeddings,
    location=":memory:",
    collection_name="Loan_Complaint_Data_Semantic_Chunks"
)

We'll use naive retrieval for this example.

In [142]:
semantic_retriever = semantic_vectorstore.as_retriever(search_kwargs={"k" : 10})

Finally we can create our classic chain!

In [143]:
semantic_retrieval_chain = (
    {"context": itemgetter("question") | semantic_retriever, "question": itemgetter("question")}
    | RunnablePassthrough.assign(context=itemgetter("context"))
    | {"response": rag_prompt | chat_model, "context": itemgetter("context")}
)

And view the results!

In [144]:
semantic_retrieval_chain.invoke({"question" : "What is the most common issue with loans?"})["response"].content

'The most common issue with loans, based on the complaints provided, appears to be miscommunication and errors in the servicing and reporting of student loans. Specific recurring issues include:\n\n- Trouble with understanding and managing payment plans, such as incorrect billing amounts, failure to process applications, and issues with auto-debit setup.\n- Problems related to the status of loans, including default notices, inaccurate reporting, and delays or discrepancies in account information.\n- Lack of transparency and communication from loan servicers regarding account status, payment requirements, or changes in servicing.\n\nOverall, a significant number of complaints involve improper handling, reporting errors, or lack of clear communication from loan servicers, leading to borrower confusion, financial harm, and legal concerns.'

In [47]:
semantic_retrieval_chain.invoke({"question" : "Did any complaints not get handled in a timely manner?"})["response"].content

'Based on the provided complaints, yes, some complaints were not handled in a timely manner. Specifically, there are instances where complaints were marked as "Closed with explanation" and note that the company did not respond to the complaint or did so after the deadline. For example, complaints related to Nelnet, Inc. from IN and NJ did not receive prompt responses, and in some cases, the level of response or resolution suggests delays or lack of timely handling. \n\nHowever, the record also indicates that some complaints received responses marked as "Yes" for timely response, meaning they were handled within an acceptable timeframe.  \n\nIn summary, while many complaints were responded to in a timely manner, there are notable cases where handling was delayed or responses were lacking, indicating that some complaints did not get handled promptly.'

In [48]:
semantic_retrieval_chain.invoke({"question" : "Why did people fail to pay back their loans?"})["response"].content

'People failed to pay back their loans for various reasons, including difficulties with loan servicing, lack of transparency from lenders, disputes over the legitimacy or accuracy of reported debt, and issues related to the handling or verification of their loan information. Some faced problems with improper or delayed payments, miscommunications, or legal disputes. In certain cases, borrowers believed their loans had been improperly reported, were in default despite never being in default, or faced delays and errors in payment processing. Additionally, some borrowers experienced problems with loan forgiveness, discharge, or the legality of their debt due to legislative or administrative issues.'

#### ❓ Question #3:

If sentences are short and highly repetitive (e.g., FAQs), how might semantic chunking behave, and how would you adjust the algorithm?

#### Answer #3:

I posed the question of what happens when we apply semantic chunking to documents with short and highly repetitive sentences to Cursor (see PROMPTS.md) and my understanding is this: since semantic chunking relies on semantic variation to detect natural breakpoints and short, repetitive sentences have high similarity and low variation it could lead to very large meaningless chunks or over-fragmentation (tiny+meaningless chunks). To adjust the algorithm to work better, one way would be to exploit structure (eg. TOC Chunker) since the document is already somewhat structured for context coherence in the sections:

1. to prevent too large chunks from being created

if the sentences cross a section then start a new chunk

2. to prevent too small chunks from being created

if too few sentences then continue until you hit a section boundary


# 🤝 Breakout Room Part #2

#### 🏗️ Activity #1

Your task is to evaluate the various Retriever methods against eachother.

You are expected to:

1. Create a "golden dataset"
 - Use Synthetic Data Generation (powered by Ragas, or otherwise) to create this dataset
2. Evaluate each retriever with *retriever specific* Ragas metrics
 - Semantic Chunking is not considered a retriever method and will not be required for marks, but you may find it useful to do a "semantic chunking on" vs. "semantic chunking off" comparision between them
3. Compile these in a list and write a small paragraph about which is best for this particular data and why.

Your analysis should factor in:
  - Cost
  - Latency
  - Performance

> NOTE: This is **NOT** required to be completed in class. Please spend time in your breakout rooms creating a plan before moving on to writing code.

##### HINTS:

- LangSmith provides detailed information about latency and cost.

## Activity #1 Code

In [107]:
# Install required packages for evaluation
!pip install -q ragas langsmith langchain-community datasets pandas numpy pillow rapidfuzz

/bin/bash: /Users/foohm/AIMakerSpace/AIE7.session9/09_Advanced_Retrieval_Eval/venv/bin/pip: /Users/foohm/AIMakerSpace/AIE7/09_Advanced_Retrieval_Eval/venv/bin/python3.13: bad interpreter: No such file or directory


Python(89217) MallocStackLogging: can't turn off malloc stack logging because it was not enabled.


In [108]:
import os
import getpass
from uuid import uuid4

# LangSmith API configuration
os.environ["LANGCHAIN_TRACING_V2"] = "true"
os.environ["LANGCHAIN_ENDPOINT"] = "https://api.smith.langchain.com"
os.environ["LANGCHAIN_API_KEY"] = getpass.getpass("Enter your LangSmith API Key: ")
os.environ["LANGCHAIN_PROJECT"] = f"AIM AdvRetrievalRAG - {uuid4().hex[0:8]}"

# Confirm OpenAI API key is set (already configured in earlier cells)
#if "OPENAI_API_KEY" not in os.environ:
#    os.environ["OPENAI_API_KEY"] = getpass.getpass("Enter your OpenAI API Key: ")

print("✅ API keys configured successfully!")

✅ API keys configured successfully!


## Building the Golden Data Set

We'll create a synthetic evaluation dataset using RAGAS synthetic data generation from the loan complaint data, following the "Abstracted SDG" approach.

This code has been extracted to a separate notebook `Golden_Dataset_Generator_Standalone.ipynb` so that we don't generate the golden data set each time we run the notebook. 

In [109]:
dataset_name = "loan-complaints-golden-dataset-standalone-20250725-105810"

## Evaluation Sanity Checkpoint

Using the golden dataset and the evaluators, we'll evaluate the `naive_retriever_chain` and extract overall metrics including average Cost and Latency from LangSmith API.

In [110]:
# Use Existing Golden Dataset for Direct RAGAS Evaluation
from langsmith import Client
from datasets import Dataset

print("🚀 Loading existing golden dataset for RAGAS evaluation...")

# Initialize LangSmith client to load the golden dataset
client = Client()

try:
    # Get examples from the existing golden dataset
    golden_examples = list(client.list_examples(dataset_name=dataset_name))
    print(f"✅ Loaded {len(golden_examples)} examples from golden dataset: {dataset_name}")
    
    # Convert to format compatible with direct RAGAS evaluation
    evaluation_data = []
    for example in golden_examples:
        eval_sample = {
            "user_input": example.inputs["question"],
            "reference_contexts": example.metadata.get("reference_contexts", []),
            "reference": example.outputs["answer"],
            # These will be populated during evaluation:
            "response": None,
            "retrieved_contexts": None
        }
        evaluation_data.append(eval_sample)
    
    # Create a simple class to hold evaluation data (similar to RAGAS format)
    class EvaluationSample:
        def __init__(self, data):
            self.user_input = data["user_input"]
            self.reference_contexts = data["reference_contexts"]
            self.reference = data["reference"]
            self.response = data["response"]
            self.retrieved_contexts = data["retrieved_contexts"]
    
    # Convert to evaluation dataset format
    evaluation_dataset = [EvaluationSample(data) for data in evaluation_data]
    
    print("📋 Sample questions from golden dataset:")
    for i, sample in enumerate(evaluation_dataset[:3]):
        print(f"  {i+1}. {sample.user_input[:100]}...")
    
    print(f"\n📊 Golden dataset ready for evaluation with {len(evaluation_dataset)} test cases")

except Exception as e:
    print(f"❌ Error loading golden dataset: {e}")
    print("Please ensure the golden dataset has been created and is available in LangSmith")

🚀 Loading existing golden dataset for RAGAS evaluation...
✅ Loaded 15 examples from golden dataset: loan-complaints-golden-dataset-standalone-20250725-105810
📋 Sample questions from golden dataset:
  1. How does the illegal access by DOGE-related ventures, as described in the context, relate to the vio...
  2. How does the CFPB's involvement relate to the failure of TransUnion to properly investigate and repo...
  3. How do violations of the Higher Education Act and FERPA relate to the data breaches and mishandling ...

📊 Golden dataset ready for evaluation with 15 test cases


## Defining the generic evaluation function

In [121]:
# Direct RAGAS Evaluation Function with Cost and Latency Tracking (CORRECTED APPROACH)
from langchain.callbacks import get_openai_callback
from ragas import evaluate, EvaluationDataset, RunConfig
from ragas.metrics import (
    LLMContextRecall, Faithfulness, FactualCorrectness,
    ResponseRelevancy, ContextRecall, NoiseSensitivity
)
from ragas.llms import LangchainLLMWrapper
import pandas as pd
import time

def evaluate_rag_chain_with_tracking(chain, dataset, method_name):
    """
    Evaluate a RAG chain using correct RAGAS approach from semantic chunking notebook.
    
    Args:
        chain: The RAG chain to evaluate
        dataset: List of EvaluationSample objects with test cases
        method_name: Name of the method for reporting
    
    Returns:
        Dictionary with RAGAS metrics and cost/latency data
    """
    print(f"🚀 Evaluating {method_name} retrieval chain...")
    print(f"📊 Test cases: {len(dataset)}")
    
    start_time = time.time()
    
    with get_openai_callback() as cb:
        # Step 1: Generate responses for each test case
        print("   Step 1: Generating responses...")
        for i, test_sample in enumerate(dataset):
            if i % 5 == 0:  # Progress indicator
                print(f"      Processing question {i+1}/{len(dataset)}")
            
            # Get response from the RAG chain
            response = chain.invoke({"question": test_sample.user_input})
            
            # Populate the sample with generated response and retrieved contexts
            test_sample.response = response["response"].content if hasattr(response["response"], 'content') else str(response["response"])
            test_sample.retrieved_contexts = [
                context.page_content for context in response["context"]
            ]
        
        # Step 2: Convert to pandas DataFrame with correct RAGAS format
        print("   Step 2: Creating RAGAS DataFrame...")
        ragas_data = []
        for sample in dataset:
            ragas_data.append({
                "user_input": sample.user_input,
                "response": sample.response,
                "retrieved_contexts": sample.retrieved_contexts,
                "reference_contexts": sample.reference_contexts,
                "reference": sample.reference
            })
        
        ragas_df = pd.DataFrame(ragas_data)
        
        # Step 3: Create EvaluationDataset from pandas DataFrame
        evaluation_dataset = EvaluationDataset.from_pandas(ragas_df)
        
        # Step 4: Set up evaluator LLM (following semantic chunking notebook approach)
        evaluator_llm = LangchainLLMWrapper(ChatOpenAI(model="gpt-4.1-nano", temperature=0))
        
        # Step 5: Configure run settings
        custom_run_config = RunConfig(timeout=360)
        
        # Step 6: Run RAGAS evaluation (following semantic chunking notebook approach)
        print("   Step 3: Running RAGAS evaluation...")
        
        ragas_results = evaluate(
            dataset=evaluation_dataset,
            metrics=[
                LLMContextRecall(),
                Faithfulness(), 
                FactualCorrectness(),
                ResponseRelevancy(),
                ContextRecall(),
                NoiseSensitivity()
            ],
            llm=evaluator_llm,
            run_config=custom_run_config
        )
        
        # Step 7: Convert results to pandas and calculate means (following your approach)
        print("   Step 4: Processing RAGAS results...")
        results_df = ragas_results.to_pandas()
    
    end_time = time.time()
    
    # Comprehensive results with cost and latency (following your pandas + mean approach)
    results = {
        # RAGAS metrics (calculated as means from pandas DataFrame)
        "context_entity_recall": results_df['context_entity_recall'].mean() if 'context_entity_recall' in results_df.columns else 0.0,
        "faithfulness": results_df['faithfulness'].mean() if 'faithfulness' in results_df.columns else 0.0,
        "factual_correctness": results_df['factual_correctness(mode=f1)'].mean() if 'factual_correctness(mode=f1)' in results_df.columns else 0.0,
        "response_relevancy": results_df['answer_relevancy'].mean() if 'answer_relevancy' in results_df.columns else 0.0,
        "context_recall": results_df['context_recall'].mean() if 'context_recall' in results_df.columns else 0.0,
        "noise_sensitivity": results_df['noise_sensitivity(mode=relevant)'].mean() if 'noise_sensitivity(mode=relevant)' in results_df.columns else 0.0,
        
        # Cost metrics (USD)
        "total_cost_usd": cb.total_cost,
        "cost_per_query": cb.total_cost / len(dataset) if len(dataset) > 0 else 0,
        
        # Token metrics
        "total_tokens": cb.total_tokens,
        "prompt_tokens": cb.prompt_tokens,
        "completion_tokens": cb.completion_tokens,
        "tokens_per_query": cb.total_tokens / len(dataset) if len(dataset) > 0 else 0,
        
        # Latency metrics (seconds)
        "total_latency_seconds": end_time - start_time,
        "latency_per_query": (end_time - start_time) / len(dataset) if len(dataset) > 0 else 0,
        
        # Efficiency metrics
        "cost_per_token": cb.total_cost / cb.total_tokens if cb.total_tokens > 0 else 0,
        "tokens_per_second": cb.total_tokens / (end_time - start_time) if (end_time - start_time) > 0 else 0
    }
    
    print(f"✅ {method_name} evaluation completed!")
    print(f"   💰 Total Cost: ${results['total_cost_usd']:.4f}")
    print(f"   🔢 Total Tokens: {results['total_tokens']:,}")
    print(f"   ⏱️ Total Time: {results['total_latency_seconds']:.2f} seconds")
    print(f"   📊 Average RAGAS Score: {sum([results[metric] for metric in ['context_entity_recall', 'faithfulness', 'factual_correctness', 'response_relevancy', 'context_recall', 'noise_sensitivity']]) / 6:.4f}")
    
    # Debug: Show available columns
    print(f"   🔍 Available RAGAS columns: {list(results_df.columns)}")
    
    return results

print("✅ Direct RAGAS evaluation function created following your pandas + mean approach!")

✅ Direct RAGAS evaluation function created following your pandas + mean approach!


#### Naive Retrieval 

In [123]:
# Evaluate Naive Retrieval Chain
print("=" * 80)
print("EVALUATING NAIVE RETRIEVAL CHAIN")
print("=" * 80)

# Create a copy of the golden dataset for naive evaluation (each method needs its own copy)
naive_dataset = [EvaluationSample({
    "user_input": sample.user_input,
    "reference_contexts": sample.reference_contexts,
    "reference": sample.reference,
    "response": None,
    "retrieved_contexts": None
}) for sample in evaluation_dataset]

# Run evaluation with comprehensive tracking
naive_results = evaluate_rag_chain_with_tracking(
    chain=naive_retrieval_chain,
    dataset=naive_dataset,
    method_name="Naive"
)

print("\n📊 NAIVE RETRIEVAL RESULTS:")
print("-" * 40)
print("RAGAS Metrics:")
for metric in ['context_entity_recall', 'faithfulness', 'factual_correctness', 'response_relevancy', 'context_recall', 'noise_sensitivity']:
    print(f"  {metric.replace('_', ' ').title()}: {naive_results[metric]:.4f}")

print("\nCost & Performance:")
print(f"  Total Cost: ${naive_results['total_cost_usd']:.4f}")
print(f"  Cost per Query: ${naive_results['cost_per_query']:.4f}")
print(f"  Total Tokens: {naive_results['total_tokens']:,}")
print(f"  Tokens per Query: {naive_results['tokens_per_query']:.1f}")
print(f"  Total Latency: {naive_results['total_latency_seconds']:.2f} seconds")
print(f"  Latency per Query: {naive_results['latency_per_query']:.2f} seconds")

EVALUATING NAIVE RETRIEVAL CHAIN
🚀 Evaluating Naive retrieval chain...
📊 Test cases: 15
   Step 1: Generating responses...
      Processing question 1/15
      Processing question 6/15
      Processing question 11/15
   Step 2: Creating RAGAS DataFrame...
   Step 3: Running RAGAS evaluation...


Evaluating:   0%|          | 0/90 [00:00<?, ?it/s]

Exception raised in Job[60]: OutputParserException(Invalid json output: Since the resumption of federal loan payments, I have received little to no contact. The last time I called, I was told I was in forbearance until 2040, but there is nothing in writing. I am having trouble logging in with my credentials, and when I call for assistance, there is a long wait followed by a disconnection once someone finally answers. The lack of transparency and accountability has caused undue stress.
For troubleshooting, visit: https://python.langchain.com/docs/troubleshooting/errors/OUTPUT_PARSING_FAILURE )


   Step 4: Processing RAGAS results...
✅ Naive evaluation completed!
   💰 Total Cost: $0.2096
   🔢 Total Tokens: 1,174,695
   ⏱️ Total Time: 420.01 seconds
   📊 Average RAGAS Score: 0.6360
   🔍 Available RAGAS columns: ['user_input', 'retrieved_contexts', 'reference_contexts', 'response', 'reference', 'context_recall', 'faithfulness', 'factual_correctness(mode=f1)', 'answer_relevancy', 'noise_sensitivity(mode=relevant)']

📊 NAIVE RETRIEVAL RESULTS:
----------------------------------------
RAGAS Metrics:
  Context Entity Recall: 0.0000
  Faithfulness: 1.0000
  Factual Correctness: 0.8487
  Response Relevancy: 0.9284
  Context Recall: 0.9667
  Noise Sensitivity: 0.0722

Cost & Performance:
  Total Cost: $0.2096
  Cost per Query: $0.0140
  Total Tokens: 1,174,695
  Tokens per Query: 78313.0
  Total Latency: 420.01 seconds
  Latency per Query: 28.00 seconds


#### BM25 Retrieval

In [146]:
# Evaluate BM25 Retrieval Chain
print("=" * 80)
print("EVALUATING BM25 RETRIEVAL CHAIN") 
print("=" * 80)

# Create a copy of the golden dataset for Multi-Query evaluation
bm25_query_dataset = [EvaluationSample({
    "user_input": sample.user_input,
    "reference_contexts": sample.reference_contexts,
    "reference": sample.reference,
    "response": None,
    "retrieved_contexts": None
}) for sample in evaluation_dataset]

# Run evaluation with comprehensive tracking
bm25_results = evaluate_rag_chain_with_tracking(
    chain=bm25_retrieval_chain,
    dataset=bm25_query_dataset,
    method_name="BM25"
)

print("\n📊 BM25 RETRIEVAL RESULTS:")
print("-" * 40)
print("RAGAS Metrics:")
for metric in ['context_entity_recall', 'faithfulness', 'factual_correctness', 'response_relevancy', 'context_recall', 'noise_sensitivity']:
    print(f"  {metric.replace('_', ' ').title()}: {bm25_results[metric]:.4f}")

print("\nCost & Performance:")
print(f"  Total Cost: ${bm25_results['total_cost_usd']:.4f}")
print(f"  Cost per Query: ${bm25_results['cost_per_query']:.4f}")
print(f"  Total Tokens: {bm25_results['total_tokens']:,}")
print(f"  Tokens per Query: {bm25_results['tokens_per_query']:.1f}")
print(f"  Total Latency: {bm25_results['total_latency_seconds']:.2f} seconds")
print(f"  Latency per Query: {bm25_results['latency_per_query']:.2f} seconds")

EVALUATING BM25 RETRIEVAL CHAIN
🚀 Evaluating BM25 retrieval chain...
📊 Test cases: 15
   Step 1: Generating responses...
      Processing question 1/15
      Processing question 6/15
      Processing question 11/15
   Step 2: Creating RAGAS DataFrame...
   Step 3: Running RAGAS evaluation...


Evaluating:   0%|          | 0/90 [00:00<?, ?it/s]

Exception raised in Job[23]: ValueError(setting an array element with a sequence. The requested array has an inhomogeneous shape after 1 dimensions. The detected shape was (4,) + inhomogeneous part.)


   Step 4: Processing RAGAS results...
✅ BM25 evaluation completed!
   💰 Total Cost: $0.1112
   🔢 Total Tokens: 652,058
   ⏱️ Total Time: 273.25 seconds
   📊 Average RAGAS Score: 0.6104
   🔍 Available RAGAS columns: ['user_input', 'retrieved_contexts', 'reference_contexts', 'response', 'reference', 'context_recall', 'faithfulness', 'factual_correctness(mode=f1)', 'answer_relevancy', 'noise_sensitivity(mode=relevant)']

📊 BM25 RETRIEVAL RESULTS:
----------------------------------------
RAGAS Metrics:
  Context Entity Recall: 0.0000
  Faithfulness: 1.0000
  Factual Correctness: 0.8327
  Response Relevancy: 0.7972
  Context Recall: 0.9667
  Noise Sensitivity: 0.0661

Cost & Performance:
  Total Cost: $0.1112
  Cost per Query: $0.0074
  Total Tokens: 652,058
  Tokens per Query: 43470.5
  Total Latency: 273.25 seconds
  Latency per Query: 18.22 seconds


#### Multi Query Retrieval 

In [147]:
# Evaluate Multi-Query Retrieval Chain
print("=" * 80)
print("EVALUATING MULTI-QUERY RETRIEVAL CHAIN") 
print("=" * 80)

# Create a copy of the golden dataset for Multi-Query evaluation
multi_query_dataset = [EvaluationSample({
    "user_input": sample.user_input,
    "reference_contexts": sample.reference_contexts,
    "reference": sample.reference,
    "response": None,
    "retrieved_contexts": None
}) for sample in evaluation_dataset]

# Run evaluation with comprehensive tracking
multi_query_results = evaluate_rag_chain_with_tracking(
    chain=multi_query_retrieval_chain,
    dataset=multi_query_dataset,
    method_name="Multi-Query"
)

print("\n📊 MULTI-QUERY RETRIEVAL RESULTS:")
print("-" * 40)
print("RAGAS Metrics:")
for metric in ['context_entity_recall', 'faithfulness', 'factual_correctness', 'response_relevancy', 'context_recall', 'noise_sensitivity']:
    print(f"  {metric.replace('_', ' ').title()}: {multi_query_results[metric]:.4f}")

print("\nCost & Performance:")
print(f"  Total Cost: ${multi_query_results['total_cost_usd']:.4f}")
print(f"  Cost per Query: ${multi_query_results['cost_per_query']:.4f}")
print(f"  Total Tokens: {multi_query_results['total_tokens']:,}")
print(f"  Tokens per Query: {multi_query_results['tokens_per_query']:.1f}")
print(f"  Total Latency: {multi_query_results['total_latency_seconds']:.2f} seconds")
print(f"  Latency per Query: {multi_query_results['latency_per_query']:.2f} seconds")

EVALUATING MULTI-QUERY RETRIEVAL CHAIN
🚀 Evaluating Multi-Query retrieval chain...
📊 Test cases: 15
   Step 1: Generating responses...
      Processing question 1/15
      Processing question 6/15
      Processing question 11/15
   Step 2: Creating RAGAS DataFrame...
   Step 3: Running RAGAS evaluation...


Evaluating:   0%|          | 0/90 [00:00<?, ?it/s]

Exception raised in Job[58]: RateLimitError(Error code: 429 - {'error': {'message': 'Rate limit reached for gpt-4.1-nano in organization org-JdudSPvfr6a6LEV8KvMepUOB on tokens per min (TPM): Limit 200000, Used 200000, Requested 6412. Please try again in 1.923s. Visit https://platform.openai.com/account/rate-limits to learn more.', 'type': 'tokens', 'param': None, 'code': 'rate_limit_exceeded'}})
Exception raised in Job[60]: RateLimitError(Error code: 429 - {'error': {'message': 'Rate limit reached for gpt-4.1-nano in organization org-JdudSPvfr6a6LEV8KvMepUOB on tokens per min (TPM): Limit 200000, Used 193711, Requested 8059. Please try again in 531ms. Visit https://platform.openai.com/account/rate-limits to learn more.', 'type': 'tokens', 'param': None, 'code': 'rate_limit_exceeded'}})
Exception raised in Job[55]: RateLimitError(Error code: 429 - {'error': {'message': 'Rate limit reached for gpt-4.1-nano in organization org-JdudSPvfr6a6LEV8KvMepUOB on tokens per min (TPM): Limit 200000

   Step 4: Processing RAGAS results...
✅ Multi-Query evaluation completed!
   💰 Total Cost: $0.2814
   🔢 Total Tokens: 1,570,529
   ⏱️ Total Time: 642.54 seconds
   📊 Average RAGAS Score: 0.6286
   🔍 Available RAGAS columns: ['user_input', 'retrieved_contexts', 'reference_contexts', 'response', 'reference', 'context_recall', 'faithfulness', 'factual_correctness(mode=f1)', 'answer_relevancy', 'noise_sensitivity(mode=relevant)']

📊 MULTI-QUERY RETRIEVAL RESULTS:
----------------------------------------
RAGAS Metrics:
  Context Entity Recall: 0.0000
  Faithfulness: 0.9872
  Factual Correctness: 0.8287
  Response Relevancy: 0.8626
  Context Recall: 0.9615
  Noise Sensitivity: 0.1316

Cost & Performance:
  Total Cost: $0.2814
  Cost per Query: $0.0188
  Total Tokens: 1,570,529
  Tokens per Query: 104701.9
  Total Latency: 642.54 seconds
  Latency per Query: 42.84 seconds


#### Parent Document Retrieval

In [148]:
# Evaluate Parent-Document Retrieval Chain
print("=" * 80)
print("EVALUATING PARENT-DOCUMENT RETRIEVAL CHAIN") 
print("=" * 80)

# Create a copy of the golden dataset for Parent-Document evaluation
parent_document_dataset = [EvaluationSample({
    "user_input": sample.user_input,
    "reference_contexts": sample.reference_contexts,
    "reference": sample.reference,
    "response": None,
    "retrieved_contexts": None
}) for sample in evaluation_dataset]

# Run evaluation with comprehensive tracking
parent_document_results = evaluate_rag_chain_with_tracking(
    chain=parent_document_retrieval_chain,
    dataset=parent_document_dataset,
    method_name="Parent-Document"
)

print("\n📊 PARENT-DOCUMENT RETRIEVAL RESULTS:")
print("-" * 40)
print("RAGAS Metrics:")
for metric in ['context_entity_recall', 'faithfulness', 'factual_correctness', 'response_relevancy', 'context_recall', 'noise_sensitivity']:
    print(f"  {metric.replace('_', ' ').title()}: {parent_document_results[metric]:.4f}")

print("\nCost & Performance:")
print(f"  Total Cost: ${parent_document_results['total_cost_usd']:.4f}")
print(f"  Cost per Query: ${parent_document_results['cost_per_query']:.4f}")
print(f"  Total Tokens: {parent_document_results['total_tokens']:,}")
print(f"  Tokens per Query: {parent_document_results['tokens_per_query']:.1f}")
print(f"  Total Latency: {parent_document_results['total_latency_seconds']:.2f} seconds")
print(f"  Latency per Query: {parent_document_results['latency_per_query']:.2f} seconds")

EVALUATING PARENT-DOCUMENT RETRIEVAL CHAIN
🚀 Evaluating Parent-Document retrieval chain...
📊 Test cases: 15
   Step 1: Generating responses...
      Processing question 1/15
      Processing question 6/15
      Processing question 11/15
   Step 2: Creating RAGAS DataFrame...
   Step 3: Running RAGAS evaluation...


Evaluating:   0%|          | 0/90 [00:00<?, ?it/s]

   Step 4: Processing RAGAS results...
✅ Parent-Document evaluation completed!
   💰 Total Cost: $0.1089
   🔢 Total Tokens: 622,929
   ⏱️ Total Time: 213.86 seconds
   📊 Average RAGAS Score: 0.6204
   🔍 Available RAGAS columns: ['user_input', 'retrieved_contexts', 'reference_contexts', 'response', 'reference', 'context_recall', 'faithfulness', 'factual_correctness(mode=f1)', 'answer_relevancy', 'noise_sensitivity(mode=relevant)']

📊 PARENT-DOCUMENT RETRIEVAL RESULTS:
----------------------------------------
RAGAS Metrics:
  Context Entity Recall: 0.0000
  Faithfulness: 0.9939
  Factual Correctness: 0.8033
  Response Relevancy: 0.8593
  Context Recall: 0.9333
  Noise Sensitivity: 0.1323

Cost & Performance:
  Total Cost: $0.1089
  Cost per Query: $0.0073
  Total Tokens: 622,929
  Tokens per Query: 41528.6
  Total Latency: 213.86 seconds
  Latency per Query: 14.26 seconds


#### Contextual Compression Retrieval

In [149]:
# Evaluate Contextual Compression Retrieval Chain
print("=" * 80)
print("EVALUATING CONTEXTUAL COMPRESSION RETRIEVAL CHAIN") 
print("=" * 80)

# Create a copy of the golden dataset for Contextual Compression evaluation
contextual_compression_dataset = [EvaluationSample({
    "user_input": sample.user_input,
    "reference_contexts": sample.reference_contexts,
    "reference": sample.reference,
    "response": None,
    "retrieved_contexts": None
}) for sample in evaluation_dataset]

# Run evaluation with comprehensive tracking
contextual_compression_results = evaluate_rag_chain_with_tracking(
    chain=contextual_compression_retrieval_chain,
    dataset=contextual_compression_dataset,
    method_name="Contextual Compression"
)

print("\n📊 CONTEXTUAL COMPRESSION RETRIEVAL RESULTS:")
print("-" * 40)
print("RAGAS Metrics:")
for metric in ['context_entity_recall', 'faithfulness', 'factual_correctness', 'response_relevancy', 'context_recall', 'noise_sensitivity']:
    print(f"  {metric.replace('_', ' ').title()}: {contextual_compression_results[metric]:.4f}")

print("\nCost & Performance:")
print(f"  Total Cost: ${contextual_compression_results['total_cost_usd']:.4f}")
print(f"  Cost per Query: ${contextual_compression_results['cost_per_query']:.4f}")
print(f"  Total Tokens: {contextual_compression_results['total_tokens']:,}")
print(f"  Tokens per Query: {contextual_compression_results['tokens_per_query']:.1f}")
print(f"  Total Latency: {contextual_compression_results['total_latency_seconds']:.2f} seconds")
print(f"  Latency per Query: {contextual_compression_results['latency_per_query']:.2f} seconds")

EVALUATING CONTEXTUAL COMPRESSION RETRIEVAL CHAIN
🚀 Evaluating Contextual Compression retrieval chain...
📊 Test cases: 15
   Step 1: Generating responses...
      Processing question 1/15
      Processing question 6/15
      Processing question 11/15
   Step 2: Creating RAGAS DataFrame...
   Step 3: Running RAGAS evaluation...


Evaluating:   0%|          | 0/90 [00:00<?, ?it/s]

   Step 4: Processing RAGAS results...
✅ Contextual Compression evaluation completed!
   💰 Total Cost: $0.0959
   🔢 Total Tokens: 537,380
   ⏱️ Total Time: 168.95 seconds
   📊 Average RAGAS Score: 0.6158
   🔍 Available RAGAS columns: ['user_input', 'retrieved_contexts', 'reference_contexts', 'response', 'reference', 'context_recall', 'faithfulness', 'factual_correctness(mode=f1)', 'answer_relevancy', 'noise_sensitivity(mode=relevant)']

📊 CONTEXTUAL COMPRESSION RETRIEVAL RESULTS:
----------------------------------------
RAGAS Metrics:
  Context Entity Recall: 0.0000
  Faithfulness: 0.9386
  Factual Correctness: 0.8253
  Response Relevancy: 0.9137
  Context Recall: 0.9000
  Noise Sensitivity: 0.1173

Cost & Performance:
  Total Cost: $0.0959
  Cost per Query: $0.0064
  Total Tokens: 537,380
  Tokens per Query: 35825.3
  Total Latency: 168.95 seconds
  Latency per Query: 11.26 seconds


#### Ensemble Retrieval

In [150]:
# Evaluate Ensemble Retrieval Chain
print("=" * 80)
print("EVALUATING ENSEMBLE RETRIEVAL CHAIN") 
print("=" * 80)

# Create a copy of the golden dataset for Ensemble evaluation
ensemble_dataset = [EvaluationSample({
    "user_input": sample.user_input,
    "reference_contexts": sample.reference_contexts,
    "reference": sample.reference,
    "response": None,
    "retrieved_contexts": None
}) for sample in evaluation_dataset]

# Run evaluation with comprehensive tracking
ensemble_results = evaluate_rag_chain_with_tracking(
    chain=ensemble_retrieval_chain,
    dataset=ensemble_dataset,
    method_name="Ensemble"
)

print("\n📊 ENSEMBLE RETRIEVAL RESULTS:")
print("-" * 40)
print("RAGAS Metrics:")
for metric in ['context_entity_recall', 'faithfulness', 'factual_correctness', 'response_relevancy', 'context_recall', 'noise_sensitivity']:
    print(f"  {metric.replace('_', ' ').title()}: {ensemble_results[metric]:.4f}")

print("\nCost & Performance:")
print(f"  Total Cost: ${ensemble_results['total_cost_usd']:.4f}")
print(f"  Cost per Query: ${ensemble_results['cost_per_query']:.4f}")
print(f"  Total Tokens: {ensemble_results['total_tokens']:,}")
print(f"  Tokens per Query: {ensemble_results['tokens_per_query']:.1f}")
print(f"  Total Latency: {ensemble_results['total_latency_seconds']:.2f} seconds")
print(f"  Latency per Query: {ensemble_results['latency_per_query']:.2f} seconds")

EVALUATING ENSEMBLE RETRIEVAL CHAIN
🚀 Evaluating Ensemble retrieval chain...
📊 Test cases: 15
   Step 1: Generating responses...
      Processing question 1/15
      Processing question 6/15
      Processing question 11/15
   Step 2: Creating RAGAS DataFrame...
   Step 3: Running RAGAS evaluation...


Evaluating:   0%|          | 0/90 [00:00<?, ?it/s]

Exception raised in Job[49]: RateLimitError(Error code: 429 - {'error': {'message': 'Rate limit reached for gpt-4.1-nano in organization org-JdudSPvfr6a6LEV8KvMepUOB on tokens per min (TPM): Limit 200000, Used 197898, Requested 11946. Please try again in 2.953s. Visit https://platform.openai.com/account/rate-limits to learn more.', 'type': 'tokens', 'param': None, 'code': 'rate_limit_exceeded'}})
Exception raised in Job[70]: RateLimitError(Error code: 429 - {'error': {'message': 'Rate limit reached for gpt-4.1-nano in organization org-JdudSPvfr6a6LEV8KvMepUOB on tokens per min (TPM): Limit 200000, Used 199331, Requested 6430. Please try again in 1.728s. Visit https://platform.openai.com/account/rate-limits to learn more.', 'type': 'tokens', 'param': None, 'code': 'rate_limit_exceeded'}})
Exception raised in Job[11]: TimeoutError()
Exception raised in Job[23]: TimeoutError()
Exception raised in Job[29]: TimeoutError()
Exception raised in Job[35]: TimeoutError()
Exception raised in Job[4

   Step 4: Processing RAGAS results...
✅ Ensemble evaluation completed!
   💰 Total Cost: $0.3221
   🔢 Total Tokens: 1,863,670
   ⏱️ Total Time: 670.08 seconds
   📊 Average RAGAS Score: 0.6259
   🔍 Available RAGAS columns: ['user_input', 'retrieved_contexts', 'reference_contexts', 'response', 'reference', 'context_recall', 'faithfulness', 'factual_correctness(mode=f1)', 'answer_relevancy', 'noise_sensitivity(mode=relevant)']

📊 ENSEMBLE RETRIEVAL RESULTS:
----------------------------------------
RAGAS Metrics:
  Context Entity Recall: 0.0000
  Faithfulness: 0.9973
  Factual Correctness: 0.8333
  Response Relevancy: 0.7936
  Context Recall: 1.0000
  Noise Sensitivity: 0.1315

Cost & Performance:
  Total Cost: $0.3221
  Cost per Query: $0.0215
  Total Tokens: 1,863,670
  Tokens per Query: 124244.7
  Total Latency: 670.08 seconds
  Latency per Query: 44.67 seconds


#### Semantic Chunking 

In [151]:
# Evaluate Semantic Chunking Retrieval Chain
print("=" * 80)
print("EVALUATING SEMANTIC CHUNKING RETRIEVAL CHAIN") 
print("=" * 80)

# Create a copy of the golden dataset for Semantic Chunking evaluation
semantic_dataset = [EvaluationSample({
    "user_input": sample.user_input,
    "reference_contexts": sample.reference_contexts,
    "reference": sample.reference,
    "response": None,
    "retrieved_contexts": None
}) for sample in evaluation_dataset]

# Run evaluation with comprehensive tracking
semantic_results = evaluate_rag_chain_with_tracking(
    chain=semantic_retrieval_chain,
    dataset=semantic_dataset,
    method_name="Semantic Chunking"
)

print("\n📊 SEMANTIC CHUNKING RETRIEVAL RESULTS:")
print("-" * 40)
print("RAGAS Metrics:")
for metric in ['context_entity_recall', 'faithfulness', 'factual_correctness', 'response_relevancy', 'context_recall', 'noise_sensitivity']:
    print(f"  {metric.replace('_', ' ').title()}: {semantic_results[metric]:.4f}")

print("\nCost & Performance:")
print(f"  Total Cost: ${semantic_results['total_cost_usd']:.4f}")
print(f"  Cost per Query: ${semantic_results['cost_per_query']:.4f}")
print(f"  Total Tokens: {semantic_results['total_tokens']:,}")
print(f"  Tokens per Query: {semantic_results['tokens_per_query']:.1f}")
print(f"  Total Latency: {semantic_results['total_latency_seconds']:.2f} seconds")
print(f"  Latency per Query: {semantic_results['latency_per_query']:.2f} seconds")

EVALUATING SEMANTIC CHUNKING RETRIEVAL CHAIN
🚀 Evaluating Semantic Chunking retrieval chain...
📊 Test cases: 15
   Step 1: Generating responses...
      Processing question 1/15
      Processing question 6/15
      Processing question 11/15
   Step 2: Creating RAGAS DataFrame...
   Step 3: Running RAGAS evaluation...


Evaluating:   0%|          | 0/90 [00:00<?, ?it/s]

Exception raised in Job[59]: ValueError(setting an array element with a sequence. The requested array has an inhomogeneous shape after 1 dimensions. The detected shape was (10,) + inhomogeneous part.)
Exception raised in Job[41]: ValueError(setting an array element with a sequence. The requested array has an inhomogeneous shape after 1 dimensions. The detected shape was (10,) + inhomogeneous part.)


   Step 4: Processing RAGAS results...
✅ Semantic Chunking evaluation completed!
   💰 Total Cost: $0.2005
   🔢 Total Tokens: 1,024,934
   ⏱️ Total Time: 438.78 seconds
   📊 Average RAGAS Score: 0.6398
   🔍 Available RAGAS columns: ['user_input', 'retrieved_contexts', 'reference_contexts', 'response', 'reference', 'context_recall', 'faithfulness', 'factual_correctness(mode=f1)', 'answer_relevancy', 'noise_sensitivity(mode=relevant)']

📊 SEMANTIC CHUNKING RETRIEVAL RESULTS:
----------------------------------------
RAGAS Metrics:
  Context Entity Recall: 0.0000
  Faithfulness: 0.9846
  Factual Correctness: 0.8473
  Response Relevancy: 0.8683
  Context Recall: 0.9667
  Noise Sensitivity: 0.1718

Cost & Performance:
  Total Cost: $0.2005
  Cost per Query: $0.0134
  Total Tokens: 1,024,934
  Tokens per Query: 68328.9
  Total Latency: 438.78 seconds
  Latency per Query: 29.25 seconds


## Generating final report

In [153]:
# Comprehensive Comparison: All 7 RAG Approaches
import pandas as pd

print("=" * 100)
print("COMPREHENSIVE COMPARISON: ALL 7 RAG RETRIEVAL APPROACHES")
print("=" * 100)

# Collect all results (NOTE: You need to run all evaluation cells above first)
all_results = {
    "Naive": naive_results,
    "BM25": bm25_results,
    "Multi-Query": multi_query_results,
    "Parent-Document": parent_document_results,
    "Contextual Compression": contextual_compression_results,
    "Ensemble": ensemble_results,
    "Semantic Chunking": semantic_results
}

# Create comprehensive comparison data
comparison_data = {
    "RAG Approach": [],
    "Context Entity Recall": [],
    "Faithfulness": [],
    "Factual Correctness": [],
    "Response Relevancy": [],
    "Context Recall": [],
    "Noise Sensitivity": [],
    "Average RAGAS Score": [],
    "Total Cost (USD)": [],
    "Cost per Query (USD)": [],
    "Total Tokens": [],
    "Tokens per Query": [],
    "Total Latency (seconds)": [],
    "Latency per Query (seconds)": [],
    "Cost Efficiency (Score/USD)": [],
    "Speed Efficiency (Score/second)": []
}

# Populate data for each approach
for approach_name, results in all_results.items():
    avg_ragas_score = sum([results[metric] for metric in ['context_entity_recall', 'faithfulness', 'factual_correctness', 'response_relevancy', 'context_recall', 'noise_sensitivity']]) / 6
    
    comparison_data["RAG Approach"].append(approach_name)
    comparison_data["Context Entity Recall"].append(f"{results['context_entity_recall']:.4f}")
    comparison_data["Faithfulness"].append(f"{results['faithfulness']:.4f}")
    comparison_data["Factual Correctness"].append(f"{results['factual_correctness']:.4f}")
    comparison_data["Response Relevancy"].append(f"{results['response_relevancy']:.4f}")
    comparison_data["Context Recall"].append(f"{results['context_recall']:.4f}")
    comparison_data["Noise Sensitivity"].append(f"{results['noise_sensitivity']:.4f}")
    comparison_data["Average RAGAS Score"].append(f"{avg_ragas_score:.4f}")
    comparison_data["Total Cost (USD)"].append(f"${results['total_cost_usd']:.4f}")
    comparison_data["Cost per Query (USD)"].append(f"${results['cost_per_query']:.4f}")
    comparison_data["Total Tokens"].append(f"{results['total_tokens']:,}")
    comparison_data["Tokens per Query"].append(f"{results['tokens_per_query']:.1f}")
    comparison_data["Total Latency (seconds)"].append(f"{results['total_latency_seconds']:.2f}")
    comparison_data["Latency per Query (seconds)"].append(f"{results['latency_per_query']:.2f}")
    comparison_data["Cost Efficiency (Score/USD)"].append(f"{avg_ragas_score / results['total_cost_usd'] if results['total_cost_usd'] > 0 else 0:.2f}")
    comparison_data["Speed Efficiency (Score/second)"].append(f"{avg_ragas_score / results['total_latency_seconds'] if results['total_latency_seconds'] > 0 else 0:.4f}")

# Create comprehensive comparison DataFrame
comprehensive_df = pd.DataFrame(comparison_data)

print("📊 COMPREHENSIVE EVALUATION RESULTS:")
print("=" * 100)
print(comprehensive_df.to_string(index=False))

# Performance analysis
print("\n🏆 PERFORMANCE ANALYSIS:")
print("=" * 50)

# Find best performing approach by average RAGAS score
ragas_scores = [sum([all_results[approach][metric] for metric in ['context_entity_recall', 'faithfulness', 'factual_correctness', 'response_relevancy', 'context_recall', 'noise_sensitivity']]) / 6 for approach in all_results.keys()]
best_approach_idx = ragas_scores.index(max(ragas_scores))
best_approach = list(all_results.keys())[best_approach_idx]
best_score = max(ragas_scores)

print(f"🥇 BEST OVERALL PERFORMANCE: {best_approach}")
print(f"   Average RAGAS Score: {best_score:.4f}")

# Find most cost-effective approach
cost_effectiveness = [ragas_scores[i] / all_results[list(all_results.keys())[i]]['total_cost_usd'] if all_results[list(all_results.keys())[i]]['total_cost_usd'] > 0 else 0 for i in range(len(ragas_scores))]
most_cost_effective_idx = cost_effectiveness.index(max(cost_effectiveness))
most_cost_effective = list(all_results.keys())[most_cost_effective_idx]

print(f"\n💰 MOST COST-EFFECTIVE: {most_cost_effective}")
print(f"   Cost Efficiency: {max(cost_effectiveness):.2f} score per USD")

# Find fastest approach
latencies = [all_results[approach]['total_latency_seconds'] for approach in all_results.keys()]
fastest_approach_idx = latencies.index(min(latencies))
fastest_approach = list(all_results.keys())[fastest_approach_idx]

print(f"\n⚡ FASTEST APPROACH: {fastest_approach}")
print(f"   Total Latency: {min(latencies):.2f} seconds")

# Individual metric winners
print(f"\n🎯 INDIVIDUAL METRIC LEADERS:")
metrics = ['context_entity_recall', 'faithfulness', 'factual_correctness', 'response_relevancy', 'context_recall', 'noise_sensitivity']
for metric in metrics:
    metric_scores = [all_results[approach][metric] for approach in all_results.keys()]
    best_metric_idx = metric_scores.index(max(metric_scores))
    best_metric_approach = list(all_results.keys())[best_metric_idx]
    print(f"   {metric.replace('_', ' ').title()}: {best_metric_approach} ({max(metric_scores):.4f})")

print("\n" + "=" * 100)
print("🎉 EVALUATION COMPLETE: All 7 RAG approaches have been comprehensively evaluated!")
print("=" * 100)

COMPREHENSIVE COMPARISON: ALL 7 RAG RETRIEVAL APPROACHES
📊 COMPREHENSIVE EVALUATION RESULTS:
          RAG Approach Context Entity Recall Faithfulness Factual Correctness Response Relevancy Context Recall Noise Sensitivity Average RAGAS Score Total Cost (USD) Cost per Query (USD) Total Tokens Tokens per Query Total Latency (seconds) Latency per Query (seconds) Cost Efficiency (Score/USD) Speed Efficiency (Score/second)
                 Naive                0.0000       1.0000              0.8487             0.9284         0.9667            0.0722              0.6360          $0.2096              $0.0140    1,174,695          78313.0                  420.01                       28.00                        3.03                          0.0015
                  BM25                0.0000       1.0000              0.8327             0.7972         0.9667            0.0661              0.6104          $0.1112              $0.0074      652,058          43470.5                  273.25      

See Google Sheet - https://docs.google.com/spreadsheets/d/1kXyngWvo80IcnFxfVWKmefz9HLa9gzRvztBHjG4YrYg/edit?gid=605947219#gid=605947219 for better formatting

# Analysis

The results are really interesting. 

* Right out of the bat, **Context Entity Recall** was all 0. I was pretty perplexed by this so I checked with Cursor (see PROMPTS.md - Prompt 5). The key idea behind Context Entity Recall are of course entities. The reason why we are seeing that metric to be zero is that a lot of the entities in the data are **anonymized** as they may be PII. So a value of zero is **to be expected**
* Next is **Response Relevancy** as it is ultimately what we are looking for in a RAG. Interestingly the Naive approach has the best score but not by much to the next best - Contextual Compression. This makes sense cos the data is smaller so over optimizing on the chunking or retrieval could be counter productive. But let's look at the cost/latency: the contextual compression improved the latency by **half**. Cost wise it's also half but we are not including the cost of using Cohere's model so that needs to be factored in. In this particular case Naive was ~ $0.20 for the whole run and Context Compression was ~ $0.15 but this could change if pricing changes. 
* Next is **Faithfulness** - both Naive and BM25 have a score of 1 and all the rest are close (0.98-0.99) except for *Contextual Compression* (0.93). My hunch why this was the case is that compared to the exercise we did previously using PDF documents, we are dealing with customer complaints and the data being smaller, it would not lend itself well to the extra work of going thru the compression. So I checked with Cursor on this and that hunch seems to be confirmed. According to Cursor (see Prompt 6) we are doing more harm than good because (a) limited context pool - reranking removes useful info (b) over-optimization filtering context that's actually needed. In essence, less context == less faithful answers
* **Factual Correctness** - Most of the approaches are close. Parent Doc is the worst but not by far. That makes sense cos the data is so small it actually hurts the overall Factual Correctness score when we try to split it even more. 
* **Context Recall** - Ensemble had a perfect score but Context Compression had the lowest score at 0.90 and we've seen this before in the last exercise cos just because your retrieved chunks are semantically similar doesn't mean it is relevant to the question. 
* **Noise Sensitivity** - both Naive and BM25 perform worse than the rest which makes sense since (a) BM25 is keyword search hence we are not using semantic similarity so it works best if the query contains important keywords otherwise it could be taken for a ride (b) as for Naive approach - all the other methods are trying to get more semantically similar chunks (than the Naive approach) into the context hence it makes sense that the Naive approach is has a lower noise sensitivity 