# Advanced Retrieval with LangChain

In the following notebook, we'll explore various methods of advanced retrieval using LangChain!

We'll touch on:

- Naive Retrieval
- Best-Matching 25 (BM25)
- Multi-Query Retrieval
- Parent-Document Retrieval
- Contextual Compression (a.k.a. Rerank)
- Ensemble Retrieval
- Semantic chunking

We'll also discuss how these methods impact performance on our set of documents with a simple RAG chain.

There will be two breakout rooms:

- 🤝 Breakout Room Part #1
  - Task 1: Getting Dependencies!
  - Task 2: Data Collection and Preparation
  - Task 3: Setting Up QDrant!
  - Task 4-10: Retrieval Strategies
- 🤝 Breakout Room Part #2
  - Activity: Evaluate with Ragas

# 🤝 Breakout Room Part #1

## Task 1: Getting Dependencies!

We're going to need a few specific LangChain community packages, like OpenAI (for our [LLM](https://platform.openai.com/docs/models) and [Embedding Model](https://platform.openai.com/docs/guides/embeddings)) and Cohere (for our [Reranker](https://cohere.com/rerank)).

We'll also provide our OpenAI key, as well as our Cohere API key.

In [1]:
import os
import getpass

os.environ["OPENAI_API_KEY"] = getpass.getpass("Enter your OpenAI API Key:")

In [2]:
os.environ["COHERE_API_KEY"] = getpass.getpass("Cohere API Key:")

## Task 2: Data Collection and Preparation

We'll be using our Loan Data once again - this time the strutured data available through the CSV!

### Data Preparation

We want to make sure all our documents have the relevant metadata for the various retrieval strategies we're going to be applying today.

In [3]:
from langchain_community.document_loaders.csv_loader import CSVLoader
from datetime import datetime, timedelta

loader = CSVLoader(
    file_path=f"./data/complaints.csv",
    metadata_columns=[
      "Date received", 
      "Product", 
      "Sub-product", 
      "Issue", 
      "Sub-issue", 
      "Consumer complaint narrative", 
      "Company public response", 
      "Company", 
      "State", 
      "ZIP code", 
      "Tags", 
      "Consumer consent provided?", 
      "Submitted via", 
      "Date sent to company", 
      "Company response to consumer", 
      "Timely response?", 
      "Consumer disputed?", 
      "Complaint ID"
    ]
)

loan_complaint_data = loader.load()

for doc in loan_complaint_data:
    doc.page_content = doc.metadata["Consumer complaint narrative"]

Let's look at an example document to see if everything worked as expected!

In [4]:
loan_complaint_data[0]

Document(metadata={'source': './data/complaints.csv', 'row': 0, 'Date received': '03/27/25', 'Product': 'Student loan', 'Sub-product': 'Federal student loan servicing', 'Issue': 'Dealing with your lender or servicer', 'Sub-issue': 'Trouble with how payments are being handled', 'Consumer complaint narrative': "The federal student loan COVID-19 forbearance program ended in XX/XX/XXXX. However, payments were not re-amortized on my federal student loans currently serviced by Nelnet until very recently. The new payment amount that is effective starting with the XX/XX/XXXX payment will nearly double my payment from {$180.00} per month to {$360.00} per month. I'm fortunate that my current financial position allows me to be able to handle the increased payment amount, but I am sure there are likely many borrowers who are not in the same position. The re-amortization should have occurred once the forbearance ended to reduce the impact to borrowers.", 'Company public response': 'None', 'Company'

## Task 3: Setting up QDrant!

Now that we have our documents, let's create a QDrant VectorStore with the collection name "LoanComplaints".

We'll leverage OpenAI's [`text-embedding-3-small`](https://openai.com/blog/new-embedding-models-and-api-updates) because it's a very powerful (and low-cost) embedding model.

> NOTE: We'll be creating additional vectorstores where necessary, but this pattern is still extremely useful.

In [5]:
from langchain_community.vectorstores import Qdrant
from langchain_openai import OpenAIEmbeddings

embeddings = OpenAIEmbeddings(model="text-embedding-3-small")

vectorstore = Qdrant.from_documents(
    loan_complaint_data,
    embeddings,
    location=":memory:",
    collection_name="LoanComplaints"
)

## Task 4: Naive RAG Chain

Since we're focusing on the "R" in RAG today - we'll create our Retriever first.

### R - Retrieval

This naive retriever will simply look at each review as a document, and use cosine-similarity to fetch the 10 most relevant documents.

> NOTE: We're choosing `10` as our `k` here to provide enough documents for our reranking process later

In [6]:
naive_retriever = vectorstore.as_retriever(search_kwargs={"k" : 10})

### A - Augmented

We're going to go with a standard prompt for our simple RAG chain today! Nothing fancy here, we want this to mostly be about the Retrieval process.

In [7]:
from langchain_core.prompts import ChatPromptTemplate

RAG_TEMPLATE = """\
You are a helpful and kind assistant. Use the context provided below to answer the question.

If you do not know the answer, or are unsure, say you don't know.

Query:
{question}

Context:
{context}
"""

rag_prompt = ChatPromptTemplate.from_template(RAG_TEMPLATE)

### G - Generation

We're going to leverage `gpt-4.1-nano` as our LLM today, as - again - we want this to largely be about the Retrieval process.

In [8]:
from langchain_openai import ChatOpenAI

chat_model = ChatOpenAI(model="gpt-4.1-nano")

### LCEL RAG Chain

We're going to use LCEL to construct our chain.

> NOTE: This chain will be exactly the same across the various examples with the exception of our Retriever!

In [9]:
from langchain_core.runnables import RunnablePassthrough
from operator import itemgetter
from langchain_core.output_parsers import StrOutputParser

naive_retrieval_chain = (
    # INVOKE CHAIN WITH: {"question" : "<<SOME USER QUESTION>>"}
    # "question" : populated by getting the value of the "question" key
    # "context"  : populated by getting the value of the "question" key and chaining it into the base_retriever
    {"context": itemgetter("question") | naive_retriever, "question": itemgetter("question")}
    # "context"  : is assigned to a RunnablePassthrough object (will not be called or considered in the next step)
    #              by getting the value of the "context" key from the previous step
    | RunnablePassthrough.assign(context=itemgetter("context"))
    # "response" : the "context" and "question" values are used to format our prompt object and then piped
    #              into the LLM and stored in a key called "response"
    # "context"  : populated by getting the value of the "context" key from the previous step
    | {"response": rag_prompt | chat_model, "context": itemgetter("context")}
)

Let's see how this simple chain does on a few different prompts.

> NOTE: You might think that we've cherry picked prompts that showcase the individual skill of each of the retrieval strategies - you'd be correct!

In [10]:
naive_retrieval_chain.invoke({"question" : "What is the most common issue with loans?"})["response"].content

'Based on the provided data, the most common issues with loans appear to involve problems with loan servicing, such as errors in loan balances, misapplied payments, wrongful denials of payment plans, and issues related to loan status and reporting. Many complaints also highlight difficulties in dealing with lenders or servicers, including lack of communication, incorrect information, and handling of payment options like forbearance or deferment.\n\nIn summary, the most common issue seems to be **problems with loan servicing**, including administrative errors, miscommunication, and mishandling of payments and account information.'

In [11]:
naive_retrieval_chain.invoke({"question" : "Did any complaints not get handled in a timely manner?"})["response"].content

'Yes, some complaints did not get handled in a timely manner. Specifically, at least one complaint, submitted to MOHELA on 03/28/25 regarding a delayed application process, was marked as "Not timely," indicating a delay in handling. Additionally, multiple complaints related to ongoing issues such as unresponsive customer service, unresolved account errors, and delays in dispute responses suggest that there have been instances where complaints were not addressed promptly.'

In [12]:
naive_retrieval_chain.invoke({"question" : "Why did people fail to pay back their loans?"})["response"].content

'People failed to pay back their loans for several reasons, including:\n\n1. **Accumulation of interest during deferment or forbearance:** Borrowers often found that interest continued to grow even when payments were postponed, making repayment more difficult over time.\n\n2. **Lack of clear or timely communication from lenders or servicers:** Many borrowers were unaware of when their repayment obligations resumed or had difficulty understanding their loan status due to poor or inconsistent information.\n\n3. **Financial hardship and inability to afford payments:** Borrowers reported that increasing monthly payments would conflict with their basic living expenses, such as bills, food, and transportation.\n\n4. **Mismanagement and administrative issues:** Cases of improper loan transfers, incorrect reporting, and lack of access to accounts or understanding of balances contributed to repayment difficulties.\n\n5. **Systemic issues and lack of support:** Many borrowers felt misled about t

Overall, this is not bad! Let's see if we can make it better!

## Task 5: Best-Matching 25 (BM25) Retriever

Taking a step back in time - [BM25](https://www.nowpublishers.com/article/Details/INR-019) is based on [Bag-Of-Words](https://en.wikipedia.org/wiki/Bag-of-words_model) which is a sparse representation of text.

In essence, it's a way to compare how similar two pieces of text are based on the words they both contain.

This retriever is very straightforward to set-up! Let's see it happen down below!


In [13]:
from langchain_community.retrievers import BM25Retriever

bm25_retriever = BM25Retriever.from_documents(loan_complaint_data, )

We'll construct the same chain - only changing the retriever.

In [14]:
bm25_retrieval_chain = (
    {"context": itemgetter("question") | bm25_retriever, "question": itemgetter("question")}
    | RunnablePassthrough.assign(context=itemgetter("context"))
    | {"response": rag_prompt | chat_model, "context": itemgetter("context")}
)

Let's look at the responses!

In [15]:
bm25_retrieval_chain.invoke({"question" : "What is the most common issue with loans?"})["response"].content

"Based on the provided context, the most common issue with loans appears to be related to dealing with lenders or servicers, particularly issues with the accuracy or transparency of information, handling of payments, and problematic practices like applying payments in a way that prolongs debt or charging unexpected fees. Specific recurring issues include:\n\n- Disputes over fees charged or incorrect fees\n- Difficulties in obtaining accurate loan or payment information\n- Problems with how payments are applied (e.g., not applying to principal)\n- Inaccurate or bad information about loans\n- Issues stemming from loan servicers' practices that seem predatory or untrustworthy\n\nWhile the context focuses on complaints about federal student loan servicers, these issues reflect broader common challenges in the loan process.\n\nIf you need a specific concise answer:  \nThe most common issue with loans is difficulties and disputes related to loan servicing, including miscommunication, imprope

In [16]:
bm25_retrieval_chain.invoke({"question" : "Did any complaints not get handled in a timely manner?"})["response"].content

'Based on the provided information, all the complaints listed were responded to in a timely manner, as indicated by the "Timely response?" field being marked "Yes" for each case. Therefore, there is no evidence that any complaints were not handled in a timely manner.'

In [17]:
bm25_retrieval_chain.invoke({"question" : "Why did people fail to pay back their loans?"})["response"].content

'People failed to pay back their loans for various reasons, including issues with the handling of their payment plans, lack of communication from the servicers, and problems with their payments being reversed or not properly processed. Some specific causes include:\n\n- Being steered into the wrong types of forbearance or having their loan transferred without proper notification.\n- Automatic payments being discontinued or not set up correctly, leading to missed or late payments.\n- Lack of response or help from loan servicers when borrowers seek assistance or file for deferments or forbearances.\n- Poor communication and notification from loan servicers regarding payment status, due dates, or account changes, sometimes leading to negative impacts on credit scores.\n- Servicers submitting accounts as past due without proper notice, even though borrowers made regular payments.\n\nOverall, these issues point to failures on the part of loan servicers to effectively communicate with borrow

It's not clear that this is better or worse, if only we had a way to test this (SPOILERS: We do, the second half of the notebook will cover this)

<div style="background-color: #204B8E; color: white; padding: 10px; border-radius: 5px;">

#### ❓ Question #1:

Give an example query where BM25 is better than embeddings and justify your answer.

</div>

<div style="background-color: #204B8E; color: white; padding: 10px; border-radius: 5px;">
### Answer:

- Example Query: Find complaints about JuanHand online lending app payment processing errors.
- BM25 excels at finding documents containing the exact term. Embeddings might return semantically similar  results about other loan services like Billease, but when you need information about JuanHand specifically, exact matching is crucial.

</div>

## Task 6: Contextual Compression (Using Reranking)

Contextual Compression is a fairly straightforward idea: We want to "compress" our retrieved context into just the most useful bits.

There are a few ways we can achieve this - but we're going to look at a specific example called reranking.

The basic idea here is this:

- We retrieve lots of documents that are very likely related to our query vector
- We "compress" those documents into a smaller set of *more* related documents using a reranking algorithm.

We'll be leveraging Cohere's Rerank model for our reranker today!

All we need to do is the following:

- Create a basic retriever
- Create a compressor (reranker, in this case)

That's it!

Let's see it in the code below!

In [18]:
from langchain.retrievers.contextual_compression import ContextualCompressionRetriever
from langchain_cohere import CohereRerank

compressor = CohereRerank(model="rerank-v3.5")
compression_retriever = ContextualCompressionRetriever(
    base_compressor=compressor, base_retriever=naive_retriever
)

Let's create our chain again, and see how this does!

In [19]:
contextual_compression_retrieval_chain = (
    {"context": itemgetter("question") | compression_retriever, "question": itemgetter("question")}
    | RunnablePassthrough.assign(context=itemgetter("context"))
    | {"response": rag_prompt | chat_model, "context": itemgetter("context")}
)

In [20]:
contextual_compression_retrieval_chain.invoke({"question" : "What is the most common issue with loans?"})["response"].content

'Based on the provided complaints, a common issue with loans, particularly student loans, appears to be problems related to dealing with lenders or servicers, including errors in loan balances, misapplied payments, wrongful denials of payment plans, and mishandling of account information. Many complaints also involve difficulties with understanding repayment options, accumulating interest, and issues with loan disclosures or privacy violations. \n\nIn summary, a most common issue is mismanagement or mishandling by loan servicers, leading to errors, confusion, and financial hardship for borrowers.'

In [21]:
contextual_compression_retrieval_chain.invoke({"question" : "Did any complaints not get handled in a timely manner?"})["response"].content

'Based on the provided information, there are indications that some complaints did not get handled in a timely manner. For example:\n\n- One complaint has been open since around 18 months without resolution, involving issues like account reviews and violations.\n- Another complaint involving legal violations has not received follow-up from the company.\n- A customer reported that issues with their auto-pay setup persisted for over 2-3 weeks despite multiple calls.\n\nWhile the company responses in these cases state "Yes" for timely response, the ongoing unresolved issues and extended delays suggest that some complaints were indeed not addressed in a timely manner.'

In [22]:
contextual_compression_retrieval_chain.invoke({"question" : "Why did people fail to pay back their loans?"})["response"].content

"People often fail to pay back their loans due to a combination of factors, including lack of clear information about repayment obligations, difficulties with managing interest accumulation, and limited options for affordable repayment plans. In some cases, borrowers are unaware that they need to repay their loans or are not properly notified about changes in loan servicing or ownership, leading to missed payments or confusion about their balances. Additionally, high interest rates, compounded over time, can make it challenging to reduce the principal, especially if the borrower can only afford minimal payments, causing the debt to grow or remain unpaid. Financial hardships, lack of guidance from loan servicers, and limited access to affordable payment options like income-driven plans can ultimately contribute to borrowers' inability to fully repay their loans."

We'll need to rely on something like Ragas to help us get a better sense of how this is performing overall - but it "feels" better!

## Task 7: Multi-Query Retriever

Typically in RAG we have a single query - the one provided by the user.

What if we had....more than one query!

In essence, a Multi-Query Retriever works by:

1. Taking the original user query and creating `n` number of new user queries using an LLM.
2. Retrieving documents for each query.
3. Using all unique retrieved documents as context

So, how is it to set-up? Not bad! Let's see it down below!



In [23]:
from langchain.retrievers.multi_query import MultiQueryRetriever

multi_query_retriever = MultiQueryRetriever.from_llm(
    retriever=naive_retriever, llm=chat_model
)

In [24]:
multi_query_retrieval_chain = (
    {"context": itemgetter("question") | multi_query_retriever, "question": itemgetter("question")}
    | RunnablePassthrough.assign(context=itemgetter("context"))
    | {"response": rag_prompt | chat_model, "context": itemgetter("context")}
)

In [25]:
multi_query_retrieval_chain.invoke({"question" : "What is the most common issue with loans?"})["response"].content

'Based on the provided context, the most common issues with loans include:\n\n- Trouble with how payments are being handled, such as difficulty applying extra funds to principal or_payments being misapplied (e.g., primarily to interest).\n- Problems with loan servicing, including errors in balances, misapplied payments, and wrongful denials of payment plans.\n- Issues with loan balances increasing unexpectedly due to mismanagement or improper handling of interest (e.g., forbearance steering leading to interest capitalization).\n- Inaccurate or misleading credit reporting related to loan status.\n- Lack of proper communication from servicers about loan status, default notices, or changes in repayment terms.\n- Problems with accessing and verifying loan information, including identity or fraud concerns.\n- Servicer misconduct such as neglecting borrower rights, mishandling documents, or failing to process requests for deferment, consolidation, or forgiveness.\n\nOverall, the most recurri

In [26]:
multi_query_retrieval_chain.invoke({"question" : "Did any complaints not get handled in a timely manner?"})["response"].content

'Based on the provided complaints, it appears that many complaints were handled in a timely manner, with responses marked as "Yes" for timely response, and some complaints explicitly stating responses occurred within required timeframes. \n\nHowever, there are also several complaints where responses were delayed or not received within the expected time, including cases marked as "No" for timely response, where complainants experienced wait times exceeding 15 days or where no response was received for over a year.\n\nTherefore, the answer is: **Yes, some complaints did not get handled in a timely manner.**'

In [27]:
multi_query_retrieval_chain.invoke({"question" : "Why did people fail to pay back their loans?"})["response"].content

"People failed to pay back their loans for various reasons, including:\n\n1. Lack of clear information about the true cost of loans and interest accrual, leading borrowers to underestimate how much they owed.\n2. Difficulty managing interest that continued to accumulate during forbearance or deferment, making loans harder to pay off over time.\n3. Limited or confusing repayment options, with some borrowers not qualifying for loan forgiveness programs and feeling misled about available assistance.\n4. Financial hardships such as unemployment, medical issues, or homelessness, which made payments impossible.\n5. Challenges with loan servicers providing inaccurate or misleading information, poor communication, or mishandling accounts, which hindered borrowers' ability to manage or understand their loans.\n6. Errors or discrepancies in loan reporting and account status that negatively impacted credit scores and created additional obstacles to repayment.\n7. Experiences of being pushed into 

<div style="background-color: #204B8E; color: white; padding: 10px; border-radius: 5px;">

#### ❓ Question #2:

Explain how generating multiple reformulations of a user query can improve recall.

</div>

<div style="background-color: #204B8E; color: white; padding: 10px; border-radius: 5px;">

### Answer:

- A single user query might miss relevant documents due to vocabulary mismatch. For example, if a user asks "payment problems", the LLM might generate variations like billing issues, transaction errors, payment processing difficulties, or account charge problems. By combining results from all query variations, you get the union of relevant documents rather than being limited to what a single query formulation can retrieve. This significantly improves recall at the potential cost of some precision.

</div>

## Task 8: Parent Document Retriever

A "small-to-big" strategy - the Parent Document Retriever works based on a simple strategy:

1. Each un-split "document" will be designated as a "parent document" (You could use larger chunks of document as well, but our data format allows us to consider the overall document as the parent chunk)
2. Store those "parent documents" in a memory store (not a VectorStore)
3. We will chunk each of those documents into smaller documents, and associate them with their respective parents, and store those in a VectorStore. We'll call those "child chunks".
4. When we query our Retriever, we will do a similarity search comparing our query vector to the "child chunks".
5. Instead of returning the "child chunks", we'll return their associated "parent chunks".

Okay, maybe that was a few steps - but the basic idea is this:

- Search for small documents
- Return big documents

The intuition is that we're likely to find the most relevant information by limiting the amount of semantic information that is encoded in each embedding vector - but we're likely to miss relevant surrounding context if we only use that information.

Let's start by creating our "parent documents" and defining a `RecursiveCharacterTextSplitter`.

In [28]:
from langchain.retrievers import ParentDocumentRetriever
from langchain.storage import InMemoryStore
from langchain_text_splitters import RecursiveCharacterTextSplitter
from qdrant_client import QdrantClient, models

parent_docs = loan_complaint_data
child_splitter = RecursiveCharacterTextSplitter(chunk_size=750)

We'll need to set up a new QDrant vectorstore - and we'll use another useful pattern to do so!

> NOTE: We are manually defining our embedding dimension, you'll need to change this if you're using a different embedding model.

In [29]:
from langchain_qdrant import QdrantVectorStore

client = QdrantClient(location=":memory:")

client.create_collection(
    collection_name="full_documents",
    vectors_config=models.VectorParams(size=1536, distance=models.Distance.COSINE)
)

parent_document_vectorstore = QdrantVectorStore(
    collection_name="full_documents", embedding=OpenAIEmbeddings(model="text-embedding-3-small"), client=client
)

Now we can create our `InMemoryStore` that will hold our "parent documents" - and build our retriever!

In [30]:
store = InMemoryStore()

parent_document_retriever = ParentDocumentRetriever(
    vectorstore = parent_document_vectorstore,
    docstore=store,
    child_splitter=child_splitter,
)

By default, this is empty as we haven't added any documents - let's add some now!

In [31]:
parent_document_retriever.add_documents(parent_docs, ids=None)

We'll create the same chain we did before - but substitute our new `parent_document_retriever`.

In [32]:
parent_document_retrieval_chain = (
    {"context": itemgetter("question") | parent_document_retriever, "question": itemgetter("question")}
    | RunnablePassthrough.assign(context=itemgetter("context"))
    | {"response": rag_prompt | chat_model, "context": itemgetter("context")}
)

Let's give it a whirl!

In [33]:
parent_document_retrieval_chain.invoke({"question" : "What is the most common issue with loans?"})["response"].content

'The most common issue with loans, based on the provided complaints, appears to be problems related to mismanagement and lack of transparency. Specific issues include:\n\n- Struggling to repay loans due to financial hardship and inadequate information about loan terms or consequences.\n- Receiving bad or incomplete information about loans from lenders or servicers.\n- Unauthorized or unexpected changes to loan terms, such as interest rate increases, incorrect balances, or lack of disclosure during processes like loan consolidation.\n- Systemic breakdowns impacting credit reporting and loan servicing, leading to errors and negative credit score impacts.\n\nIn summary, the most common issue involves borrowers facing difficulties because of insufficient or misleading information, mismanagement, or systemic errors in loan handling and reporting.'

In [34]:
parent_document_retrieval_chain.invoke({"question" : "Did any complaints not get handled in a timely manner?"})["response"].content

'Based on the provided complaints, yes, several complaints indicate that issues were not handled in a timely manner. For example:\n\n- One complaint from 03/28/25 mentions that the consumer was told their application would be expedited and they would be contacted within 15 days, but as of the date of the complaint, no one had reached out.\n- Multiple complaints from 04/11/25 describe long wait times (up to 7 hours or more) to speak with representatives and mention that issues remained unresolved or unaddressed for extended periods.\n- Also, a complaint from 04/24/25 highlights repeated failed attempts to correct errors, with no resolution and ongoing difficulties, indicating delays and unresponsiveness.\n\nFurthermore, the metadata shows that for some complaints the response was marked "No" for timely response, which suggests that some issues did not get handled promptly.\n\nTherefore, yes, some complaints were not handled in a timely manner.'

In [35]:
parent_document_retrieval_chain.invoke({"question" : "Why did people fail to pay back their loans?"})["response"].content

'People failed to pay back their loans primarily due to financial hardship caused by inadequate post-graduation employment prospects, misrepresentations about the value of their education, and lack of transparency from the educational institutions and loan servicers. For example, one complainant was unable to secure employment in their field after attending a college that closed and did not provide real career support, leading to severe financial difficulties and difficulty in making loan payments. Others experienced issues such as being required to start payments before the grace period ended, or having their payment obligations miscommunicated or improperly managed by loan servicers. Additionally, some faced complications like unverified or disputed debt reporting, which further hindered their ability to manage or repay their loans. Overall, these issues highlight a combination of systemic mismanagement, lack of clear information, and personal financial hardships as reasons why indiv

Overall, the performance *seems* largely the same. We can leverage a tool like [Ragas]() to more effectively answer the question about the performance.

## Task 9: Ensemble Retriever

In brief, an Ensemble Retriever simply takes 2, or more, retrievers and combines their retrieved documents based on a rank-fusion algorithm.

In this case - we're using the [Reciprocal Rank Fusion](https://plg.uwaterloo.ca/~gvcormac/cormacksigir09-rrf.pdf) algorithm.

Setting it up is as easy as providing a list of our desired retrievers - and the weights for each retriever.

In [36]:
from langchain.retrievers import EnsembleRetriever

retriever_list = [bm25_retriever, naive_retriever, parent_document_retriever, compression_retriever, multi_query_retriever]
equal_weighting = [1/len(retriever_list)] * len(retriever_list)

ensemble_retriever = EnsembleRetriever(
    retrievers=retriever_list, weights=equal_weighting
)

We'll pack *all* of these retrievers together in an ensemble.

In [37]:
ensemble_retrieval_chain = (
    {"context": itemgetter("question") | ensemble_retriever, "question": itemgetter("question")}
    | RunnablePassthrough.assign(context=itemgetter("context"))
    | {"response": rag_prompt | chat_model, "context": itemgetter("context")}
)

Let's look at our results!

In [38]:
ensemble_retrieval_chain.invoke({"question" : "What is the most common issue with loans?"})["response"].content

'The most common issues with loans, based on the complaints data, include:\n\n- Dealing with lender or servicer issues such as mismanagement, incorrect reporting, or poor communication.\n- Problems with how payments are being handled, including inability to allocate payments properly or being steered into high-interest forbearances.\n- Errors and inaccuracies in loan balances, interest charges, or credit reporting.\n- Lack of transparency or proper documentation, including missing or improperly managed loan records.\n- Problems related to loan forgiveness, discharge, or discharge eligibility.\n- Unauthorized transfer of loans and improper handling of loan data.\n- Poor customer service and failure to resolve disputes effectively.\n- Issues with interest accrual and improper interest charges.\n- Misleading or incomplete information about loan terms and repayment options.\n\nIn summary, the most common and recurring issue appears to be the mishandling of loan servicing, including mismana

In [39]:
ensemble_retrieval_chain.invoke({"question" : "Did any complaints not get handled in a timely manner?"})["response"].content

'Based on the provided complaints, yes, some complaints indicate that issues were not handled in a timely manner. For example:\n\n- Complaint ID 12973003 (from EdFinancial Services, NJ): The customer mentioned that it is over 2-3 weeks and they are still experiencing the same issue, which suggests a delay beyond the promised timeframe.\n- Complaint ID 12935889 (from MOHELA, CO): The complainant was dissatisfied with the response time, as their complaint was not resolved promptly.\n- Complaint ID 12410063 (from EdFinancial Services, CA): The customer reported ongoing struggles over nearly two and a half years, with repeated follow-ups and delays in resolution.\n\nAdditionally, some complaints explicitly mention delays or failure to resolve issues promptly, indicating that not all complaints were handled in a timely manner.\n\nIf you need a specific definitive answer: yes, some complaints did not get handled in a timely manner.'

In [40]:
ensemble_retrieval_chain.invoke({"question" : "Why did people fail to pay back their loans?"})["response"].content

'People failed to pay back their loans for several reasons, including:\n\n- Lack of clear communication from loan servicers about when payments were expected to resume or if the loan had been transferred to a new servicer.\n- Difficulty in understanding or managing complex repayment options, interest accumulation, and the impact of deferment or forbearance.\n- Being misled by servicers into forbearance or consolidation without proper explanation of potential consequences like interest capitalization or loss of forgiveness eligibility.\n- Changes in loan transfer status without notification, leading to unawareness of repayment obligations.\n- Financial hardships that made monthly payments unaffordable, combined with insufficient support or guidance from lenders.\n- Errors or inaccuracies in loan balance reporting, debt management, or credit reporting, causing confusion and financial hardship.\n- Technical issues, such as trouble with online portals, misapplied payments, or inaccurate ac

## Task 10: Semantic Chunking

While this is not a retrieval method - it *is* an effective way of increasing retrieval performance on corpora that have clean semantic breaks in them.

Essentially, Semantic Chunking is implemented by:

1. Embedding all sentences in the corpus.
2. Combining or splitting sequences of sentences based on their semantic similarity based on a number of [possible thresholding methods](https://python.langchain.com/docs/how_to/semantic-chunker/):
  - `percentile`
  - `standard_deviation`
  - `interquartile`
  - `gradient`
3. Each sequence of related sentences is kept as a document!

Let's see how to implement this!

We'll use the `percentile` thresholding method for this example which will:

Calculate all distances between sentences, and then break apart sequences of setences that exceed a given percentile among all distances.

In [41]:
from langchain_experimental.text_splitter import SemanticChunker

semantic_chunker = SemanticChunker(
    embeddings,
    breakpoint_threshold_type="percentile"
)

Now we can split our documents.

In [42]:
semantic_documents = semantic_chunker.split_documents(loan_complaint_data[:20])

Let's create a new vector store.

In [43]:
semantic_vectorstore = Qdrant.from_documents(
    semantic_documents,
    embeddings,
    location=":memory:",
    collection_name="Loan_Complaint_Data_Semantic_Chunks"
)

We'll use naive retrieval for this example.

In [44]:
semantic_retriever = semantic_vectorstore.as_retriever(search_kwargs={"k" : 10})

Finally we can create our classic chain!

In [45]:
semantic_retrieval_chain = (
    {"context": itemgetter("question") | semantic_retriever, "question": itemgetter("question")}
    | RunnablePassthrough.assign(context=itemgetter("context"))
    | {"response": rag_prompt | chat_model, "context": itemgetter("context")}
)

And view the results!

In [46]:
semantic_retrieval_chain.invoke({"question" : "What is the most common issue with loans?"})["response"].content

'Based on the provided complaints data, the most common issue with loans appears to be problems related to loan servicing and communication issues. Specific recurring issues include:\n\n- Trouble with how payments are being handled, such as incorrect payment amounts or re-amortization delays.\n- Difficulties in obtaining clear or accurate information about loan status, balances, or payment plans.\n- Problems with reporting and account status errors, such as loans being reported in default or delinquency falsely.\n- Issues with auto-debit setup and verification.\n- Concerns over improper reporting, breach of privacy, or illegal collection practices.\n\nOverall, challenges related to loan management, including lack of transparency, miscommunication, and administrative errors, seem to be the most prevalent issues with loans in this dataset.\n\nIf you need a specific summary or further details, please let me know!'

In [47]:
semantic_retrieval_chain.invoke({"question" : "Did any complaints not get handled in a timely manner?"})["response"].content

'Based on the provided complaints, it appears that some complaints did not get handled in a timely manner. Specifically, multiple complaints state that the companies responded with "Closed with explanation" and there is no indication that they addressed the issues within the expected timeframe. \n\nFor example:\n- The complaint about Nelnet regarding unresponded letters and alleged misconduct indicates no subsequent response to the consumer\'s inquiries.\n- Several complaints mention that, despite the companies\' responses, the issues remain unresolved, and the consumers continue to face problems such as incorrect billing, unauthorized reporting, or lack of communication.\n\nWhile some responses are marked "Yes" for timely response, the lack of resolution or ongoing disputes suggests that not all complaints were effectively handled in a timely manner. \n\nTherefore, yes, some complaints did not get handled in a timely manner.'

In [48]:
semantic_retrieval_chain.invoke({"question" : "Why did people fail to pay back their loans?"})["response"].content

'People failed to pay back their loans for various reasons, including issues with loan servicing, miscommunication, or administrative problems. Some specific reasons highlighted in the complaints include:\n\n- Receiving bad or incomplete information about their loan status or repayment terms.\n- Administrative delays or errors, such as failure to re-amortize payments after forbearance or missing payment records.\n- Disputes over the legitimacy of the debt or treatment of their loan accounts.\n- Problems with loan documentation or certification, leading to delays or complications in forgiveness or discharge.\n- Allegations of illegal reporting or collection practices, which may contribute to confusion or default status errors.\n- Personal financial difficulties, while not explicitly detailed, can also be implied as a general reason for default if borrowers are unable to meet payments.\n\nIn summary, failure to pay back loans was often related to administrative errors, communication issu

<div style="background-color: #204B8E; color: white; padding: 10px; border-radius: 5px;">

#### ❓ Question #3:

If sentences are short and highly repetitive (e.g., FAQs), how might semantic chunking behave, and how would you adjust the algorithm?

</div>

<div style="background-color: #204B8E; color: white; padding: 10px; border-radius: 5px;">

### Answer:
Short and repetitive sentences like FAQ entries will have very similar embeddings, resulting in minimal semantic distances between consecutive sentences. FAQ structures depend on maintaining clear Q&A pairs, but semantic chunking might merge multiple Q&A pairs if they're topically similar.

The adjustments to make for the algorithm is to lower the threshold by using a more sensitive breakpoint to detect subtle semantic shifts between different FAQ topics. Additionally, implement custom logic to respect FAQ formatting patterns before applying semantic chunking. Lastly, combine semantic chunking with rule-based splitting that preserves Q&A pair integrity.

</div>

# 🤝 Breakout Room Part #2

<div style="background-color: #204B8E; color: white; padding: 10px; border-radius: 5px;">

### 🏗️ Activity #1

Your task is to evaluate the various Retriever methods against each other. 
You can use the loans or bills dataset.

You are expected to:

1. Create a "golden dataset"
 - Use Synthetic Data Generation (powered by Ragas, or otherwise) to create this dataset
2. Evaluate each retriever with *retriever specific* Ragas metrics
 - Semantic Chunking is not considered a retriever method and will not be required for marks, but you may find it useful to do a "semantic chunking on" vs. "semantic chunking off" comparision between them
3. Compile these in a list and write a small paragraph about which is best for this particular data and why.

Your analysis should factor in:
  - Cost
  - Latency
  - Performance

> NOTE: This is **NOT** required to be completed in class. Please spend time in your breakout rooms creating a plan before moving on to writing code.

</div>

##### HINTS:

- LangSmith provides detailed information about latency and cost.

In [5]:
import nltk
nltk.download('punkt')
nltk.download('averaged_perceptron_tagger')

[nltk_data] Downloading package punkt to
[nltk_data]     C:\Users\Andrei\AppData\Roaming\nltk_data...
[nltk_data]   Package punkt is already up-to-date!
[nltk_data] Downloading package averaged_perceptron_tagger to
[nltk_data]     C:\Users\Andrei\AppData\Roaming\nltk_data...
[nltk_data]   Package averaged_perceptron_tagger is already up-to-
[nltk_data]       date!


True

In [None]:
import os
import time
import getpass
import pandas as pd
from typing import List, Dict, Any

os.environ["OPENAI_API_KEY"] = getpass.getpass("Enter your OpenAI API Key: ")
os.environ["COHERE_API_KEY"] = getpass.getpass("Enter your Cohere API Key: ")
os.environ["LANGCHAIN_API_KEY"] = getpass.getpass("Enter your LangSmith API Key: ")

# Enable LangSmith tracing
os.environ["LANGCHAIN_TRACING_V2"] = "true"
os.environ["LANGCHAIN_PROJECT"] = "AI/LLM Engineering"

In [None]:
from datasets import load_dataset

from langchain.schema import Document
from langchain_openai import OpenAIEmbeddings, ChatOpenAI
from langchain_core.prompts import ChatPromptTemplate
from langchain_core.output_parsers import StrOutputParser

from langchain_community.vectorstores import Qdrant
from langchain_community.retrievers import BM25Retriever
from langchain.retrievers.contextual_compression import ContextualCompressionRetriever
from langchain_cohere import CohereRerank
from langchain.retrievers.multi_query import MultiQueryRetriever
from langchain.retrievers import ParentDocumentRetriever, EnsembleRetriever
from langchain.storage import InMemoryStore
from langchain_text_splitters import RecursiveCharacterTextSplitter

from ragas.testset import TestsetGenerator
from ragas.testset import reasoning, multi_context
from ragas import evaluate
from ragas.metrics import (
    context_precision,
    context_recall,
    faithfulness,
    answer_relevancy
)

from langsmith import Client
from langsmith.evaluation import evaluate as langsmith_evaluate
from langsmith.schemas import Run, Example

In [13]:
embeddings = OpenAIEmbeddings(model="text-embedding-3-small")
llm = ChatOpenAI(model="gpt-4o-mini", temperature=0)

langsmith_client = Client()

dataset = load_dataset("squad", split="validation[:20]")

documents = []
for item in dataset:
    doc = Document(
        page_content=f"Title: {item['title']}\n\nContext: {item['context']}",
        metadata={
            "title": item['title'],
            "context": item['context'],
            "source": "squad"
        }
    )
    documents.append(doc)


OpenAIError: The api_key client option must be set either by passing api_key to the client or by setting the OPENAI_API_KEY environment variable

In [None]:
retrievers = {}

# Naive Retriever
vectorstore = Qdrant.from_documents(
    documents,
    embeddings,
    location=":memory:",
    collection_name="naive_retrieval"
)
retrievers["naive"] = vectorstore.as_retriever(search_kwargs={"k": 5})

# BM25 Retriever
retrievers["bm25"] = BM25Retriever.from_documents(documents, k=5)

# Reranking Retriever
base_retriever = vectorstore.as_retriever(search_kwargs={"k": 10})
compressor = CohereRerank(model="rerank-v3.5", top_k=5)
retrievers["reranking"] = ContextualCompressionRetriever(
    base_compressor=compressor,
    base_retriever=base_retriever
)

# Multi-query Retriever
retrievers["multi_query"] = MultiQueryRetriever.from_llm(
    retriever=vectorstore.as_retriever(search_kwargs={"k": 5}),
    llm=llm
)

# Parent Document Retriever
parent_vectorstore = Qdrant.from_documents(
    [],
    embeddings,
    location=":memory:",
    collection_name="parent_retrieval"
)
store = InMemoryStore()
child_splitter = RecursiveCharacterTextSplitter(
    chunk_size=400,
    chunk_overlap=50
)

parent_retriever = ParentDocumentRetriever(
    vectorstore=parent_vectorstore,
    docstore=store,
    child_splitter=child_splitter,
    search_kwargs={"k": 5}
)
parent_retriever.add_documents(documents)
retrievers["parent_document"] = parent_retriever

# Ensemble Retriever
ensemble_retrievers = [
    retrievers["naive"],
    retrievers["bm25"],
    retrievers["reranking"]
]
weights = [0.4, 0.3, 0.3]
retrievers["ensemble"] = EnsembleRetriever(
    retrievers=ensemble_retrievers,
    weights=weights,
    search_kwargs={"k": 5}
)

In [None]:
generator = TestsetGenerator.from_langchain(
    generator_llm = llm,
    critic_llm = llm,
    embeddings=embeddings
)

testset = generator.generate_with_langchain_docs(
    documents,
    test_size=50
)

In [17]:
dataset_name = "retrieval-evaluation-dataset"
testset_df = testset.to_pandas()

examples = []
for _, row in testset_df.iterrows():
    example = {
        "question": row["question"],
        "ground_truth": row["ground_truth"],
        "contexts": row.get("contexts", [])
    }
    examples.append(example)

NameError: name 'testset' is not defined

In [18]:
try:
    try:
        langsmith_client.delete_dataset(dataset_name=dataset_name)
    except:
        pass

    dataset = langsmith_client.create_dataset(
        dataset_name=dataset_name,
        description="Retrieval evaluation dataset generated with Ragas"
    )

    langsmith_client.create_examples(
        inputs=[{"question": ex["question"]} for ex in examples],
        outputs=[{"ground_truth": ex["ground_truth"], "contexts": ex["contexts"]} for ex in examples],
        dataset_id=dataset.id
    )

except Exception as e:
    None

In [None]:
def retrieval_accuracy_evaluator(run: Run, example: Example) -> dict:
    """Evaluate if the correct context was retrieved."""
    if not run.outputs or "contexts" not in run.outputs:
        return {"key": "retrieval_accuracy", "score": 0.0}
    
    retrieved_contexts = run.outputs["contexts"]
    expected_contexts = example.outputs.get("contexts", [])

    if not expected_contexts:
        return {"key": "retrieval_accuracy", "score": 0.0}
    
    overlap = 0
    for expected in expected_contexts:
        for retrieved in retrieved_contexts:
            if expected.lower() in retrieved.lower() or retrieved.lower() in expected.lower():
                overlap += 1
                break

    accuracy = overlap / len(expected_contexts) if expected_contexts else 0
    return {"key": "retrieval_accuracy", "score": accuracy}

In [None]:
def answer_quality_evaluator(run: Run, example: Example) -> dict:
    """Evaluate answer quality using LLM."""
    if not run.outputs or "answer" not in run.outputs:
        return {"key": "answer_quality", "score": 0.0}
    
    answer = run.outputs["answer"]
    ground_truth = example.outputs.get("ground_truth", "")
    question = example.inputs["question"]
    
    # Use LLM to evaluate answer quality
    evaluation_prompt = f"""
    Question: {question}
    Ground Truth: {ground_truth}
    Generated Answer: {answer}
    
    Rate the quality of the generated answer compared to the ground truth on a scale of 0-1:
    - 0: Completely incorrect or irrelevant
    - 0.5: Partially correct but missing key information
    - 1: Accurate and complete
    
    Return only the numerical score.
    """
    
    try:
        score_text = llm.invoke(evaluation_prompt).content
        score = float(score_text.strip())
        score = max(0.0, min(1.0, score))  # Clamp between 0 and 1
    except:
        score = 0.0
    
    return {"key": "answer_quality", "score": score}

In [None]:
def context_relevance_evaluator(run: Run, example: Example) -> dict:
    """Evaluate how relevant retrieved contexts are to the question."""
    if not run.outputs or "contexts" not in run.outputs:
        return {"key": "context_relevance", "score": 0.0}
    
    contexts = run.outputs["contexts"]
    question = example.inputs["question"]
    
    if not contexts:
        return {"key": "context_relevance", "score": 0.0}
    
    # Use LLM to evaluate context relevance
    context_text = "\n\n".join(contexts[:3])  # Use top 3 contexts
    evaluation_prompt = f"""
    Question: {question}
    Retrieved Contexts: {context_text}
    
    Rate how relevant these contexts are to answering the question on a scale of 0-1:
    - 0: Completely irrelevant
    - 0.5: Somewhat relevant but missing key information
    - 1: Highly relevant and contains information needed to answer the question
    
    Return only the numerical score.
    """
    
    try:
        score_text = llm.invoke(evaluation_prompt).content
        score = float(score_text.strip())
        score = max(0.0, min(1.0, score))
    except:
        score = 0.0
    
    return {"key": "context_relevance", "score": score}

    evaluators = [
        retrieval_accuracy_evaluator,
        answer_quality_evaluator,
        context_relevance_evaluator
    ]

    return evaluators

In [None]:
def evaluate_retriever(retriever, retriever_name, testset, dataset_id=None, langsmith_evaluators=None):
    """Evaluate a single retriever using both Ragas and LangSmith metrics."""
    print(f"\n📊 Evaluating {retriever_name} retriever...")
    
    # Create retriever function for LangSmith
    def retriever_chain(inputs: dict) -> dict:
        """Retriever chain for LangSmith evaluation."""
        question = inputs["question"]
        
        try:
            # Retrieve documents
            retrieved_docs = retriever.invoke(question)
            contexts = [doc.page_content for doc in retrieved_docs]
            
            # Generate answer using RAG chain
            context_str = "\n\n".join(contexts)
            prompt_template = """Based on the following context, answer the question.

Context: {context}

Question: {question}

Answer:"""
            
            prompt = ChatPromptTemplate.from_template(prompt_template)
            rag_chain = prompt | llm | StrOutputParser()
            
            answer = rag_chain.invoke({
                "context": context_str,
                "question": question
            })
            
            return {
                "contexts": contexts,
                "answer": answer,
                "question": question
            }
            
        except Exception as e:
            print(f"  Error in retriever chain: {e}")
            return {
                "contexts": [],
                "answer": "Error occurred during retrieval",
                "question": question
            }
    
    # Prepare evaluation data for Ragas
    eval_data = {
        "question": [],
        "contexts": [],
        "answer": [],
        "ground_truth": []
    }
    
    # Process each question in testset
    testset_df = testset.to_pandas()
    
    for _, row in testset_df.iterrows():
        question = row["question"]
        ground_truth = row["ground_truth"]
        
        try:
            result = retriever_chain({"question": question})
            
            eval_data["question"].append(question)
            eval_data["contexts"].append(result["contexts"])
            eval_data["answer"].append(result["answer"])
            eval_data["ground_truth"].append(ground_truth)
            
        except Exception as e:
            print(f"  Error processing question: {e}")
            continue
    
    # Run Ragas evaluation
    ragas_results = None
    if len(eval_data["question"]) > 0:
        eval_dataset = pd.DataFrame(eval_data)
        
        try:
            start_time = time.time()
            
            ragas_result = evaluate(
                eval_dataset,
                metrics=[
                    context_precision,
                    context_recall,
                    faithfulness,
                    answer_relevancy
                ]
            )
            
            evaluation_time = time.time() - start_time
            
            ragas_results = {
                "context_precision": ragas_result["context_precision"],
                "context_recall": ragas_result["context_recall"],
                "faithfulness": ragas_result["faithfulness"],
                "answer_relevancy": ragas_result["answer_relevancy"],
                "evaluation_time": evaluation_time,
                "num_questions": len(eval_dataset)
            }
            
            print(f"  ✅ Ragas evaluation completed ({evaluation_time:.2f}s)")
            
        except Exception as e:
            print(f"  ❌ Ragas evaluation failed: {e}")
    
    # Run LangSmith evaluation
    langsmith_results = None
    if dataset_id and langsmith_evaluators:
        try:
            print(f"  🔧 Running LangSmith evaluation...")
            
            experiment_name = f"{retriever_name}-experiment"
            
            langsmith_result = langsmith_evaluate(
                retriever_chain,
                data=dataset_id,
                evaluators=langsmith_evaluators,
                experiment_prefix=experiment_name,
                metadata={"retriever_type": retriever_name}
            )
            
            # Extract LangSmith metrics
            langsmith_results = {}
            for result in langsmith_result:
                if hasattr(result, 'evaluation_results'):
                    for eval_result in result.evaluation_results:
                        metric_name = eval_result.key
                        if metric_name not in langsmith_results:
                            langsmith_results[metric_name] = []
                        langsmith_results[metric_name].append(eval_result.score)
            
            # Calculate averages
            for metric, scores in langsmith_results.items():
                langsmith_results[metric] = sum(scores) / len(scores) if scores else 0.0
            
            print(f"  ✅ LangSmith evaluation completed")
            
        except Exception as e:
            print(f"  ❌ LangSmith evaluation failed: {e}")
    
    # Combine results
    combined_results = {
        "retriever": retriever_name,
        "num_questions": len(eval_data["question"])
    }
    
    if ragas_results:
        combined_results.update(ragas_results)
    
    if langsmith_results:
        # Add LangSmith metrics with prefix
        for metric, score in langsmith_results.items():
            combined_results[f"langsmith_{metric}"] = score
    
    return combined_results

In [None]:
def run_comprehensive_evaluation():
    """Run the complete evaluation pipeline with both Ragas and LangSmith."""
    print("\n🎯 Starting Comprehensive Evaluation")
    print("=" * 50)
    
    # Load dataset
    documents = load_squad_dataset(num_documents=100)
    
    # Create retrievers
    retrievers = create_all_retrievers(documents)
    
    # Generate synthetic dataset
    testset = generate_synthetic_dataset(documents, num_questions=50)
    
    # Create LangSmith dataset and evaluators
    dataset_id, examples = create_langsmith_dataset(testset)
    langsmith_evaluators = create_langsmith_evaluators()
    
    # Evaluate each retriever
    all_results = []
    
    for retriever_name, retriever in retrievers.items():
        result = evaluate_retriever(
            retriever, 
            retriever_name, 
            testset, 
            dataset_id, 
            langsmith_evaluators
        )
        if result:
            all_results.append(result)
    
    return all_results

In [None]:
def analyze_results(results):
    """Analyze results and provide comprehensive report."""
    print("\n📈 ANALYSIS REPORT")
    print("=" * 50)
    
    if not results:
        print("❌ No results to analyze")
        return
    
    # Create results DataFrame
    df = pd.DataFrame(results)
    
    # Display Ragas results table
    print("\n📊 RAGAS EVALUATION RESULTS:")
    print("-" * 80)
    print(f"{'Retriever':<15} {'Precision':<10} {'Recall':<10} {'Relevancy':<10} {'Faithfulness':<12} {'Answer Rel.':<10}")
    print("-" * 80)
    
    for _, row in df.iterrows():
        print(f"{row['retriever']:<15} "
              f"{row.get('context_precision', 0):<10.3f} "
              f"{row.get('context_recall', 0):<10.3f} "
              f"{row.get('context_relevancy', 0):<10.3f} "
              f"{row.get('faithfulness', 0):<12.3f} "
              f"{row.get('answer_relevancy', 0):<10.3f}")
    
    # Display LangSmith results if available
    langsmith_cols = [col for col in df.columns if col.startswith('langsmith_')]
    if langsmith_cols:
        print("\n📊 LANGSMITH EVALUATION RESULTS:")
        print("-" * 80)
        print(f"{'Retriever':<15} ", end="")
        for col in langsmith_cols:
            metric_name = col.replace('langsmith_', '').replace('_', ' ').title()
            print(f"{metric_name:<15} ", end="")
        print()
        print("-" * 80)
        
        for _, row in df.iterrows():
            print(f"{row['retriever']:<15} ", end="")
            for col in langsmith_cols:
                print(f"{row.get(col, 0):<15.3f} ", end="")
            print()
    
    # Calculate composite scores
    ragas_metrics = ['context_precision', 'context_recall', 'context_relevancy', 'faithfulness', 'answer_relevancy']
    available_ragas = [metric for metric in ragas_metrics if metric in df.columns]
    
    if available_ragas:
        df['ragas_composite'] = df[available_ragas].mean(axis=1)
    
    langsmith_metrics = [col for col in langsmith_cols if 'accuracy' in col or 'quality' in col or 'relevance' in col]
    if langsmith_metrics:
        df['langsmith_composite'] = df[langsmith_metrics].mean(axis=1)
    
    # Find best performers
    if 'ragas_composite' in df.columns:
        best_ragas = df.loc[df['ragas_composite'].idxmax()]
        print(f"\n🏆 BEST RAGAS PERFORMER:")
        print(f"Best Overall (Ragas): {best_ragas['retriever']} (composite score: {best_ragas['ragas_composite']:.3f})")
    
    if 'langsmith_composite' in df.columns:
        best_langsmith = df.loc[df['langsmith_composite'].idxmax()]
        print(f"\n🏆 BEST LANGSMITH PERFORMER:")
        print(f"Best Overall (LangSmith): {best_langsmith['retriever']} (composite score: {best_langsmith['langsmith_composite']:.3f})")
    
    # Individual metric leaders
    print(f"\n🥇 METRIC LEADERS:")
    if 'context_precision' in df.columns:
        best_precision = df.loc[df['context_precision'].idxmax()]
        print(f"Best Precision: {best_precision['retriever']} ({best_precision['context_precision']:.3f})")
    
    if 'context_recall' in df.columns:
        best_recall = df.loc[df['context_recall'].idxmax()]
        print(f"Best Recall: {best_recall['retriever']} ({best_recall['context_recall']:.3f})")
    
    if 'evaluation_time' in df.columns:
        fastest = df.loc[df['evaluation_time'].idxmin()]
        print(f"Fastest: {fastest['retriever']} ({fastest['evaluation_time']:.2f}s)")
    
    # Cost analysis (estimated)
    print(f"\n💰 COST ANALYSIS (Estimated):")
    cost_order = ["bm25", "naive", "parent_document", "reranking", "multi_query", "ensemble"]
    cost_levels = ["Lowest", "Low", "Medium", "Medium-High", "High", "Highest"]
    
    for retriever, cost in zip(cost_order, cost_levels):
        if retriever in df['retriever'].values:
            print(f"{retriever:<15}: {cost}")
    
    # Enhanced performance analysis paragraph
    print(f"\n📝 COMPREHENSIVE PERFORMANCE ANALYSIS:")
    print("-" * 50)
    
    best_overall = None
    if 'ragas_composite' in df.columns:
        best_overall = df.loc[df['ragas_composite'].idxmax()]
    elif available_ragas:
        best_overall = df.loc[df[available_ragas[0]].idxmax()]
    
    if best_overall is not None:
        analysis = f"""
Based on comprehensive evaluation using both Ragas and LangSmith frameworks on {len(results)} 
retrieval methods with the SQuAD dataset, the {best_overall['retriever']} retriever achieved 
the best overall performance.

EVALUATION FRAMEWORK INSIGHTS:
• Ragas Metrics: Focused on context precision, recall, relevancy, faithfulness, and answer relevancy
• LangSmith Metrics: Custom evaluators for retrieval accuracy, answer quality, and context relevance
• Dataset: {best_overall.get('num_questions', 'N/A')} synthetic questions generated from 100 SQuAD documents

KEY FINDINGS:
• Best Overall Performer: {best_overall['retriever']} 
• Ragas Composite Score: {best_overall.get('ragas_composite', 'N/A'):.3f}
• LangSmith Composite Score: {best_overall.get('langsmith_composite', 'N/A'):.3f}

RETRIEVER-SPECIFIC INSIGHTS:
• BM25: Fastest execution, excellent for keyword-heavy queries
• Naive Semantic: Good baseline performance with embedding similarity
• Reranking: Higher precision through secondary ranking, increased cost
• Multi-Query: Enhanced recall through query expansion, higher latency
• Parent-Document: Better context preservation through chunk-to-document mapping
• Ensemble: Combines strengths of multiple approaches for robust performance

PRODUCTION RECOMMENDATIONS:
For SQuAD-like factual Q&A datasets, {best_overall['retriever']} provides optimal 
balance of accuracy and performance. Consider ensemble methods for maximum accuracy 
or BM25 for latency-critical applications.

COST-PERFORMANCE TRADE-OFFS:
• Low-cost option: BM25 (no embedding/API costs)
• Balanced option: Naive semantic search
• High-accuracy option: Ensemble or reranking
• Specialized option: Parent-document for long-form content
        """.strip()
    else:
        analysis = "Evaluation completed but insufficient data for comprehensive analysis."
    
    print(analysis)
    
    # Save results
    df.to_csv("comprehensive_retrieval_results.csv", index=False)
    print(f"\n💾 Results saved to 'comprehensive_retrieval_results.csv'")
    
    return df, analysis


<div style="background-color: #204B8E; color: white; padding: 10px; border-radius: 5px;">

### Analysis & Observations:

</div>