# Advanced Retrieval with LangChain

In the following notebook, we'll explore various methods of advanced retrieval using LangChain!

We'll touch on:

- Naive Retrieval
- Best-Matching 25 (BM25)
- Multi-Query Retrieval
- Parent-Document Retrieval
- Contextual Compression (a.k.a. Rerank)
- Ensemble Retrieval
- Semantic chunking

We'll also discuss how these methods impact performance on our set of documents with a simple RAG chain.

There will be two breakout rooms:

- 🤝 Breakout Room Part #1
  - Task 1: Getting Dependencies!
  - Task 2: Data Collection and Preparation
  - Task 3: Setting Up QDrant!
  - Task 4-10: Retrieval Strategies
- 🤝 Breakout Room Part #2
  - Activity: Evaluate with Ragas

# 🤝 Breakout Room Part #1

## Task 1: Getting Dependencies!

We're going to need a few specific LangChain community packages, like OpenAI (for our [LLM](https://platform.openai.com/docs/models) and [Embedding Model](https://platform.openai.com/docs/guides/embeddings)) and Cohere (for our [Reranker](https://cohere.com/rerank)).

We'll also provide our OpenAI key, as well as our Cohere API key.

In [107]:
%load_ext autoreload
%autoreload 2

The autoreload extension is already loaded. To reload it, use:
  %reload_ext autoreload


In [3]:
import os
import getpass

os.environ["OPENAI_API_KEY"] = getpass.getpass("Enter your OpenAI API Key:")

In [108]:
os.environ["COHERE_API_KEY"] = getpass.getpass("Cohere API Key:")

In [109]:
# Enable LangSmith (same as test notebook)
os.environ["LANGCHAIN_TRACING_V2"] = "true"
os.environ["LANGCHAIN_API_KEY"] = getpass.getpass("LangSmith API Key: ")

## Task 2: Data Collection and Preparation

We'll be using our Use Case Data once again - this time the strutured data available through the CSV!

### Data Preparation

We want to make sure all our documents have the relevant metadata for the various retrieval strategies we're going to be applying today.

In [6]:
from langchain_community.document_loaders.csv_loader import CSVLoader
from datetime import datetime, timedelta

loader = CSVLoader(
    file_path=f"./data/Projects_with_Domains.csv",
    metadata_columns=[
      "Project Title",
      "Project Domain",
      "Secondary Domain",
      "Description",
      "Judge Comments",
      "Score",
      "Project Name",
      "Judge Score"
    ]
)

synthetic_usecase_data = loader.load()

for doc in synthetic_usecase_data:
    doc.page_content = doc.metadata["Description"]

In [8]:
print(f"Number of documents: {len(synthetic_usecase_data)}")

Number of documents: 50


## Task 3: Setting up QDrant!

Now that we have our documents, let's create a QDrant VectorStore with the collection name "Synthetic_Usecases".

We'll leverage OpenAI's [`text-embedding-3-small`](https://openai.com/blog/new-embedding-models-and-api-updates) because it's a very powerful (and low-cost) embedding model.

> NOTE: We'll be creating additional vectorstores where necessary, but this pattern is still extremely useful.

In [9]:
from langchain_community.vectorstores import Qdrant
from langchain_openai import OpenAIEmbeddings

embeddings = OpenAIEmbeddings(model="text-embedding-3-small")

vectorstore = Qdrant.from_documents(
    synthetic_usecase_data,
    embeddings,
    location=":memory:",
    collection_name="Synthetic_Usecases"
)

## Task 4: Naive RAG Chain

Since we're focusing on the "R" in RAG today - we'll create our Retriever first.

### R - Retrieval

This naive retriever will simply look at each review as a document, and use cosine-similarity to fetch the 10 most relevant documents.

> NOTE: We're choosing `10` as our `k` here to provide enough documents for our reranking process later

In [10]:
naive_retriever = vectorstore.as_retriever(search_kwargs={"k" : 10})

### A - Augmented

We're going to go with a standard prompt for our simple RAG chain today! Nothing fancy here, we want this to mostly be about the Retrieval process.

In [11]:
from langchain_core.prompts import ChatPromptTemplate

RAG_TEMPLATE = """\
You are a helpful and kind assistant. Use the context provided below to answer the question.

If you do not know the answer, or are unsure, say you don't know.

Query:
{question}

Context:
{context}
"""

rag_prompt = ChatPromptTemplate.from_template(RAG_TEMPLATE)

### G - Generation

We're going to leverage `gpt-4.1-nano` as our LLM today, as - again - we want this to largely be about the Retrieval process.

In [12]:
from langchain_openai import ChatOpenAI

chat_model = ChatOpenAI(model="gpt-4.1-nano")

### LCEL RAG Chain

We're going to use LCEL to construct our chain.

> NOTE: This chain will be exactly the same across the various examples with the exception of our Retriever!

In [13]:
from langchain_core.runnables import RunnablePassthrough
from operator import itemgetter
from langchain_core.output_parsers import StrOutputParser

naive_retrieval_chain = (
    # INVOKE CHAIN WITH: {"question" : "<<SOME USER QUESTION>>"}
    # "question" : populated by getting the value of the "question" key
    # "context"  : populated by getting the value of the "question" key and chaining it into the base_retriever
    {"context": itemgetter("question") | naive_retriever, "question": itemgetter("question")}
    # "context"  : is assigned to a RunnablePassthrough object (will not be called or considered in the next step)
    #              by getting the value of the "context" key from the previous step
    | RunnablePassthrough.assign(context=itemgetter("context"))
    # "response" : the "context" and "question" values are used to format our prompt object and then piped
    #              into the LLM and stored in a key called "response"
    # "context"  : populated by getting the value of the "context" key from the previous step
    | {"response": rag_prompt | chat_model, "context": itemgetter("context")}
)

Let's see how this simple chain does on a few different prompts.

> NOTE: You might think that we've cherry picked prompts that showcase the individual skill of each of the retrieval strategies - you'd be correct!

In [15]:
naive_retrieval_chain.invoke({"question" : "What is the most common project domain?"})["response"].content

'Based on the provided data, the most common project domain appears to be "Healthcare / MedTech," which is mentioned multiple times in the dataset.'

In [12]:
naive_retrieval_chain.invoke({"question" : "Were there any usecases about security?"})["response"].content

'Yes, there are usecases related to security. Specifically, one project titled "LatticeFlow" is described as an AI-powered platform optimizing logistics routes for sustainability, and it is associated with the Healthcare / MedTech domain and the secondary domain of Security.'

In [13]:
naive_retrieval_chain.invoke({"question" : "Can you count how many projects belonged to Security domain?"})["response"].content

# this is not good, it is not able to count the number of projects belonged to Security domain.

'There are 0 projects that directly belong to the Security domain in the provided data.'

In [14]:
naive_retrieval_chain.invoke({"question" : "What did judges have to say about the fintech projects?"})["response"].content

'The judges had various comments about the fintech projects. For example, they described the project "PlanPilot 35" as "A clever solution with measurable environmental benefit," and "PixelSense" as having a "Comprehensive and technically mature approach." Overall, the judges viewed the projects positively, highlighting the strength of ideas, technical execution, and real-world impact in some cases.'

Overall, this is not bad! Let's see if we can make it better!

## Task 5: Best-Matching 25 (BM25) Retriever

Taking a step back in time - [BM25](https://www.nowpublishers.com/article/Details/INR-019) is based on [Bag-Of-Words](https://en.wikipedia.org/wiki/Bag-of-words_model) which is a sparse representation of text.

In essence, it's a way to compare how similar two pieces of text are based on the words they both contain.

This retriever is very straightforward to set-up! Let's see it happen down below!


In [16]:
from langchain_community.retrievers import BM25Retriever

bm25_retriever = BM25Retriever.from_documents(synthetic_usecase_data)



We'll construct the same chain - only changing the retriever.

In [17]:
bm25_retrieval_chain = (
    {"context": itemgetter("question") | bm25_retriever, "question": itemgetter("question")}
    | RunnablePassthrough.assign(context=itemgetter("context"))
    | {"response": rag_prompt | chat_model, "context": itemgetter("context")}
)

Let's look at the responses!

In [18]:
bm25_retrieval_chain.invoke({"question" : "What is the most common project domain?"})["response"].content

'Based on the provided data, the most common project domain is not explicitly stated, but among the sample projects listed, the domains include Productivity Assistants, Legal / Compliance, Data / Analytics, and Healthcare / MedTech. Since only a small sample is shown, I cannot definitively determine which domain is most common overall. If you have access to the full dataset, it would be best to analyze the frequency of each domain there.'

In [45]:
response = bm25_retrieval_chain.invoke({"question" : "What is the most common project domain?"})

In [52]:
response["response"].content

'Based on the provided data, the most common project domain appears to be "E‑commerce / Marketplaces," as it is mentioned as a secondary domain for some projects. However, since only a few sample entries are provided and "Project Domain" entries include "Productivity Assistants," "Legal / Compliance," "Data / Analytics," and "Healthcare / MedTech," and no clear count for each, the most frequent project domain cannot be determined definitively from this snippet alone.\n\nIf considering the sample data, "E‑commerce / Marketplaces" appears more than once as a secondary domain, but for the primary domains, there isn\'t enough information.\n\nTherefore, I do not have enough data to conclusively identify the most common project domain.'

In [16]:
bm25_retrieval_chain.invoke({"question" : "Were there any usecases about security?"})["response"].content

'Based on the provided information, there are no specific use cases related to security mentioned in the context.'

In [19]:
bm25_retrieval_chain.invoke({"question" : "What did judges have to say about the fintech projects?"})["response"].content

'The judges commented that the fintech project, PulseAI 50, was technically ambitious and well-executed.'

It's not clear that this is better or worse, if only we had a way to test this (SPOILERS: We do, the second half of the notebook will cover this)

#### ❓ Question #1:

Give an example query where BM25 is better than embeddings and justify your answer.

##### ✅ Answer: BM25 excels at exact or near-exact keyword mathcing because it is based on the bag-of-words representation. It works best when:

- User query contains exact phrases that appear in the document
- Searching for technical terms, proper nouns, or domain-specific vocab
- Query consists of important keywords that appear in the rerieved documents

Lets check a scenario where BM25 retriever generates a "better" (vibe) response than embedding-based naive retriever. Here I use a query that is not super descriptive (or even gramatically correct) but rather uses a selection of phrases from project description of the project named "TaskFlow 13". You will see the response using BM25 is more descriptive and satisfying. 👇

In [20]:
# query to test BM25 vs naive retriever
query = "project compute costs by 40%"

response_naive = naive_retrieval_chain.invoke({"question" : query})["response"].content
response_bm25 = bm25_retrieval_chain.invoke({"question" : query})["response"].content

print(f"Naive Retrieval Response: \n {response_naive}")
print(f"\n\n BM25 Retrieval Response: \n {response_bm25}")



Naive Retrieval Response: 
 The project described as a "dynamic model pruning toolkit" reduces compute costs by 40%.


 BM25 Retrieval Response: 
 The project that focuses on reducing compute costs by 40% is "TaskFlow 13," which is in the Legal / Compliance domain with a secondary domain in E‑commerce / Marketplaces.


## Task 6: Contextual Compression (Using Reranking)

Contextual Compression is a fairly straightforward idea: We want to "compress" our retrieved context into just the most useful bits.

There are a few ways we can achieve this - but we're going to look at a specific example called reranking.

The basic idea here is this:

- We retrieve lots of documents that are very likely related to our query vector
- We "compress" those documents into a smaller set of *more* related documents using a reranking algorithm.

We'll be leveraging Cohere's Rerank model for our reranker today!

All we need to do is the following:

- Create a basic retriever
- Create a compressor (reranker, in this case)

That's it!

Let's see it in the code below!

In [19]:
from langchain.retrievers.contextual_compression import ContextualCompressionRetriever
from langchain_cohere import CohereRerank

compressor = CohereRerank(model="rerank-v3.5", top_n=7)
compression_retriever = ContextualCompressionRetriever(
    base_compressor=compressor, base_retriever=naive_retriever
)

Let's create our chain again, and see how this does!

In [20]:
contextual_compression_retrieval_chain = (
    {"context": itemgetter("question") | compression_retriever, "question": itemgetter("question")}
    | RunnablePassthrough.assign(context=itemgetter("context"))
    | {"response": rag_prompt | chat_model, "context": itemgetter("context")}
)

In [23]:
contextual_compression_retrieval_chain.invoke({"question" : "What is the most common project domain?"})["response"].content

'The most common project domain in the provided data is "Healthcare / MedTech," which appears multiple times among the listed projects.'

In [24]:
contextual_compression_retrieval_chain.invoke({"question" : "Were there any usecases about security?"})["response"].content

'Yes, there are use cases related to security. For example, one project describes "a federated learning toolkit improving privacy in healthcare applications," which is directly related to security and privacy concerns.'

In [25]:
contextual_compression_retrieval_chain.invoke({"question" : "What did judges have to say about the fintech projects?"})["response"].content

'The judges\' comments on the fintech projects were generally positive. For example, one project received praise for its "excellent code quality and use of open-source libraries," and another was noted for being a "clever solution with measurable environmental benefit." Overall, the feedback highlighted the technical quality, promising ideas, and real-world impact of the projects related to fintech.'

We'll need to rely on something like Ragas to help us get a better sense of how this is performing overall - but it "feels" better!

## Task 7: Multi-Query Retriever

Typically in RAG we have a single query - the one provided by the user.

What if we had....more than one query!

In essence, a Multi-Query Retriever works by:

1. Taking the original user query and creating `n` number of new user queries using an LLM.
2. Retrieving documents for each query.
3. Using all unique retrieved documents as context

So, how is it to set-up? Not bad! Let's see it down below!



In [21]:
from langchain.retrievers.multi_query import MultiQueryRetriever

multi_query_retriever = MultiQueryRetriever.from_llm(
    retriever=naive_retriever, llm=chat_model
) # by default it will generate 3 queries ....

In [22]:
multi_query_retrieval_chain = (
    {"context": itemgetter("question") | multi_query_retriever, "question": itemgetter("question")}
    | RunnablePassthrough.assign(context=itemgetter("context"))
    | {"response": rag_prompt | chat_model, "context": itemgetter("context")}
)

In [28]:
multi_query_retrieval_chain.invoke({"question" : "What is the most common project domain?"})["response"].content

'The most common project domain in the provided data is "Healthcare / MedTech," appearing multiple times among the projects.'

In [29]:
multi_query_retrieval_chain.invoke({"question" : "Were there any usecases about security?"})["response"].content

'Yes, there are use cases related to security. Specifically, the project "LearnWise 39" within the "Data / Analytics" domain focuses on an AI model compression suite enabling on-device reasoning for IoT sensors, which can have security implications. Additionally, "SecureNest 28" in the "Legal / Compliance" domain is a hardware-aware model quantization benchmark suite, indicating a focus on security-related benchmarking.'

In [30]:
multi_query_retrieval_chain.invoke({"question" : "What did judges have to say about the fintech projects?"})["response"].content

'Judges had various opinions about the fintech projects. For example, they described some projects as having "excellent code quality and use of open-source libraries," indicating a positive view of technical execution. Others were noted for being "conceptually strong," "technically ambitious and well-executed," or having "impressive real-world impact." Some projects received high scores, such as 96 and 94, with comments highlighting their originality and environmental benefits. However, there were also mentions of minor issues or the need for more benchmarking results, suggesting that while the projects were viewed favorably overall, some improvements or additional validation were recommended.'

#### ❓ Question #2:

Explain how generating multiple reformulations of a user query can improve recall.

##### ✅ Answer: 

The recall metric answers the question " of all relevant documents that exist, what percentage of it are successfully retrieved?" A higher recall means we are finding most of the relevant context, reducing chances of missing important information. Multi query retrieval (essentually generating multiple variants of an user query) improves recall in the following ways:

- Users often address same information in different ways (e.g. different personas, tones, phrasing, synonyms etc.) so multi-variants can address query ambiguity
- It expands semantic coverage and increases the chances of finding documents that are relevant but expressed differently
- Even best embedding models can miss relevant documents due to vocab mismatches, sentence structures, tonality and contextual variations so multi-queries act as a safety net
- It increases document pool diversity and union of all retrieved docs provide a more comprehensive context


## Task 8: Parent Document Retriever

A "small-to-big" strategy - the Parent Document Retriever works based on a simple strategy:

1. Each un-split "document" will be designated as a "parent document" (You could use larger chunks of document as well, but our data format allows us to consider the overall document as the parent chunk)
2. Store those "parent documents" in a memory store (not a VectorStore)
3. We will chunk each of those documents into smaller documents, and associate them with their respective parents, and store those in a VectorStore. We'll call those "child chunks".
4. When we query our Retriever, we will do a similarity search comparing our query vector to the "child chunks".
5. Instead of returning the "child chunks", we'll return their associated "parent chunks".

Okay, maybe that was a few steps - but the basic idea is this:

- Search for small documents
- Return big documents

The intuition is that we're likely to find the most relevant information by limiting the amount of semantic information that is encoded in each embedding vector - but we're likely to miss relevant surrounding context if we only use that information.

Let's start by creating our "parent documents" and defining a `RecursiveCharacterTextSplitter`.

In [24]:
from langchain.retrievers import ParentDocumentRetriever
from langchain.storage import InMemoryStore
from langchain_text_splitters import RecursiveCharacterTextSplitter
from qdrant_client import QdrantClient, models

parent_docs = synthetic_usecase_data
child_splitter = RecursiveCharacterTextSplitter(chunk_size=500)  # changing chunk size from 750 to 500 seemed to improve response quality

We'll need to set up a new QDrant vectorstore - and we'll use another useful pattern to do so!

> NOTE: We are manually defining our embedding dimension, you'll need to change this if you're using a different embedding model.

In [25]:
from langchain_qdrant import QdrantVectorStore

client = QdrantClient(location=":memory:")

client.create_collection(
    collection_name="full_documents",
    vectors_config=models.VectorParams(size=1536, distance=models.Distance.COSINE)
)

parent_document_vectorstore = QdrantVectorStore(
    collection_name="full_documents", embedding=OpenAIEmbeddings(model="text-embedding-3-small"), client=client
)

Now we can create our `InMemoryStore` that will hold our "parent documents" - and build our retriever!

In [26]:
#store = InMemoryStore()

parent_document_retriever = ParentDocumentRetriever(
    vectorstore = parent_document_vectorstore,
    docstore=InMemoryStore(),
    child_splitter=child_splitter,
)

By default, this is empty as we haven't added any documents - let's add some now!

In [27]:
parent_document_retriever.add_documents(parent_docs, ids=None)

We'll create the same chain we did before - but substitute our new `parent_document_retriever`.

In [28]:
parent_document_retrieval_chain = (
    {"context": itemgetter("question") | parent_document_retriever, "question": itemgetter("question")}
    | RunnablePassthrough.assign(context=itemgetter("context"))
    | {"response": rag_prompt | chat_model, "context": itemgetter("context")}
)

Let's give it a whirl!

In [30]:
parent_document_retrieval_chain.invoke({"question" : "What is the most common project domain?"})["response"].content

'The most common project domain in the provided data appears to be "Healthcare / MedTech," as it is mentioned twice among the sample entries.'

In [37]:
parent_document_retrieval_chain.invoke({"question" : "Were there any usecases about security?"})["response"].content

'Based on the provided context, there do not appear to be any use cases specifically about security. The projects mentioned focus mainly on federated learning to improve privacy in healthcare applications, but there is no explicit mention of security-related use cases.'

In [38]:
parent_document_retrieval_chain.invoke({"question" : "What did judges have to say about the fintech projects?"})["response"].content

'The judges\' comments about the fintech projects included describing one project as "a clever solution with measurable environmental benefit," and another as "technically ambitious and well-executed." These comments suggest that the judges found the fintech-related project to be both innovative and effectively implemented.'

Overall, the performance *seems* largely the same. We can leverage a tool like [Ragas]() to more effectively answer the question about the performance.

## Task 9: Ensemble Retriever

In brief, an Ensemble Retriever simply takes 2, or more, retrievers and combines their retrieved documents based on a rank-fusion algorithm.

In this case - we're using the [Reciprocal Rank Fusion](https://plg.uwaterloo.ca/~gvcormac/cormacksigir09-rrf.pdf) algorithm.

Setting it up is as easy as providing a list of our desired retrievers - and the weights for each retriever.

In [29]:
from langchain.retrievers import EnsembleRetriever

retriever_list = [bm25_retriever, naive_retriever, parent_document_retriever, compression_retriever, multi_query_retriever]
equal_weighting = [1/len(retriever_list)] * len(retriever_list)

ensemble_retriever = EnsembleRetriever(
    retrievers=retriever_list, weights=equal_weighting
)

We'll pack *all* of these retrievers together in an ensemble.

In [30]:
ensemble_retrieval_chain = (
    {"context": itemgetter("question") | ensemble_retriever, "question": itemgetter("question")}
    | RunnablePassthrough.assign(context=itemgetter("context"))
    | {"response": rag_prompt | chat_model, "context": itemgetter("context")}
)

Let's look at our results!

In [29]:
ensemble_retrieval_chain.invoke({"question" : "What is the most common project domain?"})["response"].content

'The most common project domain in the provided data is Healthcare / MedTech, as it appears multiple times among the listed projects.'

In [42]:
ensemble_retrieval_chain.invoke({"question" : "Were there any usecases about security?"})["response"].content

'Yes, there are use cases related to security. Specifically, the project titled "SecureNest 28" falls under the legal and compliance domain and involves a hardware-aware model quantization benchmark suite, which relates to security and compliance measures.'

In [43]:
ensemble_retrieval_chain.invoke({"question" : "What did judges have to say about the fintech projects?"})["response"].content

'The judges\' comments on the fintech projects were generally positive but varied in specifics. For example:\n\n- The project "Pathfinder 27" received praise for "excellent code quality and use of open-source libraries."\n- "SecureNest 28" was recognized as "conceptually strong but results need more benchmarking."\n- "Pathfinder 24" was noted for being "well-structured and scalable" with good potential for commercialization.\n\nOverall, judges highlighted strengths such as code quality, scalability, and potential for commercialization, while also noting areas like benchmarking and UI design that could be improved.'

## Task 10: Semantic Chunking

While this is not a retrieval method - it *is* an effective way of increasing retrieval performance on corpora that have clean semantic breaks in them.

Essentially, Semantic Chunking is implemented by:

1. Embedding all sentences in the corpus.
2. Combining or splitting sequences of sentences based on their semantic similarity based on a number of [possible thresholding methods](https://python.langchain.com/docs/how_to/semantic-chunker/):
  - `percentile`
  - `standard_deviation`
  - `interquartile`
  - `gradient`
3. Each sequence of related sentences is kept as a document!

Let's see how to implement this!

We'll use the `percentile` thresholding method for this example which will:

Calculate all distances between sentences, and then break apart sequences of setences that exceed a given percentile among all distances.

In [31]:
from langchain_experimental.text_splitter import SemanticChunker

semantic_chunker = SemanticChunker(
    embeddings,
    breakpoint_threshold_type="percentile"
)

Now we can split our documents.

In [32]:
semantic_documents = semantic_chunker.split_documents(synthetic_usecase_data[:20])
len(semantic_documents)

20

Let's create a new vector store.

In [33]:
semantic_vectorstore = Qdrant.from_documents(
    semantic_documents,
    embeddings,
    location=":memory:",
    collection_name="Synthetic_Usecase_Data_Semantic_Chunks"
)

We'll use naive retrieval for this example.

In [34]:
semantic_retriever = semantic_vectorstore.as_retriever(search_kwargs={"k" : 10})

Finally we can create our classic chain!

In [35]:
semantic_retrieval_chain = (
    {"context": itemgetter("question") | semantic_retriever, "question": itemgetter("question")}
    | RunnablePassthrough.assign(context=itemgetter("context"))
    | {"response": rag_prompt | chat_model, "context": itemgetter("context")}
)

And view the results!

In [36]:
semantic_retrieval_chain.invoke({"question" : "What is the most common project domain?"})["response"].content

'Based on the provided data, the most common project domain appears to be "Legal / Compliance," which is mentioned more than once compared to other domains.'

In [50]:
semantic_retrieval_chain.invoke({"question" : "Were there any usecases about security?"})["response"].content

'Yes, there are usecases related to security in the provided data. Specifically:\n\n1. "MediMind 17" with the project name "BioForge," is described as a medical imaging solution improving early diagnosis through vision transformers, and it is categorized under the Security domain.\n2. "InsightAI 1" with the project name "Project Aurora," involves a low-latency inference system for multimodal agents in autonomous systems, also categorized under the Security domain.\n3. "SecureNest 12" with the project name "Neural Canvas," features a low-latency inference system for multimodal agents in autonomous systems within the Security domain.\n\nThese projects are focused on security-related applications, particularly involving inference systems and autonomous systems.'

In [51]:
semantic_retrieval_chain.invoke({"question" : "What did judges have to say about the fintech projects?"})["response"].content

'Judges had quite positive comments about the fintech projects. For example, they described the projects as "Comprehensive and technically mature" and "Technically ambitious and well-executed." In addition, some projects were noted for being "Well-structured and scalable" with "good potential for commercialization," and others as "A forward-looking idea with solid supporting data." Overall, the judges appreciated the technical quality, ambition, and potential impact of the fintech-related projects.'

#### ❓ Question #3:

If sentences are short and highly repetitive (e.g., FAQs), how might semantic chunking behave, and how would you adjust the algorithm?

##### ✅ Answer

Note that semantic chunking works by comparing adjacent sentence embeddings, creating chunks where topic shifts occur based on semantic meaning rather than fixed character counts.

With FAQ-style content, semantic chunking may tend to:

- Over-merge unrelated items - questions starting with "How do I...", "What is...", "Can I..." show high similarity scores despite different topics
- Struggle with nuance - short sentences lack semantic richness, making distinctions harder
- Create inconsistent chunks - similar phrasing may override topical differences

In such a case I cna think of three approaches:

1. Lower the similarity threshold, e.g., 
    ```python 

        chunker = SemanticChunker(
            embeddings,
            breakpoint_threshold_type="percentile",
            breakpoint_threshold_amount=70  # More aggressive splitting
        )
    ```
2. Use `breakpoint_threshold_type="gradient"` - the [LangChain documentation](https://python.langchain.com/docs/how_to/semantic-chunker/#gradient) mentions that this threshold method chunks are highly correlated with each other. The idea is to apply anomaly detection on gradient array so that the distribution become wider and easy to identify boundaries in highly semantic data. 

3.Add metadata (question type, category) before chunking - this is a bit Meta!


# 🤝 Breakout Room Part #2

#### 🏗️ Activity #1

Your task is to evaluate the various Retriever methods against eachother.

You are expected to:

1. Create a "golden dataset"
 - Use Synthetic Data Generation (powered by Ragas, or otherwise) to create this dataset
2. Evaluate each retriever with *retriever specific* Ragas metrics
 - Semantic Chunking is not considered a retriever method and will not be required for marks, but you may find it useful to do a "semantic chunking on" vs. "semantic chunking off" comparision between them
3. Compile these in a list and write a small paragraph about which is best for this particular data and why.

Your analysis should factor in:
  - Cost
  - Latency
  - Performance

> NOTE: This is **NOT** required to be completed in class. Please spend time in your breakout rooms creating a plan before moving on to writing code.

### Data Processing + Document Generation Using PDF + CSV Data

There are some issues with the current `Projects_with_Domain.csv` dataset. For example,

- Same project name appears for diffeent project titles (e.g., `Neural Canvas` name maps to both `CreateFlow 33` and `TrendLens 32`)

- Same descriptions appear for different projects (e.g.,  "A low-latency inference system..." appears multiple times)

- Same project name have different scores (e.g., "BioForge" appears with scores 9.7, 7.5, 7.0, 7.8)

- Domain inconsistencies: Same project titles in different domains


It is interesting to note that project titles are indeed unique but project description and names are not, which is indicative of spurious data. However, we really care about the project descriptions, primary and secondary domains, and may add other secondary metadata (such as judge comments) directly into the Document page content.   

In [110]:
import pandas as pd
from langchain.schema import Document 

def create_cleaned_dataset_from_existing_docs(docs: list[Document], save_data: bool = True) -> list[Document]:
    """
    Hybrid approach: Clean existing Documents by removing duplicates while 
    preserving ALL original metadata and augmenting page_content.
    
    Args:
        docs: List of Documents with metadata (e.g., from CSVLoader)
    
    Returns:
        List of cleaned Documents with augmented page_content and preserved metadata
    """
    
    # Convert docs to dataframe for deduplication
    records = []
    for i, doc in enumerate(docs):
        record = doc.metadata.copy()
        record['Description'] = doc.metadata.get('Description', doc.page_content)
        record['_original_idx'] = i
        record['_page_content'] = doc.page_content  # Preserve original
        records.append(record)
    
    df = pd.DataFrame(records)
    
    # Clean: keep only unique descriptions, preferring higher judge scores
    df_cleaned = (
        df.sort_values(by="Judge Score", ascending=False)
        .drop_duplicates(subset=["Description"], keep="first")
        .reset_index(drop=True)
    )
    
    print(f"Length of records after cleaning: {len(df_cleaned)}")
    print(f"Removed {len(df) - len(df_cleaned)} duplicate descriptions")
    
    # Create cleaned documents with augmented page_content
    cleaned_docs = []
    for _, row in df_cleaned.iterrows():
        # Augment page_content for better semantic chunking (avoiding tiny chunks)
        page_content = f"""
Description: {row['Description']}

Primary Domain: {row['Project Domain']}

Secondary Domain: {row['Secondary Domain']}

Expert comments: {row['Judge Comments']}
        """.strip()
        
        # Preserve ALL original metadata
        metadata = row.to_dict()
        
        # Add processing metadata for tracking
        metadata.update({
            'source': metadata.get('source', 'csv_cleaned'),
            'doc_type': 'project',
            'augmented_content': True,  # Flag that content was enhanced
            'quality_tier': 'high' if float(row.get('Judge Score', 0)) > 8.0 else 'medium',
        })
        
        doc = Document(
            page_content=page_content,
            metadata=metadata
        )
        cleaned_docs.append(doc)
    
    # Optional: Save cleaned CSV for reference
    if save_data:
        cleaned_csv_path = "./data/cleaned_Projects_with_Domains.csv"
        df_to_save = df_cleaned.drop(columns=['_original_idx', '_page_content', 'source', 'row'], errors='ignore')
        df_to_save.to_csv(cleaned_csv_path, index=False)
        print(f"Cleaned dataset saved to {cleaned_csv_path}")
    
    return cleaned_docs

# Usage:
csv_docs = create_cleaned_dataset_from_existing_docs(synthetic_usecase_data)

Length of records after cleaning: 20
Removed 30 duplicate descriptions
Cleaned dataset saved to ./data/cleaned_Projects_with_Domains.csv


In [111]:
from langchain_community.document_loaders import DirectoryLoader, PyMuPDFLoader

# load pdf data 
path = "data/"
loader = DirectoryLoader(path, glob="*.pdf", loader_cls=PyMuPDFLoader)
pdf_docs = loader.load()

print(f"Number of documents loaded from PDF files: {len(pdf_docs)}")

Number of documents loaded from PDF files: 64


In [112]:
# combine csv and pdf documents (using all of the data available)
final_evaluation_docs = pdf_docs + csv_docs
print(f"Total number of documents (from PDF and CSV sources): {len(final_evaluation_docs)}")

Total number of documents (from PDF and CSV sources): 84


### Redefine Retrievers (using Recursive Text Split Chunking) with Combined Dataset 

The original retrievers were built using only the 50 CSV documents, but our test set was generated from the combined 84 documents (PDF + CSV). We need to rebuild all retrievers using the `final_evaluation_docs` to ensure they search the same corpus that generated our test questions.


In [113]:
# Rebuild main vectorstore with combined data
print("🔄 Rebuilding main vectorstore with combined dataset...")
vectorstore_updated = Qdrant.from_documents(
    final_evaluation_docs,  # ✅ Use combined PDF + CSV docs
    embeddings,
    location=":memory:",
    collection_name="Synthetic_Usecases_Updated"
)
print(f"✅ Vectorstore created with {len(final_evaluation_docs)} documents")

# Rebuild naive retriever
naive_retriever_updated = vectorstore_updated.as_retriever(search_kwargs={"k": 10})
print("✅ Naive retriever updated")

# Rebuild BM25 retriever
bm25_retriever_updated = BM25Retriever.from_documents(final_evaluation_docs)
bm25_retriever_updated.k = 10  # Match the k value of other retrievers
print("✅ BM25 retriever updated")

# Rebuild contextual compression retriever
compressor_updated = CohereRerank(model="rerank-v3.5", top_n=7)
compression_retriever_updated = ContextualCompressionRetriever(
    base_compressor=compressor_updated, 
    base_retriever=naive_retriever_updated
)
print("✅ Contextual compression retriever updated")

# Rebuild multi-query retriever
multi_query_retriever_updated = MultiQueryRetriever.from_llm(
    retriever=naive_retriever_updated, 
    llm=chat_model
)
print("✅ Multi-query retriever updated")

# Rebuild parent document retriever
# Create new QDrant client and collection
client_updated = QdrantClient(location=":memory:")
client_updated.create_collection(
    collection_name="full_documents_updated",
    vectors_config=models.VectorParams(size=1536, distance=models.Distance.COSINE)
)

parent_document_vectorstore_updated = QdrantVectorStore(
    collection_name="full_documents_updated", 
    embedding=OpenAIEmbeddings(model="text-embedding-3-small"), 
    client=client_updated
)

# Create parent document retriever
parent_document_retriever_updated = ParentDocumentRetriever(
    vectorstore=parent_document_vectorstore_updated,
    docstore=InMemoryStore(),
    child_splitter=child_splitter,  # Reuse the same splitter
)

# Add the combined documents
parent_document_retriever_updated.add_documents(final_evaluation_docs, ids=None)
print("✅ Parent document retriever updated")


# Rebuild ensemble retriever
retriever_list_updated = [
    bm25_retriever_updated, 
    naive_retriever_updated, 
    parent_document_retriever_updated, 
    compression_retriever_updated, 
    multi_query_retriever_updated
]
equal_weighting = [1/len(retriever_list_updated)] * len(retriever_list_updated)

ensemble_retriever_updated = EnsembleRetriever(
    retrievers=retriever_list_updated, 
    weights=equal_weighting
)
print("✅ Ensemble retriever updated")
print(f"\n🎉 All retrievers rebuilt with {len(final_evaluation_docs)} documents!")


# Rebuild ensemble retriever

🔄 Rebuilding main vectorstore with combined dataset...
✅ Vectorstore created with 84 documents
✅ Naive retriever updated
✅ BM25 retriever updated
✅ Contextual compression retriever updated
✅ Multi-query retriever updated
✅ Parent document retriever updated
✅ Ensemble retriever updated

🎉 All retrievers rebuilt with 84 documents!


In [114]:
# Rebuild the RAG chains with updated retrievers
print("🔄 Rebuilding RAG chains...")

# Naive retrieval chain
naive_retrieval_chain_updated = (
    {"context": itemgetter("question") | naive_retriever_updated, "question": itemgetter("question")}
    | RunnablePassthrough.assign(context=itemgetter("context"))
    | {"response": rag_prompt | chat_model, "context": itemgetter("context")}
)

# BM25 retrieval chain
bm25_retrieval_chain_updated = (
    {"context": itemgetter("question") | bm25_retriever_updated, "question": itemgetter("question")}
    | RunnablePassthrough.assign(context=itemgetter("context"))
    | {"response": rag_prompt | chat_model, "context": itemgetter("context")}
)

# Contextual compression chain
contextual_compression_retrieval_chain_updated = (
    {"context": itemgetter("question") | compression_retriever_updated, "question": itemgetter("question")}
    | RunnablePassthrough.assign(context=itemgetter("context"))
    | {"response": rag_prompt | chat_model, "context": itemgetter("context")}
)

# Multi-query chain
multi_query_retrieval_chain_updated = (
    {"context": itemgetter("question") | multi_query_retriever_updated, "question": itemgetter("question")}
    | RunnablePassthrough.assign(context=itemgetter("context"))
    | {"response": rag_prompt | chat_model, "context": itemgetter("context")}
)

# Parent document chain
parent_document_retrieval_chain_updated = (
    {"context": itemgetter("question") | parent_document_retriever_updated, "question": itemgetter("question")}
    | RunnablePassthrough.assign(context=itemgetter("context"))
    | {"response": rag_prompt | chat_model, "context": itemgetter("context")}
)

# Ensemble chain
ensemble_retrieval_chain_updated = (
    {"context": itemgetter("question") | ensemble_retriever_updated, "question": itemgetter("question")}
    | RunnablePassthrough.assign(context=itemgetter("context"))
    | {"response": rag_prompt | chat_model, "context": itemgetter("context")}
)

print("✅ All RAG chains rebuilt!")


🔄 Rebuilding RAG chains...
✅ All RAG chains rebuilt!


#### Use RAGAS for Synthetic Data Generation

In [115]:
from ragas.llms import LangchainLLMWrapper
from ragas.embeddings import LangchainEmbeddingsWrapper

generator_llm = LangchainLLMWrapper(
    ChatOpenAI(model="gpt-4.1-mini")
    )
generator_embeddings = LangchainEmbeddingsWrapper(
    OpenAIEmbeddings(model="text-embedding-3-small")
    )

In [42]:
from ragas.testset import TestsetGenerator

generator = TestsetGenerator(llm=generator_llm, embedding_model=generator_embeddings)
testset = generator.generate_with_langchain_docs(documents=final_evaluation_docs, testset_size=15)

Applying HeadlinesExtractor:   0%|          | 0/21 [00:00<?, ?it/s]

Applying HeadlineSplitter:   0%|          | 0/84 [00:00<?, ?it/s]

unable to apply transformation: 'headlines' property not found in this node
unable to apply transformation: 'headlines' property not found in this node
unable to apply transformation: 'headlines' property not found in this node
unable to apply transformation: 'headlines' property not found in this node
unable to apply transformation: 'headlines' property not found in this node
unable to apply transformation: 'headlines' property not found in this node
unable to apply transformation: 'headlines' property not found in this node
unable to apply transformation: 'headlines' property not found in this node
unable to apply transformation: 'headlines' property not found in this node
unable to apply transformation: 'headlines' property not found in this node
unable to apply transformation: 'headlines' property not found in this node
unable to apply transformation: 'headlines' property not found in this node
unable to apply transformation: 'headlines' property not found in this node
unable to ap

Applying SummaryExtractor:   0%|          | 0/38 [00:00<?, ?it/s]

Property 'summary' already exists in node '3f83d3'. Skipping!
Property 'summary' already exists in node '7f3a37'. Skipping!
Property 'summary' already exists in node 'c490d8'. Skipping!
Property 'summary' already exists in node '042eac'. Skipping!
Property 'summary' already exists in node '69261f'. Skipping!
Property 'summary' already exists in node 'dead5c'. Skipping!
Property 'summary' already exists in node '56e422'. Skipping!
Property 'summary' already exists in node '7ccb8c'. Skipping!
Property 'summary' already exists in node 'c33412'. Skipping!
Property 'summary' already exists in node '245656'. Skipping!
Property 'summary' already exists in node '1fce7e'. Skipping!
Property 'summary' already exists in node '187874'. Skipping!
Property 'summary' already exists in node '5517c6'. Skipping!
Property 'summary' already exists in node '2bf359'. Skipping!
Property 'summary' already exists in node '74e74a'. Skipping!
Property 'summary' already exists in node '9f8df6'. Skipping!
Property

Applying CustomNodeFilter:   0%|          | 0/8 [00:00<?, ?it/s]

Applying [EmbeddingExtractor, ThemesExtractor, NERExtractor]:   0%|          | 0/48 [00:00<?, ?it/s]

Property 'summary_embedding' already exists in node '042eac'. Skipping!
Property 'summary_embedding' already exists in node '245656'. Skipping!
Property 'summary_embedding' already exists in node '9e10fb'. Skipping!
Property 'summary_embedding' already exists in node '56e422'. Skipping!
Property 'summary_embedding' already exists in node '69261f'. Skipping!
Property 'summary_embedding' already exists in node '7f3a37'. Skipping!
Property 'summary_embedding' already exists in node 'dead5c'. Skipping!
Property 'summary_embedding' already exists in node '7ccb8c'. Skipping!
Property 'summary_embedding' already exists in node '5517c6'. Skipping!
Property 'summary_embedding' already exists in node 'c33412'. Skipping!
Property 'summary_embedding' already exists in node 'c490d8'. Skipping!
Property 'summary_embedding' already exists in node '3f83d3'. Skipping!
Property 'summary_embedding' already exists in node '1fce7e'. Skipping!
Property 'summary_embedding' already exists in node '187874'. Sk

Applying [CosineSimilarityBuilder, OverlapScoreBuilder]:   0%|          | 0/2 [00:00<?, ?it/s]

Generating personas:   0%|          | 0/3 [00:00<?, ?it/s]

Generating Scenarios:   0%|          | 0/3 [00:00<?, ?it/s]

Generating Samples:   0%|          | 0/15 [00:00<?, ?it/s]

In [43]:
testset_df = testset.to_pandas()
testset_df

Unnamed: 0,user_input,reference_contexts,reference,synthesizer_name
0,What are the expected developments for ChatGPT...,[Introduction ChatGPT launched in November 202...,The provided context does not specify any deta...,single_hop_specifc_query_synthesizer
1,OpenAI how does it help work?,[Table 1: ChatGPT daily message counts (millio...,The context explains that OpenAI's ChatGPT is ...,single_hop_specifc_query_synthesizer
2,How does the usage of ChatGPT differ among use...,[Variation by Occupation Figure 23 presents va...,"Users in nonprofessional occupations, includin...",single_hop_specifc_query_synthesizer
3,What does the paper reveal about ChatGPT usage...,[Conclusion This paper studies the rapid growt...,"The paper indicates that in the US, ChatGPT us...",single_hop_specifc_query_synthesizer
4,How does the message volume comparison between...,[<1-hop>\n\nMonth Non-Work (M) (%) Work (M) (%...,"In June 2024, non-work messages accounted for ...",multi_hop_abstract_query_synthesizer
5,How does the rapid growth and widespread adopt...,[<1-hop>\n\nConclusion This paper studies the ...,The first context segment highlights that by J...,multi_hop_abstract_query_synthesizer
6,Based on the analysis of ChatGPT's usage patte...,[<1-hop>\n\nConclusion This paper studies the ...,The context indicates that ChatGPT has experie...,multi_hop_abstract_query_synthesizer
7,Based on the message volume comparison between...,[<1-hop>\n\nMonth Non-Work (M) (%) Work (M) (%...,"In June 2024, non-work messages accounted for ...",multi_hop_abstract_query_synthesizer
8,How does the increase in total message volume ...,[<1-hop>\n\nMonth Non-Work (M) (%) Work (M) (%...,The data shows that total messages increased f...,multi_hop_abstract_query_synthesizer
9,How did ChatGPT's usage patterns and message v...,[<1-hop>\n\nTable 1: ChatGPT daily message cou...,"Between June 2024 and June 2025, ChatGPT exper...",multi_hop_specific_query_synthesizer


In [None]:
# store testset_df as a csv file 
# testset_df.to_csv("testset.csv", index=False)

# load testset_df from csv file
testset_df = pd.read_csv("testset.csv")

# Convert string representations of lists back to actual lists
import ast
testset_df['reference_contexts'] = testset_df['reference_contexts'].apply(ast.literal_eval)


### Evaluations Part I: Retrievers with Recursive Text Split Chunking

In [None]:
import time 
from retriever_evaluator import evaluate_retriever, SUGGESTED_METRICS, get_suggested_metrics, RetrieverType, print_results

list_of_retrievers = [
    (RetrieverType.NAIVE, naive_retrieval_chain_updated),
    (RetrieverType.BM25, bm25_retrieval_chain_updated),
    (RetrieverType.COMPRESSION, contextual_compression_retrieval_chain_updated),
    (RetrieverType.MULTI_QUERY, multi_query_retrieval_chain_updated),
    (RetrieverType.PARENT_DOCUMENT, parent_document_retrieval_chain_updated),
    (RetrieverType.ENSEMBLE, ensemble_retrieval_chain_updated)
]


final_results = {}
for i, (retriever_type, chain) in enumerate(list_of_retrievers):
    result = evaluate_retriever(
        retriever_chain=chain,
        name=retriever_type,
        testset_df=testset_df,
        evaluator_llm=generator_llm,
        metrics=get_suggested_metrics(retriever_type)
    )
    print_results(result)
    final_results[retriever_type] = result

        # Add delay between retrievers (wait 60 seconds after every batch)
    if i < len(list_of_retrievers) - 1:  # Don't wait after the last one
        print(f"Waiting 90 seconds to avoid rate limits...")
        time.sleep(90)




🔍 Evaluating: naive retriever

🗑️  Deleted existing project: naive_retriever_eval
📁 LangSmith Project: naive_retriever_eval

📊 Retrieving for 10 test questions...
 Retrieval for naive completed for 10 eval samples.

Running RAGAS evaluation ...


Evaluating:   0%|          | 0/30 [00:00<?, ?it/s]

{'context_recall': 0.9417, 'context_precision': 0.8886, 'context_entity_recall': 0.4262}
✅ RAGAS evaluation completed in 62.90s

💰 Fetching cost and latency from LangSmith...

📊 RESULTS: naive

🎯 QUALITY METRICS (RAGAS):
----------------------------------------------------------------------
  context_recall.................................... 0.9417
  context_precision................................. 0.8886
  context_entity_recall............................. 0.4262

 ⚡ COST and LATENCY METRICS:
----------------------------------------------------------------------
  Cost per query (USD):............................ $0.000284
  Prompt cost per query (USD):............................ $0.000239
  Completion cost per query (USD):............................ $0.000046
  Latency (P50):............................ 3.075s
  Latency (P99):............................ 3.788s
  Tokens per query:............................ 2499.940


Waiting 90 seconds to avoid rate limits...

🔍 Evaluating: bm

Evaluating:   0%|          | 0/30 [00:00<?, ?it/s]

{'context_recall': 0.9167, 'context_precision': 0.7740, 'context_entity_recall': 0.4483}
✅ RAGAS evaluation completed in 46.44s

💰 Fetching cost and latency from LangSmith...

📊 RESULTS: bm25

🎯 QUALITY METRICS (RAGAS):
----------------------------------------------------------------------
  context_recall.................................... 0.9167
  context_precision................................. 0.7740
  context_entity_recall............................. 0.4483

 ⚡ COST and LATENCY METRICS:
----------------------------------------------------------------------
  Cost per query (USD):............................ $0.000276
  Prompt cost per query (USD):............................ $0.000234
  Completion cost per query (USD):............................ $0.000042
  Latency (P50):............................ 3.197s
  Latency (P99):............................ 4.529s
  Tokens per query:............................ 2447.850


Waiting 90 seconds to avoid rate limits...

🔍 Evaluating: com

Evaluating:   0%|          | 0/30 [00:00<?, ?it/s]

{'context_recall': 0.8833, 'context_precision': 0.9462, 'context_entity_recall': 0.3885}
✅ RAGAS evaluation completed in 23.61s

💰 Fetching cost and latency from LangSmith...

📊 RESULTS: compression

🎯 QUALITY METRICS (RAGAS):
----------------------------------------------------------------------
  context_recall.................................... 0.8833
  context_precision................................. 0.9462
  context_entity_recall............................. 0.3885

 ⚡ COST and LATENCY METRICS:
----------------------------------------------------------------------
  Cost per query (USD):............................ $0.000211
  Prompt cost per query (USD):............................ $0.000166
  Completion cost per query (USD):............................ $0.000045
  Latency (P50):............................ 3.396s
  Latency (P99):............................ 5.691s
  Tokens per query:............................ 1770.140


Waiting 90 seconds to avoid rate limits...

🔍 Evaluati

Evaluating:   0%|          | 0/30 [00:00<?, ?it/s]

Exception raised in Job[20]: TimeoutError()


{'context_recall': 0.9750, 'context_precision': 0.8484, 'context_entity_recall': 0.3714}
✅ RAGAS evaluation completed in 188.04s

💰 Fetching cost and latency from LangSmith...

📊 RESULTS: multi_query

🎯 QUALITY METRICS (RAGAS):
----------------------------------------------------------------------
  context_recall.................................... 0.9750
  context_precision................................. 0.8484
  context_entity_recall............................. nan

 ⚡ COST and LATENCY METRICS:
----------------------------------------------------------------------
  Cost per query (USD):............................ $0.000241
  Prompt cost per query (USD):............................ $0.000196
  Completion cost per query (USD):............................ $0.000045
  Latency (P50):............................ 4.635s
  Latency (P99):............................ 8.972s
  Tokens per query:............................ 2075.500


Waiting 90 seconds to avoid rate limits...

🔍 Evaluating

Evaluating:   0%|          | 0/30 [00:00<?, ?it/s]

{'context_recall': 0.8300, 'context_precision': 0.8917, 'context_entity_recall': 0.3049}
✅ RAGAS evaluation completed in 20.92s

💰 Fetching cost and latency from LangSmith...

📊 RESULTS: parent_document

🎯 QUALITY METRICS (RAGAS):
----------------------------------------------------------------------
  context_recall.................................... 0.8300
  context_precision................................. 0.8917
  context_entity_recall............................. 0.3049

 ⚡ COST and LATENCY METRICS:
----------------------------------------------------------------------
  Cost per query (USD):............................ $0.000112
  Prompt cost per query (USD):............................ $0.000077
  Completion cost per query (USD):............................ $0.000035
  Latency (P50):............................ 2.708s
  Latency (P99):............................ 3.677s
  Tokens per query:............................ 859.800


Waiting 90 seconds to avoid rate limits...

🔍 Evalu

Evaluating:   0%|          | 0/30 [00:00<?, ?it/s]

Exception raised in Job[23]: TimeoutError()


{'context_recall': 0.9750, 'context_precision': 0.8103, 'context_entity_recall': 0.3642}
✅ RAGAS evaluation completed in 191.31s

💰 Fetching cost and latency from LangSmith...

📊 RESULTS: ensemble

🎯 QUALITY METRICS (RAGAS):
----------------------------------------------------------------------
  context_recall.................................... 0.9750
  context_precision................................. 0.8103
  context_entity_recall............................. nan

 ⚡ COST and LATENCY METRICS:
----------------------------------------------------------------------
  Cost per query (USD):............................ $0.000233
  Prompt cost per query (USD):............................ $0.000195
  Completion cost per query (USD):............................ $0.000038
  Latency (P50):............................ 5.584s
  Latency (P99):............................ 9.635s
  Tokens per query:............................ 2046.700




In [None]:
from typing import Any

def print_metrics_results_table(final_results: dict[str, Any]):
    "Print evaluation results in a formatted table"
    
    all_data = []
    
    # Mapping for shorter column names
    metric_name_map = {
        'context_recall': 'Recall',
        'context_precision': 'Precision',
        'context_entity_recall': 'Entity_Recall'
    }
    
    for retriever_type, result in final_results.items():
        row = {'Retriever': result['name']}
        
        # Add quality metrics with shorter names
        for metric_name, score in result['quality_metrics'].items():
            short_name = metric_name_map[metric_name] if metric_name in metric_name_map else metric_name
            row[short_name] = score
        
        # Add performance metrics with shorter names
        perf = result['performance_metrics']
        row['Cost ($)'] = perf['cost_per_query_usd']
        row['Prompt ($)'] = perf['prompt_cost_per_query_usd']
        row['Compl ($)'] = perf['completion_cost_per_query_usd']
        row['Tokens'] = perf['total_tokens_per_query']
        row['P50 (s)'] = perf['p50_latency_seconds']
        row['P99 (s)'] = perf['p99_latency_seconds']
        
        all_data.append(row)
    
    # Create DataFrame
    results_df = pd.DataFrame(all_data)
    
    # Reorder columns: Retriever, Quality metrics, Cost metrics, Latency metrics, Tokens
    desired_order = [
        'Retriever',
        'Recall', 'Precision', 'Entity_Recall',  # Quality metrics together
        'Cost ($)', 'Prompt ($)', 'Compl ($)',   # Cost metrics together
        'P50 (s)', 'P99 (s)',                     # Latency metrics together
        'Tokens'                                   # Tokens last
    ]
    
    # Only include columns that actually exist in the dataframe
    column_order = [col for col in desired_order if col in results_df.columns]
    results_df = results_df[column_order]
    
    # Convert Tokens to integer
    if 'Tokens' in results_df.columns:
        results_df['Tokens'] = results_df['Tokens'].astype(int)
    
    # Print with nice formatting
    print("\n" + "="*120)
    print("📊 COMPLETE EVALUATION RESULTS")
    print("="*120 + "\n")
    print(results_df.to_string(index=False, float_format=lambda x: f'{x:.4f}'))  # type: ignore
    print("\n" + "="*120 + "\n")
    
    return results_df

In [120]:
# Re-run with updated shorter column names
results_df = print_metrics_results_table(final_results)


📊 COMPLETE EVALUATION RESULTS

      Retriever  Recall  Precision  Entity_Recall    Cost ($)  Prompt ($)   Compl ($)  P50 (s)  P99 (s)  Tokens
          naive  0.9417     0.8886         0.4262 0.000284401 0.000238525 0.000045876   3.0750   3.7876    2499
           bm25  0.9167     0.7740         0.4483 0.000276246 0.000234298 0.000041948   3.1975   4.5290    2447
    compression  0.8833     0.9462         0.3885 0.000210587 0.000165823 0.000044764   3.3960   5.6913    1770
    multi_query  0.9750     0.8484            NaN 0.000241135 0.000196355  0.00004478   4.6345   8.9723    2075
parent_document  0.8300     0.8917         0.3049 0.000112173 0.000077249 0.000034924   2.7085   3.6767     859
       ensemble  0.9750     0.8103            NaN 0.000233305 0.000195125  0.00003818   5.5840   9.6349    2046




#### Evaluations Part II: Retrievers with Semantic Chunking

In [142]:
from langchain_experimental.text_splitter import SemanticChunker

# Create semantic chunker
semantic_chunker = SemanticChunker(
    embeddings,
    breakpoint_threshold_type="standard_deviation",
)

# Apply semantic chunking to all documents
print("🔄 Creating semantic chunks from combined dataset...")
semantic_documents = semantic_chunker.split_documents(final_evaluation_docs)
print(f"✅ Created {len(semantic_documents)} semantic chunks from {len(final_evaluation_docs)} documents")

# Create new vectorstore with semantic chunks
print("🔄 Creating vectorstore with semantic chunks...")
semantic_vectorstore = Qdrant.from_documents(
    semantic_documents,
    embeddings,
    location=":memory:",
    collection_name="Semantic_Chunks_Vectorstore"
)
print(f"✅ Semantic vectorstore created")

🔄 Creating semantic chunks from combined dataset...
✅ Created 88 semantic chunks from 84 documents
🔄 Creating vectorstore with semantic chunks...
✅ Semantic vectorstore created


In [143]:
# Rebuild naive retriever with semantic chunks
semantic_naive_retriever = semantic_vectorstore.as_retriever(search_kwargs={"k": 10})
print("✅ Semantic naive retriever created")

# Rebuild BM25 retriever with semantic chunks
semantic_bm25_retriever = BM25Retriever.from_documents(semantic_documents)
semantic_bm25_retriever.k = 10
print("✅ Semantic BM25 retriever created")

# Rebuild contextual compression retriever with semantic chunks
semantic_compressor = CohereRerank(model="rerank-v3.5", top_n=7)
semantic_compression_retriever = ContextualCompressionRetriever(
    base_compressor=semantic_compressor, 
    base_retriever=semantic_naive_retriever
)
print("✅ Semantic compression retriever created")

# Rebuild multi-query retriever with semantic chunks
semantic_multi_query_retriever = MultiQueryRetriever.from_llm(
    retriever=semantic_naive_retriever, 
    llm=chat_model
)
print("✅ Semantic multi-query retriever created")

# Rebuild parent document retriever with semantic chunks
semantic_client = QdrantClient(location=":memory:")
semantic_client.create_collection(
    collection_name="semantic_parent_documents",
    vectors_config=models.VectorParams(size=1536, distance=models.Distance.COSINE)
)

semantic_parent_vectorstore = QdrantVectorStore(
    collection_name="semantic_parent_documents", 
    embedding=OpenAIEmbeddings(model="text-embedding-3-small"), 
    client=semantic_client
)

semantic_parent_retriever = ParentDocumentRetriever(
    vectorstore=semantic_parent_vectorstore,
    docstore=InMemoryStore(),
    child_splitter=child_splitter,
)
semantic_parent_retriever.add_documents(semantic_documents, ids=None)
print("✅ Semantic parent document retriever created")

# Rebuild ensemble retriever with semantic chunks
semantic_retriever_list = [
    semantic_bm25_retriever, 
    semantic_naive_retriever, 
    semantic_parent_retriever, 
    semantic_compression_retriever, 
    semantic_multi_query_retriever
]
semantic_equal_weighting = [1/len(semantic_retriever_list)] * len(semantic_retriever_list)

semantic_ensemble_retriever = EnsembleRetriever(
    retrievers=semantic_retriever_list, 
    weights=semantic_equal_weighting
)
print("✅ Semantic ensemble retriever created")
print(f"\n🎉 All semantic retrievers built with {len(semantic_documents)} semantic chunks!")

✅ Semantic naive retriever created
✅ Semantic BM25 retriever created
✅ Semantic compression retriever created
✅ Semantic multi-query retriever created
✅ Semantic parent document retriever created
✅ Semantic ensemble retriever created

🎉 All semantic retrievers built with 88 semantic chunks!


In [145]:
print("🔄 Building RAG chains with semantic retrievers...")

# Naive retrieval chain with semantic chunks
semantic_naive_chain = (
    {"context": itemgetter("question") | semantic_naive_retriever, "question": itemgetter("question")}
    | RunnablePassthrough.assign(context=itemgetter("context"))
    | {"response": rag_prompt | chat_model, "context": itemgetter("context")}
)

# BM25 retrieval chain with semantic chunks
semantic_bm25_chain = (
    {"context": itemgetter("question") | semantic_bm25_retriever, "question": itemgetter("question")}
    | RunnablePassthrough.assign(context=itemgetter("context"))
    | {"response": rag_prompt | chat_model, "context": itemgetter("context")}
)

# Contextual compression chain with semantic chunks
semantic_compression_chain = (
    {"context": itemgetter("question") | semantic_compression_retriever, "question": itemgetter("question")}
    | RunnablePassthrough.assign(context=itemgetter("context"))
    | {"response": rag_prompt | chat_model, "context": itemgetter("context")}
)

# Multi-query chain with semantic chunks
semantic_multi_query_chain = (
    {"context": itemgetter("question") | semantic_multi_query_retriever, "question": itemgetter("question")}
    | RunnablePassthrough.assign(context=itemgetter("context"))
    | {"response": rag_prompt | chat_model, "context": itemgetter("context")}
)

# Parent document chain with semantic chunks
semantic_parent_chain = (
    {"context": itemgetter("question") | semantic_parent_retriever, "question": itemgetter("question")}
    | RunnablePassthrough.assign(context=itemgetter("context"))
    | {"response": rag_prompt | chat_model, "context": itemgetter("context")}
)

# Ensemble chain with semantic chunks
semantic_ensemble_chain = (
    {"context": itemgetter("question") | semantic_ensemble_retriever, "question": itemgetter("question")}
    | RunnablePassthrough.assign(context=itemgetter("context"))
    | {"response": rag_prompt | chat_model, "context": itemgetter("context")}
)

print("✅ All semantic RAG chains built!")

🔄 Building RAG chains with semantic retrievers...
✅ All semantic RAG chains built!


In [146]:
semantic_retrievers_list = [
    (RetrieverType.NAIVE, semantic_naive_chain),
    (RetrieverType.BM25, semantic_bm25_chain),
    (RetrieverType.COMPRESSION, semantic_compression_chain),
    (RetrieverType.MULTI_QUERY, semantic_multi_query_chain),
    (RetrieverType.PARENT_DOCUMENT, semantic_parent_chain),
    (RetrieverType.ENSEMBLE, semantic_ensemble_chain)
]

semantic_results = {}
for i, (retriever_type, chain) in enumerate(semantic_retrievers_list):
    result = evaluate_retriever(
        retriever_chain=chain,
        name=retriever_type,
        testset_df=testset_df,
        evaluator_llm=generator_llm,
        metrics=get_suggested_metrics(retriever_type)
    )
    print_results(result)
    semantic_results[retriever_type] = result
    
    # Add delay between retrievers to avoid rate limits
    if i < len(semantic_retrievers_list) - 1:
        print(f"Waiting 90 seconds to avoid rate limits...")
        time.sleep(90)


🔍 Evaluating: naive retriever

🗑️  Deleted existing project: naive_retriever_eval
📁 LangSmith Project: naive_retriever_eval

📊 Retrieving for 10 test questions...
 Retrieval for naive completed for 10 eval samples.

Running RAGAS evaluation ...


Evaluating:   0%|          | 0/30 [00:00<?, ?it/s]

{'context_recall': 0.9417, 'context_precision': 0.8736, 'context_entity_recall': 0.4025}
✅ RAGAS evaluation completed in 40.73s

💰 Fetching cost and latency from LangSmith...

📊 RESULTS: naive

🎯 QUALITY METRICS (RAGAS):
----------------------------------------------------------------------
  context_recall.................................... 0.9417
  context_precision................................. 0.8736
  context_entity_recall............................. 0.4025

 ⚡ COST and LATENCY METRICS:
----------------------------------------------------------------------
  Cost per query (USD):............................ $0.000278
  Prompt cost per query (USD):............................ $0.000232
  Completion cost per query (USD):............................ $0.000046
  Latency (P50):............................ 3.176s
  Latency (P99):............................ 5.340s
  Tokens per query:............................ 2435.340


Waiting 90 seconds to avoid rate limits...

🔍 Evaluating: bm

Evaluating:   0%|          | 0/30 [00:00<?, ?it/s]

{'context_recall': 0.9500, 'context_precision': 0.7507, 'context_entity_recall': 0.4214}
✅ RAGAS evaluation completed in 41.45s

💰 Fetching cost and latency from LangSmith...

📊 RESULTS: bm25

🎯 QUALITY METRICS (RAGAS):
----------------------------------------------------------------------
  context_recall.................................... 0.9500
  context_precision................................. 0.7507
  context_entity_recall............................. 0.4214

 ⚡ COST and LATENCY METRICS:
----------------------------------------------------------------------
  Cost per query (USD):............................ $0.000275
  Prompt cost per query (USD):............................ $0.000229
  Completion cost per query (USD):............................ $0.000047
  Latency (P50):............................ 3.152s
  Latency (P99):............................ 4.499s
  Tokens per query:............................ 2402.790


Waiting 90 seconds to avoid rate limits...

🔍 Evaluating: com

Evaluating:   0%|          | 0/30 [00:00<?, ?it/s]

{'context_recall': 0.9417, 'context_precision': 0.9111, 'context_entity_recall': 0.3536}
✅ RAGAS evaluation completed in 24.54s

💰 Fetching cost and latency from LangSmith...

📊 RESULTS: compression

🎯 QUALITY METRICS (RAGAS):
----------------------------------------------------------------------
  context_recall.................................... 0.9417
  context_precision................................. 0.9111
  context_entity_recall............................. 0.3536

 ⚡ COST and LATENCY METRICS:
----------------------------------------------------------------------
  Cost per query (USD):............................ $0.000206
  Prompt cost per query (USD):............................ $0.000162
  Completion cost per query (USD):............................ $0.000043
  Latency (P50):............................ 4.148s
  Latency (P99):............................ 7.103s
  Tokens per query:............................ 1732.130


Waiting 90 seconds to avoid rate limits...

🔍 Evaluati

Evaluating:   0%|          | 0/30 [00:00<?, ?it/s]

{'context_recall': 0.9417, 'context_precision': 0.8787, 'context_entity_recall': 0.3419}
✅ RAGAS evaluation completed in 65.57s

💰 Fetching cost and latency from LangSmith...

📊 RESULTS: multi_query

🎯 QUALITY METRICS (RAGAS):
----------------------------------------------------------------------
  context_recall.................................... 0.9417
  context_precision................................. 0.8787
  context_entity_recall............................. 0.3419

 ⚡ COST and LATENCY METRICS:
----------------------------------------------------------------------
  Cost per query (USD):............................ $0.000229
  Prompt cost per query (USD):............................ $0.000181
  Completion cost per query (USD):............................ $0.000048
  Latency (P50):............................ 4.757s
  Latency (P99):............................ 7.022s
  Tokens per query:............................ 1987.940


Waiting 90 seconds to avoid rate limits...

🔍 Evaluati

Evaluating:   0%|          | 0/30 [00:00<?, ?it/s]

{'context_recall': 0.8500, 'context_precision': 0.8944, 'context_entity_recall': 0.2988}
✅ RAGAS evaluation completed in 18.05s

💰 Fetching cost and latency from LangSmith...

📊 RESULTS: parent_document

🎯 QUALITY METRICS (RAGAS):
----------------------------------------------------------------------
  context_recall.................................... 0.8500
  context_precision................................. 0.8944
  context_entity_recall............................. 0.2988

 ⚡ COST and LATENCY METRICS:
----------------------------------------------------------------------
  Cost per query (USD):............................ $0.000124
  Prompt cost per query (USD):............................ $0.000087
  Completion cost per query (USD):............................ $0.000037
  Latency (P50):............................ 2.531s
  Latency (P99):............................ 3.607s
  Tokens per query:............................ 960.320


Waiting 90 seconds to avoid rate limits...

🔍 Evalu

Evaluating:   0%|          | 0/30 [00:00<?, ?it/s]

{'context_recall': 0.9750, 'context_precision': 0.7945, 'context_entity_recall': 0.4492}
✅ RAGAS evaluation completed in 117.21s

💰 Fetching cost and latency from LangSmith...

📊 RESULTS: ensemble

🎯 QUALITY METRICS (RAGAS):
----------------------------------------------------------------------
  context_recall.................................... 0.9750
  context_precision................................. 0.7945
  context_entity_recall............................. 0.4492

 ⚡ COST and LATENCY METRICS:
----------------------------------------------------------------------
  Cost per query (USD):............................ $0.000224
  Prompt cost per query (USD):............................ $0.000183
  Completion cost per query (USD):............................ $0.000041
  Latency (P50):............................ 5.793s
  Latency (P99):............................ 9.271s
  Tokens per query:............................ 1928.650




In [147]:
# Display results in a formatted table
semantic_results_df = print_metrics_results_table(semantic_results)


📊 COMPLETE EVALUATION RESULTS

      Retriever  Recall  Precision  Entity_Recall    Cost ($)  Prompt ($)   Compl ($)  P50 (s)  P99 (s)  Tokens
          naive  0.9417     0.8736         0.4025 0.000278139 0.000231999  0.00004614   3.1760   5.3404    2435
           bm25  0.9500     0.7507         0.4214 0.000275427 0.000228563 0.000046864   3.1525   4.4986    2402
    compression  0.9417     0.9111         0.3536 0.000205769 0.000162361 0.000043408   4.1480   7.1026    1732
    multi_query  0.9417     0.8787         0.3419 0.000228875 0.000181087 0.000047788   4.7570   7.0220    1987
parent_document  0.8500     0.8944         0.2988 0.000124139 0.000086663 0.000037476   2.5310   3.6066     960
       ensemble  0.9750     0.7945         0.4492 0.000223561 0.000182633 0.000040928   5.7925   9.2705    1928




In [149]:
import numpy as np

# Create comparison table with consecutive rows for Original vs Semantic
comparison_data = []

for retriever_type in final_results.keys():
    if retriever_type in semantic_results:
        orig = final_results[retriever_type]
        sem = semantic_results[retriever_type]
        
        # Original row
        orig_row = {
            'Retriever': retriever_type,
            'Chunking': 'Recursive Splitting',
            'Recall': orig['quality_metrics'].get('context_recall', None),
            'Precision': orig['quality_metrics'].get('context_precision', None),
            'Entity_Recall': orig['quality_metrics'].get('context_entity_recall', None),
            'Cost ($)': orig['performance_metrics']['cost_per_query_usd'],
            'P50 (s)': orig['performance_metrics']['p50_latency_seconds'],
            'P99 (s)': orig['performance_metrics']['p99_latency_seconds'],
            'Tokens': int(orig['performance_metrics']['total_tokens_per_query']),
        }
        
        # Semantic row
        sem_row = {
            'Retriever': retriever_type,
            'Chunking': 'Semantic',
            'Recall': sem['quality_metrics'].get('context_recall', None),
            'Precision': sem['quality_metrics'].get('context_precision', None),
            'Entity_Recall': sem['quality_metrics'].get('context_entity_recall', None),
            'Cost ($)': sem['performance_metrics']['cost_per_query_usd'],
            'P50 (s)': sem['performance_metrics']['p50_latency_seconds'],
            'P99 (s)': sem['performance_metrics']['p99_latency_seconds'],
            'Tokens': int(sem['performance_metrics']['total_tokens_per_query']),
        }
        
        comparison_data.append(orig_row)
        comparison_data.append(sem_row)

comparison_df = pd.DataFrame(comparison_data)

print("\n" + "="*120)
print("📊 ORIGINAL vs SEMANTIC CHUNKING COMPARISON")
print("="*120 + "\n")

# Custom formatting function
def format_value(val):
    if pd.isna(val):
        return 'N/A'
    elif isinstance(val, (int, np.integer)):
        return f'{val}'
    else:
        return f'{val:.4f}'

# Apply formatting for display
display_df = comparison_df.copy()
for col in display_df.columns:
    if col not in ['Retriever', 'Chunking']:
        display_df[col] = display_df[col].apply(format_value)

print(display_df.to_string(index=False))
print("\n" + "="*120 + "\n")


📊 ORIGINAL vs SEMANTIC CHUNKING COMPARISON

      Retriever            Chunking Recall Precision Entity_Recall Cost ($) P50 (s) P99 (s) Tokens
          naive Recursive Splitting 0.9417    0.8886        0.4262   0.0003  3.0750  3.7876   2499
          naive            Semantic 0.9417    0.8736        0.4025   0.0003  3.1760  5.3404   2435
           bm25 Recursive Splitting 0.9167    0.7740        0.4483   0.0003  3.1975  4.5290   2447
           bm25            Semantic 0.9500    0.7507        0.4214   0.0003  3.1525  4.4986   2402
    compression Recursive Splitting 0.8833    0.9462        0.3885   0.0002  3.3960  5.6913   1770
    compression            Semantic 0.9417    0.9111        0.3536   0.0002  4.1480  7.1026   1732
    multi_query Recursive Splitting 0.9750    0.8484           N/A   0.0002  4.6345  8.9723   2075
    multi_query            Semantic 0.9417    0.8787        0.3419   0.0002  4.7570  7.0220   1987
parent_document Recursive Splitting 0.8300    0.8917        0.30

### **Conclusions**

**Note**: High recall minimizes Type II error (false negatives, i.e., avoid missing relevant docs), while high precision minimizes Type I error (false positives, i.e., higher fraction of retrieved docs are relevant).

- In terms of recall, ensemble and multi-query retrievers perform best (on average).

    - These retirevers cast a wider net, capturing more relevant docs in context

- Compression retrievers were the clear winner in terms of precision.

    - Reranking filters out irrelevant docs during retrieval, minimize Type I errors (false positives)

- Parent document retrievers incurred the lowest costs and latencies.

    - This is because they return fewer larger chunks v/s others (even though chunk size of the retrieved whole document may be higher in isolation)

- BM25 struggles with precision 

    - keyword mathcing retrieves more loosely related docs

- Semantic v/s Recursive Chunking

| Metric | Trend | Explanation |
|--------|-------|-------------|
| **Recall** | Semantic is slightly better | Semantic chunks preserve topic coherence, improving retrieval of related content |
| **Precision** | Recursive slightly better | Fixed-size recursive chunks may be more focused without including adjacent context |
| **Cost/Tokens** | Reduces tokens in general | Slightly more efficient chunking in terms of next-token predictions |
| **Latency** | Mixed  | Semantic chunking doesn't significantly impact retrieval speed |

**Summary**: Semantic chunking improves recall (finds more relevant docs) but slightly reduces precision (includes more noise). Token/cost savings are marginal.