# Advanced Retrieval with LangChain

In the following notebook, we'll explore various methods of advanced retrieval using LangChain!

We'll touch on:

- Naive Retrieval
- Best-Matching 25 (BM25)
- Multi-Query Retrieval
- Parent-Document Retrieval
- Contextual Compression (a.k.a. Rerank)
- Ensemble Retrieval
- Semantic chunking

We'll also discuss how these methods impact performance on our set of documents with a simple RAG chain.

There will be two breakout rooms:

- 🤝 Breakout Room Part #1
  - Task 1: Getting Dependencies!
  - Task 2: Data Collection and Preparation
  - Task 3: Setting Up QDrant!
  - Task 4-10: Retrieval Strategies
- 🤝 Breakout Room Part #2
  - Activity: Evaluate with Ragas

# 🤝 Breakout Room Part #1

## Task 1: Getting Dependencies!

We're going to need a few specific LangChain community packages, like OpenAI (for our [LLM](https://platform.openai.com/docs/models) and [Embedding Model](https://platform.openai.com/docs/guides/embeddings)) and Cohere (for our [Reranker](https://cohere.com/rerank)).

> You do not need to run the following cells if you are running this notebook locally. 

In [None]:
#!pip install -qU langchain langchain-openai langchain-cohere rank_bm25

[?25l   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m0.0/49.6 kB[0m [31m?[0m eta [36m-:--:--[0m[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m49.6/49.6 kB[0m [31m3.2 MB/s[0m eta [36m0:00:00[0m
[?25h[?25l   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m0.0/44.0 kB[0m [31m?[0m eta [36m-:--:--[0m[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m44.0/44.0 kB[0m [31m2.3 MB/s[0m eta [36m0:00:00[0m
[?25h[?25l   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m0.0/233.1 kB[0m [31m?[0m eta [36m-:--:--[0m[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m233.1/233.1 kB[0m [31m8.0 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m378.1/378.1 kB[0m [31m15.9 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m1.1/1.1 MB[0m [31m37.4 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m 

We're also going to be leveraging [Qdrant's](https://qdrant.tech/documentation/frameworks/langchain/) (pronounced "Quadrant") VectorDB in "memory" mode (so we can leverage it locally in our colab environment).

In [None]:
#!pip install -qU qdrant-client

We'll also provide our OpenAI key, as well as our Cohere API key.

In [1]:
import os
import getpass

os.environ["OPENAI_API_KEY"] = getpass.getpass("Enter your OpenAI API Key:")

In [2]:
os.environ["COHERE_API_KEY"] = getpass.getpass("Cohere API Key:")

In [3]:
from uuid import uuid4

os.environ["LANGCHAIN_TRACING_V2"] = "true"
os.environ["LANGCHAIN_API_KEY"] = getpass.getpass("LangChain API Key:")
os.environ["RAGAS_APP_TOKEN"] = getpass.getpass("Please enter your Ragas API key!")
os.environ["LANGCHAIN_PROJECT"] = f"AIM - Advanced RAG - {uuid4().hex[0:8]}"

## Task 2: Data Collection and Preparation

We'll be using some reviews from the 4 movies in the John Wick franchise today to explore the different retrieval strategies.

These were obtained from IMDB, and are available in the [AIM Data Repository](https://github.com/AI-Maker-Space/DataRepository).

### Data Collection

We can simply `wget` these from GitHub.

You could use any review data you wanted in this step - just be careful to make sure your metadata is aligned with your choice.

In [4]:
!wget https://raw.githubusercontent.com/AI-Maker-Space/DataRepository/main/jw1.csv -O john_wick_1.csv
!wget https://raw.githubusercontent.com/AI-Maker-Space/DataRepository/main/jw2.csv -O john_wick_2.csv
!wget https://raw.githubusercontent.com/AI-Maker-Space/DataRepository/main/jw3.csv -O john_wick_3.csv
!wget https://raw.githubusercontent.com/AI-Maker-Space/DataRepository/main/jw4.csv -O john_wick_4.csv

--2025-03-01 10:03:00--  https://raw.githubusercontent.com/AI-Maker-Space/DataRepository/main/jw1.csv
Resolving raw.githubusercontent.com (raw.githubusercontent.com)... 2606:50c0:8002::154, 2606:50c0:8003::154, 2606:50c0:8000::154, ...
Connecting to raw.githubusercontent.com (raw.githubusercontent.com)|2606:50c0:8002::154|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 19628 (19K) [text/plain]
Saving to: ‘john_wick_1.csv’


2025-03-01 10:03:00 (5.19 MB/s) - ‘john_wick_1.csv’ saved [19628/19628]

--2025-03-01 10:03:00--  https://raw.githubusercontent.com/AI-Maker-Space/DataRepository/main/jw2.csv
Resolving raw.githubusercontent.com (raw.githubusercontent.com)... 2606:50c0:8002::154, 2606:50c0:8003::154, 2606:50c0:8000::154, ...
Connecting to raw.githubusercontent.com (raw.githubusercontent.com)|2606:50c0:8002::154|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 14747 (14K) [text/plain]
Saving to: ‘john_wick_2.csv’


2025-03-01 10:03:01

### Data Preparation

We want to make sure all our documents have the relevant metadata for the various retrieval strategies we're going to be applying today.

- Self-Query: Wants as much metadata as we can provide
- Time-weighted: Wants temporal data

> NOTE: While we're creating a temporal relationship based on when these movies came out for illustrative purposes, it needs to be clear that the "time-weighting" in the Time-weighted Retriever is based on when the document was *accessed* last - not when it was created.

In [4]:
from langchain_community.document_loaders.csv_loader import CSVLoader
from datetime import datetime, timedelta

documents = []

for i in range(1, 5):
  loader = CSVLoader(
      file_path=f"john_wick_{i}.csv",
      metadata_columns=["Review_Date", "Review_Title", "Review_Url", "Author", "Rating"]
  )

  movie_docs = loader.load()
  for doc in movie_docs:

    # Add the "Movie Title" (John Wick 1, 2, ...)
    doc.metadata["Movie_Title"] = f"John Wick {i}"

    # convert "Rating" to an `int`, if no rating is provided - assume 0 rating
    doc.metadata["Rating"] = int(doc.metadata["Rating"]) if doc.metadata["Rating"] else 0

    # newer movies have a more recent "last_accessed_at"
    doc.metadata["last_accessed_at"] = datetime.now() - timedelta(days=4-i)

  documents.extend(movie_docs)

Let's look at an example document to see if everything worked as expected!

In [5]:
documents[0]

Document(metadata={'source': 'john_wick_1.csv', 'row': 0, 'Review_Date': '6 May 2015', 'Review_Title': ' Kinetic, concise, and stylish; John Wick kicks ass.\n', 'Review_Url': '/review/rw3233896/?ref_=tt_urv', 'Author': 'lnvicta', 'Rating': 8, 'Movie_Title': 'John Wick 1', 'last_accessed_at': datetime.datetime(2025, 2, 27, 20, 5, 1, 537598)}, page_content=": 0\nReview: The best way I can describe John Wick is to picture Taken but instead of Liam Neeson it's Keanu Reeves and instead of his daughter it's his dog. That's essentially the plot of the movie. John Wick (Reeves) is out to seek revenge on the people who took something he loved from him. It's a beautifully simple premise for an action movie - when action movies get convoluted, they get bad i.e. A Good Day to Die Hard. John Wick gives the viewers what they want: Awesome action, stylish stunts, kinetic chaos, and a relatable hero to tie it all together. John Wick succeeds in its simplicity.")

## Task 3: Setting up QDrant!

Now that we have our documents, let's create a QDrant VectorStore with the collection name "JohnWick".

We'll leverage OpenAI's [`text-embedding-3-small`](https://openai.com/blog/new-embedding-models-and-api-updates) because it's a very powerful (and low-cost) embedding model.

> NOTE: We'll be creating additional vectorstores where necessary, but this pattern is still extremely useful.

In [6]:
from langchain_community.vectorstores import Qdrant
from langchain_openai import OpenAIEmbeddings

embeddings = OpenAIEmbeddings(model="text-embedding-3-small")

vectorstore = Qdrant.from_documents(
    documents,
    embeddings,
    location=":memory:",
    collection_name="JohnWick"
)

## Task 4: Naive RAG Chain

Since we're focusing on the "R" in RAG today - we'll create our Retriever first.

### R - Retrieval

This naive retriever will simply look at each review as a document, and use cosine-similarity to fetch the 10 most relevant documents.

> NOTE: We're choosing `10` as our `k` here to provide enough documents for our reranking process later

In [7]:
naive_retriever = vectorstore.as_retriever(search_kwargs={"k" : 10})

### A - Augmented

We're going to go with a standard prompt for our simple RAG chain today! Nothing fancy here, we want this to mostly be about the Retrieval process.

In [8]:
from langchain_core.prompts import ChatPromptTemplate

RAG_TEMPLATE = """\
You are a helpful and kind assistant. Use the context provided below to answer the question.

If you do not know the answer, or are unsure, say you don't know.

Query:
{question}

Context:
{context}
"""

rag_prompt = ChatPromptTemplate.from_template(RAG_TEMPLATE)

### G - Generation

We're going to leverage `gpt-3.5-turbo` as our LLM today, as - again - we want this to largely be about the Retrieval process.

In [9]:
from langchain_openai import ChatOpenAI

chat_model = ChatOpenAI()

### LCEL RAG Chain

We're going to use LCEL to construct our chain.

> NOTE: This chain will be exactly the same across the various examples with the exception of our Retriever!

In [10]:
from langchain_core.runnables import RunnablePassthrough
from operator import itemgetter
from langchain_core.output_parsers import StrOutputParser

naive_retrieval_chain = (
    # INVOKE CHAIN WITH: {"question" : "<<SOME USER QUESTION>>"}
    # "question" : populated by getting the value of the "question" key
    # "context"  : populated by getting the value of the "question" key and chaining it into the base_retriever
    {"context": itemgetter("question") | naive_retriever, "question": itemgetter("question")}
    # "context"  : is assigned to a RunnablePassthrough object (will not be called or considered in the next step)
    #              by getting the value of the "context" key from the previous step
    | RunnablePassthrough.assign(context=itemgetter("context"))
    # "response" : the "context" and "question" values are used to format our prompt object and then piped
    #              into the LLM and stored in a key called "response"
    # "context"  : populated by getting the value of the "context" key from the previous step
    | {"response": rag_prompt | chat_model, "context": itemgetter("context")}
)

Let's see how this simple chain does on a few different prompts.

> NOTE: You might think that we've cherry picked prompts that showcase the individual skill of each of the retrieval strategies - you'd be correct!

In [11]:
naive_retrieval_chain.invoke({"question" : "Did people generally like John Wick?"})["response"].content

'Yes, people generally liked John Wick based on the reviews provided in the context.'

In [12]:
naive_retrieval_chain.invoke({"question" : "Do any reviews have a rating of 10? If so - can I have the URLs to those reviews?"})["response"].content

'Yes, there is a review with a rating of 10 for the movie "John Wick 3". Here is the URL to that review: \'/review/rw4854296/?ref_=tt_urv\'.'

In [13]:
naive_retrieval_chain.invoke({"question" : "What happened in John Wick?"})["response"].content

'In John Wick, an ex-hitman comes out of retirement to seek vengeance after gangsters kill his dog and steal everything from him. This sets off a chain of events where he faces off against various enemies, including professional killers and mobsters. The movie is known for its intense action, thrilling fights, and suspenseful storyline.'

Overall, this is not bad! Let's see if we can make it better!

## Task 5: Best-Matching 25 (BM25) Retriever

Taking a step back in time - [BM25](https://www.nowpublishers.com/article/Details/INR-019) is based on [Bag-Of-Words](https://en.wikipedia.org/wiki/Bag-of-words_model) which is a sparse representation of text.

In essence, it's a way to compare how similar two pieces of text are based on the words they both contain.

This retriever is very straightforward to set-up! Let's see it happen down below!


In [14]:
from langchain_community.retrievers import BM25Retriever

bm25_retriever = BM25Retriever.from_documents(documents)

We'll construct the same chain - only changing the retriever.

In [15]:
bm25_retrieval_chain = (
    {"context": itemgetter("question") | bm25_retriever, "question": itemgetter("question")}
    | RunnablePassthrough.assign(context=itemgetter("context"))
    | {"response": rag_prompt | chat_model, "context": itemgetter("context")}
)

Let's look at the responses!

In [16]:
bm25_retrieval_chain.invoke({"question" : "Did people generally like John Wick?"})["response"].content

'Overall, opinions on John Wick vary. Some people really enjoyed the action and style of the movie, while others found it lacking substance and depth. It seems like there are mixed reviews on whether people generally liked John Wick.'

In [17]:
bm25_retrieval_chain.invoke({"question" : "Do any reviews have a rating of 10? If so - can I have the URLs to those reviews?"})["response"].content

'Yes, there is a review with a rating of 10 for "John Wick 4". Here is the URL to that review: /review/rw8946038/?ref_=tt_urv'

In [18]:
bm25_retrieval_chain.invoke({"question" : "What happened in John Wick?"})["response"].content

"In John Wick, the protagonist is a former hitman seeking vengeance for the death of his beloved dog, given to him by his deceased wife. The movie is known for its intense action sequences and the main character's skills in combat."

It's not clear that this is better or worse - but the `I don't know` isn't great!

## Task 6: Contextual Compression (Using Reranking)

Contextual Compression is a fairly straightforward idea: We want to "compress" our retrieved context into just the most useful bits.

There are a few ways we can achieve this - but we're going to look at a specific example called reranking.

The basic idea here is this:

- We retrieve lots of documents that are very likely related to our query vector
- We "compress" those documents into a smaller set of *more* related documents using a reranking algorithm.

We'll be leveraging Cohere's Rerank model for our reranker today!

All we need to do is the following:

- Create a basic retriever
- Create a compressor (reranker, in this case)

That's it!

Let's see it in the code below!

In [19]:
from langchain.retrievers.contextual_compression import ContextualCompressionRetriever
from langchain_cohere import CohereRerank

compressor = CohereRerank(model="rerank-english-v3.0")
compression_retriever = ContextualCompressionRetriever(
    base_compressor=compressor, base_retriever=naive_retriever
)

Let's create our chain again, and see how this does!

In [20]:
contextual_compression_retrieval_chain = (
    {"context": itemgetter("question") | compression_retriever, "question": itemgetter("question")}
    | RunnablePassthrough.assign(context=itemgetter("context"))
    | {"response": rag_prompt | chat_model, "context": itemgetter("context")}
)

In [21]:
contextual_compression_retrieval_chain.invoke({"question" : "Did people generally like John Wick?"})["response"].content

'Yes, people generally liked John Wick based on the reviews provided.'

In [22]:
contextual_compression_retrieval_chain.invoke({"question" : "Do any reviews have a rating of 10? If so - can I have the URLs to those reviews?"})["response"].content

'Yes, there is a review with a rating of 10. Here is the URL to that review:\n- Review: A Masterpiece & Brilliant Sequel\n- URL: /review/rw4854296/?ref_=tt_urv'

In [23]:
contextual_compression_retrieval_chain.invoke({"question" : "What happened in John Wick?"})["response"].content

"In John Wick, the protagonist, John Wick, is coerced back into the world of assassins by a mobster who blows up his house after Wick refuses to help. Wick is then tasked with killing the mobster's sister in Rome, leading to a series of events where Wick becomes the target of multiple contract killers. Eventually, Wick seeks revenge on the mobster who betrayed him."

We'll need to rely on something like Ragas to help us get a better sense of how this is performing overall - but it "feels" better!

## Task 7: Multi-Query Retriever

Typically in RAG we have a single query - the one provided by the user.

What if we had....more than one query!

In essence, a Multi-Query Retriever works by:

1. Taking the original user query and creating `n` number of new user queries using an LLM.
2. Retrieving documents for each query.
3. Using all unique retrieved documents as context

So, how is it to set-up? Not bad! Let's see it down below!



In [24]:
from langchain.retrievers.multi_query import MultiQueryRetriever

multi_query_retriever = MultiQueryRetriever.from_llm(
    retriever=naive_retriever, llm=chat_model
)

In [25]:
multi_query_retrieval_chain = (
    {"context": itemgetter("question") | multi_query_retriever, "question": itemgetter("question")}
    | RunnablePassthrough.assign(context=itemgetter("context"))
    | {"response": rag_prompt | chat_model, "context": itemgetter("context")}
)

In [26]:
multi_query_retrieval_chain.invoke({"question" : "Did people generally like John Wick?"})["response"].content

"Yes, people generally liked John Wick based on the reviews provided. Many reviewers praised the film for its action sequences, slickness, brutality, and Keanu Reeves' performance. Some described it as the best action film of the year and one of the best in the past decade. It was seen as a stylish, violent, and entertaining movie that exceeded expectations."

In [27]:
multi_query_retrieval_chain.invoke({"question" : "Do any reviews have a rating of 10? If so - can I have the URLs to those reviews?"})["response"].content

"Yes, there is a review with a rating of 10. Here is the URL to that review: '/review/rw4854296/?ref_=tt_urv'"

In [28]:
multi_query_retrieval_chain.invoke({"question" : "What happened in John Wick?"})["response"].content

"In John Wick, a retired assassin named John Wick seeks revenge when his dog is killed and his car is stolen, leading to a lot of carnage. He is forced back into the world of assassins when he must help take over the Assassin's Guild. Wick travels to Italy, Canada, and Manhattan, killing numerous assassins along the way."

## Task 8: Parent Document Retriever

A "small-to-big" strategy - the Parent Document Retriever works based on a simple strategy:

1. Each un-split "document" will be designated as a "parent document" (You could use larger chunks of document as well, but our data format allows us to consider the overall document as the parent chunk)
2. Store those "parent documents" in a memory store (not a VectorStore)
3. We will chunk each of those documents into smaller documents, and associate them with their respective parents, and store those in a VectorStore. We'll call those "child chunks".
4. When we query our Retriever, we will do a similarity search comparing our query vector to the "child chunks".
5. Instead of returning the "child chunks", we'll return their associated "parent chunks".

Okay, maybe that was a few steps - but the basic idea is this:

- Search for small documents
- Return big documents

The intuition is that we're likely to find the most relevant information by limiting the amount of semantic information that is encoded in each embedding vector - but we're likely to miss relevant surrounding context if we only use that information.

Let's start by creating our "parent documents" and defining a `RecursiveCharacterTextSplitter`.

In [29]:
from langchain.retrievers import ParentDocumentRetriever
from langchain.storage import InMemoryStore
from langchain_text_splitters import RecursiveCharacterTextSplitter
from qdrant_client import QdrantClient, models

parent_docs = documents
child_splitter = RecursiveCharacterTextSplitter(chunk_size=200)

We'll need to set up a new QDrant vectorstore - and we'll use another useful pattern to do so!

> NOTE: We are manually defining our embedding dimension, you'll need to change this if you're using a different embedding model.

In [30]:
client = QdrantClient(location=":memory:")

client.create_collection(
    collection_name="full_documents",
    vectors_config=models.VectorParams(size=1536, distance=models.Distance.COSINE)
)

parent_document_vectorstore = Qdrant(
    collection_name="full_documents", embeddings=OpenAIEmbeddings(model="text-embedding-3-small"), client=client
)

  parent_document_vectorstore = Qdrant(


Now we can create our `InMemoryStore` that will hold our "parent documents" - and build our retriever!

In [31]:
store = InMemoryStore()

parent_document_retriever = ParentDocumentRetriever(
    vectorstore = parent_document_vectorstore,
    docstore=store,
    child_splitter=child_splitter,
)

By default, this is empty as we haven't added any documents - let's add some now!

In [32]:
parent_document_retriever.add_documents(parent_docs, ids=None)

We'll create the same chain we did before - but substitute our new `parent_document_retriever`.

In [33]:
parent_document_retrieval_chain = (
    {"context": itemgetter("question") | parent_document_retriever, "question": itemgetter("question")}
    | RunnablePassthrough.assign(context=itemgetter("context"))
    | {"response": rag_prompt | chat_model, "context": itemgetter("context")}
)

Let's give it a whirl!

In [34]:
parent_document_retrieval_chain.invoke({"question" : "Did people generally like John Wick?"})["response"].content

'Yes, generally people liked John Wick.'

In [35]:
parent_document_retrieval_chain.invoke({"question" : "Do any reviews have a rating of 10? If so - can I have the URLs to those reviews?"})["response"].content

'Yes, there is a review with a rating of 10. Here is the URL to that review: /review/rw4854296/?ref_=tt_urv.'

In [36]:
parent_document_retrieval_chain.invoke({"question" : "What happened in John Wick?"})["response"].content

"In John Wick, a retired assassin named John Wick is forced back into action when someone steals his car, leading to a lot of violence and carnage. He also helps Ian McShane take over the Assassin's Guild by traveling to Italy, Canada, and Manhattan to kill many other assassins along the way."

Overall, the performance *seems* largely the same. We can leverage a tool like [Ragas]() to more effectively answer the question about the performance.

## Task 9: Ensemble Retriever

In brief, an Ensemble Retriever simply takes 2, or more, retrievers and combines their retrieved documents based on a rank-fusion algorithm.

In this case - we're using the [Reciprocal Rank Fusion](https://plg.uwaterloo.ca/~gvcormac/cormacksigir09-rrf.pdf) algorithm.

Setting it up is as easy as providing a list of our desired retrievers - and the weights for each retriever.

In [37]:
from langchain.retrievers import EnsembleRetriever

retriever_list = [bm25_retriever, naive_retriever, parent_document_retriever, compression_retriever, multi_query_retriever]
equal_weighting = [1/len(retriever_list)] * len(retriever_list)

ensemble_retriever = EnsembleRetriever(
    retrievers=retriever_list, weights=equal_weighting
)

We'll pack *all* of these retrievers together in an ensemble.

In [38]:
ensemble_retrieval_chain = (
    {"context": itemgetter("question") | ensemble_retriever, "question": itemgetter("question")}
    | RunnablePassthrough.assign(context=itemgetter("context"))
    | {"response": rag_prompt | chat_model, "context": itemgetter("context")}
)

Let's look at our results!

In [39]:
ensemble_retrieval_chain.invoke({"question" : "Did people generally like John Wick?"})["response"].content

'Yes, people generally liked John Wick based on the reviews provided in the context.'

In [40]:
ensemble_retrieval_chain.invoke({"question" : "Do any reviews have a rating of 10? If so - can I have the URLs to those reviews?"})["response"].content

'Yes, there is a review with a rating of 10 for the movie "John Wick 3". Here is the URL to that review:\nhttps://www.imdb.com/review/rw4854296/?ref_=tt_urv'

In [41]:
ensemble_retrieval_chain.invoke({"question" : "What happened in John Wick?"})["response"].content

'In the movie "John Wick," an ex-hitman comes out of retirement to seek vengeance against the gangsters who killed his dog and took everything from him. He becomes entangled in a web of violence and destruction as he becomes the target of hitmen and bounty hunters. The movie is filled with intense action, shootouts, and thrilling fights as John Wick seeks retribution.'

## Task 10: Semantic Chunking

While this is not a retrieval method - it *is* an effective way of increasing retrieval performance on corpora that have clean semantic breaks in them.

Essentially, Semantic Chunking is implemented by:

1. Embedding all sentences in the corpus.
2. Combining or splitting sequences of sentences based on their semantic similarity based on a number of [possible thresholding methods](https://python.langchain.com/docs/how_to/semantic-chunker/):
  - `percentile`
  - `standard_deviation`
  - `interquartile`
  - `gradient`
3. Each sequence of related sentences is kept as a document!

Let's see how to implement this!

> NOTE: You do not need to run this cell if you're running this locally

In [None]:
#!pip install -qU langchain_experimental

[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m50.4/50.4 kB[0m [31m1.1 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m208.1/208.1 kB[0m [31m7.3 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m2.4/2.4 MB[0m [31m40.4 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m399.9/399.9 kB[0m [31m17.4 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m1.0/1.0 MB[0m [31m32.5 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m292.1/292.1 kB[0m [31m15.6 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m76.4/76.4 kB[0m [31m4.3 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m77.9/77.9 kB[0m [31m4.2 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━

We'll use the `percentile` thresholding method for this example which will:

Calculate all distances between sentences, and then break apart sequences of setences that exceed a given percentile among all distances.

In [42]:
from langchain_experimental.text_splitter import SemanticChunker

semantic_chunker = SemanticChunker(
    embeddings,
    breakpoint_threshold_type="percentile"
)

Now we can split our documents.

In [43]:
semantic_documents = semantic_chunker.split_documents(documents)

Let's create a new vector store.

In [44]:
semantic_vectorstore = Qdrant.from_documents(
    semantic_documents,
    embeddings,
    location=":memory:",
    collection_name="JohnWickSemantic"
)

We'll use naive retrieval for this example.

In [45]:
semantic_retriever = semantic_vectorstore.as_retriever(search_kwargs={"k" : 10})

Finally we can create our classic chain!

In [46]:
semantic_retrieval_chain = (
    {"context": itemgetter("question") | semantic_retriever, "question": itemgetter("question")}
    | RunnablePassthrough.assign(context=itemgetter("context"))
    | {"response": rag_prompt | chat_model, "context": itemgetter("context")}
)

And view the results!

In [47]:
semantic_retrieval_chain.invoke({"question" : "Did people generally like John Wick?"})["response"].content

'Overall, people generally liked John Wick based on the reviews provided.'

In [48]:
semantic_retrieval_chain.invoke({"question" : "Do any reviews have a rating of 10? If so - can I have the URLs to those reviews?"})["response"].content

'Yes, there is a review with a rating of 10 for the movie "John Wick 3". Here is the URL to that review: \'/review/rw4854296/?ref_=tt_urv\'.'

In [49]:
semantic_retrieval_chain.invoke({"question" : "What happened in John Wick?"})["response"].content

"In John Wick, the main character John Wick seeks revenge on the people who killed his dog and took something he loved from him. It's a simple premise for an action movie that delivers awesome action, stylish stunts, kinetic chaos, and a relatable hero."

# 🤝 Breakout Room Part #2

#### 🏗️ Activity #1

Your task is to evaluate the various Retriever methods against eachother.

You are expected to:

1. Create a "golden dataset"
 - Use Synthetic Data Generation (powered by Ragas, or otherwise) to create this dataset
2. Evaluate each retriever with *retriever specific* Ragas metrics
 - Semantic Chunking is not considered a retriever method and will not be required for marks, but you may find it useful to do a "semantic chunking on" vs. "semantic chunking off" comparision between them
3. Compile these in a list and write a small paragraph about which is best for this particular data and why.

Your analysis should factor in:
  - Cost
  - Latency
  - Performance

> NOTE: This is **NOT** required to be completed in class. Please spend time in your breakout rooms creating a plan before moving on to writing code.

##### HINTS:

- LangSmith provides detailed information about latency and cost.

#### Wall of Imports

RAG-relevant metrics include: Context Precision, Context Recall, Context Entities Recall, Noise Sensitivity, Response Relevancy, and Faithfulness

In [67]:
from ragas.llms import LangchainLLMWrapper
from ragas.embeddings import LangchainEmbeddingsWrapper
from langchain_openai import ChatOpenAI
from langchain_openai import OpenAIEmbeddings
from ragas.testset import TestsetGenerator
from ragas import EvaluationDataset
from ragas import evaluate
from ragas.llms import LangchainLLMWrapper
from ragas.metrics import LLMContextPrecisionWithoutReference, LLMContextRecall, Faithfulness, ResponseRelevancy, ContextEntityRecall, NoiseSensitivity
from ragas import evaluate, RunConfig

#### Test Data Generation

In [51]:
generator_llm = LangchainLLMWrapper(ChatOpenAI(model="gpt-4o"))
generator_embeddings = LangchainEmbeddingsWrapper(OpenAIEmbeddings())
generator = TestsetGenerator(llm=generator_llm, embedding_model=generator_embeddings)
dataset = generator.generate_with_langchain_docs(documents, testset_size=10)
dataset.to_pandas()

Applying SummaryExtractor:   0%|          | 0/44 [00:00<?, ?it/s]

Applying CustomNodeFilter:   0%|          | 0/100 [00:00<?, ?it/s]

Node 40b202eb-8a31-43e7-b211-7b0a9d5bea61 does not have a summary. Skipping filtering.
Node c8b7ad1f-e40c-4eb8-9631-c769477b239c does not have a summary. Skipping filtering.
Node e62d0b5e-f377-408d-8b6d-2ca9d8d9e641 does not have a summary. Skipping filtering.
Node 17d1e661-ce7e-4ec4-bcb1-3478c1c83da1 does not have a summary. Skipping filtering.
Node cf4b9820-ba03-4943-b54b-2ea781c89fac does not have a summary. Skipping filtering.
Node c3e81d30-f249-4382-8c73-c0dc0aa23a92 does not have a summary. Skipping filtering.
Node b35a1873-d7a0-4ae6-8228-cc6ea99a2579 does not have a summary. Skipping filtering.
Node 32c11c40-562d-472f-ad98-962c03ec8175 does not have a summary. Skipping filtering.
Node aa9a4e6b-0815-4baa-a0e6-b36724c46b65 does not have a summary. Skipping filtering.
Node 8175c664-8da6-4df9-a110-625a6c330cb7 does not have a summary. Skipping filtering.
Node 79df99b1-7d80-40ac-9f91-69700c17fd3b does not have a summary. Skipping filtering.
Node fd071b55-7e4d-4cc7-ba39-ad3aa1d55df9 d

Applying [EmbeddingExtractor, ThemesExtractor, NERExtractor]:   0%|          | 0/244 [00:00<?, ?it/s]

Applying OverlapScoreBuilder:   0%|          | 0/1 [00:00<?, ?it/s]

Generating personas:   0%|          | 0/3 [00:00<?, ?it/s]

Generating Scenarios:   0%|          | 0/2 [00:00<?, ?it/s]

Generating Samples:   0%|          | 0/10 [00:00<?, ?it/s]

Unnamed: 0,user_input,reference_contexts,reference,synthesizer_name
0,How does the film 'John Wick' compare to 'Take...,[: 0\nReview: The best way I can describe John...,The film 'John Wick' can be described as simil...,single_hop_specifc_query_synthesizer
1,Wht is the poplarity of Jon Wick films?,[: 2\nReview: With the fourth installment scor...,The fourth installment of John Wick is scoring...,single_hop_specifc_query_synthesizer
2,What distinguishes Chad Stahelski's direction ...,[: 3\nReview: John wick has a very simple reve...,Chad Stahelski's direction in John Wick is dis...,single_hop_specifc_query_synthesizer
3,How do Russian mobsters influence the plot of ...,[: 4\nReview: Though he no longer has a taste ...,The Russian mobsters are responsible for attac...,single_hop_specifc_query_synthesizer
4,What John Wick do in first movie?,[: 5\nReview: Ultra-violent first entry with l...,"In the original John Wick (2014), an ex-hit-ma...",single_hop_specifc_query_synthesizer
5,What makes John Wick 3 stand out in terms of a...,[<1-hop>\n\n: 19\nReview: The inevitable third...,John Wick 3 stands out in terms of action sequ...,multi_hop_specific_query_synthesizer
6,How John Wick Chapter 2 and John Wick: Chapter...,[<1-hop>\n\n: 19\nReview: John Wick: Chapter 4...,John Wick Chapter 2 is described as relentless...,multi_hop_specific_query_synthesizer
7,How does the reception of John Wick 4 compare ...,[<1-hop>\n\n: 19\nReview: The inevitable third...,The reception of John Wick 4 appears to be mix...,multi_hop_specific_query_synthesizer
8,How does the narrative coherence and character...,[<1-hop>\n\n: 11\nReview: JOHN WICK is a rare ...,The narrative coherence and character developm...,multi_hop_specific_query_synthesizer
9,How does John Wick: Chapter 3 - Parabellum mai...,[<1-hop>\n\n: 13\nReview: Following on from tw...,John Wick: Chapter 3 - Parabellum maintains it...,multi_hop_specific_query_synthesizer


In [53]:
df = dataset.to_pandas()
df.to_csv('testdataset.csv', index=False)

#### Naive RAG Evaluation

In [68]:
for test_row in dataset:
  response = naive_retrieval_chain.invoke({"question" : test_row.eval_sample.user_input})
  test_row.eval_sample.response = response["response"].content
  test_row.eval_sample.retrieved_contexts = [context.page_content for context in response["context"]]

dataset.to_pandas()
evaluation_dataset = EvaluationDataset.from_pandas(dataset.to_pandas())
evaluator_llm = LangchainLLMWrapper(ChatOpenAI(model="gpt-4o"))
custom_run_config = RunConfig(timeout=360)

result = evaluate(
    dataset=evaluation_dataset,
    metrics=[LLMContextPrecisionWithoutReference(), LLMContextRecall(), Faithfulness(), ResponseRelevancy(), ContextEntityRecall(), NoiseSensitivity()],
    llm=evaluator_llm,
    run_config=custom_run_config
)
result


Evaluating:   0%|          | 0/60 [00:00<?, ?it/s]

Exception raised in Job[5]: TimeoutError()
Exception raised in Job[35]: TimeoutError()
Exception raised in Job[47]: TimeoutError()


{'llm_context_precision_without_reference': 0.6183, 'context_recall': 0.7833, 'faithfulness': 0.9405, 'answer_relevancy': 0.7637, 'context_entity_recall': 0.5333, 'noise_sensitivity_relevant': 0.4585}

#### BM25 Evaluation

In [69]:
for test_row in dataset:
  response = bm25_retrieval_chain.invoke({"question" : test_row.eval_sample.user_input})
  test_row.eval_sample.response = response["response"].content
  test_row.eval_sample.retrieved_contexts = [context.page_content for context in response["context"]]

dataset.to_pandas()
evaluation_dataset = EvaluationDataset.from_pandas(dataset.to_pandas())
evaluator_llm = LangchainLLMWrapper(ChatOpenAI(model="gpt-4o"))
custom_run_config = RunConfig(timeout=360)

result = evaluate(
    dataset=evaluation_dataset,
    metrics=[LLMContextPrecisionWithoutReference(), LLMContextRecall(), Faithfulness(), ResponseRelevancy(), ContextEntityRecall(), NoiseSensitivity()],
    llm=evaluator_llm,
    run_config=custom_run_config
)
result

Evaluating:   0%|          | 0/60 [00:00<?, ?it/s]

{'llm_context_precision_without_reference': 0.3833, 'context_recall': 0.4133, 'faithfulness': 0.8000, 'answer_relevancy': 0.3933, 'context_entity_recall': 0.4100, 'noise_sensitivity_relevant': 0.4125}

#### Contextual Compression

In [70]:
for test_row in dataset:
  response = contextual_compression_retrieval_chain.invoke({"question" : test_row.eval_sample.user_input})
  test_row.eval_sample.response = response["response"].content
  test_row.eval_sample.retrieved_contexts = [context.page_content for context in response["context"]]

dataset.to_pandas()
evaluation_dataset = EvaluationDataset.from_pandas(dataset.to_pandas())
evaluator_llm = LangchainLLMWrapper(ChatOpenAI(model="gpt-4o"))
custom_run_config = RunConfig(timeout=360)

result = evaluate(
    dataset=evaluation_dataset,
    metrics=[LLMContextPrecisionWithoutReference(), LLMContextRecall(), Faithfulness(), ResponseRelevancy(), ContextEntityRecall(), NoiseSensitivity()],
    llm=evaluator_llm,
    run_config=custom_run_config
)
result

Evaluating:   0%|          | 0/60 [00:00<?, ?it/s]

{'llm_context_precision_without_reference': 0.8000, 'context_recall': 0.6700, 'faithfulness': 0.8564, 'answer_relevancy': 0.7654, 'context_entity_recall': 0.6517, 'noise_sensitivity_relevant': 0.1833}

#### Multi-Query Evaluation

In [71]:
for test_row in dataset:
  response = multi_query_retrieval_chain.invoke({"question" : test_row.eval_sample.user_input})
  test_row.eval_sample.response = response["response"].content
  test_row.eval_sample.retrieved_contexts = [context.page_content for context in response["context"]]

dataset.to_pandas()
evaluation_dataset = EvaluationDataset.from_pandas(dataset.to_pandas())
evaluator_llm = LangchainLLMWrapper(ChatOpenAI(model="gpt-4o"))
custom_run_config = RunConfig(timeout=360)

result = evaluate(
    dataset=evaluation_dataset,
    metrics=[LLMContextPrecisionWithoutReference(), LLMContextRecall(), Faithfulness(), ResponseRelevancy(), ContextEntityRecall(), NoiseSensitivity()],
    llm=evaluator_llm,
    run_config=custom_run_config
)
result

Evaluating:   0%|          | 0/60 [00:00<?, ?it/s]

Exception raised in Job[5]: TimeoutError()
Exception raised in Job[29]: TimeoutError()
Exception raised in Job[35]: TimeoutError()
Exception raised in Job[41]: TimeoutError()
Exception raised in Job[47]: TimeoutError()
Exception raised in Job[59]: TimeoutError()


{'llm_context_precision_without_reference': 0.5256, 'context_recall': 0.9167, 'faithfulness': 0.9033, 'answer_relevancy': 0.7632, 'context_entity_recall': 0.5083, 'noise_sensitivity_relevant': 0.4167}

#### Parent Document Evaluation

In [72]:
for test_row in dataset:
  response = parent_document_retrieval_chain.invoke({"question" : test_row.eval_sample.user_input})
  test_row.eval_sample.response = response["response"].content
  test_row.eval_sample.retrieved_contexts = [context.page_content for context in response["context"]]

dataset.to_pandas()
evaluation_dataset = EvaluationDataset.from_pandas(dataset.to_pandas())
evaluator_llm = LangchainLLMWrapper(ChatOpenAI(model="gpt-4o"))
custom_run_config = RunConfig(timeout=360)

result = evaluate(
    dataset=evaluation_dataset,
    metrics=[LLMContextPrecisionWithoutReference(), LLMContextRecall(), Faithfulness(), ResponseRelevancy(), ContextEntityRecall(), NoiseSensitivity()],
    llm=evaluator_llm,
    run_config=custom_run_config
)
result

Evaluating:   0%|          | 0/60 [00:00<?, ?it/s]

{'llm_context_precision_without_reference': 0.7000, 'context_recall': 0.5517, 'faithfulness': 0.6978, 'answer_relevancy': 0.8595, 'context_entity_recall': 0.5450, 'noise_sensitivity_relevant': 0.0892}

#### Ensemble Evaluation

In [73]:
for test_row in dataset:
  response = ensemble_retrieval_chain.invoke({"question" : test_row.eval_sample.user_input})
  test_row.eval_sample.response = response["response"].content
  test_row.eval_sample.retrieved_contexts = [context.page_content for context in response["context"]]

dataset.to_pandas()
evaluation_dataset = EvaluationDataset.from_pandas(dataset.to_pandas())
evaluator_llm = LangchainLLMWrapper(ChatOpenAI(model="gpt-4o"))
custom_run_config = RunConfig(timeout=360)

result = evaluate(
    dataset=evaluation_dataset,
    metrics=[LLMContextPrecisionWithoutReference(), LLMContextRecall(), Faithfulness(), ResponseRelevancy(), ContextEntityRecall(), NoiseSensitivity()],
    llm=evaluator_llm,
    run_config=custom_run_config
)
result

Evaluating:   0%|          | 0/60 [00:00<?, ?it/s]

Exception raised in Job[5]: TimeoutError()
Exception raised in Job[11]: TimeoutError()
Exception raised in Job[35]: TimeoutError()
Exception raised in Job[41]: TimeoutError()
Exception raised in Job[47]: TimeoutError()


{'llm_context_precision_without_reference': 0.6339, 'context_recall': 0.9667, 'faithfulness': 0.7665, 'answer_relevancy': 0.8519, 'context_entity_recall': 0.6333, 'noise_sensitivity_relevant': 0.3600}

#### Create LangSmith Dataset 
Makes it easier to compare different metrics for each pipeline.

In [55]:
from langsmith import Client

langsmith_client = Client()

dataset_name = "Advanced RAG"

langsmith_dataset = langsmith_client.create_dataset(
    dataset_name=dataset_name,
    description=dataset_name
)

In [56]:
for data_row in dataset.to_pandas().iterrows():
  langsmith_client.create_example(
      inputs={
          "question": data_row[1]["user_input"]
      },
      outputs={
          "answer": data_row[1]["reference"]
      },
      metadata={
          "context": data_row[1]["reference_contexts"]
      },
      dataset_id=langsmith_dataset.id
  )

#### Configure Evaluation Model

In [57]:
eval_llm = ChatOpenAI(model="gpt-4o")

#### Define LangSmith Evaluators
We only need one here since we're more focused on using Ragas to do our evaluations.

In [58]:
from langsmith.evaluation import LangChainStringEvaluator, evaluate

qa_evaluator = LangChainStringEvaluator("qa", config={"llm" : eval_llm})

#### Naive RAG LangSmith Evaluation

In [59]:
evaluate(
    naive_retrieval_chain.invoke,
    data=dataset_name,
    evaluators=[
        qa_evaluator
    ],
    metadata={"revision_id": "default_chain_init"},
)

View the evaluation results for experiment: 'elderly-condition-45' at:
https://smith.langchain.com/o/dda31471-d9b4-4423-9065-04bbf7ece062/datasets/4e3bb4ac-212c-44a7-9972-b1dd7fb0b3c0/compare?selectedSessions=1dc98b4e-59e8-4a98-ad69-408c41977cf3




0it [00:00, ?it/s]

Error running evaluator <DynamicRunEvaluator evaluate> on run 8b4e9537-744c-42c5-bbb1-2410fcb11c88: ValueError('Evaluator verbose=False prompt=PromptTemplate(input_variables=[\'answer\', \'query\', \'result\'], input_types={}, partial_variables={}, template="You are a teacher grading a quiz.\\nYou are given a question, the student\'s answer, and the true answer, and are asked to score the student answer as either CORRECT or INCORRECT.\\n\\nExample Format:\\nQUESTION: question here\\nSTUDENT ANSWER: student\'s answer here\\nTRUE ANSWER: true answer here\\nGRADE: CORRECT or INCORRECT here\\n\\nGrade the student answers based ONLY on their factual accuracy. Ignore differences in punctuation and phrasing between the student answer and true answer. It is OK if the student answer contains more information than the true answer, as long as it does not contain any conflicting statements. Begin! \\n\\nQUESTION: {query}\\nSTUDENT ANSWER: {result}\\nTRUE ANSWER: {answer}\\nGRADE:") llm=ChatOpenAI(

Unnamed: 0,inputs.question,outputs.response,outputs.context,error,reference.answer,feedback.wrapper,execution_time,example_id,id
0,How does John Wick: Chapter 3 - Parabellum mai...,content='To maintain its unique appeal while d...,[page_content=': 24\nReview: John Wick: Chapte...,,John Wick: Chapter 3 - Parabellum maintains it...,,1.534019,cee8e8d5-f8cb-4cbd-be93-92d47cd69559,8b4e9537-744c-42c5-bbb1-2410fcb11c88
1,How does the narrative coherence and character...,"content=""I don't know."" additional_kwargs={'re...",[page_content=': 19\nReview: The inevitable th...,,The narrative coherence and character developm...,,1.022953,968140bc-8ade-44f3-9fb9-c5db51a7231d,eebd86c9-d756-45e1-8af4-49378dc4b0cd
2,How does the reception of John Wick 4 compare ...,"content='Based on the reviews provided, John W...",[page_content=': 19\nReview: John Wick: Chapte...,,The reception of John Wick 4 appears to be mix...,,1.419759,e3f2020b-816b-4353-a457-75df03eb847f,05b710f5-2644-4580-be36-78cd9199b4ec
3,How John Wick Chapter 2 and John Wick: Chapter...,"content=""I don't have specific information com...",[page_content=': 19\nReview: John Wick: Chapte...,,John Wick Chapter 2 is described as relentless...,,1.266918,7321dad7-d531-419b-a310-4d97440d454e,eab48d70-b002-4381-b4b5-c362bb0b9233
4,What makes John Wick 3 stand out in terms of a...,"content=""John Wick 3 stands out in terms of ac...",[page_content=': 3\nReview: John wick has a ve...,,John Wick 3 stands out in terms of action sequ...,,1.503118,428145ee-3e0d-4322-b7bc-f0f0df7cd735,88b6c0eb-76c6-4b32-8c95-eb01f9ea1fe9
5,What John Wick do in first movie?,"content='In the first John Wick movie, John Wi...",[page_content=': 5\nReview: Ultra-violent firs...,,"In the original John Wick (2014), an ex-hit-ma...",,0.865102,7e8f48db-1869-4bbd-aa76-53c66f548378,9b9994a7-addb-4d3d-90f1-396f9ea50170
6,How do Russian mobsters influence the plot of ...,content='Russian mobsters influence the plot o...,[page_content=': 18\nReview: When the story be...,,The Russian mobsters are responsible for attac...,,0.975777,0ee076bc-3637-4aa1-9503-f28698333c17,70097608-90dc-4c51-991f-6aafd2b42f7c
7,What distinguishes Chad Stahelski's direction ...,"content=""Chad Stahelski's direction in John Wi...",[page_content=': 3\nReview: John wick has a ve...,,Chad Stahelski's direction in John Wick is dis...,,1.16376,8947bfc1-e67d-48d3-ae98-dfe1410c911a,e66b3af8-8543-43ac-ad98-0667b2b637bd
8,Wht is the poplarity of Jon Wick films?,content='The John Wick films have been consist...,"[page_content=': 9\nReview: At first glance, J...",,The fourth installment of John Wick is scoring...,,2.178945,69319b0b-968f-488b-94c2-808351d509a2,db45139d-8a10-4a38-8578-a7e233dbc673
9,How does the film 'John Wick' compare to 'Take...,"content=""In terms of plot, both 'John Wick' an...",[page_content=': 0\nReview: The best way I can...,,The film 'John Wick' can be described as simil...,,2.650874,94ff1162-6c93-4901-bbe2-cf5e2948db4c,6c7776cd-61ce-400d-9e89-9e92379973f8


#### BM25 LangSmith Evaluation

In [60]:
evaluate(
    bm25_retrieval_chain.invoke,
    data=dataset_name,
    evaluators=[
        qa_evaluator
    ],
    metadata={"revision_id": "default_chain_init"},
)

View the evaluation results for experiment: 'stupendous-energy-43' at:
https://smith.langchain.com/o/dda31471-d9b4-4423-9065-04bbf7ece062/datasets/4e3bb4ac-212c-44a7-9972-b1dd7fb0b3c0/compare?selectedSessions=955131e8-c661-4e27-adf7-4319a98bd9fc




0it [00:00, ?it/s]

Error running evaluator <DynamicRunEvaluator evaluate> on run d8376b50-60b1-4b4d-89e5-e74037f74c07: ValueError('Evaluator verbose=False prompt=PromptTemplate(input_variables=[\'answer\', \'query\', \'result\'], input_types={}, partial_variables={}, template="You are a teacher grading a quiz.\\nYou are given a question, the student\'s answer, and the true answer, and are asked to score the student answer as either CORRECT or INCORRECT.\\n\\nExample Format:\\nQUESTION: question here\\nSTUDENT ANSWER: student\'s answer here\\nTRUE ANSWER: true answer here\\nGRADE: CORRECT or INCORRECT here\\n\\nGrade the student answers based ONLY on their factual accuracy. Ignore differences in punctuation and phrasing between the student answer and true answer. It is OK if the student answer contains more information than the true answer, as long as it does not contain any conflicting statements. Begin! \\n\\nQUESTION: {query}\\nSTUDENT ANSWER: {result}\\nTRUE ANSWER: {answer}\\nGRADE:") llm=ChatOpenAI(

Unnamed: 0,inputs.question,outputs.response,outputs.context,error,reference.answer,feedback.wrapper,execution_time,example_id,id
0,How does John Wick: Chapter 3 - Parabellum mai...,"content=""John Wick: Chapter 3 - Parabellum mai...",[page_content=': 24\nReview: John Wick: Chapte...,,John Wick: Chapter 3 - Parabellum maintains it...,,1.196603,cee8e8d5-f8cb-4cbd-be93-92d47cd69559,d8376b50-60b1-4b4d-89e5-e74037f74c07
1,How does the narrative coherence and character...,"content=""I don't know the specifics of how nar...",[page_content=': 19\nReview: The inevitable th...,,The narrative coherence and character developm...,,0.653177,968140bc-8ade-44f3-9fb9-c5db51a7231d,90b2a6a6-fe80-4924-ac51-7df0a6423c53
2,How does the reception of John Wick 4 compare ...,"content=""Based on the reviews provided, it see...",[page_content=': 2\nReview: The first three Jo...,,The reception of John Wick 4 appears to be mix...,,1.192218,e3f2020b-816b-4353-a457-75df03eb847f,374adad3-8a7b-4baf-a27e-ef2b82932cc5
3,How John Wick Chapter 2 and John Wick: Chapter...,content='In terms of action and character deve...,[page_content=': 16\nReview: John Wick Chapter...,,John Wick Chapter 2 is described as relentless...,,1.919266,7321dad7-d531-419b-a310-4d97440d454e,eb14b4f8-f159-4b4e-8d46-bfea983e5add
4,What makes John Wick 3 stand out in terms of a...,content='John Wick 3 stands out in terms of ac...,[page_content=': 22\nReview: Lets contemplate ...,,John Wick 3 stands out in terms of action sequ...,,0.686215,428145ee-3e0d-4322-b7bc-f0f0df7cd735,20c33e8d-8748-4019-9a64-46e154579bd3
5,What John Wick do in first movie?,"content=""I'm sorry, but the context provided d...",[page_content=': 11\nReview: Who needs a 2hr a...,,"In the original John Wick (2014), an ex-hit-ma...",,0.51904,7e8f48db-1869-4bbd-aa76-53c66f548378,85867c2b-2966-4f89-8009-8cbb71af49a9
6,How do Russian mobsters influence the plot of ...,content='I don\'t know the specific details ab...,[page_content=': 11\nReview: Who needs a 2hr a...,,The Russian mobsters are responsible for attac...,,0.532849,0ee076bc-3637-4aa1-9503-f28698333c17,70193206-e95a-4c9a-80b5-0ba35c3e1260
7,What distinguishes Chad Stahelski's direction ...,"content=""Chad Stahelski's direction in John Wi...",[page_content=': 19\nReview: John Wick: Chapte...,,Chad Stahelski's direction in John Wick is dis...,,0.754211,8947bfc1-e67d-48d3-ae98-dfe1410c911a,3b08298b-79e4-469f-b8ed-67b682e4366b
8,Wht is the poplarity of Jon Wick films?,"content=""I don't know."" additional_kwargs={'re...",[page_content=': 19\nReview: The inevitable th...,,The fourth installment of John Wick is scoring...,,0.436681,69319b0b-968f-488b-94c2-808351d509a2,dd4fdd7b-4dcd-433b-9a80-525589529437
9,How does the film 'John Wick' compare to 'Take...,"content=""I'm sorry, I don't have the specific ...",[page_content=': 20\nReview: In a world where ...,,The film 'John Wick' can be described as simil...,,0.802305,94ff1162-6c93-4901-bbe2-cf5e2948db4c,9be22599-b1c4-423c-a05b-92f13ccdaf68


#### Contextual Compression LangSmith Evaluation

In [61]:
evaluate(
    contextual_compression_retrieval_chain.invoke,
    data=dataset_name,
    evaluators=[
        qa_evaluator
    ],
    metadata={"revision_id": "default_chain_init"},
)

View the evaluation results for experiment: 'abandoned-process-81' at:
https://smith.langchain.com/o/dda31471-d9b4-4423-9065-04bbf7ece062/datasets/4e3bb4ac-212c-44a7-9972-b1dd7fb0b3c0/compare?selectedSessions=24104c65-67f2-4687-ad60-3ad68f28b347




0it [00:00, ?it/s]

Error running evaluator <DynamicRunEvaluator evaluate> on run 8b5cd23a-dfbf-4609-a8f4-432e8f1652a3: ValueError('Evaluator verbose=False prompt=PromptTemplate(input_variables=[\'answer\', \'query\', \'result\'], input_types={}, partial_variables={}, template="You are a teacher grading a quiz.\\nYou are given a question, the student\'s answer, and the true answer, and are asked to score the student answer as either CORRECT or INCORRECT.\\n\\nExample Format:\\nQUESTION: question here\\nSTUDENT ANSWER: student\'s answer here\\nTRUE ANSWER: true answer here\\nGRADE: CORRECT or INCORRECT here\\n\\nGrade the student answers based ONLY on their factual accuracy. Ignore differences in punctuation and phrasing between the student answer and true answer. It is OK if the student answer contains more information than the true answer, as long as it does not contain any conflicting statements. Begin! \\n\\nQUESTION: {query}\\nSTUDENT ANSWER: {result}\\nTRUE ANSWER: {answer}\\nGRADE:") llm=ChatOpenAI(

Unnamed: 0,inputs.question,outputs.response,outputs.context,error,reference.answer,feedback.wrapper,execution_time,example_id,id
0,How does John Wick: Chapter 3 - Parabellum mai...,"content=""John Wick: Chapter 3 - Parabellum mai...",[page_content=': 24\nReview: John Wick: Chapte...,,John Wick: Chapter 3 - Parabellum maintains it...,,1.75735,cee8e8d5-f8cb-4cbd-be93-92d47cd69559,8b5cd23a-dfbf-4609-a8f4-432e8f1652a3
1,How does the narrative coherence and character...,"content=""I don't have specific information rel...",[page_content=': 19\nReview: The inevitable th...,,The narrative coherence and character developm...,,1.46111,968140bc-8ade-44f3-9fb9-c5db51a7231d,d7489ee6-86c1-4a37-932d-d3d391e69dbc
2,How does the reception of John Wick 4 compare ...,"content=""Based on the reviews provided, John W...",[page_content=': 20\nReview: In a world where ...,,The reception of John Wick 4 appears to be mix...,,1.332468,e3f2020b-816b-4353-a457-75df03eb847f,a4559539-0989-437b-af0c-3e333bc72071
3,How John Wick Chapter 2 and John Wick: Chapter...,"content=""I'm sorry, I don't have specific info...",[page_content=': 19\nReview: John Wick: Chapte...,,John Wick Chapter 2 is described as relentless...,,1.66997,7321dad7-d531-419b-a310-4d97440d454e,83e34d02-8101-4b44-a1ba-f3452f72f65c
4,What makes John Wick 3 stand out in terms of a...,content='John Wick 3 stands out in terms of ac...,[page_content=': 16\nReview: John Wick 3 is wi...,,John Wick 3 stands out in terms of action sequ...,,1.099758,428145ee-3e0d-4322-b7bc-f0f0df7cd735,9f2c15a9-8fd8-4127-b1c9-8349a42ca384
5,What John Wick do in first movie?,"content='In the first John Wick movie, John Wi...",[page_content=': 5\nReview: Ultra-violent firs...,,"In the original John Wick (2014), an ex-hit-ma...",,1.075636,7e8f48db-1869-4bbd-aa76-53c66f548378,c92d60aa-db65-4b2c-9c21-78d8ab5bdf53
6,How do Russian mobsters influence the plot of ...,content='The Russian mobsters influence the pl...,[page_content=': 20\nReview: After resolving h...,,The Russian mobsters are responsible for attac...,,3.394296,0ee076bc-3637-4aa1-9503-f28698333c17,3c019838-e3bd-4dd6-8744-68ea96780839
7,What distinguishes Chad Stahelski's direction ...,"content=""Chad Stahelski's direction in John Wi...",[page_content=': 3\nReview: John wick has a ve...,,Chad Stahelski's direction in John Wick is dis...,,1.275004,8947bfc1-e67d-48d3-ae98-dfe1410c911a,3cfd546b-040c-4bfe-85d6-83488df6c864
8,Wht is the poplarity of Jon Wick films?,"content='I am sorry, but I do not have specifi...",[page_content=': 11\nReview: JOHN WICK is a ra...,,The fourth installment of John Wick is scoring...,,0.943287,69319b0b-968f-488b-94c2-808351d509a2,8faa8ad6-19f1-428b-8ea8-3c190c38a755
9,How does the film 'John Wick' compare to 'Take...,"content=""Based on the provided context, both '...",[page_content=': 11\nReview: JOHN WICK is a ra...,,The film 'John Wick' can be described as simil...,,1.929755,94ff1162-6c93-4901-bbe2-cf5e2948db4c,9f81bee8-6ebc-4fb6-9ae2-62ee21353e6c


#### Multi-Query LangSmith Evaluation

In [62]:
evaluate(
    multi_query_retrieval_chain.invoke,
    data=dataset_name,
    evaluators=[
        qa_evaluator
    ],
    metadata={"revision_id": "default_chain_init"},
)

View the evaluation results for experiment: 'large-rabbit-43' at:
https://smith.langchain.com/o/dda31471-d9b4-4423-9065-04bbf7ece062/datasets/4e3bb4ac-212c-44a7-9972-b1dd7fb0b3c0/compare?selectedSessions=ee6ca056-06b5-44cf-a205-8319d4a30e7d




0it [00:00, ?it/s]

Error running evaluator <DynamicRunEvaluator evaluate> on run 5c71f072-3fd5-40ee-bbce-d705627ed7c5: ValueError('Evaluator verbose=False prompt=PromptTemplate(input_variables=[\'answer\', \'query\', \'result\'], input_types={}, partial_variables={}, template="You are a teacher grading a quiz.\\nYou are given a question, the student\'s answer, and the true answer, and are asked to score the student answer as either CORRECT or INCORRECT.\\n\\nExample Format:\\nQUESTION: question here\\nSTUDENT ANSWER: student\'s answer here\\nTRUE ANSWER: true answer here\\nGRADE: CORRECT or INCORRECT here\\n\\nGrade the student answers based ONLY on their factual accuracy. Ignore differences in punctuation and phrasing between the student answer and true answer. It is OK if the student answer contains more information than the true answer, as long as it does not contain any conflicting statements. Begin! \\n\\nQUESTION: {query}\\nSTUDENT ANSWER: {result}\\nTRUE ANSWER: {answer}\\nGRADE:") llm=ChatOpenAI(

Unnamed: 0,inputs.question,outputs.response,outputs.context,error,reference.answer,feedback.wrapper,execution_time,example_id,id
0,How does John Wick: Chapter 3 - Parabellum mai...,"content=""John Wick: Chapter 3 - Parabellum mai...",[page_content=': 24\nReview: John Wick: Chapte...,,John Wick: Chapter 3 - Parabellum maintains it...,,3.378689,cee8e8d5-f8cb-4cbd-be93-92d47cd69559,5c71f072-3fd5-40ee-bbce-d705627ed7c5
1,How does the narrative coherence and character...,"content=""I don't know the specific details of ...",[page_content=': 19\nReview: The inevitable th...,,The narrative coherence and character developm...,,2.874796,968140bc-8ade-44f3-9fb9-c5db51a7231d,e2713310-b01f-4ffd-a06f-ffeb6168ca38
2,How does the reception of John Wick 4 compare ...,"content='Based on the reviews provided, the re...",[page_content=': 19\nReview: John Wick: Chapte...,,The reception of John Wick 4 appears to be mix...,,3.740345,e3f2020b-816b-4353-a457-75df03eb847f,90f6cff7-60f4-45a9-a3b7-97ca9923612c
3,How John Wick Chapter 2 and John Wick: Chapter...,"content=""I don't have specific details compari...",[page_content=': 19\nReview: John Wick: Chapte...,,John Wick Chapter 2 is described as relentless...,,3.990491,7321dad7-d531-419b-a310-4d97440d454e,163c5359-2054-4214-abef-5dbffd7ed948
4,What makes John Wick 3 stand out in terms of a...,content='John Wick 3 stands out in terms of ac...,[page_content=': 3\nReview: John wick has a ve...,,John Wick 3 stands out in terms of action sequ...,,2.730219,428145ee-3e0d-4322-b7bc-f0f0df7cd735,46af1d3f-b83a-4787-afc3-92cd27b0f751
5,What John Wick do in first movie?,"content='In the first John Wick movie, John Wi...",[page_content=': 19\nReview: If you've seen th...,,"In the original John Wick (2014), an ex-hit-ma...",,5.718881,7e8f48db-1869-4bbd-aa76-53c66f548378,cc43fd85-3488-425c-a975-4524ceb46735
6,How do Russian mobsters influence the plot of ...,content='The Russian mobsters influence the pl...,[page_content=': 18\nReview: When the story be...,,The Russian mobsters are responsible for attac...,,6.780662,0ee076bc-3637-4aa1-9503-f28698333c17,22eb3c4b-af61-4fe3-b543-705315b4d76e
7,What distinguishes Chad Stahelski's direction ...,"content=""Chad Stahelski's direction in John Wi...",[page_content=': 3\nReview: John wick has a ve...,,Chad Stahelski's direction in John Wick is dis...,,2.507542,8947bfc1-e67d-48d3-ae98-dfe1410c911a,6419dd0a-ddf4-4b77-8624-4b47f14bb5c6
8,Wht is the poplarity of Jon Wick films?,content='The John Wick films have been very po...,[page_content=': 20\nReview: In a world where ...,,The fourth installment of John Wick is scoring...,,2.90257,69319b0b-968f-488b-94c2-808351d509a2,dbf96cd3-9f4b-4a07-a079-de0b64d5c4dd
9,How does the film 'John Wick' compare to 'Take...,"content=""In terms of plot and execution, both ...",[page_content=': 0\nReview: The best way I can...,,The film 'John Wick' can be described as simil...,,3.508383,94ff1162-6c93-4901-bbe2-cf5e2948db4c,acaf28b4-4706-4108-a4f5-45dd8d013478


#### Parent Document LangSmith Evaluation

In [63]:
evaluate(
    parent_document_retrieval_chain.invoke,
    data=dataset_name,
    evaluators=[
        qa_evaluator
    ],
    metadata={"revision_id": "default_chain_init"},
)

View the evaluation results for experiment: 'artistic-fish-1' at:
https://smith.langchain.com/o/dda31471-d9b4-4423-9065-04bbf7ece062/datasets/4e3bb4ac-212c-44a7-9972-b1dd7fb0b3c0/compare?selectedSessions=d4e31f91-048a-4f63-8311-7c08e5016d5d




0it [00:00, ?it/s]

Error running evaluator <DynamicRunEvaluator evaluate> on run 015c0bd7-30de-4e9e-91be-1569a3246d17: ValueError('Evaluator verbose=False prompt=PromptTemplate(input_variables=[\'answer\', \'query\', \'result\'], input_types={}, partial_variables={}, template="You are a teacher grading a quiz.\\nYou are given a question, the student\'s answer, and the true answer, and are asked to score the student answer as either CORRECT or INCORRECT.\\n\\nExample Format:\\nQUESTION: question here\\nSTUDENT ANSWER: student\'s answer here\\nTRUE ANSWER: true answer here\\nGRADE: CORRECT or INCORRECT here\\n\\nGrade the student answers based ONLY on their factual accuracy. Ignore differences in punctuation and phrasing between the student answer and true answer. It is OK if the student answer contains more information than the true answer, as long as it does not contain any conflicting statements. Begin! \\n\\nQUESTION: {query}\\nSTUDENT ANSWER: {result}\\nTRUE ANSWER: {answer}\\nGRADE:") llm=ChatOpenAI(

Unnamed: 0,inputs.question,outputs.response,outputs.context,error,reference.answer,feedback.wrapper,execution_time,example_id,id
0,How does John Wick: Chapter 3 - Parabellum mai...,"content=""John Wick: Chapter 3 - Parabellum mai...",[page_content=': 24\nReview: John Wick: Chapte...,,John Wick: Chapter 3 - Parabellum maintains it...,,1.370694,cee8e8d5-f8cb-4cbd-be93-92d47cd69559,015c0bd7-30de-4e9e-91be-1569a3246d17
1,How does the narrative coherence and character...,content='The narrative coherence and character...,[page_content=': 19\nReview: The inevitable th...,,The narrative coherence and character developm...,,1.477476,968140bc-8ade-44f3-9fb9-c5db51a7231d,9f7d13d5-e705-4040-a4cb-42c6cdba38cd
2,How does the reception of John Wick 4 compare ...,"content='Based on the reviews provided, the re...",[page_content=': 19\nReview: John Wick: Chapte...,,The reception of John Wick 4 appears to be mix...,,1.205801,e3f2020b-816b-4353-a457-75df03eb847f,cb65b930-acef-4cba-8441-6414f169efa7
3,How John Wick Chapter 2 and John Wick: Chapter...,"content=""Based on the provided context, John W...",[page_content=': 19\nReview: John Wick: Chapte...,,John Wick Chapter 2 is described as relentless...,,2.478003,7321dad7-d531-419b-a310-4d97440d454e,22fb3206-3578-4b59-893f-8e500312fc07
4,What makes John Wick 3 stand out in terms of a...,"content='In John Wick 3, what makes it stand o...",[page_content=': 1\nReview: I'm a fan of the J...,,John Wick 3 stands out in terms of action sequ...,,0.885871,428145ee-3e0d-4322-b7bc-f0f0df7cd735,bff4392c-6351-459c-ab1e-cbea66173f12
5,What John Wick do in first movie?,"content='In the first John Wick movie, John Wi...",[page_content=': 5\nReview: Ultra-violent firs...,,"In the original John Wick (2014), an ex-hit-ma...",,1.127982,7e8f48db-1869-4bbd-aa76-53c66f548378,bf165d6b-9c8f-45e7-b37b-e4e3bc2c22fd
6,How do Russian mobsters influence the plot of ...,"content=""Russian mobsters influence the plot o...",[page_content=': 20\nReview: John Wick is some...,,The Russian mobsters are responsible for attac...,,0.998202,0ee076bc-3637-4aa1-9503-f28698333c17,63e48d3f-6209-4a94-8bf2-83ee51a501eb
7,What distinguishes Chad Stahelski's direction ...,"content=""Chad Stahelski's direction in John Wi...",[page_content=': 18\nReview: Ever since the or...,,Chad Stahelski's direction in John Wick is dis...,,0.990364,8947bfc1-e67d-48d3-ae98-dfe1410c911a,97729ce6-ecb3-4066-beb0-9fe2e89f4d54
8,Wht is the poplarity of Jon Wick films?,content='The John Wick films seem to be quite ...,[page_content=': 20\nReview: In a world where ...,,The fourth installment of John Wick is scoring...,,0.997374,69319b0b-968f-488b-94c2-808351d509a2,7c4f41bc-3325-4ff7-8662-ab4c3df74f72
9,How does the film 'John Wick' compare to 'Take...,"content=""Based on the provided context, 'John ...",[page_content=': 11\nReview: JOHN WICK is a ra...,,The film 'John Wick' can be described as simil...,,1.241361,94ff1162-6c93-4901-bbe2-cf5e2948db4c,3b982289-875c-4b23-b305-f37a3458175a


#### Ensemble LangSmith Evaluation

In [64]:
evaluate(
    ensemble_retrieval_chain.invoke,
    data=dataset_name,
    evaluators=[
        qa_evaluator
    ],
    metadata={"revision_id": "default_chain_init"},
)

View the evaluation results for experiment: 'excellent-copy-13' at:
https://smith.langchain.com/o/dda31471-d9b4-4423-9065-04bbf7ece062/datasets/4e3bb4ac-212c-44a7-9972-b1dd7fb0b3c0/compare?selectedSessions=7972a8e7-fbbb-422b-8812-a9355e310bab




0it [00:00, ?it/s]

Error running evaluator <DynamicRunEvaluator evaluate> on run fe455257-895a-49b1-a6c0-8660c994f568: ValueError('Evaluator verbose=False prompt=PromptTemplate(input_variables=[\'answer\', \'query\', \'result\'], input_types={}, partial_variables={}, template="You are a teacher grading a quiz.\\nYou are given a question, the student\'s answer, and the true answer, and are asked to score the student answer as either CORRECT or INCORRECT.\\n\\nExample Format:\\nQUESTION: question here\\nSTUDENT ANSWER: student\'s answer here\\nTRUE ANSWER: true answer here\\nGRADE: CORRECT or INCORRECT here\\n\\nGrade the student answers based ONLY on their factual accuracy. Ignore differences in punctuation and phrasing between the student answer and true answer. It is OK if the student answer contains more information than the true answer, as long as it does not contain any conflicting statements. Begin! \\n\\nQUESTION: {query}\\nSTUDENT ANSWER: {result}\\nTRUE ANSWER: {answer}\\nGRADE:") llm=ChatOpenAI(

Unnamed: 0,inputs.question,outputs.response,outputs.context,error,reference.answer,feedback.wrapper,execution_time,example_id,id
0,How does John Wick: Chapter 3 - Parabellum mai...,"content=""John Wick: Chapter 3 - Parabellum mai...",[page_content=': 24\nReview: John Wick: Chapte...,,John Wick: Chapter 3 - Parabellum maintains it...,,3.896029,cee8e8d5-f8cb-4cbd-be93-92d47cd69559,fe455257-895a-49b1-a6c0-8660c994f568
1,How does the narrative coherence and character...,"content=""I don't have the specific information...",[page_content=': 19\nReview: The inevitable th...,,The narrative coherence and character developm...,,3.280386,968140bc-8ade-44f3-9fb9-c5db51a7231d,b1cd79b7-3404-4d20-a298-d588432e15c3
2,How does the reception of John Wick 4 compare ...,"content='Based on the reviews provided, the re...",[page_content=': 19\nReview: John Wick: Chapte...,,The reception of John Wick 4 appears to be mix...,,4.467646,e3f2020b-816b-4353-a457-75df03eb847f,b696dabc-852b-4971-86d5-e231796df42b
3,How John Wick Chapter 2 and John Wick: Chapter...,"content=""I'm sorry, but I don't have enough in...",[page_content=': 19\nReview: John Wick: Chapte...,,John Wick Chapter 2 is described as relentless...,,3.345671,7321dad7-d531-419b-a310-4d97440d454e,ce205817-f40d-4f60-af21-2deffdd90b98
4,What makes John Wick 3 stand out in terms of a...,"content=""In John Wick 3, the action sequences ...",[page_content=': 1\nReview: I'm a fan of the J...,,John Wick 3 stands out in terms of action sequ...,,9.599096,428145ee-3e0d-4322-b7bc-f0f0df7cd735,7318c71c-4206-441b-9314-6366e951a8ff
5,What John Wick do in first movie?,"content='John Wick, in the first movie, is an ...",[page_content=': 5\nReview: Ultra-violent firs...,,"In the original John Wick (2014), an ex-hit-ma...",,3.530078,7e8f48db-1869-4bbd-aa76-53c66f548378,2036a12f-437c-4ee4-9803-06692e7cbe91
6,How do Russian mobsters influence the plot of ...,"content=""Russian mobsters influence the plot o...",[page_content=': 18\nReview: When the story be...,,The Russian mobsters are responsible for attac...,,4.819144,0ee076bc-3637-4aa1-9503-f28698333c17,9b41edf4-9bca-4347-a792-f3e11525059f
7,What distinguishes Chad Stahelski's direction ...,"content=""Chad Stahelski's direction in John Wi...",[page_content=': 3\nReview: John wick has a ve...,,Chad Stahelski's direction in John Wick is dis...,,3.323437,8947bfc1-e67d-48d3-ae98-dfe1410c911a,2b5c03d1-9aee-433a-a6e4-edb287aa75b6
8,Wht is the poplarity of Jon Wick films?,"content='The John Wick films, particularly the...",[page_content=': 20\nReview: In a world where ...,,The fourth installment of John Wick is scoring...,,3.366329,69319b0b-968f-488b-94c2-808351d509a2,845d6b2d-dec2-48be-83ae-7c0400768861
9,How does the film 'John Wick' compare to 'Take...,"content=""The film 'John Wick' has been compare...",[page_content=': 0\nReview: The best way I can...,,The film 'John Wick' can be described as simil...,,3.982251,94ff1162-6c93-4901-bbe2-cf5e2948db4c,f14552af-1cbe-4539-8731-fb2810f69ffd


#### Semantic LangSmith Evaluation

In [65]:
evaluate(
    semantic_retrieval_chain.invoke,
    data=dataset_name,
    evaluators=[
        qa_evaluator
    ],
    metadata={"revision_id": "default_chain_init"},
)

View the evaluation results for experiment: 'dependable-test-84' at:
https://smith.langchain.com/o/dda31471-d9b4-4423-9065-04bbf7ece062/datasets/4e3bb4ac-212c-44a7-9972-b1dd7fb0b3c0/compare?selectedSessions=c66c721c-70aa-4b7d-9d02-1039fb5489b0




0it [00:00, ?it/s]

Error running evaluator <DynamicRunEvaluator evaluate> on run 77522441-55ac-4726-91f3-08393036fbba: ValueError('Evaluator verbose=False prompt=PromptTemplate(input_variables=[\'answer\', \'query\', \'result\'], input_types={}, partial_variables={}, template="You are a teacher grading a quiz.\\nYou are given a question, the student\'s answer, and the true answer, and are asked to score the student answer as either CORRECT or INCORRECT.\\n\\nExample Format:\\nQUESTION: question here\\nSTUDENT ANSWER: student\'s answer here\\nTRUE ANSWER: true answer here\\nGRADE: CORRECT or INCORRECT here\\n\\nGrade the student answers based ONLY on their factual accuracy. Ignore differences in punctuation and phrasing between the student answer and true answer. It is OK if the student answer contains more information than the true answer, as long as it does not contain any conflicting statements. Begin! \\n\\nQUESTION: {query}\\nSTUDENT ANSWER: {result}\\nTRUE ANSWER: {answer}\\nGRADE:") llm=ChatOpenAI(

Unnamed: 0,inputs.question,outputs.response,outputs.context,error,reference.answer,feedback.wrapper,execution_time,example_id,id
0,How does John Wick: Chapter 3 - Parabellum mai...,"content=""John Wick: Chapter 3 - Parabellum mai...",[page_content=': 24\nReview: John Wick: Chapte...,,John Wick: Chapter 3 - Parabellum maintains it...,,1.077527,cee8e8d5-f8cb-4cbd-be93-92d47cd69559,77522441-55ac-4726-91f3-08393036fbba
1,How does the narrative coherence and character...,content='The narrative coherence and character...,[page_content='The story goes deeper and the a...,,The narrative coherence and character developm...,,1.490263,968140bc-8ade-44f3-9fb9-c5db51a7231d,6772f933-8d02-44f9-8b09-4ddd91743c76
2,How does the reception of John Wick 4 compare ...,"content='Based on the reviews provided, the re...",[page_content=': 18\nReview: Ever since the or...,,The reception of John Wick 4 appears to be mix...,,2.002714,e3f2020b-816b-4353-a457-75df03eb847f,2c1f0bec-e292-420e-848d-43075b1f2f7f
3,How John Wick Chapter 2 and John Wick: Chapter...,"content=""I don't know the specific details abo...",[page_content=': 19\nReview: John Wick: Chapte...,,John Wick Chapter 2 is described as relentless...,,1.525434,7321dad7-d531-419b-a310-4d97440d454e,6f9c97ac-0f44-467e-aba3-9ab77d2f3f1e
4,What makes John Wick 3 stand out in terms of a...,content='John Wick 3 stands out in terms of ac...,[page_content='This is EXACTLY what you want o...,,John Wick 3 stands out in terms of action sequ...,,1.409272,428145ee-3e0d-4322-b7bc-f0f0df7cd735,56c261cb-a3d4-4e93-9994-d7f99dc73542
5,What John Wick do in first movie?,content='John Wick seeks revenge on the gangst...,[page_content='John Wick (Reeves) is out to se...,,"In the original John Wick (2014), an ex-hit-ma...",,0.80587,7e8f48db-1869-4bbd-aa76-53c66f548378,288d713f-83f0-4cb0-9e74-f4275461f942
6,How do Russian mobsters influence the plot of ...,content='The Russian mobsters influence the pl...,"[page_content='One day, he's getting gas and a...",,The Russian mobsters are responsible for attac...,,0.868326,0ee076bc-3637-4aa1-9503-f28698333c17,ce93eace-9655-4ccd-b5d6-e24e876d525f
7,What distinguishes Chad Stahelski's direction ...,"content=""Chad Stahelski's direction in John Wi...",[page_content='Directed by Chad Stahelski who'...,,Chad Stahelski's direction in John Wick is dis...,,1.345579,8947bfc1-e67d-48d3-ae98-dfe1410c911a,737760f2-7bc4-452a-bdf3-b234380c5269
8,Wht is the poplarity of Jon Wick films?,"content=""I don't know."" additional_kwargs={'re...",[page_content='John Wick was cool.' metadata={...,,The fourth installment of John Wick is scoring...,,0.813832,69319b0b-968f-488b-94c2-808351d509a2,d1e4baf1-d097-4f3e-9d02-19dfa1041866
9,How does the film 'John Wick' compare to 'Take...,"content=""The film 'John Wick' has been describ...",[page_content='John Wick (Reeves) is out to se...,,The film 'John Wick' can be described as simil...,,1.133374,94ff1162-6c93-4901-bbe2-cf5e2948db4c,5da22807-3629-44d5-b7fe-8524d9b4aeb6


#### Evaluation Results

Based on the table below and excluding the Semantic chunking approach for the time being, it appears you can roughly rank each retriever as follows:

- **Lowest Cost:** Parent Document
- **Fastest:** BM25
- **Best Context Precision:** Contextual Compression
- **Best Context Recall:** Ensemble
- **Most Faithful:** Naive RAG
- **Most Relevant:** Parent Document
- **Best Context Entity Recall:** Contextual Compression
- **Least Noise Sensitive:** Parent Document
- **Best Performance (Averaging Across First 5 Metrics):** Ensemble
- **Most Balanced (Between Latency, Cost, and Performance):** Contextual Compression or Parent Document

Before deciding on any particular solution though, it would probably be a good idea to first rank the metrics in terms of importance for your project in addition to some kind of tolerance level. For example, would it be ok to sacrifice some precision for lower cost? Or some latency for better faithfulness? Based on the results, it appears there will always be some trade off. 


| Retriever Type | Avg Latency | Cost | Context Precision | Context Recall | Faithfulness | Response Relevancy | Context Entity Recall | Average | Noise Sensitivity |
| --- | --- | --- | --- | --- | --- | --- | --- | --- | --- |
| Naive RAG | 1.458 | Total: $0.02, Avg: $0.002 | 0.6183 | 0.7833 | 0.9405 | 0.7637 | 0.5333 | 0.72782 | 0.4585 |
| BM25 | 0.869 | Total: $0.01, Avg: $0.001 | 0.3833 | 0.4133 | 0.8000 | 0.3933 | 0.4100 | 0.47998 | 0.4125 |
| Contextual Compression | 1.594 | Total: $0.01, Avg: $0.001 | 0.8000 | 0.6700 | 0.8564 | 0.7654 | 0.6517 | 0.7487 | 0.1833 |
| Multi-Query | 3.814 | Total: $0.024, Avg: $0.0024 | 0.5256 | 0.9167 | 0.9033 | 0.7632 | 0.5083 | 0.72342 | 0.4167 |
| Parent Document | 1.279 | Total: $0.001, Avg: $0.0001 | 0.7000 | 0.5517 | 0.6978 | 0.8595 | 0.5450 | 0.6708 | 0.0892 |
| Ensemble | 4.362 | Total: $0.029, Avg: $0.0029 | 0.6339 | 0.9667 | 0.7665 | 0.8519 | 0.6333 | 0.77046 | 0.3600 |
| Semantic | 1.246 | Total: $0.016, Avg: $0.0016 |  |  |  |  |  | |  |