# Using Ragas to Evaluate a RAG Application built with LangChain and LangGraph

In the following notebook, we'll be looking at how [Ragas](https://github.com/explodinggradients/ragas) can be helpful in a number of ways when looking to evaluate your RAG applications!

While this example is rooted in LangChain/LangGraph - Ragas is framework agnostic (you don't even need to be using a framework!).

- 🤝 Breakout Room #1
  1. Task 1: Installing Required Libraries
  2. Task 2: Set Environment Variables
  3. Task 3: Synthetic Dataset Generation for Evaluation using Ragas
  4. Task 4: Evaluating our Pipeline with Ragas
  5. Task 6: Making Adjustments and Re-Evaluating

But first! Let's set some dependencies!

## Dependencies and API Keys:

> NOTE: Please skip the pip install commands if you are running the notebook locally.

In [1]:
#!pip install -qU ragas==0.2.10
#!pip install -qU cohere langchain_cohere

In [3]:
#!pip install -qU langchain-community==0.3.14 langchain-openai==0.2.14 unstructured==0.16.12 langgraph==0.2.61 langchain-qdrant==0.2.0

We'll also need to provide our API keys.

First, OpenAI's for our LLM/embedding model combination! 

And next, in case we use cohere cntextual compression, we need our cohere key. 

And then, if we are going to use tracing, we need our Langsmith key.

In [1]:
import os
from getpass import getpass
os.environ["OPENAI_API_KEY"] = getpass("Please enter your OpenAI API key!")
os.environ["COHERE_API_KEY"] = getpass("Please enter your Cohere API key!")

## Generating Synthetic Test Data

We wil be using Ragas to build out a set of synthetic test questions, references, and reference contexts. This is useful because it will allow us to find out how our system is performing.

> NOTE: Ragas is best suited for finding *directional* changes in your LLM-based systems. The absolute scores aren't comparable in a vacuum.

### Data Preparation

We'll prepare our data - and download our webpages which we'll be using for our data today.

These webpages are from [Simon Willison's](https://simonwillison.net/) yearly "AI learnings".

- [2023 Blog](https://simonwillison.net/2023/Dec/31/ai-in-2023/)
- [2024 Blog](https://simonwillison.net/2024/Dec/31/llms-in-2024/)

Let's start by collecting our data into a useful pile!

In [None]:
!mkdir data

In [None]:
!curl https://simonwillison.net/2023/Dec/31/ai-in-2023/ -o data/2023_llms.html

In [None]:
!curl https://simonwillison.net/2024/Dec/31/llms-in-2024/ -o data/2024_llms.html

Next, let's load our data into a familiar LangChain format using the `DirectoryLoader`.

In [None]:
from langchain_community.document_loaders import DirectoryLoader

path = "data/"
loader = DirectoryLoader(path, glob="*.html")
docs = loader.load()

### Knowledge Graph Based Synthetic Generation

Ragas uses a knowledge graph based approach to create data. This is extremely useful as it allows us to create complex queries rather simply. The additional testset complexity allows us to evaluate larger problems more effectively, as systems tend to be very strong on simple evaluation tasks.

Let's start by defining our `sdg_llm` (which will generate our questions, summaries, and more), and our `sdg_embeddings` which will be useful in building our graph.

In [3]:
from ragas.llms import LangchainLLMWrapper
from ragas.embeddings import LangchainEmbeddingsWrapper
from langchain_openai import ChatOpenAI
from langchain_openai import OpenAIEmbeddings
sdg_llm = LangchainLLMWrapper(ChatOpenAI(model="gpt-4o-mini"))
sdg_embeddings = LangchainEmbeddingsWrapper(OpenAIEmbeddings(model="text-embedding-3-small"))


This will generate our knowledge graph under the hood, and will - from there - generate our personas and scenarios to construct our queries.

If we do not give the query distribution it will use 1/3, 1/3, 1/3 split for single hop specific, multihop specific and multihop abstract.

In [None]:
from ragas.testset import TestsetGenerator

sdg_generator = TestsetGenerator(llm=sdg_llm, embedding_model=sdg_embeddings)
sdg_dataset = sdg_generator.generate_with_langchain_docs(docs, testset_size=10)

In [None]:
sdg_dataset.to_pandas()

## LangChain RAG

Now we'll construct our LangChain RAG, which we will be evaluating using the above created test data!

### R - Retrieval

Let's start with building our retrieval pipeline, which will involve loading the same data we used to create our synthetic test set above.

> NOTE: We need to use the same data - as our test set is specifically designed for this data.

In [None]:
path = "data/"
loader = DirectoryLoader(path, glob="*.html")
docs = loader.load()

Now that we have our data loaded, let's split it into chunks! 

We are providing:
  * two options for chunking - **Naive or Semantic**, and
  * two options for retrieval: **Naive** (get top 5) or **Cohere** (get top 20 and then juice stuff from them using cohere)

**Helper functions: Naive Splitting/Chunking**

In [7]:
from langchain.text_splitter import RecursiveCharacterTextSplitter

def text_split():
    text_splitter = RecursiveCharacterTextSplitter(chunk_size=1000, chunk_overlap=200)
    split_documents = text_splitter.split_documents(docs)
    len(split_documents)
    print(f"Number of naive chunks: {len(split_documents)}") 
     # Calculate and print chunk length statistics
    chunk_lengths = [len(doc.page_content) for doc in split_documents]
    print(f"Max chunk length: {max(chunk_lengths)}")
    print(f"Min chunk length: {min(chunk_lengths)}")
    print(f"Average chunk length: {sum(chunk_lengths)/len(chunk_lengths):.2f}")
    return split_documents

**Helper functions: Semantic Splitting/Chunking**

In [8]:
from langchain_experimental.text_splitter import SemanticChunker # Note: Often in 'experimental'

# TRY: Hybrid chunking, two pass. First pass is semantic and then second pass is the naive chunking.
# Must we match the model we use later for retrieval/generation?
sc_embeddings = OpenAIEmbeddings(model="text-embedding-3-small") 

# Initialize the SemanticChunker
# Ok, so for blog article text, we want meaningful chunks, we don't want overly big chunks.
# We start with percentile - it looks at the distribution of similarity scores between adjacent 
# sentences within that document and splits where the similarity drops into a lower percentile.
# Default is bottom 5% similarity scores. I want to avoid overly large chunks, so I will change
# it to ensure that the chunks are not too big.
semantic_splitter = SemanticChunker(
    embeddings=sc_embeddings, 
    breakpoint_threshold_type="percentile", # Or standard_deviation, interquartile, gradient
    breakpoint_threshold_amount=85
)
def semantic_split():
    # Split the documents
    semantic_split_documents = semantic_splitter.split_documents(docs)

    # 5. Check the result
    print(f"Number of semantic chunks: {len(semantic_split_documents)}") 
    # Calculate and print chunk length statistics
    chunk_lengths = [len(doc.page_content) for doc in semantic_split_documents]
    print(f"Max chunk length: {max(chunk_lengths)}")
    print(f"Min chunk length: {min(chunk_lengths)}")
    print(f"Average chunk length: {sum(chunk_lengths)/len(chunk_lengths):.2f}")
    return semantic_split_documents

**Some chunking results**

|Attribute|Naive chunking| Semantic Chunking (Percentile - 85)|
|---------|--------------|------------------|
|Number of chunks| 75 | 70|
|Max chunks|998|5088|
|Min chunk| 406| 22|
|Avg Chunk| 880.55| 858.27|



#### ❓ Question: 

What is the purpose of the `chunk_overlap` parameter in the `RecursiveCharacterTextSplitter`?

#### <span style="color:green"> TL;DR - Maintain context continuity across chunks.

<span style="color:green"> We do not want a split to happen mid sentence. We can lose context that way. Any sequence or idea that **spans** chunk boundary is likely to be kept intact with the help of overlap. This helps downstream applications (our RAG pipeline in this case) get better semantic context.

**Qdrant vector db set-up**

In [9]:
# We need an embedding model for Qdrant to use
from langchain_openai import OpenAIEmbeddings

# Qdrant imports
from langchain_qdrant import QdrantVectorStore
from qdrant_client import QdrantClient
from qdrant_client.http.models import Distance, VectorParams


embeddings = OpenAIEmbeddings(model="text-embedding-3-small")

# Our in memory QDrant vector store. We are setting it up. We will load data into it later.
client = QdrantClient(":memory:")
client.create_collection(
    collection_name="ai_across_years",
    vectors_config=VectorParams(size=1536, distance=Distance.COSINE),
)
vector_store = QdrantVectorStore(
    client=client,
    collection_name="ai_across_years",
    embedding=embeddings,
)


In [44]:
# By default, we are using regular text splitting/chunking and Naive QDrant retrieval.
SEMANTIC_SPLIT = False
COHERE_RETRIEVER = True

Now that we have the helper functions for our two chunking/splitting options and the setup for Qdrant, \
we are ready to load our data into the Qdrant vector db (vector_store) and then we will set our retriever node to use this vector store for the graph.

**qdrant vector db: load data into it, depending on SEMANTIC_SPLIT**


In [None]:
# Depending on which type of chunking we are evaluating, we will add the documents to the vector store.
split_documents = None
if (SEMANTIC_SPLIT):
  split_documents = semantic_split()
else:
  split_documents = text_split()

_ = vector_store.add_documents(documents=split_documents)


**Helper Fucntions: Naive and Cohere Retrievers**

 We will create helper functions for two retriever nodes,
   - one that uses a **regular or naive** retrieval using a naive Qdrant vector DB - this one gets the top 5 matches, 
   - and another **cohere_retriever** that gets top 20 matches and then contextually compresses them down into 5.

Reranking, or contextual compression, is a technique that uses a reranker to compress the retrieved documents into a smaller set of documents.\
This is essentially a slower, more accurate form of semantic similarity that we use on a smaller subset of our documents.

In [54]:
# Cohere imports
from langchain.retrievers.contextual_compression import ContextualCompressionRetriever
from langchain_cohere import CohereRerank

# We define both our retriever nodes and later, we will add the correct one to our graph.

# Cohere retriever node.
# We'll first set our retriever to return more documents, which will allow us to take advantage of the reranking.
cohere_base_retriever = vector_store.as_retriever(search_kwargs={"k": 20})
def cohere_retrieve(state):
  compressor = CohereRerank(model="rerank-v3.5")
  compression_retriever = ContextualCompressionRetriever(
    base_compressor=compressor, base_retriever=cohere_base_retriever, search_kwargs={"k": 5}
  )
  retrieved_docs = compression_retriever.invoke(state["question"])
  return {"context" : retrieved_docs}
  
# Naive retriever node.
naive_base_retriever = vector_store.as_retriever(search_kwargs={"k": 5})
def naive_retrieve(state):
  retrieved_docs = naive_base_retriever.invoke(state["question"])
  return {"context" : retrieved_docs}


### Augmented

Let's create a simple RAG prompt!

In [55]:
from langchain.prompts import ChatPromptTemplate

RAG_PROMPT = """\
You are a helpful assistant who answers questions based on provided context. You must only use the provided context, and cannot use your own knowledge.

### Question
{question}

### Context
{context}
"""

rag_prompt = ChatPromptTemplate.from_template(RAG_PROMPT)

### Generation

We'll also need an LLM to generate responses - we'll use `gpt-4o-mini` to avoid using the same model as our judge model.

In [56]:
from langchain_openai import ChatOpenAI

generator_llm = ChatOpenAI(model="gpt-4o-mini")

Then we can create a `generate` node!

In [57]:
def rag_generate(state):
  docs_content = "\n\n".join(doc.page_content for doc in state["context"])
  messages = rag_prompt.format_messages(question=state["question"], context=docs_content)
  response = generator_llm.invoke(messages)
  return {"response" : response.content}

### Building RAG Graph with LangGraph

Let's create some state for our LangGraph RAG graph!

In [58]:
from langgraph.graph import START, StateGraph
from typing_extensions import List, TypedDict
from langchain_core.documents import Document

class State(TypedDict):
  question: str
  context: List[Document]
  response: str

Now we can build our simple graph!

> NOTE: We're using `add_sequence` since we will always move from retrieval to generation. This is essentially building a chain in LangGraph.

In [59]:
graph_builder = StateGraph(State)
if COHERE_RETRIEVER:
  graph_builder.add_sequence([cohere_retrieve, rag_generate])
  graph_builder.add_edge(START, "cohere_retrieve")
else:
  graph_builder.add_sequence([naive_retrieve, rag_generate])
  graph_builder.add_edge(START, "naive_retrieve")
  
graph = graph_builder.compile()

In [60]:
# print(graph.get_graph().draw_ascii())

Let's do a test to make sure it's doing what we'd expect.

In [None]:
response = graph.invoke({"question" : "How are LLM agents useful?"})

In [None]:
response["response"]

## Evaluating the App with Ragas

Now we can finally do our evaluation!

We'll start by running the queries that we generated in sdg_dataset. For each query we will run the RAG chain and get the context and response.\
Earlier, we had user_input(query), reference_contexts(ideal context), reference(ideal response).\
 Now we add two more columns: response (RAG response) and retrieved_contexts (RAG context). \
So, we will have everything we need to hand it off to evaluation.

In [63]:
import time
for test_row in sdg_dataset:
  response = graph.invoke({"question" : test_row.eval_sample.user_input})
  test_row.eval_sample.response = response["response"]
  test_row.eval_sample.retrieved_contexts = [context.page_content for context in response["context"]]
  time.sleep(5)

time.sleep(5)

In [None]:
sdg_dataset.to_pandas()

Then we can convert that table into a `EvaluationDataset` which will make the process of evaluation smoother.

In [65]:
from ragas import EvaluationDataset

evaluation_dataset = EvaluationDataset.from_pandas(sdg_dataset.to_pandas())

We'll need to select a judge model - in this case we're using the same model that was used to generate our Synthetic Data.

In [66]:
from ragas import evaluate
from ragas.llms import LangchainLLMWrapper

evaluator_llm = LangchainLLMWrapper(ChatOpenAI(model="gpt-4.1-mini"))

Next up - we simply evaluate on our desired metrics!

In [None]:
from ragas.metrics import LLMContextRecall, Faithfulness, FactualCorrectness, ResponseRelevancy, ContextEntityRecall, NoiseSensitivity
from ragas import evaluate, RunConfig

# default max_workers is 16.
# custom_run_config = RunConfig(timeout=360, max_workers=8)
custom_run_config = RunConfig(
    timeout=300,          # 5 minutes max for operations
    max_retries=15,       # More retries for rate limits
    max_wait=90,          # Longer wait between retries
    max_workers=8,        # Fewer concurrent API calls
    log_tenacity=True     # Log retry attempts
)

result = evaluate(
    dataset=evaluation_dataset,
    metrics=[LLMContextRecall(), Faithfulness(), FactualCorrectness(), ResponseRelevancy(), ContextEntityRecall(), NoiseSensitivity()],
    llm=evaluator_llm,
    run_config=custom_run_config
)
result



Some data to show variability from run to run. 

Naive chunking, Naive retrieval: Two runs

|metric        |run1 | run2 |
|--------------|-----|------|
|context_recall| 0.7803 | 0.6153 |
|faithfulness| 0.8319 | 0.7748 | 
|factual_correctness'| 0.6000 | 0.5492 |
|answer_relevancy'| 0.7959 | 0.7931|
|context_entity_recall'| 0.4514 | 0.4310 |
|noise_sensitivity_relevant'| 0.3639 | 0.2666|


Semantic chunking, Naive Retrieval: Three runs:
|metric        |run1 | run2 | run3 |
|--------------|-----|------|------|
|context_recall| 0.8788| .8417| 0.5389|
|faithfulness | 0.9032|0.8616| 0.9088|
|factual_correctness| 0.5718 | 0.5191| 0.4145|
|answer_relevancy| 0.8735| 0.9570| 0.8648|
| context_entity_recall| 0.4793 | 0.3603| 0.4496|
| noise_sensitivity_relevant| 0.4513| 0.5918|0.4446|


Naive Chunking, Cohere Retrieval: Three runs:

| metric                     | run1   | run2   | run3   |
|----------------------------|--------|--------|--------|
| context_recall             | 0.7917 | 0.5909 | 0.7778 |
| faithfulness               | 0.9021 | 0.6857 | 0.8526 |
| factual_correctness      | 0.6050 | 0.5442 | 0.4775 |
| answer_relevancy         | 0.8736 | 0.8682 | 0.8710 |
| context_entity_recall    | 0.4550 | 0.4618 | 0.4893 |
| noise_sensitivity_relevant | 0.3862 | 0.3018 | 0.3974 |

<span style="color:green"> **RESULTS for the three runs**

|Chunking|Retrieval|  Context Recall| Faithfulness | Factual_correctness | Answer_relevancy| Context-Entity-Recall | Noise_sensitivity_relevant |
| -------|---------| ---------------|--------------|---------------------|------------------|-----------------------|-------------------|
|Naive| Naive  |     0.7000     |    0.8346    |    0.5300           |     0.9337       |   0.3842              |   0.3458          |
| Naive| Cohere|     0.7917     |    0.9021    |    0.6050           |     0.8736       |   0.4550              |   0.3862          |
| Semantic | Naive |     0.8788     |    0.9032    |    0.5718           |     0.8735       |   0.4793              |   0.4513          |

#### ❓ Question: 

Which system performed better, on what metrics, and why?

#### <span style="color:green"> TL;DR

  -  <span style="color:green">As you can see in the above tables, there is a some variance in the results. This makes sense since LLMs are probabilistic by nature.
  - <span style="color:green">However, we can see that semantic chunking and cohere retrieval DID performed better than naive chunking and naive retrieval.
  - <span style="color:green">This is likely because semantic chunking is more context aware and cohere retrieval is more accurate.

---

### <span style="color:green">RAG System Performance Summary

<span style="color:green">Semantic chunking provided the biggest boost for metrics related to *finding and using* the right context effectively. Adding a reranker  significantly helped improve the *factual accuracy* of the final answer, even with basic chunking. The purely naive approach struggled most with context utilization.

---
### <span style="color:green">RAG System Performance Details

#### <span style="color:green">Semantic Chunking + Naive Retrieval

*   **Wins:** This system scored highest on:
    *   `Context Recall` (0.8788): It was much better at retrieving *all* the relevant information needed to answer the question.
        *   **Why:** Semantic chunking likely created more contextually complete chunks, making it easier for the naive retrieval to grab the necessary information.
    *   `Context Entity Recall` (0.4793): It excelled at recalling specific named entities (like people, places) from the context.
        *   **Why:** Similar to Context Recall, better-formed chunks likely kept relevant entities together?
    *   `Noise Sensitivity` (0.4513): It was decent at handling irrelevant or distracting information within the context.
        *   **Why:** Semantically coherent chunks might make it easier for the system to focus on the core information despite noise?
        * I think this could also be an artifact of LLM probability. We created some pretty big chunks with semantic chunking and sometimes that can lead to halluicnations or more sensitivity to noise. I am not trusting this one much.
    *   `Faithfulness` (0.9032): Tied for the best score, indicating its answers closely matched the information in the retrieved context.
        *   **Why:** When relevant context is retrieved well (high Context Recall), the answer is more likely to be faithful to it?
*   **Average:** Scored decently on `Factual Correctness` (0.5718), slightly below System 2.
*   **Lagged:** Scored lower on `Answer Relevancy` (0.8736) compared to the baseline. I am not sure why?! Same for Naive Chunking and Cohere Retrieval below.


---

#### <span style="color:green">Naive Chunking + Cohere Rerank Retrieval

*   **Wins:** This system scored highest on:
    *   `Factual Correctness` (0.6050): It was the best at providing factually accurate answers based on the context.
        *   **Why:** The Cohere reranker likely re-ordered the naively retrieved chunks to prioritize the most relevant and accurate ones *before* sending them to the language model for answer generation.
    *   `Faithfulness` (0.9021): Tied for the best, showing answers aligned well with the provided context.
        *   **Why:** The reranker helps ensure the context used for generation is highly relevant.
*   **Average:** Showed significant improvement over the baseline on `Context Recall` (0.7917) and `Context Entity Recall` (0.4550), demonstrating the benefit of the reranker even with naive chunks.
*   **Lagged:** Scored lower on `Answer Relevancy` (0.8736) compared to the baseline. I am not sure why?!

---

#### <span style="color:green">Naive Chunking + Naive Retrieval (Baseline)

*   **Wins:** Surprisingly scored highest on `Answer Relevancy` (0.9337).
    *   **Why:** This might indicate the baseline system gave answers directly related to the question prompt, perhaps simpler ones, without necessarily leveraging the retrieved context as effectively as the other systems. High relevancy doesn't guarantee correctness or context usage.
*   **Lagged:** This system performed worst on almost all other metrics, especially those measuring how well the context was utilized (`Context Recall`, `Faithfulness`, `Factual Correctness`, `Context Entity Recall`, `Noise Sensitivity`).
    *   **Why:** Basic chunking can split information awkwardly, and naive retrieval without reranking can bring back less relevant context, hindering the LLM's ability to generate high-quality, context-grounded answers.

---