# Using Ragas to Evaluate a RAG Application built with LangChain and LangGraph

In the following notebook, we'll be looking at how [Ragas](https://github.com/explodinggradients/ragas) can be helpful in a number of ways when looking to evaluate your RAG applications!

While this example is rooted in LangChain/LangGraph - Ragas is framework agnostic (you don't even need to be using a framework!).

- 🤝 Breakout Room #1
  1. Task 1: Installing Required Libraries
  2. Task 2: Set Environment Variables
  3. Task 3: Synthetic Dataset Generation for Evaluation using Ragas
  4. Task 4: Evaluating our Pipeline with Ragas
  5. Task 6: Making Adjustments and Re-Evaluating

But first! Let's set some dependencies!

## Dependencies and API Keys:

We'll also need to provide our API keys.

First, OpenAI's for our LLM/embedding model combination!

In [1]:
import os
from getpass import getpass
os.environ["OPENAI_API_KEY"] = getpass("Please enter your OpenAI API key!")

## Generating Synthetic Test Data

We wil be using Ragas to build out a set of synthetic test questions, references, and reference contexts. This is useful because it will allow us to find out how our system is performing.

> NOTE: Ragas is best suited for finding *directional* changes in your LLM-based systems. The absolute scores aren't comparable in a vacuum.

### Data Preparation

We'll prepare our data - which should hopefully be familiar at this point since it's our Loan Data use-case!

Next, let's load our data into a familiar LangChain format using the `DirectoryLoader`.

In [2]:
from langchain_community.document_loaders import DirectoryLoader
from langchain_community.document_loaders import PyMuPDFLoader

path = "data/"
loader = DirectoryLoader(path, glob="*.pdf", loader_cls=PyMuPDFLoader)
docs = loader.load()

In [3]:
len(docs)

269

In [4]:
docs[:20]

[Document(metadata={'source': 'data/Applications_and_Verification_Guide.pdf', 'file_path': 'data/Applications_and_Verification_Guide.pdf', 'page': 0, 'total_pages': 76, 'format': 'PDF 1.7', 'title': '', 'author': '', 'subject': '', 'keywords': '', 'creator': 'wkhtmltopdf 0.12.6', 'producer': 'GPL Ghostscript 10.00.0', 'creationDate': "D:20241217172423Z00'00'", 'modDate': "D:20241217172423Z00'00'", 'trapped': ''}, page_content='Application and Verification Guide\nIntroduction\nThis guide is intended for college financial aid administrators and counselors who help students with the financial aid\nprocess4completing the Free Application for Federal Student Aid (FAFSA®) form, verifying information, and making\ncorrections and other changes to the information reported on the FAFSA form.\nThroughout the Federal Student Aid Handbook, we use <college,= <school,= and <institution= interchangeably unless a\nmore specific use is given. Similarly, <student,= <applicant,= and <aid recipient= are sy

### Knowledge Graph Based Synthetic Generation

Ragas uses a knowledge graph based approach to create data. This is extremely useful as it allows us to create complex queries rather simply. The additional testset complexity allows us to evaluate larger problems more effectively, as systems tend to be very strong on simple evaluation tasks.

Let's start by defining our `generator_llm` (which will generate our questions, summaries, and more), and our `generator_embeddings` which will be useful in building our graph.

### Abstracted SDG

The above method is the full process - but we can shortcut that using the provided abstractions!

This will generate our knowledge graph under the hood, and will - from there - generate our personas and scenarios to construct our queries.



In [5]:
from ragas.llms import LangchainLLMWrapper
from ragas.embeddings import LangchainEmbeddingsWrapper
from langchain_openai import ChatOpenAI
from langchain_openai import OpenAIEmbeddings
generator_llm = LangchainLLMWrapper(ChatOpenAI(model="gpt-4.1"))
generator_embeddings = LangchainEmbeddingsWrapper(OpenAIEmbeddings())

In [6]:
from ragas.testset import TestsetGenerator

generator = TestsetGenerator(llm=generator_llm, embedding_model=generator_embeddings)
dataset = generator.generate_with_langchain_docs(docs[:20], testset_size=10)

Applying HeadlinesExtractor:   0%|          | 0/18 [00:00<?, ?it/s]

Applying HeadlineSplitter:   0%|          | 0/20 [00:00<?, ?it/s]

unable to apply transformation: 'headlines' property not found in this node
unable to apply transformation: 'headlines' property not found in this node


Applying SummaryExtractor:   0%|          | 0/28 [00:00<?, ?it/s]

Property 'summary' already exists in node '559eda'. Skipping!
Property 'summary' already exists in node 'c89122'. Skipping!
Property 'summary' already exists in node '74af45'. Skipping!
Property 'summary' already exists in node '498002'. Skipping!
Property 'summary' already exists in node 'f6007b'. Skipping!
Property 'summary' already exists in node '8ed74b'. Skipping!
Property 'summary' already exists in node '5bfcb7'. Skipping!
Property 'summary' already exists in node 'b28e84'. Skipping!
Property 'summary' already exists in node '25f78d'. Skipping!
Property 'summary' already exists in node 'f6e826'. Skipping!


Applying CustomNodeFilter:   0%|          | 0/16 [00:00<?, ?it/s]

Applying [EmbeddingExtractor, ThemesExtractor, NERExtractor]:   0%|          | 0/58 [00:00<?, ?it/s]

Property 'summary_embedding' already exists in node 'f6007b'. Skipping!
Property 'summary_embedding' already exists in node 'f6e826'. Skipping!
Property 'summary_embedding' already exists in node '8ed74b'. Skipping!
Property 'summary_embedding' already exists in node '25f78d'. Skipping!
Property 'summary_embedding' already exists in node '5bfcb7'. Skipping!
Property 'summary_embedding' already exists in node '74af45'. Skipping!
Property 'summary_embedding' already exists in node 'c89122'. Skipping!
Property 'summary_embedding' already exists in node '498002'. Skipping!
Property 'summary_embedding' already exists in node 'b28e84'. Skipping!
Property 'summary_embedding' already exists in node '559eda'. Skipping!


Applying [CosineSimilarityBuilder, OverlapScoreBuilder]:   0%|          | 0/2 [00:00<?, ?it/s]

Generating personas:   0%|          | 0/3 [00:00<?, ?it/s]

Generating Scenarios:   0%|          | 0/3 [00:00<?, ?it/s]

Generating Samples:   0%|          | 0/12 [00:00<?, ?it/s]

In [7]:
dataset.to_pandas()

Unnamed: 0,user_input,reference_contexts,reference,synthesizer_name
0,How does the Internal Revenue Service (IRS) fa...,[Application and Verification Guide Introducti...,The Internal Revenue Service (IRS) facilitates...,single_hop_specifc_query_synthesizer
1,What recent change has been made regarding the...,[Chapter 1: The Application Process We removed...,The <Returning FAFSA Filers= section has been ...,single_hop_specifc_query_synthesizer
2,How does the NSLDS financial aid history page ...,[The FPS also checks the application for possi...,If an applicant has defaulted on a federal stu...,single_hop_specifc_query_synthesizer
3,"As a Financial Aid Counselor, what is the proc...",[2. The disclosure of their FTI by the IRS to ...,"For FAFSA contributors, each student and requi...",single_hop_specifc_query_synthesizer
4,How has the FAFSA Simplification Act impacted ...,[<1-hop>\n\nApplication and Verification Guide...,"The FAFSA Simplification Act, passed as part o...",multi_hop_abstract_query_synthesizer
5,How does the determination of independent stud...,[<1-hop>\n\nThe FPS also checks the applicatio...,A student who qualifies as independent—such as...,multi_hop_abstract_query_synthesizer
6,How does veteran status determination for fina...,"[<1-hop>\n\nveteran, or will be one by June 30...",Veteran status determination for financial aid...,multi_hop_abstract_query_synthesizer
7,if i fill out fafsa form online and my tax inf...,[<1-hop>\n\nequal the tax filer(s) plus depend...,if you do the fafsa online and your tax info g...,multi_hop_abstract_query_synthesizer
8,"According to Chapter 4 and Chapter 5, how do a...",[<1-hop>\n\nequal the tax filer(s) plus depend...,"A student's marital status, as described in Ch...",multi_hop_specific_query_synthesizer
9,According to the FAFSA Simplification Act and ...,[<1-hop>\n\nThe FPS also checks the applicatio...,"The FAFSA Simplification Act, as described in ...",multi_hop_specific_query_synthesizer


## LangChain RAG

Now we'll construct our LangChain RAG, which we will be evaluating using the above created test data!

### R - Retrieval

Let's start with building our retrieval pipeline, which will involve loading the same data we used to create our synthetic test set above.

> NOTE: We need to use the same data - as our test set is specifically designed for this data.

Now that we have our data loaded, let's split it into chunks!

In [14]:
from langchain.text_splitter import RecursiveCharacterTextSplitter

text_splitter = RecursiveCharacterTextSplitter(chunk_size=1000, chunk_overlap=200)
split_documents = text_splitter.split_documents(docs)
print(len(docs))
print(len(split_documents))

269
1102


#### ❓ Question: 

What is the purpose of the `chunk_overlap` parameter in the `RecursiveCharacterTextSplitter`?

#### ✅ Answer:
The `chunk_overlap` parameter in `RecursiveCharacterTextSplitter` controls how many characters are shared between consecutive text chunks. Its purpose is to preserve important context that might otherwise be lost at chunk boundaries.

When documents are split into chunks for processing (e.g., for embeddings or retrieval), information near the end of a chunk may be critical for understanding the beginning of the next one. By overlapping a portion of text (e.g., 50–100 characters), the splitter helps maintain semantic continuity between chunks.

This improves downstream performance in retrieval-augmented generation (RAG), question answering, and summarization tasks by ensuring smoother transitions and richer context for each chunk.

Next up, we'll need to provide an embedding model that we can use to construct our vector store.

In [10]:
from langchain_openai import OpenAIEmbeddings

embeddings = OpenAIEmbeddings(model="text-embedding-3-small")

Now we can build our in memory QDrant vector store.

In [11]:
from langchain_qdrant import QdrantVectorStore
from qdrant_client import QdrantClient
from qdrant_client.http.models import Distance, VectorParams

client = QdrantClient(":memory:")

client.create_collection(
    collection_name="loan_data",
    vectors_config=VectorParams(size=1536, distance=Distance.COSINE),
)

vector_store = QdrantVectorStore(
    client=client,
    collection_name="loan_data",
    embedding=embeddings,
)

We can now add our documents to our vector store.

In [12]:
_ = vector_store.add_documents(documents=split_documents)

Let's define our retriever.

In [15]:
retriever = vector_store.as_retriever(search_kwargs={"k": 5})

Now we can produce a node for retrieval!

In [16]:
def retrieve(state):
  retrieved_docs = retriever.invoke(state["question"])
  return {"context" : retrieved_docs}

### Augmented

Let's create a simple RAG prompt!

In [17]:
from langchain.prompts import ChatPromptTemplate

RAG_PROMPT = """\
You are a helpful assistant who answers questions based on provided context. You must only use the provided context, and cannot use your own knowledge.

### Question
{question}

### Context
{context}
"""

rag_prompt = ChatPromptTemplate.from_template(RAG_PROMPT)

### Generation

We'll also need an LLM to generate responses - we'll use `gpt-4o-nano` to avoid using the same model as our judge model.

In [18]:
from langchain_openai import ChatOpenAI

llm = ChatOpenAI(model="gpt-4.1-nano")

Then we can create a `generate` node!

In [19]:
def generate(state):
  docs_content = "\n\n".join(doc.page_content for doc in state["context"])
  messages = rag_prompt.format_messages(question=state["question"], context=docs_content)
  response = llm.invoke(messages)
  return {"response" : response.content}

### Building RAG Graph with LangGraph

Let's create some state for our LangGraph RAG graph!

In [20]:
from langgraph.graph import START, StateGraph
from typing_extensions import List, TypedDict
from langchain_core.documents import Document

class State(TypedDict):
  question: str
  context: List[Document]
  response: str

Now we can build our simple graph!

> NOTE: We're using `add_sequence` since we will always move from retrieval to generation. This is essentially building a chain in LangGraph.

In [21]:
graph_builder = StateGraph(State).add_sequence([retrieve, generate])
graph_builder.add_edge(START, "retrieve")
graph = graph_builder.compile()

Let's do a test to make sure it's doing what we'd expect.

In [22]:
response = graph.invoke({"question" : "What are the different kinds of loans?"})

In [23]:
response["response"]

'The provided context references various types of loans related to student financial assistance, specifically focusing on Direct Loans. Based on this information, the different kinds of loans mentioned are:\n\n1. **Direct Loans**  \n   - These include both subsidized and unsubsidized loans that are part of the federal student aid program. The context discusses the eligibility, repayment plans, interest accrual, and other aspects of Direct Loans.\n\n2. **Direct Unsubsidized Loans**  \n   - A specific type of Direct Loan where interest may accrue while the borrower is in school, and the borrower has the option to pay the interest during this period.\n\nWhile the context primarily emphasizes Direct Loans and their variants, it does not explicitly mention other distinct loan types such as Perkins Loans, Stafford Loans, or PLUS Loans. Therefore, based solely on the provided information, the different kinds of loans are:\n\n- **Direct Loans (including Direct Unsubsidized Loans)**\n\nIf you n

## Evaluating the App with Ragas

Now we can finally do our evaluation!

We'll start by running the queries we generated usign SDG above through our application to get context and responses.

In [24]:
for test_row in dataset:
  response = graph.invoke({"question" : test_row.eval_sample.user_input})
  test_row.eval_sample.response = response["response"]
  test_row.eval_sample.retrieved_contexts = [context.page_content for context in response["context"]]

In [25]:
dataset.samples[0].eval_sample.response

'The Internal Revenue Service (IRS) facilitates the FAFSA application process for students seeking federal student aid through the FUTURE Act Direct Data Exchange (FA-DDX). This system establishes a secure connection between the IRS and the Department of Education via an application programming interface, allowing for near-real-time transfer of certain federal tax information (FTI) directly from the IRS to the FAFSA form. This process streamlines the application by eliminating the need for most applicants and their family members to self-report income and tax data, as the transferred FTI is considered verified for Title IV purposes.\n\nRegarding changes in the transfer of tax information, the FA-DDX has replaced the previous IRS Data Retrieval Tool (IRS-DRT). Unlike the IRS-DRT, which allowed applicants to opt in, the FA-DDX requires explicit consent and approval from the student and each contributor (such as spouse or parents) for the Department to access and use their FTI. Additional

Then we can convert that table into a `EvaluationDataset` which will make the process of evaluation smoother.

In [26]:
from ragas import EvaluationDataset

evaluation_dataset = EvaluationDataset.from_pandas(dataset.to_pandas())

We'll need to select a judge model - in this case we're using the same model that was used to generate our Synthetic Data.

In [27]:
from ragas import evaluate
from ragas.llms import LangchainLLMWrapper

evaluator_llm = LangchainLLMWrapper(ChatOpenAI(model="gpt-4.1-mini"))

Next up - we simply evaluate on our desired metrics!

In [28]:
from ragas.metrics import LLMContextRecall, Faithfulness, FactualCorrectness, ResponseRelevancy, ContextEntityRecall, NoiseSensitivity
from ragas import evaluate, RunConfig

custom_run_config = RunConfig(timeout=360)

result = evaluate(
    dataset=evaluation_dataset,
    metrics=[LLMContextRecall(), Faithfulness(), FactualCorrectness(), ResponseRelevancy(), ContextEntityRecall(), NoiseSensitivity()],
    llm=evaluator_llm,
    run_config=custom_run_config
)
result

Evaluating:   0%|          | 0/72 [00:00<?, ?it/s]

{'context_recall': 0.8357, 'faithfulness': 0.8306, 'factual_correctness': 0.6642, 'answer_relevancy': 0.9362, 'context_entity_recall': 0.4583, 'noise_sensitivity_relevant': 0.2298}

## Making Adjustments and Re-Evaluating

Now that we've got our baseline - let's make a change and see how the model improves or doesn't improve!

> NOTE: This will be using Cohere's Rerank model - please be sure to [sign-up for an API key!](https://docs.cohere.com/reference/about)

In [29]:
os.environ["COHERE_API_KEY"] = getpass("Please enter your Cohere API key!")


We'll first set our retriever to return more documents, which will allow us to take advantage of the reranking.

In [30]:
adjusted_example_retriever = vector_store.as_retriever(search_kwargs={"k": 20})

Reranking, or contextual compression, is a technique that uses a reranker to compress the retrieved documents into a smaller set of documents.

This is essentially a slower, more accurate form of semantic similarity that we use on a smaller subset of our documents.

In [31]:
from langchain.retrievers.contextual_compression import ContextualCompressionRetriever
from langchain_cohere import CohereRerank

def retrieve_adjusted(state):
  compressor = CohereRerank(model="rerank-v3.5")
  compression_retriever = ContextualCompressionRetriever(
    base_compressor=compressor, base_retriever=adjusted_example_retriever, search_kwargs={"k": 5}
  )
  retrieved_docs = compression_retriever.invoke(state["question"])
  return {"context" : retrieved_docs}

We can simply rebuild our graph with the new retriever!

In [32]:
class AdjustedState(TypedDict):
  question: str
  context: List[Document]
  response: str

adjusted_graph_builder = StateGraph(AdjustedState).add_sequence([retrieve_adjusted, generate])
adjusted_graph_builder.add_edge(START, "retrieve_adjusted")
adjusted_graph = adjusted_graph_builder.compile()

In [33]:
response = adjusted_graph.invoke({"question" : "What are the different kinds of loans?"})
response["response"]

'The context mentions two specific types of loans: \n1. Direct Subsidized Loans\n2. Direct Unsubsidized Loans\nAdditionally, it references Federal PLUS Loans, which were made under the Federal Family Education Loan (FFEL) Program before July 1, 2010.\n\nBased on this information, the different kinds of loans are:\n- Direct Subsidized Loans\n- Direct Unsubsidized Loans\n- Federal PLUS Loans'

In [34]:
import time
import copy

rerank_dataset = copy.deepcopy(dataset)

for test_row in rerank_dataset:
  response = adjusted_graph.invoke({"question" : test_row.eval_sample.user_input})
  test_row.eval_sample.response = response["response"]
  test_row.eval_sample.retrieved_contexts = [context.page_content for context in response["context"]]
  time.sleep(5) # To try to avoid rate limiting.

In [35]:
rerank_dataset.samples[0].eval_sample.response

'The Internal Revenue Service (IRS) facilitates the FAFSA application process for students seeking federal student aid by enabling a direct data exchange called the FA-DDX (FUTURE Act Direct Data Exchange). This exchange allows the Department of Education to obtain federal tax information (FTI) directly from the IRS to help complete the FAFSA form. The implementation of the FA-DDX has replaced the previous tool, the IRS Data Retrieval Tool (DRT), which was retired after the 2023-24 application cycle. \n\nUnlike the IRS-DRT, which applicants could opt into, the FA-DDX requires applicants and contributors (such as spouses and parents) to provide consent and approval for the Department to retrieve and use their FTI from the IRS. The transferred tax information via the FA-DDX is considered verified for Title IV purposes, streamlining the application process by reducing the need for applicants to self-report income and tax data. \n\nAdditionally, if an applicant or contributor files an amen

In [36]:
rerank_evaluation_dataset = EvaluationDataset.from_pandas(rerank_dataset.to_pandas())

In [37]:
result = evaluate(
    dataset=rerank_evaluation_dataset,
    metrics=[LLMContextRecall(), Faithfulness(), FactualCorrectness(), ResponseRelevancy(), ContextEntityRecall(), NoiseSensitivity()],
    llm=evaluator_llm,
    run_config=custom_run_config
)
result

Evaluating:   0%|          | 0/72 [00:00<?, ?it/s]

{'context_recall': 0.8054, 'faithfulness': 0.9021, 'factual_correctness': 0.6942, 'answer_relevancy': 0.9412, 'context_entity_recall': 0.5252, 'noise_sensitivity_relevant': 0.2230}

#### ❓ Question: 

Which system performed better, on what metrics, and why?

#### ✅ Answer:

### 🧪 RAG Evaluation Comparison

Below is a comparison of two RAG systems based on RAGAS metrics:

| Metric                     | System 1 (Baseline RAG) | System 2 (with Cohere Rerank) | Better System |
|---------------------------|--------------------------|-------------------------------|----------------|
| **Context Recall**        | 0.8357                   | 0.8054                        | System 1       |
| **Faithfulness**          | 0.8306                   | 0.9021                        | System 2       |
| **Factual Correctness**   | 0.6642                   | 0.6942                        | System 2       |
| **Answer Relevancy**      | 0.9362                   | 0.9412                        | System 2       |
| **Context Entity Recall** | 0.4583                   | 0.5252                        | System 2       |
| **Noise Sensitivity (↓)** | 0.2298                   | 0.2230                        | System 2       |

---

### ✅ Conclusion

- **System 2**, which incorporates **Cohere Rerank**, outperforms the baseline RAG system (System 1) on **five out of six metrics**:
  - It produces answers that are more **faithful**, **factually correct**, and **relevant**.
  - It retains more **key entities** from the context.
  - It is **slightly less sensitive to noise**.

- **System 1** slightly outperforms in **context recall**, suggesting it retrieved more context, but not necessarily the most helpful parts.

---

### 🤖 Why Cohere Rerank Helps

Cohere Rerank improves RAG performance by:
- **Reordering retrieved documents** based on semantic relevance to the question — rather than just vector similarity.
- Prioritizing chunks that are **both topically aligned and informative**, which leads to:
  - **Higher-quality context** being passed to the LLM,
  - **Less noise** in the prompt,
  - And therefore, **better answer generation**.

🔍 Even with slightly lower context recall, System 2 selects **higher-value context**, which boosts the overall performance across metrics.

---

🏁 **Bottom line**: Using Cohere Rerank can meaningfully enhance a RAG system's downstream answer quality by focusing not on retrieving more — but on retrieving **smarter**.
