# Using Ragas to Evaluate a RAG Application built with LangChain and LangGraph

In the following notebook, we'll be looking at how [Ragas](https://github.com/explodinggradients/ragas) can be helpful in a number of ways when looking to evaluate your RAG applications!

While this example is rooted in LangChain/LangGraph - Ragas is framework agnostic (you don't even need to be using a framework!).

- 🤝 Breakout Room #1
  1. Task 1: Installing Required Libraries
  2. Task 2: Set Environment Variables
  3. Task 3: Synthetic Dataset Generation for Evaluation using Ragas
  4. Task 4: Evaluating our Pipeline with Ragas
  5. Task 6: Making Adjustments and Re-Evaluating

But first! Let's set some dependencies!

## Dependencies and API Keys:

> NOTE: Please skip the pip install commands if you are running the notebook locally.

In [9]:
#!pip install -qU ragas==0.2.10

In [10]:
#!pip install -qU langchain-community==0.3.14 langchain-openai==0.2.14 unstructured==0.16.12 langgraph==0.2.61 langchain-qdrant==0.2.0

We'll also need to provide our API keys.

First, OpenAI's for our LLM/embedding model combination!

In [28]:
import os
from getpass import getpass
os.environ["OPENAI_API_KEY"] = getpass("Please enter your OpenAI API key!")

OPTIONALLY:

We can also provide a Ragas API key - which you can sign-up for [here](https://app.ragas.io/).

In [29]:
os.environ["RAGAS_APP_TOKEN"] = getpass("Please enter your Ragas API key!")

## Generating Synthetic Test Data

We wil be using Ragas to build out a set of synthetic test questions, references, and reference contexts. This is useful because it will allow us to find out how our system is performing.

> NOTE: Ragas is best suited for finding *directional* changes in your LLM-based systems. The absolute scores aren't comparable in a vacuum.

### Data Preparation

We'll prepare our data - and download our webpages which we'll be using for our data today.

These webpages are from [Simon Willison's](https://simonwillison.net/) yearly "AI learnings".

- [2023 Blog](https://simonwillison.net/2023/Dec/31/ai-in-2023/)
- [2024 Blog](https://simonwillison.net/2024/Dec/31/llms-in-2024/)

Let's start by collecting our data into a useful pile!

In [30]:
!mkdir data

mkdir: cannot create directory ‘data’: File exists


In [31]:
!curl https://simonwillison.net/2023/Dec/31/ai-in-2023/ -o data/2023_llms.html

  % Total    % Received % Xferd  Average Speed   Time    Time     Time  Current
                                 Dload  Upload   Total   Spent    Left  Speed
100 31427    0 31427    0     0  75778      0 --:--:-- --:--:-- --:--:-- 75910


In [32]:
!curl https://simonwillison.net/2024/Dec/31/llms-in-2024/ -o data/2024_llms.html

  % Total    % Received % Xferd  Average Speed   Time    Time     Time  Current
                                 Dload  Upload   Total   Spent    Left  Speed
100 70286    0 70286    0     0   149k      0 --:--:-- --:--:-- --:--:--  149k


Next, let's load our data into a familiar LangChain format using the `DirectoryLoader`.

In [33]:
from langchain_community.document_loaders import DirectoryLoader

path = "data/"
loader = DirectoryLoader(path, glob="*.html")
docs = loader.load()

### Knowledge Graph Based Synthetic Generation

Ragas uses a knowledge graph based approach to create data. This is extremely useful as it allows us to create complex queries rather simply. The additional testset complexity allows us to evaluate larger problems more effectively, as systems tend to be very strong on simple evaluation tasks.

Let's start by defining our `generator_llm` (which will generate our questions, summaries, and more), and our `generator_embeddings` which will be useful in building our graph.

### Abstracted SDG

The above method is the full process - but we can shortcut that using the provided abstractions!

This will generate our knowledge graph under the hood, and will - from there - generate our personas and scenarios to construct our queries.



In [34]:
from ragas.llms import LangchainLLMWrapper
from ragas.embeddings import LangchainEmbeddingsWrapper
from langchain_openai import ChatOpenAI
from langchain_openai import OpenAIEmbeddings
generator_llm = LangchainLLMWrapper(ChatOpenAI(model="gpt-4o"))
generator_embeddings = LangchainEmbeddingsWrapper(OpenAIEmbeddings())

In [35]:
from ragas.testset import TestsetGenerator

generator = TestsetGenerator(llm=generator_llm, embedding_model=generator_embeddings)
dataset = generator.generate_with_langchain_docs(docs, testset_size=10)

Generating personas: 100%|██████████| 2/2 [00:01<00:00,  1.31it/s]                                           
Generating Scenarios: 100%|██████████| 3/3 [00:09<00:00,  3.11s/it]
Generating Samples: 100%|██████████| 12/12 [01:08<00:00,  5.68s/it]


In [36]:
dataset.to_pandas()

Unnamed: 0,user_input,reference_contexts,reference,synthesizer_name
0,What contributions has Anthropic made to the d...,[Code may be the best application The ethics o...,Anthropic has produced better-than-GPT-3 class...,single_hop_specifc_query_synthesizer
1,Why LLMs better at code than Spanish?,[Based Development As a computer scientist and...,LLMs are more capable of writing code because ...,single_hop_specifc_query_synthesizer
2,"How did the field of Artificial Intelligence, ...",[Simon Willison’s Weblog Subscribe Stuff we fi...,"The field of Artificial Intelligence, which da...",single_hop_specifc_query_synthesizer
3,Who is Simon Willison and what role does he pl...,[easy to follow. The rest of the document incl...,Simon Willison is an individual who has writte...,single_hop_specifc_query_synthesizer
4,"How do the ethics of AI, particularly concerni...",[<1-hop>\n\nCode may be the best application T...,"The ethics of AI, especially regarding trainin...",multi_hop_abstract_query_synthesizer
5,Considering the advancements in Large Language...,[<1-hop>\n\nCode may be the best application T...,The gullibility of language models is a signif...,multi_hop_abstract_query_synthesizer
6,How do the challenges of using Large Language ...,[<1-hop>\n\nCode may be the best application T...,The challenges of using Large Language Models ...,multi_hop_abstract_query_synthesizer
7,What are the ethical challenges associated wit...,[<1-hop>\n\nCode may be the best application T...,The ethical challenges associated with the tra...,multi_hop_abstract_query_synthesizer
8,What are the advancements and challenges assoc...,[<1-hop>\n\nfeed with the model and talk about...,Gemini and Gemini 2.0 represent significant ad...,multi_hop_specific_query_synthesizer
9,How did the developments in Large Language Mod...,[<1-hop>\n\neasy to follow. The rest of the do...,"In 2023, Large Language Models (LLMs) marked a...",multi_hop_specific_query_synthesizer


#### OPTIONAL:

If you've provided your Ragas API key - you can use this web interface to look at the created data!

In [37]:
dataset.upload()

Testset uploaded! View at https://app.ragas.io/dashboard/alignment/testset/8a7ea925-4eca-45a0-a64f-3a4d8d036eff


'https://app.ragas.io/dashboard/alignment/testset/8a7ea925-4eca-45a0-a64f-3a4d8d036eff'

## LangChain RAG

Now we'll construct our LangChain RAG, which we will be evaluating using the above created test data!

### R - Retrieval

Let's start with building our retrieval pipeline, which will involve loading the same data we used to create our synthetic test set above.

> NOTE: We need to use the same data - as our test set is specifically designed for this data.

In [38]:
path = "data/"
loader = DirectoryLoader(path, glob="*.html")
docs = loader.load()

Now that we have our data loaded, let's split it into chunks!

In [39]:
from langchain.text_splitter import RecursiveCharacterTextSplitter

text_splitter = RecursiveCharacterTextSplitter(chunk_size=1000, chunk_overlap=200)
split_documents = text_splitter.split_documents(docs)
len(split_documents)

74

#### ❓ Question: 

What is the purpose of the `chunk_overlap` parameter in the `RecursiveCharacterTextSplitter`?

Providing chunk overlap allows us to ensure that key information between chunks is not lost. RecursiveCharacterTextSplitter attempts to split data in a logical way, but it may not always be able to do so without losing context. By providing a chunk overlap, we can increase our chances to preserve key context across chunks and not lose that important information when chunking the data.


Next up, we'll need to provide an embedding model that we can use to construct our vector store.

In [40]:
from langchain_openai import OpenAIEmbeddings

embeddings = OpenAIEmbeddings(model="text-embedding-3-small")

Now we can build our in memory QDrant vector store.

In [41]:
from langchain_qdrant import QdrantVectorStore
from qdrant_client import QdrantClient
from qdrant_client.http.models import Distance, VectorParams

client = QdrantClient(":memory:")

client.create_collection(
    collection_name="ai_across_years",
    vectors_config=VectorParams(size=1536, distance=Distance.COSINE),
)

vector_store = QdrantVectorStore(
    client=client,
    collection_name="ai_across_years",
    embedding=embeddings,
)

We can now add our documents to our vector store.

In [42]:
_ = vector_store.add_documents(documents=split_documents)

Let's define our retriever.

In [43]:
retriever = vector_store.as_retriever(search_kwargs={"k": 5})

Now we can produce a node for retrieval!

In [44]:
def retrieve(state):
  retrieved_docs = retriever.invoke(state["question"])
  return {"context" : retrieved_docs}

### Augmented

Let's create a simple RAG prompt!

In [45]:
from langchain.prompts import ChatPromptTemplate

RAG_PROMPT = """\
You are a helpful assistant who answers questions based on provided context. You must only use the provided context, and cannot use your own knowledge.

### Question
{question}

### Context
{context}
"""

rag_prompt = ChatPromptTemplate.from_template(RAG_PROMPT)

### Generation

We'll also need an LLM to generate responses - we'll use `gpt-4o-mini` to avoid using the same model as our judge model.

In [46]:
from langchain_openai import ChatOpenAI

llm = ChatOpenAI(model="gpt-4o-mini")

Then we can create a `generate` node!

In [47]:
def generate(state):
  docs_content = "\n\n".join(doc.page_content for doc in state["context"])
  messages = rag_prompt.format_messages(question=state["question"], context=docs_content)
  response = llm.invoke(messages)
  return {"response" : response.content}

### Building RAG Graph with LangGraph

Let's create some state for our LangGraph RAG graph!

In [48]:
from langgraph.graph import START, StateGraph
from typing_extensions import List, TypedDict
from langchain_core.documents import Document

class State(TypedDict):
  question: str
  context: List[Document]
  response: str

Now we can build our simple graph!

> NOTE: We're using `add_sequence` since we will always move from retrieval to generation. This is essentially building a chain in LangGraph.

In [49]:
graph_builder = StateGraph(State).add_sequence([retrieve, generate])
graph_builder.add_edge(START, "retrieve")
graph = graph_builder.compile()

Let's do a test to make sure it's doing what we'd expect.

In [50]:
response = graph.invoke({"question" : "How are LLM agents useful?"})

In [51]:
response["response"]

'LLM agents are useful primarily because they can efficiently handle tasks such as code generation. They are surprisingly easy to build, requiring only a few hundred lines of Python, although obtaining the necessary training data and computational resources can be a challenge. \n\nOne of the standout applications of LLMs is in software development, where their efficiency in generating code, despite their tendency to hallucinate, can be mitigated by their ability to execute the generated code and make corrections as needed. This makes them particularly effective in coding environments.\n\nAdditionally, LLMs can be run on personal devices, making them more accessible than before. While there are concerns about their negatives, such as ethical issues, reliability, and environmental impact, there is a recognition of their potential when used responsibly. Overall, the effectiveness of LLMs in generating code is one of the key highlights of their utility.'

## Evaluating the App with Ragas

Now we can finally do our evaluation!

We'll start by running the queries we generated usign SDG above through our application to get context and responses.

In [52]:
for test_row in dataset:
  response = graph.invoke({"question" : test_row.eval_sample.user_input})
  test_row.eval_sample.response = response["response"]
  test_row.eval_sample.retrieved_contexts = [context.page_content for context in response["context"]]

In [53]:
dataset.to_pandas()

Unnamed: 0,user_input,retrieved_contexts,reference_contexts,response,reference,synthesizer_name
0,What contributions has Anthropic made to the d...,[OpenAI are not the only game in town here. Go...,[Code may be the best application The ethics o...,"Based on the provided context, Anthropic has m...",Anthropic has produced better-than-GPT-3 class...,single_hop_specifc_query_synthesizer
1,Why LLMs better at code than Spanish?,[Code may be the best application\n\nThe ethic...,[Based Development As a computer scientist and...,Large Language Models (LLMs) are better at cod...,LLMs are more capable of writing code because ...,single_hop_specifc_query_synthesizer
2,"How did the field of Artificial Intelligence, ...",[Simon Willison’s Weblog\n\nSubscribe\n\nStuff...,[Simon Willison’s Weblog Subscribe Stuff we fi...,"The field of Artificial Intelligence (AI), whi...","The field of Artificial Intelligence, which da...",single_hop_specifc_query_synthesizer
3,Who is Simon Willison and what role does he pl...,[Simon Willison’s Weblog\n\nSubscribe\n\nStuff...,[easy to follow. The rest of the document incl...,Simon Willison is a commentator and analyst in...,Simon Willison is an individual who has writte...,single_hop_specifc_query_synthesizer
4,"How do the ethics of AI, particularly concerni...","[Since then, almost every major LLM (and most ...",[<1-hop>\n\nCode may be the best application T...,"The ethics of AI, particularly concerning trai...","The ethics of AI, especially regarding trainin...",multi_hop_abstract_query_synthesizer
5,Considering the advancements in Large Language...,[a browser? 40.5k 49.2k How to implement Q&A a...,[<1-hop>\n\nCode may be the best application T...,"The gullibility of language models, particular...",The gullibility of language models is a signif...,multi_hop_abstract_query_synthesizer
6,How do the challenges of using Large Language ...,[Code may be the best application\n\nThe ethic...,[<1-hop>\n\nCode may be the best application T...,The challenges of using Large Language Models ...,The challenges of using Large Language Models ...,multi_hop_abstract_query_synthesizer
7,What are the ethical challenges associated wit...,"[Since then, almost every major LLM (and most ...",[<1-hop>\n\nCode may be the best application T...,The ethical challenges associated with the tra...,The ethical challenges associated with the tra...,multi_hop_abstract_query_synthesizer
8,What are the advancements and challenges assoc...,[I wrote about this at the time in The killer ...,[<1-hop>\n\nfeed with the model and talk about...,The advancements and challenges associated wit...,Gemini and Gemini 2.0 represent significant ad...,multi_hop_specific_query_synthesizer
9,How did the developments in Large Language Mod...,[Simon Willison’s Weblog\n\nSubscribe\n\nStuff...,[<1-hop>\n\neasy to follow. The rest of the do...,"In 2023, the developments in Large Language Mo...","In 2023, Large Language Models (LLMs) marked a...",multi_hop_specific_query_synthesizer


Then we can convert that table into a `EvaluationDataset` which will make the process of evaluation smoother.

In [54]:
from ragas import EvaluationDataset

evaluation_dataset = EvaluationDataset.from_pandas(dataset.to_pandas())

We'll need to select a judge model - in this case we're using the same model that was used to generate our Synthetic Data.

In [55]:
from ragas import evaluate
from ragas.llms import LangchainLLMWrapper

evaluator_llm = LangchainLLMWrapper(ChatOpenAI(model="gpt-4o"))

Next up - we simply evaluate on our desired metrics!

In [56]:
from ragas.metrics import LLMContextRecall, Faithfulness, FactualCorrectness, ResponseRelevancy, ContextEntityRecall, NoiseSensitivity
from ragas import evaluate, RunConfig

custom_run_config = RunConfig(timeout=360)

result = evaluate(
    dataset=evaluation_dataset,
    metrics=[LLMContextRecall(), Faithfulness(), FactualCorrectness(), ResponseRelevancy(), ContextEntityRecall(), NoiseSensitivity()],
    llm=evaluator_llm,
    run_config=custom_run_config
)
result

Evaluating:  24%|██▎       | 17/72 [02:00<14:46, 16.12s/it]Exception raised in Job[24]: RateLimitError(Error code: 429 - {'error': {'message': 'Rate limit reached for gpt-4o in organization org-TU5fm55zJrncrgPcg3lg23B6 on tokens per min (TPM): Limit 30000, Used 28901, Requested 2337. Please try again in 2.476s. Visit https://platform.openai.com/account/rate-limits to learn more.', 'type': 'tokens', 'param': None, 'code': 'rate_limit_exceeded'}})
Evaluating:  28%|██▊       | 20/72 [02:34<09:43, 11.22s/it]Exception raised in Job[16]: RateLimitError(Error code: 429 - {'error': {'message': 'Rate limit reached for gpt-4o in organization org-TU5fm55zJrncrgPcg3lg23B6 on tokens per min (TPM): Limit 30000, Used 29704, Requested 1867. Please try again in 3.142s. Visit https://platform.openai.com/account/rate-limits to learn more.', 'type': 'tokens', 'param': None, 'code': 'rate_limit_exceeded'}})
Evaluating:  29%|██▉       | 21/72 [03:01<13:27, 15.84s/it]Exception raised in Job[7]: RateLimitErro

{'context_recall': 0.8214, 'faithfulness': 0.7833, 'factual_correctness': 0.5125, 'answer_relevancy': 0.8699, 'context_entity_recall': 0.4903, 'noise_sensitivity_relevant': 0.2863}

## Making Adjustments and Re-Evaluating

Now that we've got our baseline - let's make a change and see how the model improves or doesn't improve!

> NOTE: This will be using Cohere's Rerank model (which was updated fairly [recently](https://docs.cohere.com/v2/changelog/rerank-v3.5)) - please be sure to [sign-up for an API key!](https://docs.cohere.com/reference/about)

In [57]:
os.environ["COHERE_API_KEY"] = getpass("Please enter your Cohere API key!")

In [58]:
#!pip install -qU cohere langchain_cohere


We'll first set our retriever to return more documents, which will allow us to take advantage of the reranking.

In [59]:
retriever = vector_store.as_retriever(search_kwargs={"k": 20})

Reranking, or contextual compression, is a technique that uses a reranker to compress the retrieved documents into a smaller set of documents.

This is essentially a slower, more accurate form of semantic similarity that we use on a smaller subset of our documents.

In [60]:
from langchain.retrievers.contextual_compression import ContextualCompressionRetriever
from langchain_cohere import CohereRerank

def retrieve_adjusted(state):
  compressor = CohereRerank(model="rerank-v3.5")
  compression_retriever = ContextualCompressionRetriever(
    base_compressor=compressor, base_retriever=retriever, search_kwargs={"k": 5}
  )
  retrieved_docs = compression_retriever.invoke(state["question"])
  return {"context" : retrieved_docs}

We can simply rebuild our graph with the new retriever!

In [61]:
class State(TypedDict):
  question: str
  context: List[Document]
  response: str

graph_builder = StateGraph(State).add_sequence([retrieve_adjusted, generate])
graph_builder.add_edge(START, "retrieve_adjusted")
graph = graph_builder.compile()

In [62]:
response = graph.invoke({"question" : "How are LLM agents useful?"})
response["response"]

'LLM agents can be useful primarily in the area of writing code, as they demonstrate significant capability in this domain. The grammar rules of programming languages are less complex than those of natural languages, which enhances the effectiveness of LLMs in coding tasks. However, the context also expresses skepticism regarding the broader utility of LLM agents due to issues like gullibility, as LLMs may believe and act on false information. This limitation raises concerns about their reliability in making meaningful decisions on behalf of users. Overall, while LLMs have demonstrated some functional capabilities, particularly in coding, there are significant challenges and criticisms regarding their overall effectiveness and the potential downsides of their use.'

In [63]:
import time

for test_row in dataset:
  response = graph.invoke({"question" : test_row.eval_sample.user_input})
  test_row.eval_sample.response = response["response"]
  test_row.eval_sample.retrieved_contexts = [context.page_content for context in response["context"]]
  time.sleep(2) # To try to avoid rate limiting.

In [64]:
result = evaluate(
    dataset=evaluation_dataset,
    metrics=[LLMContextRecall(), Faithfulness(), FactualCorrectness(), ResponseRelevancy(), ContextEntityRecall(), NoiseSensitivity()],
    llm=evaluator_llm,
    run_config=custom_run_config
)
result

Evaluating:  26%|██▋       | 19/72 [02:30<11:23, 12.89s/it]Exception raised in Job[24]: RateLimitError(Error code: 429 - {'error': {'message': 'Rate limit reached for gpt-4o in organization org-TU5fm55zJrncrgPcg3lg23B6 on tokens per min (TPM): Limit 30000, Used 29161, Requested 2337. Please try again in 2.996s. Visit https://platform.openai.com/account/rate-limits to learn more.', 'type': 'tokens', 'param': None, 'code': 'rate_limit_exceeded'}})
Evaluating:  28%|██▊       | 20/72 [02:48<12:33, 14.50s/it]Exception raised in Job[25]: RateLimitError(Error code: 429 - {'error': {'message': 'Rate limit reached for gpt-4o in organization org-TU5fm55zJrncrgPcg3lg23B6 on tokens per min (TPM): Limit 30000, Used 29305, Requested 2494. Please try again in 3.598s. Visit https://platform.openai.com/account/rate-limits to learn more.', 'type': 'tokens', 'param': None, 'code': 'rate_limit_exceeded'}})
Evaluating:  31%|███       | 22/72 [03:09<10:51, 13.02s/it]Exception raised in Job[7]: RateLimitErro

{'context_recall': 0.8393, 'faithfulness': 0.7939, 'factual_correctness': 0.5017, 'answer_relevancy': 0.8695, 'context_entity_recall': 0.5012, 'noise_sensitivity_relevant': 0.2463}

#### ❓ Question: 

Which system performed better, on what metrics, and why?

pulling 20 resullts and then using Cohere's reranking model to rerank the top 5 likely improved context recall as we are reranking to pull the most relevant responses before evaluating. This is the metric that saw the largest improvement.


**context_recall from 0.6327 to 0.7167**
pulling 20 results and then reranking before evaluating the top 5 likely improved context recall as we are reranking to pull the most relevant responses before evaluating. This is the metric that saw the largest improvement.

**faithfulness from 0.8297 to 0.7500**
Reranking prioritizes relevance, but it does not necessarily filter for faithfulness, meaning some answers could have pulled in less strictly accurate or inferred information.

**noise_sensitivity_relevant from 0.1821 to 0.2286**
noise likely went up very slightly even though reranking was used because we pulled form a larger set of results (20 vs 5) and could have introduced more incorrect responses to be reranked.


Saw minor changes between system here.
**factual_correctness from 0.4482 to 0.4327**
**answer_relevancy from 0.8712 to 0.8741**
**context_entity_recall from 0.3993 to 0.4100**


I ran the tests twice and got two different results. I still saw a very slight improvement in context recall and context entity recall. but not large changes across the board.  Running the test each time takes over a half hour and costs several dollars so I didn't continue to run it after those runs.


two blobs for comparison

{'context_recall': 0.6327, 'faithfulness': 0.8297, 'factual_correctness': 0.4482, 'answer_relevancy': 0.8712, 'context_entity_recall': 0.3993, 'noise_sensitivity_relevant': 0.1821}

{'context_recall': 0.7167, 'faithfulness': 0.7500, 'factual_correctness': 0.4327, 'answer_relevancy': 0.8741, 'context_entity_recall': 0.4100, 'noise_sensitivity_relevant': 0.2286}

Second run and both blobs

{'context_recall': 0.8214, 'faithfulness': 0.7833, 'factual_correctness': 0.5125, 'answer_relevancy': 0.8699, 'context_entity_recall': 0.4903, 'noise_sensitivity_relevant': 0.2863}

{'context_recall': 0.8393, 'faithfulness': 0.7939, 'factual_correctness': 0.5017, 'answer_relevancy': 0.8695, 'context_entity_recall': 0.5012, 'noise_sensitivity_relevant': 0.2463}
