# **RAG Evaluation using RAGAS**

Authored by [Kalyan KS](https://www.linkedin.com/in/kalyanksnlp/). To stay updated with LLMs, RAG and Agents, you can follow him on [LinkedIn](https://www.linkedin.com/in/kalyanksnlp/), [Twitter](https://x.com/kalyan_kpl) and [YouTube](https://youtube.com/@kalyanksnlp?si=ZdoC0WPN9TmAOvKB).

- RAGAS is one of the popular open-source libraries for RAG evaluation.
- RAGAS includes popular metrics to evaluate both the retriever and generator components of RAG system.

In [1]:
!pip install -qU ragas langchain-openai

[?25l   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m0.0/187.2 kB[0m [31m?[0m eta [36m-:--:--[0m[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m187.2/187.2 kB[0m [31m5.8 MB/s[0m eta [36m0:00:00[0m
[?25h[?25l   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m0.0/60.1 kB[0m [31m?[0m eta [36m-:--:--[0m[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m60.1/60.1 kB[0m [31m3.8 MB/s[0m eta [36m0:00:00[0m
[?25h[?25l   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m0.0/45.5 kB[0m [31m?[0m eta [36m-:--:--[0m[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m45.5/45.5 kB[0m [31m2.6 MB/s[0m eta [36m0:00:00[0m
[?25h[?25l   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m0.0/420.1 kB[0m [31m?[0m eta [36m-:--:--[0m[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m420.1/420.1 kB[0m [31m17.8 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0

In [2]:
from google.colab import userdata
import os
os.environ["OPENAI_API_KEY"] = userdata.get('OPENAI_API_KEY')

# **RAG Retriever Evaluation**

## **Context Precision**

- Context Precision is a metric that evaluates how well a RAG retriever ranks  relevant chunks within the retrieved contexts.

- Formula is
$$
\text{Context Precision@K} = \frac{\sum_{k=1}^{K} \left( \text{Precision@k} \times v_k \right)}{\text{Total number of relevant items in the top } K \text{ results}}
$$

In [None]:
from langchain_openai import ChatOpenAI
from ragas.llms import LangchainLLMWrapper
from ragas import SingleTurnSample
from ragas.metrics import LLMContextPrecisionWithReference

# Set up the LLM
llm = ChatOpenAI(model="gpt-4o-mini")

# Initialize the metric
evaluator_llm = LangchainLLMWrapper(llm)
context_precision = LLMContextPrecisionWithReference(llm=evaluator_llm)

# Define the test case
query = "Will it rain this afternoon?"
response = "There's a 60% chance of rain after 2 PM today."
reference = "Expect a 60% probability of rainfall this afternoon after 2 PM."
context = [
    "The weather forecast indicates a 60% chance of rain starting after 2 PM today.",
    "Temperatures will drop slightly in the afternoon due to cloud cover.",
    "Yesterday’s forecast was unrelated to today’s weather patterns.",
    "Rain is more likely in the northern regions this afternoon."
]

sample = SingleTurnSample(
    user_input= query,
    reference= reference,
    retrieved_contexts=context,
)

# Compute the metric score
await context_precision.single_turn_ascore(sample)

0.9999999999

## **Context Recall**
- Context Recall is computed as the ratio of number of ground truth claims supported by the context to the total number of ground truth claims.
- Formula is
$$
\text{Context Recall} = \frac{|\text{Number of GT claims that can be attributed to context}|}{|\text{Total number of claims in GT}|}
$$

Here "GT" refer to ground truth.

In [None]:
from langchain_openai import ChatOpenAI
from ragas.llms import LangchainLLMWrapper
from ragas.dataset_schema import SingleTurnSample
from ragas.metrics import LLMContextRecall

# Set up the LLM
llm = ChatOpenAI(model="gpt-4o-mini")

# Initialize the metric
evaluator_llm = LangchainLLMWrapper(llm)
context_recall = LLMContextRecall(llm=evaluator_llm)

# Define the test case
query = "What caused the power outage last night?"
response = "The power outage was due to a severe thunderstorm that damaged power lines."
reference = "Last night's power outage resulted from a thunderstorm causing damage to electrical infrastructure."
context = [
    "A severe thunderstorm passed through the area last night, bringing strong winds.",
    "Power lines were reported damaged around 10 PM due to fallen trees from the storm."
]

sample = SingleTurnSample(
    user_input= query,
    response= response,
    reference= reference,
    retrieved_contexts= context,
)

# Compute the metric score
await context_recall.single_turn_ascore(sample)

1.0

## **Context Entities Recall**

- Context Entities Recall is computed as the ratio of number of common entities between reference and retrieved context to the total number of entities in the reference
- Formula is
$$
\text{Context Entities Recall} = \frac{\text{Number of common entities between RCE and RE}}{\text{Total number of entities in RE}}
$$

Here RE represents referenec entities and RCE represents reference context entities.

In [None]:
from langchain_openai import ChatOpenAI
from ragas.llms import LangchainLLMWrapper
from ragas import SingleTurnSample
from ragas.metrics import ContextEntityRecall
import asyncio

# Set up the LLM
llm = ChatOpenAI(model="gpt-4o-mini")

# Initialize the metric
evaluator_llm = LangchainLLMWrapper(llm)
context_entity_recall = ContextEntityRecall(llm=evaluator_llm)

# Define the test case
query = "What is the capital city of France?"
reference = "The capital city of France is Paris."
response = "Paris is the capital of France."
context = "France is a country in Europe with a rich history and culture."

sample = SingleTurnSample(
    user_input=query,
    reference=reference,
    response=response,
    retrieved_contexts=[context],
)

# Compute the metric
score = asyncio.run(context_entity_recall.single_turn_ascore(sample))

# Output the result
print(f"Context Entities Recall Score: {score}")

Context Entities Recall Score: 0.4999999975


# **RAG Generator Evaluation**

## **Response Relevancy**

- The Response Relevancy metric evaluates how relevant a generated response is to the original user query.
- Formula is

 $$
\text{Response Relevancy} = \frac{1}{N} \sum_{i=1}^{N} \frac{E_{g_i} \cdot E_o}{\|E_{g_i}\| \|E_o\|}
 $$

 Where:

     - $E_o$ : Embedding of the user query.
     - $E_{g_i}$ : Embedding of the (i)-th synthetic query.
     - (N): Number of synthetic queries (default is 3).
     - $\cos(E_{g_i}, E_o)$ : Cosine similarity between the embeddings.

In [None]:
from langchain_openai import ChatOpenAI
from ragas.llms import LangchainLLMWrapper
from langchain_openai import OpenAIEmbeddings
from ragas.embeddings import LangchainEmbeddingsWrapper
from ragas import SingleTurnSample
from ragas.metrics import ResponseRelevancy
import asyncio

# Set up the LLM
llm = ChatOpenAI(model="gpt-4o-mini")
evaluator_llm = LangchainLLMWrapper(llm)

# Set up the embedding model
embeddings = OpenAIEmbeddings(model="text-embedding-3-small")
evaluator_embeddings = LangchainEmbeddingsWrapper(embeddings)

# Initialize the metric
response_relevancy = ResponseRelevancy(llm=evaluator_llm, embeddings=evaluator_embeddings)

# Define the test case
query = "What is the tallest mountain in the world?"
response = "Mount Everest is the tallest mountain in the world, standing at 8,848 meters (29,029 feet)."
context = [
        "Mount Everest, located in the Himalayas on the border between Nepal and China, has an elevation of 8,848 meters above sea level, making it the highest peak on Earth.",
        "The height of Mount Everest was officially determined to be 8,848 meters by the Survey of India in 1955."
    ]

sample = SingleTurnSample(
    user_input = query,
    response = response,
    retrieved_contexts = context
)

# Compute the metric
score = asyncio.run(response_relevancy.single_turn_ascore(sample))

# Display the score
print(f"Response Relevancy Score: {score}")

Response Relevancy Score: 0.9102625215899899


## **Faithfulness**

- The Faithfulness metric measures how factually consistent a generated response is with the retrieved context.
- The Faithfulness metric is computed as the ratio of number of claims in the response supported by retrieved context to the total number of claims in the response.
- Formula is

$$
\text{Faithfulness Score} = \frac{\text{Number of claims in the response supported by the retrieved context}}{\text{Total number of claims in the response}}
$$

In [None]:
from langchain_openai import ChatOpenAI
from ragas.llms import LangchainLLMWrapper
from ragas import SingleTurnSample
from ragas.metrics import Faithfulness
import asyncio

# Set up the LLM
llm = ChatOpenAI(model="gpt-4o-mini")

# Initialize the metric
evaluator_llm = LangchainLLMWrapper(llm)
faithfulness = Faithfulness(llm=evaluator_llm)

# Define the test case
query = "What are some tips for maintaining a healthy diet?"
response = "Eating fruits and vegetables daily, drinking enough water, and avoiding processed foods can improve your diet."
context = [
    "A healthy diet includes regular consumption of fruits and vegetables.",
    "Staying hydrated by drinking sufficient water is essential for good health.",
    "Processed foods should be limited to maintain a balanced diet."
]

sample = SingleTurnSample(
    user_input = query,
    response = response,
    retrieved_contexts = context,
)

# Compute the metric
score = asyncio.run(faithfulness.single_turn_ascore(sample))

# Output the result
print(f"Faithfulness Score: {score}")

Faithfulness Score: 1.0


## **Relevant Noise Sensitivity**

- Relevant Noise Sensitivity refers to the proportion of incorrect claims in a model’s response that are entailed (i.e., supported or implied) by relevant retrieved chunks to the toal number of response claims.

- A relevant chunk is a chunk that contains at least one claim from the ground truth answer.

- Formula is
$$
\text{Relevant Noise Sensitivity} = \frac{\text{Number of incorrect claims in the model response entailed by relevant chunks}}{\text{Total number of claims in the response}}
$$

In [3]:
from langchain_openai import ChatOpenAI
from ragas.llms import LangchainLLMWrapper
from ragas import SingleTurnSample
from ragas.metrics import NoiseSensitivity
import asyncio

# Set up the LLM
llm = ChatOpenAI(model="gpt-4o-mini")

# Initialize the metric
evaluator_llm = LangchainLLMWrapper(llm)
relevant_noise_sensitivity = NoiseSensitivity(llm=evaluator_llm, mode="relevant")

# Define the test case
query = "Who painted the Mona Lisa and in what century was it painted?"
response = "Leonardo da Vinci painted the Mona Lisa, and it was painted in the 15th century."
reference = "Leonardo da Vinci painted the Mona Lisa in the 16th century."
context = [
    "The Mona Lisa is a famous portrait painted by Leonardo da Vinci. It is believed to have been started in the early 1500s. Some art historians date its completion to around 1519, placing it firmly in the 15th century according to certain periodizations."
]

sample = SingleTurnSample(
    user_input = query,
    response = response,
    reference = reference,
    retrieved_contexts = context,
)

# Compute the metric
score = asyncio.run(relevant_noise_sensitivity.single_turn_ascore(sample))

# Output the result
print(f"Relevant Noise Sensitivity Score: {score}")

Relevant Noise Sensitivity Score: 0.5


## **Irrelevant Noise Sensitivity**

- Irrelevant Noise Sensitivity refers to the proportion of incorrect claims in a model’s response that are entailed (i.e., supported or implied) by irrelevant retrieved chunks to the toal number of response claims.

- A irrelevant chunk is a chunk with no claims from the ground truth answer.

- Formula is
$$
\text{Irrelevant Noise Sensitivity} = \frac{\text{Number of incorrect claims in the model response entailed by irrelevant chunks}}{\text{Total number of claims in the response}}
$$

In [None]:
from langchain_openai import ChatOpenAI
from ragas.llms import LangchainLLMWrapper
from ragas import SingleTurnSample
from ragas.metrics import NoiseSensitivity
import asyncio

# Set up the LLM
llm = ChatOpenAI(model="gpt-4o-mini")

# Initialize the metric
evaluator_llm = LangchainLLMWrapper(llm)
irrelevant_noise_sensitivity = NoiseSensitivity(llm=evaluator_llm, mode="irrelevant")

# Define the test case
query = "Who wrote the novel 'Pride and Prejudice'?"
response = "Charlotte Brontë wrote 'Pride and Prejudice,' and she is famous for 'Jane Eyre.'"
reference = "Jane Austen wrote 'Pride and Prejudice.'"
context = [
    "Jane Austen published 'Pride and Prejudice' in 1813, a classic romance novel.",
    "Charlotte Brontë, a renowned author, is best known for her novel 'Jane Eyre,' published in 1847."
]

sample = SingleTurnSample(
    user_input = query,
    response = response,
    reference = reference,
    retrieved_contexts = context,
)

# Compute the metric
score = asyncio.run(irrelevant_noise_sensitivity.single_turn_ascore(sample))

# Output the result
print(f"Irrelevant Noise Sensitivity Score: {score}")

Irrelevant Noise Sensitivity Score: 0.5


# **RAG Evaluation using RAGAS - Full Example**

In [4]:
import pandas as pd
from ragas import EvaluationDataset, evaluate
from ragas.metrics import (
    Faithfulness,
    LLMContextRecall
)

In [11]:
# Sample dataframe (replace with your actual dataframe)
data = {
    "query": ["What is the capital of France?", "Who invented the telephone?"],
    "reference": ["The capital of France is Paris.", "Alexander Graham Bell invented the telephone."],
    "response": ["The capital of France is Paris.", "The telephone was invented by Alexander Graham Bell."],
    "context": [
        ["France is a country in Europe. Its capital is Paris."],
        ["Alexander Graham Bell was an inventor. He is credited with inventing the telephone."]
    ]
}
df = pd.DataFrame(data)

In [12]:
# Prepare the data as a list of dictionaries for EvaluationDataset
evaluation_data = [
    {
        "user_input": row["query"],
        "reference": row["reference"],
        "response": row["response"],
        "retrieved_contexts": row["context"]
    }
    for _, row in df.iterrows()
]

# Create an EvaluationDataset object
dataset = EvaluationDataset.from_list(evaluation_data)

In [13]:
# Set up the LLM
llm = ChatOpenAI(model="gpt-4o-mini")

In [14]:
# Initialize the metrics
evaluator_llm = LangchainLLMWrapper(llm)
context_recall = LLMContextRecall(llm=evaluator_llm)
faithfulness = Faithfulness(llm=evaluator_llm)


In [15]:
# Define the metrics to evaluate
metrics = [
    faithfulness,
    context_recall
]

In [16]:
# Compute the metric scores
results = evaluate(
    dataset=dataset,
    metrics=metrics
)

Evaluating:   0%|          | 0/4 [00:00<?, ?it/s]

In [19]:
print(results)

{'faithfulness': 1.0000, 'context_recall': 1.0000}
