# **RAG Evaluation using DeepEval**

Authored by [Kalyan KS](https://www.linkedin.com/in/kalyanksnlp/). To stay updated with LLMs, RAG and Agents, you can follow him on [LinkedIn](https://www.linkedin.com/in/kalyanksnlp/), [Twitter](https://x.com/kalyan_kpl) and [YouTube](https://youtube.com/@kalyanksnlp?si=ZdoC0WPN9TmAOvKB).



- DeepEval is a simple-to-use, open-source LLM evaluation framework.
- DeepEval includes popular metrics to evaluate both the retriever and generator components of RAG system.

In [None]:
!pip install -qU deepeval

[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m81.3/81.3 kB[0m [31m5.6 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m576.3/576.3 kB[0m [31m29.3 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m55.9/55.9 kB[0m [31m3.7 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m1.7/1.7 MB[0m [31m57.1 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m243.6/243.6 kB[0m [31m12.2 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m83.9/83.9 kB[0m [31m6.2 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m2.5/2.5 MB[0m [31m57.2 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m60.1/60.1 kB[0m [31m4.2 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━

In [None]:
from google.colab import userdata
import os
os.environ["OPENAI_API_KEY"] = userdata.get('OPENAI_API_KEY')

The `query` and `reference` are required to create an LLMTestCase.

So it is mandatory to pass these two along with the necessary inputs required to compute the evaluation metric.

# **RAG Retriever Evaluation**

## **Context Precision**

- Context Precision is a metric that evaluates how well a RAG system ranks  relevant chunks within the retrieved contexts.

- Formula is
$$
\text{Context Precision@K} = \frac{\sum_{k=1}^{K} \left( \text{Precision@k} \times v_k \right)}{\text{Total number of relevant items in the top } K \text{ results}}
$$

In [None]:
from deepeval.metrics import ContextualPrecisionMetric
from deepeval.test_case import LLMTestCase

# Initialize the metric
metric = ContextualPrecisionMetric(
    threshold=0.75,
    model="gpt-4o-mini",
    include_reason=True
)

# Define the test case
query = "Will it rain this afternoon?"
response = "There's a 60% chance of rain after 2 PM today."
reference = "Expect a 60% probability of rainfall this afternoon after 2 PM."
context = [
    "The weather forecast indicates a 60% chance of rain starting after 2 PM today.",
    "Temperatures will drop slightly in the afternoon due to cloud cover.",
    "Yesterday’s forecast was unrelated to today’s weather patterns.",
    "Rain is more likely in the northern regions this afternoon."
]

test_case = LLMTestCase(
    input=query,
    actual_output=response,
    expected_output=reference,
    retrieval_context=context
)

# Compute the metric
metric.measure(test_case)

# Display score and explanation
print(f"Context Precision Score: {metric.score}")
print(f"Explanation: {metric.reason}")

Output()

Context Precision Score: 1.0
Explanation: The score is 1.00 because all relevant nodes, like the first node, directly state a 60% chance of rain after 2 PM, placing them at the top of the ranking. The irrelevant nodes rank lower since they provide details such as temperatures dropping (2nd node) or unrelated forecasts (3rd node) that do not answer the question regarding rain. This clear hierarchy of relevant over irrelevant information justifies the perfect score.


## **Context Recall**
- Context Recall is computed as the ratio of number of ground truth claims supported by the context to the total number of ground truth claims.
- Formula is
$$
\text{Context Recall} = \frac{|\text{Number of GT claims that can be attributed to context}|}{|\text{Total number of claims in GT}|}
$$

Here "GT" refer to ground truth.

In [None]:
from deepeval.metrics import ContextualRecallMetric
from deepeval.test_case import LLMTestCase

# Initialize the metric
metric = ContextualRecallMetric(
    threshold=0.8,
    model="gpt-4o-mini",
    include_reason=True
)

# Define the test case
query = "What caused the power outage last night?"
response = "The power outage was due to a severe thunderstorm that damaged power lines."
reference = "Last night's power outage resulted from a thunderstorm causing damage to electrical infrastructure."
context = [
    "A severe thunderstorm passed through the area last night, bringing strong winds.",
    "Power lines were reported damaged around 10 PM due to fallen trees from the storm."
]

test_case = LLMTestCase(
    input=query,
    actual_output=response,
    expected_output=reference,
    retrieval_context=context
)

# Compute the metric
metric.measure(test_case)

# Display score and explanation
print(f"Context Recall Score: {metric.score}")
print(f"Explanation: {metric.reason}")

Output()

Context Recall Score: 1.0
Explanation: The score is 1.00 because all aspects of the expected output are directly supported by the information in the nodes in retrieval context, specifically aligning perfectly with the details of the thunderstorm and the resulting power outage.


## **Context Relevancy**
- Context Relevancy is computed as the ratio of number of statements in the context relevant to the user query to the total number of statements in the context.
- Formula is
$$
\text{Contextual Relevancy} = \frac{\text{Number of statements in the context relevant to the query}}{\text{Total number of statements in the context}}
$$

In [None]:
from deepeval.metrics import ContextualRelevancyMetric
from deepeval.test_case import LLMTestCase

# Initialize the metric
metric = ContextualRelevancyMetric(
    threshold=0.7,
    model="gpt-4o-mini",
    include_reason=True
)

# Define the test case
query = "Why did the stock market drop today?"
response = "The stock market dropped due to concerns over rising inflation rates and a tech sector sell-off."
context = [
    "Recent economic reports showed inflation reaching a 5-year high this month.",
    "Major tech companies reported disappointing earnings, triggering a sell-off.",
    "The weather was sunny and pleasant throughout the day."
]

test_case = LLMTestCase(
    input=query,
    actual_output=response,
    retrieval_context=context
)

# Compute the metric
metric.measure(test_case)

# Display score and explanation
print(f"Context Relevancy Score: {metric.score}")
print(f"Reasoning: {metric.reason}")

Output()

Context Relevancy Score: 0.3333333333333333
Reasoning: The score is 0.33 because while the relevant statement indicates that 'Major tech companies reported disappointing earnings, triggering a sell-off,' it is overshadowed by the irrelevant reasons regarding inflation and weather, which do not address the stock market drop directly.


# **RAG Generator Evaluation**

## **Response Relevancy**

- The Response Relevancy metric evaluates how relevant a generated response is to the original user query.
- Formula is

 $$
\text{Response Relevancy Score} = \frac{\text{Number of statements relevant to the user query in the response}}{\text{Total number of statements in the response}}
 $$

 **Note** - Response relevancy metric is also referred to as Answer relevancy.

In [None]:
from deepeval.metrics import AnswerRelevancyMetric
from deepeval.test_case import LLMTestCase

# Initialize the metric
metric = AnswerRelevancyMetric(
    threshold=0.75,
    model="gpt-4o-mini",
    include_reason=True
)

# Define the test case
query = "How can I improve my coding skills?"
response = "Practice daily, read documentation, and build small projects. The weather is nice today."

test_case = LLMTestCase(
    input=query,
    actual_output=response
)

# Compute the score
metric.measure(test_case)

# Display score and explanation
print(f"Answer Relevancy Score: {metric.score}")
print(f"Explanation: {metric.reason}")

Output()

Answer Relevancy Score: 0.75
Explanation: The score is 0.75 because the output included an irrelevant statement about the weather, which does not contribute to the topic of improving coding skills. However, it still provided some relevant tips, which justifies the current score.


## **Faithfulness**

- The Faithfulness metric measures how factually consistent a generated response is with the retrieved context.
- The Faithfulness metric is computed as the ratio of number of claims in the response supported by retrieved context to total number of claims in the response.
- Formula is

$$
\text{Faithfulness Score} = \frac{\text{Number of claims in the response supported by the retrieved context}}{\text{Total number of claims in the response}}
$$

In [None]:
from deepeval.metrics import FaithfulnessMetric
from deepeval.test_case import LLMTestCase

# Initialize the metric
metric = FaithfulnessMetric(
    threshold=0.7,
    model="gpt-4o-mini",
    include_reason=True
)

# Define the test case
query = "What are some tips for maintaining a healthy diet?"
response = "Eating fruits and vegetables daily, drinking enough water, and avoiding processed foods can improve your diet."
context = [
    "A healthy diet includes regular consumption of fruits and vegetables.",
    "Staying hydrated by drinking sufficient water is essential for good health.",
    "Processed foods should be limited to maintain a balanced diet."
]

test_case = LLMTestCase(
    input=query,
    actual_output=response,
    retrieval_context=context
)

# Compute the score
metric.measure(test_case)

# Display score and explanation
print(f"Faithfulness Score: {metric.score}")
print(f"Explanation: {metric.reason}")

Output()

Faithfulness Score: 1.0
Explanation: The score is 1.00 because there are no contradictions, indicating that the actual output aligns perfectly with the retrieval context. Great job maintaining consistency!


## **Hallucination**

- The Hallucination metric measures how factually inconsistent a generated response is with the retrieved context.
- The Hallucination metric is computed as the ratio of number of claims in the response unsupported by retrieved context to total number of claims in the response.
- Formula is

$$
\text{Hallucination Score} = \frac{\text{Number of claims in the response unsupported by the retrieved context}}{\text{Total number of claims in the response}}
$$



In [None]:
from deepeval.metrics import HallucinationMetric
from deepeval.test_case import LLMTestCase

# Initialize the metric
metric = HallucinationMetric(
    threshold=0.6,
    model="gpt-4o-mini",
    include_reason=True
)

# Define the test case
query = "What are some effective exercises for building strength?"
response = "Lifting weights and doing push-ups can help build muscle strength."
context = [
    "Strength training often involves weight lifting to increase muscle mass.",
    "Bodyweight exercises like push-ups are effective for building strength.",
    "Yoga improves both flexibility and muscular strength through specific poses."
]

test_case = LLMTestCase(
    input=query,
    actual_output=response,
    context=context
)

# Compute the score
metric.measure(test_case)

# Display score and explanation
print(f"Hallucination Score: {metric.score}")
print(f"Reasoning: {metric.reason}")

Output()

Hallucination Score: 0.3333333333333333
Reasoning: The score is 0.33 because the actual output aligns with the provided contexts regarding strength training and bodyweight exercises, but it contradicts by omitting yoga and its benefits, indicating some misinformation.


# **RAG Evaluation using DeepEval - Full Example**

In [None]:
import pandas as pd
from deepeval import evaluate
from deepeval.test_case import LLMTestCase
from deepeval.metrics import FaithfulnessMetric, ContextualRelevancyMetric

In [None]:
# DataFrame with RAG outputs
data = {
    "query": ["What is the capital of France?", "Who won the 2020 US election?"],
    "reference": ["The capital of France is Paris.", "Joe Biden won the 2020 US election."],
    "response": ["The capital of France is Paris.", "Joe Biden won the election in 2020."],
    "context": [
        ["France is a country in Europe.", "The capital of France is Paris."],
        ["The 2020 US election was held on November 3.", "Joe Biden was declared the winner."]
    ]
}
df = pd.DataFrame(data)

In [None]:
# Convert DataFrame rows to LLMTestCase objects
def create_test_cases(df):
    test_cases = []
    for index, row in df.iterrows():
        test_case = LLMTestCase(
            input=row["query"],
            actual_output=row["response"],
            retrieval_context=row["context"]
        )
        test_cases.append(test_case)
    return test_cases

# Create test cases from the DataFrame
test_cases = create_test_cases(df)

In [None]:
# Initialize the metrics
faithfulness_metric = FaithfulnessMetric(
    threshold=0.7,  # Minimum score to pass (0-1 scale)
    model="gpt-4o-mini",
    include_reason=True
)

context_relevancy_metric = ContextualRelevancyMetric(
    threshold=0.7,  # Minimum score to pass (0-1 scale)
    model="gpt-4o-mini",
    include_reason=True
)

In [None]:
# Compute metric scores
evaluation_results = evaluate(
    test_cases=test_cases,
    metrics=[faithfulness_metric, context_relevancy_metric]
)

Evaluating 2 test case(s) in parallel: |██████████|100% (2/2) [Time Taken: 00:05,  2.99s/test case]



Metrics Summary

  - ✅ Faithfulness (score: 1.0, threshold: 0.7, strict: False, evaluation model: gpt-4o-mini, reason: The score is 1.00 because there are no contradictions, indicating that the actual output perfectly aligns with the retrieval context., error: None)
  - ❌ Contextual Relevancy (score: 0.5, threshold: 0.7, strict: False, evaluation model: gpt-4o-mini, reason: The score is 0.50 because while the relevant statement 'The capital of France is Paris.' directly answers the query, the irrelevant context mentions 'France is a country in Europe,' which does not help in identifying the capital., error: None)

For test case:

  - input: What is the capital of France?
  - actual output: The capital of France is Paris.
  - expected output: None
  - context: None
  - retrieval context: ['France is a country in Europe.', 'The capital of France is Paris.']


Metrics Summary

  - ✅ Faithfulness (score: 1.0, threshold: 0.7, strict: False, evaluation model: gpt-4o-mini, reason: The score




In [None]:
# Process and display results
results = []
for test_result in evaluation_results.test_results:
    # Extract metrics data
    faithfulness_data = next(m for m in test_result.metrics_data if m.name == "Faithfulness")
    context_relevancy_data = next(m for m in test_result.metrics_data if m.name == "Contextual Relevancy")

    # Append result dictionary
    results.append({
        "query": test_result.input,
        "response": test_result.actual_output,
        "faithfulness_score": faithfulness_data.score,
        "faithfulness_reason": faithfulness_data.reason,
        "context_relevancy_score": context_relevancy_data.score,
        "context_relevancy_reason": context_relevancy_data.reason
    })

In [None]:
# Convert results to DataFrame for easier viewing
results_df = pd.DataFrame(results)

# Print the results
print("Evaluation Results:")
print(results_df)

Evaluation Results:
                            query                             response  \
0  What is the capital of France?      The capital of France is Paris.   
1   Who won the 2020 US election?  Joe Biden won the election in 2020.   

   faithfulness_score                                faithfulness_reason  \
0                 1.0  The score is 1.00 because there are no contrad...   
1                 1.0  The score is 1.00 because there are no contrad...   

   context_relevancy_score                           context_relevancy_reason  
0                      0.5  The score is 0.50 because while the relevant s...  
1                      0.5  The score is 0.50 because while the statement ...  
