### Evaluating Retrieval-Augmented Generation (RAG) Applications: Importance and Impact  

Evaluating RAG-based applications is crucial to ensure the reliability, relevance, and efficiency of their responses. Unlike using LLMs by themselves, RAG-based applications combine the power of Large-Language-Models with vector stores and retrieval components, which introduces complexity. Hence, evaluation of different components in RAG pipelines is crucial. 

#### Key Pitfalls in RAG Applications
- **Hallucination**: Even with retrieved context, LLMs can still generate inaccurate or misleading responses.  
- **Irrelevant Retrieval**: Poor retrieval results in the model relying on incomplete or incorrect context.  
- **Latency & Efficiency**: Combining retrieval and generation can introduce delays, requiring performance optimization. 

#### Advantages of Proper Evaluation
- **Improved Accuracy**: Ensures the factual correctness of responses by using a knowledge base.  
- **Enhanced Relevance**: Helps refine retrieval mechanisms, ensuring responses are more context-aware.  
- **Robustness & Adaptability**: Identifies failure points, enabling continuous improvements to model behavior.  

#### Impact on Future Development  
A structured evaluation process enables better benchmarking, allowing for iterative improvements in retrieval models, embeddings, and LLM fine-tuning. It also promotes transparency in AI decision-making, paving the way for more explainable and ethical AI systems and thus gaining a wider adoption. 

---

### Setup

* Execute the script in the root of the `evaluation` folder.
* If executing the script fails run: `chmod u+x setup.sh`.
* It will install all required dependencies.
* Finally, make sure you select it in the notebook by specifying `eval` as kernel.

---

*(OPTIONAL STEP)*

**RAGAs** provides a cloud platform where a dataset and evaluation results can be stored and viewed.
To use it follow this link: [RAGAs.io](https://app.ragas.io/).
* Sign-up
* Retrieve the **token**
* Create a `.env` file with the following content:
```bash
RAGAS_APP_TOKEN=apt.......-9f6ed
```

---

**Note**
* In my project I've used **DeepEval** to generate a synthetic dataset and it was uploaded on **ConfidentAI**. For that reason I will pull it from my own account.
* If you have used my `generate` notebook with **RAGAs** to generate testdata you can load it aswell.
* If you want to follow along and pull the data from **DeepEval** make sure you have a `.env` file with **DeepEval API key**

In [1]:
import os
from dotenv import load_dotenv
from deepeval import login_with_confident_api_key

# If you have followed the optional step, execute this cell
# My .env file is stored at the root of the evaluation folder so I can
# combine all environment variables for all frameworks
load_dotenv("../.env")

deepeval_api_key: str = os.getenv("DEEPEVAL_API_KEY")

# You should get a message letting you know you are logged-in.
login_with_confident_api_key(deepeval_api_key)

### Loading the evaluation dataset

According to **RAGAs** an **evaluation dataset** is a homogeneous collection of **data samples** with the sole purpose of measuring the capabilities and performance of an AI application.

- **Structure**:

    - Contains either **SingleTurnSample** or **MultiTurnSample** object instances, each of them representing a unique interaction between a **Persona** and the AI-system.

    - **NOTE**: The dataset can contain **ONLY** a single type of samples. They cannot be mixed together into a single dataset.

**Samples** represent a single unit of interaction with the underlying system. As mentioned they can be either **SingleTurnSample** or **MultiTurnSample**.

For this project I focus solely on **SingleTurnSample** objects, since I'm evaluating independent queries, not multi-turn interactions.

In [2]:
import deepeval.dataset

deepeval_eval_dataset = deepeval.dataset.EvaluationDataset()
deepeval_eval_dataset.pull(
    alias=os.getenv("DATASET_ALIAS"),
    auto_convert_goldens_to_test_cases=True
)

Output()

In [3]:
from typing import List
from ragas import EvaluationDataset, SingleTurnSample

# After pulling my dataset from ConfidentAI I need to convert it into proper RAGAs evaluation dataset

samples: List[SingleTurnSample] = []
for test_case in deepeval_eval_dataset.test_cases:
    single_turn_sample = SingleTurnSample(
        user_input=test_case.input,
        retrieved_contexts=test_case.retrieval_context,
        reference_contexts=test_case.context,
        response=test_case.actual_output,
        reference=test_case.expected_output
    )
    samples.append(single_turn_sample)
    
evaluation_dataset = EvaluationDataset(samples)

In [None]:
from langchain_ollama.chat_models import ChatOllama
from langchain_ollama.embeddings import OllamaEmbeddings

from ragas import RunConfig, DiskCacheBackend
from ragas.llms import LangchainLLMWrapper
from ragas.embeddings import LangchainEmbeddingsWrapper

load_dotenv("../../env/rag.env") 

llm = ChatOllama(
    model=os.getenv("CHAT_MODEL"),
    base_url="http://localhost:11434",
    temperature=float(os.getenv("TEMPERATURE")),
    num_ctx=int(os.getenv("LLM_CONTEXT_WINDOW_TOKENS")),
    format="json"
)

embeddings = OllamaEmbeddings(
    model=os.getenv("EMBEDDING_MODEL"),
    base_url="http://localhost:11434"
)

run_config = RunConfig(
    timeout = 86400, # 24 hours, depending on GPU, model, testsize, etc -> can experinment
    max_wait = 600,
    log_tenacity = True
)

cacher = DiskCacheBackend(".cache")

llm = LangchainLLMWrapper(
    langchain_llm=llm,
    run_config=run_config,
    cache=cacher
)

embeddings = LangchainEmbeddingsWrapper(
    embeddings=embeddings,
    run_config=run_config,
    cache=cacher
)

### Metrics

Metrics in **RAGAs** can be classified depending on the mechanism they evaluate the application as:

- **LLM-based** - where a LLM is used to perform the evaluation. There might be more than one queries submitted to the LLM to evaluate the application on a single metric. Furthermore, this type of metrics mimic a user, since they tend to be quite non-deterministic and unpredictable. 

- **Non-LLM based** - where no LLM is required or used to perform the evaluation. These metrics tend to be much faster and predictable than the previous type. However, in my opinion are maybe not the best for evaluating a RAG system where a single query can be submitted multiple times and each time a new response might be generated.

The previously mentioned metric types can be further classified into **SingleTurnMetric** and **MultiTurnMetric** respectively. I would like to note again that in this project the **SingleTurnMetric** would be relevant and used.

---

### **Context Precision**

> **TL;DR:** What fraction of the retrieved chunks are **actually relevant**?

---

### **Definition**:

**Context Precision** is a crucial metric for evaluating the **retrieval capabilities** of a RAG (Retrieval-Augmented Generation) system. Since the **LLM's response** depends heavily on the **context retrieved from the knowledge base**, it is essential to have a reliable retrieval mechanism that fetches **only relevant** information.  

By achieving **high context precision**, the system ensures that the retrieved information is **more relevant** and **ranked higher**, leading to **more accurate responses** and **reduced hallucinations**.

---

### **Formula:**
For each retrieved chunk at **rank (k)**:

$\text{Precision@k} = \frac{\text{Number of relevant chunks at rank k}}{\text{Total number of retrieved chunks at rank k}}$

We then compute **Context Precision@K** as the **mean of all Precision@k values**, weighted by relevance:

$\text{Context Precision@K} = \frac{\sum_{k=1}^{K} \text{Precision@k} \times \text{Relevance}(k)}{\text{Total number of relevant chunks in top K results}}$

Where:
- **Precision@k** is the proportion of relevant chunks at rank (k).
- **Relevance(k)** is a binary indicator
    - 1 if the chunk at rank (k) is relevant
    - 0 otherwise
- **K** is the total number of retrieved chunks

---

### **Example Calculation**
Suppose our **retriever fetches 4 chunks**, and **2 of them** are relevant. We calculate **Precision@k** at each rank:

| Rank \( k \) | Retrieved Chunk | Relevant? | Precision@k |
|-------------|----------------|------------|-------------|
| 1           | ✅ Relevant      | ✅ (1)      | \( $\frac{1}{1}$ = 1.00 \) |
| 2           | ❌ Not Relevant  | ❌ (0)      | \( $\frac{1}{2}$ = 0.50 \) |
| 3           | ✅ Relevant      | ✅ (1)      | \( $\frac{2}{3}$ = 0.67 \) |
| 4           | ❌ Not Relevant  | ❌ (0)      | \( $\frac{2}{4}$ = 0.50 \) |

Now, using the **Context Precision formula**:


$\text{Context Precision @ 4} = \frac{(1.00 \times 1) + (0.50 \times 0) + (0.67 \times 1) + (0.50 \times 0)}{2} = $

= $\frac{1.00 + 0 + 0.67 + 0}{2} = \frac{1.67}{2}$ = 0.835

Thus, **Context Precision@4 = 0.835**.

---

### **Why is This Useful?**

- **Evaluates Retriever Quality** → Ensures retrieved chunks contain **relevant** information.

- **Also evaluates the Re-ranker capabilities** -> Ensures relevant chunks are ranked higher.

- **Improves RAG Performance** → Helps **reduce hallucinations** and improves **LLM accuracy**.

- **Higher values** for this metric would signify a good retriever, ranking **relevant chunks** high.

In [None]:
from ragas.metrics import (
    LLMContextPrecisionWithReference,
    LLMContextPrecisionWithoutReference,
    NonLLMContextPrecisionWithReference
)
from prompts.custom_context_precision_prompt import MyContextPrecisionPrompt

# For each piece of context, the LLM determines whether or not it is relevant for answering the input 
# with respect to the expected output
llm_context_precision_with_re = LLMContextPrecisionWithReference(
    context_precision_prompt=MyContextPrecisionPrompt,
    max_retries=5
)

# Same metric, but instead of reference the actual response is used
llm_context_precision_without_re = LLMContextPrecisionWithoutReference(
    context_precision_prompt=MyContextPrecisionPrompt,    
    max_retries=5
)

# This particular version uses a distance measure to determine the semantic similarity
# between chunks from the retrieved context and reference context.
# It builds a cartesian product coupling each reference context with context from the retrieved nodes.
# For each couple a score is computed and the maximum is taken.
# Finally, the metric removes all scores that do not exceed the threshold and the Context@K is computed using the 
# already known formula. The only difference here is that no LLM is actually used.
context_precision_with_reference = NonLLMContextPrecisionWithReference()

### **Context Recall**

### TL;DR:
> **How many of the relevant chunks were actually retrieved? Did we miss relevant chunks?**

---

### **Defition**:

**Context Recall** measures how many of the **relevant documents (or pieces of information)** were successfully retrieved. This metric ensures that the retrieval process does not **miss important information** that could be used to generate a response. A **higher recall** means that fewer relevant documents were left out, making it crucial for applications that prioritize completeness over precision. Since recall focuses on **not missing relevant data**, it always requires a reference set for comparison.

---

### LLM-Based Context Recall

This metric evaluates recall by analyzing the **claims** in the reference (expected) response and checking whether they can be attributed to the retrieved context. In an ideal scenario, **all claims** in the reference answer should be **supported** by the retrieved context.

### **Formula**:

**Context Recall** = $\frac{\text{Number of claims in the reference supported by the retrieved context}}{\text{Total number of claims in the reference}}$

A recall score **closer to 1** indicates that most relevant data has been successfully retrieved, while a **lower recall** means that key information was missed.

### **Non-LLM-Based Context Recall**

### **Definition**:

**Non-LLM-Based Context Recall** measures how many of the **relevant documents (or pieces of information)** were successfully retrieved **without relying on a language model** for evaluation. This metric ensures that the retrieval process does not **miss important information** that could be used to generate a response. A **higher recall** means that fewer relevant documents were left out, making it crucial for applications that prioritize completeness over precision. Since recall focuses on **not missing relevant data**, it always requires a reference set for comparison.

---

### **How It Works**
This metric calculates recall based on **string similarity** rather than using an LLM to assess the retrieved content. It follows these steps:

1. **Extract Retrieved and Reference Contexts**  
   - Uses `retrieved_contexts` (what was fetched) and `reference_contexts` (ground truth).  
2. **Compute Similarity Scores**  
   - Compares each **retrieved context** against each **reference context** using a similarity function.
3. **Find the Best Match for Each Reference Context**  
   - For each **reference context**, it selects the **retrieved context** with the **highest similarity score**.
4. **Apply a Threshold (`0.5` by default)**  
   - If the similarity score is **above the threshold**, the reference context is considered **retrieved successfully**.
5. **Compute Recall Score**  
   - The recall score is the proportion of reference contexts that have at least one matching retrieved context **above the threshold**.

---

### **Formula**

**Non-LLM-Based Context Recall** =  $\frac{\text{Number of reference contexts with a high-scoring retrieved match}}{\text{Total number of reference contexts}} $

Where:
- A **retrieved match** is valid if its **similarity score > threshold**.
- The **threshold** is configurable (default = `0.5`).

---

### **Example Calculation**
#### **Given:**
- `retrieved_contexts = ["text A", "text B", "text C"]`
- `reference_contexts = ["text X", "text Y"]`
- **Similarity scores:**  
  - `text A` ↔ `text X` = **0.7** ✅
  - `text B` ↔ `text X` = **0.4** ❌
  - `text C` ↔ `text X` = **0.6** ✅ (max = 0.7)
  - `text A` ↔ `text Y` = **0.2** ❌
  - `text B` ↔ `text Y` = **0.5** ✅
  - `text C` ↔ `text Y` = **0.3** ❌ (max = 0.5)

#### **Final Score Computation**
- `text X` has a match (`0.7 > 0.5`) → ✅ **1**
- `text Y` has a match (`0.5 == 0.5`) → ✅ **1**
- **Score = (1+1) / 2 = 1.0 (100%)** ✅

---

### **Why is This Useful?**
- **Evaluates Retriever Performance Without an LLM** → Ensures relevance using **string similarity**, avoiding model biases.
- **Helps Optimize RAG Pipelines** → Higher recall ensures more **comprehensive** context retrieval.
- **Works Well with Threshold-Based Comparisons** → Adjustable strictness for different applications.

A recall score **closer to 1** means that most relevant contexts were retrieved, while a **lower recall** suggests that key information was missed.

In [None]:
from ragas.metrics import LLMContextRecall, NonLLMContextRecall   
from prompts.custom_context_recall_prompt import MyContextRecallPrompt

# For each claim in the response, verify if it can be supported by the retrieved context
llm_context_recall = LLMContextRecall(
    context_recall_prompt=MyContextRecallPrompt,
    max_retries=5
)

context_recall = NonLLMContextRecall()

### **Context Entities Recall**

### TL;DR:
> **It is a measure of what fraction of entities are recalled from reference**

---

### Definition:
`ContextEntityRecall` metric measures the recall of the retrieved context, based on the number of entities present in both `reference` and `retrieved_contexts`, relative to the total number of entities in the `reference`. In simple terms, it evaluates how well the retrieved contexts capture the entities from the original reference.

---

### Formula:

To compute this metric, we define two sets:

- **RE**: The set of entities in the reference.
- **RCE**: The set of entities in the retrieved contexts.

We determine the number of entities common to both sets (**RCE** $\cap$  **RE**) and divide it by the total number of entities in the reference (**RE**). The formula is:

$ \text{Context Entity Recall} = \frac{\text{Number of common entities between RCE and RE}}{\text{Total number of entities in RE}} = \frac{\text{RCE } \cap \text{ RE}}{\text{RE}} $

---

### **Example Calculation**

#### **Scenario:**

We have a **reference** and two retrieved contexts:

**Reference:**
> "The Taj Mahal is an ivory-white marble mausoleum on the right bank of the river Yamuna in the Indian city of Agra. It was commissioned in 1631 by the Mughal emperor Shah Jahan to house the tomb of his favorite wife, Mumtaz Mahal."

**Retrieved Contexts:**

- **High Entity Recall Context:**
  > "The Taj Mahal is a symbol of love and architectural marvel located in Agra, India. It was built by the Mughal emperor Shah Jahan in memory of his beloved wife, Mumtaz Mahal. The structure is renowned for its intricate marble work and beautiful gardens surrounding it."

- **Low Entity Recall Context:**
  > "The Taj Mahal is an iconic monument in India. It is a UNESCO World Heritage Site and attracts millions of visitors annually. The intricate carvings and stunning architecture make it a must-visit destination."

#### **Step 1: Extract Entities**

Entities in the **Reference (RE)**:
> \["Taj Mahal", "Yamuna", "Agra", "1631", "Shah Jahan", "Mumtaz Mahal"\]

Entities in **High Recall Context (RCE1)**:
> \["Taj Mahal", "Agra", "Shah Jahan", "Mumtaz Mahal", "India"\]

Entities in **Low Recall Context (RCE2)**:
> \["Taj Mahal", "UNESCO", "India"\]

#### **Step 2: Compute Context Entity Recall**

For **High Recall Context (RCE1)**:
$ \text{Context Entity Recall} = \frac{4}{6} = 0.67 $ 

For **Low Recall Context (RCE2)**:
$ \text{Context Entity Recall} = \frac{1}{6} = 0.17 $

Since the first context retains more entities from the reference, it has a **higher entity recall**, indicating it is **more comprehensive** in capturing the essential information.

---

### **Key Insights:**

* **Higher Entity Recall Improves Answer Completeness:** If entities are important for context, a higher recall ensures critical information is included in the retrieved context.

* **Useful for Fact-Heavy Applications:** Applications in **legal, medical, and historical domains** benefit significantly from high entity recall.

* **Balances with Context Precision:** While high recall is beneficial, retrieving too many irrelevant entities can introduce noise. **Optimizing both recall and precision** is crucial for effective retrieval in RAG systems.

* **Comparison of Retrieval Mechanisms:** If two retrieval mechanisms fetch different contexts, **Context Entity Recall** helps determine which one is better at preserving key entities from the reference.

In [None]:
from ragas.metrics import ContextEntityRecall
from prompts.custom_context_entities_recall_prompt import MyContextEntitiesRecallPrompt

context_entity_recall = ContextEntityRecall(
    context_entity_recall_prompt=MyContextEntitiesRecallPrompt,
    max_retries=5
)

### **Noise Sensitivity**

### TL;DR:
> **How prone is the system to generating incorrect claims from retrieved contexts?**

---

### **Definition**:
**Noise Sensitivity** measures how often a system makes errors by providing incorrect responses when utilizing either relevant or irrelevant retrieved documents. This metric evaluates the robustness of a Retrieval-Augmented Generation (RAG) system against potentially misleading or noisy information.

---

### **Conceptual Insight**:
The metric fundamentally tests how easily an LLM can be "tricked" into generating factually incorrect responses. It distinguishes between two critical scenarios:

1. **Relevant Context Noise**: 
   - More subtle and dangerous
   - Noise is camouflaged within seemingly pertinent information
   - High risk of inadvertently incorporating incorrect claims

2. **Irrelevant Context Noise**:
   - A robust LLM should completely resist this
   - Contexts unrelated to the user's query
   - Zero tolerance for incorporating unrelated information

---

### LLM-Based Noise Sensitivity

The metric assesses noise sensitivity by decomposing the response and reference into individual claims, then analyzing:
- Whether claims are correct according to the ground truth
- Whether incorrect claims can be attributed to retrieved contexts (either relevant or irrelevant)

---

### **Calculation Approach**:

1. **Decompose Statements**: Break down the reference and response into individual claims using an LLM evaluator
2. **Context Verification**: Check if claims can be supported by retrieved contexts using NLI techniques
3. **Identify Relevant Contexts**: Determine which contexts support the ground truth
4. **Map Response Claims**: Map each claim in the response to contexts that support it
5. **Identify Incorrect Claims**: Determine which response claims are unsupported by the ground truth
6. **Calculate Sensitivity**: Compute the proportion of incorrect claims derived from contexts

---

### **Formula**:

Let's define the following:
- $S_r$ = Set of response statements
- $S_i$ = Set of incorrect statements in the response (not supported by ground truth)
- $C_r$ = Set of relevant contexts (contexts that support ground truth)
- $C_i$ = Set of irrelevant contexts (contexts that don't support ground truth)
- $f(s,c)$ = Function that returns 1 if statement $s$ can be inferred from context $c$, 0 otherwise

For each statement $s$ in the response, we can define:
- $\text{relevant\_faithful}(s) = \max_{c \in C_r} f(s,c)$ (1 if statement can be inferred from any relevant context)
- $\text{irrelevant\_faithful}(s) = \max_{c \in C_i} f(s,c) \cdot (1 - \text{relevant\_faithful}(s))$ (1 if statement can be inferred from irrelevant context and not from any relevant context)

Then:

**Relevant Mode Noise Sensitivity**:
$$\text{NS}_{\text{relevant}} = \frac{\sum_{s \in S_i} \text{relevant\_faithful}(s)}{|S_r|}$$

**Irrelevant Mode Noise Sensitivity**:
$$\text{NS}_{\text{irrelevant}} = \frac{\sum_{s \in S_i} \text{irrelevant\_faithful}(s)}{|S_r|}$$

Where $|S_r|$ is the total number of statements in the response.

In plain English:
- Relevant mode: Proportion of response statements that are both incorrect and supported by relevant contexts
- Irrelevant mode: Proportion of response statements that are both incorrect and supported only by irrelevant contextst

A score **closer to 0** indicates better performance, suggesting:
- Fewer incorrect claims
- Less influence from noisy or irrelevant contexts
- More robust response generation

---

### **Modes**:
- **Relevant Mode** (default): Focuses on noise in relevant contexts
- **Irrelevant Mode**: Analyzes noise from irrelevant retrieved documents

---

### **Evaluation Expectations**:
- **Relevant Context**: Some noise tolerance, but should be minimal
- **Irrelevant Context**: Virtually zero noise should be incorporated
  - A high-quality LLM should completely disregard irrelevant information
  - Any noise from irrelevant contexts indicates a significant vulnerability

---

### **Example Scenario**:
Consider a query about the Life Insurance Corporation of India (LIC):
- **Question**: "What is the Life Insurance Corporation of India (LIC) known for?"
- **Reference**: "The Life Insurance Corporation of India (LIC) is the largest insurance company in India, established in 1956 through the nationalization of the insurance industry. It is known for managing a large portfolio of investments."
- **Response**: "The Life Insurance Corporation of India (LIC) is the largest insurance company in India, known for its vast portfolio of investments. LIC contributes to the financial stability of the country."
- **Retrieved Contexts**: Mix of relevant information about LIC and irrelevant information about the Indian economy
- **Noise Sensitivity Score**: 0.33 (one incorrect claim out of three total claims)

---

### **Key Insights**:
- Helps identify system vulnerabilities to hallucination
- Provides a quantitative measure of response reliability
- Distinguishes between subtle (relevant) and gross (irrelevant) information distortions
- Serves as a critical component in comprehensive RAG system evaluation

In [None]:
from ragas.metrics import NoiseSensitivity
from prompts.faithfulness.custom_nli_generator_prompt import MyNLIStatementPrompt
from prompts.faithfulness.custom_statement_generator_prompt import MyStatementGeneratorPrompt

noise_sensitivity_mode_relevant = NoiseSensitivity(
    nli_statements_prompt=MyNLIStatementPrompt,
    statement_generator_prompt=MyStatementGeneratorPrompt,
    max_retries=5
)

noise_sensitivity_mode_irrelevant = NoiseSensitivity(
    mode="irrelevant",
    nli_statements_prompt=MyNLIStatementPrompt,
    statement_generator_prompt=MyStatementGeneratorPrompt,
    max_retries=5
)

### **Response Relevancy**

### TL;DR:
> **How well does the answer address the original user query?**

---

### **Definition**:
**Response Relevancy** measures how closely an AI-generated response aligns with the original user input. It evaluates the answer's ability to directly and appropriately address the user's question, penalizing responses that are incomplete, off-topic, or include unnecessary details. It doesn't judge the **factual accuracy**.

---

### **Conceptual Insight**:
The metric focuses on the semantic alignment between:
- The original user query
- The generated response

Key evaluation criteria:
1. **Direct Address**: Does the answer directly tackle the user's question?
2. **Information Completeness**: Does the response provide sufficient information?
3. **Topic Coherence**: Does the answer stay focused on the query's intent?

---

### **Calculation Approach**:

1. **Question Generation**: 
   - Use the LLM to generate artificial questions based on the response
   - Default is to create 3 variant questions
   - These questions should capture the essence of the response

2. **Semantic Similarity**:
   - Compute cosine similarity between:
     - Original user input embedding
     - Embeddings of generated questions
   - Measures how closely the questions match the original query

3. **Scoring**:
   - Average the cosine similarity scores
   - Higher scores indicate better relevance
   - Scores typically range between 0 and 1

---

### **Formula**:

**Response Relevancy** = $\frac{\sum_{i=1}^{N} \text{cosine\_similarity}(E_{g_{i}}, E_{o})}{N}$ 

Where:
- $E_{g_{i}}$: Embedding of the i-th generated question
- $E_{o}$: Embedding of the original user input
- N: Number of generated questions (default: 3)

---

### **Evaluation Expectations**:
- **High Score (Close to 1)**: 
  - Answer directly addresses the query
  - Comprehensive and focused response
  - Minimal irrelevant information

- **Low Score (Close to 0)**: 
  - Response is off-topic
  - Incomplete or evasive answer
  - Includes excessive unrelated details

---

### **Important Limitations**:
- **No Factuality Check**: 
  - Measures relevance, not accuracy
  - Does not verify the truthfulness of the response
- **Embedding-Based**: 
  - Relies on semantic similarity
  - May not catch nuanced relevance

---

### **Example Scenario**:
- **User Query**: "When was the first Super Bowl?"
- **Good Response**: "The first Super Bowl was held on Jan 15, 1967, between the Green Bay Packers and Kansas City Chiefs."
- **Poor Response**: "Football is a popular sport in the United States with many interesting historical moments."

### **Key Insights**:
- Helps evaluate response quality beyond simple keyword matching
- Provides a quantitative measure of semantic alignment
- Supports improving AI system's query understanding
- Identifies potential issues with off-topic or unfocused responses

In [None]:
from ragas.metrics import ResponseRelevancy
from prompts.custom_response_relevance_prompt import MyResponseRelevancePrompt

response_relevancy = ResponseRelevancy(
    question_generation=MyResponseRelevancePrompt
)

### **Faithfulness**

### TL;DR:
> **How factually consistent is the response with the retrieved context?**

---

### **Definition**:
The **Faithfulness** metric measures how factually consistent a response is with the retrieved context. It ranges from 0 to 1, with higher scores indicating better consistency.

A response is considered faithful if all of its claims can be supported by the retrieved context.

---

### **Calculation Approach**:

1. **Identify Claims**: 
   - Break down the response into individual statements
   - Examine each claim systematically

2. **Context Verification**:
   - Check each claim to see if it can be inferred from the retrieved context
   - Determine the support level of each statement

3. **Scoring**:
   - Compute the faithfulness score using the formula:
     
     $\text{Faithfulness Score} = \frac{\text{Number of claims supported by the retrieved context}}{\text{Total number of claims in the response}}$

---

### **Example**:

**Question**: Where and when was Einstein born?

**Context**: Albert Einstein (born 14 March 1879) was a German-born theoretical physicist, widely held to be one of the greatest and most influential scientists of all time.

**High Faithfulness Answer**: Einstein was born in Germany on 14th March 1879.

**Low Faithfulness Answer**: Einstein was born in Germany on 20th March 1879.

#### Calculation Steps:
- **Step 1**: Break the generated answer into individual statements
- **Step 2**: Verify if each statement can be inferred from the given context
- **Step 3**: Apply the faithfulness formula

---

### **Evaluation Expectations**:
- **Perfect Score (1.0)**: 
  - All claims are directly supported by the context
  - No extraneous or unsupported information

- **Partial Score**: 
  - Some claims are supported
  - Partial consistency with the retrieved context

- **Low Score (Close to 0)**: 
  - Most claims cannot be verified
  - Significant deviation from the original context

---

### **Advanced Verification**:
**HHEM-2.1-Open** can be used as a classifier model to:
- Detect hallucinations in LLM-generated text
- Cross-check claims with the given context
- Efficiently determine claim inferability
- Avoid **biases** by not using `LLM-As-A-Judge`

---

### **Key Insights**:
- Measures the factual consistency of AI-generated responses
- Helps identify potential hallucinations or fabrications
- Provides a quantitative assessment of contextual alignment
- Supports improving the reliability of AI-generated content

In [None]:
from ragas.metrics import FaithfulnesswithHHEM
from prompts.faithfulness.custom_statement_generator_prompt import MyStatementGeneratorPrompt

faithfulness = FaithfulnesswithHHEM(
    statement_generator_prompt=MyStatementGeneratorPrompt,
    max_retries=5,
    device="cuda:0",
)

You are using a model of type HHEMv2Config to instantiate a model of type HHEMv2. This is not supported for all configurations of models and can yield errors.


### **Factual Correctness**

### TL;DR:
> **How factually consistent is a model's response with the ground truth?**

---

### **Definition**:
**Factual Correctness** measures the degree to which a model's response aligns with the reference by assessing the factual consistency between the two. It does this by breaking down both the response and reference into distinct claims and then determining whether these claims match using natural language inference (NLI). This metric helps evaluate how accurately the generated response retains factual integrity.

The factual correctness score ranges from **0 to 1**, where higher values indicate better factual alignment.

---

### **Claim Breakdown and Classification**:
This metric identifies three types of claims:
- **True Positives (TP):** Claims that are present in both the response and the reference.
- **False Positives (FP):** Claims that are in the response but not supported by the reference.
- **False Negatives (FN):** Claims that are in the reference but missing from the response.

---

### **Calculation Modes**:
The metric can be computed in three different modes:

#### **Precision Mode:**
> Measures how much of the information in the response is factually correct.

$ \text{Precision} = \frac{TP}{TP + FP} $

- High precision means the response avoids including unsupported claims.
- Penalizes responses that introduce hallucinated information.

#### **Recall Mode:**
> Measures how much of the reference information is retained in the response.

$ \text{Recall} = \frac{TP}{TP + FN} $

- High recall means the response includes all important reference claims.
- Penalizes responses that omit key factual details.

#### **F1 Mode (Default):**
> Balances precision and recall for a comprehensive factual correctness score.

$ \text{F1} = \frac{2 \times \text{Precision} \times \text{Recall}}{\text{Precision} + \text{Recall}} $

---

### **Controlling the Number of Claims**
Each sentence in the response and reference can be decomposed into multiple claims. The granularity of this decomposition is determined by **atomicity** and **coverage**.

#### **Atomicity:**
Controls how much a sentence is broken down:
- **High Atomicity:** Breaks a sentence into fine-grained claims.
- **Low Atomicity:** Keeps a sentence more intact with minimal decomposition.

#### **Coverage:**
Determines how much information is extracted:
- **High Coverage:** Extracts all details from the original sentence.
- **Low Coverage:** Focuses on key information, omitting minor details.

---

### **Example of Atomicity and Coverage Adjustments**
#### **High Atomicity & High Coverage:**
```python
scorer = FactualCorrectness(mode="precision", atomicity="high", coverage="high")
```
**Original Sentence:**  
> "Marie Curie was a Polish and naturalized-French physicist and chemist who conducted pioneering research on radioactivity."

**Decomposed Claims:**
- "Marie Curie was a Polish physicist."
- "Marie Curie was a naturalized-French physicist."
- "Marie Curie was a chemist."
- "Marie Curie conducted pioneering research on radioactivity."

---

### **Practical Application**
- **High Atomicity & High Coverage** → Best for detailed fact-checking and claim extraction.
- **Low Atomicity & Low Coverage** → Suitable for summarization or when only key facts are needed.

This flexibility ensures the metric can be tailored to different levels of granularity based on the application.

---

### **Key Takeaways:**
* Factual correctness evaluates the factual overlap between a response and a reference.  
* It provides **precision**, **recall**, and **F1-score** options for measurement.  
* Atomicity and coverage control the **granularity** of claim decomposition.  
* Helps identify hallucinations (FP) and missing information (FN) in responses.  

This metric is crucial for **evaluating Retrieval-Augmented Generation (RAG) systems** where factual consistency is a priority!

In [None]:
from ragas.metrics import FactualCorrectness
from prompts.faithfulness.custom_nli_generator_prompt import MyNLIStatementPrompt

factual_correctness = FactualCorrectness(
    nli_prompt=MyNLIStatementPrompt
)

### **Answer Correctness**

### TL;DR:  
> **How well does a model’s response align with the ground truth in terms of factual correctness and completeness?**

---

### **Definition**:  
**Answer Correctness** evaluates how accurately a model-generated response corresponds to a given ground truth. It does this by breaking down both texts into discrete factual statements and determining whether these statements match. The metric assesses correctness through a combination of **factual alignment** and **semantic similarity**.

The correctness score ranges from **0 to 1**, where higher values indicate better alignment with the ground truth.

---

### **Statement Breakdown and Classification**:  
To assess correctness, the response is decomposed into factual statements, which are then categorized as follows:

- **True Positives (TP):** Statements in the response that are also in the ground truth.
- **False Positives (FP):** Statements in the response that are not supported by the ground truth.
- **False Negatives (FN):** Statements in the ground truth that are missing from the response.

---

### **Calculation Formula**:  
The **F1 Score** used for Answer Correctness is defined as:

$ F1 = \frac{|TP|}{|TP| + 0.5 \times (|FP| + |FN|)} $ 

- This formula ensures a balance between **precision (avoiding hallucinated facts)** and **recall (retaining key information).**
- The **0.5 weight on FP and FN** prevents excessive penalization while ensuring factual accuracy.

---

### **Computation Steps**:  
1. **Extract factual statements** from both the response and the ground truth.  
2. **Classify statements** into TP, FP, and FN using a **Correctness Classifier** based on natural language inference (NLI).  
3. **Compute factual correctness** using the F1-like score.  
4. **Compute semantic similarity** if slight wording variations are acceptable.  
5. **Combine both scores** using predefined weight factors (default: factual correctness = 0.75, semantic similarity = 0.25).  

---

### **Computation Modes**:  

#### **Factual Alignment Mode (Default)**  
> Focuses on exact factual correctness using the TP, FP, and FN classification.  

F1 = $\frac{|TP|}{|TP| + 0.5 \times (|FP| + |FN|)} $

- Ensures that incorrect statements (FP) and missing information (FN) reduce the score.

#### **Semantic Similarity Mode**  
> Measures overall alignment by incorporating semantic closeness between response and ground truth.  

- Uses **embeddings-based similarity scoring** (e.g., cosine similarity).
- Allows for flexible wording but still ensures factual correctness.

#### **Weighted Combination Mode**  
> A hybrid approach balancing factual alignment and semantic similarity.  


Score = $ w_1 \times \text{Factual Correctness} + w_2 \times \text{Semantic Similarity} $

(Default: **\( w_1 = 0.75, w_2 = 0.25 \)**)

---

### **Controlling Granularity of Statements**  

The metric supports **customizing the granularity** of extracted statements using:  

#### **Atomicity:**  
- **High Atomicity** → Extracts fine-grained claims from a sentence.  
- **Low Atomicity** → Retains larger sentence-level claims.  

#### **Coverage:**  
- **High Coverage** → Extracts all possible claims from a response.  
- **Low Coverage** → Focuses only on the key facts.  

This flexibility ensures adaptability for different evaluation needs.

---

### **Example of Atomicity and Coverage Adjustments**  
#### **High Atomicity & High Coverage Example:**
```python
scorer = AnswerCorrectness(atomicity="high", coverage="high")
```
**Original Sentence:**  
> "The Eiffel Tower was designed by Gustave Eiffel and completed in 1889."  

**Decomposed Claims:**  
- "The Eiffel Tower was designed by Gustave Eiffel."  
- "The Eiffel Tower was completed in 1889."  

---

### **Practical Applications**  
- **High Atomicity & High Coverage** → Best for **fact-checking and evaluation** in RAG applications.  
- **Low Atomicity & Low Coverage** → Suitable for **summary-level correctness**.  

---

### **Key Takeaways:**  
* Measures factual correctness and completeness in generated answers.  
* Uses **True Positives (TP), False Positives (FP), and False Negatives (FN)** for scoring.  
* Provides an **F1-like score** for correctness evaluation.  
* Supports **semantic similarity** as an additional scoring mode.  
* Can be adjusted for **granularity (atomicity) and coverage** to control claim decomposition.  

This metric is essential for **evaluating Retrieval-Augmented Generation (RAG) systems**, ensuring factually grounded responses!

In [None]:
from ragas.metrics import AnswerCorrectness
from prompts.custom_correctness_classifier_prompt import MyCorrectnessClassifier
from prompts.faithfulness.custom_statement_generator_prompt import MyStatementGeneratorPrompt

answer_correctness = AnswerCorrectness(
    correctness_prompt=MyCorrectnessClassifier,
    statement_generator_prompt=MyStatementGeneratorPrompt,
    max_retries=5
)

In [None]:

from ragas.evaluation import evaluate, EvaluationResult

# Evaluate
results: EvaluationResult = evaluate(
    dataset=evaluation_dataset,
    metrics=[
        llm_context_precision_with_re,      # Metric 1
        llm_context_precision_without_re,   # Metric 1
        context_precision_with_reference,   # Metric 1
        llm_context_recall,                 # Metric 2
        context_recall,                     # Metric 2
        context_entity_recall,              # Metric 3
        noise_sensitivity_mode_relevant,    # Metric 4
        noise_sensitivity_mode_irrelevant,  # Metric 4
        response_relevancy,                 # Metric 5
        faithfulness,                       # Metric 6
        factual_correctness,                # Metric 7
        answer_correctness                  # Metric 8
    ],
    llm=llm,
    embeddings=embeddings,
    run_config=run_config
)

Evaluating: 100%|██████████| 260/260 [3:15:39<00:00, 45.15s/it]   


In [None]:
# Save results locally
result_df = results.to_pandas()
result_df.to_csv('metrics_evaluation.csv', index=False)
results

In [None]:
# If you have supplied a RAGAs app token
results.upload()