### Importing dependencies

In [None]:
import os

import pandas as pd
from dotenv import load_dotenv

from langchain_ollama import (
    ChatOllama, OllamaEmbeddings
)

from ragas import (
    RunConfig, DiskCacheBackend
)
from ragas.metrics import (
    LLMContextPrecisionWithReference,
    LLMContextRecall,
    ContextEntityRecall,
    NoiseSensitivity,
    ResponseRelevancy,
    Faithfulness,
    FactualCorrectness,
    SemanticSimilarity # Non-LLM based one
)
from ragas import EvaluationDataset
from ragas.llms import LangchainLLMWrapper
from ragas.embeddings import LangchainEmbeddingsWrapper
from ragas.evaluation import evaluate, EvaluationResult

from prompts.metrics.custom_context_recall_prompt import MyContextRecallPrompt
from prompts.metrics.custom_context_precision_prompt import MyContextPrecisionPrompt
from prompts.metrics.custom_response_relevance_prompt import MyResponseRelevancePrompt
from prompts.metrics.custom_context_entities_recall_prompt import MyContextEntitiesRecallPrompt
from prompts.metrics.faithfulness.custom_nli_generator_prompt import MyNLIStatementPrompt
from prompts.metrics.faithfulness.custom_statement_generator_prompt import MyStatementGeneratorPrompt

# Load RAG parameters
load_dotenv("../../env/rag.env")

### Loading the evaluation dataset

- https://docs.ragas.io/en/latest/concepts/components/eval_sample/
- https://docs.ragas.io/en/latest/concepts/components/eval_dataset/

According to **RAGAs** an **evaluation dataset** is a homogeneous collection of **data samples** with the sole purpose of measuring the capabilities and performance of an AI application.

- **Structure**:
    - Contains either **SingleTurnSample** or **MultiTurnSample** object instances, representing a unique interaction or a set of interactions respectively between a **Persona** and the AI-system.
    - **NOTE**: The dataset can contain **ONLY** a single type of samples. They cannot be mixed together into a single dataset.

**Samples** represent a single unit of interaction with the underlying system. As mentioned they can be either **SingleTurnSample** or **MultiTurnSample**.

For this project I focus solely on **SingleTurnSample** objects, since I'm evaluating independent queries, not multi-turn interactions.

If you have your own custom synthetic data load it as required. Make sure you convert your entries into an `EvaluationDataset`.

The schema that you need to respect is:
```bash
{
    "user_input": "...", # query/question
    "reference": "...",  # expected answer
    "response": "...",   # actual answer (LLM output)
    # (optional and not relevant, however generated by RAGAs during synthetic data creation) 
    "reference_contexts": [...], # context used to generate queries
    "retrieved_contexts": [...]  # context retrieved at runtime to answer a question
}

```

In [None]:
# The datasets are located at ../datasets
# Specify the experiment id to load the corresponding dataset
test_id: str = input("Please specify experiment id (Ex. 1): ")

try:
    evaluation_dataset = EvaluationDataset.from_jsonl(
        path=f"../datasets/{test_id}_dataset.jsonl"
    )
except FileNotFoundError:
    print(f"Dataset for experiment id {test_id} not found. Please enter correct id.")

### Instantiate required objects for interacting with RAGAs

- **RAGAs** would require a **LLM** and an **embedding model** depending on the type of **transformation**s one would like to apply to the **knowledge graph**. For that purpose one must create *wrapper* objects for both of the models. `langchain`, `llama-index`, `haystack`, etc are supported.

- Additionally, a **configuration** can be used to modify the default behaviour of the framework. For example timeout values can be modified, maximum retries for failed operations and so on can be configured from the **RunConfig**.

- **NOTE**: depending on the LLM model and GPU you may need to modify the `timeout` value, otherwise you will stumble upon `TimeoutException`

- Lastly, there's a single implementation in **RAGAs** for caching intermediate steps onto disk. To use it the **DiskCacheBackend** class can come in play. Can be useful if the kernel freezes - operations will not be carried out again, since they are cached.

In [None]:
run_config = RunConfig(
    timeout=86400,    # 24 hours on waiting for a single operation
    max_retries=20,   # Max retries before giving up
    max_wait=600,     # Max wait between retries
    max_workers=4,    # Concurrent requests
    log_tenacity=True # Print retry attempts
)

# This stores data generation and evaluation results locally on disk
# When using it for the first time, it will create a .cache folder
# When using it again, it will read from that folder and finish almost instantly
cacher = DiskCacheBackend(cache_dir=f".cache-{test_id}")

ollama_llm = ChatOllama(
    model=os.getenv("DATA_GENERATION_MODEL"),
    base_url="http://localhost:11434", # Can vary for you
    temperature=float(os.getenv("TEMPERATURE")),
    num_ctx=int(os.getenv("LLM_CONTEXT_WINDOW_TOKENS")),
    format="json" # We need to enforce JSON output, since most outputs would be validated by a pydantic model
)

ollama_embeddings = OllamaEmbeddings(
    model=os.getenv("EMBEDDING_MODEL"),
    base_url="http://localhost:11434" # Can vary for you
)

ragas_llm = LangchainLLMWrapper(
    langchain_llm=ollama_llm,
    run_config=run_config,
    cache=cacher
)

ragas_embeddings = LangchainEmbeddingsWrapper(
    embeddings=ollama_embeddings,
    run_config=run_config,
    cache=cacher
)

### Metrics

A metric is a quantatative measure used to evaluate the performance of a LLM-based application. Metrics help us assess how well an application as a whole or its individual components performs relative to the dataset used. With their help we can iterate towards a more *accurate*, *faithful* and *precise* LLM application.

Metrics in **RAGAs** can be classified depending on the mechanism they evaluate the application as:

- **LLM-based** - where a LLM is used to perform the evaluation. There might be more than one prompts submitted to the LLM to evaluate the application on a single metric. Furthermore, this type of metrics reflect a human-like evaluation.
- **Non-LLM based** - where no LLM is required or used to perform the evaluation. These metrics tend to be much faster and predictable than the previous type. However, in my opinion are not the best for evaluating a RAG system where a single query can be submitted multiple times and each time a new response might be generated (re-phrased).

The previously mentioned metric types can be further classified into **SingleTurnMetric** and **MultiTurnMetric** respectively. I would like to note again that in this project the **SingleTurnMetric** would be relevant and used.

![Metric types in RAGAs](../../img/ragas/metrics_mindmap.webp "Metrics RAGAs")

### **Context Precision**

> **TL;DR:** What fraction of the retrieved chunks are **actually relevant** and their positioning?

---

### **Definition**:

**Context Precision** is a crucial metric for evaluating the **retrieval capabilities** of a RAG (Retrieval-Augmented Generation) system. Since the **LLM's response** depends heavily on the **context retrieved from the knowledge base**, it is essential to have a reliable retrieval mechanism that fetches **only relevant** information.  

By achieving **high context precision**, the system ensures that the retrieved information is **more relevant** and **ranked higher**, leading to **more accurate responses** and **reduced hallucinations**.

---

### **Approach**:

The `retrieved_contexts` are iterated on and for each context, an object consisting of `input`, `reference` and `context` is submitted to the LLM for evaluation. The LLM has to classify the context as either useful or not for arriving at the answer. Then using the verdicts, the final score is computed.

---

### **Formula:**
For each retrieved chunk at **rank (k)**:

**Precision@k** = $\frac{\text{Number of relevant chunks at rank k}}{\text{Total number of retrieved chunks at rank k}} = \frac{\text{true positives@k}}{\text{true positives@k + false positives@k}}$

We then compute **Context Precision@K** as the **mean of all Precision@k values**, weighted by relevance:

**Context Precision@K** = $\frac{\sum_{k=1}^{K} (\text{Precision@k} \times \text{Relevance}(k))}{\text{Total number of relevant chunks in top K results}}$

Where:
- **Precision@k** is the proportion of relevant chunks at rank (k).
- **Relevance(k)**:
\begin{cases}
1, & \text{if the chunk at rank } k \text{ is relevant} \\
0, & \text{otherwise}
\end{cases}
- **K** is the total number of retrieved chunks, which are classified as relevant

---

### **Example Calculation**
Suppose our **retriever fetches 4 chunks**, and **2 of them** are relevant. We calculate **Precision@k** at each rank:

| Rank (k) | Retrieved Chunk | Relevant? | Precision@k |
|-------------|----------------|------------|-------------|
| 1           | ✅ Relevant      | ✅ (1)      | \( $\frac{1}{1}$ = 1.00 \) |
| 2           | ❌ Not Relevant  | ❌ (0)      | \( $\frac{1}{2}$ = 0.50 \) |
| 3           | ✅ Relevant      | ✅ (1)      | \( $\frac{2}{3}$ = 0.67 \) |
| 4           | ❌ Not Relevant  | ❌ (0)      | \( $\frac{2}{4}$ = 0.50 \) |

Now, using the **Context Precision formula**:


**Context Precision@4** = $\frac{(1.00 \times 1) + (0.50 \times 0) + (0.67 \times 1) + (0.50 \times 0)}{2} = $

= $\frac{1.00 + 0 + 0.67 + 0}{2} = \frac{1.67}{2}$ = 0.835

Thus, **Context Precision@4 = 0.835**.

---

### **Why is This Useful?**

- **Evaluates Retriever Quality** → Ensures retrieved chunks contain **relevant** information.

- **Also evaluates the re-rankers capabilities** -> Ensures relevant chunks are ranked higher.

- **Improves RAG Performance** → Helps **reduce hallucinations** and improves **LLM accuracy**, since relevant information would come first in the context.

- **Higher values** for this metric would signify a good retriever, ranking **relevant chunks** high.

In [None]:
context_precision = LLMContextPrecisionWithReference(
    name="context_precision",
    context_precision_prompt=MyContextPrecisionPrompt(),
    max_retries=20
)

### **Context Recall**

### TL;DR:
> **How many of the relevant chunks were actually retrieved? Did we miss relevant chunks?**

---

### **Defition**:

**Context Recall** measures how many of the **relevant documents (or pieces of information)** were successfully retrieved from the knowledge base. This metric ensures that the retrieval process does not **miss important information** that could be used to generate a response. A **higher recall** means that fewer relevant documents were left out, making it crucial for applications that prioritize completeness over precision. Since recall focuses on **not missing relevant data**, it always requires a reference (expected output) for comparison.

---

### **Approach**:

This metric evaluates recall by analyzing the **claims** in the reference (expected response) and checking whether they can be attributed to the retrieved context (does the context support the claims). In an ideal scenario, **all claims** in the reference answer should be **supported** by the retrieved context.

### **Formula**:

**Context Recall** = $\frac{\text{Number of claims in the reference supported by the retrieved context}}{\text{Total number of claims in the reference}}$

A recall score **closer to 1** indicates that most relevant data has been successfully retrieved, while a **lower recall** means that key information was missed.

---

### **Why is This Useful?**
- **Evaluates Retriever Performance** → Ensures all information required for answering is fetched.
- **Helps Optimize RAG Pipelines** → Higher recall ensures more **comprehensive** context retrieval.

A recall score **closer to 1** means that most relevant contexts were retrieved, while a **lower recall** suggests that key information was missed.

In [None]:
context_recall = LLMContextRecall(
    name="context_recall",
    context_recall_prompt=MyContextRecallPrompt(),
    max_retries=20
)

### **Context Entities Recall**

### TL;DR:
> **It is a measure of what fraction of entities are recalled from reference**

---

### Definition:
`ContextEntityRecall` metric measures the recall of the retrieved context, based on the number of entities present in both `reference` and `retrieved_contexts`, relative to the total number of entities in the `reference` alone. In simple terms, it evaluates how well the retrieved contexts capture the entities from the original reference.

---

### Formula:

To compute this metric, we define two sets:

- **RE**: The set of entities in the reference.
- **RCE**: The set of entities in the retrieved contexts.

We determine the number of entities common to both sets (**RCE** $\cap$  **RE**) and divide it by the total number of entities in the reference (**RE**). The formula is:

$ \text{Context Entity Recall} = \frac{\text{Number of common entities between RCE and RE}}{\text{Total number of entities in RE}} = \frac{\text{RCE } \cap \text{ RE}}{\text{RE}} $

---

### **Example Calculation**

#### **Scenario:**

We have a **reference** and two retrieved contexts:

**Reference:**
> "The Taj Mahal is an ivory-white marble mausoleum on the right bank of the river Yamuna in the Indian city of Agra. It was commissioned in 1631 by the Mughal emperor Shah Jahan to house the tomb of his favorite wife, Mumtaz Mahal."

**Retrieved Contexts:**

- **High Entity Recall Context:**
  > "The Taj Mahal is a symbol of love and architectural marvel located in Agra, India. It was built by the Mughal emperor Shah Jahan in memory of his beloved wife, Mumtaz Mahal. The structure is renowned for its intricate marble work and beautiful gardens surrounding it."

- **Low Entity Recall Context:**
  > "The Taj Mahal is an iconic monument in India. It is a UNESCO World Heritage Site and attracts millions of visitors annually. The intricate carvings and stunning architecture make it a must-visit destination."

#### **Step 1: Extract Entities**

Entities in the **Reference (RE)**:
> \["Taj Mahal", "Yamuna", "Agra", "1631", "Shah Jahan", "Mumtaz Mahal"\]

Entities in **High Recall Context (RCE1)**:
> \["Taj Mahal", "Agra", "Shah Jahan", "Mumtaz Mahal", "India"\]

Entities in **Low Recall Context (RCE2)**:
> \["Taj Mahal", "UNESCO", "India"\]

#### **Step 2: Compute Context Entity Recall**

For **High Recall Context (RCE1)**:
$ \text{Context Entity Recall} = \frac{4}{6} = 0.67 $ 

For **Low Recall Context (RCE2)**:
$ \text{Context Entity Recall} = \frac{1}{6} = 0.17 $

Since the first context retains more entities from the reference, it has a **higher entity recall**, indicating it is **more comprehensive** in capturing the essential information.

---

### **Key Insights:**

* **Higher Entity Recall Improves Answer Completeness:** If entities are important for context, a higher recall ensures critical information is included in the retrieved context.

* **Useful for Fact-Heavy Applications:** Applications in **legal, medical, and historical domains** benefit significantly from high entity recall.

* **Balances with Context Precision:** While high recall is beneficial, retrieving too many irrelevant entities can introduce noise. **Optimizing both recall and precision** is crucial for effective retrieval in RAG systems.

* **Comparison of Retrieval Mechanisms:** If two retrieval mechanisms fetch different contexts, **Context Entity Recall** helps determine which one is better at preserving key entities from the reference.

In [None]:
context_entity_recall = ContextEntityRecall(
    context_entity_recall_prompt=MyContextEntitiesRecallPrompt(),
    max_retries=20
)

### **Noise Sensitivity**

### TL;DR:
> **How prone is the system to generating incorrect claims from retrieved contexts?**

---

### **Definition**:
**Noise Sensitivity** measures how often a system makes errors by providing incorrect responses when utilizing either relevant or irrelevant retrieved documents. This metric evaluates the robustness of a Retrieval-Augmented Generation (RAG) system against potentially misleading or noisy information.

---

### **Conceptual Insight**:
The metric fundamentally tests how easily an LLM can be "tricked" into generating factually incorrect responses. It distinguishes between two critical scenarios:

1. **Relevant Context Noise**: 
   - More subtle and dangerous
   - Noise is camouflaged within seemingly pertinent information
   - High risk of inadvertently incorporating incorrect claims

2. **Irrelevant Context Noise**:
   - A robust LLM should completely resist this
   - Contexts unrelated to the user's query
   - Zero tolerance for incorporating unrelated information

---

### LLM-Based Noise Sensitivity

The metric assesses noise sensitivity by decomposing the response and reference into individual claims, then analyzing:
- Whether claims are correct according to the ground truth
- Whether incorrect claims can be attributed to retrieved contexts (either relevant or irrelevant)

---

### **Calculation Approach**:

1. **Decompose Statements**: Break down the reference and response into individual claims using an LLM evaluator
2. **Context Verification**: Check if claims can be supported by retrieved contexts using NLI techniques
3. **Identify Relevant Contexts**: Determine which contexts support the ground truth
4. **Map Response Claims**: Map each claim in the response to contexts that support it
5. **Identify Incorrect Claims**: Determine which response claims are unsupported by the ground truth
6. **Calculate Sensitivity**: Compute the proportion of incorrect claims derived from contexts

---

### **Formula**:

Let's define the following:
- $S_r$ = Set of response statements
- $S_i$ = Set of incorrect statements in the response (not supported by ground truth)
- $C_r$ = Set of relevant contexts (contexts that support ground truth)
- $C_i$ = Set of irrelevant contexts (contexts that don't support ground truth)
- $f(s,c)$ = Function that returns 1 if statement $s$ can be inferred from context $c$, 0 otherwise

For each statement $s$ in the response, we can define:
- $\text{relevant\_faithful}(s) = \max_{c \in C_r} f(s,c)$ (1 if statement can be inferred from any relevant context)
- $\text{irrelevant\_faithful}(s) = \max_{c \in C_i} f(s,c) \cdot (1 - \text{relevant\_faithful}(s))$ (1 if statement can be inferred from irrelevant context and not from any relevant context)

Then:

**Relevant Mode Noise Sensitivity**:
$$\text{NS}_{\text{relevant}} = \frac{\sum_{s \in S_i} \text{relevant\_faithful}(s)}{|S_r|}$$

**Irrelevant Mode Noise Sensitivity**:
$$\text{NS}_{\text{irrelevant}} = \frac{\sum_{s \in S_i} \text{irrelevant\_faithful}(s)}{|S_r|}$$

Where $|S_r|$ is the total number of statements in the response.

In plain English:
- Relevant mode: Proportion of response statements that are both incorrect and supported by relevant contexts
   - incorrect statement means that a there's a contradiction between a statement in the `response` and `reference`
   - relevant context is a context, which supports information from the `reference`
- Irrelevant mode: Proportion of response statements that are both incorrect and supported only by irrelevant contexts
   - irrelevant context is a context, which doesn't support any statement in the `reference`

A score **closer to 0** indicates better performance, suggesting:
- Fewer incorrect claims
- Less influence from noisy or irrelevant contexts
- More robust response generation

---

### **Modes**:
- **Relevant Mode** (default): Focuses on noise in relevant contexts
- **Irrelevant Mode**: Analyzes noise from irrelevant retrieved documents

---

### **Evaluation Expectations**:
- **Relevant Context**: Some noise tolerance, but should be minimal
- **Irrelevant Context**: Virtually zero noise should be incorporated
  - A high-quality LLM should completely disregard irrelevant information
  - Any noise from irrelevant contexts indicates a significant vulnerability

---

### **Example Scenario**:
Consider a query about the Life Insurance Corporation of India (LIC):
- **Question**: "What is the Life Insurance Corporation of India (LIC) known for?"
- **Reference**: "The Life Insurance Corporation of India (LIC) is the largest insurance company in India, established in 1956 through the nationalization of the insurance industry. It is known for managing a large portfolio of investments."
- **Response**: "The Life Insurance Corporation of India (LIC) is the largest insurance company in India, known for its vast portfolio of investments. LIC contributes to the financial stability of the country."
- **Retrieved Contexts**: Mix of relevant information about LIC and irrelevant information about the Indian economy
- **Noise Sensitivity Score**: 0.33 (one incorrect claim out of three total claims)

---

### **Key Insights**:
- Helps identify system vulnerabilities to hallucination
- Provides a quantitative measure of response reliability
- Distinguishes between subtle (relevant) and gross (irrelevant) information distortions
- Serves as a critical component in comprehensive RAG system evaluation

In [None]:
# For this project we will focus on the `relevant` mode
# This means how sensitive is the LLM to noise from relevant context
noise_sensitivity = NoiseSensitivity(
    nli_statements_prompt=MyNLIStatementPrompt(),
    statement_generator_prompt=MyStatementGeneratorPrompt(),
    max_retries=20
)

### **Response Relevancy**

### TL;DR:
> **How well does the answer address the original user query?**

---

### **Definition**:
**Response Relevancy** measures how closely an AI-generated response aligns with the original user input. It evaluates the answer's ability to directly and appropriately address the user's question, penalizing responses that are incomplete, off-topic, or include unnecessary details. It doesn't judge the **factual accuracy**.

---

### **Conceptual Insight**:
The metric focuses on the semantic alignment between:
- The original user query
- The generated response

Key evaluation criteria:
1. **Direct Address**: Does the answer directly tackle the user's question?
2. **Information Completeness**: Does the response provide sufficient information?
3. **Topic Coherence**: Does the answer stay focused on the query's intent?

---

### **Calculation Approach**:

1. **Question Generation**: 
   - Use the LLM to generate artificial questions based on the response
   - Default is to create 3 variant questions
   - These questions should capture the essence of the response

2. **Semantic Similarity**:
   - Compute cosine similarity between:
     - Original user input embedding
     - Embeddings of generated questions
   - Measures how closely the questions match the original query

3. **Scoring**:
   - Average the cosine similarity scores
   - Higher scores indicate better relevance
   - Scores typically range between 0 and 1

---

### **Formula**:

**Response Relevancy** = $\frac{\sum_{i=1}^{N} \text{cosine\_similarity}(E_{g_{i}}, E_{o})}{N}$ 

Where:
- $E_{g_{i}}$: Embedding of the i-th generated question
- $E_{o}$: Embedding of the original user input
- N: Number of generated questions (default: 3)

---

### **Evaluation Expectations**:
- **High Score (Close to 1)**: 
  - Answer directly addresses the query
  - Comprehensive and focused response
  - Minimal irrelevant information

- **Low Score (Close to 0)**: 
  - Response is off-topic
  - Incomplete or evasive answer
  - Includes excessive unrelated details

---

### **Important Limitations**:
- **No Factuality Check**: 
  - Measures relevance, not accuracy
  - Does not verify the truthfulness of the response
- **Embedding-Based**: 
  - Relies on semantic similarity
  - May not catch nuanced relevance

---

### **Example Scenario**:
- **User Query**: "When was the first Super Bowl?"
- **Good Response**: "The first Super Bowl was held on Jan 15, 1967, between the Green Bay Packers and Kansas City Chiefs."
- **Poor Response**: "Football is a popular sport in the United States with many interesting historical moments."

### **Key Insights**:
- Helps evaluate response quality beyond simple keyword matching
- Provides a quantitative measure of semantic alignment
- Supports improving AI system's query understanding
- Identifies potential issues with off-topic or unfocused responses

In [None]:
response_relevancy = ResponseRelevancy(
    question_generation=MyResponseRelevancePrompt()
)

### **Faithfulness**

### TL;DR:
> **How factually consistent is the response with the retrieved context?**

---

### **Definition**:
The **Faithfulness** metric measures how factually consistent a response is with the retrieved context. It ranges from 0 to 1, with higher scores indicating better consistency.

A response is considered faithful if all of its claims can be supported by the retrieved context.

---

### **Calculation Approach**:

1. **Identify Claims**: 
   - Break down the response into individual statements
   - Examine each claim systematically

2. **Context Verification**:
   - Check each claim to see if it can be inferred from the retrieved context
   - Determine the support level of each statement

3. **Scoring**:
   - Compute the faithfulness score using the formula:
     
     $\text{Faithfulness Score} = \frac{\text{Number of claims supported by the retrieved context}}{\text{Total number of claims in the response}}$

---

### **Example**:

**Question**: Where and when was Einstein born?

**Context**: Albert Einstein (born 14 March 1879) was a German-born theoretical physicist, widely held to be one of the greatest and most influential scientists of all time.

**High Faithfulness Answer**: Einstein was born in Germany on 14th March 1879.

**Low Faithfulness Answer**: Einstein was born in Germany on 20th March 1879.

#### Calculation Steps:
- **Step 1**: Break the generated answer into individual statements
- **Step 2**: Verify if each statement can be inferred from the given context
- **Step 3**: Apply the faithfulness formula

---

### **Evaluation Expectations**:
- **Perfect Score (1.0)**: 
  - All claims are directly supported by the context
  - No extraneous or unsupported information

- **Partial Score**: 
  - Some claims are supported
  - Partial consistency with the retrieved context

- **Low Score (Close to 0)**: 
  - Most claims cannot be verified
  - Significant deviation from the original context

---

### **Advanced Verification**:
**HHEM-2.1-Open** can be used as a classifier model to:
- Detect hallucinations in LLM-generated text
- Cross-check claims with the given context
- Efficiently determine claim inferability
- Avoid **biases** by not using `LLM-As-A-Judge`

---

### **Key Insights**:
- Measures the factual consistency of AI-generated responses
- Helps identify potential hallucinations or fabrications
- Provides a quantitative assessment of contextual alignment
- Supports improving the reliability of AI-generated content

In [None]:
faithfulness = Faithfulness(
    nli_statements_prompt=MyNLIStatementPrompt(),
    statement_generator_prompt=MyStatementGeneratorPrompt(),
    max_retries=20,
)

### **Factual Correctness**

### TL;DR:
> **How factually consistent is a model's response with the ground truth?**

---

### **Definition**:
**Factual Correctness** measures the degree to which a model's `response` aligns with the `reference` by assessing the factual consistency between the two. It does this by breaking down both the response and reference into distinct claims and then determining whether these claims match using natural language inference (NLI). This metric helps evaluate how accurately the generated response retains factual integrity.

The factual correctness score ranges from **0 to 1**, where higher values indicate better factual alignment.

---

### **Claim Breakdown and Classification**:
This metric identifies three types of claims:
- **True Positives (TP):** Claims that are present in both the response and the reference.
- **False Positives (FP):** Claims that are in the response but not supported by the reference.
- **False Negatives (FN):** Claims that are in the reference but missing from the response.

---

### **Calculation Modes**:
The metric can be computed in three different modes:

#### **Precision Mode:**
> Measures how much of the information in the response is factually correct.

$ \text{Precision} = \frac{TP}{TP + FP} $

- High precision means the response avoids including unsupported claims.
- Penalizes responses that introduce hallucinated information.

#### **Recall Mode:**
> Measures how much of the reference information is retained in the response.

$ \text{Recall} = \frac{TP}{TP + FN} $

- High recall means the response includes all important reference claims.
- Penalizes responses that omit key factual details.

#### **F1 Mode (Default):**
> Balances precision and recall for a comprehensive factual correctness score.

$ \text{F1} = \frac{2 \times \text{Precision} \times \text{Recall}}{\text{Precision} + \text{Recall}} $

---

### **Controlling the Number of Claims**
Each sentence in the response and reference can be decomposed into multiple claims. The granularity of this decomposition is determined by **atomicity** and **coverage**.

#### **Atomicity:**
Controls how much a sentence is broken down:
- **High Atomicity:** Breaks a sentence into fine-grained claims.
- **Low Atomicity:** Keeps a sentence more intact with minimal decomposition.

#### **Coverage:**
Determines how much information is extracted:
- **High Coverage:** Extracts all details from the original sentence.
- **Low Coverage:** Focuses on key information, omitting minor details.

---

### **Example of Atomicity and Coverage Adjustments**
#### **High Atomicity & High Coverage:**
```python
scorer = FactualCorrectness(mode="precision", atomicity="high", coverage="high")
```
**Original Sentence:**  
> "Marie Curie was a Polish and naturalized-French physicist and chemist who conducted pioneering research on radioactivity."

**Decomposed Claims:**
- "Marie Curie was a Polish physicist."
- "Marie Curie was a naturalized-French physicist."
- "Marie Curie was a chemist."
- "Marie Curie conducted pioneering research on radioactivity."

---

### **Practical Application**
- **High Atomicity & High Coverage** → Best for detailed fact-checking and claim extraction.
- **Low Atomicity & Low Coverage** → Suitable for summarization or when only key facts are needed.

This flexibility ensures the metric can be tailored to different levels of granularity based on the application.

---

### **Key Takeaways:**
* Factual correctness evaluates the factual overlap between a response and a reference.  
* It provides **precision**, **recall**, and **F1-score** options for measurement.  
* Atomicity and coverage control the **granularity** of claim decomposition.  
* Helps identify hallucinations (FP) and missing information (FN) in responses.  

This metric is crucial for **evaluating Retrieval-Augmented Generation (RAG) systems** where factual consistency is a priority!

In [None]:
factual_correctness = FactualCorrectness(
    nli_prompt=MyNLIStatementPrompt()
)

### **Semantic Similarity**

### TL;DR:
> **How semantically similar are two texts?**

---

### **Definition**:
**Semantic Similarity** measures the degree of similarity between two texts. It uses an `embedding model` to convert the text into a high dimensional vector or so-called `embedding` and it computes the similarity using `cosine distance` for example between 2 such vectors. Two texts might contain very similar information semantically, however can be phrased in various ways. This metric determines if they are similar or close in meaning to each other regardless of the wording.

---

### **Approach**:
Compute the emebedding of both the `response` and `reference`, then normalize the vectors and finally compute the score itself. The score will range between **0 to 1** with higher values indication higher similarity. This metric is great for assessing the `generator component` without requiring a LLM.

In [None]:
semantic_similarity = SemanticSimilarity(
    threshold=0.7, # Default is 0.5 = 50%
)

### Evaluation

Finally, after all relevant metrics are initialized and customized as needed we can evaluate. 

In [None]:
results: EvaluationResult = evaluate(
    dataset=evaluation_dataset,
    metrics=[
        context_precision,
        context_recall,
        context_entity_recall,              
        noise_sensitivity,    
        response_relevancy,                 
        faithfulness,                       
        factual_correctness,                
        semantic_similarity                 
    ],
    llm=ragas_llm,
    embeddings=ragas_embeddings,
    experiment_name=f"{test_id}_evaluation",
    run_config=run_config,
    show_progress=True
)

In [None]:
# Make sure the directory exists
os.makedirs("./res", exist_ok=True)

# Save results locally (optional)
result_df: pd.DataFrame = results.to_pandas()
result_df.to_csv(f'./res/{test_id}_eval_results.csv', index=False)

# Display metric scores
results.scores

### Upload to the platform (Optional)

In [None]:
if os.getenv("RAGAS_APP_TOKEN"):
    results.upload()