### Confident AI

1. In short **Confident AI** is a cloud-based platform part of the **DeepEval** framework, which stores **datasets**, **evaluations** and **monitoring data**. 

2. If you want to use **Confident AI** platform create an account from here: [Confident AI](https://www.confident-ai.com/)

3. After signing-up an **API key** will be generated, which can be used to interact with the platform from inside the notebook.

---

Example of .env file:
```bash
DEEPEVAL_RESULTS_FOLDER=<folder> # Results of evaluations can be saved locally
DEEPEVAL_API_KEY=<your api key>  # Relevant if you want to use Confident AI
DEEPEVAL_TELEMETRY_OPT_OUT="YES" # Remove telemetry
```

In [1]:
import os
from dotenv import load_dotenv
from deepeval import login_with_confident_api_key

# Loads the environment variables from a `.env` file.
# If you want to use Confident AI be sure to create one in this directory.
load_dotenv()

deepeval_api_key: str = os.getenv("DEEPEVAL_API_KEY")

# You should get a message letting you know you are logged-in.
login_with_confident_api_key(deepeval_api_key)

### LLMTestCase

Unlike **RAGAs**, where a single interaction between a user and the AI system is represented by either a **SingleTurnSample** or **MultiTurnSample**, in **DeepEval** there's the concept of so called **LLMTestCase**/**MLLMTestCase** and **ConversationalTestCase**. For this project the **LLMTestCase** will be of relevance. Just like in **RAGAs**, **LLMTestCase** objects have the same fields just different names - input, actual_output, expected_output, etc.

![Image showcasing what a LLMTestCase is.](https://confident-docs.s3.amazonaws.com/llm-test-case.svg "LLMTestCase")

### Evaluation

**Evaluation** should be a crucial component of every single application which uses **AI**. **DeepEval** provides more than 14 metrics for evaluation so that one can very easily iterate towards a better LLM application. Each default metric uses **LLM-As-A-Judge**. Optionally, one can use the **GEval** to set a custom criteria for evaluation if neither of the other metrics meet the requirements. Alternatively, there's the **DAGMetric**, whose purpose is similar to the **GEval**, however it uses a graph and it's fully **deterministic**.

When evaluating a test case, multiple metrics can be used and the test would be **positive** iff all the **metrics thresholds** have been exceeded and **negative** in any other case. 

Evaluation workflow:
![Image of evaluation workflox](https://d2lsxfc3p6r9rv.cloudfront.net/workflow.png "Evaluation comparison")

---

**Confident AI** enables users to compare evaluation runs between one another and to compare results. In some instances a new evaluation run might yield better overall results, however have a **failing** test which was previously **successful**. This is known as **regression** and will be marked in **red**. Tests marked in **green** show an improvement. 

![Image of evaluation run comparsion](https://confident-docs.s3.us-east-1.amazonaws.com/comparison-page.png "Evaluation comparison")

* Here is the LLM development workflow that is highly recommended with **Confident AI**: 
    - Curate datasets (unless you don't have one already available)
    - Run evaluations with dataset
    - Analyze evaluation results
    - Improve LLM application based on evaluation results
    - Run another evaluation on the same dataset

### Evaluation Dataset

An **evaluation dataset** is just a collection of **LLMTestCase**- or so called **Golden** objects. A **Golden** is structurally the same as a **LLMTestCase**, however it has no `actual_output` and `retrieval_context` fields, which can be generated by your LLM at evaluation time.

Datasets can be **pushed**, **stored** and **pulled** from **Confident AIs** platform.

### Synthetic datasets

**DeepEval** can be used to generate **synthetic dataset** as well. The **Synthesizer** object is highly customizable.

Example:
```python
from deepeval.synthesizer import Synthesizer
from deepeval.dataset import EvaluationDataset

synthesizer = Synthesizer()
goldens = synthesizer.generate_goldens_from_docs(document_paths=['example.txt', 'example.docx', 'example.pdf'])

# Since the synthesizer generates so-called goldens they don't have actual_output and retrieval_context fields
# You can generate them prior to evaluation or during evaluation time
dataset = EvaluationDataset(goldens=goldens)
```

### LLM provider

**DeepEval** uses **OpenAI** by default as a LLM, however **Ollama** is also available. To use it execute the code cell below. This will generate a `.deepeval` file where key-value pairs will be stored about that particular LLM-provider like model name, base url and so on. 

In [7]:
!deepeval set-ollama llama3.1:latest --base-url="http://localhost:11434/"
!deepeval set-ollama-embeddings mxbai-embed-large --base-url="http://localhost11434"

🙌 Congratulations! You're now using a local Ollama model for all evals that 
require an LLM.
🙌 Congratulations! You're now using Ollama embeddings for all evals that 
require text embeddings.


### Pushing a **dataset** to **Confident AI**

Since I already have a dataset which was generated by **RAGAs** I would like to create a **DeepEval** equivalent and upload it to the cloud so that I can use it in the future.

In [None]:
import ast 
import typing as t
import pandas as pd
from pandas import DataFrame
from deepeval.test_case import LLMTestCase
from deepeval.dataset import EvaluationDataset

def upload_ragas_dataset_to_confident_ai(filepath: str, dataset_name: str):
    try:
        dataset: DataFrame = pd.read_csv(filepath)
        test_cases: t.List[LLMTestCase] = []
        for _, row in dataset.iterrows():
            test_cases.append(
                LLMTestCase(
                    input=row['user_input'],
                    actual_output=row['response'],
                    expected_output=row['reference'],
                    context=ast.literal_eval(row['reference_contexts']),
                    retrieval_context=ast.literal_eval(row['retrieved_contexts']),        
                )
            )
            
        deepeval_dataset: EvaluationDataset = EvaluationDataset(test_cases)
        deepeval_dataset.push(
            alias=dataset_name,
            auto_convert_test_cases_to_goldens=True
        )
    except FileNotFoundError as fnfe:
        print(print(fnfe.strerror))
    except TypeError as te:
        print(str(te))
    

In [73]:
upload_ragas_dataset_to_confident_ai("../ragas/dataset.csv", "RAGAs Dataset")

Gtk-Message: 19:06:38.380: Failed to load module "canberra-gtk-module"
Gtk-Message: 19:06:38.381: Failed to load module "canberra-gtk-module"


Opening in existing browser session.


### Pulling a **dataset** from **Confident AI**

If you already have a dataset on the platform just use the `pull` method and specify the name/alias.

In [1]:
from deepeval.dataset import EvaluationDataset

evaluation_dataset: EvaluationDataset = EvaluationDataset()
evaluation_dataset.pull("RAGAs Dataset")

### Evaluating a dataset/set of test cases using metric/s

Evaluation in **DeepEval** works as follow:
* Pull or create a dataset containing test cases or goldens:
    - In the case of **goldens** have the LLM generate the `actual_output` and `retrieval_context` fields.
    - Usually the **goldens** are preferred since one can modify the `prompt template` during evaluation.
* Think about the application and the different use cases:
    - What is it doing?
    - How could I assure it's doing what it's supposed to be doing?
    - Check out the existing metrics or create your own one.
    - Create metric/s.
* Run the evaluation:
    - In notebooks use the `evaluate` method.
    - If you choose to create test cases in python file/s:
        - The filename/s should start with `test_` and can be ran using `deepeval test run <filename>`.
        - Optionally, pass in flags like `-c` to use the cache.

In [None]:
from deepeval.evaluate import (
    evaluate, 
    EvaluationResult
)
from deepeval.metrics import GEval
from deepeval.test_case import LLMTestCaseParams

correctness_metric = GEval(
    name="Correctness",
    evaluation_params=[LLMTestCaseParams.ACTUAL_OUTPUT, LLMTestCaseParams.EXPECTED_OUTPUT],
    criteria="Determine if the 'actual output' is correct based on the 'expected output'.",
)

# For running an evaluation in notebook use this approach
results: EvaluationResult = evaluate(
    test_cases=[evaluation_dataset.test_cases[0]],
    metrics=[correctness_metric],
    use_cache=True
)

Evaluating 1 test case(s) in parallel: |██████████|100% (1/1) [Time Taken: 00:00, 42.39test case/s]



Metrics Summary

  - ✅ Correctness (GEval) (score: 1.0, threshold: 0.5, strict: False, evaluation model: llama3.1:latest (Ollama), reason: Actual output matches expected output exactly and meets all requirements specified., error: None)

For test case:

  - input: what happend with spacex
  - actual output: The query about SpaceX cannot be determined as there is no mention of it in the provided context. The context only discusses potential issues and resolutions for in-flight services, booking-related issues, and special assistance with airlines.
  - expected output: There is no mention of SpaceX in the provided context.
  - context: ['Special Assistance\n\nRagas Airlines provides special assistance services for passengers with disabilities, unaccompanied minors, and those requiring medical support. Below is a detailed breakdown of how to request and prepare for these services.\n\nPassengers with Disabilities\n\nRagas Airlines ensures accessibility for passengers requiring wheelchair




Gtk-Message: 15:15:25.955: Failed to load module "canberra-gtk-module"
Gtk-Message: 15:15:25.955: Failed to load module "canberra-gtk-module"


Opening in existing browser session.


### **DagMetric**

This metric can be used instead of **GEval** to build a more deterministic approach for a custom metric logic. It makes use of a directed acyclic graph (DAG). The basic idea is to create a decision tree with nodes and directed edges with leaf nodes representing the score/verdict. Users can specify their own rules/conditions and steps to be performed, in cases where the pre-defined metrics are not enough.

Example:
![Image DAG](https://confident-docs.s3.amazonaws.com/dag-formatting-metric.svg)

There are **four** main types of nodes:
* **TaskNode**: usually utilized for pre-processing and its output is further used by other nodes
* **BinaryJudgementNode**: takes a `criteria` which is very often the output of the **TaskNode** and returns either `True` or `False`
* **NonBinaryJudgementNode**: the same as the **BinaryJudgementNode**, however it can have values besides `True` or `False`
* **VerdictNode**: Leaf nodes which hold a score, determining the final output of the **DagMetric** based on the path taken

### Required arguments:
* `input`
* `actual_output`

### **Context Utilization** *(Using DagMetric)*

### **TL;DR**:
> **How effectively is each piece of retrieved context utilized in the final response? Are we making good use of what we've retrieved?**

---

## **Definition**:

**Context Utilization** measures how effectively the retrieved context is utilized in the generated response. This metric ensures that the generation process makes appropriate use of the available context. A **higher utilization score** means that more of the relevant retrieved information was incorporated into the response, making it crucial for applications that need to maximize the value of retrieved context. Unlike relevance metrics that focus on what was retrieved, utilization focuses on what was actually used.

---

## **DAG-Based Context Utilization**

This metric evaluates context utilization by analyzing individual context items and tracking how their information flows into the final response. It uses a **Deep Acyclic Graph (DAG)** to make deterministic judgments about context usage patterns.

## **Formula**:

**Context Utilization** = Weighted combination of:
- *Information Coverage* (percentage of relevant retrieved information used)
- *Distribution Balance* (how evenly information was drawn from different context items)
- *Relevance Alignment* (whether the most relevant context items were utilized most effectively)

A utilization score **closer to 1** indicates efficient use of the retrieved context, while a **lower score** suggests that valuable retrieved information was ignored or underutilized.

---

## **How It Works**

This metric calculates utilization using a decision tree approach with the following steps:

1. **Process Context Items**  
   - Breaks down each context item in the retrieved list into distinct information units.
   - Rates the relevance of each context item to the query.

2. **Track Information Usage**  
   - For each context item, determines what percentage of its information appears in the final response.
   - Notes whether information was used directly, modified, or contradicted.

3. **Analyze Distribution Patterns**  
   - Evaluates how evenly information was utilized across all context items.
   - Checks if the response drew from many context items or focused on just a few.

4. **Evaluate Relevance-Utilization Alignment**  
   - Determines if the most relevant context items were given appropriate weight in the response.
   - Highlights misalignments where highly relevant context was underutilized.

5. **Calculate Overall Information Coverage**  
   - Computes the total percentage of available relevant information that made it into the response.
   - Adjusts based on the importance of the information that was used or omitted.

---

## **Why is This Useful?**

- **Complements Retrieval Metrics** → While recall measures if you retrieved the right context, utilization measures if you used it effectively.
- **Identifies Generator Weaknesses** → Reveals if your model ignores important retrieved information.
- **Optimizes Context Window Usage** → Helps determine if you're wasting tokens on context that never gets used.
- **Improves Response Completeness** → Ensures all relevant retrieved information is considered in the response.
- **Detects Pattern Biases** → Reveals if your model systematically ignores certain types of information.

A utilization score **closer to 1** means that the retrieved context was used efficiently, while a **lower score** suggests that the generator is not making good use of the available information.

In [10]:
from deepeval.metrics.dag import (
    DeepAcyclicGraph,
    TaskNode,
    BinaryJudgementNode,
    NonBinaryJudgementNode,
    VerdictNode,
)
from deepeval.metrics import DAGMetric
from deepeval.test_case import LLMTestCaseParams

# Step 1: Process each context item individually
process_contexts_node = TaskNode(
    instructions="""
    The retrieved_context is a list of separate context items. For each context item:
    1. Identify it as 'Context Item #N' (where N is its position in the list)
    2. Extract the key information units/facts from that context item
    3. Rate the relevance of each context item to the query (high/medium/low)
    """,
    evaluation_params=[LLMTestCaseParams.INPUT, LLMTestCaseParams.RETRIEVAL_CONTEXT],
    output_label="Processed Context Items",
    children=[]  # Will be populated below
)

# Step 2: Track utilization of each context item
track_context_usage_node = TaskNode(
    instructions="""
    For each 'Context Item #N' identified in the previous step:
    1. Determine if information from this context item appears in the actual_output
    2. Estimate what percentage of the key information from this context item was used (0-100%)
    3. Note whether the information was used directly, modified, or contradicted
    4. Create a utilization summary for each context item
    """,
    evaluation_params=[LLMTestCaseParams.ACTUAL_OUTPUT],
    output_label="Context Item Utilization",
    children=[]  # Will be populated below
)

# Step 3: Analyze per-item utilization patterns
context_distribution_node = NonBinaryJudgementNode(
    criteria="How evenly was information utilized across all context items? Did the response draw from many context items or just a few?",
    children=[
        VerdictNode(verdict="Well-distributed across most context items", score=10),
        VerdictNode(verdict="Moderately distributed across several context items", score=7),
        VerdictNode(verdict="Heavily skewed toward few context items", score=4),
        VerdictNode(verdict="Drew from only one context item", score=2),
    ],
)

# Step 4: Evaluate relevance-utilization alignment
relevance_utilization_node = BinaryJudgementNode(
    criteria="Were the most relevant context items (as identified in step 1) utilized more than less relevant ones?",
    children=[
        VerdictNode(verdict=False, score=3),
        VerdictNode(verdict=True, score=10),
    ],
)

# Step 5: Calculate overall information coverage
information_coverage_node = NonBinaryJudgementNode(
    criteria="Overall, what percentage of the total relevant information available across all context items was utilized in the response?",
    children=[
        VerdictNode(verdict="Excellent (80-100%)", score=10),
        VerdictNode(verdict="Good (60-79%)", score=8),
        VerdictNode(verdict="Adequate (40-59%)", score=6),
        VerdictNode(verdict="Poor (20-39%)", score=4),
        VerdictNode(verdict="Very poor (0-19%)", score=2),
    ],
)

# Connect the nodes
process_contexts_node.children = [track_context_usage_node]
track_context_usage_node.children = [
    context_distribution_node, 
    relevance_utilization_node,
    information_coverage_node
]

# Create the DAG
dag = DeepAcyclicGraph(root_nodes=[process_contexts_node])

# Create the metric
multi_context_utilization_metric = DAGMetric(
    name="Multi-Context Utilization Analysis",
    dag=dag,
    threshold=0.7,
    verbose_mode=True
)

from deepeval.evaluate import (
    evaluate, 
    EvaluationResult
)

from deepeval.test_case import LLMTestCaseParams

# For running an evaluation in notebook use this approach
results: EvaluationResult = evaluate(
    test_cases=evaluation_dataset.test_cases,
    metrics=[multi_context_utilization_metric],
    use_cache=True
)

Evaluating 52 test case(s) in parallel: |          |  0% (0/52) [Time Taken: 2:35:00, ?test case/s]


RemoteProtocolError: Server disconnected without sending a response.

### **Answer Relevancy**

It uses the **LLM as a judge** to determine how well the response answers the user input. This metric is especially relevant when it comes to testing the **RAG pipelines generator** by comparing the degree of relevance of the output in regards to the user query. 

---

### Required arguments:
* `user input`
* `actual output`

---

### Approach:

The LLM is used to retrieve claims from the response and then they get compared with the user input. The LLM classifies each statement as follows:
* `yes` if relevant
* `no` if irrelevant
* `idk` if non-determined or partially relevant.

---

### Formula:

__Answer Relevancy__ = $\frac{\text{number of relevant statements}}{\text{total number of statements}}$

---

This metric is very similar if not the same as the **Response Relevance** in **RAGAs** in terms of what it evaluates, however it does it in a diffrerent way. In **RAGAs**, after the statements have been decomposed they get compared with the user query using **Semantic similarity** and then the sum of all comparison is divided by the total number of statements to get the final response. In **DeepEval** the statements are still decomposed, however we use the LLM to judge the relevance (no embedding is performed). 

|           | Answer Relevancy | Response Relevancy |
|-----------|------------------|-------------|
| Framework | DeepEval | RAGAs |
| Approach  | Statemement decomposition + LLM-as-a-judge  | Statement decomposition + Semantic similarity  |
| Score     |  $\in{[0-1]} $ | $\in{[0-1]} $ |
| Component | Generator      | Generator     |

In [None]:
from deepeval.metrics import AnswerRelevancyMetric
from deepeval.evaluate import evaluate, EvaluationResult
from prompts.llama31_answer_relevancy_prompt import Llama31AnswerRelevancyTemplate

# Currently, my custom template works OK in some instances, however in general the original one is superior.

answer_relevancy = AnswerRelevancyMetric(
    #threshold=0.7,
    verbose_mode=True,
    #evaluation_template=Llama31AnswerRelevancyTemplate
)

results: EvaluationResult = evaluate(
    test_cases=evaluation_dataset.test_cases[40:48],
    metrics=[answer_relevancy],
    identifier="Answer Relevancy with default template Cases [40:48]"
)

### **Faithfulness**

The **Faithfulness** metric measures how factually consistent a response is with the retrieval context. It ranges from 0 to 1, with higher scores indicating better consistency.

A response is considered faithful if its claims can be supported by the retrieved context.

---

### Required arguments:

* `user input`
* `actual output`
* `retrieval context`

---

### Approach:

1. **Identify Claims**:
   - Break down the response into individual statements

2. **Identify Truths**:
   - Break down the retrieval context into individual statements

3. **Use NLI**
    - Using NLI determine if a claim in the response is factually based on claim/s in the context

---

### Formula:

__Faithfulness__ = $\frac{\text{Number of truthful claims}}{\text{Total number of claims}}$

---

### Framework Comparison:

|              | Faithfulness | Faithfulness |
|--------------|-------------------------|---------------------|
| Framework    | DeepEval                | RAGAS               |
| Approach     | Statement decomposition, LLM-as-a-judge with NLI | Statement decomposition, LLM-as-a-judge with NLI |
| Components   | Generator             | Generator           |
| Scoring      | $\in{[0-1]}$          | $\in{[0-1]}$        |
| Customization| - | Can use HHEM-2.1-Open classifier |

In [20]:
from deepeval.metrics import FaithfulnessMetric
from deepeval.evaluate import evaluate, EvaluationResult
from prompts.llama31_faithfulness_prompt import Llama31FaithfulnessTemplate 

faithfulness = FaithfulnessMetric(
    #threshold=0.7,
    verbose_mode=True,
    evaluation_template=Llama31FaithfulnessTemplate
)

results: EvaluationResult = evaluate(
    test_cases=evaluation_dataset.test_cases[0:10],
    metrics=[faithfulness],
    identifier="Faithfulness with my own template Cases [0:10]",
    run_async=False
)

Evaluating 10 test case(s) sequentially: |          |  0% (0/10) [Time Taken: 00:00, ?test case/s]

Evaluating 10 test case(s) sequentially: |          |  0% (0/10) [Time Taken: 02:46, ?test case/s]


TypeError: Llama31FaithfulnessTemplate.generate_reason() got an unexpected keyword argument 'contradictions'

### **Contextual Precision**



In [18]:
for result in results.test_results:
    print(f"{result.name} - {result.metrics_data[0].score}")

test_case_3 - 0.75
test_case_0 - 0.5714285714285714
test_case_4 - 0.75
test_case_5 - 0.6666666666666666
test_case_9 - 0.6666666666666666
test_case_8 - 1.0
test_case_1 - 1.0
test_case_7 - 0.7142857142857143
test_case_2 - 0.8571428571428571
test_case_6 - 1.0
