### Confident AI (Optional)

1. In short **Confident AI** is a cloud-based platform fully compatible with the **DeepEval** framework, which can store generated **datasets**, results of **evaluations**, etc. 

2. If you want to use **Confident AI** platform create an account from here: [Confident AI](https://www.confident-ai.com/)

3. After signing-up an **API key** will be generated, which can be used to interact with the platform from inside the notebook.

---

Example of `.env` file:
```bash
DEEPEVAL_RESULTS_FOLDER=<folder> # Results of evaluations can be saved locally
DEEPEVAL_API_KEY=<your api key>  # Relevant if you want to use Confident AI
DEEPEVAL_TELEMETRY_OPT_OUT="YES" # Remove telemetry
```

#### Import all dependencies.

In [None]:
import os
import json
from typing import (
    Final, List, Dict, Any, Union
)

from dotenv import load_dotenv

from deepeval import evaluate
from deepeval.metrics import (
    AnswerRelevancyMetric,
    FaithfulnessMetric,
    ContextualPrecisionMetric,
    ContextualRecallMetric,
    ContextualRelevancyMetric    
)
from deepeval.test_case import LLMTestCase
from deepeval.dataset import EvaluationDataset
from deepeval import login_with_confident_api_key
from deepeval.evaluate.configs import (
    AsyncConfig, DisplayConfig, TestRunResultDisplay
)
from deepeval.evaluate.evaluate import EvaluationResult

# Custom prompts for the DeepEval metrics
from prompts.custom_faithfulness_prompt import MyFaithfulnessTemplate
from prompts.custom_answer_relevancy_prompt import MyAnswerRelevancyTemplate
from prompts.custom_contextual_recall_prompt import MyContextualRecallTemplate
from prompts.custom_contextual_precision_prompt import MyContextualPrecisionTemplate
from prompts.custom_contextual_relevancy_prompt import MyContextualRelevancyTemplate

# Loads the environment variables from a `.env` file.
# If you want to use Confident AI be sure to create one parent directory with the variables.
load_dotenv("../.env")

# Loads all the RAG parameters like chunk size, chunk overlap, temperature, etc.
load_dotenv("../../env/rag.env")

#### Optional step, if you want to use the cloud platform.

In [None]:
if os.getenv("DEEPEVAL_API_KEY"):
    deepeval_api_key: str = os.getenv("DEEPEVAL_API_KEY")
    
    # You should get a message letting you know you are logged-in.
    login_with_confident_api_key(deepeval_api_key)

### LLMTestCase

- Unlike **RAGAs**, where a single interaction between a user and the AI system is represented by a **SingleTurnSample** in **DeepEval** there's the concept of so called **LLMTestCase**.

- Similarly to **RAGAs**, **LLMTestCase** objects have the same fields just different names - input, actual_output, expected_output, etc.
One can very easily generate a synthetic dataset with one framework, convert it into the proper format valid for the other framework and use it.

![Image showcasing what a LLMTestCase is.](../../img/deepeval/LLM-testcase.png "LLMTestCase")

- Put simply a **LLMTestCase** represents a **single, atomic unit of interaction with your LLM app**.
- An **interaction** can mean different things depending on the application being evaluated and the scope.
- If one evaluates an **AI Agent** an **interaction** can mean:
    - **Agent Level**: The entire process initiated by the agent including all intermediary steps
    - **RAG pipeline level**: Just the **RAG** pipeline - **retriever** + **generator**
    - Individual components level:
        - **Retriever**: Retrieving relevant chunks and ranking them accordingly
        - **Generator**: Generating relevant, complete answer free of hallucinations

![Image showcasing what an interaction means](../../img/deepeval/llm-interaction.png "LLMTestCase interaction")

### Evaluation

**Evaluation** should be a crucial component of every single application which uses **AI**. **DeepEval** provides more than 30 metrics for evaluation so that one can very easily iterate towards a better LLM application. Each default metric is evaluated by a LLM (`LLM-as-a-judge`). 

Optionally, one can use the [GEval](https://www.deepeval.com/docs/metrics-llm-evals) to set a custom criteria for evaluation if neither of the other metrics meet the requirements. 

Alternatively, there's the [DAG](https://www.deepeval.com/docs/metrics-dag), whose purpose is similar to the **GEval**, however it uses a graph and it's fully **deterministic**.

When evaluating a test case, multiple metrics can be used and the test would be **positive** **iff** all the **metrics thresholds** have been exceeded and **negative** in any other case. 

Evaluation workflow:

![Image showcasing the evaluation steps](../../img/deepeval/evaluation-workflow.png "Evaluation workflow")

### Evaluation Dataset

An **evaluation dataset** is just a collection of **LLMTestCase**- or so called **Golden** objects. A **Golden** is structurally the same as a **LLMTestCase**, however it has no `actual_output` and `retrieval_context` fields, which can be generated by your LLM at evaluation time.

Datasets can be **pushed**, **stored** and **pulled** from **Confident AIs** platform.

### LLM provider

**DeepEval** uses **OpenAI** by default as a LLM provider, however **Ollama** is also available. To use it execute the code cell below. This will generate a `.deepeval` file where key-value pairs will be stored about that particular LLM-provider like model name, base url and so on. 

In [None]:
EVALUATION_MODEL: Final[str] = os.getenv("EVALUATION_MODEL")
EMBEDDING_MODEL: Final[str] = os.getenv("EMBEDDING_MODEL")

print(f"Using [{EVALUATION_MODEL}] as evaluation model")
print(f"Using [{EMBEDDING_MODEL}] as embedding model")

# If you want to use a custom model be sure to configure this properly.
# https://www.deepeval.com/integrations/models/ollama
! deepeval set-ollama {EVALUATION_MODEL} --base-url="http://localhost:11434/"
! deepeval set-ollama-embeddings {EMBEDDING_MODEL} --base-url="http://localhost:11434"

### Loading dataset

* If you already have a dataset on the platform just use the `pull` method and specify the name/alias.
* Alternatively, you can save your synthetic dataset on disk and import it from a `json` or `csv` file.
* Note also that if your dataset consists of **goldens** you need to generate the `actual_output` and `retrieval_context` first.

For this project I use **RAGAs** for synthetic testdata generation and the full datasets are stored under: `../ragas_eval/datasets` and can be loaded from `jsonl` files. You can also use datasets created by **DeepEval**.

In [None]:
# The datasets are located under `../ragas_eval/datasets`
# If you want, you can use something else.
experiment_id: int = int(input("Please specify experiment_id (Ex. 1): "))

ragas_samples: List[Dict[str, Any]] = []
try:
    # Your filepath may vary
    # Alternatively, you can pull a dataset from ConfidentAI
    with open(
        file=f"../datasets/{experiment_id}_dataset.jsonl",
        mode="r",
        encoding="utf-8"
    ) as file:
        for line in file:
            if line.strip():  # Skip empty lines
                ragas_samples.append(json.loads(line))

    # Convert from JSON to LLMTestCase.
    # Have in mind that RAGAs and DeepEval use similar names for the same parameters.
    # We are mapping the RAGAs parameters to DeepEval ones.
    test_cases: List[LLMTestCase] = []
    for ragas_sample in ragas_samples:
        test_case = LLMTestCase(
            input=ragas_sample["user_input"],
            actual_output=ragas_sample["response"],
            expected_output=ragas_sample["reference"],
            retrieval_context=ragas_sample["retrieved_contexts"],
            context=ragas_sample["reference_contexts"],
        )
        test_cases.append(test_case)
    
    # The fully loaded and ready for evaluation dataset 
    evaluation_dataset = EvaluationDataset(test_cases=test_cases)
except FileNotFoundError:
    raise Exception(
        f"File: `../datasets/{experiment_id}_dataset.jsonl` containing test cases not found!"
    )
except json.JSONDecodeError as e:
    raise Exception (
        f"Error parsing JSONL file: {str(e)}"
    )

### RAG Triad

The RAG triad is composed of three metrics:

- **Answer Relevancy**
    - This particular metric can assess the `prompt template` submitted to the LLM after the retrieval phase.
    - If the score is insufficient it might be worth playing around with the template itself - try to be more detailed by providing more fine-granular instructions, be more explicit or provide few-shot-examples.

- **Faithfulness**
    - This metric evaluates the `LLM` and its ability to incorporate context into the generation process.
    - Low scores would usually signify a bad `LLM model` that **hallucinates** or fails to use the context properly. Alternatively, it might be due to noise in the context retrieved (large `chunk sizes` or high `top-k`)

- **Contextual Relevancy**
    - This metric verifies if the `retrieved context` is relevant for providing a good/complete answer to the query.
    
    - It measures the `top-k`, `chunk_size` and `embedding model`.
        - The `embedding model` would be useful to capture relevant chunks of context and detect different nuances.
        - `top-k` and `chunk-size` will either help or hinder the context retrieval by including or missing important information.

If a given RAG application scores high on all 3 metrics, one can be confident that the optimal `hyperparameters` are being used. `Hyperparamaters` are parameters, which can either positively or negatively influence the RAG pipeline. For example a good `embedding model` would be helpful for retrieving relevant data and also be able to pick-up on nuances, whereas a bad one wouldn't. 

If you have a simple RAG application that doesn't use `tool calls` or doesn't employ an agent then those 3 metrics can be a good starting point.

![RAG Triad](../../img/deepeval/RAG-triad.png  "RAG Triad")

### **Answer Relevancy**

It uses the **LLM as a judge** to determine how well the response answers the user input.

`Is the answer complete, on-topic and concise?`

This metric is especially relevant when it comes to testing the **RAG pipelines generator** by determining the degree of relevance of the generated output with respect to the query. 

The main **hyperparameter** being tested by this metric is the `prompt template` that the **LLM** receives as instruction to create an output (can it instruct the LLM to output relevant and helpful answers based on the `retrieval_context`). Tweaking the template submitted to the LLM can provide higher scores.

---

### Approach:

The LLM is used to decompose the `actual_output` into claims and then each of those claims get classified with respect to the `input` by the LLM:

* `yes` if relevant

* `no` if irrelevant

* `idk` if non-determined or partially relevant.

---

### Formula:

* Number of relevant statements = Statements marked as `yes` or `idk`

* **Answer Relevancy** = $\frac{\text{Number of relevant statements}}{\text{Total number of statements}}$

In [None]:
answer_relevancy = AnswerRelevancyMetric(
    include_reason=False,
    evaluation_template=MyAnswerRelevancyTemplate,
)

### **Faithfulness**

The **Faithfulness** metric measures how factually consistent a response is with the retrieval context. 
It verifies, whether or not the LLM can use the retrieved context properly and not hallucinate or contradict it.

Output that contains hallucinations (some sort of contradiction or made up information) are penalized. The score ranges between 0 to 1, with higher scores indicating better factual consistency.

The metric tests the `generator` component in the RAG pipeline and tries to verify if the output contradicts factual information from the **retrieval_context**.

A response is considered faithful if its claims can be supported by the retrieved context (ideally we would want a response whose claims/statements are all supported by a node/chunk in the context).

---

### Approach:

1. **Identify Claims**:
   - Break down the response into individual statements

2. **Identify Truths**:
   - Break down the retrieval context into individual statements
   - One can configure the number of `truths` to be extracted using the `truths_extraction_limit` parameter

3. **Use LLM**
    - Using the LLM to determine if all claims in the response are truthful (do not contradict information in the retrieval context and do not introduce new information outside the context)

---

### Formula:

* Number of truthful claims = Statements/claims from the response marked as `yes` or `idk` **(not contradicting a truth or introduce information outside the context)**

* **Faithfulness** = $\frac{\text{Number of truthful claims}}{\text{Total number of claims}}$

In [None]:
faithfulness = FaithfulnessMetric(
    include_reason=False,
    evaluation_template=MyFaithfulnessTemplate,
)

### **Contextual Precision**

The **Contextual Precision** metric measures the the **re-rankers capability to rank relevant** `nodes` higher than irrelevant ones. This ensures that important information is in the beginning of the context, which may lead to better generation.

Usually a LLM focuses more on information which is found at the beginning of the context. Information which is ranked higher than it is supposed to, leads to penalties (lower score).

---

### Approach:

The **LLM** determines for each **piece of context** whether or not it's relevant for answering the **input** with respect to the **expected output**. Then the `precision@k` is computed. Finally, it all gets averaged out using the WCP formula.

---

### Formula:

* **Note that sometimes not every node, which was retrieved would be relevant**
* **k** is the rank (position of retrieved node context)
* **n** is the total number of retrieved nodes
* **$r_{k}$** $\in \{0, 1\} $ the relevance indicator (1 if the node is relevant, 0 otherwise)
* **Contextual Precision** = $ \frac{1}{\text{Number of relevant nodes}} \times \sum_{k=1}^{n} (\frac{\text{Number of relevant Nodes up to Rank k}}{\text{k}} \times r_{k}) $

In [None]:
contextual_precision = ContextualPrecisionMetric(
    include_reason=False,
    evaluation_template=MyContextualPrecisionTemplate
)

### **Contextual Recall**

The **Contextual Recall** metric measures the extent to which the `retrieval_context` aligns with the `expected_output`.

This metric penalizes a **retriever** which misses important/relevant nodes during retrieval, since that would lead to an incomplete response. It uses the `expected output` as a reference for the `retrieval context`.

The main `hyperparameter` which is evaluated by the metric is the `embedding model`. Since the `embedding model` is used by the retriever when performing semantic similarity search a good model can positively affect the application. The inverse is also true.

---

### Approach:

The **LLM** determines for each statement in the `expected_output` whether or not it can be attributed to a node from the `retrieval context`.

---

### Formula:

**Attributable statement** is one, which contains information present in a node of the context. 
* **Contextual Recall** = $ \frac{\text{Number of attributable statements}}{\text{Total number of statements}} $

In [None]:
contextual_recall = ContextualRecallMetric(
    include_reason=False,
    evaluation_template=MyContextualRecallTemplate
)

### **Contextual Relevancy**

The **Contextual Relevancy** metric measures the extent to which the `retrieval_context` is useful for generating a proper response relative to the **input**.

The hyperparameters which are measured in this case are the `chunk size` and `top-k`.

In the case that `chunk size` is large one might fetch various nodes that contain the required information, however also redundant information causing the LLM to hallucinate. The other parameter `top-k` plays an important role in limiting the number of nodes to consider as context for the generation phase.

---

### Approach:

The **LLM** receives the **input** and **retrieval_context** and is prompted to decide for each statement found in the context, whether or not it can be of use for answering the input.

---

### Formula:

* **Relevant statement** is a statement found in the context, which can be utilized to answer the user input.

* **Contextual Relevancy** = $ \frac{\text{Number of relevant statements}}{\text{Total number of statements}} $

In [None]:
contextual_relevancy = ContextualRelevancyMetric(
    include_reason=False,
    evaluation_template=MyContextualRelevancyTemplate
)

#### Configs for the fine-granular customization of the evaluation process

In [None]:
# https://www.deepeval.com/docs/evaluation-flags-and-configs
async_conf = AsyncConfig(
    run_async=True,
)

display_conf = DisplayConfig(
    show_indicator=True,
    print_results=True,
    verbose_mode=False,
    # Displays only the failed tests
    # The ones whose score didn't exceed the threshold
    display_option=TestRunResultDisplay.FAILING
)

### Evaluation

In [None]:
results: EvaluationResult = evaluate(
    test_cases=evaluation_dataset.test_cases,
    metrics=[
        answer_relevancy,
        faithfulness,
        contextual_precision,
        contextual_recall,
        contextual_relevancy,
    ],
    #hyperparameters={},
    identifier=f"{experiment_id}_experiment",
    async_config=async_conf,
    display_config=display_conf,
    # cache_config=cache_conf
)

Store results

In [None]:
# Extract the scores 
scores: Dict[str, List[Dict[str, Union[str, float]]]] = {} # Metric name -> list of test case name and score

for test_result in results.test_results: # Iterate over the test cases
    for metric in test_result.metrics_data: # Iterate over the metrics
        # Get the test case number
        test_case: int = int(test_result.name.split("_")[-1]) # Make sure to convert to int, so we can sort it properly
        if metric.name in scores:
            scores[metric.name].append({ "test_case": test_case, "score": metric.score })
        else:
            scores[metric.name] = [{ "test_case": test_case, "score": metric.score }]
            
# Sort properly, since DeepEval doesn't evaluate in the original order
for metric_name in scores:
    scores[metric_name] = sorted(scores[metric_name], key=lambda x: x["test_case"])
    
# Store on disk
with open(f"./res/{experiment_id}_eval.jsonl", "w") as f:
    for metric_name, metric_scores in scores.items():
        record = {
            "metric": metric_name,
            "scores": metric_scores  # This is the sorted list of {"test_case": ..., "score": ...}
        }
        f.write(json.dumps(record) + "\n")