### Setup

* Execute the script in the root of the `evaluation` folder.
* If executing the script fails run: `chmod u+x setup.sh`.
* It will install all required dependencies.
* Finally, make sure you select it in the notebook by specifying `eval` as kernel.

### Confident AI

1. In short **Confident AI** is a cloud-based platform fully compatible with the **DeepEval** project, which can store generated **datasets**, results of **evaluations**. 

2. If you want to use **Confident AI** platform create an account from here: [Confident AI](https://www.confident-ai.com/)

3. After signing-up an **API key** will be generated, which can be used to interact with the platform from inside the notebook.

4. Do bear in mind that if you want to use **ConfidentAI** with **DeepEval** you need to always keep `deepeval` as the most recent version, otherwise you will get an exception.

---

Example of `.env` file:
```bash
DEEPEVAL_RESULTS_FOLDER=<folder> # Results of evaluations can be saved locally
DEEPEVAL_API_KEY=<your api key>  # Relevant if you want to use Confident AI
DEEPEVAL_TELEMETRY_OPT_OUT="YES" # Remove telemetry
```

In [1]:
import os
from dotenv import load_dotenv
from deepeval import login_with_confident_api_key

# Loads the environment variables from a `.env` file.
# If you want to use Confident AI be sure to create one parent directory with the variables.
load_dotenv("../.env")

deepeval_api_key: str = os.getenv("DEEPEVAL_API_KEY")

# You should get a message letting you know you are logged-in.
login_with_confident_api_key(deepeval_api_key)

### LLMTestCase

- Unlike **RAGAs**, where a single interaction between a user and the AI system is represented by a **SingleTurnSample** in **DeepEval** there's the concept of so called **LLMTestCase**.

- Similarly to **RAGAs**, **LLMTestCase** objects have the same fields just different names - input, actual_output, expected_output, etc.
One can very easily generate a synthetic dataset with one framework, convert it into the proper format valid for the other framework and use it.

![Image showcasing what a LLMTestCase is.](../../img/LLM-testcase.png "LLMTestCase")

- Put simply a **LLMTestCase** represents a **single, atomic unit of interaction with your LLM app**.
    - An **interaction** can mean different things depending on the application being evaluated and the scope.
        - If one evaluates an **AI Agent** an **interaction** can mean:
            - **Agent Level**: The entire process initiated by the agent including all intermediary steps
            - **RAG pipeline level**: Just the **RAG** workflow - **retriever** + **generator**
            - Individual components level:
                - **Retriever**: Retrieving relevant chunks and ranking them accordingly
                - **Generator**: Generating relevant, complete answer free of hallucinations

![Image showcasing what an interaction means](../../img/llm-interaction.png "LLMTestCase interaction")

### Evaluation

**Evaluation** should be a crucial component of every single application which uses **AI**. **DeepEval** provides more than 30 metrics for evaluation so that one can very easily iterate towards a better LLM application. Each default metric uses **LLM-As-A-Judge**. Optionally, one can use the **GEval** to set a custom criteria for evaluation if neither of the other metrics meet the requirements. Alternatively, there's the **DAGMetric**, whose purpose is similar to the **GEval**, however it uses a graph and it's fully **deterministic**.

When evaluating a test case, multiple metrics can be used and the test would be **positive** iff all the **metrics thresholds** have been exceeded and **negative** in any other case. 

Evaluation workflow:

![Image showcasing the evaluation steps](../../img/evaluation-workflow.png "Evaluation workflow")

### Evaluation Dataset

An **evaluation dataset** is just a collection of **LLMTestCase**- or so called **Golden** objects. A **Golden** is structurally the same as a **LLMTestCase**, however it has no `actual_output` and `retrieval_context` fields, which can be generated by your LLM at evaluation time.

Datasets can be **pushed**, **stored** and **pulled** from **Confident AIs** platform.

### LLM provider

**DeepEval** uses **OpenAI** by default as a LLM, however **Ollama** is also available. To use it execute the code cell below. This will generate a `.deepeval` file where key-value pairs will be stored about that particular LLM-provider like model name, base url and so on. 

In [None]:
import os
from typing import Final
from dotenv import load_dotenv

load_dotenv("../../env/rag.env")

CHAT_MODEL: Final[str] = os.getenv("CHAT_MODEL")
EMBEDDING_MODEL: Final[str] = os.getenv("EMBEDDING_MODEL")

! deepeval set-ollama {CHAT_MODEL} --base-url="http://localhost:11434/"
! deepeval set-ollama-embeddings {EMBEDDING_MODEL} --base-url="http://localhost:11434"

🙌 Congratulations! You're now using a local Ollama model for all evals that 
require an LLM.
🙌 Congratulations! You're now using Ollama embeddings for all evals that 
require text embeddings.


### Loading dataset

* If you already have a dataset on the platform just use the `pull` method and specify the name/alias.
* Alternatively, you can save your synthetic dataset on disk and import it from a `json` or `csv` file.
* Finally, you can store a dataset on **ragas.io**, however before using it you need to convert it.
* Note also that if your dataset consists of **goldens** you need to generate the `actual_output` and `retrieval_context` first.

For this project I use **RAGAs** for synthetic testdata generation and the full datasets are stored under: `/ragas/datasets` and can be loaded from `jsonl` files.

In [None]:
import json
from typing import List, Dict
from ragas import EvaluationDataset, SingleTurnSample

# The are located under `./datasets`
filepath: str = input("Please specify which dataset to evaluate (only the file name): ")

goldens: List[Dict] = []
try:
    with open(file=f"../ragas/datasets/{filepath}.jsonl", mode="r", encoding="utf-8") as file:
        for line in file:
            if line.strip():  # Skip empty lines
                goldens.append(json.loads(line))

    samples: List[SingleTurnSample] = []
    for golden in goldens:
        single_turn_sample = SingleTurnSample(**golden)
        samples.append(single_turn_sample)
        
    evaluation_dataset = EvaluationDataset(samples)
except FileNotFoundError:
    print(f"File: `./datasets/{filepath}.jsonl` containing goldens not found!")
except json.JSONDecodeError as e:
    print(f"Error parsing JSONL file: {e}")

Output()

### Setup for **RAGAs**

**DeepEval** supports the original **RAGAs** metrics, so the cell below just initializes the required structures.

In [None]:
import os
from dotenv import load_dotenv
from langchain_ollama import ChatOllama, OllamaEmbeddings

from ragas import RunConfig, DiskCacheBackend
from ragas.llms import LangchainLLMWrapper
from ragas.embeddings import LangchainEmbeddingsWrapper

run_config = RunConfig(
    timeout=86400,    # 24 hours on waiting for a single operation
    max_retries=20,   # Max retries before giving up
    max_wait=600,     # Max wait between retries
    max_workers=4,    # Concurrent requests
    log_tenacity=True # Print retry attempts
)

# This stores data generation and evaluation results locally on disk
# When using it for the first time, it will create a .cache folder
# When using it again, it will read from that folder and finish almost instantly
# For each dataset, a different cache folder is created
cacher = DiskCacheBackend(cache_dir=f".cache-{filepath}")

load_dotenv("../../env/rag.env")

ragas_llm = LangchainLLMWrapper(
    langchain_llm=ChatOllama(
        model=os.getenv("CHAT_MODEL"),
        base_url="http://localhost:11434",
        temperature=float(os.getenv("TEMPERATURE")),
        num_ctx=int(os.getenv("LLM_CONTEXT_WINDOW_TOKENS")),
        format="json"
    ),
    run_config=run_config,
    cache=cacher
)

ragas_embeddings = LangchainEmbeddingsWrapper(
    embeddings=OllamaEmbeddings(
        model=os.getenv("EMBEDDING_MODEL"),
        base_url="http://localhost:11434"
    ),
    run_config=run_config,
    cache=cacher
)

### Evaluating a dataset using metrics

Evaluation in **DeepEval** works as follow:

* Pull or create a dataset containing test cases or goldens:
    - In the case of **goldens** have the LLM generate the `actual_output` and `retrieval_context` fields.
    - Usually the **goldens** are preferred since one can modify the `prompt template` during evaluation.

* Think about the application and the different use cases:
    - What is it doing?
    - How could I assure it's doing what it's supposed to be doing?
    - Check out the existing metrics or create your own one.
    - Select metric/s.

* Run the evaluation:
    - In notebooks use the `evaluate` method.
    - If you choose to create test cases in python file/s:
        - The filename/s should start with `test_` and can be ran using `deepeval test run <filename>`.
        - Optionally, pass in flags like `-c` to use the cache.

### RAG Triad

The RAG triad is composed of three metrics:
- **Answer Relevancy**
    - This particular metric can assess the `prompt template`.
    - If the score is insufficient it might be worth playing around with the template itself - try to be more detailed by providing more fine-granular instructions or provide few-shot-examples.
- **Faithfulness**
    - This metric evaluates the `LLM` and its ability to incorporate context into the generation process.
    - Low scores would usually signify a bad `LLM model` that **hallucinates** or fails to use the context properly.
- **Contextual Relevancy**
    - This metric verifies if the `retrieved context` is relevant for providing a good/complete answer to the query.
    - It measures the `top-k`, `chunk_size` and `embedding model`.
        - The `embedding model` would be useful to capture relevant chunks of context.
        - `top-k` and `chunk-size` will either help or hinder the context retrieval by including or missing important information.

If a given RAG application scores high on all 3 metrics, one can be confident that the optimal `hyperparameters` are being used. `Hyperparamaters` are parameters, which can either positively or negatively influence the RAG pipeline. For example a good `embedding model` would retrieve relevant data and also be able to pick-up on nuances, whereas a bad one wouldn't. If you have a simple RAG application that doesn't use `tool calls` or doesn't employ an agent then those 3 metrics can be a good starting point.

![RAG Triad](../../img/RAG-triad.png  "RAG Triad")

### **Answer Relevancy**

It uses the **LLM as a judge** to determine how well the response answers the user input. Is the answer complete, on-topic and concise? This metric is especially relevant when it comes to testing the **RAG pipelines generator** by determining the degree of relevance of the generated output with respect to the query. The main **hyperparameter** being tested by this metric is the `prompt_template` that the **LLM** receives as instruction to create an output (can it instruct the LLM to output relevant and helpful answers based on the `retrieval_context`). Tweaking the template submitted to the LLM can provide higher scores.

---

### Approach:

The LLM is used to decompose the `actual_output` into claims and then each of those claims get classified with respect to the `input` by the LLM:

* `yes` if relevant

* `no` if irrelevant

* `idk` if non-determined or partially relevant.

---

### Formula:

* Number of relevant statements = Statements marked as `yes` or `idk`

* **Answer Relevancy** = $\frac{\text{Number of relevant statements}}{\text{Total number of statements}}$

---

### Framework Comparison:

This metric is very similar if not the same as the **Response Relevancy** in **RAGAs** in terms of what it evaluates, however it does it in a diffrerent way. In **RAGAs**, hypothetical questions get generated based on the response, their embeddings are computed and using **Semantic similarity** the mean of those scores is computed. In **DeepEval** the response is decomposed into claims and for each the LLM determines its relevance to the query/input (no Embedding model is required). 

|           | Answer Relevancy | Response Relevancy |
|-----------|------------------|-------------|
| Framework | DeepEval | RAGAs |
| Approach  | Statemement decomposition + LLM-as-a-judge  | Hypothetical questions (LLM) + Semantic similarity  |
| Score     |  $\in{[0-1]} $ | $\in{[0-1]} $ |
| Component | Generator      | Generator     |

In [None]:
from deepeval.metrics import AnswerRelevancyMetric
from prompts.custom_answer_relevancy_prompt import MyAnswerRelevancyTemplate

answer_relevancy = AnswerRelevancyMetric(
    include_reason=False,
    evaluation_template=MyAnswerRelevancyTemplate,
)

### **Faithfulness**

The **Faithfulness** metric measures how factually consistent a response is with the retrieval context. It verifies, whether or not the LLM can use the retrieved context properly and not hallucinate. Output that contains hallucinations (some sort of contradiction or made up information) are penalized. The score ranges between 0 to 1, with higher scores indicating better factual consistency. The metric tests the `generator` component in the RAG pipeline and tries to verify if the output contradicts factual information from the **retrieval_context**. A response is considered faithful if its claims can be supported by the retrieved context (ideally we would want a response whose claims/statements are all supported by a node/chunk in the context).

---

### Approach:

1. **Identify Claims**:
   - Break down the response into individual statements

2. **Identify Truths**:
   - Break down the retrieval context into individual statements
   - One can configure the number of `truths` to be extracted using the `truths_extraction_limit` parameter

3. **Use LLM**
    - Using the LLM to determine if all claims in the response are truthful (do not contradict information in the retrieval context)

---

### Formula:

* Number of truthful claims = Statements/claims from the response marked as `yes` or `idk` **(not contradicting a truth)**

* **Faithfulness** = $\frac{\text{Number of truthful claims}}{\text{Total number of claims}}$

---

### Framework Comparison:

|              | Faithfulness | Faithfulness |
|--------------|-------------------------|---------------------|
| Framework    | DeepEval                | RAGAS               |
| Approach     | Statement decomposition, NLI, LLM-as-a-judge | Statement decomposition, NLI, LLM-as-a-judge |
| Components   | Generator             | Generator           |
| Scoring      | $\in{[0-1]}$          | $\in{[0-1]}$        |
| Customization| - | Can use HHEM-2.1-Open classifier |

In [None]:
from deepeval.metrics import FaithfulnessMetric
from prompts.custom_faithfulness_prompt import MyFaithfulnessTemplate 

faithfulness = FaithfulnessMetric(
    include_reason=False,
    evaluation_template=MyFaithfulnessTemplate,
)

### **Contextual Precision**

The **Contextual Precision** metric measures the the **re-rankers capability to rank relevant** `nodes`/`pieces of context` higher than irrelevant ones. This ensures that important information is in the beginning of the context, which may lead to better generation. Usually a LLM focuses more on information which is found at the beginning of the context. Information which is ranked higher than it is supposed to be leads to penalties (lower score).

---

### Approach:

The **LLM** determines for each **piece of context** whether or not it's relevant for answering the **input** with respect to the **expected output**.

---

### Formula:

* **Note that not every node which was retrieved would be relevant**
* **k** is the rank (position of retrieved node context)
* **n** is the total number of retrieved nodes
* **$r_{k}$** $\in \{0, 1\} $
* **Contextual Precision** = $ \frac{1}{\text{Number of relevant nodes}} \times \sum_{k=1}^{n} (\frac{\text{Number of relevant Nodes up to Rank k}}{\text{k}} \times r_{k}) $

---

### Framework Comparison:

Both metrics work the exact same way and use the same formula, so one would expect almost identical results.

|              | Contextual Precision | Context Precision |
|--------------|-------------------------|---------------------|
| Framework    | DeepEval                | RAGAS               |
| Approach     | LLM-as-a-judge | LLM-as-a-judge |
| Components   | Retriever             | Retriever           |
| Scoring      | $\in{[0-1]}$          | $\in{[0-1]}$        |


In [None]:
from deepeval.metrics import ContextualPrecisionMetric
from prompts.custom_contextual_precision_prompt import MyContextualPrecisionTemplate

contextual_precision = ContextualPrecisionMetric(
    include_reason=False,
    evaluation_template=MyContextualPrecisionTemplate
)

### **Contextual Recall**

The **Contextual Recall** metric measures the extent to which the `retrieval_context` aligns with the `expected_output`. This metric penalizes a **retriever** which misses important/relevant nodes during retrieval, since that would lead to an incomplete response. It uses the `expected output` as a reference for the `retrieval context`. The main `hyperparameter` which is evaluated by the metric is the `embedding model`. Since the `embedding model` is used by the retriever when performing semantic similarity search a good model can positively affect the application. The inverse is also true.

---

### Approach:

The **LLM** determines for each statement in the `expected_output` whether or not it can be attributed to a node from the `retrieval context`.

---

### Formula:

**Attributable statement** is one, which contains information present in a node of the context. 
* **Contextual Recall** = $ \frac{\text{Number of attributable statements}}{\text{Total number of statements}} $

---

### Framework Comparison:

Both metrics work almost identically. The `expected output` is decomposed into statements and for each of them the LLM determines if it can be attributed to a node from the `retrieved context`. The only minor difference is that in **RAGAs** the input is also explicitly used in determining that.

|              | Contextual Precision | Context Precision |
|--------------|-------------------------|---------------------|
| Framework    | DeepEval                | RAGAS               |
| Approach     | LLM-as-a-judge | LLM-as-a-judge |
| Components   | Retriever             | Retriever           |
| Scoring      | $\in{[0-1]}$          | $\in{[0-1]}$        |


In [None]:
from deepeval.metrics import ContextualRecallMetric
from prompts.custom_contextual_recall_prompt import MyContextualRecallTemplate

contextual_recall = ContextualRecallMetric(
    include_reason=False,
    evaluation_template=MyContextualRecallTemplate
)

### **Contextual Relevancy**

The **Contextual Relevancy** metric measures the extent to which the `retrieval_context` is useful for generating a proper response to the **input**. The hyperparameters which are measured in this case are the `chunk size` and `top-k`. In the case that `chunk size` is large one might fetch various nodes that contain the required information, however also redundant information causing the LLM to hallucinate. The other parameter `top-k` plays an important role in limiting the number of nodes to consider as context for the generation phase.

---

### Approach:

The **LLM** receives the **input** and **retrieval_context** and is prompted to decide for each statement found in the context, whether or not it can be of use for answering the input.

---

### Formula:

**Relevant statement** is a statement found in the context, which can be utilized to answer the user input. 
* **Contextual Relevancy** = $ \frac{\text{Number of relevant statements}}{\text{Total number of statements}} $

In [None]:
from deepeval.metrics import ContextualRelevancyMetric
from prompts.custom_contextual_relevancy_prompt import MyContextualRelevancyTemplate

contextual_relevancy = ContextualRelevancyMetric(
    include_reason=False,
    evaluation_template=MyContextualRelevancyTemplate
)

### **Contextual Entities Recall**

This metric is not available in **DeepEval**. For explanation of how it works, please go to the notebook which contains evaluations in the `ragas` folder.

In [None]:
from deepeval.metrics.ragas import RAGASContextualEntitiesRecall

ragas_contextual_entities_recall = RAGASContextualEntitiesRecall(
    model=llm
)

Finally we need to evaluate all our application using our dataset and metrics.

In [None]:
from deepeval import evaluate
from deepeval.evaluate.configs import (
    AsyncConfig,
    CacheConfig,
    DisplayConfig,
    TestRunResultDisplay
)
from deepeval.evaluate.evaluate import EvaluationResult

async_conf = AsyncConfig(
    run_async=True,
    throttle_value=120,
    max_concurrent=10
)

cache_conf = CacheConfig(
    write_cache=True,
    use_cache=True
)

display_conf = DisplayConfig(
    show_indicator=True,
    print_results=True,
    verbose_mode=False,
    # Displays only the failed tests
    # The ones whose score didn't exceed the threshold
    display_option=TestRunResultDisplay.FAILING
)

# There's also an `ErrorConfig` for further customization
results: EvaluationResult = evaluate(
    test_cases=evaluation_dataset.test_cases,
    metrics=[
        answer_relevancy,
        ragas_answer_relevancy,
        faithfulness,
        ragas_faithfulness,
        contextual_precision,
        ragas_contextual_precision,
        contextual_recall,
        ragas_contextual_recall,
        ragas_contextual_entities_recall,
        contextual_relevancy,
    ],
    async_config=async_conf,
    display_config=display_conf,
    cache_config=cache_conf
)