### Setup

* Execute the script in the root of the `evaluation` folder.
* If executing the script fails run: `chmod u+x setup.sh`.
* It will install all required dependencies.
* Finally, make sure you select it in the notebook by specifying `eval` as kernel.

### Confident AI

1. In short **Confident AI** is a cloud-based platform fully compatible with the **DeepEval** project, which can store generated **datasets**, results of **evaluations**. 

2. If you want to use **Confident AI** platform create an account from here: [Confident AI](https://www.confident-ai.com/)

3. After signing-up an **API key** will be generated, which can be used to interact with the platform from inside the notebook.

4. Do bear in mind that if you want to use **ConfidentAI** with **DeepEval** you need to always keep `deepeval` as the most recent version, otherwise you will get an exception.

---

Example of `.env` file:
```bash
DEEPEVAL_RESULTS_FOLDER=<folder> # Results of evaluations can be saved locally
DEEPEVAL_API_KEY=<your api key>  # Relevant if you want to use Confident AI
DEEPEVAL_TELEMETRY_OPT_OUT="YES" # Remove telemetry
```

In [None]:
import os
from dotenv import load_dotenv
from deepeval import login_with_confident_api_key

# Loads the environment variables from a `.env` file.
# If you want to use Confident AI be sure to create one parent directory with the variables.
load_dotenv("../.env")

deepeval_api_key: str = os.getenv("DEEPEVAL_API_KEY")

# You should get a message letting you know you are logged-in.
login_with_confident_api_key(deepeval_api_key)

### LLMTestCase

Unlike **RAGAs**, where a single interaction between a user and the AI system is represented by either a **SingleTurnSample** or **MultiTurnSample**, in **DeepEval** there's the concept of so called **LLMTestCase** and **ConversationalTestCase**. For this project the **LLMTestCase** will be of relevance. Just like in **RAGAs**, **LLMTestCase** objects have the same fields just different names - input, actual_output, expected_output, etc.

![Image showcasing what a LLMTestCase is.](https://deepeval-docs.s3.amazonaws.com/llm-test-case.svg "LLMTestCase")

### Evaluation

**Evaluation** should be a crucial component of every single application which uses **AI**. **DeepEval** provides more than 14 metrics for evaluation so that one can very easily iterate towards a better LLM application. Each default metric uses **LLM-As-A-Judge**. Optionally, one can use the **GEval** to set a custom criteria for evaluation if neither of the other metrics meet the requirements. Alternatively, there's the **DAGMetric**, whose purpose is similar to the **GEval**, however it uses a graph and it's fully **deterministic**.

When evaluating a test case, multiple metrics can be used and the test would be **positive** iff all the **metrics thresholds** have been exceeded and **negative** in any other case. 

Evaluation workflow:
![Image of evaluation workflox](https://d2lsxfc3p6r9rv.cloudfront.net/workflow.png "Evaluation comparison")

### Evaluation Dataset

An **evaluation dataset** is just a collection of **LLMTestCase**- or so called **Golden** objects. A **Golden** is structurally the same as a **LLMTestCase**, however it has no `actual_output` and `retrieval_context` fields, which can be generated by your LLM at evaluation time.

Datasets can be **pushed**, **stored** and **pulled** from **Confident AIs** platform.

### LLM provider

**DeepEval** uses **OpenAI** by default as a LLM, however **Ollama** is also available. To use it execute the code cell below. This will generate a `.deepeval` file where key-value pairs will be stored about that particular LLM-provider like model name, base url and so on. 

In [2]:
!deepeval set-ollama llama3.1:latest --base-url="http://localhost:11434/"
!deepeval set-ollama-embeddings mxbai-embed-large --base-url="http://localhost11434"

🙌 Congratulations! You're now using a local Ollama model for all evals that 
require an LLM.
🙌 Congratulations! You're now using Ollama embeddings for all evals that 
require text embeddings.


### Pulling a **dataset** from **Confident AI**

* If you already have a dataset on the platform just use the `pull` method and specify the name/alias.
* Alternatively, you can save your synthetic dataset on disk and import it from a `json` or `csv` file.

In [None]:
from deepeval.dataset import EvaluationDataset

evaluation_dataset = EvaluationDataset()
evaluation_dataset.pull(
    os.getenv("DATASET_ALIAS"),
    auto_convert_goldens_to_test_cases=True
)

### Setup for **RAGAs**

**DeepEval** supports the original **RAGAs** metrics, so the cell below just initializes the required structures.

In [None]:
from dotenv import load_dotenv
from langchain_ollama import ChatOllama, OllamaEmbeddings

load_dotenv("../../env/rag.env")

llm = ChatOllama(
    model=os.getenv("CHAT_MODEL"),
    base_url="http://localhost:11434",
    temperature=float(os.getenv("TEMPERATURE")),
    num_ctx=int(os.getenv("LLM_CONTEXT_WINDOW_TOKENS")),
    format="json"
)

embeddings = OllamaEmbeddings(
    model=os.getenv("EMBEDDING_MODEL"),
    base_url="http://localhost:11434"
)

### Evaluating a dataset using metrics

Evaluation in **DeepEval** works as follow:
* Pull or create a dataset containing test cases or goldens:
    - In the case of **goldens** have the LLM generate the `actual_output` and `retrieval_context` fields.
    - Usually the **goldens** are preferred since one can modify the `prompt template` during evaluation.
* Think about the application and the different use cases:
    - What is it doing?
    - How could I assure it's doing what it's supposed to be doing?
    - Check out the existing metrics or create your own one.
    - Create metric/s.
* Run the evaluation:
    - In notebooks use the `evaluate` method.
    - If you choose to create test cases in python file/s:
        - The filename/s should start with `test_` and can be ran using `deepeval test run <filename>`.
        - Optionally, pass in flags like `-c` to use the cache.

### **Answer Relevancy**

It uses the **LLM as a judge** to determine how well the response answers the user input. This metric is especially relevant when it comes to testing the **RAG pipelines generator** by determining the degree of relevance of the generated output with respect to the query. The main **hyperparamter** being tested by this metric is the `prompt_template` that the **LLM** receives as instruction to create an output (can it instruct the LLM to output relevant and helpful outputs based on the `retrieval_context`).

---

### Approach:

The LLM is used to decompose the `actual_output` into claims and then each of those claims get classified with respect to the `input` by the LLM:
* `yes` if relevant
* `no` if irrelevant
* `idk` if non-determined or partially relevant.

---

### Formula:

* Number of relevant statements = Statements marked as `yes` or `idk`  
* **Answer Relevancy** = $\frac{\text{Number of relevant statements}}{\text{Total number of statements}}$

---

### Framework Comparison:

This metric is very similar if not the same as the **Response Relevance** in **RAGAs** in terms of what it evaluates, however it does it in a diffrerent way. In **RAGAs**, hypothetical questions get generated based on the response, their embeddings are computed and using **Semantic similarity** the mean of those scores is computed. In **DeepEval** the response is decomposed into claims and for each the LLM determines its relevance to the query/input (no Embedding model is required). 

|           | Answer Relevancy | Response Relevancy |
|-----------|------------------|-------------|
| Framework | DeepEval | RAGAs |
| Approach  | Statemement decomposition + LLM-as-a-judge  | Hypothetical questions + Semantic similarity  |
| Score     |  $\in{[0-1]} $ | $\in{[0-1]} $ |
| Component | Generator      | Generator     |

In [None]:
from deepeval.metrics import AnswerRelevancyMetric
from deepeval.metrics.ragas import RAGASAnswerRelevancyMetric
from prompts.custom_answer_relevancy_prompt import MyAnswerRelevancyTemplate

answer_relevancy = AnswerRelevancyMetric(
    evaluation_template=MyAnswerRelevancyTemplate,
    include_reason=False
)
ragas_answer_relevancy = RAGASAnswerRelevancyMetric(
    model=llm,
    embeddings=embeddings
)

### **Faithfulness**

The **Faithfulness** metric measures how factually consistent a response is with the retrieval context. It ranges from 0 to 1, with higher scores indicating better factual consistency. The metrics tests the `generator` component in the RAG pipeline and tries to verify if the output contradicts factual information from the **retrieval_context**.

A response is considered faithful if its claims can be supported by the retrieved context (ideally we would want a response whose claims/statements are all supported by a node/chunk in the context).

---

### Approach:

1. **Identify Claims**:
   - Break down the response into individual statements

2. **Identify Truths**:
   - Break down the retrieval context into individual statements
   - One can configure the number of `truths` to be extracted using the `truths_extraction_limit` parameter

3. **Use LLM**
    - Using the LLM to determine if a claim doesn't contradict a truth from the context

---

### Formula:

* Number of truthful claims = Statements/claims from the response marked as `yes` or `idk` **(not contradicting a truth)**
* **Faithfulness** = $\frac{\text{Number of truthful claims}}{\text{Total number of claims}}$

---

### Framework Comparison:

|              | Faithfulness | Faithfulness |
|--------------|-------------------------|---------------------|
| Framework    | DeepEval                | RAGAS               |
| Approach     | Statement decomposition, LLM-as-a-judge | Statement decomposition, LLM-as-a-judge |
| Components   | Generator             | Generator           |
| Scoring      | $\in{[0-1]}$          | $\in{[0-1]}$        |
| Customization| - | Can use HHEM-2.1-Open classifier |

In [None]:
from deepeval.metrics import FaithfulnessMetric
from deepeval.metrics.ragas import RAGASFaithfulnessMetric
from prompts.custom_faithfulness_prompt import MyFaithfulnessTemplate 

faithfulness = FaithfulnessMetric(
    include_reason=False,
    evaluation_template=MyFaithfulnessTemplate
)

ragas_faithfulness = RAGASFaithfulnessMetric(
    model=llm
)

### **Contextual Precision**

The **Contextual Precision** metric measures the the **re-rankers capability to rank relevant** `nodes`/`pieces of context` higher than irrelevant ones. This ensures that important information is in the beginning of the context, which may lead to better generation. Usually a LLM focuses more on information which is found at the beginning of the context. Information which is ranked higher than it is supposed to be leads to penalties (lower score).

evaluates whether the reranker in your retriever ranks more relevant nodes in your retrieval context higher than irrelevant ones.

---

### Approach:

The **LLM** determines for each **piece of context** whether or not it's relevant for answering the **input** with respect to the **expected output**.

---

### Formula:

* **Note that not every node which was retrieved would be relevant**
* **k** is the rank (position of retrieved node context)
* **n** is the total number of retrieved nodes
* **$r_{k}$** $\in \{0, 1\} $
* **Contextual Precision** = $ \frac{1}{\text{Number of relevant nodes}} \times \sum_{k=1}^{n} (\frac{\text{Number of relevant Nodes up to Rank k}}{\text{k}} \times r_{k}) $

---

### Framework Comparison:

Both metrics work the exact same way and use the same formula, so one would expect almost identical results.

|              | Contextual Precision | Context Precision |
|--------------|-------------------------|---------------------|
| Framework    | DeepEval                | RAGAS               |
| Approach     | LLM-as-a-judge | LLM-as-a-judge |
| Components   | Retriever             | Retriever           |
| Scoring      | $\in{[0-1]}$          | $\in{[0-1]}$        |


In [None]:
from deepeval.metrics import ContextualPrecisionMetric
from deepeval.metrics.ragas import RAGASContextualPrecisionMetric

contextual_precision = ContextualPrecisionMetric(
    threshold=0.7,
)

ragas_contextual_precision = RAGASContextualPrecisionMetric(
    threshold=0.5,
    model=llm
)

### **Contextual Recall**

The **Contextual Recall** metric measures the extent to which the `retrieval_context` aligns with the `expected_output`. This metric penalizes a **retriever** which misses important/relevant nodes during retrieval, since that would lead to an incomplete response. It uses the `expected output` as a reference for the `retrieval context`.

evaluates whether the embedding model in your retriever is able to accurately capture and retrieve relevant information based on the context of the input.

---

### Approach:

The **LLM** determines for each statement in the `expected_output` whether or not it can be attributed to a node from the `retrieval context`.

---

### Formula:

**Attributable statement** is one, which contains information present in a node of the context. 
* **Contextual Recall** = $ \frac{\text{Number of attributable statements}}{\text{Total number of statements}} $

---

### Framework Comparison:

Both metrics work almost identically. The `expected output` is decomposed into statements and for each of them the LLM determines if it can be attributed to a node from the `retrieved context`. The only minor difference is that in **RAGAs** the input is also explicitly used in determining that.

|              | Contextual Precision | Context Precision |
|--------------|-------------------------|---------------------|
| Framework    | DeepEval                | RAGAS               |
| Approach     | LLM-as-a-judge | LLM-as-a-judge |
| Components   | Retriever             | Retriever           |
| Scoring      | $\in{[0-1]}$          | $\in{[0-1]}$        |


In [None]:
from deepeval.metrics import ContextualRecallMetric
from deepeval.metrics.ragas import RAGASContextualRecallMetric
from prompts.llama31_contextual_recall_prompt import Llama31ContextualRecallTemplate

contextual_recall = ContextualRecallMetric(
    threshold=0.7,
    evaluation_template=Llama31ContextualRecallTemplate
)

ragas_contextual_precision = RAGASContextualRecallMetric(
    threshold=0.5,
    model=llm
)

### **Contextual Relevancy**

The **Contextual Relevancy** metric measures the extent to which the `retrieval_context` is useful for generating a proper response to the **input**.

evaluates whether the text chunk size and top-K of your retriever is able to retrieve information without much irrelevancies.

---

### Approach:

The **LLM** receives the **input** and **retrieval_context** and is prompted to decide for each statement found in the context, whether or not it can be of use for answering the input.

---

### Formula:

**Relevant statement** is a statement found in the context, which can be utilized to answer the user input. 
* **Contextual Relevancy** = $ \frac{\text{Number of relevant statements}}{\text{Total number of statements}} $

---

### Framework Comparison:

No such metric is available in **RAGAs**.

In [None]:
from deepeval.metrics import ContextualRelevancyMetric
from prompts.llama31_contextual_relevancy_prompt import Llama31ContextualRelevancyTemplate

contextual_relevancy = ContextualRelevancyMetric(
    threshold=0.7,
    evaluation_template=Llama31ContextualRelevancyTemplate
)

### **ContextualEntitiesRecall**

This metric is not available in **DeepEval**. For explanation of how it works, please go to `/evaluation/ragas/evalute.ipynb`.

In [None]:
from deepeval.metrics.ragas import RAGASContextualEntitiesRecall

ragas_contextual_entities_recall = RAGASContextualEntitiesRecall(
    threshold=0.5,
    model=llm
)

### **Hallucination**

The **Hallucination** metric measures the ability of the **LLM** to properly utilize context. In some instances a perfect prompt template with all the relevant context, can still result in an invalid or partially invalid response, which contradicts data from the knowledge base. This metric ensures that such cases are penalized.

---

### Approach:

The **LLM** receives a list of `context` nodes (data which will be part of the knowledge base) and the `actual_output`. Unlike the `Faithfulness` metric where statements from the `actual_output` are compared agains nodes in the `retrieval_context` this one uses nodes from the `context` and ensures there're no contradictions in the `actual_output`. A contradiction is considered a statement/claim in the `actual_output` which is fully inconsistent. Missing information is not a contradiction.

---

### Formula:
 
* **Hallucination** = $ \frac{\text{Number of contradicted contexts}}{\text{Total number of contexts}} $

---

### Framework Comparison:

No such metric is available in **RAGAs**.

In [None]:
from deepeval.metrics import HallucinationMetric
from prompts.llama31_hallucination_prompt import Llama31HallucinationTemplate

hallucination = HallucinationMetric(
    threshold=0.7,
    evaluation_template=Llama31HallucinationTemplate
)

In [None]:
import deepeval

@deepeval.log_hyperparameters(model="gpt-4", prompt_template="...")
def custom_parameters():
    return {}

In [None]:
from deepeval import evaluate
from deepeval.evaluate.configs import (
    AsyncConfig,
    CacheConfig,
    DisplayConfig,
    TestRunResultDisplay
)

async_conf = AsyncConfig(
    run_async=True,
    throttle_value=120,
    max_concurrent=10
)

cache_conf = CacheConfig(
    write_cache=True,
    use_cache=True
)

display_conf = DisplayConfig(
    show_indicator=True,
    print_results=True,
    verbose_mode=False,
    # Displays only the failed tests
    # The ones whose score didn't exceed the threshold
    display_option=TestRunResultDisplay.FAILING
)

# There's also an `ErrorConfig` for further customization
# evaluate(
    
# )
