### Confident AI

1. In short **Confident AI** is a cloud-based platform part of the **DeepEval** framework, which stores **datasets**, **evaluations** and **monitoring data**. 

2. If you want to use **Confident AI** platform create an account from here: [Confident AI](https://www.confident-ai.com/)

3. After signing-up an **API key** will be generated, which can be used to interact with the platform from inside the notebook.

---

Example of .env file:
```bash
DEEPEVAL_RESULTS_FOLDER=<folder> # Results of evaluations can be saved locally
DEEPEVAL_API_KEY=<your api key>  # Relevant if you want to use Confident AI
DEEPEVAL_TELEMETRY_OPT_OUT="YES" # Remove telemetry
```

In [1]:
import os
from dotenv import load_dotenv
from deepeval import login_with_confident_api_key

# Loads the environment variables from a `.env` file.
# If you want to use Confident AI be sure to create one in this directory.
load_dotenv()

deepeval_api_key: str = os.getenv("DEEPEVAL_API_KEY")

# You should get a message letting you know you are logged-in.
login_with_confident_api_key(deepeval_api_key)

### LLMTestCase

Unlike **RAGAs**, where a single interaction between a user and the AI system is represented by either a **SingleTurnSample** or **MultiTurnSample**, in **DeepEval** there's the concept of so called **LLMTestCase**/**MLLMTestCase** and **ConversationalTestCase**. For this project the **LLMTestCase** will be of relevance. Just like in **RAGAs**, **LLMTestCase** objects have the same fields just different names - input, actual_output, expected_output, etc.

![Image showcasing what a LLMTestCase is.](https://confident-docs.s3.amazonaws.com/llm-test-case.svg "LLMTestCase")

### Evaluation

**Evaluation** should be a crucial component of every single application which uses **AI**. **DeepEval** provides more than 14 metrics for evaluation so that one can very easily iterate towards a better LLM application. Each default metric uses **LLM-As-A-Judge**. Optionally, one can use the **GEval** to set a custom criteria for evaluation if neither of the other metrics meet the requirements. Alternatively, there's the **DAGMetric**, whose purpose is similar to the **GEval**, however it uses a graph and it's fully **deterministic**.

When evaluating a test case, multiple metrics can be used and the test would be **positive** iff all the **metrics thresholds** have been exceeded and **negative** in any other case. 

Evaluation workflow:
![Image of evaluation workflox](https://d2lsxfc3p6r9rv.cloudfront.net/workflow.png "Evaluation comparison")

---

**Confident AI** enables users to compare evaluation runs between one another and to compare results. In some instances a new evaluation run might yield better overall results, however have a **failing** test which was previously **successful**. This is known as **regression** and will be marked in **red**. Tests marked in **green** show an improvement. 

![Image of evaluation run comparsion](https://confident-docs.s3.us-east-1.amazonaws.com/comparison-page.png "Evaluation comparison")

* Here is the LLM development workflow that is highly recommended with **Confident AI**: 
    - Curate datasets (unless you don't have one already available)
    - Run evaluations with dataset
    - Analyze evaluation results
    - Improve LLM application based on evaluation results
    - Run another evaluation on the same dataset

### Evaluation Dataset

An **evaluation dataset** is just a collection of **LLMTestCase**- or so called **Golden** objects. A **Golden** is structurally the same as a **LLMTestCase**, however it has no `actual_output` and `retrieval_context` fields, which can be generated by your LLM at evaluation time.

Datasets can be **pushed**, **stored** and **pulled** from **Confident AIs** platform.

### Synthetic datasets

**DeepEval** can be used to generate **synthetic dataset** as well. The **Synthesizer** object is highly customizable.

Example:
```python
from deepeval.synthesizer import Synthesizer
from deepeval.dataset import EvaluationDataset

synthesizer = Synthesizer()
goldens = synthesizer.generate_goldens_from_docs(document_paths=['example.txt', 'example.docx', 'example.pdf'])

# Since the synthesizer generates so-called goldens they don't have actual_output and retrieval_context fields
# You can generate them prior to evaluation or during evaluation time
dataset = EvaluationDataset(goldens=goldens)
```

In [None]:
# Set Ollama as LLM provider

!deepeval set-local-model --model-name=llama3.1:latest --base-url="http://localhost:11434/" --api-key="ollama" --format=json

🙌 Congratulations! You're now using a local model for all evals that require an
LLM.


### Pushing a **dataset** to **Confident AI**

Since I already have a dataset which was generated by **RAGAs** I would like to create a **DeepEval** equivalent and upload it to the cloud so that I can use it in the future.

In [None]:
import ast 
import typing as t
import pandas as pd
from pandas import DataFrame
from deepeval.test_case import LLMTestCase
from deepeval.dataset import EvaluationDataset

def upload_ragas_dataset_to_confident_ai(filepath: str, dataset_name: str):
    try:
        dataset: DataFrame = pd.read_csv(filepath)
        test_cases: t.List[LLMTestCase] = []
        for _, row in dataset.iterrows():
            test_cases.append(
                LLMTestCase(
                    input=row['user_input'],
                    actual_output=row['response'],
                    expected_output=row['reference'],
                    context=ast.literal_eval(row['reference_contexts']),
                    retrieval_context=ast.literal_eval(row['retrieved_contexts']),        
                )
            )
            
        deepeval_dataset: EvaluationDataset = EvaluationDataset(test_cases)
        deepeval_dataset.push(
            alias=dataset_name,
            auto_convert_test_cases_to_goldens=True
        )
    except FileNotFoundError as fnfe:
        print(print(fnfe.strerror))
    except TypeError as te:
        print(str(te))
    

In [73]:
upload_ragas_dataset_to_confident_ai("../ragas/dataset.csv", "RAGAs Dataset")

Gtk-Message: 19:06:38.380: Failed to load module "canberra-gtk-module"
Gtk-Message: 19:06:38.381: Failed to load module "canberra-gtk-module"


Opening in existing browser session.


### Pulling a **dataset** from **Confident AI**

If you already have a dataset on the platform just use the `pull` method and specify the name/alias.

In [2]:
from deepeval.dataset import EvaluationDataset

evaluation_dataset: EvaluationDataset = EvaluationDataset()
evaluation_dataset.pull("RAGAs Dataset")

### Evaluating a dataset/set of test cases using metric/s

Evaluation in **DeepEval** works as follow:
* Pull or create a dataset containing test cases or goldens:
    - In the case of **goldens** have the LLM generate the `actual_output` and `retrieval_context` fields.
    - Usually the **goldens** are preferred since one can modify the `prompt template` during evaluation.
* Think about the application and the different use cases:
    - What is it doing?
    - How could I assure it's doing what it's supposed to be doing?
    - Check out the existing metrics or create your own one.
    - Create metric/s.
* Run the evaluation:
    - In notebooks use the `evaluate` method.
    - If you choose to create test cases in python file/s:
        - The filename/s should start with `test_` and can be ran using `deepeval test run <filename>`.
        - Optionally, pass in flags like `-c` to use the cache.

In [None]:
from deepeval.evaluate import (
    evaluate, 
    EvaluationResult
)
from deepeval.metrics import GEval
from deepeval.models.llms import OllamaModel
from deepeval.test_case import LLMTestCaseParams

correctness_metric = GEval(
    name="Correctness",
    evaluation_params=[LLMTestCaseParams.ACTUAL_OUTPUT, LLMTestCaseParams.EXPECTED_OUTPUT],
    criteria="Determine if the 'actual output' is correct based on the 'expected output'.",
    model=OllamaModel(),
    threshold=0.5
)

# For running an evaluation in notebook use this approach
results: EvaluationResult = evaluate(
    test_cases=evaluation_dataset,
    metrics=[correctness_metric],
    use_cache=True,
    run_async=False
)