### Configuration

* The **Opik** platform can either be hosted locally in a container or a cluster: [local hosting](https://www.comet.com/docs/opik/self-host/overview)
* Alternatively, there's a cloud based platform free of charge: [cloud version](https://www.comet.com/signup).
    * When using the cloud version: an API key and a workspace need to be specified.
    * Similarly to **DeepEval**, the cloud version of **Opik** has an intuitive UI, where datasets, evaluations results and more can be stored.
    * For the cloud based approach make sure to add the following to your `.env` file in the parent folder:
    ```bash
        OPIK_API_KEY=<your-api-key>
        OPIK_WORKSPACE=<your-workspace>
        # Setting this will automatically log traces for the project (Optional)
        OPIK_PROJECT_NAME=<project-name>
        OPIK_USAGE_REPORT_ENABLED=false  # Disable telemetry (Optional)
        # By default creates a file called ~/.opik.config (on Linux) if it doesn't exist
        OPIK_CONFIG_PATH=<filepath-to-your-config-file>  # Overwrite the file location (Optional)
    ```

In [1]:
import os
import opik
from typing import Final
from dotenv import load_dotenv
from opik.exceptions import ConfigurationError

load_dotenv("../.env")

OPIK_API_KEY: Final[str] = os.getenv("OPIK_API_KEY")
OPIK_WORKSPACE: Final[str] = os.getenv("OPIK_WORKSPACE")
OPIK_PROJECT_NAME: Final[str] = os.getenv("OPIK_PROJECT_NAME")

try:
    # If you're using the locally hosted version set the `use_local=True and provide the url`
    opik.configure(
        api_key=OPIK_API_KEY,
        workspace=OPIK_WORKSPACE
    )
except (ConfigurationError, ConnectionError) as ce:
    print(f"Error occurred: {ce}")

OPIK: Opik is already configured. You can check the settings by viewing the config file at /home/p3tr0vv/Desktop/Evaluation-Approaches-for-Retrieval-Augmented-Generation-RAG-/evaluation/opik/.opik.config


### LLM and tracing

* **Opik** uses **OpenAI** as the LLM-provider by default. To overwrite that create a `LiteLLMChatModel` instance with the model you want to use and specify your [input parameters](https://docs.litellm.ai/docs/completion/input).
* If you want to add tracing capabilities so that all calls to `litellm` are traced to the **Opik** platform create the `OpikLogger` and set the `litellm` callbacks (Optional).

In [2]:
import os
import litellm
from dotenv import load_dotenv
from opik.evaluation.models import LiteLLMChatModel
from litellm.integrations.opik.opik import OpikLogger

load_dotenv("../../env/rag.env")

# https://docs.litellm.ai/docs/completion/input
eval_model = LiteLLMChatModel(
    model_name=f"ollama/{os.getenv("CHAT_MODEL")}",
    temperature=float(os.getenv("TEMPERATURE")),
    top_p=float(os.getenv("TOP_P")),
    response_format={
        "type": "json_object"
    },
    api_base="http://localhost:11434",
    num_retries=3,
)

# This will trace all calls submitted to the LLM to Opik (Optional)
opik_logger = OpikLogger()
litellm.callbacks = [opik_logger]

### Evaluation

For an evaluation/experiment in **Opik** the following things are required:
- an experiment is a single evaluation of the LLM application
    - during an experiment all items of the dataset get iterated on
    - an experiment consists of two main components:
        - configuration:
            - one can store key-value-pairs unique to each experiment to track and compare different evaluation runs and verify, which set of hyperparameters yields the best performance
        - experiment items:
            - individual items from a dataset
            - they get evaluated using the specified metrics and receive a trace, score and additional metadata
- a dataset
    - **Opik** supports datasets, which are a collection samples, that the LLM application will be evaluated on.
    - Datasets can be created and deleted.
- an evaluation task
    - receives a dataset item as input and returns a dictionary, that contains all required parameters by a metric
- a set of metrics
    - the ones of relevance for the project are the LLM-as-ajudge one
    - additionally, one can overwrite the `BaseMetric` class for a custom metric

In [3]:
from dotenv import load_dotenv
from typing import Dict, Any, List
from opik import Dataset
from opik.api_objects.dataset.rest_operations import ApiError

load_dotenv("../.env")

DATASET_ALIAS: Final[str] = os.getenv("DATASET_ALIAS")

# Create an `Opik` client for interacting with the platform
opik_client = opik.Opik(
    project_name=OPIK_PROJECT_NAME,
    workspace=OPIK_WORKSPACE,
    api_key=OPIK_API_KEY,
)

try:
    # Fetch the dataset
    opik_dataset: Dataset = opik_client.get_dataset(name=DATASET_ALIAS)
except ApiError as ae:
    # If not available fetch it from `DeepEval`
    # Convert it into a list of Opik Dataset Items and upload to `Opik`
    # Alternatively: 
    #   opik_dataset.read_jsonl_from_file("path/to/file.jsonl")
    #   opik_dataset.insert_from_pandas(dataframe=df)

    from deepeval.dataset import EvaluationDataset
    from deepeval import login_with_confident_api_key
    
    print(f"{ae.status_code}: {ae.body['errors']}")
    print(f"Fetching from DeepEval and then uploading on the Opik Platform")
    
    login_with_confident_api_key(os.getenv("DEEPEVAL_API_KEY"))
    deepeval_dataset = EvaluationDataset()
    deepeval_dataset.pull(
        alias=DATASET_ALIAS,
        auto_convert_goldens_to_test_cases=True
    )
    
    opik_dataset: Dataset = opik_client.create_dataset(
        name=DATASET_ALIAS,
        description="Evaluation dataset from DeepEval"
    )
    
    opik_dataset_items: List[Dict[str, Any]] = [vars(test_case) for test_case in deepeval_dataset.test_cases]
    opik_dataset.insert(opik_dataset_items)

### Evaluation task

The whole purpose of this function is so that one can compute the `actual output` at runtime. However, since we already have a full dataset ready, we can just map each dataset item to a dictionary. **Opik** follows a very similar concept like in **DeepEval** where the usage of **Golden**s is encouraged.

In [4]:
from opik.api_objects.dataset.dataset_item import DatasetItem

# This function is used during evaluation
# For each item in the dataset, this function will be called
# The output of the function is a dictionary containing the relevant parameters for the metrics
def evaluation_task(item: DatasetItem) -> Dict[str, Any]:
    return {
        "input": item['input'],
        "output": item['actual_output'],
        "expected_output": item['expected_output'],
        "context": item['retrieval_context'],
        "reference_context": item['context']
    }

### Answer Relevance

The following metric evaluates the pertinence of the `actual_output` with respect to the `input`. Missing or off-topic information will be penalized. The goal is to achieve a fully pertinent answer with no redundancies. Unlike **RAGAs** or **DeepEval** this framework evaluates in a different way. **RAGAs** takes the `actual_output` and generates *hypothetical questions* and then the average of the semantic similarity between them and the original questions is computed. **DeepEval** uses a multi-step approach, where the `actual_output` is first decomposed into statements, verdicts are then computed and then the final score is derived. **Opik** uses a single template to evaluate the `relevancy` of the answer, so it would be important to be able to have control over the prompt template. For that reason I've created a custom metric, which works the same way as the default one, however it provides a way to manually overwrite the template to test different ones.

In [None]:
from custom.my_answer_relevance import MyAnswerRelevance

# Do have in mind that you can also overwrite the `few shot examples` with your custom ones
# Be sure to set the `few_shot_examples_no_context` to your custom list of examples, since my custom metric doesn't use context
my_answer_relevance_metric = MyAnswerRelevance(
    model=eval_model,
    project_name=OPIK_PROJECT_NAME
)

In [None]:
from opik.evaluation.metrics.llm_judges.answer_relevance.metric import AnswerRelevance
from opik.evaluation.metrics.llm_judges.context_precision.metric import ContextPrecision
from opik.evaluation.metrics.llm_judges.context_recall.metric import ContextRecall
from opik.evaluation.metrics.llm_judges.factuality.metric import Factuality
from opik.evaluation.metrics.llm_judges.usefulness.metric import Usefulness
from opik.evaluation.metrics.llm_judges.hallucination.metric import Hallucination

answer_relevance_metric = AnswerRelevance(
    model=eval_model,
    require_context=False,
    project_name=OPIK_PROJECT_NAME
)

context_precision_metric = ContextPrecision(
    model=eval_model,
    project_name=OPIK_PROJECT_NAME
)

context_recall_metric = ContextRecall(
    model=eval_model,
    project_name=OPIK_PROJECT_NAME
)

factuality_metric = Factuality(
    model=eval_model,
    project_name=OPIK_PROJECT_NAME
)

usefulness_metric = Usefulness(
    model=eval_model,
    project_name=OPIK_PROJECT_NAME
)

hallucination_metric = Hallucination(
    model=eval_model,
    project_name=OPIK_PROJECT_NAME
)

In [None]:
from opik.evaluation import evaluate
from opik.evaluation.evaluation_result import EvaluationResult

# There's a function called `evaluate_experiment`, which allows to add a metric to an already existing experiment.
# This doesn't re-run the evaluation for the already existing metric scores.
# https://www.comet.com/docs/opik/evaluation/update_existing_experiment

eval_res: EvaluationResult = evaluate(
    dataset=opik_dataset,
    task=evaluation_task,
    scoring_metrics=[
        answer_relevance_metric,
        context_precision_metric,
        context_recall_metric,
        factuality_metric,
        usefulness_metric,
        hallucination_metric
    ],
    experiment_name="First evaluation ever using Opik",
    project_name=os.getenv("OPIK_PROJECT_NAME"),
    experiment_config={
        "model": os.getenv("CHAT_MODEL")
    }
)

Evaluation:   0%|          | 0/48 [00:00<?, ?it/s]OPIK: Started logging traces to the "Evaluation Approaches for RAG" project at https://www.comet.com/opik/api/v1/session/redirect/projects/?trace_id=019691d7-a282-716b-a3b2-a145dc28568b&path=aHR0cHM6Ly93d3cuY29tZXQuY29tL29waWsvYXBpLw==.
Evaluation: 100%|██████████| 48/48 [31:18<00:00, 39.14s/it]
