### Configuration

* The **Opik** platform can either be hosted locally in a container or a cluster: [local hosting](https://www.comet.com/docs/opik/self-host/overview)

* Alternatively, there's a cloud based platform free of charge: [cloud version](https://www.comet.com/signup).

    * When using the cloud version: an API key and a workspace need to be specified.

    * Similarly to **DeepEval**, the cloud version of **Opik** has an intuitive UI, where datasets, evaluations results and more can be stored.

    * For the cloud based approach make sure to add the following to your `.env` file in the parent folder:
    ```bash
        OPIK_API_KEY=<your-api-key>
        OPIK_WORKSPACE=<your-workspace>
        # Setting this will automatically log traces for the project (Optional)
        OPIK_PROJECT_NAME=<project-name>
        OPIK_USAGE_REPORT_ENABLED=false  # Disable telemetry (Optional)
        # By default creates a file called ~/.opik.config (on Linux) if it doesn't exist
        OPIK_CONFIG_PATH=<filepath-to-your-config-file>  # Overwrite the file location (Optional)
    ```

In [1]:
import os
import opik
from typing import Final, Type
from dotenv import load_dotenv
from opik.exceptions import ConfigurationError

load_dotenv("../.env")

OPIK_API_KEY: Final[Type[str]] = str(os.getenv("OPIK_API_KEY"))
OPIK_WORKSPACE: Final[Type[str]] = str(os.getenv("OPIK_WORKSPACE"))
OPIK_PROJECT_NAME: Final[Type[str]] = str(os.getenv("OPIK_PROJECT_NAME"))

try:
    # If you're using the locally hosted version set the `use_local=True and provide the url`
    opik.configure(
        api_key=OPIK_API_KEY,
        workspace=OPIK_WORKSPACE
    )
except (ConfigurationError, ConnectionError) as ce:
    print(f"Error occurred: {str(ce)}. Please try again!")

OPIK: Opik is already configured. You can check the settings by viewing the config file at /home/p3tr0vv/Desktop/Evaluation-Approaches-for-Retrieval-Augmented-Generation-RAG-/evaluation/opik_eval/.opik.config


### LLM and tracing

* **Opik** uses **OpenAI** as the LLM-provider by default. To overwrite that create a `LiteLLMChatModel` instance with the model you want to use and specify your [input parameters](https://docs.litellm.ai/docs/completion/input).

* If you want to add tracing capabilities so that all calls to `litellm` are traced to the **Opik** platform create the `OpikLogger` and set the `litellm` callbacks (Optional).

In [None]:
import os
import litellm
from dotenv import load_dotenv
from opik.evaluation.models import LiteLLMChatModel
from litellm.integrations.opik.opik import OpikLogger

load_dotenv("../../env/rag.env")

CHAT_MODEL: Final[Type[str]] = str(os.getenv("CHAT_MODEL"))

# https://docs.litellm.ai/docs/completion/input
eval_model = LiteLLMChatModel(
    model_name=f"ollama_chat/{CHAT_MODEL}",
    temperature=0.0,
    response_format={
        "type": "json_object"
    },
    api_base="http://localhost:11434",
    num_retries=3,
)

# This will trace all calls submitted to the LLM to Opik (Optional)
opik_logger = OpikLogger()
litellm.callbacks = [opik_logger]

### Evaluation

For an evaluation/experiment in **Opik** the following is required:

- an experiment:
    - a single evaluation of the LLM application
   
    - during an experiment all items of the dataset get iterated on
   
    - an experiment consists of two main components:
   
        - configuration:
            - one can store key-value-pairs unique to each experiment to track and compare different evaluation runs and verify, which set of hyperparameters yields the best performance

        - experiment items:
            - individual items from a dataset, which consist of `input`, `response`, `expected response` and `context`
            - they get evaluated using the specified metrics and receive a trace, score and additional metadata

- a dataset:
    - **Opik** supports datasets, which are a collection of samples, that the LLM application will be evaluated on.

- an evaluation task:
    - a function that maps dataset items to a dictionary
    - receives a dataset item as input and returns a dictionary, that contains all required parameters by a metric

- a set of metrics
    - quantatative measure, which helps us assess the performance of the LLM application
    - the ones of relevance for the project are the LLM-as-a-judge one (LLM evaluates the experiment)
    - additionally, one can overwrite the `BaseMetric` class for a custom metric

In [3]:
from typing import Dict, Any

from opik import Dataset
from opik.api_objects.dataset.rest_operations import ApiError

# Create an `Opik` client for interacting with the platform
opik_client = opik.Opik(
    project_name=OPIK_PROJECT_NAME,
    workspace=OPIK_WORKSPACE,
    api_key=OPIK_API_KEY,
)

# Enter the dataset id without the extension. 
# The datasets would be stored under `/ragas_eval/datasets`
dataset_id: Final[Type[str]] = input("Enter the dataset id (Ex: dataset_1): ")
dataset_filepath: Final[Type[str]] = f"../ragas_eval/datasets/{dataset_id}.jsonl"

try:
    # Fetch the dataset
    dataset: Dataset = opik_client.get_dataset(name=dataset_id)
except ApiError as ae:    
    dataset: Dataset = opik_client.create_dataset(
        name=dataset_id,
        description="Evaluation dataset from DeepEval"
    )
    
    dataset.read_jsonl_from_file(
        file_path=dataset_filepath,
        keys_mapping={
            # From RAGAs to Opik mappings
            "user_input": "input",
            "response": "output",
            "reference": "expected_output",
            "retrieved_contexts": "context",
        }
    )

OPIK: Created a "test_1" dataset at https://www.comet.com/opik/api/v1/session/redirect/datasets/?dataset_id=0196d8cf-637c-7f50-921d-51293f0f4a8f&path=aHR0cHM6Ly93d3cuY29tZXQuY29tL29waWsvYXBpLw==.


### Evaluation task

The whole purpose of this function is so that one can compute the `actual output` at runtime. However, since we already have a full dataset ready, we can just map each dataset item to a dictionary. **Opik** follows a very similar concept like in **DeepEval** where the usage of **Golden**s is encouraged. This means that the actual output can be generated at `evaluation time`.

In [4]:
from opik.api_objects.dataset.dataset_item import DatasetItem

# This function is used during evaluation
# For each item in the dataset, this function will be called
# The output of the function is a dictionary containing the relevant parameters for the metrics
def evaluation_task(item: DatasetItem) -> Dict[str, Any]:
    return {
        "input": item['input'],
        "output": item['output'],
        "expected_output": item['expected_output'],
        "context": item['context']
    }

### Hallucination

This metric verifies if the `output` generated relative to the `input` is based on information from the `context`. This is the equivalent to the **Faithfulness** metric we know from **DeepEval** and **RAGAs**, however instead of trying to maximize the score, we would like to minimize it. As with all metrics in **Opik**, by default the evaluation takes place by submitting a singular prompt to the LLM without breaking the process into multiple steps. If we compare the approach to **DeepEval or RAGAs** we would find out that the other frameworks first decompose the output and then the context (in the case of **DeepEval**) and based on that the final verdict is given.

In [None]:
from custom.hallucination.metric import Hallucination

hallucination = Hallucination(
    model=eval_model,
    project_name=OPIK_PROJECT_NAME
)

### Answer Relevance

The following metric evaluates the pertinence of the `actual_output` with respect to the `input`. Missing or off-topic information will be penalized. The goal is to achieve a fully relevant answer with no redundancies. Unlike **RAGAs** or **DeepEval** this framework evaluates in a different way. **RAGAs** takes the `actual_output` and generates *hypothetical questions* and then the average of the semantic similarity between them and the original questions is computed. **DeepEval** uses a multi-step approach, where the `actual_output` is first decomposed into statements, verdicts are then computed and then the final score is derived. **Opik** uses a single template to evaluate the `relevancy` of the answer, so it would be important to be able to have control over the prompt template.

In [None]:
from custom.answer_relevancy.metric import AnswerRelevance

answer_relevance = AnswerRelevance(
    model=eval_model,
    project_name=OPIK_PROJECT_NAME
)

### Context Precision

Context precision is a metric, which evaluates the retriever component in a RAG application. The idea is that a high context precision would translate into retrieving relevant information and ranking it high so that the user input could be augmented with relevant context maximizing the relevance and completenss of the LLM output. Unlike **RAGAs** and **DeepEval** there's no complex multi-step approach here, we just submit a single prompt template so it's worth playing around with the prompt itself to get the best possible results. You can also overwrite the examples I've provided in the metric.

### Context Recall

This metric is mostly about evaluating the `retriever` and in particular to verify if all the required information from the knowledge base has been retrieved. The retriever tends to exert a greater influence over the final answer of the RAG pipeline, so its crucial to have a metric, which tests exactly that. To test the `recall` of the pipeline we submit a single prompt and instruct the LLM to determine for each statement in the `expected output`, whether or not it is supported by a node in the context. The final formula is:
* **Context Recall** = $\frac{\text{number statements supported by context}}{\text{total number statements in the expected output}} $

### Experiment

In [None]:
import pandas as pd
from opik.evaluation import evaluate
from opik.evaluation.evaluation_result import EvaluationResult

test_id: Final[Type[str]] = int(input("Enter the test id (Ex: 1: "))
df: pd.DataFrame = pd.read_csv("../../experiments.csv")

row = df[df['test_id'] == test_id].iloc[0]

TOP_K = int(row['top_k'])                     # Number of top-k retrieved contexts during the generation
MAX_TOKENS = int(row['max_tokens_to_sample']) # Tokens used when generating the response
CHUNK_SIZE = int(row['chunk_size'])           # Size of the chunk used to split the context during the generation
CHUNK_OVERLAP = int(row['chunk_overlap'])     # Overlap of the chunk used to split the context during the generation
MODEL = str(row['chat_model'])                # The model used to generate the response, not the evaluation model
TEMPERATURE = float(row['temperature'])       # Temperature used to generate the response
DESCRIPTION = str(row.get('description'))

eval_res: EvaluationResult = evaluate(
    dataset=dataset,
    task=evaluation_task,
    scoring_metrics=[
        hallucination,
        answer_relevance
    ],
    experiment_name=f"{dataset_id}_evaluation",
    project_name=OPIK_PROJECT_NAME,
    experiment_config={
        "top_k": TOP_K,
        "max_tokens": MAX_TOKENS,
        "chunk_size": CHUNK_SIZE,
        "chunk_overlap": CHUNK_OVERLAP,
        "temperature": TEMPERATURE,
        "model": MODEL,
        "description": DESCRIPTION
    }
)

### Results

In [None]:
eval_res.test_results