### Configuration

* The **Opik** platform can either be hosted locally in a container or a cluster: [local hosting](https://www.comet.com/docs/opik/self-host/overview)

* Alternatively, there's a cloud based platform free of charge: [cloud version](https://www.comet.com/signup).
    * When using the cloud version: an API key and a workspace need to be specified.
    * Those can be retrieved in the platform. 
    * Similarly to **DeepEval**, the cloud version of **Opik** has an intuitive UI, where datasets, evaluations results and more can be stored.
    * For the cloud based approach make sure to add the following to your `.env` file in the parent folder:
    ```bash
        OPIK_API_KEY=<your-api-key>
        OPIK_WORKSPACE=<your-workspace>
        # Setting this will automatically log traces for the project (Optional)
        OPIK_PROJECT_NAME=<project-name>
        OPIK_USAGE_REPORT_ENABLED=false  # Disable telemetry (Optional)
        # By default creates a file called ~/.opik.config (on Linux) if it doesn't exist
        OPIK_CONFIG_PATH=<filepath-to-your-config-file>  # Overwrite the file location (Optional)
    ```

### Dependencies

In [None]:
import os
import pandas as pd
from typing import Final, Dict, Any
from distutils.util import strtobool

from dotenv import load_dotenv

import opik
from opik import Dataset
from opik.exceptions import ConfigurationError
from opik.evaluation import evaluate
from opik.evaluation.models import LiteLLMChatModel
from opik.evaluation.evaluation_result import EvaluationResult
from opik.api_objects.dataset.dataset_item import DatasetItem
from opik.api_objects.dataset.rest_operations import ApiError

# Uncomment for tracing. In my case there were some errors
# import litellm
# from litellm.integrations.opik.opik import OpikLogger

# Assuming you have created a file.
# This will not be in the project by default since it contains sensitive data.
load_dotenv("../.env")

# Retrieve all the evaluation parameters and other ones relevant for experiments
load_dotenv("../../env/rag.env")

### Configure Opik

In [None]:
OPIK_API_KEY: Final[str] = str(os.getenv("OPIK_API_KEY"))
OPIK_WORKSPACE: Final[str] = str(os.getenv("OPIK_WORKSPACE"))
OPIK_PROJECT_NAME: Final[str] = str(os.getenv("OPIK_PROJECT_NAME"))

try:
    # If you're using the locally hosted version set the `use_local=True and provide the url`
    opik.configure(
        api_key=OPIK_API_KEY,
        workspace=OPIK_WORKSPACE,
        
    )
except (ConfigurationError, ConnectionError) as ce:
    raise Exception(f"Error occurred: {str(ce)}. Please try again!")

### LLM and tracing

* **Opik** uses **OpenAI** as the LLM-provider by default. To overwrite that create a `LiteLLMChatModel` instance with the model you want to use and specify your [input parameters](https://docs.litellm.ai/docs/completion/input).
* If you want to add tracing capabilities so that all calls to `litellm` are traced to the **Opik** platform create the `OpikLogger` and set the `litellm` callbacks (Optional).

In [None]:
# Same for all experiments
EVALUATION_MODEL: Final[str] = str(os.getenv("EVALUATION_MODEL"))
EVALUATION_TEMPERATURE: Final[float] = float(os.getenv("EVALUATION_TEMPERATURE"))

# https://docs.litellm.ai/docs/completion/input
eval_model = LiteLLMChatModel(
    model_name=f"ollama_chat/{EVALUATION_MODEL}",
    temperature=EVALUATION_TEMPERATURE,
    response_format={
        "type": "json_object" # Make sure this is set, since some metrics require JSON output
    },
    api_base="http://localhost:11434", # Might differ in your case
    num_retries=3,
)

# This will trace all calls submitted to the LLM to Opik (Optional)
# I'm getting errors so I don't use it.
# opik_logger = OpikLogger()
# litellm.callbacks = [opik_logger]

### Evaluation

For an evaluation/experiment in **Opik** the following is required:

- an experiment:
    - a single evaluation of the LLM application
    - during an experiment all items of the dataset get iterated on
    - an experiment consists of two main components:
        - configuration:
            - one can store key-value pairs unique to each experiment to track and compare different evaluation runs and verify, which set of hyperparameters yields the best performance

        - experiment items:
            - individual items from a dataset, which consist of `input`, `response`, `expected response` and `context`
            - they get evaluated using the specified metrics and receive a trace, score and additional metadata

- a dataset:
    - **Opik** supports datasets, which are a collection of samples, that the LLM application will be evaluated on.

- an evaluation task:
    - a function that maps dataset items to a dictionary
    - receives a dataset item as input and returns a dictionary, that contains all required parameters by a metric

- a set of metrics
    - quantatative measures, which helps us assess the performance of the LLM application
    - additionally, one can overwrite the `BaseMetric` class for a custom metric

In [None]:
# Create an `Opik` client for interacting with the platform
opik_client = opik.Opik(
    project_name=OPIK_PROJECT_NAME,
    workspace=OPIK_WORKSPACE,
    api_key=OPIK_API_KEY,
)

# Enter the dataset id without the extension. 
# The datasets would be stored under `../ragas_eval/datasets`.
# Alternatively provide your filepath.
experiment_id: Final[int] = int(input("Enter the experiment id (Ex: 1): "))
dataset_filepath: Final[str] = f"../ragas_eval/datasets/{experiment_id}_dataset.jsonl"

try:
    # Fetch the dataset
    dataset: Dataset = opik_client.get_dataset(name=f"{experiment_id}_dataset")
except ApiError as ae: 
    # If it doesn't exist already, just create it
    dataset: Dataset = opik_client.create_dataset(
        name=f"{experiment_id}_dataset",
        description=f"Evaluation dataset [{experiment_id}]"
    )
    
    dataset.read_jsonl_from_file(
        file_path=dataset_filepath,
        keys_mapping={
            # From RAGAs to Opik mappings
            # Alternatively, map it from your fields to the ones expected by Opik
            # The ones expected by Opik are on the right-hand side
            "user_input": "input",
            "response": "output",
            "reference": "expected_output",
            "retrieved_contexts": "context",
        }
    )

### Evaluation task

The whole purpose of this function is so that one can compute the `actual output` at runtime. So if you have a custom **RAG** pipeline, you could take each individual input and compute the output and get the context at evaluation time. However, since we already have a full dataset ready, we can just map each dataset item to a dictionary. **Opik** follows a very similar concept like in **DeepEval** where the usage of **Golden**s is encouraged. This means that the actual output can be generated at `evaluation time`.

In [None]:
# This function is used during evaluation
# For each item in the dataset, this function will be called
# The output of the function is a dictionary containing the relevant parameters for the metrics
def evaluation_task(item: DatasetItem) -> Dict[str, Any]:
    return {
        "input": item['input'],
        "output": item['output'],
        "expected_output": item['expected_output'],
        "context": item['context']
    }

### Hallucination

This metric verifies if the `output` generated relative to the `input` is based on information from the `context`. Cases where statements contain **contradictory** or **made-up** information are penalized.

This is the equivalent to the **Faithfulness** metric we know from **DeepEval** and **RAGAs**, however instead of trying to maximize the score, we would like to minimize it.

The evaluation takes place by submitting a singular prompt to the LLM, providing instructions and examples. Since we expect a JSON output, we provide a schema and parse the output to retrieve the score. 

The metric expects:
- `input`
- `output`
- `context`

**DeepEval** first decomposes both the `output` and `context` into statements and then it checks for hallucinations. The final score is: **Faithfulness** = $ \frac{\text{number of attributed statements in the response}}{\text{all statements in the response}} $.

**RAGAs** works in a similar fashion to **DeepEval**, however the `context` doesn't get decomposed into so called `truths`.

In [None]:
from custom.hallucination.metric import Hallucination

hallucination = Hallucination(
    model=eval_model,
    project_name=OPIK_PROJECT_NAME
)

### Answer Relevance

The following metric evaluates the pertinence of the `output` with respect to the `input`. Missing or off-topic information will be penalized. The goal is to achieve a complete answer, which is on-topic amd addresses the `input` directly.

The goal is to maximize this metric, following 3 dimensions:
- completeness
- conciseness
- relevance

**RAGAs** takes the `output` and generates *hypothetical questions* and then the average of the semantic similarity between them and the original questions is computed. There's no `LLM-as-a-judge` in this case.

**DeepEval** uses a multi-step approach, where the `output` is first decomposed into statements, verdicts are then computed and then the final score is derived using the formula: **Answer Relevancy** = $ \frac{\text{relevant statements in the response}}{\text{all statements in the response}} $

In [None]:
from custom.answer_relevancy.metric import AnswerRelevance

answer_relevance = AnswerRelevance(
    model=eval_model,
    project_name=OPIK_PROJECT_NAME
)

### Context Precision

Context precision is a metric, which evaluates the retriever component in a RAG application.

The idea is that a **high** context precision would translate into **retrieving relevant information and ranking it high**, so that the user input could be augmented with relevant context maximizing the relevance and completeness of the LLM output.

The higher the context is ranked, the better since it's going to appear earlier in the prompt and this way the LLM could utilize it better avoiding the [Lost In The Middle](https://arxiv.org/abs/2307.03172) problem.

### Formula
* WCP = $ \frac{\sum_{k=1}^{K}(\text{Precision@k} \times v_{k})}{\text{Total number of relevant items in top K results}} $

* `K` is the `limit` or number of documents fetched by the retriever

* $ v_{k} $ is either 1 or 0, depending on the relevance of the context

In [None]:
from custom.context_precision.metric import ContextPrecision

context_precision = ContextPrecision(
    model=eval_model,
    project_name=OPIK_PROJECT_NAME
)

### Context Recall

This metric is mostly about evaluating the `retriever`s ability to retrieve all information, that would be required to address a particular `input`. The retriever tends to exert a greater influence over the final answer of the RAG pipeline, so its crucial to have a metric, which tests exactly that.

This metric first decomposes the `expected output` into individual statements, thereafter we classify each of them as either:
- attributed (supported by the `context`)
- not-attributed (not supported by the `context`)

The final formula is:
* **Context Recall** = $\frac{\text{number statements supported by context}}{\text{total number statements in the expected output}} $

In [None]:
from custom.context_recall.metric import ContextRecall

context_recall = ContextRecall(
    model=eval_model,
    project_name=OPIK_PROJECT_NAME
)

### Mean Reciprocal Rank

This metric evaluates the `retrievers` effectiveness, by measuring how well it can rank the first relevant document based on relevance.

In this project, we need to evaluate a RAG pipeline. When we retrieve data/context it can either be relevant or not. For this metric I compare each individual context that is retrieved with the `expected output`. If the semantic similarity score exceeds a predefined `threshold` then that context is deemed relevant. So we just take the multiplicative inverse of the rank of the first relevant document. 

To compute the `MRR` for the full dataset we need to average out all the scores for each entry in our dataset. This metric is `deterministic`, i.e. on each execution we should get identical results, with small margin of error.

* **MRR** = $\frac{1}{K} \times \sum_{k=1}^{K}(\frac{1}{\text{rank}_{i}})$

* `K` is the limit or number of documents that were retrieved

* `i` is the rank of the position of the first document deemed relevant

In [None]:
from custom.mean_reciprocal_rank.reciprocal_rank.metric import ReciprocalRank
from custom.mean_reciprocal_rank.metric import MeanReciprocalRank

# For each data entry we compute the reciprocal rank individually
reciprocal_ranks = [
    ReciprocalRank(
      project_name=OPIK_PROJECT_NAME  
    ) 
    for _ in range(len(dataset.get_items()))
]

# This aggregate metric will then compute the average of all scores to get the MRR
mrr = MeanReciprocalRank(
    name="MRR",
    metrics=reciprocal_ranks,
    project_name=OPIK_PROJECT_NAME
)

### Preparation

In [None]:
df: pd.DataFrame = pd.read_csv("../../experiments.csv")

row = df[df['test_id'] == experiment_id].iloc[0]

TOP_K: int = int(row['top_k'])                     # Number of top-k retrieved contexts during the generation
MAX_TOKENS: int = int(row['max_tokens_to_sample']) # Tokens used when generating the response
CHUNK_SIZE: int = int(row['chunk_size'])           # Size of the chunk used to split the context during the generation
CHUNK_OVERLAP: int = int(row['chunk_overlap'])     # Overlap of the chunk used to split the context during the generation
MODEL: str = str(row['chat_model'])                # The model used to generate the response, not the evaluation model
TEMPERATURE: float = float(row['temperature'])     # Temperature used to generate the response
DESCRIPTION: str = str(row.get('description'))

use_vanilla_rag: bool = bool(
    strtobool(os.getenv("VANILLA_RAG"))
)

if use_vanilla_rag:
    RAG_APPROACH: str = "vanilla"
else:
    RAG_APPROACH: str = "query_fusion" # RAG-Fusion

print(f"""
Evaluating experiment {experiment_id}:

top-k = {TOP_K}
max tokens = {MAX_TOKENS}
chunk size = {CHUNK_SIZE}
chunk overlap = {CHUNK_OVERLAP}
model = {MODEL}                
temperature = {TEMPERATURE}
description = {DESCRIPTION.strip()}
RAG approach = {RAG_APPROACH}
""")

### Evaluation

In [None]:
eval_res: EvaluationResult = evaluate(
    dataset=dataset,
    task=evaluation_task,
    scoring_metrics=[
        hallucination,
        answer_relevance,
        context_precision,
        context_recall,
        mrr
    ],
    experiment_name=f"{experiment_id}_evaluation",
    project_name=OPIK_PROJECT_NAME,
    experiment_config={
        "top_k": TOP_K,
        "max_tokens": MAX_TOKENS,
        "chunk_size": CHUNK_SIZE,
        "chunk_overlap": CHUNK_OVERLAP,
        "temperature": TEMPERATURE,
        "model": MODEL,
        "description": DESCRIPTION,
        "rag_approach": RAG_APPROACH
    }
)

### Results

In [None]:
import json

results = {}

# You can choose to store the resuls as you prefer
# I use this approach since it's easy to aggregate and get the average for each metric
for test_result in eval_res.test_results:
    for score_result in test_result.score_results:
        metric_name: str = score_result.name
        
        if metric_name not in results:
            results[metric_name] = []
        
        results[metric_name].append({
            'value': score_result.value,
            'reason': score_result.reason
        })

# Save to JSON
with open(f'./res/{experiment_id}_evaluation.json', 'w') as f:
    json.dump(results, f, indent=4)

Check out the average for each metric.

In [None]:
try:
    scores = {}
    with open(file=f"./res/{experiment_id}_evaluation.json", mode="r") as f:
        scores = json.load(f)
        
        for metric, results in scores.items():
            avg_metric = sum([result['value'] for result in results]) / len(results)
            print(f"{metric}: {avg_metric:.4f}")
except FileNotFoundError:
    print("No evaluation results found.")