<img src="https://raw.githubusercontent.com/comet-ml/opik/main/apps/opik-documentation/documentation/static/img/opik-logo.svg" width="250"/>

# LLM-Based Evaluation with Opik

In this exercise, you'll be evaluationg LLM applications with LLM-as-a-judge metrics. You can use OpenAI or open source models via LiteLLM. To make the exercise a little more exciting, you'll be running your evaluations using HaluBench, the popular hallucination dataset.

# Imports & Configuration

In [1]:
%pip install opik openai comet_ml litellm --quiet

In [2]:
import opik
from opik import Opik, track
from opik.evaluation import evaluate
from opik.evaluation.metrics import (Hallucination, AnswerRelevance)
from opik.integrations.openai import track_openai
import openai
import os
from datetime import datetime
from getpass import getpass
import litellm
from litellm.integrations.opik.opik import OpikLogger
from opik.opik_context import get_current_span_data

opik_logger = OpikLogger()
# In order to log LiteLLM traces to Opik, you will need to set the Opik callback
litellm.callbacks = [opik_logger]

# Define project name to enable tracing
os.environ["OPIK_PROJECT_NAME"] = "llm-based-eval"


* 'fields' has been removed


In [3]:
# opik configs
if "OPIK_API_KEY" not in os.environ:
    os.environ["OPIK_API_KEY"] = getpass("Enter your Opik API key: ")

opik.configure()

Enter your Opik API key: ··········


OPIK: Opik is already configured. You can check the settings by viewing the config file at /root/.opik.config


# OpenAI configuration (ignore if you're using LiteLLM)
if "OPENAI_API_KEY" not in os.environ:
    os.environ["OPENAI_API_KEY"] = getpass("Enter your OpenAI API key: ")

MODEL = "gpt-4o-mini"

In [4]:
# Hugging Face Configs to access meta-llama-3.2 model
if "HF_TOKEN" not in os.environ:
  os.environ["HF_TOKEN"] = getpass("Enter your Hugging Face Key: ")

Enter your Hugging Face Key: ··········


In [5]:
# get Opik client and Model
client = opik.Opik()

MODEL = "huggingface/meta-llama/Llama-3.2-1B-Instruct"

# Prompts & Templates

In [6]:
prompt_template = """Use the following context to answer my question:

### CONTEXT:
{context}

### QUESTION:
{question}
"""

# Dataset

In [7]:
# Create dataset
dataset = client.get_or_create_dataset(
    name="HaluBench", description="HaluBench dataset"
)

In [8]:
import pandas as pd

df = pd.read_parquet(
    "hf://datasets/PatronusAI/HaluBench/data/test-00000-of-00001.parquet"
)

The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.


In [9]:
df.head()

Unnamed: 0,id,passage,question,answer,label,source_ds
0,d3fb4c3c-d21b-480a-baa0-98d6d0d17c1d,Hoping to rebound from the road loss to the Ch...,Which team scored the longest field goal kick ...,"['Rams', 'second', 'Marc Bulger', 'Kevin Curtis']",FAIL,DROP
1,8603663e-c53b-46db-a482-a867f12ff3b4,"As of the census of 2000, there were 218,590 p...",How many percent were not Irish?,87.1,FAIL,DROP
2,c63a73e5-2c91-489b-bd24-af150ddfa82c,Hoping to rebound from the road loss to the Ch...,How many yards was the second longest field go...,42,FAIL,DROP
3,52db14ed-5426-46ec-b0ae-4ef843b2d692,Hoping to rebound from their tough overtime ro...,How long was the last touchdown?,18-yard,FAIL,DROP
4,31b36417-aad1-412c-b0e5-9c1faaed233f,"As of the census of 2000, there were 218,590 p...",How many in percent from the census weren't Ir...,87.1,FAIL,DROP


In [10]:
cleaned_ds = df.drop(['answer', 'label', 'source_ds', 'id'], axis=1).iloc[0:100]
cleaned_ds.head()

Unnamed: 0,passage,question
0,Hoping to rebound from the road loss to the Ch...,Which team scored the longest field goal kick ...
1,"As of the census of 2000, there were 218,590 p...",How many percent were not Irish?
2,Hoping to rebound from the road loss to the Ch...,How many yards was the second longest field go...
3,Hoping to rebound from their tough overtime ro...,How long was the last touchdown?
4,"As of the census of 2000, there were 218,590 p...",How many in percent from the census weren't Ir...


In [11]:
dataset.insert(cleaned_ds.to_dict('records'))

In [12]:
# read the above dataset
dataset.to_pandas().head()

Unnamed: 0,passage,question,id
0,"Trying to snap a two-game skid, the Bills flew...",How many games had the Bills won before this g...,01943f12-f4f5-75c3-95c9-38d24ac6a136
1,1564: The city of Ryazan posad was burned.:47 ...,What was burned first: city of Ryazan or subur...,01943f12-f4f4-70ed-8866-444aed413048
2,"As of the census of 2000, there were 218,590 p...",How many percent were not Italian?,01943f12-f4f3-7c1c-b670-478928e58613
3,"As of the census of 2000, there were 218,590 p...",Which group from the census is smaller: German...,01943f12-f4f2-7d5c-8b4b-7f456fac118b
4,"In week 6, the Lions hosted the NFC West Divis...",How many field goals between 20 and 30 yards w...,01943f12-f4f1-71d6-8e4e-4a3ddedce83f


# Simple little client class for using different LLM APIs (OpenAI or LiteLLM) (not necessary cos not using OpenAI)
class LLMClient:
  def __init__(self, client_type: str ="openai", model: str ="gpt-4"):
    self.client_type = client_type
    self.model = model

    if self.client_type == "openai":
      self.client = track_openai(openai.OpenAI())

    else:
      self.client = None

  # LiteLLM query function
  def _get_litellm_response(self, query: str, system: str = "You are a helpful assistant."):
    messages = [
        {"role": "system", "content": system },
        { "role": "user", "content": query }
    ]

    response = litellm.completion(
        model=self.model,
        messages=messages
    )

    return response.choices[0].message.content

  # OpenAI query function - use **kwargs to pass arguments like temperature
  def _get_openai_response(self, query: str, system: str = "You are a helpful assistant.", **kwargs):
    messages = [
        {"role": "system", "content": system },
        { "role": "user", "content": query }
    ]

    response = self.client.chat.completions.create(
        model=self.model,
        messages=messages,
        **kwargs
    )

    return response.choices[0].message.content


  def query(self, query: str, system: str = "You are a helpful assistant.", **kwargs):
    if self.client_type == 'openai':
      return self._get_openai_response(query, system, **kwargs)

    else:
      return self._get_litellm_response(query, system)



llm_client = LLMClient(model=MODEL)


@track
def chatbot_application(question: str, context: str) -> str:
    response = llm_client.query(prompt_template.format(context=context, question=question))
    return response


# LLM Application





In [13]:
# function call of llama3 using litellm
@track
def chatbot_application(question: str, context: str) -> str:
    response = litellm.completion(
        model=MODEL,
        messages=[
            {"role":"system", "content":"You are a helpful assistant."},
            {"role":"user", "content":prompt_template.format(context=context, question=question)}
        ]
    )
    return response.choices[0].message.content

# Evaluation

In [14]:
# Define the evaluation task
def evaluation_task(x):
    return {
        "input": x['question'],
        "output": chatbot_application(x['question'], x['passage']),
        "context": x['passage']
    }


In [16]:
# Retrieve the dataset
client = opik.Opik()

In [15]:
# Define the metrics
metrics = [Hallucination(), AnswerRelevance()]

# experiment_name
experiment_name = MODEL + "_" + dataset.name + "_" + datetime.now().strftime("%Y-%m-%d_%H-%M-%S")

# run evaluation
evaluation = evaluate(
    experiment_name=experiment_name,
    dataset=dataset,
    task=evaluation_task,
    scoring_metrics=metrics,
    experiment_config={
        "model": MODEL
    }
)

[1;30;43mStreaming output truncated to the last 5000 lines.[0m
OPIK: Failed to compute metric answer_relevance_metric. Score result will be marked as failed.
Traceback (most recent call last):
  File "/usr/local/lib/python3.10/dist-packages/litellm/llms/openai/openai.py", line 691, in completion
    raise e
  File "/usr/local/lib/python3.10/dist-packages/litellm/llms/openai/openai.py", line 595, in completion
    openai_client: OpenAI = self._get_openai_client(  # type: ignore
  File "/usr/local/lib/python3.10/dist-packages/litellm/llms/openai/openai.py", line 361, in _get_openai_client
    _new_client = OpenAI(
  File "/usr/local/lib/python3.10/dist-packages/openai/_client.py", line 101, in __init__
    raise OpenAIError(
openai.OpenAIError: The api_key client option must be set either by passing api_key to the client or by setting the OPENAI_API_KEY environment variable

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "