<img src="https://raw.githubusercontent.com/comet-ml/opik/main/apps/opik-documentation/documentation/static/img/opik-logo.svg" width="250"/>

# LLM Unit Tests with MedQuAD



In this exercise, you'll be implementing an LLM unit test similar to the one you just saw in the lesson. To make the exercise a little more interesting, you'll be using the popular MedQuAD dataset, which is a question-answer dataset with context for each question-answer pair.

You'll first download the dataset from HuggingFace and convert it into an Opik Dataset. Then, you'll reuse your Factuality metric from a previous lesson to build an LLM unit test. In the real world, you might use this unit test as part of your CI/CD pipeline, to ensure that any changes you make to your underlying model, prompt, or parameters doesn't lead to a regression.

For this exercise, you can use OpenAI or open source models via LiteLLM.

# Imports & Configuration

In [1]:
! pip install opik openai litellm  --quiet

[?25l   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m0.0/304.7 kB[0m [31m?[0m eta [36m-:--:--[0m[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m304.7/304.7 kB[0m [31m12.8 MB/s[0m eta [36m0:00:00[0m
[?25h[?25l   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m0.0/6.6 MB[0m [31m?[0m eta [36m-:--:--[0m[2K   [91m━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m[91m╸[0m[90m━━━━━━━━━━━━[0m [32m4.6/6.6 MB[0m [31m139.0 MB/s[0m eta [36m0:00:01[0m[2K   [91m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m[91m╸[0m [32m6.6/6.6 MB[0m [31m142.6 MB/s[0m eta [36m0:00:01[0m[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m6.6/6.6 MB[0m [31m83.8 MB/s[0m eta [36m0:00:00[0m
[?25h[?25l   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m0.0/76.4 kB[0m [31m?[0m eta [36m-:--:--[0m[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m76.4/76.4 kB[0m [31m6.3 MB/s[0m eta [36m0:00:00[0m
[?25h[?25l   [90m━━━━━━━━━━━━━

In [2]:
import pytest
import opik
from opik import track, llm_unit
from opik import Opik
from opik.integrations.openai import track_openai
import openai
import json
import os
from getpass import getpass
import pandas as pd
import litellm
from litellm.integrations.opik.opik import OpikLogger
from opik.opik_context import get_current_span_data

opik_logger = OpikLogger()
# In order to log LiteLLM traces to Opik, you will need to set the Opik callback
litellm.callbacks = [opik_logger]

# Define project name to enable tracing
os.environ["OPIK_PROJECT_NAME"] = "unit-test-MedQuAD-bench"

* 'fields' has been removed


In [3]:
# Opik configuration
if "OPIK_API_KEY" not in os.environ:
  os.environ["OPIK_API_KEY"] = getpass("Enter your Opik API key: ")

opik.configure()

Enter your Opik API key: ··········
Do you want to use "bluemusk" workspace? (Y/n)y


OPIK: Configuration saved to file: /root/.opik.config


In [4]:
# retrieve any dataset
client = opik.Opik()


# Dataset

In [5]:
# Create dataset
dataset = client.get_or_create_dataset(
    name="MedQuAD", description="MedQuAD dataset"
)

OPIK: Created a "MedQuAD" dataset at https://www.comet.com/opik/bluemusk/redirect/datasets?name=MedQuAD.


In [6]:
# Insert items into dataset
df = pd.read_parquet(
    "hf://datasets/AnonymousSub/MedQuAD_Context_Question_Answer_Triples_TWO/data/train-00000-of-00001-c38b6c63d6178c71.parquet"
)
df = df.sample(n=50, random_state=42)


The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.


In [7]:
df.head()

Unnamed: 0,Contexts,Questions,Answers
8850,"After cancer treatment, you may have a lot of ...",What is (are) Your cancer survivorship care pl...,
5504,You were in the hospital to have gastric bypas...,Do you have information about Gastric bypass s...,
5662,Tetanus is an infection of the nervous system ...,What are the symptoms of Tetanus ?,
7821,Hepatic ischemia is a condition in which the l...,What is the outlook for Hepatic ischemia ?,
29292,,What is (are) Scurvy ?,Scurvy is a condition that develops in people ...


In [9]:
df.shape

(50, 3)

In [10]:
dataset.insert(df.to_dict('records'))

In [11]:
dataset.to_pandas().head()

Unnamed: 0,Answers,Questions,Contexts,id
0,1q21.1 microdeletion is a chromosomal change i...,What is (are) 1q21.1 microdeletion ?,1q21.1 microdeletion is a chromosomal change i...,01944e4b-98dc-709b-99bf-9dbcf57d34c1
1,,What are the treatments for Diverticulitis ?,"Diverticula are small, bulging sacs or pouches...",01944e4b-98db-72f0-862c-0f6111fa1114
2,,What are the side effects or risks of Neomycin...,"Neomycin, polymyxin, and bacitracin ophthalmic...",01944e4b-98da-7357-ac0e-948c3be43d76
3,,What are the symptoms of Dementia ?,Dementia is a loss of brain function that occu...,01944e4b-98d9-7980-b39d-d8b8cbcd3fb3
4,What are the signs and symptoms of CHOPS syndr...,What are the symptoms of CHOPS syndrome ?,,01944e4b-98d8-7b2d-ba58-12e9562c053c


# Prompts & Templates

In [12]:
# prompt template for the Factuality metric
factuality_template = """
###INSTRUCTIONS###

You are a helpful assistant who should evaluate if a medical assistant's response is factual given the provided medical context. Output 1 if the chatbot response is factually answering the user message and 0 if it doesn't.

###EXAMPLE OUTPUT FORMAT###
{{
    "value": 0,
    "reason": "The response is not factually answering the user question."
}}


###CONTEXT:###
{context}

###INPUTS:###
{question}

###RESPONSE:###
{response}
"""


In [13]:
prompt_template = """
### CONTEXT
{context}

### QUESTION
{question}
"""

In [14]:
system = "You are a helpful medical assistant who answers questions using provided medical context"

# LLM Application --> if using OpenAI API

In [None]:
# Simple little client class for using different LLM APIs (OpenAI or LiteLLM)
#class LLMClient:
#  def __init__(self, client_type: str ="openai", model: str ="gpt-4"):
#    self.client_type = client_type
#    self.model = model

#    if self.client_type == "openai":
#      self.client = track_openai(openai.OpenAI())

#    else:
#      self.client = None

  # LiteLLM query function
#  def _get_litellm_response(self, query: str, system: str = "You are a helpful assistant."):
#    messages = [
#        {"role": "system", "content": system },
#        { "role": "user", "content": query }
#    ]

#    response = litellm.completion(
#        model=self.model,
#        messages=messages
#    )

#    return response.choices[0].message.content

  # OpenAI query function - use **kwargs to pass arguments like temperature
#  def _get_openai_response(self, query: str, system: str = "You are a helpful assistant.", **kwargs):
#    messages = [
#        {"role": "system", "content": system },
#        { "role": "user", "content": query }
#    ]

#   response = self.client.chat.completions.create(
#        model=self.model,
#        messages=messages,
#        **kwargs
#    )

#    return response.choices[0].message.content


#  def query(self, query: str, system: str = "You are a helpful assistant.", **kwargs):
#    if self.client_type == 'openai':
#      return self._get_openai_response(query, system, **kwargs)

#    else:
#      return self._get_litellm_response(query, system)


In [None]:
# Set your model and initialize your LLM client
#MODEL = "gpt-4o-mini"
#llm_client = LLMClient(model=MODEL)

In [None]:
@track
#def generate_factuality_score(question: str, context: str, response: str):
#    factuality_score = llm_client.query(factuality_template.format(context=context, question=question, response=response))
#    return eval(factuality_score)

In [None]:
@track
#def llm_application(question: str, context: str) -> str:
    # LLM application code here
#    chatbot_response = llm_client.query(prompt_template.format(question=question, context=context))
#    return chatbot_response

# Build Application with LiteLLM and llama3.2

In [15]:
# Hugging Face Configs to access meta-llama-3.2 model
if "HF_TOKEN" not in os.environ:
  os.environ["HF_TOKEN"] = getpass("Enter your Hugging Face Key: ")

Enter your Hugging Face Key: ··········


In [16]:
# meta-llama from HuggingFace
MODEL = "huggingface/meta-llama/Llama-3.2-3B-Instruct"

In [18]:
from opik.evaluation.metrics import base_metric, score_result
from opik.evaluation.models import litellm_chat_model

In [20]:
# Define the Factuality Metric class
class Factuality(base_metric.BaseMetric):
  def __init__(self, name: str, model: str = "huggingface/meta-llama/Llama-3.2-3B-Instruct", **kwargs):
    self.name = name
    self.llm_client = litellm_chat_model.LiteLLMChatModel(model_name=model)


  @track
  def generate_factuality_score(self, question: str, context: str, response: str):
    factuality_score = self.llm_client.generate_string(factuality_template.format(context=context, question=question, response=response))
    return eval(factuality_score)

In [21]:
@track
def llm_application(question: str, context: str) -> str:
    # LLM application code here
    chatbot_response = litellm.completion(
        model=MODEL,
        messages=[{"role":"system", "content":system},
                  {"role":"user", "content":prompt_template.format(question=question, context=context)}]
    )
    return chatbot_response.choices[0].message.content

# LLM Unit Testing

In [57]:
eval_dataset = json.loads(dataset.to_json())
#eval_dataset

* convert the list of dictionaries into a list of tuples

In [58]:
# Transform the dataset
final_dataset = [(item["Questions"], item["Answers"]) for item in eval_dataset]

# Print the final dataset for verification
print(final_dataset)



In [53]:
# convert the list of dictionaries into a list of tuples (from the provided notebook)
#final_dataset = [(item["input"]["question"], item["expected_output"]["response"]) for item in eval_dataset]

In [59]:
@llm_unit(expected_output_key="expected_output")
@pytest.mark.parametrize("user_question, expected_output", final_dataset)
def test_factuality_test(user_question, expected_output):
    response = llm_application(user_question)
    factuality_score = generate_factuality_score(user_question, response)

    assert factuality_score["value"] > 0.5
