<img src="https://github.com/comet-ml/opik/blob/main/apps/opik-documentation/documentation/static/img/opik-logo.svg?raw=true" width="200" height="100" alt="Opik Logo">

# Comet Assistant: RAG with Opik

The below example walks through the process of building a simple RAG application with OpenAI and langchain, and evaluating the application with Opik.

The concepts covered in this tutorial include:

1. Setting up a simple vector store and RAG pipeline with langchain
2. Defining an assistant application using this RAG pipeline and the OpenAI API
3. Creating a dataset of questions for evaluation in Opik
4. Automating the evaluation of the application on the dataset using Opik metrics

## Creating an account on Comet.com

[Comet](https://www.comet.com/site?from=llm&utm_source=opik&utm_medium=colab&utm_content=langchain&utm_campaign=opik) provides a hosted version of the Opik platform, [simply create an account](https://www.comet.com/signup?from=llm&utm_source=opik&utm_medium=colab&utm_content=langchain&utm_campaign=opik) and grab you API Key.

> You can also run the Opik platform locally, see the [installation guide](https://www.comet.com/docs/opik/self-host/overview/?from=llm&utm_source=opik&utm_medium=colab&utm_content=langchain&utm_campaign=opik) for more information.

In [None]:
%pip install --upgrade --quiet opik openai

In [None]:
# from opik import Opik, track
import opik
from opik.evaluation import evaluate
from opik.evaluation.metrics import AnswerRelevance, LevenshteinRatio
import openai

In [None]:
opik.configure(use_local=False)

OPIK: Opik is already configured. You can check the settings by viewing the config file at /root/.opik.config


# Setup the vector store for RAG

In [None]:
! pip install --quiet langsmith langchain-community langchain chromadb tiktoken langchain_openai

In [None]:
import os
import getpass

if "OPENAI_API_KEY" not in os.environ:
    os.environ["OPENAI_API_KEY"] = getpass.getpass("Enter your OpenAI API key: ")

**Set up Vector Store and Retriever.**

The below code sets up a vector store using [Chroma](https://www.trychroma.com/). Here we are loading Zoox FAQ documentation.

In [None]:
from langchain_core.documents import Document
from langchain_community.vectorstores import Chroma
from langchain_openai import OpenAIEmbeddings
from langchain.text_splitter import RecursiveCharacterTextSplitter

# Load from text file
file_path = "/content/zoox_faq.txt"
with open(file_path, "r", encoding="utf-8") as f:
    text = f.read()

# Wrap into Document (list of one document)
docs = [Document(page_content=text, metadata={"source": file_path})]

# Split
splitter = RecursiveCharacterTextSplitter(chunk_size=200, chunk_overlap=50)
splits = splitter.split_documents(docs)

# Embed
vectorstore = Chroma.from_documents(documents=splits, embedding=OpenAIEmbeddings())

# Index
retriever = vectorstore.as_retriever()

#Define RAG Application
The below code defines our LLM application. In this case, we create a Comet bot that 1) retrieves relevant context from our vector store based on the input 2) inputs the relevant question + user question into OpenAI to retrieve a response.

In order to ensure that the OpenAI API calls are being tracked, we will be using the `track_openai` function from the Opik library. We will also use the `track` decorator to ensure each step of the application is tracked.

In [None]:
import openai

from opik.integrations.openai import track_openai
from opik import track, opik_context, Attachment
from pathlib import Path
import base64
from PIL import Image

PROJECT_NAME = "Zoox User Assistant"

def encode_image_to_base64(image_path: str) -> str:
    with open(image_path, "rb") as image_file:
        return base64.b64encode(image_file.read()).decode("utf-8")

def detect_mime(image_path: Path) -> str:
    with Image.open(image_path) as img:
        return f"image/{img.format.lower()}"

def make_image_url(image_path: Path):
  image_data = encode_image_to_base64(image_path)
  mime_type = detect_mime(image_path)
  return {
    "url": f"data:{mime_type};base64,{image_data}"
  }



class CometImageBot:
    def __init__(self, retriever, model: str = "gpt-4o"):
        self._retriever = retriever
        self._client = track_openai(openai.Client(), project_name=PROJECT_NAME)
        self._model = model

    @track(project_name=PROJECT_NAME)
    def retrieve_docs(self, question: str):
        return self._retriever.invoke(question)

    @track(project_name=PROJECT_NAME)
    def get_answer(self, question: str, image_url: str, system: str = "You are an assistant that answers questions about images using both visual and text context."):
        docs_retrieved = self.retrieve_docs(question)


        # Create vision prompt
        messages = [
            {
                "role": "system",
                "content": system + "\nUse the following docs to answer the question:\n\n" + str(docs_retrieved),
            },
            {
                "role": "user",
                "content": [
                    {"type": "text", "text": question},
                    {
                        "type": "image_url",
                        "image_url": image_url,
                    },
                ],
            },
        ]

        response = self._client.chat.completions.create(
            model=self._model,
            messages=messages,
        )

        return {
            "response": response.choices[0].message.content,
            "context": [str(doc) for doc in docs_retrieved],
        }

Testing the Bot and the Retriever with one system prompt

In [None]:
bot = CometImageBot(retriever)
image_url = make_image_url(Path("seatbelt.png"))

bot.get_answer(
    question="how do I buckle my seatbelt",
    image_url=image_url
)

OPIK: Started logging traces to the "Zoox User Assistant" project at https://www.comet.com/opik/api/v1/session/redirect/projects/?trace_id=01988561-5712-7295-add2-2088af71b3c2&path=aHR0cHM6Ly93d3cuY29tZXQuY29tL29waWsvYXBpLw==.


{'response': 'To buckle your seatbelt:\n\n1. Pull the seatbelt across your body.\n2. Insert the metal clip into the buckle until you hear a click.\n3. Adjust the strap so it fits snugly across your lap and shoulder.\n\nEnsure the belt is not twisted and lies flat against your body for safety.',
 'context': ["page_content='Do I have to wear a seat belt?\nYes. The robotaxi won’t move until all riders are wearing their seatbelts. Please stay seated and buckled at all times, even when the robotaxi isn’t moving.' metadata={'source': '/content/zoox_faq.txt'}",
  "page_content='row between the side seat panel and the window. Simultaneously, pull up the handle with your left hand and push the door closest to the handle with your right hand.' metadata={'source': '/content/zoox_faq.txt'}",
  "page_content='If you need to open the doors in an emergency and the above methods do not work, you can use the Emergency Door Handle on the far left of each seat row between the side seat panel and the wind

# Creating a Dataset: Comet Questions

Below we define a standard set of questions that we would like to evaluate the assistant on.

In [None]:
dataset_items = [
  {
    "question": "What are the rules about wearing seatbelt and how do I use it",
    "expected_answer": "You must wear your seatbelt at all times during the ride. To use it, pull the belt across your body and insert the latch into the buckle until it clicks.",
    "image_url": [make_image_url(Path("seatbelt.png")), make_image_url(Path("seatbelt.png")), make_image_url(Path("seatbelt.png"))]
  },
  {
    "question": "I think my seatbelt is missing",
    "expected_answer": "The seatbelt should be located beside your seat, typically to the left or right. If you can't find it, check if it is retracted into the seat or behind you.",
    "image_url": make_image_url(Path("seatbelt.png"))
  },
  {
    "question": "How do I open the door?",
    "expected_answer": "To open the door, press the illuminated button with the door icon on the side panel. The door will unlock and open automatically.",
    "image_url": make_image_url(Path("open_door.png"))
  },
  {
    "question": "How do I change the temperature?",
    "expected_answer": "Use the on-screen temperature slider or touch controls near the air vents to adjust the cabin temperature. Slide left for cooler and right for warmer.",
    "image_url": make_image_url(Path("temperature.png"))
  },
  {
    "question": "What is the coldest I can make the car?",
    "expected_answer": "You can lower the temperature by sliding the temperature control all the way to the left until the blue or snowflake icon appears, indicating the coldest setting.",
    "image_url": make_image_url(Path("temperature.png"))
  },
  {
    "question": "Am I able to drive the car?",
    "expected_answer": "No, this vehicle is self-driving. All driving functions are handled by the system, and passengers cannot manually control the vehicle.",
    "image_url": make_image_url(Path("temperature.png"))
  },
]

Now that we have our dataset, we can create a dataset in Opik and insert the questions into it.

In [None]:
# Get or create a dataset
client = opik.Opik()

dataset = client.get_or_create_dataset(name="Zoox-FAQ",
                                       description="Zoox-FAQ")

# Inserting will not duplicate entries
dataset.insert(dataset_items)

# Evaluating the Assistant

In order to ensure our RAG application is working correctly and determine the system prompt to use in production, we will test it on our dataset with 3 different system prompts.

For this we will be using the `evaluate` function from the `opik` library. We will evaluate the application on two metrics: Hallucination and AnswerRelevance.

**Step 1: Fetch the dataset for evaluation**

In [None]:
client = opik.Opik()

dataset = client.get_dataset(name="Zoox-FAQ")

**Step 2: Define the system prompt to test**




In [None]:
system_prompt = opik.Prompt(
    name="Zoox Chat Assistant",
    prompt="""
        You are a chat assistant. Answer questions to best of your ability.
        """.rstrip().lstrip()
)

**Step 3: Define Evaluation Task**

The evaluation task maps each input to the retrieved context and LLM output. These values will be used by Opik when calculating the metrics defined in the next step.

In [None]:
def evaluation_task(x):
    full_response = bot.get_answer(x['question'], x['image_url'], system_prompt.format())
    response = full_response["response"]
    context = full_response["context"]
    return {
        "response": response,
        "context": context
    }

**Step 4: Define Metrics**

Defining a Custom Metric Using Image Inputs


In [None]:
from opik.evaluation.metrics import base_metric, score_result
from openai import OpenAI
from typing import Any
import json
import re

class LLMJudgeMetric(base_metric.BaseMetric):
    def __init__(self, name: str = "Image Relevance", model_name: str = "gpt-4o"):
        self.name = name
        self.llm_client = OpenAI()
        self.model_name = model_name
        self.prompt_template = """
        You are an impartial judge evaluating how accurately a response answers a question using a provided image as evidence.

        Return only a JSON object with no explanation or extra text.

        Score the response as:
        - 1 if it is fully accurate based on the image
        - 0.5 if partially accurate or uncertain
        - 0 if inaccurate or not supported

        It is very important that the format of your response should be a json with no backticks that returns:
        {{
            "score": <score between 0 and 1>,
            "reason": "<reason for the score>"
        }}

        Question and Response:
        {output}
        """

    def score(self, output: str, image_url: dict, **ignored_kwargs: Any):
        """
        Score the output of an LLM.
        Args:
            output: The output of an LLM to score.
            **ignored_kwargs: Any additional keyword arguments. This is important so that the metric can be used in the `evaluate` function.
        """
        # Construct the prompt based on the output of the LLM
        prompt = self.prompt_template.format(output=output, image_url=image_url)
        # Generate and parse the response from the LLM
        response = self.llm_client.chat.completions.create(
            model=self.model_name,
            messages=[
                {
                    "role": "user",
                    "content": [
                        {
                            "type": "text",
                            "text": prompt
                        },
                        {
                            "type": "image_url",
                            "image_url": image_url
                        }
                    ]
                }
            ]
        )

        raw_output = response.choices[0].message.content.strip()

        match = re.search(r'{\s*"score"\s*:\s*(0(?:\.5)?|1)(?:,\s*"reason"\s*:\s*".*?")?\s*}', raw_output, re.DOTALL)

        if not match:
            raise ValueError(f"Could not extract JSON from model output:\n{raw_output}")

        # Convert matched JSON string to dict
        response_dict = json.loads(match.group(0))
        response_score = float(response_dict["score"])

        return score_result.ScoreResult(
            name=self.name,
            value=response_score
        )

image_relevance = LLMJudgeMetric()


Here we use Comet's built-in [Levenshtein Ratio](https://www.comet.com/docs/opik/evaluation/metrics/heuristic_metrics#levenshteinratio) and [AnswerRelevance](https://www.comet.com/docs/opik/evaluation/metrics/answer_relevance) metrics.

In [None]:
# Define the metrics
answerrelevance_metric = AnswerRelevance(name="AnswerRelevance")
levenshteinratio_metric = LevenshteinRatio(name="LevenshteinRatio")

**Step 5: Run the evaluation**

Input the dataset, experiment config, evaluation task, and metrics into Opik's `evaluate` to run the evaluation.

In [None]:
experiment_config = {"model": "gpt-4o"}
experiment_name = f"zoox-faq-v1"

res = evaluate(
    dataset=dataset,
    experiment_name=experiment_name,
    experiment_config=experiment_config,
    project_name=PROJECT_NAME,
    task=evaluation_task,
    prompt=system_prompt,
    scoring_metrics=[answerrelevance_metric, levenshteinratio_metric, image_relevance],
    scoring_key_mapping={
        "input": "question",
        "output": "response",
        "reference": "expected_answer"
    }
)

The evaluation results are now uploaded to the Opik platform and can be viewed in the UI.

# Evaluating the Assistant (II)

Prompt Engineering is an iterative process. Let's try a different system prompt.

In [None]:
system_prompt = opik.Prompt(
    name="Zoox Chat Assistant",
    prompt="""
        You are Zoey, a friendly, knowledgeable virtual assistant designed to help users have a safe and comfortable experience while riding in autonomous vehicles. Your primary role is to answer user questions clearly and accurately, drawing from a curated FAQ knowledge base and analyzing images provided by the user when relevant (e.g., photos of the vehicle interior, displays, or signage).

        Goals:
        Provide fast, reliable answers to questions about policies, safety, accessibility, vehicle features, and how to interact with the autonomous vehicle.

        Offer guidance based on visual context when users submit images (e.g., interpreting screen messages, identifying seatbelt indicators, or showing how to use in-vehicle controls).

        Always prioritize clarity, reassurance, and safety in your responses.
        """.rstrip().lstrip()
)

In [None]:
experiment_config = {"model": "gpt-4o"}
experiment_name = f"zoox-faq-v2"

res = evaluate(
    dataset=dataset,
    experiment_name=experiment_name,
    experiment_config=experiment_config,
    project_name=PROJECT_NAME,
    task=evaluation_task,
    prompt=system_prompt,
    scoring_metrics=[answerrelevance_metric, levenshteinratio_metric, image_relevance],
    scoring_key_mapping={
        "input": "question",
        "output": "response",
        "reference": "expected_answer"
    }
)