# Opik Optimizer (beta)

Welcome to the Opik Opitimizer beta program! In this notebook we'll walk through a basic examples of a number of an optimizer algorithm.

## Setup

First, let's install the needed Python packages.

We only need one package, `opik-optimizer`:

In [1]:
%%capture
%pip install git+https://github.com/comet-ml/opik#subdirectory=sdks/opik_optimizer --upgrade

Next, we'll install the beta version of the `opik-optimizer`:

[Comet](https://www.comet.com/site?from=llm&utm_source=opik) provides a hosted version of the Opik platform, [simply create an account](https://www.comet.com/signup?from=llm&utm_source=opik&utm_medium=colab&utm_content=langchain&utm_campaign=opik) and grab your API Key.

> You can also run the Opik platform locally, see the [installation guide](https://www.comet.com/docs/opik/self-host/overview/?from=llm&utm_source=opik) for more information.

You can use your own workspace.

In [2]:
import opik

# Configure Opik
opik.configure()

OPIK: Your Opik API key is available in your account settings, can be found at https://www.comet.com/api/my/settings/ for Opik cloud


Please enter your Opik API key:··········
Do you want to use "dsblank" workspace? (Y/n)y


OPIK: Configuration saved to file: /root/.opik.config


For this example, we'll use OpenAI models, so we need to set our OpenAI API key:

In [3]:
import os
import getpass
if "OPENAI_API_KEY" not in os.environ:
    os.environ["OPENAI_API_KEY"] = getpass.getpass("Enter your OpenAI API key: ")

Enter your OpenAI API key: ··········


To optimize any prompt, we'll need:

1. A starting prompt
2. A metric
3. A dataset

For this initial test, we'll start with a portion of the HotpotQA dataset.

HotpotQA is a question/answering dataset featuring natural questions, with correct answers. It was collected by a team of NLP researchers at Carnegie Mellon University, Stanford University, and Université de Montréal.

Let's take a look at the demo dataset "hotpot-300".

First, we get or create it. This will add it to our default Opik project:

In [4]:
from opik_optimizer.demo import get_or_create_dataset
from opik_optimizer.demo.cache import get_litellm_cache


opik_dataset = get_or_create_dataset("hotpot-300")
get_litellm_cache("test")

Let's look at a specific row from the dataset:

In [5]:
rows = opik_dataset.get_items()
rows[0]

{'id': '0195d400-517b-7f17-b746-ff3d084463ac',
 'question': 'Were both drinks, the Smoking Bishop and the Mickey Slim, popular in different countries?',
 'answer': 'yes'}

We see that each item has a "question" and "answer". Some of the answers are short and direct, and some of them are more complicated:

In [6]:
rows[2]

{'id': '0195d400-5179-7259-8104-1f0a55a13ac2',
 'question': 'Woody Wuthrie wrote the song "Do Re Mi" which was about migrants and their experiences when they arrive in California and are greeted with severe storms that came in waves in what years?',
 'answer': '1934, 1936, and 1939–1940'}

As you can see, the answers can be a little messy. We'll need a good metric to able to determine whether a prompt is good or not. But let's start simple.

We'll use the `Equals` metric from Opik.

It works like this:

In [7]:
from opik.evaluation.metrics import Equals

metric = Equals()

metric.score("Hello", "Hello")

OPIK: Started logging traces to the "Default Project" project at https://www.comet.com/opik/api/v1/session/redirect/projects/?trace_id=0196ab7c-5acc-7781-a3c2-7744b93bc02e&path=aHR0cHM6Ly93d3cuY29tZXQuY29tL29waWsvYXBpLw==.


ScoreResult(name='equals_metric', value=1.0, reason=None, metadata=None, scoring_failed=False)

Here we can see that value is 1.0. That meens that the two values are the same.

What counts as equal?

In [8]:
metric.score("Hello", "heLLo")

ScoreResult(name='equals_metric', value=1.0, reason=None, metadata=None, scoring_failed=False)

In [9]:
metric.score("hello", "hell")

ScoreResult(name='equals_metric', value=0.0, reason=None, metadata=None, scoring_failed=False)

Ok, we have a dataset and a metric, now we are ready to construct an optimizer.

We can use an OpenAI model name, or more generally, a LiteLLM model name. Just make sure you have your model API key set as we did above.

We'll start with the `FewShotBayesianOptimizer`:

In [10]:
from opik_optimizer import (
    FewShotBayesianOptimizer,
    MetricConfig,
    from_llm_response_text,
    from_dataset_field,
    TaskConfig,
)

optimizer = FewShotBayesianOptimizer(
    model="gpt-4o-mini",
    temperature=0.1,
    max_tokens=5000,
)

Now, we need a prompt to optimize. Given the examples above, let's try something like:

In [11]:
initial_prompt = "Provide an answer to the question"

In [12]:
project_name = "optimize-few-shot-hotpot-300"

In [13]:
messages = [
    {"role": "system", "content": initial_prompt},
    {"role": "user", "content": "{{question}}"},
]

In [14]:
metric_config = MetricConfig(
    metric=Equals(project_name=project_name),
    inputs={
        "output": from_llm_response_text(),
        "reference": from_dataset_field(name="answer"),
    },
)

task_config = TaskConfig(
    instruction_prompt=initial_prompt,
    input_dataset_fields=["question"],
    output_dataset_field="answer",
    use_chat_prompt=True,
)

In [15]:
score = optimizer.evaluate_prompt(
    dataset=opik_dataset,
    metric_config=metric_config,
    prompt=messages,
    n_samples=100,
)
score

Evaluation:   0%|          | 0/100 [00:00<?, ?it/s]

0.0

The score 0.15 is pretty low. Let's see if we can optimize it!

The FewShotBayesianOptimizer can, fairly quickly, create better prompts.

Let's try it out. It takes exactly the same parameters as evaluate_prompt(), but will run for a minute or so.

In [16]:
result = optimizer.optimize_prompt(
    opik_dataset,
    metric_config,
    task_config,
    n_trials=3,
    n_samples=100
)

[I 2025-05-07 16:03:27,319] A new study created in memory with name: no-name-422b323c-8895-458c-81d9-daec3166ecbc


Evaluation:   0%|          | 0/100 [00:00<?, ?it/s]

[I 2025-05-07 16:03:40,176] Trial 0 finished with value: 0.19 and parameters: {'n_examples': 5, 'example_0': 4, 'example_1': 11, 'example_2': 167, 'example_3': 5, 'example_4': 32}. Best is trial 0 with value: 0.19.


Evaluation:   0%|          | 0/100 [00:00<?, ?it/s]

[I 2025-05-07 16:03:56,484] Trial 1 finished with value: 0.21 and parameters: {'n_examples': 6, 'example_0': 171, 'example_1': 195, 'example_2': 238, 'example_3': 18, 'example_4': 273, 'example_5': 62}. Best is trial 1 with value: 0.21.


Evaluation:   0%|          | 0/100 [00:00<?, ?it/s]

[I 2025-05-07 16:04:08,596] Trial 2 finished with value: 0.17 and parameters: {'n_examples': 2, 'example_0': 37, 'example_1': 14}. Best is trial 1 with value: 0.21.


In [17]:
for message in result.prompt:
    print(message)

{'role': 'system', 'content': 'Provide an answer to the question\n\nYou are an intelligent assistant that learns from few-shot examples provided earlier in the conversation. Whenever you respond, carefully follow the structure, tone, and format of previous assistant replies, using them as a guide'}
{'role': 'user', 'content': '\n{\n  "question": "What is the name of the TV station in Hudson County, New Jersey, where Van Hackett worked?"\n}\n'}
{'role': 'assistant', 'content': 'WWOR-TV'}
{'role': 'user', 'content': '\n{\n  "question": "In addition to the Austrian passport, what is needed to gain access to 173 countries and territories?"\n}\n'}
{'role': 'assistant', 'content': 'national identity card'}
{'role': 'user', 'content': '\n{\n  "question": "The stop motion director who directed \\"Coraline\\" also co-directed \\"James and The Giant Peach\\" with what male director?"\n}\n'}
{'role': 'assistant', 'content': 'Tim Burton'}
{'role': 'user', 'content': '\n{\n  "question": "Which movi

Although there is some randomness, you probably got a better prompt. My result is about 5 times better.

This is just the initial framework for optimizers for Opik!

Please see the [Opik Optimizer repo](https://github.com/comet-ml/opik/tree/main/sdks/opik_optimizer) for additional examples using different algorithms.