# Opik Optimizer (beta)

Welcome to the Opik Opitimizer beta program! In this notebook we'll walk through a basic examples of a number of an optimizer algorithm.

## Setup

First, let's install the needed Python packages.

We'll of course need `opik`:

In [1]:
%pip install opik --quiet

[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m149.7/149.7 kB[0m [31m7.6 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m491.2/491.2 kB[0m [31m32.3 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m68.8/68.8 kB[0m [31m7.9 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m161.7/161.7 kB[0m [31m19.0 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m44.4/44.4 kB[0m [31m4.3 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m7.6/7.6 MB[0m [31m112.5 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m3.1/3.1 MB[0m [31m113.1 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m1.2/1.2 MB[0m [31m75.3 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━

Next, we'll install the beta version of the `opik-optimizer`:

In [4]:
!pip install git+https://github.com/comet-ml/opik#subdirectory=sdks/opik_optimizer --upgrade --quiet > pip-install.log 2>&1

[Comet](https://www.comet.com/site?from=llm&utm_source=opik) provides a hosted version of the Opik platform, [simply create an account](https://www.comet.com/signup?from=llm&utm_source=opik&utm_medium=colab&utm_content=langchain&utm_campaign=opik) and grab your API Key.

> You can also run the Opik platform locally, see the [installation guide](https://www.comet.com/docs/opik/self-host/overview/?from=llm&utm_source=opik) for more information.

You can use your own workspace.

In [5]:
import opik

# Configure Opik
opik.configure()

OPIK: Your Opik API key is available in your account settings, can be found at https://www.comet.com/api/my/settings/ for Opik cloud


Please enter your Opik API key:··········
Do you want to use "dsblank" workspace? (Y/n)y


OPIK: Configuration saved to file: /root/.opik.config


For this example, we'll use OpenAI models, so we need to set our OpenAI API key:

In [6]:
import os
import getpass
if "OPENAI_API_KEY" not in os.environ:
    os.environ["OPENAI_API_KEY"] = getpass.getpass("Enter your OpenAI API key: ")

Enter your OpenAI API key: ··········


To optimize any prompt, we'll need:

1. A starting prompt
2. A metric
3. A dataset

For this initial test, we'll start with a portion of the HotpotQA dataset.

HotpotQA is a question/answering dataset featuring natural questions, with correct answers. It was collected by a team of NLP researchers at Carnegie Mellon University, Stanford University, and Université de Montréal.

Let's take a look at the demo dataset "hotpot-300".

First, we get or create it. This will add it to our default Opik project:

In [26]:
from opik_optimizer.demo import get_or_create_dataset
from opik_optimizer.demo.cache import get_litellm_cache


opik_dataset = get_or_create_dataset("hotpot-300")
get_litellm_cache("test")

Let's look at a specific row from the dataset:

In [8]:
rows = opik_dataset.get_items()
rows[0]

{'id': '0195d400-517b-7f17-b746-ff3d084463ac',
 'question': 'Were both drinks, the Smoking Bishop and the Mickey Slim, popular in different countries?',
 'answer': 'yes'}

We see that each item has a "question" and "answer". Some of the answers are short and direct, and some of them are more complicated:

In [9]:
rows[2]

{'id': '0195d400-5179-7259-8104-1f0a55a13ac2',
 'question': 'Woody Wuthrie wrote the song "Do Re Mi" which was about migrants and their experiences when they arrive in California and are greeted with severe storms that came in waves in what years?',
 'answer': '1934, 1936, and 1939–1940'}

As you can see, the answers can be a little messy. We'll need a good metric to able to determine whether a prompt is good or not. But let's start simple.

We'll use the `LevenshteinRatio` metric from Opik.

It works like this:

In [10]:
from opik.evaluation.metrics import LevenshteinRatio

metric = LevenshteinRatio()

metric.score("Hello", "Hello")

OPIK: Started logging traces to the "Default Project" project at https://www.comet.com/opik/api/v1/session/redirect/projects/?trace_id=01968c66-c5d3-7e85-b7d6-3911e2e60dc8&path=aHR0cHM6Ly93d3cuY29tZXQuY29tL29waWsvYXBpLw==.


ScoreResult(name='levenshtein_ratio_metric', value=1.0, reason=None, metadata=None, scoring_failed=False)

Here we can see that value is 1.0. That meens that there is no difference between "Hello" and "Hello". That is as close as you can get!

As the difference between the two strings get further away, the value gets smaller, approaching 0.0:

In [11]:
metric.score("Hello", "Hello!")

ScoreResult(name='levenshtein_ratio_metric', value=0.9090909090909091, reason=None, metadata=None, scoring_failed=False)

In [12]:
metric.score("one", "six")

ScoreResult(name='levenshtein_ratio_metric', value=0.0, reason=None, metadata=None, scoring_failed=False)

As you can see, slightly different have values close to 1, and no letters in common give a zero.

To find out more about the metric, see [Levenshtein Distance](https://en.wikipedia.org/wiki/Levenshtein_distance).

Ok, we have a dataset and a metric, now we are ready to construct an optimizer.

We can use an OpenAI model name, or more generally, a LiteLLM model name. Just make sure you have your model API key set as we did above.

We'll start with the `FewShotBayesianOptimizer`:

In [21]:
from opik_optimizer import (
    FewShotBayesianOptimizer,
    OptimizationConfig,
    MetricConfig,
    from_llm_response_text,
    from_dataset_field,
    PromptTaskConfig,
)

optimizer = FewShotBayesianOptimizer(
    model="gpt-4o-mini",
    temperature=0.1,
    max_tokens=5000,
)

Now, we need a prompt to optimize. Given the examples above, let's try something like:

In [22]:
initial_prompt = "Provide an answer to the question"

In [23]:
initial_prompt_no_examples = [
    {"role": "system", "content": initial_prompt},
    {"role": "user", "content": "{{question}}"},
]

In [24]:
optimization_config = OptimizationConfig(
    dataset=opik_dataset,
    objective=MetricConfig(
        metric=LevenshteinRatio(),
        inputs={
            "output": from_llm_response_text(),
            "reference": from_dataset_field(name="answer"),
        },
    ),
    task=PromptTaskConfig(
        instruction_prompt=initial_prompt,
        input_dataset_fields=["question"],
        output_dataset_field="answer",
        use_chat_prompt=True,
    ),
)

In [27]:
score = optimizer.evaluate_prompt(
    dataset=opik_dataset,
    metric_config=optimization_config.objective,
    prompt=initial_prompt_no_examples,
)
score


Evaluation:   0%|          | 0/300 [00:00<?, ?it/s][A
Evaluation:   0%|          | 1/300 [00:01<05:18,  1.07s/it][A
Evaluation:   1%|          | 2/300 [00:01<04:10,  1.19it/s][A
Evaluation:   1%|          | 3/300 [00:01<02:40,  1.85it/s][A
Evaluation:   1%|▏         | 4/300 [00:02<01:55,  2.57it/s][A
Evaluation:   2%|▏         | 5/300 [00:02<01:30,  3.27it/s][A
Evaluation:   2%|▏         | 7/300 [00:02<00:55,  5.30it/s][A
Evaluation:   3%|▎         | 8/300 [00:02<01:00,  4.79it/s][A
Evaluation:   3%|▎         | 9/300 [00:02<00:52,  5.56it/s][A
Evaluation:   3%|▎         | 10/300 [00:03<01:25,  3.40it/s][A
Evaluation:   4%|▎         | 11/300 [00:03<01:10,  4.07it/s][A
Evaluation:   4%|▍         | 13/300 [00:03<01:09,  4.11it/s][A
Evaluation:   5%|▌         | 16/300 [00:04<00:49,  5.70it/s][A
Evaluation:   6%|▌         | 17/300 [00:04<00:48,  5.88it/s][A
Evaluation:   6%|▌         | 18/300 [00:04<01:00,  4.64it/s][A
Evaluation:   6%|▋         | 19/300 [00:04<00:54,  5.17i

0.14533818431262188

The score 0.15 is pretty low. Let's see if we can optimize it!

The FewShotBayesianOptimizer can, fairly quickly, create better prompts.

Let's try it out. It takes exactly the same parameters as evaluate_prompt(), but will run for a minute or so.

In [28]:
result = optimizer.optimize_prompt(optimization_config, n_trials=3)

[I 2025-05-01 15:30:53,011] A new study created in memory with name: no-name-091bce25-dd58-4c57-8ee3-bb3781d5887d

Evaluation:   0%|          | 0/300 [00:00<?, ?it/s][A
Evaluation:   0%|          | 1/300 [00:00<04:00,  1.24it/s][A
Evaluation:   1%|          | 2/300 [00:01<02:34,  1.92it/s][A
Evaluation:   1%|          | 3/300 [00:01<01:43,  2.86it/s][A
Evaluation:   2%|▏         | 6/300 [00:01<00:48,  6.03it/s][A
Evaluation:   3%|▎         | 8/300 [00:01<00:36,  7.98it/s][A
Evaluation:   3%|▎         | 10/300 [00:02<00:48,  5.98it/s][A
Evaluation:   4%|▍         | 12/300 [00:02<00:38,  7.39it/s][A
Evaluation:   5%|▍         | 14/300 [00:02<00:39,  7.28it/s][A
Evaluation:   5%|▌         | 16/300 [00:02<00:43,  6.53it/s][A
Evaluation:   6%|▌         | 18/300 [00:03<00:35,  7.91it/s][A
Evaluation:   7%|▋         | 20/300 [00:03<00:33,  8.26it/s][A
Evaluation:   7%|▋         | 21/300 [00:03<00:43,  6.36it/s][A
Evaluation:   7%|▋         | 22/300 [00:03<00:41,  6.65it/s][A
Eva

[I 2025-05-01 15:31:37,114] Trial 0 finished with value: 0.5314112618205721 and parameters: {'n_examples': 8, 'example_0': 251, 'example_1': 15, 'example_2': 164, 'example_3': 11, 'example_4': 118, 'example_5': 141, 'example_6': 118, 'example_7': 157}. Best is trial 0 with value: 0.5314112618205721.

Evaluation:   0%|          | 0/300 [00:00<?, ?it/s][A
Evaluation:   0%|          | 1/300 [00:00<03:47,  1.31it/s][A
Evaluation:   1%|▏         | 4/300 [00:00<00:51,  5.76it/s][A
Evaluation:   2%|▏         | 6/300 [00:01<00:42,  6.92it/s][A
Evaluation:   3%|▎         | 8/300 [00:01<00:42,  6.81it/s][A
Evaluation:   3%|▎         | 10/300 [00:01<00:46,  6.23it/s][A
Evaluation:   4%|▍         | 12/300 [00:01<00:38,  7.52it/s][A
Evaluation:   5%|▍         | 14/300 [00:02<00:40,  7.08it/s][A
Evaluation:   5%|▌         | 15/300 [00:02<00:39,  7.14it/s][A
Evaluation:   5%|▌         | 16/300 [00:02<00:40,  6.98it/s][A
Evaluation:   6%|▌         | 18/300 [00:02<00:40,  6.94it/s][A
Evaluat

[I 2025-05-01 15:32:19,474] Trial 1 finished with value: 0.5372766641460849 and parameters: {'n_examples': 8, 'example_0': 294, 'example_1': 85, 'example_2': 232, 'example_3': 171, 'example_4': 104, 'example_5': 40, 'example_6': 57, 'example_7': 282}. Best is trial 1 with value: 0.5372766641460849.

Evaluation:   0%|          | 0/300 [00:00<?, ?it/s][A
Evaluation:   0%|          | 1/300 [00:00<04:06,  1.21it/s][A
Evaluation:   2%|▏         | 6/300 [00:00<00:35,  8.18it/s][A
Evaluation:   3%|▎         | 9/300 [00:01<00:53,  5.43it/s][A
Evaluation:   4%|▎         | 11/300 [00:01<00:42,  6.81it/s][A
Evaluation:   4%|▍         | 13/300 [00:02<00:38,  7.38it/s][A
Evaluation:   5%|▌         | 16/300 [00:02<00:30,  9.27it/s][A
Evaluation:   6%|▌         | 18/300 [00:02<00:39,  7.08it/s][A
Evaluation:   7%|▋         | 20/300 [00:02<00:37,  7.53it/s][A
Evaluation:   7%|▋         | 22/300 [00:03<00:34,  8.17it/s][A
Evaluation:   8%|▊         | 24/300 [00:03<00:35,  7.83it/s][A
Evaluat

[I 2025-05-01 15:33:04,198] Trial 2 finished with value: 0.5083562144763062 and parameters: {'n_examples': 8, 'example_0': 82, 'example_1': 180, 'example_2': 102, 'example_3': 155, 'example_4': 251, 'example_5': 236, 'example_6': 198, 'example_7': 122}. Best is trial 1 with value: 0.5372766641460849.


In [30]:
print(result)

Optimizer Results:
    Best prompt: [{'role': 'system', 'content': 'Provide an answer to the question\n\nYou are an intelligent assistant that learns from few-shot examples provided earlier in the conversation. Whenever you respond, carefully follow the structure, tone, and format of previous assistant replies, using them as a guide'}, {'role': 'user', 'content': '\n{\n  "question": "Long Lake is located in which town?"\n}\n'}, {'role': 'assistant', 'content': 'Harrison'}, {'role': 'user', 'content': '\n{\n  "question": "In which region did the settlers have conflict with the Mexican government where it escalated to a rebellion led by John Dunn Hunter?"\n}\n'}, {'role': 'assistant', 'content': 'Texas'}, {'role': 'user', 'content': '\n{\n  "question": "What single by an Australian rock band was featured in an Australian paranormal television program which premiered on 9 July 2015, on ABC1?"\n}\n'}, {'role': 'assistant', 'content': 'Breakaway'}, {'role': 'user', 'content': '\n{\n  "quest

Although there is some randomness, you probably got a better prompt. My result is about 5 times better.

This is just the initial framework for optimizers for Opik!

Please see the [Opik Optimizer repo](https://github.com/comet-ml/opik/tree/main/sdks/opik_optimizer) for additional examples using different algorithms.