# Introduction to Opik Agent Optimizers

You will need:

1. A Comet account, for seeing Opik visualizations (free!) - [comet.com](https://comet.com)
2. An OpenAI account, for using an LLM
[platform.openai.com/settings/organization/api-keys](https://platform.openai.com/settings/organization/api-keys)


## Setup

This pip-install takes about a minute.

In [1]:
%pip install opik-optimizer --quiet


[1m[[0m[34;49mnotice[0m[1;39;49m][0m[39;49m A new release of pip is available: [0m[31;49m25.0.1[0m[39;49m -> [0m[32;49m25.3[0m
[1m[[0m[34;49mnotice[0m[1;39;49m][0m[39;49m To update, run: [0m[32;49mpip install --upgrade pip[0m
Note: you may need to restart the kernel to use updated packages.


Let's make sure we have the correct version:

In [2]:
import opik_optimizer

opik_optimizer.__version__

  from scipy.stats import fisher_exact


'2.2.2'

The version should be 0.7.3 or greater.

[Comet](https://www.comet.com/site?from=llm&utm_source=opik) provides a hosted version of the Opik platform, [simply create an account](https://www.comet.com/signup?from=llm&utm_source=opik&utm_medium=colab&utm_content=langchain&utm_campaign=opik) and grab your API Key.

> You can also run the Opik platform locally, see the [installation guide](https://www.comet.com/docs/opik/self-host/overview/?from=llm&utm_source=opik) for more information.

Enter your Comet API key, followed by "Y" to use your own workspace:

In [3]:
import opik

# Configure Opik
opik.configure()

OPIK: Opik is already configured. You can check the settings by viewing the config file at /Users/jacquesverre/.opik.config
OPIK: Configuration completed successfully. Traces will be logged to 'Default Project' project. To change the destination project, see: https://www.comet.com/docs/opik/tracing/log_traces#configuring-the-project-name


For this example, we'll use OpenAI models, so we need to set our OpenAI API key:

In [4]:
import os
import getpass

if "OPENAI_API_KEY" not in os.environ:
    os.environ["OPENAI_API_KEY"] = getpass.getpass("Enter your OpenAI API key: ")

To optimize any prompt, we'll need three basic things:

1. A starting prompt
2. A metric
3. A dataset

## The Dataset

In this experiment, we are going to use the **HotPotQA** dataset. This dataset was designed to be difficult for regular LLMs to handle. This dataset is called a "**multi-hop**" dataset because answering the questions involves multiple reasoning steps and multiple tool calls, where the LLM needs to infer relationships, combine information, or draw conclusions based on the combined context.

Example:

> "What are the capitals of the states that border California?"

You'd need to find which states border California, and then lookup each state's capital.

The dataset has about 113,000 such crowd-sourced questions that are constructed to require the introductory paragraphs of two Wikipedia articles to answer.

**NOTE:** The name "HotPot" comes from the restaurant where the authors came up with the idea of the dataset.

In [5]:
import opik_optimizer

opik_dataset = opik_optimizer.datasets.hotpot_300()

Let's take a look at some dataset items:

In [6]:
rows = opik_dataset.get_items()
rows[0]

{'id': '0197044c-d782-7735-a8cf-bffd4d19838e',
 'question': 'Are Smyrnium and Nymania both types of plant?',
 'answer': 'yes'}

We see that each item has a "question" and "answer". Some of the answers are short and direct, and some of them are more complicated:

In [7]:
rows[1]

{'id': '0197044c-d781-7eb0-a717-6fd6b1415476',
 'question': 'That Darn Cat! and Never a Dull Moment were both produced by what studio?',
 'answer': 'Walt Disney Productions'}

## Opik Project

All LLM traces in Opik are saved in a "project". We'll put them all in the following project name:

In [8]:
project_name = "optimize-workshop-2025"

## The Metric

Choosing a good metric for optimization is tricky. For these examples, we'll pick one that will allow us to show improvement, and also provide a gradient of scores. In general though, this metric isn't the best for optimization runs.

We'll use "Edit Distance" AKA "Levenshtein Distance":

In [9]:
from opik.evaluation.metrics import LevenshteinRatio

metric = LevenshteinRatio(project_name=project_name)

The metric takes two things: the `output` of the LLM and the `reference` (the truth):

In [10]:
metric.score("Hello", "Hello")

OPIK: Started logging traces to the "optimize-workshop-2025" project at https://www.comet.com/opik/api/v1/session/redirect/projects/?trace_id=019a8327-2c10-7bf5-8b70-777868d850a4&path=aHR0cHM6Ly93d3cuY29tZXQuY29tL29waWsvYXBpLw==.


ScoreResult(name='levenshtein_ratio_metric', value=1.0, reason=None, metadata=None, scoring_failed=False)

In [11]:
metric.score("Hello!", "Hello")

ScoreResult(name='levenshtein_ratio_metric', value=0.9090909090909091, reason=None, metadata=None, scoring_failed=False)

The edit distance between "Hello!" and "Hello" is 1. Here is how the .91 is computed:

In [12]:
edit_distance = 1 - 1 / (len("Hello1") + len("Hello"))
edit_distance

0.9090909090909091

For more information see: [Levenshtein Distance](https://en.wikipedia.org/wiki/Levenshtein_distance)

## Configuation

To create the necessary configurations for using an Opik Agent Optimizer, you'll need two things:

1. An initial prompt
2. A metric

We're going to start with a pretty bad prompt... so we can optimize it!

In [13]:
from opik_optimizer import ChatPrompt

initial_prompt = ChatPrompt(
    messages=[
        {"role": "system", "content": "Provide an answer to the question"},
        {"role": "user", "content": "{question}"},
    ]
)

And the metric:

In [14]:
def levenshtein_ratio(dataset_item, llm_output):
    metric = LevenshteinRatio()
    return metric.score(reference=dataset_item["answer"], output=llm_output)

As you can see our metric function is a wrapper around the built-in LevenshteinRatio(). For most Opik metrics, we need two parameters.
1. 'output' - which is the output of the LLM
2. 'reference' - the correct answer (provided by the database item "answer")



## FewShotBayesianOptimizer

The FewShotBayesianOptimizer name indicates two things:

1. It will produce Chat Prompts, or FewShot examples as described in the slides.
2. Secondly, it describes how it searches for the best set of these FewShot examples.

To use this optimizer, we import it and create an instance, passing in the project name and model parameters:

In [16]:
from opik_optimizer import (
    FewShotBayesianOptimizer,
)

optimizer = FewShotBayesianOptimizer(
    model="openai/gpt-4o-mini",
    model_parameters={"temperature": 0.1, "max_tokens": 5000},
)

### Baseline

Before we optimize this prompt ("Provide an answer to the question") let's see what the bare prompt does by itself on the dataset:

In [17]:
score = optimizer.evaluate_prompt(
    prompt=initial_prompt,
    dataset=opik_dataset,
    metric=levenshtein_ratio,
    n_samples=100,
    n_threads=4,
)
score

Evaluation:   0%|          | 0/100 [00:00<?, ?it/s]

OPIK: Started logging traces to the "Optimization" project at https://www.comet.com/opik/api/v1/session/redirect/projects/?trace_id=019a8327-6f19-76fa-9691-9e44ed642ff0&path=aHR0cHM6Ly93d3cuY29tZXQuY29tL29waWsvYXBpLw==.


0.1418003695496669

In my run, it scored about 16% correct. [I say "percent correct" but because we are using edit distance, that isn't quite accurate. But we can think of it this way.] Your result may be somewhat different but is probably between 10% and 20% correct.

Ok, let's optimize that prompt!

In [18]:
result = optimizer.optimize_prompt(
    prompt=initial_prompt,
    dataset=opik_dataset,
    metric=levenshtein_ratio,
    n_samples=50,
    n_trials=3,
)

╭────────────────────────────────────────────────────────────────────╮
│ [32m● [0mRunning Opik Evaluation - [34mFewShotBayesianOptimizer[0m               │
│                                                                    │
│ -> View optimization details ]8;id=112650;https://www.comet.com/opik/api/v1/session/redirect/optimizations/?optimization_id=019a8328-5cb3-71a7-967b-67639113fa3f&dataset_id=0197044c-d51d-7cff-96c5-38969b6ae708&path=aHR0cHM6Ly93d3cuY29tZXQuY29tL29waWsvYXBpLw==\in your Opik dashboard]8;;\                │
╰────────────────────────────────────────────────────────────────────╯


> Let's optimize the prompt:

[2m╭─[0m[2m system [0m[2m──────────────────────────────────────────────────────────[0m[2m─╮[0m
[2m│[0m                                                                    [2m│[0m
[2m│[0m  Provide an answer to the question                                 [2m│[0m
[2m│[0m                                                                    [

Output()

[32m  Baseline score was: 0.1439.[0m

> Let's add a placeholder for few-shot examples in the messages:
[2;33m│    Created the prompt template:[0m
[2;33m│[0m
│    [2m╭─[0m[2m system [0m[2m──────────────────────────────────────────────────────────[0m[2m─╮[0m
│    [2m│[0m                                                                    [2m│[0m
│    [2m│[0m  Provide an answer to the question                                 [2m│[0m
│    [2m│[0m                                                                    [2m│[0m
│    [2m│[0m  ### Few-Shot Examples                                             [2m│[0m
│    [2m│[0m  FEW_SHOT_EXAMPLE_PLACEHOLDER                                      [2m│[0m
│    [2m│[0m                                                                    [2m│[0m
│    [2m╰────────────────────────────────────────────────────────────────────╯[0m
│    [2m╭─[0m[2m user [0m[2m────────────────────────────────────────────────────────────

Output()

[32m│    Trial 1 - score was: 0.3598 (150.00%).[0m
[32m│[0m
│ - Starting optimization round 2 of 10
│
│    [2m╭─[0m[2m system [0m[2m──────────────────────────────────────────────────────────[0m[2m─╮[0m
│    [2m│[0m                                                                    [2m│[0m
│    [2m│[0m  Provide an answer to the question                                 [2m│[0m
│    [2m│[0m                                                                    [2m│[0m
│    [2m│[0m  ### Few-Shot Examples                                             [2m│[0m
│    [2m│[0m  Question: Which career do Jaoru Kuroki and Ilona Staller share?   [2m│[0m
│    [2m│[0m  Answer: adult video (AV) performer                                [2m│[0m
│    [2m│[0m                                                                    [2m│[0m
│    [2m│[0m  Question: Ramfis Trujillo was a close friend of the racecar       [2m│[0m
│    [2m│[0m  driver and diplomat of what nation

Output()

[32m│    Trial 2 - score was: 0.2835 (96.98%).[0m
[32m│[0m
│ - Starting optimization round 3 of 10
│
│    [2m╭─[0m[2m system [0m[2m──────────────────────────────────────────────────────────[0m[2m─╮[0m
│    [2m│[0m                                                                    [2m│[0m
│    [2m│[0m  Provide an answer to the question                                 [2m│[0m
│    [2m│[0m                                                                    [2m│[0m
│    [2m│[0m  ### Few-Shot Examples                                             [2m│[0m
│    [2m│[0m  Question: The Police Integrity Commission is responsible for the  [2m│[0m
│    [2m│[0m  prevention, detection, and investigation of alleged serious       [2m│[0m
│    [2m│[0m  misconduct in what primary law enforcement agency of the state    [2m│[0m
│    [2m│[0m  of New South Wales, Australia that is  Divided into seventy six   [2m│[0m
│    [2m│[0m  local area command                 

Output()

[32m│    Trial 3 - score was: 0.2850 (98.02%).[0m
[32m│[0m
│ - Starting optimization round 4 of 10
│
│    [2m╭─[0m[2m system [0m[2m──────────────────────────────────────────────────────────[0m[2m─╮[0m
│    [2m│[0m                                                                    [2m│[0m
│    [2m│[0m  Provide an answer to the question                                 [2m│[0m
│    [2m│[0m                                                                    [2m│[0m
│    [2m│[0m  ### Few-Shot Examples                                             [2m│[0m
│    [2m│[0m  Question: Woody Wuthrie wrote the song "Do Re Mi" which was       [2m│[0m
│    [2m│[0m  about migrants and their experiences when they arrive in          [2m│[0m
│    [2m│[0m  California and are greeted with severe storms that came in waves  [2m│[0m
│    [2m│[0m  in what years?                                                    [2m│[0m
│    [2m│[0m  Answer: 1934, 1936, and 1939–1940  

Output()

[32m│    Trial 4 - score was: 0.3387 (135.32%).[0m
[32m│[0m
│ - Starting optimization round 5 of 10
│
│    [2m╭─[0m[2m system [0m[2m──────────────────────────────────────────────────────────[0m[2m─╮[0m
│    [2m│[0m                                                                    [2m│[0m
│    [2m│[0m  Provide an answer to the question                                 [2m│[0m
│    [2m│[0m                                                                    [2m│[0m
│    [2m│[0m  ### Few-Shot Examples                                             [2m│[0m
│    [2m│[0m  Question: Who is the wife of Charlemagne who is a step mother to  [2m│[0m
│    [2m│[0m  Pepin the Hunchback?                                              [2m│[0m
│    [2m│[0m  Answer: Hildegard                                                 [2m│[0m
│    [2m│[0m                                                                    [2m│[0m
│    [2m│[0m  Question: Who attacked American Ai

Output()

[32m│    Trial 5 - score was: 0.3787 (163.12%).[0m
[32m│[0m
│ - Starting optimization round 6 of 10
│
│    [2m╭─[0m[2m system [0m[2m──────────────────────────────────────────────────────────[0m[2m─╮[0m
│    [2m│[0m                                                                    [2m│[0m
│    [2m│[0m  Provide an answer to the question                                 [2m│[0m
│    [2m│[0m                                                                    [2m│[0m
│    [2m│[0m  ### Few-Shot Examples                                             [2m│[0m
│    [2m│[0m  Question: Can you find Peter Piper Pizza where Pizza Fusion is    [2m│[0m
│    [2m│[0m  based?                                                            [2m│[0m
│    [2m│[0m  Answer: no                                                        [2m│[0m
│    [2m│[0m                                                                    [2m│[0m
│    [2m│[0m  Question: Josh Brolin was cast as 

Output()

[32m│    Trial 6 - score was: 0.1977 (37.38%).[0m
[32m│[0m
│ - Starting optimization round 7 of 10
│
│    [2m╭─[0m[2m system [0m[2m──────────────────────────────────────────────────────────[0m[2m─╮[0m
│    [2m│[0m                                                                    [2m│[0m
│    [2m│[0m  Provide an answer to the question                                 [2m│[0m
│    [2m│[0m                                                                    [2m│[0m
│    [2m│[0m  ### Few-Shot Examples                                             [2m│[0m
│    [2m│[0m  Question: Who was the star of "The Perks of Being a Wallflower"   [2m│[0m
│    [2m│[0m  and also starred as the title character in the drama "We Need to  [2m│[0m
│    [2m│[0m  Talk About Kevin" (2011)?                                         [2m│[0m
│    [2m│[0m  Answer: Ezra Matthew Miller                                       [2m│[0m
│    [2m│[0m                                     

Output()

[32m│    Trial 7 - score was: 0.2440 (69.56%).[0m
[32m│[0m
│ - Starting optimization round 8 of 10
│
│    [2m╭─[0m[2m system [0m[2m──────────────────────────────────────────────────────────[0m[2m─╮[0m
│    [2m│[0m                                                                    [2m│[0m
│    [2m│[0m  Provide an answer to the question                                 [2m│[0m
│    [2m│[0m                                                                    [2m│[0m
│    [2m│[0m  ### Few-Shot Examples                                             [2m│[0m
│    [2m│[0m  Question: Wing has drawn comparisons to what  American socialite  [2m│[0m
│    [2m│[0m  and amateur soprano who was known and mocked for her flamboyant   [2m│[0m
│    [2m│[0m  performance costumes                                              [2m│[0m
│    [2m│[0m  Answer: Florence Foster Jenkins                                   [2m│[0m
│    [2m│[0m                                     

Output()

[32m│    Trial 8 - score was: 0.3428 (138.21%).[0m
[32m│[0m
│ - Starting optimization round 9 of 10
│
│    [2m╭─[0m[2m system [0m[2m──────────────────────────────────────────────────────────[0m[2m─╮[0m
│    [2m│[0m                                                                    [2m│[0m
│    [2m│[0m  Provide an answer to the question                                 [2m│[0m
│    [2m│[0m                                                                    [2m│[0m
│    [2m│[0m  ### Few-Shot Examples                                             [2m│[0m
│    [2m│[0m  Question: What did the first book of Gary Zukav explore ?         [2m│[0m
│    [2m│[0m  Answer: empirical topics in modern physics research               [2m│[0m
│    [2m│[0m                                                                    [2m│[0m
│    [2m│[0m  Question: The principal building in the village of Guthrie dates  [2m│[0m
│    [2m│[0m  back to what century?             

Output()

[32m│    Trial 9 - score was: 0.2643 (83.61%).[0m
[32m│[0m
│ - Starting optimization round 10 of 10
│
│    [2m╭─[0m[2m system [0m[2m──────────────────────────────────────────────────────────[0m[2m─╮[0m
│    [2m│[0m                                                                    [2m│[0m
│    [2m│[0m  Provide an answer to the question                                 [2m│[0m
│    [2m│[0m                                                                    [2m│[0m
│    [2m│[0m  ### Few-Shot Examples                                             [2m│[0m
│    [2m│[0m  Question: The principal building in the village of Guthrie dates  [2m│[0m
│    [2m│[0m  back to what century?                                             [2m│[0m
│    [2m│[0m  Answer: the 15th century                                          [2m│[0m
│    [2m│[0m                                                                    [2m│[0m
│    [2m│[0m  Question: Peter Scully is on trial

Output()

[32m│    Trial 10 - score was: 0.3267 (126.98%).[0m
[32m│[0m

> Optimization complete

[32m╭─[0m[32m Optimization results [0m[32m────────────────────────────────────────────[0m[32m─╮[0m
[32m│[0m                                                                    [32m│[0m
[32m│[0m  [1;32mPrompt was optimized and improved from 0.1439 to 0.3787 [0m          [32m│[0m
[32m│[0m  [1;32m(16312.14%)[0m                                                       [32m│[0m
[32m│[0m                                                                    [32m│[0m
[32m│[0m  Optimized prompt:                                                 [32m│[0m
[32m│[0m  [2m╭─[0m[2m system [0m[2m────────────────────────────────────────────────────[0m[2m─╮[0m  [32m│[0m
[32m│[0m  [2m│[0m                                                              [2m│[0m  [32m│[0m
[32m│[0m  [2m│[0m  Provide an answer to the question                           [2m│[0m  [32m│[0m
[32

In [20]:
result.display()

[33m╔═[0m[33m════════════════════════════════════════════[0m[33m [0m[1;33mOptimization Complete[0m[33m [0m[33m════════════════════════════════════════════[0m[33m═╗[0m
[33m║[0m                                                                                                                 [33m║[0m
[33m║[0m [2mOptimizer:            [0m[2m [0m[1mFewShotBayesianOptimizer[0m                                                                 [33m║[0m
[33m║[0m [2mModel Used:           [0m[2m [0mopenai/gpt-4o-mini                                                                       [33m║[0m
[33m║[0m [2mMetric Evaluated:     [0m[2m [0m[1mlevenshtein_ratio[0m                                                                        [33m║[0m
[33m║[0m [2mInitial Score:        [0m[2m [0m0.1439                                                                                   [33m║[0m
[33m║[0m [2mFinal Best Score:     [0m[2m [0m[1;36m0.3787[0m  

Well done optimizer!

The percentage correct went from about 15% to about 50% correct.

What did we find? The result is a series of messages:

In [21]:
result.prompt

[{'role': 'system',
  'content': "Provide an answer to the question\n\n### Few-Shot Examples\nQuestion: Who is the wife of Charlemagne who is a step mother to Pepin the Hunchback?\nAnswer: Hildegard\n\nQuestion: Who attacked American Airlines Flight 444, in a nationwide bombing campaign that targeted people involved with modern technology?\nAnswer: Theodore John Kaczynski\n\nQuestion: How large is this city in North Rhine-Westphalia as compared to other cities in Germany from which the Krupp family comes?\nAnswer: the ninth-largest\n\nQuestion: What type of person does Prime Minister of Hungary and Viktor Orbán have in common?\nAnswer: leader\n\nQuestion: Which dog originated in the Mountains of Romania, the Ariege Pointer or the Carpathian Shepherd Dog?\nAnswer: Carpathian Mountains of Romania.\n\nQuestion: What is the name of the actor who played Pee-wee Herman on Pee-wee's Playhouse?\nAnswer: Paul Reubens\n\nQuestion: The Knicks–Nuggets was the most penalized on-court fight in the N

## Opik Visualization UI

When you create an Optimization Run, you'll see it in the Opik UI (either running locally or hosted).

If you go to your `comet.com/opik/YOURID/optimizations` page, you'll see your run at the top of the list:

<img src="https://raw.githubusercontent.com/comet-ml/opik/refs/heads/main/sdks/opik_optimizer/docs/images/optimizer-ui.png" width="500px">

Along the top of the page you'll see a running history of the metric scores, with the latest dataset selected.

If you click on your run, you'll see the set of trials that ran durring this optimization run. Running across the top of this page are the scores just for this trial. On the top right you'll see the best so far:

<img src="https://raw.githubusercontent.com/comet-ml/opik/refs/heads/main/sdks/opik_optimizer/docs/images/optimize-trials.png" width="500px">

If you click on a trial, you'll see the prompt for that trial:

<img src="https://raw.githubusercontent.com/comet-ml/opik/refs/heads/main/sdks/opik_optimizer/docs/images/optimizer-prompt.png" width="500px">

From the trial page you can also see the Trial items, and even dig down (mouse over the "Evaluation task" column on a row) to see the traces.

## Using Optimized Prompts

How can we use the optimized results?


Once we have the "chat_messages", we can do the following:

In [22]:
from litellm.integrations.opik.opik import OpikLogger
import litellm

opik_logger = OpikLogger()
litellm.callbacks = [opik_logger]


def query(question, chat_messages):
    messages = chat_messages[:-1]  # Cut off the last one
    # replace it with question in proper format:
    messages.append({"role": "user", "content": '{"question": "%s"}"}' % question})

    response = litellm.completion(
        model="gpt-4o-mini",
        temperature=0.1,
        max_tokens=5000,
        messages=messages,
    )
    return response.choices[0].message.content

In [23]:
query("When was David Chalmers born?", result.details["chat_messages"])

'Answer: April 20, 1966'

In [24]:
query("What weighs more: a baby elephant or an SUV?", result.details["chat_messages"])

'Answer: A baby elephant typically weighs more than an SUV.'

If it says "elephant" that is not correct!

We'll need to use an agent with tools to get a better answer.

# Next Steps

You can try out other optimizers. More details can be found in the [Opik Agent Optimizer documentation](https://www.comet.com/docs/opik/agent_optimization/overview).