# Opik Optimizer (beta)

Welcome to the Opik Opitimizer beta program! In this notebook we'll walk through a basic examples of a number of an optimizer algorithm.

## Setup

First, let's install the needed Python packages.

We only need one package, `opik-optimizer`:

In [1]:
%%capture
!pip install opik-optimizer

Next, we'll install the beta version of the `opik-optimizer`:

[Comet](https://www.comet.com/site?from=llm&utm_source=opik) provides a hosted version of the Opik platform, [simply create an account](https://www.comet.com/signup?from=llm&utm_source=opik&utm_medium=colab&utm_content=langchain&utm_campaign=opik) and grab your API Key.

> You can also run the Opik platform locally, see the [installation guide](https://www.comet.com/docs/opik/self-host/overview/?from=llm&utm_source=opik) for more information.

You can use your own workspace.

In [2]:
! pip show pydantic

Name: pydantic
Version: 2.11.4
Summary: Data validation using Python type hints
Home-page: https://github.com/pydantic/pydantic
Author: 
Author-email: Samuel Colvin <s@muelcolvin.com>, Eric Jolibois <em.jolibois@gmail.com>, Hasan Ramezani <hasan.r67@gmail.com>, Adrian Garcia Badaracco <1755071+adriangb@users.noreply.github.com>, Terrence Dorsey <terry@pydantic.dev>, David Montague <david@pydantic.dev>, Serge Matveenko <lig@countzero.co>, Marcelo Trylesinski <marcelotryle@gmail.com>, Sydney Runkle <sydneymarierunkle@gmail.com>, David Hewitt <mail@davidhewitt.io>, Alex Hall <alex.mojaki@gmail.com>, Victorien Plot <contact@vctrn.dev>
License: 
Location: /usr/local/lib/python3.11/dist-packages
Requires: annotated-types, pydantic-core, typing-extensions, typing-inspection
Required-by: albumentations, confection, dspy, google-cloud-aiplatform, google-genai, google-generativeai, langchain, langchain-core, langsmith, litellm, openai, opik, opik_optimizer, pydantic-settings, spacy, thinc, wandb

In [3]:
import opik

# Configure Opik
opik.configure()

OPIK: Your Opik API key is available in your account settings, can be found at https://www.comet.com/api/my/settings/ for Opik cloud


Please enter your Opik API key:··········
Do you want to use "dsblank" workspace? (Y/n)y


OPIK: Configuration saved to file: /root/.opik.config


For this example, we'll use OpenAI models, so we need to set our OpenAI API key:

In [4]:
import os
import getpass

if "OPENAI_API_KEY" not in os.environ:
    os.environ["OPENAI_API_KEY"] = getpass.getpass("Enter your OpenAI API key: ")

Enter your OpenAI API key: ··········


To optimize any prompt, we'll need:

1. A starting prompt
2. A metric
3. A dataset

For this initial test, we'll start with a portion of the HotpotQA dataset.

HotpotQA is a question/answering dataset featuring natural questions, with correct answers. It was collected by a team of NLP researchers at Carnegie Mellon University, Stanford University, and Université de Montréal.

Let's take a look at the demo dataset "hotpot-300".

First, we get or create it. This will add it to our default Opik project:

In [5]:
from opik_optimizer.demo import get_or_create_dataset
from opik_optimizer.demo.cache import get_litellm_cache


opik_dataset = get_or_create_dataset("hotpot-300")
get_litellm_cache("test")

Let's look at a specific row from the dataset:

In [6]:
rows = opik_dataset.get_items()
rows[0]

{'id': '0195d400-517b-7f17-b746-ff3d084463ac',
 'question': 'Were both drinks, the Smoking Bishop and the Mickey Slim, popular in different countries?',
 'answer': 'yes'}

We see that each item has a "question" and "answer". Some of the answers are short and direct, and some of them are more complicated:

In [7]:
rows[2]

{'id': '0195d400-5179-7259-8104-1f0a55a13ac2',
 'question': 'Woody Wuthrie wrote the song "Do Re Mi" which was about migrants and their experiences when they arrive in California and are greeted with severe storms that came in waves in what years?',
 'answer': '1934, 1936, and 1939–1940'}

As you can see, the answers can be a little messy. We'll need a good metric to able to determine whether a prompt is good or not. But let's start simple.

We'll use the `Equals` metric from Opik.

It works like this:

In [8]:
from opik.evaluation.metrics import Equals

metric = Equals()

metric.score("Hello", "Hello")

OPIK: Started logging traces to the "Default Project" project at https://www.comet.com/opik/api/v1/session/redirect/projects/?trace_id=0196b133-7fda-70b6-b967-295511b381ca&path=aHR0cHM6Ly93d3cuY29tZXQuY29tL29waWsvYXBpLw==.


ScoreResult(name='equals_metric', value=1.0, reason=None, metadata=None, scoring_failed=False)

Here we can see that value is 1.0. That meens that the two values are the same.

What counts as equal?

In [9]:
metric.score("Hello", "heLLo")

ScoreResult(name='equals_metric', value=1.0, reason=None, metadata=None, scoring_failed=False)

In [10]:
metric.score("hello", "hell")

ScoreResult(name='equals_metric', value=0.0, reason=None, metadata=None, scoring_failed=False)

Ok, we have a dataset and a metric, now we are ready to construct an optimizer.

We can use an OpenAI model name, or more generally, a LiteLLM model name. Just make sure you have your model API key set as we did above.

We'll start with the `FewShotBayesianOptimizer`:

In [11]:
from opik_optimizer import (
    FewShotBayesianOptimizer,
    MetricConfig,
    from_llm_response_text,
    from_dataset_field,
    TaskConfig,
)

optimizer = FewShotBayesianOptimizer(
    model="gpt-4o-mini",
    temperature=0.1,
    max_tokens=5000,
)

Now, we need a prompt to optimize. Given the examples above, let's try something like:

In [12]:
initial_prompt = "Provide an answer to the question"

In [13]:
project_name = "optimize-few-shot-hotpot-300"

In [14]:
messages = [
    {"role": "system", "content": initial_prompt},
    {"role": "user", "content": "{{question}}"},
]

In [15]:
metric_config = MetricConfig(
    metric=Equals(project_name=project_name),
    inputs={
        "output": from_llm_response_text(),
        "reference": from_dataset_field(name="answer"),
    },
)

task_config = TaskConfig(
    instruction_prompt=initial_prompt,
    input_dataset_fields=["question"],
    output_dataset_field="answer",
    use_chat_prompt=True,
)

In [16]:
score = optimizer.evaluate_prompt(
    dataset=opik_dataset,
    metric_config=metric_config,
    prompt=messages,
    n_samples=100,
)
score

Evaluation:   0%|          | 0/100 [00:00<?, ?it/s]

0.0

The score 0.15 is pretty low. Let's see if we can optimize it!

The FewShotBayesianOptimizer can, fairly quickly, create better prompts.

Let's try it out. It takes exactly the same parameters as evaluate_prompt(), but will run for a minute or so.

In [17]:
result = optimizer.optimize_prompt(
    opik_dataset,
    metric_config,
    task_config,
    n_trials=3,
    n_samples=100
)

Evaluation:   0%|          | 0/100 [00:00<?, ?it/s]

Evaluation:   0%|          | 0/100 [00:00<?, ?it/s]

Evaluation:   0%|          | 0/100 [00:00<?, ?it/s]

Evaluation:   0%|          | 0/100 [00:00<?, ?it/s]

In [21]:
result.display()

Although there is some randomness, you probably got a better prompt. My result is about 5 times better.

This is just the initial framework for optimizers for Opik!

Please see the [Opik Optimizer repo](https://github.com/comet-ml/opik/tree/main/sdks/opik_optimizer) for additional examples using different algorithms.

In [25]:
result.details["chat_messages"]

[{'role': 'system',
  'content': 'Provide an answer to the question\n\nYou are an intelligent assistant that learns from few-shot examples provided earlier in the conversation. Whenever you respond, carefully follow the structure, tone, and format of previous assistant replies, using them as a guide'},
 {'role': 'user',
  'content': '\n{\n  "question": "Who did Holly Dunn record \\"Daddy\'s Hands\\" for?"\n}\n'},
 {'role': 'assistant', 'content': 'MTM Records'},
 {'role': 'user',
  'content': '\n{\n  "question": "What musical instrument does Amaan Ali Khan and  Amjad Ali Khan both play?"\n}\n'},
 {'role': 'assistant', 'content': 'the sarod'},
 {'role': 'user',
  'content': '\n{\n  "question": "Which has more people, Tumen, Jilin or Xiangcheng City?"\n}\n'},
 {'role': 'assistant', 'content': 'Xiangcheng'},
 {'role': 'user',
  'content': '\n{\n  "question": "Are the bands Flow and Against the Current from the same country?"\n}\n'},
 {'role': 'assistant', 'content': 'no'},
 {'role': 'user