# DSPy Workshop

In [0]:
%pip install -q dspy mlflow
%restart_python

### Setup MLflow Tracing

Let's set up MLflow tracing for the entire lab for interpretability. This lab is prepared with Databricks managed MLflow service, so we don't need to handle the MLflow server management. If you are running the notebook on your local machine, you will need to spin up your own MLflow server. To do so, make sure you have `mlflow` installed, then execute the command below in your terminal:

```bash
mlflow ui  --host 127.0.0.1 --port 8080
```

Then add the following line to the jupyter notebook:

```python
mlflow.set_tracking_uri("http://127.0.0.1:8080")
```

Then you can run the following lab without issues! If you are having trouble setting up the local MLflow server, please refer to the [MLflow Quickstart](https://mlflow.org/docs/latest/getting-started/intro-quickstart/)

In [0]:
import mlflow

mlflow.dspy.autolog()

## DSPy Backbone - Signature and Modules

In this lab, we will demonstrate the fundamentals of DSPy by walking through an end to end demo of building a DSPy program. You will learn how to set the LM for the program, and use DSPy signature to define your input/output data and format, then build a DSPy program with dspy modules to achieve your task. You will also learn how to save/load your DSPy program for future usage and sharing.

In [0]:
import dspy
import os

os.environ["OPENAI_API_KEY"] = "{your_openai_api_key}"

In [0]:
import litellm
import logging

litellm._logging._disable_debugging()
httpx_logger = logging.getLogger("httpx")
httpx_logger.setLevel(logging.WARNING)

### Configure the LM

Configure the LM behind our DSPy workflow. The identifier is of the format `{provider_name}/{lm_name}`, e.g., `databricks/databricks-claude-3-7-sonnet` or `openai/gpt-4o` To find the the complete list of provider and supported models, please refer to the [Litellm documentation](https://docs.litellm.ai/docs/providers).

In this lab, we will be using the `openai/gpt-4o-mini` model. If you decide to switch to a different LM in the future, simply call `dspy.settings.configure(lm=dspy.LM({new_lm_identifier}))`, and the rest of the code can stay unchanged!

In [0]:
dspy.settings.configure(lm=dspy.LM("openai/gpt-4o-mini"))

## Use DSPy built-in Module to Build a Sentiment Classifier

Now let's start simple by building a sentiment classifier with a single DSPy built-in module. The goal is to classify the text into sentiment of a scale of 10. A higher score means a more positive sentiment.

### Build the Signature to Define Input/Output Data and Format

The next step is creating a signature to define your input/output data and format. We have two different options:

- String-based signature: easy to authorize, but not suitable for complex use cases.
- Class-based signature: powerful, slightly more complex than string-based signature.

Let's explore both options.

#### String-based Signature

String-based signature separates input and output fields by `->`, and separates each field by `,`. For example, `question, context -> answer` means the input has 2 fields, `question` and `context`, and the output has a single field `answer`. To use a string-based signature, we can use `dspy.make_signature()`, or directly feed it to a DSPy module, e.g., `dspy.Predict("question, context -> answer")`. 

Below we create the string-based signature for sentiment classifier.

In [0]:
str_signature = dspy.make_signature("text -> sentiment")

Class-based Signature

To use class-based signature, we need to define a class subclassing from `dspy.Signature`, and configure input and output fields as class attrs, similar to [Pydantic model](https://docs.pydantic.dev/latest/concepts/models/) if you are familar with it.

In [0]:
class SentimentClassifier(dspy.Signature):
    """Classify the sentiment of a text."""

    text: str = dspy.InputField(desc="input text to classify sentiment")
    sentiment: int = dspy.OutputField(desc="sentiment, the higher the more positive", ge=0, le=10)

In order to distinguish between input fields and output fields, simply assign `dspy.InputField` to input fields, and `dspy.OutputField` to output fields. Fields support the following custom configs:

- type: you can specify the type of the field using format `: {type}`, similar as Python type hint.
- desc: a string explaining the purpose of this field. 
- pydantic constraints: additional constraint on the fields, such as `ge` means must be greater or equal to.

Please refer to [signature tutorial](https://dspy.ai/learn/programming/signatures/) for a thorough guide on defining a signature.

### Create a Module to Interact with the LM

Now we have defined the input/output by creating a signature, the next step is to call LM to classify the sentiment of user text. We can simply pass the signature to the very basic DSPy module `dspy.Predict` to get the job done.

In [0]:
predict = dspy.Predict(SentimentClassifier)

You can also use `predict = dspy.Predict(str_signature)` or `predict = dspy.Predict("text -> sentiment")` to replace the code above. 

Now you can interact with the LM by calling the `predict` with `text` as the input.

In [0]:
output = predict(text="I am feeling pretty happy!")
print(output)

Prediction(
    sentiment=8
)


Trace(request_id=tr-2944b148f1044a339f4bf2996a098a93)

The output is a `dspy.Prediction` instance with a single field `sentiment` of type int (as requested in the signature). You can directly access the field by dot access, or dict access.

In [0]:
print(f"The sentiment is: {output.sentiment}")
print(f"The sentiment is: {output['sentiment']}")

The sentiment is: 8
The sentiment is: 8


### Wait, Where is My Prompt? 

It's a bit counterintuitive that with the code above, we never compose the multiturn prompts as requested by LLM providers, e.g., OpenAI, Anthropic, Google and so on. Instead, we are just making calls to a python callable. The magic is under the hood we format the signature together with user input into the multiturn prompt, and we can inspect it by utilizing the util function `dspy.inspect_history()`.

In [0]:
dspy.inspect_history(n=1)





[34m[2025-04-22T22:13:39.300194][0m

[31mSystem message:[0m

Your input fields are:
1. `text` (str): input text to classify sentiment
Your output fields are:
1. `sentiment` (int): sentiment, the higher the more positive
Constraints: greater than or equal to: 0, less than or equal to: 10
All interactions will be structured in the following way, with the appropriate values filled in.

[[ ## text ## ]]
{text}

[[ ## sentiment ## ]]
{sentiment}        # note: the value you produce must be a single int value

[[ ## completed ## ]]
In adhering to this structure, your objective is: 
        Classify the sentiment of a text.


[31mUser message:[0m

[[ ## text ## ]]
I am feeling pretty happy!

Respond with the corresponding output fields, starting with the field `[[ ## sentiment ## ]]` (must be formatted as a valid Python int), and then ending with the marker for `[[ ## completed ## ]]`.


[31mResponse:[0m

[32m[[ ## sentiment ## ]]
8

[[ ## completed ## ]][0m







In the above console print, the "System message: " section is the the message of role `system`, and "User message: " is the message of role `user`. "Response: " section is the actual response we get from the LLM.

### Try a Different Built-in Module

We have tried out the very basic module `dspy.Predict` above, let's try something more powerful. Another common built-in module we recommend is the `dspy.ChainOfThought`, which is almost the same as `dspy.Predict`, but in addition to the normal `dspy.Predict`'s output, we also request LM to provide `reasoning` for its response. Let's check it out.

In [0]:
cot = dspy.ChainOfThought(SentimentClassifier)

output = cot(text="As a Mavericks fan, I am feeling a bit frustrated about Luka gets traded...")
print(output)

INFO:httpx:HTTP Request: POST https://api.openai.com/v1/chat/completions "HTTP/1.1 200 OK"


Prediction(
    reasoning='The text expresses frustration regarding the potential trade of a beloved player, Luka, which indicates a negative sentiment. The use of the word "frustrated" clearly conveys disappointment and concern, leading to a lower sentiment score.',
    sentiment=3
)


Trace(request_id=tr-a561bf8f251d4e59a23a04a3c52972a4)

You can see above that the output includes a `reasoning` field explaining why we reach this sentiment score. Now let's check the LM interaction history.

In [0]:
dspy.inspect_history(n=1)





[34m[2025-04-22T22:13:49.858819][0m

[31mSystem message:[0m

Your input fields are:
1. `text` (str): input text to classify sentiment
Your output fields are:
1. `reasoning` (str)
2. `sentiment` (int): sentiment, the higher the more positive
Constraints: greater than or equal to: 0, less than or equal to: 10
All interactions will be structured in the following way, with the appropriate values filled in.

[[ ## text ## ]]
{text}

[[ ## reasoning ## ]]
{reasoning}

[[ ## sentiment ## ]]
{sentiment}        # note: the value you produce must be a single int value

[[ ## completed ## ]]
In adhering to this structure, your objective is: 
        Classify the sentiment of a text.


[31mUser message:[0m

[[ ## text ## ]]
As a Mavericks fan, I am feeling a bit frustrated about Luka gets traded...

Respond with the corresponding output fields, starting with the field `[[ ## reasoning ## ]]`, then `[[ ## sentiment ## ]]` (must be formatted as a valid Python int), and then ending with the

Comparing it to the history of `dspy.Predict`, we can see that with `dspy.ChainOfThought` we are explicitly requesting reasoning from the LM. 

For information of other pre-built modules, please refer to the [API Reference](https://dspy.ai/api/modules/BestOfN/).


### [Optional] Use a Different Adapter

Sometimes the default adapter may not be able to have LLM provide the requested output fields, when it happens you can try a different adapter. To change the adapter, we can make a call to `dspy.configure` like below, where we switch the adapter to be `dspy.JSONAdapter()`, which leverages the structured output feature of the provider. 

In [0]:
dspy.configure(adapter=dspy.JSONAdapter())

Let's interact with LM again and see what's getting changed.

In [0]:
print(cot(text="As a Lakers fan, I feel thrilled of getting Luka Doncic"))
dspy.inspect_history(n=1)

INFO:httpx:HTTP Request: POST https://api.openai.com/v1/chat/completions "HTTP/1.1 200 OK"


Prediction(
    reasoning="The text expresses excitement and happiness about the acquisition of Luka Doncic, indicating a strong positive sentiment. The use of the word 'thrilled' suggests a high level of enthusiasm and satisfaction, which contributes to a positive sentiment score.",
    sentiment=9
)




[34m[2025-04-22T22:14:04.443384][0m

[31mSystem message:[0m

Your input fields are:
1. `text` (str): input text to classify sentiment
Your output fields are:
1. `reasoning` (str)
2. `sentiment` (int): sentiment, the higher the more positive
Constraints: greater than or equal to: 0, less than or equal to: 10
All interactions will be structured in the following way, with the appropriate values filled in.

Inputs will have the following structure:

[[ ## text ## ]]
{text}

Outputs will be a JSON object with the following fields.

[[ ## reasoning ## ]]
{reasoning}

[[ ## sentiment ## ]]
{sentiment}        # note: the value you produce must be a single int value
In adhering to this str

Trace(request_id=tr-64bc28e80a824c6293ef0750b6409399)

By a careful comparison, you can see that the new Response comes in the format of a dictionary (json) instead of a string broken down into a few sections.

## Build a Program with Custom Module

So far we have explored how to use built-in DSPy module to build a single-module DSPy program. In actual production cases, we usually need to build multi-module programs which also involves custom function calling. Let's get started with it!

### Define Your Program by Customizing DSPy Module

In order to define a custom DSPy program, you will need to define a class subclassing from `dspy.Module`, and write your custom logic in the `forward` method. This is similar PyTorch and Keras if you are familiar with those.

For demo, let's build a celebrity guessing game with DSPy. The game setup is:
- User is asked to think about a celebrity name, and keep it in mind.
- LLM starts by asking a series of yes/no questions. User must response with yes/no. 
- When the max number of questions is reached, LLM must make its guess, and user needs to response to if that's correct.
- LLM will tell users about its reasoning process.

In [0]:
class QuestionGenerator(dspy.Signature):
    """Generate a yes or no question in order to guess the celebrity name in users' mind. You can ask in general or directly guess the name if you think the signal is enough. You should never ask the same question in the past_questions."""
    past_questions: list[str] = dspy.InputField(desc="past questions asked")
    past_answers: list[bool] = dspy.InputField(desc="past answers")
    new_question: str = dspy.OutputField(desc="new question that can help narrow down the celebrity name")
    guess_made: bool = dspy.OutputField(desc="If the new_question is the celebrity name guess, set to True, if it is still a general question set to False")


class Reflection(dspy.Signature):
    """Provide reflection on the guessing process"""
    correct_celebrity_name: str = dspy.InputField(desc="the celebrity name in user's mind")
    final_guessor_question: str = dspy.InputField(desc="the final guess or question LM made")
    past_questions: list[str] = dspy.InputField(desc="past questions asked")
    past_answers: list[bool] = dspy.InputField(desc="past answers")

    reflection: str = dspy.OutputField(
        desc="reflection on the guessing process, including what was done well and what can be improved"
    )

def ask(prompt, valid_responses=("y", "n")):
    while True:
        response = input(f"{prompt} ({'/'.join(valid_responses)}): ").strip().lower()
        if response in valid_responses:
            return response
        print(f"Please enter one of: {', '.join(valid_responses)}")

class CelebrityGuess(dspy.Module):
    def __init__(self, max_tries=10):
        super().__init__()

        self.question_generator = dspy.ChainOfThought(QuestionGenerator)
        self.reflection = dspy.ChainOfThought(Reflection)

        self.max_tries = 20

    def forward(self):
        celebrity_name = input("Please think of a celebrity name, once you are ready, type the name and press enter...")
        past_questions = []
        past_answers = []

        correct_guess = False

        for i in range(self.max_tries):
            question = self.question_generator(
                past_questions=past_questions,
                past_answers=past_answers,
            )
            answer = ask(f"{question.new_question}").lower() == "y"
            past_questions.append(question.new_question)
            past_answers.append(answer)

            if question.guess_made and answer:
                correct_guess = True
                break

        if correct_guess:
            print("Yay! I got it right!")
        else:
            print("Oops, I couldn't guess it right.")

        reflection = self.reflection(
            correct_celebrity_name=celebrity_name,
            final_guessor_question=question.new_question,
            past_questions=past_questions,
            past_answers=past_answers,
        )
        print(reflection.reflection)


In [0]:
celebrity_guess = CelebrityGuess()

celebrity_guess()

Please think of a celebrity name, once you are ready, type the name and press enter... Lebron James

INFO:httpx:HTTP Request: POST https://api.openai.com/v1/chat/completions "HTTP/1.1 200 OK"


Is the celebrity you are thinking of an actor or actress? (y/n):  n

INFO:httpx:HTTP Request: POST https://api.openai.com/v1/chat/completions "HTTP/1.1 200 OK"


Is the celebrity you are thinking of a musician? (y/n):  n

INFO:httpx:HTTP Request: POST https://api.openai.com/v1/chat/completions "HTTP/1.1 200 OK"


Is the celebrity you are thinking of a sports figure? (y/n):  y

INFO:httpx:HTTP Request: POST https://api.openai.com/v1/chat/completions "HTTP/1.1 200 OK"


Is the celebrity you are thinking of known for playing basketball? (y/n):  y

INFO:httpx:HTTP Request: POST https://api.openai.com/v1/chat/completions "HTTP/1.1 200 OK"


Is the celebrity you are thinking of currently active in the NBA? (y/n):  y

INFO:httpx:HTTP Request: POST https://api.openai.com/v1/chat/completions "HTTP/1.1 200 OK"


Is the celebrity you are thinking of LeBron James? (y/n):  y

Yay! I got it right!


INFO:httpx:HTTP Request: POST https://api.openai.com/v1/chat/completions "HTTP/1.1 200 OK"


The guessing process was well-structured, starting with broad categories and progressively narrowing down to specific traits related to the celebrity. The questions were relevant and guided the guessing effectively. One area for improvement could be to include a question about the celebrity's achievements or specific teams they are associated with, which could further refine the guessing process in future interactions.


Trace(request_id=tr-861d7ce35195430d8f1fa026f3cbed39)

That's it! With DSPy, it's very easy to combine custom logic with multiple LLM calls to build a compound AI system.

## Save and Load

DSPy provides functionality to save/load your program effortlessly. Simply use the `Module.save(path)` method to save your model, which has two modes:
- State-only: only save the internal states into a json file.
- Entire program: save the whole program's information in a pickle file together with metadata like dspy version in a json file. 

### State-only Saving

In [0]:
celebrity_guess.save("dspy_program/celebrity.json", save_program=False)

You can load the state back from an instance of the same type. This is useful if we have changed the internal state of the module. In this use case it's a no-op.

In [0]:
celebrity_guess.load("dspy_program/celebrity.json")

## Whole-program Saving

In [0]:
celebrity_guess.save("dspy_program/celebrity/", save_program=True)

You can load it back by `dspy.load()`, and it will create a new instance the same as `celebrity_guess`.

In [0]:
loaded = dspy.load("dspy_program/celebrity/")

## Debug DSPy with MLflow Tracing

In this section, we will demonstrate how to use mlflow tracing to help visualize and debug DSPy program. As of the demo, we will build a tool calling agent with DSPy ReAct module. ReAct stands for reasoning and acting, which uses LLM to determine the next tool to call, then execute the tool and send the execution result back to LLM to decide if more tools calls are required or return the answer. 

### Setup MLflow Tracing

Let's first install MLflow and set up MLflow tracing.

### Build a Airline Customer Service Agent with DSPy

Let's build a simple airline customer service agent that helps user automatically manage the flight booking, which can do the following:

- Booking new trips.
- Modify existing trips, including flight change and cancellation. 

To build this agent, we need to interact with the following tools:

- `fetch_flight_info`: get flight information for certain dates. 
- `fetch_itinerary`: get the information about booked itinerary. 
- `book_itinerary`: book a flight on behalf of the user.
- `modify_itinerary`: modify the iternary, either flight change or cancellation.
- `get_user_info`: get users' information.
- `file_ticket`: file a backlog ticket to have human assist. 

Let's define the tools. 

In [0]:
from pydantic import BaseModel

class Date(BaseModel):
    # Somehow LLM is bad at specifying `datetime.datetime`
    year: int
    month: int
    day: int
    hour: int

class UserProfile(BaseModel):
    user_id: str
    name: str
    email: str

class Flight(BaseModel):
    flight_id: str
    date_time: Date
    origin: str
    destination: str
    duration: float
    price: float

class Itinerary(BaseModel):
    confirmation_number: str
    user_profile: UserProfile
    flight: Flight

class Ticket(BaseModel):
    user_request: str
    user_profile: UserProfile


Let's make a few fake data for the demo.

In [0]:
user_database = {
    "Adam": UserProfile(user_id="1", name="Adam", email="adam@gmail.com"),
    "Bob": UserProfile(user_id="2", name="Bob", email="bob@gmail.com"),
    "Chelsie": UserProfile(user_id="3", name="Chelsie", email="chelsie@gmail.com"),
    "David": UserProfile(user_id="4", name="David", email="david@gmail.com"),
}

flight_database = {
    "DA123": Flight(
        flight_id="DA123",
        origin="SFO",
        destination="JFK",
        date_time=Date(year=2025, month=9, day=1, hour=1),
        duration=3,
        price=200,
    ),
    "DA125": Flight(
        flight_id="DA125",
        origin="SFO",
        destination="JFK",
        date_time=Date(year=2025, month=9, day=1, hour=7),
        duration=9,
        price=500,
    ),
    "DA456": Flight(
        flight_id="DA456",
        origin="SFO",
        destination="SNA",
        date_time=Date(year=2025, month=10, day=1, hour=1),
        duration=2,
        price=100,
    ),
    "DA460": Flight(
        flight_id="DA460",
        origin="SFO",
        destination="SNA",
        date_time=Date(year=2025, month=10, day=1, hour=9),
        duration=2,
        price=120,
    ),
}

itinery_database = {}
ticket_database = {}

Now let's define the tools.

In [0]:
import random
import string


def fetch_flight_info(date: Date, origin: str, destination: str):
    """Fetch flight information from origin to destination on the given date"""
    flights = []

    for flight_id, flight in flight_database.items():
        if (
            flight.date_time.year == date.year
            and flight.date_time.month == date.month
            and flight.date_time.day == date.day
            and flight.origin == origin
            and flight.destination == destination
        ):
            flights.append(flight)
    return flights


def fetch_itinerary(confirmation_number: str):
    """Fetch a booked itinerary information from database"""
    return itinery_database.get(confirmation_number)


def pick_flight(flights: list[Flight]):
    """Pick up the best flight that matches users' request."""
    sorted_flights = sorted(
        flights,
        key=lambda x: (
            x.get("duration") if isinstance(x, dict) else x.duration,
            x.get("price") if isinstance(x, dict) else x.price,
        ),
    )
    return sorted_flights[0]


def generate_id(length=8):
    chars = string.ascii_lowercase + string.digits
    return "".join(random.choices(chars, k=length))


def book_itinerary(flight: Flight, user_profile: UserProfile):
    """Book a flight on behalf of the user."""
    confirmation_number = generate_id()
    while confirmation_number in itinery_database:
        confirmation_number = generate_id()
    itinery_database[confirmation_number] = Itinerary(
        confirmation_number=confirmation_number,
        user_profile=user_profile,
        flight=flight,
    )
    return confirmation_number, itinery_database[confirmation_number]


def cancel_itinerary(confirmation_number: str, user_profile: UserProfile):
    """Cancel an itinerary on behalf of the user."""
    if confirmation_number in itinery_database:
        del itinery_database[confirmation_number]
        return
    raise ValueError("Cannot find the itinerary, please check your confirmation number.")


def get_user_info(name: str):
    """Fetch the user profile from database with given name."""
    return user_database.get(name)


def file_ticket(user_request: str, user_profile: UserProfile):
    """File a customer support ticket if this is something the agent cannot handle."""
    ticket_id = generate_id(length=6)
    ticket_database[ticket_id] = Ticket(
        user_request=user_request,
        user_profile=user_profile,
    )
    return ticket_id


The next step is to define the DSPy program, which takes in a user request, and return the process result. 

In [0]:
class DSPyAirlineCustomerSerice(dspy.Signature):
    """You are an airline customer service agent. You are given a list of tools to handle user request. You should decide the right tool to use in order to fullfil users' request."""
    user_request: str = dspy.InputField()
    process_result: str = dspy.OutputField(desc="Message that summarizes the process result, and the information users need, e.g., the confirmation_number if it's a flight booking request.")

In [0]:
react = dspy.ReAct(
    DSPyAirlineCustomerSerice,
    tools = [
        fetch_flight_info,
        fetch_itinerary,
        pick_flight,
        book_itinerary,
        cancel_itinerary,
        get_user_info,
        file_ticket,
    ]
)

In [0]:
result = react(user_request="please help me book a flight from SFO to JFK on 09/01/2025, my name is Adam")

INFO:httpx:HTTP Request: POST https://api.openai.com/v1/chat/completions "HTTP/1.1 200 OK"
INFO:httpx:HTTP Request: POST https://api.openai.com/v1/chat/completions "HTTP/1.1 200 OK"
INFO:httpx:HTTP Request: POST https://api.openai.com/v1/chat/completions "HTTP/1.1 200 OK"
INFO:httpx:HTTP Request: POST https://api.openai.com/v1/chat/completions "HTTP/1.1 200 OK"


Trace(request_id=tr-eeff5fe411a3460b9fc6dc4f2584d9dd)

## DSPy Magic - Automatic Program Optimization

In this section, we will explore how to optimize our DSPy program with DSPy optimizer. 

In [0]:
os.environ["HF_DATASETS_CACHE"] = "."

### Define the Evaluate Metrics

Similar to training a machine learning model, we need to define a metric to represent how well our program functions. In this example, we are tackling a math Q&A task, and the metric we use is "equivalence", which means the LM generated answer is equivalent to the golden answer.

In [0]:
def _fix_fracs(string):
    substrs = string.split("\\frac")
    new_str = substrs[0]
    if len(substrs) > 1:
        substrs = substrs[1:]
        for substr in substrs:
            new_str += "\\frac"
            if substr[0] == "{":
                new_str += substr
            else:
                try:
                    assert len(substr) >= 2
                except:
                    return string
                a = substr[0]
                b = substr[1]
                if b != "{":
                    if len(substr) > 2:
                        post_substr = substr[2:]
                        new_str += "{" + a + "}{" + b + "}" + post_substr
                    else:
                        new_str += "{" + a + "}{" + b + "}"
                else:
                    if len(substr) > 2:
                        post_substr = substr[2:]
                        new_str += "{" + a + "}" + b + post_substr
                    else:
                        new_str += "{" + a + "}" + b
    string = new_str
    return string


def _fix_a_slash_b(string):
    if len(string.split("/")) != 2:
        return string
    a = string.split("/")[0]
    b = string.split("/")[1]
    try:
        a = int(a)
        b = int(b)
        assert string == "{}/{}".format(a, b)
        new_string = "\\frac{" + str(a) + "}{" + str(b) + "}"
        return new_string
    except:
        return string


def _remove_right_units(string):
    # "\\text{ " only ever occurs (at least in the val set) when describing units
    if "\\text{ " in string:
        splits = string.split("\\text{ ")
        assert len(splits) == 2
        return splits[0]
    else:
        return string


def _fix_sqrt(string):
    if "\\sqrt" not in string:
        return string
    splits = string.split("\\sqrt")
    new_string = splits[0]
    for split in splits[1:]:
        if split[0] != "{":
            a = split[0]
            new_substr = "\\sqrt{" + a + "}" + split[1:]
        else:
            new_substr = "\\sqrt" + split
        new_string += new_substr
    return new_string


def _strip_string(string):
    # linebreaks
    string = string.replace("\n", "")
    # print(string)

    # remove inverse spaces
    string = string.replace("\\!", "")
    # print(string)

    # replace \\ with \
    string = string.replace("\\\\", "\\")
    # print(string)

    # replace tfrac and dfrac with frac
    string = string.replace("tfrac", "frac")
    string = string.replace("dfrac", "frac")
    # print(string)

    # remove \left and \right
    string = string.replace("\\left", "")
    string = string.replace("\\right", "")
    # print(string)

    # Remove circ (degrees)
    string = string.replace("^{\\circ}", "")
    string = string.replace("^\\circ", "")

    # remove dollar signs
    string = string.replace("\\$", "")

    # remove units (on the right)
    string = _remove_right_units(string)

    # remove percentage
    string = string.replace("\\%", "")
    string = string.replace("\%", "")

    # " 0." equivalent to " ." and "{0." equivalent to "{." Alternatively, add "0" if "." is the start of the string
    string = string.replace(" .", " 0.")
    string = string.replace("{.", "{0.")
    # if empty, return empty string
    if len(string) == 0:
        return string
    if string[0] == ".":
        string = "0" + string

    # to consider: get rid of e.g. "k = " or "q = " at beginning
    if len(string.split("=")) == 2:
        if len(string.split("=")[0]) <= 2:
            string = string.split("=")[1]

    # fix sqrt3 --> sqrt{3}
    string = _fix_sqrt(string)

    # remove spaces
    string = string.replace(" ", "")

    # \frac1b or \frac12 --> \frac{1}{b} and \frac{1}{2}, etc. Even works with \frac1{72} (but not \frac{72}1). Also does a/b --> \\frac{a}{b}
    string = _fix_fracs(string)

    # manually change 0.5 --> \frac{1}{2}
    if string == "0.5":
        string = "\\frac{1}{2}"

    # NOTE: X/Y changed to \frac{X}{Y} in dataset, but in simple cases fix in case the model output is X/Y
    string = _fix_a_slash_b(string)

    return string


def is_equiv(str1, str2, verbose=False):
    if str1 is None and str2 is None:
        print("WARNING: Both None")
        return True
    if str1 is None or str2 is None:
        return False

    try:
        ss1 = _strip_string(str1)
        ss2 = _strip_string(str2)
        if verbose:
            print(ss1, ss2)
        return ss1 == ss2
    except:
        return str1 == str2


def last_boxed_only_string(string):
    idx = string.rfind("\\boxed")
    if idx < 0:
        idx = string.rfind("\\fbox")
        if idx < 0:
            return None

    i = idx
    right_brace_idx = None
    num_left_braces_open = 0
    while i < len(string):
        if string[i] == "{":
            num_left_braces_open += 1
        if string[i] == "}":
            num_left_braces_open -= 1
            if num_left_braces_open == 0:
                right_brace_idx = i
                break
        i += 1

    if right_brace_idx == None:
        retval = None
    else:
        retval = string[idx : right_brace_idx + 1]

    return retval


def remove_boxed(s):
    left = "\\boxed{"
    try:
        assert s[: len(left)] == left
        assert s[-1] == "}"
        return s[len(left) : -1]
    except:
        return None


def remove_format(s):
    # remove \(\)
    s = s.replace("\\(", "")
    s = s.replace("\\)", "")
    return s


def math_evaluate(gold, pred, trace=None):
    return is_equiv(
        remove_boxed(last_boxed_only_string(gold.solution)),
        remove_format(pred.answer),
        verbose=False,
    )

### Convert the Dataset into a List of dspy.Example

In order for the optimizer to work smoothly, it's recommended to convert each data record into a `dspy.Example`. 

In [0]:
from datasets import load_dataset

dataset_math = load_dataset("DigitalLearningGmbH/MATH-lighteval", "default")
math_train = [
    dspy.Example(**x).with_inputs("problem") for x in dataset_math["train"]
]
math_val = [
    dspy.Example(**x).with_inputs("problem") for x in dataset_math["test"]
]



In [0]:
math_program = dspy.ChainOfThought("problem->answer")

### Track Optimization with MLflow

We can track DSPy optimization process with mlflow autologging, to enable it, we need to configure a few flags.

In [0]:
import mlflow

mlflow.dspy.autolog(log_evals=True, log_compiles=True, log_traces_from_compile=True)

In [0]:
from dspy.evaluate import Evaluate

math_evaluator = Evaluate(devset=math_val[:100], num_threads=10, display_progress=True, display_table=20)

In [0]:
math_eval_score_baseline = math_evaluator(math_program, metric=math_evaluate)

  0%|          | 0/100 [00:00<?, ?it/s]Average Metric: 1.00 / 1 (100.0%):   0%|          | 0/100 [00:00<?, ?it/s]Average Metric: 1.00 / 1 (100.0%):   1%|          | 1/100 [00:00<00:39,  2.48it/s]Average Metric: 2.00 / 2 (100.0%):   1%|          | 1/100 [00:00<00:39,  2.48it/s]Average Metric: 2.00 / 3 (66.7%):   2%|▏         | 2/100 [00:00<00:39,  2.48it/s] Average Metric: 3.00 / 4 (75.0%):   3%|▎         | 3/100 [00:00<00:39,  2.48it/s]Average Metric: 4.00 / 5 (80.0%):   4%|▍         | 4/100 [00:00<00:38,  2.48it/s]Average Metric: 4.00 / 6 (66.7%):   5%|▌         | 5/100 [00:00<00:38,  2.48it/s]Average Metric: 5.00 / 7 (71.4%):   6%|▌         | 6/100 [00:00<00:37,  2.48it/s]Average Metric: 6.00 / 8 (75.0%):   7%|▋         | 7/100 [00:01<00:37,  2.48it/s]Average Metric: 6.00 / 8 (75.0%):   8%|▊         | 8/100 [00:01<00:20,  4.55it/s]Average Metric: 7.00 / 9 (77.8%):   8%|▊         | 8/100 [00:01<00:20,  4.55it/s]Average Metric: 7.00 / 10 (70.0%):   9%|▉         | 9/100 [00

2025/04/22 22:22:48 INFO dspy.evaluate.evaluate: Average Metric: 80 / 100 (80.0%)
INFO:py4j.clientserver:Closing down clientserver connection
INFO:py4j.clientserver:Closing down clientserver connection
INFO:py4j.clientserver:Closing down clientserver connection
INFO:py4j.clientserver:Closing down clientserver connection





INFO:py4j.clientserver:Closing down clientserver connection
INFO:py4j.clientserver:Closing down clientserver connection


Unnamed: 0,problem,level,solution,type,reasoning,answer,math_evaluate
0,How many vertical asymptotes does the graph of $y=\frac{2}{x^2+x-6...,Level 3,The denominator of the rational function factors into $x^2+x-6=(x-...,Algebra,To find the vertical asymptotes of the function \( y = \frac{2}{x^...,2,✔️ [True]
1,What is the positive difference between $120\%$ of 30 and $130\%$ ...,Level 1,One hundred twenty percent of 30 is $120\cdot30\cdot\frac{1}{100}=...,Algebra,To find the positive difference between $120\%$ of 30 and $130\%$ ...,10,✔️ [True]
2,Find $x$ such that $\lceil x \rceil + x = \dfrac{23}{7}$. Express ...,Level 4,"First, we note that $x$ must be positive, since otherwise $\lceil ...",Algebra,Let \( x = n + f \) where \( n = \lfloor x \rfloor \) (the greates...,\frac{2}{7},
3,Evaluate $i^5+i^{-25}+i^{45}$.,Level 5,We have $i^5 = i^4\cdot i = 1\cdot (i) = i$. We also have $i^{-25}...,Algebra,"To evaluate the expression $i^5 + i^{-25} + i^{45}$, we first need...",i,✔️ [True]
4,"If $2^8=4^x$, what is the value of $x$?",Level 1,"Rewrite $4$ as $2^2$ to find $4^x=2^{2x}$. Since $2^8=2^{2x}$, we...",Algebra,"To solve the equation \(2^8 = 4^x\), we can express \(4\) in terms...",4,✔️ [True]
5,"What is the 100th term of the arithmetic sequence 6, 10, 14, 18, ...?",Level 2,"The common difference is $10 - 6 = 4$, so the 100th term is $6+99\...",Algebra,"In an arithmetic sequence, the nth term can be calculated using th...",402,✔️ [True]
6,For what values of $x$ is it true that $x^2 - 5x - 4 \le 10$? Expr...,Level 4,"Re-arranging, $x^2 - 5x - 14 \le 0$. The left-hand quadratic facto...",Algebra,"To solve the inequality \( x^2 - 5x - 4 \le 10 \), we first rearra...","[-2, 7]",
7,Mr. Madoff invests 1000 dollars in a fund that compounds annually ...,Level 4,"Let $r$ be the annual interest rate. Then after three years, Mr. M...",Algebra,"To find the annual interest rate, we can use the formula for compo...",8,
8,"Four distinct integers $a$, $b$, $c$ and $d$ have the property tha...",Level 4,"WLOG, let $a<b<c<d$. The smallest sum is $a+b=10$. The second-smal...",Algebra,"Let the four distinct integers be denoted as a, b, c, and d. The s...","4, 6, 14, 15",✔️ [True]
9,What is the smallest value of $x$ such that $|5x - 1| = |3x + 2|$?...,Level 5,"There are two cases, when $5x-1=3x+2$ and when $5x-1=-(3x+2).$ The...",Algebra,"To solve the equation |5x - 1| = |3x + 2|, we need to consider the...",-\frac{1}{8},✔️ [True]


INFO:py4j.clientserver:Closing down clientserver connection
INFO:py4j.clientserver:Closing down clientserver connection
INFO:py4j.clientserver:Closing down clientserver connection
INFO:py4j.clientserver:Closing down clientserver connection


[Trace(request_id=tr-0cbe9e72c90a4aa7afab179c090c6b21), Trace(request_id=tr-2cf5ffe35df742c0aced23f86c595d5b), Trace(request_id=tr-9b31230805924151b157855539365e42), Trace(request_id=tr-1350ad22e27140b893cd064a6d31d18a), Trace(request_id=tr-7c565b2aba0c4b75824d0b93168758e8), Trace(request_id=tr-78d669c8f11a49a69dd5d8613494bd40), Trace(request_id=tr-b39318de04984f45a6f8e84d275b902f), Trace(request_id=tr-3d518355b32e4d71845b1aabeb3e69f9), Trace(request_id=tr-d142aadb684c46a6973c5468f0191ca5), Trace(request_id=tr-dba0f58f19b846b7b110326c6f63509c)]

Check out a sample data input/output.

In [0]:
math_program(**math_train[0].inputs())

Prediction(
    reasoning='To ensure that the piecewise function \\( f(x) \\) is continuous, we need to check the continuity at the points where the definition of the function changes, which are at \\( x = -2 \\) and \\( x = 2 \\).\\n\\n1. **Continuity at \\( x = 2 \\)**:\\n   - From the left (using the second piece): \\( f(2) = 2 - 5 = -3 \\)\\n   - From the right (using the first piece): \\( f(2) = a(2) + 3 = 2a + 3 \\)\\n   - For continuity at \\( x = 2 \\): \\( 2a + 3 = -3 \\)\\n   - Solving for \\( a \\): \\( 2a = -6 \\) \\( \\Rightarrow a = -3 \\)\\n\\n2. **Continuity at \\( x = -2 \\)**:\\n   - From the left (using the third piece): \\( f(-2) = 2(-2) - b = -4 - b \\)\\n   - From the right (using the second piece): \\( f(-2) = -2 - 5 = -7 \\)\\n   - For continuity at \\( x = -2 \\): \\( -4 - b = -7 \\)\\n   - Solving for \\( b \\): \\( -b = -3 \\) \\( \\Rightarrow b = 3 \\)\\n\\nNow we have \\( a = -3 \\) and \\( b = 3 \\). Therefore, \\( a + b = -3 + 3 = 0 \\).',
    answer='0'


Trace(request_id=tr-1708a05f716e44f1911bda3a2d97174a)

### Configure the Optimizer and Kick Off Optimization

The last step is configuring the optimizer, and kick off the optimization process. Here we are using `MIPROv2` optimizer, which optimizes both the `instruction` (system prompt) and `demos` (few-shot examples).

In [0]:
mipro = dspy.MIPROv2(
    metric=math_evaluate,
    max_bootstrapped_demos=4,
    max_labeled_demos=4,
    num_candidates=4,
    auto=None,
)

optimized_math_program = mipro.compile(
    math_program,
    trainset=math_train[:100],
    valset=math_train[100:200],
    num_trials=8,
    minibatch_size=4,
    minibatch_full_eval_steps=4,
)

2025/04/22 22:23:41 INFO mlflow.utils.autologging_utils: Created MLflow autologging run with ID 'f3e835cef9f8448c81be5c8966ee62d5', which will track hyperparameters, performance metrics, model artifacts, and lineage information for the current dspy workflow


[93m[1mProjected Language Model (LM) Calls[0m

Based on the parameters you have set, the maximum number of LM calls is projected as follows:

[93m- Prompt Generation: [94m[1m10[0m[93m data ... y

2025/04/22 22:23:44 INFO dspy.teleprompt.mipro_optimizer_v2: 
==> STEP 1: BOOTSTRAP FEWSHOT EXAMPLES <==
2025/04/22 22:23:44 INFO dspy.teleprompt.mipro_optimizer_v2: These will be used as few-shot example candidates for our program and for creating instructions.

2025/04/22 22:23:44 INFO dspy.teleprompt.mipro_optimizer_v2: Bootstrapping N=4 sets of demonstrations...


Bootstrapping set 1/4
Bootstrapping set 2/4
Bootstrapping set 3/4


Downloading artifacts:   0%|          | 0/1 [00:00<?, ?it/s]

  0%|          | 0/100 [00:00<?, ?it/s]INFO:py4j.clientserver:Closing down clientserver connection
  1%|          | 1/100 [00:00<00:35,  2.81it/s]  2%|▏         | 2/100 [00:11<10:27,  6.41s/it]  3%|▎         | 3/100 [00:13<07:37,  4.72s/it]  4%|▍         | 4/100 [00:15<06:00,  3.76s/it]  4%|▍         | 4/100 [00:15<06:23,  4.00s/it]


Bootstrapped 4 full traces after 4 examples for up to 1 rounds, amounting to 4 attempts.


Downloading artifacts:   0%|          | 0/1 [00:00<?, ?it/s]

Bootstrapping set 4/4


Downloading artifacts:   0%|          | 0/1 [00:00<?, ?it/s]

  0%|          | 0/100 [00:00<?, ?it/s]INFO:py4j.clientserver:Closing down clientserver connection
INFO:py4j.clientserver:Closing down clientserver connection
  1%|          | 1/100 [00:03<05:43,  3.47s/it]  2%|▏         | 2/100 [00:09<07:45,  4.75s/it]  3%|▎         | 3/100 [00:15<09:03,  5.60s/it]  3%|▎         | 3/100 [00:15<08:28,  5.25s/it]


Bootstrapped 3 full traces after 3 examples for up to 1 rounds, amounting to 3 attempts.


Downloading artifacts:   0%|          | 0/1 [00:00<?, ?it/s]

2025/04/22 22:24:21 INFO dspy.teleprompt.mipro_optimizer_v2: 
==> STEP 2: PROPOSE INSTRUCTION CANDIDATES <==
2025/04/22 22:24:21 INFO dspy.teleprompt.mipro_optimizer_v2: We will use the few-shot examples from the previous step, a generated dataset summary, a summary of the program code, and a randomly selected prompting tip to propose instructions.
INFO:py4j.clientserver:Closing down clientserver connection
2025/04/22 22:25:00 INFO dspy.teleprompt.mipro_optimizer_v2: 
Proposing instructions...

2025/04/22 22:25:26 INFO dspy.teleprompt.mipro_optimizer_v2: Proposed Instructions for Predictor 0:

2025/04/22 22:25:26 INFO dspy.teleprompt.mipro_optimizer_v2: 0: Given the fields `problem`, produce the fields `answer`.

2025/04/22 22:25:26 INFO dspy.teleprompt.mipro_optimizer_v2: 1: You are a mathematician tasked with solving critical algebraic problems that will determine the fate of a prestigious mathematics competition. Given the problem statement, provide a thorough step-by-step reasoning

  0%|          | 0/100 [00:00<?, ?it/s]Average Metric: 1.00 / 1 (100.0%):   0%|          | 0/100 [00:02<?, ?it/s]Average Metric: 1.00 / 1 (100.0%):   1%|          | 1/100 [00:02<04:34,  2.77s/it]Average Metric: 2.00 / 2 (100.0%):   1%|          | 1/100 [00:04<04:34,  2.77s/it]Average Metric: 2.00 / 2 (100.0%):   2%|▏         | 2/100 [00:04<03:05,  1.89s/it]Average Metric: 3.00 / 3 (100.0%):   2%|▏         | 2/100 [00:07<03:05,  1.89s/it]Average Metric: 3.00 / 3 (100.0%):   3%|▎         | 3/100 [00:07<03:53,  2.41s/it]Average Metric: 4.00 / 4 (100.0%):   3%|▎         | 3/100 [00:08<03:53,  2.41s/it]Average Metric: 4.00 / 4 (100.0%):   4%|▍         | 4/100 [00:08<02:56,  1.84s/it]Average Metric: 5.00 / 5 (100.0%):   4%|▍         | 4/100 [00:09<02:56,  1.84s/it]Average Metric: 5.00 / 5 (100.0%):   5%|▌         | 5/100 [00:09<02:40,  1.69s/it]Average Metric: 6.00 / 6 (100.0%):   5%|▌         | 5/100 [00:10<02:40,  1.69s/it]Average Metric: 6.00 / 6 (100.0%):   6%|▌         | 6/



Average Metric: 84.00 / 98 (85.7%):  97%|█████████▋| 97/100 [01:47<00:03,  1.06s/it]Average Metric: 84.00 / 98 (85.7%):  98%|█████████▊| 98/100 [01:47<00:02,  1.29s/it]Average Metric: 85.00 / 99 (85.9%):  98%|█████████▊| 98/100 [01:51<00:02,  1.29s/it]Average Metric: 85.00 / 99 (85.9%):  99%|█████████▉| 99/100 [01:51<00:02,  2.10s/it]Average Metric: 86.00 / 100 (86.0%):  99%|█████████▉| 99/100 [01:56<00:02,  2.10s/it]Average Metric: 86.00 / 100 (86.0%): 100%|██████████| 100/100 [01:56<00:00,  3.13s/it]Average Metric: 86.00 / 100 (86.0%): 100%|██████████| 100/100 [01:56<00:00,  1.17s/it]

2025/04/22 22:27:24 INFO dspy.evaluate.evaluate: Average Metric: 86 / 100 (86.0%)
INFO:py4j.clientserver:Closing down clientserver connection
INFO:py4j.clientserver:Closing down clientserver connection
INFO:py4j.clientserver:Closing down clientserver connection
INFO:py4j.clientserver:Closing down clientserver connection
INFO:py4j.clientserver:Closing down clientserver connection
INFO:py4j.clientserver:Closing down clientserver connection





2025/04/22 22:27:24 INFO dspy.teleprompt.mipro_optimizer_v2: Default program score: 86.0

2025/04/22 22:27:24 INFO dspy.teleprompt.mipro_optimizer_v2: == Trial 2 / 11 - Minibatch ==


  0%|          | 0/4 [00:00<?, ?it/s]Average Metric: 1.00 / 1 (100.0%):   0%|          | 0/4 [00:08<?, ?it/s]Average Metric: 1.00 / 1 (100.0%):  25%|██▌       | 1/4 [00:08<00:26,  8.82s/it]Average Metric: 2.00 / 2 (100.0%):  25%|██▌       | 1/4 [00:09<00:26,  8.82s/it]Average Metric: 2.00 / 2 (100.0%):  50%|█████     | 2/4 [00:09<00:08,  4.08s/it]Average Metric: 3.00 / 3 (100.0%):  50%|█████     | 2/4 [00:15<00:08,  4.08s/it]Average Metric: 3.00 / 3 (100.0%):  75%|███████▌  | 3/4 [00:15<00:05,  5.04s/it]Average Metric: 4.00 / 4 (100.0%):  75%|███████▌  | 3/4 [00:16<00:05,  5.04s/it]Average Metric: 4.00 / 4 (100.0%): 100%|██████████| 4/4 [00:16<00:00,  3.37s/it]Average Metric: 4.00 / 4 (100.0%): 100%|██████████| 4/4 [00:16<00:00,  4.15s/it]

2025/04/22 22:27:41 INFO dspy.evaluate.evaluate: Average Metric: 4 / 4 (100.0%)
INFO:py4j.clientserver:Closing down clientserver connection
INFO:py4j.clientserver:Closing down clientserver connection
INFO:py4j.clientserver:Closing down clientserver connection
INFO:py4j.clientserver:Closing down clientserver connection





2025/04/22 22:27:42 INFO dspy.teleprompt.mipro_optimizer_v2: Score: 100.0 on minibatch of size 4 with parameters ['Predictor 0: Instruction 1', 'Predictor 0: Few-Shot Set 2'].
2025/04/22 22:27:42 INFO dspy.teleprompt.mipro_optimizer_v2: Minibatch scores so far: [100.0]
2025/04/22 22:27:42 INFO dspy.teleprompt.mipro_optimizer_v2: Full eval scores so far: [86.0]
2025/04/22 22:27:42 INFO dspy.teleprompt.mipro_optimizer_v2: Best full score so far: 86.0


2025/04/22 22:27:42 INFO dspy.teleprompt.mipro_optimizer_v2: == Trial 3 / 11 - Minibatch ==


  0%|          | 0/4 [00:00<?, ?it/s]Average Metric: 1.00 / 1 (100.0%):   0%|          | 0/4 [00:03<?, ?it/s]Average Metric: 1.00 / 1 (100.0%):  25%|██▌       | 1/4 [00:03<00:11,  3.93s/it]Average Metric: 2.00 / 2 (100.0%):  25%|██▌       | 1/4 [00:05<00:11,  3.93s/it]Average Metric: 2.00 / 2 (100.0%):  50%|█████     | 2/4 [00:05<00:05,  2.56s/it]Average Metric: 3.00 / 3 (100.0%):  50%|█████     | 2/4 [00:06<00:05,  2.56s/it]Average Metric: 3.00 / 3 (100.0%):  75%|███████▌  | 3/4 [00:06<00:01,  1.79s/it]Average Metric: 4.00 / 4 (100.0%):  75%|███████▌  | 3/4 [00:10<00:01,  1.79s/it]Average Metric: 4.00 / 4 (100.0%): 100%|██████████| 4/4 [00:10<00:00,  2.62s/it]Average Metric: 4.00 / 4 (100.0%): 100%|██████████| 4/4 [00:10<00:00,  2.58s/it]

2025/04/22 22:27:53 INFO dspy.evaluate.evaluate: Average Metric: 4 / 4 (100.0%)
INFO:py4j.clientserver:Closing down clientserver connection
INFO:py4j.clientserver:Closing down clientserver connection
INFO:py4j.clientserver:Closing down clientserver connection
INFO:py4j.clientserver:Closing down clientserver connection





2025/04/22 22:27:53 INFO dspy.teleprompt.mipro_optimizer_v2: Score: 100.0 on minibatch of size 4 with parameters ['Predictor 0: Instruction 3', 'Predictor 0: Few-Shot Set 0'].
2025/04/22 22:27:53 INFO dspy.teleprompt.mipro_optimizer_v2: Minibatch scores so far: [100.0, 100.0]
2025/04/22 22:27:53 INFO dspy.teleprompt.mipro_optimizer_v2: Full eval scores so far: [86.0]
2025/04/22 22:27:53 INFO dspy.teleprompt.mipro_optimizer_v2: Best full score so far: 86.0


2025/04/22 22:27:53 INFO dspy.teleprompt.mipro_optimizer_v2: == Trial 4 / 11 - Minibatch ==


  0%|          | 0/4 [00:00<?, ?it/s]Average Metric: 0.00 / 1 (0.0%):   0%|          | 0/4 [00:04<?, ?it/s]Average Metric: 0.00 / 1 (0.0%):  25%|██▌       | 1/4 [00:04<00:13,  4.48s/it]Average Metric: 1.00 / 2 (50.0%):  25%|██▌       | 1/4 [00:04<00:13,  4.48s/it]Average Metric: 2.00 / 3 (66.7%):  50%|█████     | 2/4 [00:04<00:08,  4.48s/it]Average Metric: 2.00 / 3 (66.7%):  75%|███████▌  | 3/4 [00:04<00:01,  1.23s/it]Average Metric: 3.00 / 4 (75.0%):  75%|███████▌  | 3/4 [00:07<00:01,  1.23s/it]Average Metric: 3.00 / 4 (75.0%): 100%|██████████| 4/4 [00:07<00:00,  1.82s/it]Average Metric: 3.00 / 4 (75.0%): 100%|██████████| 4/4 [00:07<00:00,  1.90s/it]

2025/04/22 22:28:02 INFO dspy.evaluate.evaluate: Average Metric: 3 / 4 (75.0%)
INFO:py4j.clientserver:Closing down clientserver connection
INFO:py4j.clientserver:Closing down clientserver connection
INFO:py4j.clientserver:Closing down clientserver connection
INFO:py4j.clientserver:Closing down clientserver connection





2025/04/22 22:28:02 INFO dspy.teleprompt.mipro_optimizer_v2: Score: 75.0 on minibatch of size 4 with parameters ['Predictor 0: Instruction 0', 'Predictor 0: Few-Shot Set 2'].
2025/04/22 22:28:02 INFO dspy.teleprompt.mipro_optimizer_v2: Minibatch scores so far: [100.0, 100.0, 75.0]
2025/04/22 22:28:02 INFO dspy.teleprompt.mipro_optimizer_v2: Full eval scores so far: [86.0]
2025/04/22 22:28:02 INFO dspy.teleprompt.mipro_optimizer_v2: Best full score so far: 86.0


2025/04/22 22:28:02 INFO dspy.teleprompt.mipro_optimizer_v2: == Trial 5 / 11 - Minibatch ==


  0%|          | 0/4 [00:00<?, ?it/s]Average Metric: 1.00 / 1 (100.0%):   0%|          | 0/4 [00:08<?, ?it/s]Average Metric: 1.00 / 1 (100.0%):  25%|██▌       | 1/4 [00:08<00:25,  8.35s/it]Average Metric: 2.00 / 2 (100.0%):  25%|██▌       | 1/4 [00:08<00:25,  8.35s/it]Average Metric: 2.00 / 2 (100.0%):  50%|█████     | 2/4 [00:08<00:07,  3.60s/it]Average Metric: 3.00 / 3 (100.0%):  50%|█████     | 2/4 [00:18<00:07,  3.60s/it]Average Metric: 3.00 / 3 (100.0%):  75%|███████▌  | 3/4 [00:18<00:06,  6.33s/it]Average Metric: 4.00 / 4 (100.0%):  75%|███████▌  | 3/4 [00:18<00:06,  6.33s/it]Average Metric: 4.00 / 4 (100.0%): 100%|██████████| 4/4 [00:18<00:00,  3.98s/it]Average Metric: 4.00 / 4 (100.0%): 100%|██████████| 4/4 [00:18<00:00,  4.64s/it]

2025/04/22 22:28:21 INFO dspy.evaluate.evaluate: Average Metric: 4 / 4 (100.0%)
INFO:py4j.clientserver:Closing down clientserver connection
INFO:py4j.clientserver:Closing down clientserver connection
INFO:py4j.clientserver:Closing down clientserver connection
INFO:py4j.clientserver:Closing down clientserver connection





2025/04/22 22:28:22 INFO dspy.teleprompt.mipro_optimizer_v2: Score: 100.0 on minibatch of size 4 with parameters ['Predictor 0: Instruction 2', 'Predictor 0: Few-Shot Set 1'].
2025/04/22 22:28:22 INFO dspy.teleprompt.mipro_optimizer_v2: Minibatch scores so far: [100.0, 100.0, 75.0, 100.0]
2025/04/22 22:28:22 INFO dspy.teleprompt.mipro_optimizer_v2: Full eval scores so far: [86.0]
2025/04/22 22:28:22 INFO dspy.teleprompt.mipro_optimizer_v2: Best full score so far: 86.0


2025/04/22 22:28:22 INFO dspy.teleprompt.mipro_optimizer_v2: ===== Trial 6 / 11 - Full Evaluation =====
2025/04/22 22:28:22 INFO dspy.teleprompt.mipro_optimizer_v2: Doing full eval on next top averaging program (Avg Score: 100.0) from minibatch trials...


  0%|          | 0/100 [00:00<?, ?it/s]Average Metric: 1.00 / 1 (100.0%):   0%|          | 0/100 [00:03<?, ?it/s]Average Metric: 1.00 / 1 (100.0%):   1%|          | 1/100 [00:03<05:34,  3.38s/it]Average Metric: 2.00 / 2 (100.0%):   1%|          | 1/100 [00:04<05:34,  3.38s/it]Average Metric: 2.00 / 2 (100.0%):   2%|▏         | 2/100 [00:04<03:20,  2.04s/it]Average Metric: 3.00 / 3 (100.0%):   2%|▏         | 2/100 [00:07<03:20,  2.04s/it]Average Metric: 3.00 / 3 (100.0%):   3%|▎         | 3/100 [00:07<03:57,  2.45s/it]Average Metric: 4.00 / 4 (100.0%):   3%|▎         | 3/100 [00:08<03:57,  2.45s/it]Average Metric: 4.00 / 4 (100.0%):   4%|▍         | 4/100 [00:08<02:52,  1.80s/it]Average Metric: 5.00 / 5 (100.0%):   4%|▍         | 4/100 [00:08<02:52,  1.80s/it]Average Metric: 5.00 / 5 (100.0%):   5%|▌         | 5/100 [00:08<02:04,  1.31s/it]Average Metric: 6.00 / 6 (100.0%):   5%|▌         | 5/100 [00:09<02:04,  1.31s/it]Average Metric: 6.00 / 6 (100.0%):   6%|▌         | 6/



Average Metric: 16.00 / 19 (84.2%):  18%|█▊        | 18/100 [00:31<02:48,  2.05s/it]Average Metric: 16.00 / 19 (84.2%):  19%|█▉        | 19/100 [00:31<03:14,  2.40s/it]Average Metric: 16.00 / 20 (80.0%):  19%|█▉        | 19/100 [00:32<03:14,  2.40s/it]Average Metric: 16.00 / 20 (80.0%):  20%|██        | 20/100 [00:32<02:32,  1.91s/it]Average Metric: 17.00 / 21 (81.0%):  20%|██        | 20/100 [00:32<02:32,  1.91s/it]Average Metric: 17.00 / 21 (81.0%):  21%|██        | 21/100 [00:32<01:59,  1.51s/it]Average Metric: 18.00 / 22 (81.8%):  21%|██        | 21/100 [00:33<01:59,  1.51s/it]Average Metric: 18.00 / 22 (81.8%):  22%|██▏       | 22/100 [00:33<01:29,  1.15s/it]Average Metric: 19.00 / 23 (82.6%):  22%|██▏       | 22/100 [00:33<01:29,  1.15s/it]Average Metric: 19.00 / 23 (82.6%):  23%|██▎       | 23/100 [00:33<01:13,  1.04it/s]Average Metric: 20.00 / 24 (83.3%):  23%|██▎       | 23/100 [00:40<01:13,  1.04it/s]Average Metric: 20.00 / 24 (83.3%):  24%|██▍       | 24/100 [00:



Average Metric: 47.00 / 55 (85.5%):  54%|█████▍    | 54/100 [01:18<01:10,  1.53s/it]Average Metric: 47.00 / 55 (85.5%):  55%|█████▌    | 55/100 [01:18<01:27,  1.94s/it]Average Metric: 48.00 / 56 (85.7%):  55%|█████▌    | 55/100 [01:18<01:27,  1.94s/it]Average Metric: 48.00 / 56 (85.7%):  56%|█████▌    | 56/100 [01:18<01:01,  1.41s/it]Average Metric: 49.00 / 57 (86.0%):  56%|█████▌    | 56/100 [01:19<01:01,  1.41s/it]Average Metric: 49.00 / 57 (86.0%):  57%|█████▋    | 57/100 [01:19<00:46,  1.08s/it]Average Metric: 50.00 / 58 (86.2%):  57%|█████▋    | 57/100 [01:23<00:46,  1.08s/it]Average Metric: 50.00 / 58 (86.2%):  58%|█████▊    | 58/100 [01:23<01:23,  1.98s/it]Average Metric: 51.00 / 59 (86.4%):  58%|█████▊    | 58/100 [01:24<01:23,  1.98s/it]Average Metric: 51.00 / 59 (86.4%):  59%|█████▉    | 59/100 [01:24<01:08,  1.66s/it]Average Metric: 52.00 / 60 (86.7%):  59%|█████▉    | 59/100 [01:25<01:08,  1.66s/it]Average Metric: 52.00 / 60 (86.7%):  60%|██████    | 60/100 [01:

2025/04/22 22:29:56 ERROR dspy.utils.parallelizer: Error for Example({'problem': 'Find $x$ such that $\\lceil x \\rceil \\cdot x = 135$. Express $x$ as a decimal.', 'level': 'Level 4', 'solution': 'First, we note that $x$ must be positive, since otherwise $\\lceil x \\rceil \\cdot x$ is nonpositive. Now, knowing that $\\lceil x \\rceil - 1 < x \\leq \\lceil x \\rceil,$ we see that $\\lceil x \\rceil$ must be $12,$ since $11 \\cdot 11 < 135 \\leq 12 \\cdot 12.$\n\nNow we see that $\\lceil x \\rceil \\cdot x = 12x = 135,$ so $x = \\frac{135}{12} = \\boxed{11.25}.$', 'type': 'Algebra'}) (input_keys={'problem'}): Both structured output format and JSON mode failed. Please choose a model that supports `response_format` argument. Original error: 'list' object has no attribute 'items'. Set `provide_traceback=True` for traceback.


Average Metric: 56.00 / 64 (87.5%):  64%|██████▍   | 64/100 [01:33<00:35,  1.00it/s]Average Metric: 56.00 / 64 (87.5%):  65%|██████▌   | 65/100 [01:33<00:59,  1.71s/it]Average Metric: 57.00 / 65 (87.7%):  65%|██████▌   | 65/100 [01:35<00:59,  1.71s/it]Average Metric: 57.00 / 65 (87.7%):  66%|██████▌   | 66/100 [01:35<00:57,  1.69s/it]Average Metric: 58.00 / 66 (87.9%):  66%|██████▌   | 66/100 [01:35<00:57,  1.69s/it]Average Metric: 58.00 / 66 (87.9%):  67%|██████▋   | 67/100 [01:35<00:47,  1.44s/it]Average Metric: 59.00 / 67 (88.1%):  67%|██████▋   | 67/100 [01:37<00:47,  1.44s/it]Average Metric: 59.00 / 67 (88.1%):  68%|██████▊   | 68/100 [01:37<00:50,  1.57s/it]Average Metric: 60.00 / 68 (88.2%):  68%|██████▊   | 68/100 [01:38<00:50,  1.57s/it]Average Metric: 60.00 / 68 (88.2%):  69%|██████▉   | 69/100 [01:38<00:40,  1.30s/it]Average Metric: 61.00 / 69 (88.4%):  69%|██████▉   | 69/100 [01:38<00:40,  1.30s/it]Average Metric: 61.00 / 69 (88.4%):  70%|███████   | 70/100 [01:



Average Metric: 81.00 / 91 (89.0%):  91%|█████████ | 91/100 [02:10<00:14,  1.60s/it]Average Metric: 81.00 / 91 (89.0%):  92%|█████████▏| 92/100 [02:10<00:13,  1.63s/it]Average Metric: 82.00 / 92 (89.1%):  92%|█████████▏| 92/100 [02:12<00:13,  1.63s/it]Average Metric: 82.00 / 92 (89.1%):  93%|█████████▎| 93/100 [02:12<00:12,  1.76s/it]Average Metric: 83.00 / 93 (89.2%):  93%|█████████▎| 93/100 [02:13<00:12,  1.76s/it]Average Metric: 83.00 / 93 (89.2%):  94%|█████████▍| 94/100 [02:13<00:10,  1.67s/it]Average Metric: 84.00 / 94 (89.4%):  94%|█████████▍| 94/100 [02:14<00:10,  1.67s/it]Average Metric: 84.00 / 94 (89.4%):  95%|█████████▌| 95/100 [02:14<00:06,  1.35s/it]Average Metric: 85.00 / 95 (89.5%):  95%|█████████▌| 95/100 [02:15<00:06,  1.35s/it]Average Metric: 85.00 / 95 (89.5%):  96%|█████████▌| 96/100 [02:15<00:04,  1.22s/it]Average Metric: 86.00 / 96 (89.6%):  96%|█████████▌| 96/100 [02:15<00:04,  1.22s/it]Average Metric: 86.00 / 96 (89.6%):  97%|█████████▋| 97/100 [02:

2025/04/22 22:30:55 ERROR dspy.utils.parallelizer: Error for Example({'problem': 'The first term of a given sequence is 1, and each successive term is the sum of all the previous terms of the sequence. What is the value of the first term which exceeds 5000?', 'level': 'Level 4', 'solution': "We calculate the first several terms directly and find the sequence starts\n\\[ 1, 1, 2, 4, 8, 16, \\ldots \\] It appears the $n$th term is $2^{n-2}$ for $n\\geq 2$.  Since $2^{12}=4096$, the first power of 2 that exceeds 5000 is $2^{13}=\\boxed{8192}$.\n\nLet's prove by induction that the $n$th term of the sequence is $2^{n-2}$ for all integers $n\\geq 2$.  The base case $n=2$ holds since the second term of the sequence is the sum of all the terms before it, which is just 1.  For the induction step, let $n>2$ and suppose that the $(n-1)$st term is $2^{n-1-2}=2^{n-3}$.  Then the sum of the first $n-2$ terms of the sequence is $2^{n-3}$, since the $(n-1)$st term is equal to the sum of the first $n-2

Average Metric: 88.00 / 98 (89.8%):  99%|█████████▉| 99/100 [02:32<00:04,  4.98s/it]Average Metric: 88.00 / 98 (89.8%): 100%|██████████| 100/100 [02:32<00:00,  3.66s/it]Average Metric: 88.00 / 98 (89.8%): 100%|██████████| 100/100 [02:32<00:00,  1.52s/it]

2025/04/22 22:30:55 INFO dspy.evaluate.evaluate: Average Metric: 88.0 / 100 (88.0%)
INFO:py4j.clientserver:Closing down clientserver connection
INFO:py4j.clientserver:Closing down clientserver connection
INFO:py4j.clientserver:Closing down clientserver connection
INFO:py4j.clientserver:Closing down clientserver connection
INFO:py4j.clientserver:Closing down clientserver connection
INFO:py4j.clientserver:Closing down clientserver connection





2025/04/22 22:30:55 INFO dspy.teleprompt.mipro_optimizer_v2: [92mNew best full eval score![0m Score: 88.0
2025/04/22 22:30:55 INFO dspy.teleprompt.mipro_optimizer_v2: Full eval scores so far: [86.0, 88.0]
2025/04/22 22:30:55 INFO dspy.teleprompt.mipro_optimizer_v2: Best full score so far: 88.0
2025/04/22 22:30:55 INFO dspy.teleprompt.mipro_optimizer_v2: 

2025/04/22 22:30:55 INFO dspy.teleprompt.mipro_optimizer_v2: == Trial 7 / 11 - Minibatch ==


  0%|          | 0/4 [00:00<?, ?it/s]Average Metric: 0.00 / 1 (0.0%):   0%|          | 0/4 [00:00<?, ?it/s]Average Metric: 0.00 / 1 (0.0%):  25%|██▌       | 1/4 [00:00<00:01,  2.99it/s]Average Metric: 0.00 / 2 (0.0%):  25%|██▌       | 1/4 [00:00<00:01,  2.99it/s]Average Metric: 1.00 / 3 (33.3%):  50%|█████     | 2/4 [00:00<00:00,  2.99it/s]Average Metric: 2.00 / 4 (50.0%):  75%|███████▌  | 3/4 [00:00<00:00,  2.99it/s]Average Metric: 2.00 / 4 (50.0%): 100%|██████████| 4/4 [00:00<00:00,  9.39it/s]

2025/04/22 22:30:56 INFO dspy.evaluate.evaluate: Average Metric: 2 / 4 (50.0%)
INFO:py4j.clientserver:Closing down clientserver connection
INFO:py4j.clientserver:Closing down clientserver connection
INFO:py4j.clientserver:Closing down clientserver connection
INFO:py4j.clientserver:Closing down clientserver connection





2025/04/22 22:30:56 INFO dspy.teleprompt.mipro_optimizer_v2: Score: 50.0 on minibatch of size 4 with parameters ['Predictor 0: Instruction 0', 'Predictor 0: Few-Shot Set 0'].
2025/04/22 22:30:56 INFO dspy.teleprompt.mipro_optimizer_v2: Minibatch scores so far: [100.0, 100.0, 75.0, 100.0, 50.0]
2025/04/22 22:30:56 INFO dspy.teleprompt.mipro_optimizer_v2: Full eval scores so far: [86.0, 88.0]
2025/04/22 22:30:56 INFO dspy.teleprompt.mipro_optimizer_v2: Best full score so far: 88.0


2025/04/22 22:30:56 INFO dspy.teleprompt.mipro_optimizer_v2: == Trial 8 / 11 - Minibatch ==


  0%|          | 0/4 [00:00<?, ?it/s]Average Metric: 0.00 / 1 (0.0%):   0%|          | 0/4 [00:00<?, ?it/s]Average Metric: 0.00 / 1 (0.0%):  25%|██▌       | 1/4 [00:00<00:01,  2.80it/s]Average Metric: 0.00 / 2 (0.0%):  25%|██▌       | 1/4 [00:00<00:01,  2.80it/s]Average Metric: 1.00 / 3 (33.3%):  50%|█████     | 2/4 [00:00<00:00,  2.80it/s]Average Metric: 2.00 / 4 (50.0%):  75%|███████▌  | 3/4 [00:00<00:00,  2.80it/s]Average Metric: 2.00 / 4 (50.0%): 100%|██████████| 4/4 [00:00<00:00,  8.96it/s]

2025/04/22 22:30:58 INFO dspy.evaluate.evaluate: Average Metric: 2 / 4 (50.0%)
INFO:py4j.clientserver:Closing down clientserver connection
INFO:py4j.clientserver:Closing down clientserver connection
INFO:py4j.clientserver:Closing down clientserver connection
INFO:py4j.clientserver:Closing down clientserver connection





2025/04/22 22:30:58 INFO dspy.teleprompt.mipro_optimizer_v2: Score: 50.0 on minibatch of size 4 with parameters ['Predictor 0: Instruction 0', 'Predictor 0: Few-Shot Set 0'].
2025/04/22 22:30:58 INFO dspy.teleprompt.mipro_optimizer_v2: Minibatch scores so far: [100.0, 100.0, 75.0, 100.0, 50.0, 50.0]
2025/04/22 22:30:58 INFO dspy.teleprompt.mipro_optimizer_v2: Full eval scores so far: [86.0, 88.0]
2025/04/22 22:30:58 INFO dspy.teleprompt.mipro_optimizer_v2: Best full score so far: 88.0


2025/04/22 22:30:58 INFO dspy.teleprompt.mipro_optimizer_v2: == Trial 9 / 11 - Minibatch ==


  0%|          | 0/4 [00:00<?, ?it/s]Average Metric: 1.00 / 1 (100.0%):   0%|          | 0/4 [00:00<?, ?it/s]Average Metric: 1.00 / 1 (100.0%):  25%|██▌       | 1/4 [00:00<00:01,  2.98it/s]Average Metric: 2.00 / 2 (100.0%):  25%|██▌       | 1/4 [00:00<00:01,  2.98it/s]Average Metric: 3.00 / 3 (100.0%):  50%|█████     | 2/4 [00:00<00:00,  2.98it/s]Average Metric: 4.00 / 4 (100.0%):  75%|███████▌  | 3/4 [00:00<00:00,  2.98it/s]Average Metric: 4.00 / 4 (100.0%): 100%|██████████| 4/4 [00:00<00:00, 10.82it/s]

2025/04/22 22:30:59 INFO dspy.evaluate.evaluate: Average Metric: 4 / 4 (100.0%)
INFO:py4j.clientserver:Closing down clientserver connection
INFO:py4j.clientserver:Closing down clientserver connection
INFO:py4j.clientserver:Closing down clientserver connection
INFO:py4j.clientserver:Closing down clientserver connection





2025/04/22 22:30:59 INFO dspy.teleprompt.mipro_optimizer_v2: Score: 100.0 on minibatch of size 4 with parameters ['Predictor 0: Instruction 0', 'Predictor 0: Few-Shot Set 0'].
2025/04/22 22:30:59 INFO dspy.teleprompt.mipro_optimizer_v2: Minibatch scores so far: [100.0, 100.0, 75.0, 100.0, 50.0, 50.0, 100.0]
2025/04/22 22:30:59 INFO dspy.teleprompt.mipro_optimizer_v2: Full eval scores so far: [86.0, 88.0]
2025/04/22 22:30:59 INFO dspy.teleprompt.mipro_optimizer_v2: Best full score so far: 88.0


2025/04/22 22:30:59 INFO dspy.teleprompt.mipro_optimizer_v2: == Trial 10 / 11 - Minibatch ==


  0%|          | 0/4 [00:00<?, ?it/s]Average Metric: 1.00 / 1 (100.0%):   0%|          | 0/4 [00:00<?, ?it/s]Average Metric: 1.00 / 1 (100.0%):  25%|██▌       | 1/4 [00:00<00:01,  2.47it/s]Average Metric: 2.00 / 2 (100.0%):  25%|██▌       | 1/4 [00:08<00:01,  2.47it/s]Average Metric: 2.00 / 2 (100.0%):  50%|█████     | 2/4 [00:08<00:09,  4.76s/it]Average Metric: 3.00 / 3 (100.0%):  50%|█████     | 2/4 [00:09<00:09,  4.76s/it]Average Metric: 3.00 / 3 (100.0%):  75%|███████▌  | 3/4 [00:09<00:03,  3.39s/it]Average Metric: 4.00 / 4 (100.0%):  75%|███████▌  | 3/4 [00:18<00:03,  3.39s/it]Average Metric: 4.00 / 4 (100.0%): 100%|██████████| 4/4 [00:18<00:00,  5.59s/it]Average Metric: 4.00 / 4 (100.0%): 100%|██████████| 4/4 [00:18<00:00,  4.74s/it]

2025/04/22 22:31:19 INFO dspy.evaluate.evaluate: Average Metric: 4 / 4 (100.0%)
INFO:py4j.clientserver:Closing down clientserver connection
INFO:py4j.clientserver:Closing down clientserver connection
INFO:py4j.clientserver:Closing down clientserver connection
INFO:py4j.clientserver:Closing down clientserver connection





2025/04/22 22:31:19 INFO dspy.teleprompt.mipro_optimizer_v2: Score: 100.0 on minibatch of size 4 with parameters ['Predictor 0: Instruction 0', 'Predictor 0: Few-Shot Set 2'].
2025/04/22 22:31:19 INFO dspy.teleprompt.mipro_optimizer_v2: Minibatch scores so far: [100.0, 100.0, 75.0, 100.0, 50.0, 50.0, 100.0, 100.0]
2025/04/22 22:31:19 INFO dspy.teleprompt.mipro_optimizer_v2: Full eval scores so far: [86.0, 88.0]
2025/04/22 22:31:19 INFO dspy.teleprompt.mipro_optimizer_v2: Best full score so far: 88.0


2025/04/22 22:31:19 INFO dspy.teleprompt.mipro_optimizer_v2: ===== Trial 11 / 11 - Full Evaluation =====
2025/04/22 22:31:19 INFO dspy.teleprompt.mipro_optimizer_v2: Doing full eval on next top averaging program (Avg Score: 100.0) from minibatch trials...


  0%|          | 0/100 [00:00<?, ?it/s]Average Metric: 1.00 / 1 (100.0%):   0%|          | 0/100 [00:02<?, ?it/s]Average Metric: 1.00 / 1 (100.0%):   1%|          | 1/100 [00:02<04:27,  2.70s/it]Average Metric: 2.00 / 2 (100.0%):   1%|          | 1/100 [00:04<04:27,  2.70s/it]Average Metric: 2.00 / 2 (100.0%):   2%|▏         | 2/100 [00:04<03:17,  2.02s/it]Average Metric: 3.00 / 3 (100.0%):   2%|▏         | 2/100 [00:08<03:17,  2.02s/it]Average Metric: 3.00 / 3 (100.0%):   3%|▎         | 3/100 [00:08<05:06,  3.16s/it]Average Metric: 4.00 / 4 (100.0%):   3%|▎         | 3/100 [00:08<05:06,  3.16s/it]Average Metric: 5.00 / 5 (100.0%):   4%|▍         | 4/100 [00:10<05:03,  3.16s/it]Average Metric: 5.00 / 5 (100.0%):   5%|▌         | 5/100 [00:10<02:49,  1.78s/it]Average Metric: 6.00 / 6 (100.0%):   5%|▌         | 5/100 [00:11<02:49,  1.78s/it]Average Metric: 6.00 / 6 (100.0%):   6%|▌         | 6/100 [00:11<02:44,  1.75s/it]Average Metric: 6.00 / 7 (85.7%):   6%|▌         | 6/1

2025/04/22 22:33:26 INFO dspy.evaluate.evaluate: Average Metric: 85 / 100 (85.0%)
INFO:py4j.clientserver:Closing down clientserver connection
INFO:py4j.clientserver:Closing down clientserver connection
INFO:py4j.clientserver:Closing down clientserver connection
INFO:py4j.clientserver:Closing down clientserver connection
INFO:py4j.clientserver:Closing down clientserver connection
INFO:py4j.clientserver:Closing down clientserver connection





2025/04/22 22:33:27 INFO dspy.teleprompt.mipro_optimizer_v2: Full eval scores so far: [86.0, 88.0, 85.0]
2025/04/22 22:33:27 INFO dspy.teleprompt.mipro_optimizer_v2: Best full score so far: 88.0
2025/04/22 22:33:27 INFO dspy.teleprompt.mipro_optimizer_v2: 

2025/04/22 22:33:27 INFO dspy.teleprompt.mipro_optimizer_v2: Returning best identified program with score 88.0!


Downloading artifacts:   0%|          | 0/1 [00:00<?, ?it/s]

[Trace(request_id=tr-8f164c632cdf436aa52689fd7372caa8), Trace(request_id=tr-d6521015b172418185418907027c1fe3), Trace(request_id=tr-66b5fd4cd8384d92840fbbe3c729a8cb), Trace(request_id=tr-464ac7f28329469683491402a081bbc4), Trace(request_id=tr-9d0b396ce34b461cbf537083644e175d), Trace(request_id=tr-1f60c0b657134372a90b97bdbc2a2f74), Trace(request_id=tr-90c739c0fce64625be49444681564bbb), Trace(request_id=tr-6acd7ad632654e059b80d1e5a4051262), Trace(request_id=tr-daddec08928242c58fe0599d3f9a77da), Trace(request_id=tr-5553bd0896c4437b9fbd2fbd584eec61)]

INFO:py4j.clientserver:Closing down clientserver connection


Check out the performance of the optimized program.

In [0]:
math_eval_score_optimized = math_evaluator(optimized_math_program, metric=math_evaluate)

  0%|          | 0/100 [00:00<?, ?it/s]Average Metric: 1.00 / 1 (100.0%):   0%|          | 0/100 [00:04<?, ?it/s]Average Metric: 1.00 / 1 (100.0%):   1%|          | 1/100 [00:04<07:52,  4.77s/it]Average Metric: 2.00 / 2 (100.0%):   1%|          | 1/100 [00:05<07:52,  4.77s/it]Average Metric: 2.00 / 2 (100.0%):   2%|▏         | 2/100 [00:05<04:06,  2.52s/it]Average Metric: 3.00 / 3 (100.0%):   2%|▏         | 2/100 [00:06<04:06,  2.52s/it]Average Metric: 3.00 / 3 (100.0%):   3%|▎         | 3/100 [00:06<02:32,  1.57s/it]Average Metric: 4.00 / 4 (100.0%):   3%|▎         | 3/100 [00:06<02:32,  1.57s/it]Average Metric: 4.00 / 4 (100.0%):   4%|▍         | 4/100 [00:06<01:46,  1.11s/it]Average Metric: 4.00 / 5 (80.0%):   4%|▍         | 4/100 [00:09<01:46,  1.11s/it] Average Metric: 4.00 / 5 (80.0%):   5%|▌         | 5/100 [00:09<03:03,  1.93s/it]Average Metric: 5.00 / 6 (83.3%):   5%|▌         | 5/100 [00:11<03:03,  1.93s/it]Average Metric: 5.00 / 6 (83.3%):   6%|▌         | 6/100



Average Metric: 24.00 / 32 (75.0%):  31%|███       | 31/100 [00:37<01:37,  1.41s/it]Average Metric: 24.00 / 32 (75.0%):  32%|███▏      | 32/100 [00:37<01:17,  1.14s/it]Average Metric: 25.00 / 33 (75.8%):  32%|███▏      | 32/100 [00:38<01:17,  1.14s/it]Average Metric: 25.00 / 33 (75.8%):  33%|███▎      | 33/100 [00:38<01:15,  1.13s/it]Average Metric: 26.00 / 34 (76.5%):  33%|███▎      | 33/100 [00:41<01:15,  1.13s/it]Average Metric: 26.00 / 34 (76.5%):  34%|███▍      | 34/100 [00:41<01:56,  1.77s/it]Average Metric: 27.00 / 35 (77.1%):  34%|███▍      | 34/100 [00:42<01:56,  1.77s/it]Average Metric: 27.00 / 35 (77.1%):  35%|███▌      | 35/100 [00:42<01:32,  1.43s/it]Average Metric: 28.00 / 36 (77.8%):  35%|███▌      | 35/100 [00:42<01:32,  1.43s/it]Average Metric: 28.00 / 36 (77.8%):  36%|███▌      | 36/100 [00:42<01:06,  1.04s/it]Average Metric: 28.00 / 37 (75.7%):  36%|███▌      | 36/100 [00:46<01:06,  1.04s/it]Average Metric: 28.00 / 37 (75.7%):  37%|███▋      | 37/100 [00:



Average Metric: 31.00 / 41 (75.6%):  40%|████      | 40/100 [00:48<00:51,  1.16it/s]Average Metric: 31.00 / 41 (75.6%):  41%|████      | 41/100 [00:48<01:05,  1.12s/it]Average Metric: 31.00 / 42 (73.8%):  41%|████      | 41/100 [00:49<01:05,  1.12s/it]Average Metric: 31.00 / 42 (73.8%):  42%|████▏     | 42/100 [00:49<00:53,  1.08it/s]Average Metric: 32.00 / 43 (74.4%):  42%|████▏     | 42/100 [00:49<00:53,  1.08it/s]Average Metric: 32.00 / 43 (74.4%):  43%|████▎     | 43/100 [00:49<00:50,  1.12it/s]Average Metric: 33.00 / 44 (75.0%):  43%|████▎     | 43/100 [00:50<00:50,  1.12it/s]Average Metric: 33.00 / 44 (75.0%):  44%|████▍     | 44/100 [00:50<00:50,  1.11it/s]Average Metric: 33.00 / 45 (73.3%):  44%|████▍     | 44/100 [00:51<00:50,  1.11it/s]Average Metric: 33.00 / 45 (73.3%):  45%|████▌     | 45/100 [00:51<00:48,  1.13it/s]Average Metric: 34.00 / 46 (73.9%):  45%|████▌     | 45/100 [00:51<00:48,  1.13it/s]Average Metric: 34.00 / 46 (73.9%):  46%|████▌     | 46/100 [00:

2025/04/22 23:34:26 ERROR dspy.utils.parallelizer: Error for Example({'problem': 'Suppose the roots of the polynomial $x^2 - mx + n$ are positive prime integers (not necessarily distinct). Given that $m < 20,$ how many possible values of $n$ are there?', 'level': 'Level 5', 'solution': 'Let $p$ and $q$ be the prime roots. Then, we know that $m = p+q$ and $n = pq$. Since $m < 20$, the primes $p$ and $q$ must both be less than $20$.\n\nThe primes less than $20$ are $2,$ $3,$ $5,$ $7,$ $11,$ $13,$ $17,$ $19.$ Now we list all possible pairs $(p, q)$ such that $p + q < 20$, remembering to also include the cases in which $p=q$: \\[\\begin{aligned} & (2,2),(2,3),(2,5),(2,7),(2,11),(2,13),(2,17) \\\\\n&(3,3),(3,5),(3,7),(3,11),(3,13) \\\\\n&(5,5),(5,7),(5,11),(5,13) \\\\\n&(7,7),(7,11) \\end{aligned}\\]There are $7 + 5 + 4 + 2 = 18$ pairs in total. Each pair produces a value for $n$, and furthermore, these values are all distinct, because every positive integer has a unique prime factorization

Average Metric: 41.00 / 54 (75.9%):  54%|█████▍    | 54/100 [01:02<00:57,  1.26s/it]Average Metric: 41.00 / 54 (75.9%):  55%|█████▌    | 55/100 [01:02<00:59,  1.31s/it]Average Metric: 42.00 / 55 (76.4%):  55%|█████▌    | 55/100 [01:02<00:59,  1.31s/it]Average Metric: 42.00 / 55 (76.4%):  56%|█████▌    | 56/100 [01:02<00:43,  1.01it/s]Average Metric: 43.00 / 56 (76.8%):  56%|█████▌    | 56/100 [01:02<00:43,  1.01it/s]Average Metric: 43.00 / 56 (76.8%):  57%|█████▋    | 57/100 [01:02<00:31,  1.38it/s]Average Metric: 43.00 / 57 (75.4%):  57%|█████▋    | 57/100 [01:02<00:31,  1.38it/s]Average Metric: 43.00 / 57 (75.4%):  58%|█████▊    | 58/100 [01:02<00:23,  1.83it/s]Average Metric: 44.00 / 58 (75.9%):  58%|█████▊    | 58/100 [01:02<00:23,  1.83it/s]Average Metric: 45.00 / 59 (76.3%):  59%|█████▉    | 59/100 [01:02<00:22,  1.83it/s]Average Metric: 45.00 / 59 (76.3%):  60%|██████    | 60/100 [01:02<00:12,  3.09it/s]Average Metric: 46.00 / 60 (76.7%):  60%|██████    | 60/100 [01:

2025/04/22 23:35:04 INFO dspy.evaluate.evaluate: Average Metric: 80.0 / 100 (80.0%)
INFO:py4j.clientserver:Closing down clientserver connection
INFO:py4j.clientserver:Closing down clientserver connection
INFO:py4j.clientserver:Closing down clientserver connection
INFO:py4j.clientserver:Closing down clientserver connection
INFO:py4j.clientserver:Closing down clientserver connection





INFO:py4j.clientserver:Closing down clientserver connection
INFO:py4j.clientserver:Closing down clientserver connection


Unnamed: 0,problem,level,solution,type,reasoning,answer,math_evaluate
0,How many vertical asymptotes does the graph of $y=\frac{2}{x^2+x-6...,Level 3,The denominator of the rational function factors into $x^2+x-6=(x-...,Algebra,To find the vertical asymptotes of the function \( y = \frac{2}{x^...,2,✔️ [True]
1,What is the positive difference between $120\%$ of 30 and $130\%$ ...,Level 1,One hundred twenty percent of 30 is $120\cdot30\cdot\frac{1}{100}=...,Algebra,"To find the positive difference between 120% of 30 and 130% of 20,...",10,✔️ [True]
2,Find $x$ such that $\lceil x \rceil + x = \dfrac{23}{7}$. Express ...,Level 4,"First, we note that $x$ must be positive, since otherwise $\lceil ...",Algebra,"Let \( x = n + f \), where \( n = \lceil x \rceil \) is an integer...",\frac{16}{7},
3,Evaluate $i^5+i^{-25}+i^{45}$.,Level 5,We have $i^5 = i^4\cdot i = 1\cdot (i) = i$. We also have $i^{-25}...,Algebra,"To evaluate the expression \( i^5 + i^{-25} + i^{45} \), we first ...",i,✔️ [True]
4,"If $2^8=4^x$, what is the value of $x$?",Level 1,"Rewrite $4$ as $2^2$ to find $4^x=2^{2x}$. Since $2^8=2^{2x}$, we...",Algebra,"To solve the equation \( 2^8 = 4^x \), we first express \( 4 \) in...",4,✔️ [True]
5,"What is the 100th term of the arithmetic sequence 6, 10, 14, 18, ...?",Level 2,"The common difference is $10 - 6 = 4$, so the 100th term is $6+99\...",Algebra,"In an arithmetic sequence, the nth term can be calculated using th...",402,✔️ [True]
6,For what values of $x$ is it true that $x^2 - 5x - 4 \le 10$? Expr...,Level 4,"Re-arranging, $x^2 - 5x - 14 \le 0$. The left-hand quadratic facto...",Algebra,"To solve the inequality \( x^2 - 5x - 4 \le 10 \), we first rearra...","[-2, 7]",
7,Mr. Madoff invests 1000 dollars in a fund that compounds annually ...,Level 4,"Let $r$ be the annual interest rate. Then after three years, Mr. M...",Algebra,"To find the annual interest rate, we can use the formula for compo...",8,
8,"Four distinct integers $a$, $b$, $c$ and $d$ have the property tha...",Level 4,"WLOG, let $a<b<c<d$. The smallest sum is $a+b=10$. The second-smal...",Algebra,"Let the four distinct integers be a, b, c, and d. The sums of the ...","4, 6, 14, 15",✔️ [True]
9,What is the smallest value of $x$ such that $|5x - 1| = |3x + 2|$?...,Level 5,"There are two cases, when $5x-1=3x+2$ and when $5x-1=-(3x+2).$ The...",Algebra,"To solve the equation \(|5x - 1| = |3x + 2|\), we need to consider...",-1/8,


INFO:py4j.clientserver:Closing down clientserver connection
INFO:py4j.clientserver:Closing down clientserver connection
INFO:py4j.clientserver:Closing down clientserver connection


[Trace(request_id=tr-17294e7b5fee42a7a6eac609709a5704), Trace(request_id=tr-376cebe0f0ca4effb3f4a0756d207385), Trace(request_id=tr-5401e251236547d083d444c4100177e5), Trace(request_id=tr-528ff0ad71374304b8ccbc3dc4bdb046), Trace(request_id=tr-664e9a7c0cc340c8b94d3f0ad3163601), Trace(request_id=tr-444776b305ee42a58910b626df075600), Trace(request_id=tr-6585cf57e12a4932bf595310162b2122), Trace(request_id=tr-d4340826056947dd9fc9e82d78aff0af), Trace(request_id=tr-118d8a69def645b9af855cff84a5a59d), Trace(request_id=tr-b44a1ba8679d43b9b6f85f60f6be3966)]

In [0]:
optimized_math_program.predict.signature.instructions

'You are a mathematician tasked with solving critical algebraic problems that will determine the fate of a prestigious mathematics competition. Given the problem statement, provide a thorough step-by-step reasoning process leading to the final answer. Make sure to connect theoretical concepts with practical applications, ensuring clarity and rigor in your solution. Use the fields `problem` to guide your response and produce the fields `reasoning` and `answer` accordingly.'

In [0]:
optimized_math_program.predict.demos

[Example({'augmented': True, 'problem': 'Let \\[f(x) = \\left\\{\n\\begin{array}{cl} ax+3, &\\text{ if }x>2, \\\\\nx-5 &\\text{ if } -2 \\le x \\le 2, \\\\\n2x-b &\\text{ if } x <-2.\n\\end{array}\n\\right.\\]Find $a+b$ if the piecewise function is continuous (which means that its graph can be drawn without lifting your pencil from the paper).', 'reasoning': 'To ensure that the piecewise function \\( f(x) \\) is continuous, we need to check the continuity at the points where the definition of the function changes, which are at \\( x = -2 \\) and \\( x = 2 \\).\\n\\n1. **Continuity at \\( x = 2 \\)**:\\n   - From the left (using the second piece): \\( f(2) = 2 - 5 = -3 \\)\\n   - From the right (using the first piece): \\( f(2) = a(2) + 3 = 2a + 3 \\)\\n   - For continuity at \\( x = 2 \\): \\( 2a + 3 = -3 \\)\\n   - Solving for \\( a \\): \\( 2a = -6 \\) \\( \\Rightarrow a = -3 \\)\\n\\n2. **Continuity at \\( x = -2 \\)**:\\n   - From the left (using the third piece): \\( f(-2) = 2(-2)

In [0]:
# Define a QA module with chain of thought
qa = dspy.ChainOfThought("question -> answer")


# Define a reward function that checks for one-word answers
def one_word_answer(args, pred):
    score = 1.0 if len(pred.answer.split()) < 5 else 0.0
    print(f"Score: {score}")
    return score


# Create a refined module that tries up to 3 times
best_of_3 = dspy.Refine(module=qa, N=3, reward_fn=one_word_answer, threshold=1.0)

# Use the refined module
result = best_of_3(question="What's a good reason for working?").answer

print(result)
