# Automatic Agent Optimization with MIPRO

MIPRO (Multiprompt Instruction PRoposal Optimizer) is an optimizer you can use to create the best prompts from your task. You can read the original paper [here](https://arxiv.org/abs/2406.11695).

This optimizer was popularized by the [DSPy library](https://dspy.ai/) that provided the first and most commonly used implementation.

In the notebook below, we will re-implement the algorithm from scratch to get a better understanding of how it works.

## Optimization strategy

The MIPRO algorithm was developed to optimize the multi-step LLM applications when you don't have labels for each step but rather global label for the entire task. This makes it a great algorithm to automatically optimize agents !

## Define the agent

To test the optimization algorithm, we are going to start by creating a very simple agent.

In [1]:
import os
import getpass
from openai import OpenAI

%pip install --quiet openai

if "OPENAI_API_KEY" not in os.environ:
    os.environ["OPENAI_API_KEY"] = getpass.getpass("OPENAI_API_KEY")


[1m[[0m[34;49mnotice[0m[1;39;49m][0m[39;49m A new release of pip is available: [0m[31;49m25.0[0m[39;49m -> [0m[32;49m25.0.1[0m
[1m[[0m[34;49mnotice[0m[1;39;49m][0m[39;49m To update, run: [0m[32;49mpip install --upgrade pip[0m
Note: you may need to restart the kernel to use updated packages.


In [2]:
sentiment_prompt = "Analyze the sentiment of this text and classify it as 'postitive sentiment` or `negative sentiment`. Make sure to use this specific term in the response."


class SentimentAnalysisTool:
    def __init__(self, name: str = "sentiment", prompt: str = sentiment_prompt):
        self.name = name
        self.openai_client = OpenAI()
        self.sentiment_prompt = prompt

    def __call__(self, text: str) -> str:
        completion = self.openai_client.chat.completions.create(
            messages=[
                {"role": "system", "content": self.sentiment_prompt},
                {"role": "user", "content": text},
            ],
            model="gpt-4o",
        )

        return completion.choices[0].message.content

In [3]:
summarize_prompt = "Summarize the text provided below."


class SummarizeTool:
    def __init__(self, name: str = "summarization", prompt: str = summarize_prompt):
        self.name = name
        self.openai_client = OpenAI()
        self.summarization_prompt = prompt

    def __call__(self, text: str) -> str:
        completion = self.openai_client.chat.completions.create(
            messages=[
                {"role": "system", "content": self.summarization_prompt},
                {"role": "user", "content": text},
            ],
            model="gpt-4o",
        )

        return completion.choices[0].message.content

In [4]:
class LMProgram:
    def __init__(
        self,
        sumarization_tool: SummarizeTool,
        sentiment_analysis_tool: SentimentAnalysisTool,
    ):
        self.name = "Sentiment Analysis"
        self.sumarization_tool = sumarization_tool
        self.sentiment_analysis_tool = sentiment_analysis_tool

        self.history = []

    def run(self, text: str):
        self.history = []

        summary = self.sumarization_tool(text)
        self.history.append(
            {"name": self.sumarization_tool.name, "input": text, "output": summary}
        )

        sentiment = self.sentiment_analysis_tool(summary)
        self.history.append(
            {
                "name": self.sentiment_analysis_tool.name,
                "input": summary,
                "output": sentiment,
            }
        )

        return summary

In [5]:
program = LMProgram(SummarizeTool(), SentimentAnalysisTool())

The MIPRO algorithm relies on the following:
1. A dataset: This is a set of labelled data with positive and negative samples
2. A metric: A metric to optimize over
3. A task: The agent to optimize

For the purposes of this investigation, we will define these as the following:

In [6]:
dataset = [
    {
        "input": """Summarize the following customer reviews and analyze their overall sentiment:
        I just love waiting in long lines at the DMV... said no one ever.""",
        "expected_output": "negative sentiment",
    },
    {
        "input": """Summarize the following customer reviews and analyze their overall sentiment:
        The food was absolutely terrible, but at least the waiter was friendly.""",
        "expected_output": "negative sentiment",
    },
    {
        "input": """Summarize the following customer reviews and analyze their overall sentiment:
        This movie was so bad that itâ€™s actually hilarious. I had a great time watching it!""",
        "expected_output": "positive sentiment",
    },
    {
        "input": """Summarize the following customer reviews and analyze their overall sentiment:
        Wow, I can't believe how amazing this service is! It only took them two hours to get my order wrong.""",
        "expected_output": "negative sentiment",
    },
    {
        "input": """Summarize the following customer reviews and analyze their overall sentiment:
        The software crashes every time I open it. But hey, at least the icon looks nice!""",
        "expected_output": "negative sentiment",
    },
]


def equals_metric(output, expected_output):
    if expected_output.lower() in output.lower():
        return 1
    else:
        return 0

In [7]:
for i in dataset:
    res = program.run(i["input"])
    print(res)
    print(equals_metric(res, i["expected_output"]))
    print("---")

The review sarcastically expresses dissatisfaction with the long wait times at the DMV, implying a negative sentiment towards the experience of waiting in lines there.
1
---
The customer reviews indicate a negative experience with the food, described as "absolutely terrible." However, they acknowledge a positive aspect of their visit, noting that the waiter was friendly. Overall, the sentiment is more negative due to the dissatisfaction with the food quality.
0
---
The customer review expresses that the movie was poorly made, yet the reviewer found it enjoyable because of its comedic badness. The overall sentiment is positive because the viewer enjoyed the experience despite the movie's low quality.
0
---
The review showcases a sarcastic tone with a negative sentiment. The customer expresses dissatisfaction with the service, highlighting that their order was incorrect despite taking two hours. Overall, the sentiment is negative due to the delay and the error in fulfilling the order.
1


## The MIPRO algorithm

The MIPRO algorithm has 3 phases:
1. Initialize

### Initialization

For the initialization step, we will start by bootstraping a set of N few-shot examples for each module. In the agent defined above, we have three different modules:

1. Orchestrator
2. Sentiment analysis tool
3. Summarization tool

For this we are going to use the Bootstrap Demonstration and Grounding.

#### Bootstrap demonstrations

The concept of bootstrap demonstrations is remarkably simple. The idea is that if the agent as a whole returns the correct output, then we can expect that the input / output pair for each module is correct. This allows us to create a dataset of "labelled" data for each module without needing to specify a dataset for each module.

In [8]:
def bootstrap_demonstrations(program, dataset, metric):
    positive_samples = []
    for item in dataset:
        output = program.run(item["input"])
        print(output)

        score = metric(output=output, expected_output=item["expected_output"])

        if score >= 1:
            positive_samples += program.history

    return positive_samples


positive_samples = bootstrap_demonstrations(program, dataset, equals_metric)

The customer review expresses a negative sentiment toward the experience of waiting in long lines at the DMV, using a sarcastic tone to convey dissatisfaction.
The customer reviews indicate a negative overall sentiment towards the food, describing it as "absolutely terrible." However, there is a positive mention of the service, as the waiter is described as "friendly."
The customer reviews suggest that although the movie was perceived as bad, it was so poorly made that it became unintentionally funny, leading to an enjoyable viewing experience. The overall sentiment appears to be positive, as the reviewer had a great time watching the movie despite its flaws.
The customer review contains a sarcastic tone and highlights dissatisfaction with the service. Despite expressing amazement, the comment about getting an order wrong in two hours indicates a negative sentiment towards the experience. Overall, the sentiment of this review is critical and negative.
The customer reviews are negative 

#### Grounding

In order to improve each program, we are going to create some instructions for creating a better prompt that relies on:

1. Dataset summary
2. Program task
3. Prompt tips

By relying on this information, we are "grounding" the instructions for each modile.

In [12]:
import json


class InstructionGeneration:
    def __init__(self):
        self.openai_client = OpenAI()

    def _get_code(self, name):
        for i, cell in enumerate(reversed(In)):  # noqa: F821
            if f"class {name}" in cell:
                return cell

        raise ValueError("Couldn't find code")

    def summarize_program(self, program):
        code = self._get_code("LMProgram")
        prompt = f"""
I need to understand the architecture of a language model program called {program.name}.
This program code is the following:

{code}

Please analyze this program structure and provide a concise summary of:
1. The overall workflow and how modules connect to each other
2. The specific role and responsibility of each module
3. How information flows through the program
4. Critical dependencies between modules

Provide this as a brief, factual summary that would help someone design effective instructions for each module.
        """

        completion = self.openai_client.chat.completions.create(
            messages=[
                {"role": "user", "content": prompt},
            ],
            model="gpt-4o",
        )

        return completion.choices[0].message.content

    def summarize_dataset(self, dataset):
        prompt = f"""
I need to understand the dataset that provides the input and output of a language model program called {program.name}.
The dataset is the following:

{json.dumps(dataset, indent=2)}

Please analyze this program structure and provide a concise summary of:
1. The overall workflow and how modules connect to each other
2. The specific role and responsibility of each module
3. How information flows through the program
4. Critical dependencies between modules

Provide this as a brief, factual summary that would help someone design effective instructions for each module.
        """

        print(prompt)


InstructionGeneration().summarize_program(program)

'1. **Overall Workflow and Module Connections:**\n   - The `LMProgram` class performs sentiment analysis on a given text by first applying a summarization process.\n   - The program uses two main components: a summarization tool and a sentiment analysis tool.\n   - The summarization tool is applied first, and its output is then processed by the sentiment analysis tool.\n\n2. **Role and Responsibility of Each Module:**\n   - **LMProgram:** Acts as the main orchestrator class. It initializes with two tools and processes input text through these tools while maintaining a history of operations.\n   - **SummarizeTool (summarization_tool):** Presumably tasked with condensing the input text to a shorter form suitable for further analysis. Its exact implementation is abstracted away.\n   - **SentimentAnalysisTool (sentiment_analysis_tool):** Evaluates the summarized text to determine its sentiment. Details on how it determines sentiment are also abstracted.\n\n3. **Information Flow:**\n   - Th