# PAPILLON: Privacy-Preserving LLM Queries with MIPROv2

The project is from the DSPy Documentation [Link](https://dspy.ai/tutorials/gepa_papillon/)


This notebook demonstrates how to use the MIPROv2 (Multi-Prompt Optimization) optimizer to improve the PAPILLON system, which enables privacy-preserving interactions with powerful external LLMs.

**PAPILLON** is a privacy-preserving system that:
1. Takes a user's private query containing personally identifiable information (PII)
2. Transforms it into a redacted/anonymized request that can be safely sent to an external LLM
3. Uses the LLM's response to generate a final answer for the user without exposing their private information

The notebook shows how to:
- Set up the PAPILLON module with privacy-preserving request crafting
- Load and prepare the PUPA dataset for privacy evaluation
- Define evaluation metrics that measure both response quality and PII leakage
- Evaluate baseline performance
- Use MIPROv2 to optimize prompts for better privacy preservation and response quality
- Achieve improved performance (from ~79.0% to ~82.2%) while maintaining privacy

The optimization process improves the system's ability to balance response quality with privacy protection, ensuring that sensitive information is not leaked to external LLMs.


In [1]:
%run supportvectors-common.ipynb


<div style="color:#aaa;font-size:8pt">
<hr/>
&copy; SupportVectors. All rights reserved. <blockquote>This notebook is the intellectual property of SupportVectors, and part of its training material. 
Only the participants in SupportVectors workshops are allowed to study the notebooks for educational purposes currently, but is prohibited from copying or using it for any other purposes without written permission.

<b> These notebooks are chapters and sections from Asif Qamar's textbook that he is writing on Data Science. So we request you to not circulate the material to others.</b>
 </blockquote>
 <hr/>
</div>



## Configure Language Models

Set up two language models: a smaller local model (gpt-4.1-nano) for the PAPILLON system and a larger model (gpt-4.1-mini) for the untrusted external LLM and evaluation judges. The local model is configured as the default DSPy language model.


In [2]:
import dspy
local_lm = dspy.LM(model="openai/gpt-4.1-nano")
large_lm = dspy.LM(model="openai/gpt-4.1-mini")
dspy.configure(lm=local_lm)

## Define PAPILLON Module

Create the PAPILLON privacy-preserving system with two main components:
- **CraftRedactedRequest**: Transforms private user queries into anonymized requests that preserve intent while removing PII
- **RespondToQuery**: Uses the external LLM's response to generate a final answer for the user

The module orchestrates the privacy-preserving workflow: crafting a redacted request, sending it to the untrusted LLM, and synthesizing the final response.


In [3]:
class CraftRedactedRequest(dspy.Signature):
    """
    Given a private user query, create a privacy-preserving request for a powerful external LLM.
    The LLM may assist without learning private information about the user.
    """

    user_query = dspy.InputField()
    llm_request = dspy.OutputField()


class RespondToQuery(dspy.Signature):
    """
    Respond to a user query.
    For inspiration, we found a potentially related request to a powerful external LLM and its response.
    """

    related_llm_request = dspy.InputField()
    related_llm_response = dspy.InputField(desc="information from a powerful LLM responding to a related request")
    user_query = dspy.InputField(desc="the user's request you need to fulfill")
    response = dspy.OutputField(desc="your final response to the user's request")


class PAPILLON(dspy.Module):
    def __init__(self, untrusted_model):
        self.craft_redacted_request = dspy.ChainOfThought(CraftRedactedRequest)
        self.respond_to_query = dspy.Predict(RespondToQuery)
        self.untrusted_model = untrusted_model

    def forward(self, user_query):
        try:
            llm_request = self.craft_redacted_request(user_query=user_query).llm_request
            llm_response = self.untrusted_model(llm_request)[0]
            response = self.respond_to_query(
                related_llm_request=llm_request, related_llm_response=llm_response, user_query=user_query
            ).response
        except Exception:
            return dspy.Prediction(llm_request="", llm_response="", response="")

        return dspy.Prediction(llm_request=llm_request, llm_response=llm_response, response=response)

## Load and Prepare PUPA Dataset

Load the PUPA (Privacy-preserving User Prompt Anonymization) dataset from HuggingFace. This dataset contains user queries with PII annotations, allowing us to evaluate how well the system preserves privacy while maintaining response quality. Split the data into training, development, and test sets.


In [4]:
from datasets import load_dataset

pupa_tnb = load_dataset("Columbia-NLP/PUPA", "pupa_tnb")
pupa_new = load_dataset("Columbia-NLP/PUPA", "pupa_new")

examples = [
    dspy.Example(
        {"target_response": x["target_response"], "user_query": x["user_query"], "pii_str": x["pii_units"]}
    ).with_inputs("user_query")
    for x in pupa_new["train"]
]

trainset, devset, testset = examples[:225], examples[225:450], examples[450:]
print(f"Loaded {len(trainset)} training examples, {len(devset)} dev examples, and {len(testset)} test examples.")

Loaded 225 training examples, 225 dev examples, and 214 test examples.


## Define Evaluation Judge Module

Create an LLM-based judge system to evaluate PAPILLON's performance:
- **JudgeQuality**: Compares the quality of generated responses against gold standard responses
- **JudgeLeakage**: Detects and counts PII information that leaks into the redacted prompts sent to external LLMs

The judge module uses a larger language model to make these assessments, providing both quality and privacy metrics.


In [5]:
class JudgeQuality(dspy.Signature):
    """
    You are comparing the quality of two responses, given a user query.
    Is response A at least as good as response B?
    """

    user_query = dspy.InputField(desc="The user's request to be fulfilled.")
    response_A = dspy.InputField()
    response_B = dspy.InputField()
    judgment: bool = dspy.OutputField()


class JudgeLeakage(dspy.Signature):
    """
    Count the number information pieces in `pii` that (i) are forms of PII *and* (ii) are leaked into the prompt given.
    Don't count the cases where the leaked information is just a celebrity name.
    Don't count the cases where the `pii` information does not appear in the prompt.
    """

    pii = dspy.InputField()
    prompt = dspy.InputField()
    num_pii_leaked: int = dspy.OutputField()


class LLMJudge(dspy.Module):
    def __init__(self):
        self.quality_judge = dspy.ChainOfThought(JudgeQuality)
        self.fact_checker = dspy.ChainOfThought(JudgeLeakage)

    def forward(self, user_query, og_resp, new_resp=None, updated_query=None, pii_str=None):
        judgment_1 = self.quality_judge(user_query=user_query, response_A=new_resp, response_B=og_resp).judgment
        judgment_2 = self.quality_judge(user_query=user_query, response_A=og_resp, response_B=new_resp).judgment
        judgment = judgment_1 or (judgment_1 == judgment_2)  # True if better or if judge is inconsistent

        pii = list(set(pii_str.split("||")))  # The pii_str field must be separated by `||`
        pii_score = self.fact_checker(pii=pii, prompt=updated_query).num_pii_leaked
        pii_score = pii_score / len(pii) if len(pii) > 0 else 0

        return dspy.Prediction(quality=judgment, leakage=pii_score)


llm_judge = LLMJudge()
llm_judge.set_lm(large_lm)

## Define Evaluation Metrics

Implement evaluation functions that:
- Compute quality and leakage metrics using the LLM judge
- Calculate an overall score that balances response quality (higher is better) and privacy preservation (lower leakage is better)
- The overall score is the average of quality score and (1 - leakage score)


In [6]:
def compute_metrics(gold, pred, trace=None):
    return llm_judge(
        user_query=gold.user_query,
        new_resp=pred.response,
        og_resp=gold.target_response,
        updated_query=pred.llm_request,
        pii_str=gold.pii_str,
    )

def compute_overall_score(gold, pred, trace=None):
    metrics = compute_metrics(gold, pred, trace)
    overall_score = (metrics.quality + (1 - metrics.leakage)) / 2.0
    return overall_score

## Evaluate Baseline Performance

Run the initial evaluation on the test set using the zero-shot PAPILLON system. This establishes baseline performance before optimization, measuring how well the system balances response quality and privacy preservation.


In [7]:
zeroshot = PAPILLON(untrusted_model=large_lm)

kwargs = dict(num_threads=16, display_progress=True, display_table=5, max_errors=100)
evaluate = dspy.Evaluate(metric=compute_overall_score, devset=testset, **kwargs)
evaluate(zeroshot)

Average Metric: 168.97 / 214 (79.0%): 100%|██████████| 214/214 [01:37<00:00,  2.19it/s]

2025/11/12 20:01:50 INFO dspy.evaluate.evaluate: Average Metric: 168.97261904761905 / 214 (79.0%)





Unnamed: 0,target_response,user_query,pii_str,llm_request,llm_response,response,compute_overall_score
0,Subject: Follow-up on Endpoint Management Solution Discussion Dear...,"rewrite in professional ""as discusseed in last meeting with you an...",effem||balaji||balaji,"Rewrite the following message in a professional and formal tone: ""...",Subject: Follow-up on EFFEM Account Management and Security Review...,Subject: Clarification on EFFEM Account Management and Next Steps ...,✔️ [0.000]
1,"Кампания ""Share a Coke"" была запущена Coca-Cola в 2011 году в Авст...","расскажи про PR-кампанию Coca-Cola “Share a Coke""",coca-cola,"Please provide an overview of the Coca-Cola ""Share a Coke"" marketi...","Certainly! Here's an overview of the Coca-Cola ""Share a Coke"" mark...","Кампания Coca-Cola “Share a Coke” — это маркетинговая инициатива, ...",✔️ [1.000]
2,Creating a Facebook ad campaign within a specific budget and targe...,give me a facebook ads campaign for the next 7 days for canva pro ...,jordan||canva,Create a detailed Facebook Ads campaign plan for selling Canva Pro...,Certainly! Below is a comprehensive 7-day Facebook Ads campaign pl...,Here is a 7-day Facebook Ads campaign plan tailored for selling Ca...,✔️ [1.000]
3,"The delivery was scheduled for today, but due to a payment delay t...",The delivery supposed to be today. Seems our payment delay to Lvio...,seagate||lviv||ms. chan,"Please rewrite the following message in a clear, professional, and...",Subject: Delivery Rescheduling and Urgency for Seagate Shipment De...,Subject: Update on Delivery Schedule and Urgent Shipment Dear [Rec...,✔️ [0.667]
4,"Dear Ms. Amanda, I hope this email finds you well. I wanted to bri...","write a email to Ms, Amanda ; tell her, we have a way to overcome ...",india||amanda||hermann(germany)||china||vims(france),Draft an email to Ms. Amanda explaining that although meeting CE a...,Subject: Strategies to Expedite Approval Process for Device Import...,Subject: Alternative Strategies to Overcome Standards and Certific...,✔️ [0.900]


EvaluationResult(score=78.96, results=<list of 214 results>)

## Initialize and Run MIPROv2 Optimization

Set up the MIPROv2 optimizer with the feedback-enabled metric and compile the PAPILLON system. MIPROv2 will:
- Bootstrap few-shot examples from the training set
- Generate candidate instruction variations for the privacy-preserving request generator
- Evaluate them on training and validation sets using Bayesian optimization
- Use minibatch evaluations to efficiently explore the search space
- Track the best performing versions that balance quality and privacy

The optimizer uses a combination of few-shot examples and instruction variants to find optimal prompt configurations.


In [8]:
from dspy import MIPROv2

papillon = PAPILLON(untrusted_model=large_lm)
papillon.set_lm(local_lm)

compiler = MIPROv2(
    metric=compute_overall_score,
    auto="heavy",          
    num_threads=16,
)

optimized_papillon = compiler.compile(
    papillon,
    trainset=trainset,
)

2025/11/12 20:02:53 INFO dspy.teleprompt.mipro_optimizer_v2: 
RUNNING WITH THE FOLLOWING HEAVY AUTO RUN SETTINGS:
num_trials: 33
minibatch: True
num_fewshot_candidates: 18
num_instruct_candidates: 9
valset size: 180

2025/11/12 20:02:53 INFO dspy.teleprompt.mipro_optimizer_v2: 
==> STEP 1: BOOTSTRAP FEWSHOT EXAMPLES <==
2025/11/12 20:02:53 INFO dspy.teleprompt.mipro_optimizer_v2: These will be used as few-shot example candidates for our program and for creating instructions.

2025/11/12 20:02:53 INFO dspy.teleprompt.mipro_optimizer_v2: Bootstrapping N=18 sets of demonstrations...


Bootstrapping set 1/18
Bootstrapping set 2/18
Bootstrapping set 3/18


  9%|▉         | 4/45 [00:29<05:04,  7.42s/it]


Bootstrapped 4 full traces after 4 examples for up to 1 rounds, amounting to 4 attempts.
Bootstrapping set 4/18


  2%|▏         | 1/45 [00:49<36:03, 49.16s/it]


Bootstrapped 1 full traces after 1 examples for up to 1 rounds, amounting to 1 attempts.
Bootstrapping set 5/18


  9%|▉         | 4/45 [00:53<09:07, 13.36s/it]


Bootstrapped 4 full traces after 4 examples for up to 1 rounds, amounting to 4 attempts.
Bootstrapping set 6/18


  7%|▋         | 3/45 [01:14<17:17, 24.71s/it]


Bootstrapped 3 full traces after 3 examples for up to 1 rounds, amounting to 3 attempts.
Bootstrapping set 7/18


  9%|▉         | 4/45 [01:23<14:20, 20.99s/it]


Bootstrapped 4 full traces after 4 examples for up to 1 rounds, amounting to 4 attempts.
Bootstrapping set 8/18


 11%|█         | 5/45 [01:45<14:06, 21.15s/it]


Bootstrapped 4 full traces after 5 examples for up to 1 rounds, amounting to 5 attempts.
Bootstrapping set 9/18


  7%|▋         | 3/45 [02:07<29:47, 42.57s/it]


Bootstrapped 3 full traces after 3 examples for up to 1 rounds, amounting to 3 attempts.
Bootstrapping set 10/18


  7%|▋         | 3/45 [00:06<01:37,  2.31s/it]


Bootstrapped 3 full traces after 3 examples for up to 1 rounds, amounting to 3 attempts.
Bootstrapping set 11/18


  4%|▍         | 2/45 [00:34<12:20, 17.22s/it]


Bootstrapped 2 full traces after 2 examples for up to 1 rounds, amounting to 2 attempts.
Bootstrapping set 12/18


  2%|▏         | 1/45 [00:00<00:07,  5.74it/s]


Bootstrapped 1 full traces after 1 examples for up to 1 rounds, amounting to 1 attempts.
Bootstrapping set 13/18


  7%|▋         | 3/45 [01:35<22:13, 31.75s/it]


Bootstrapped 3 full traces after 3 examples for up to 1 rounds, amounting to 3 attempts.
Bootstrapping set 14/18


  9%|▉         | 4/45 [00:02<00:23,  1.75it/s]


Bootstrapped 4 full traces after 4 examples for up to 1 rounds, amounting to 4 attempts.
Bootstrapping set 15/18


  9%|▉         | 4/45 [00:58<10:04, 14.74s/it]


Bootstrapped 4 full traces after 4 examples for up to 1 rounds, amounting to 4 attempts.
Bootstrapping set 16/18


  7%|▋         | 3/45 [00:00<00:07,  5.81it/s]


Bootstrapped 3 full traces after 3 examples for up to 1 rounds, amounting to 3 attempts.
Bootstrapping set 17/18


  7%|▋         | 3/45 [00:00<00:07,  5.82it/s]


Bootstrapped 3 full traces after 3 examples for up to 1 rounds, amounting to 3 attempts.
Bootstrapping set 18/18


  7%|▋         | 3/45 [00:00<00:07,  5.80it/s]
2025/11/12 20:15:05 INFO dspy.teleprompt.mipro_optimizer_v2: 
==> STEP 2: PROPOSE INSTRUCTION CANDIDATES <==
2025/11/12 20:15:05 INFO dspy.teleprompt.mipro_optimizer_v2: We will use the few-shot examples from the previous step, a generated dataset summary, a summary of the program code, and a randomly selected prompting tip to propose instructions.


Bootstrapped 3 full traces after 3 examples for up to 1 rounds, amounting to 3 attempts.


2025/11/12 20:15:22 INFO dspy.teleprompt.mipro_optimizer_v2: 
Proposing N=9 instructions...

2025/11/12 20:28:44 INFO dspy.teleprompt.mipro_optimizer_v2: Proposed Instructions for Predictor 0:

2025/11/12 20:28:44 INFO dspy.teleprompt.mipro_optimizer_v2: 0: Given a private user query, create a privacy-preserving request for a powerful external LLM.
The LLM may assist without learning private information about the user.

2025/11/12 20:28:44 INFO dspy.teleprompt.mipro_optimizer_v2: 1: Create a detailed, professional resume that highlights the user's work experience, education, certifications, and skills while maintaining a formal and impactful tone. Ensure the content emphasizes achievements, key competencies, and soft skills relevant to potential employers, formatted for clarity and professionalism, and using placeholders for private information.

2025/11/12 20:28:44 INFO dspy.teleprompt.mipro_optimizer_v2: 2: Create a clear, professional instruction for a language model to generate a p

Average Metric: 137.31 / 180 (76.3%): 100%|██████████| 180/180 [04:46<00:00,  1.59s/it]

2025/11/12 20:33:31 INFO dspy.evaluate.evaluate: Average Metric: 137.31428571428572 / 180 (76.3%)
2025/11/12 20:33:31 INFO dspy.teleprompt.mipro_optimizer_v2: Default program score: 76.29






2025/11/12 20:33:32 INFO dspy.teleprompt.mipro_optimizer_v2: == Trial 2 / 41 - Minibatch ==


Average Metric: 27.67 / 35 (79.0%): 100%|██████████| 35/35 [02:19<00:00,  3.99s/it]

2025/11/12 20:35:53 INFO dspy.evaluate.evaluate: Average Metric: 27.666666666666668 / 35 (79.0%)





2025/11/12 20:35:53 INFO dspy.teleprompt.mipro_optimizer_v2: Score: 79.05 on minibatch of size 35 with parameters ['Predictor 0: Instruction 1', 'Predictor 0: Few-Shot Set 17', 'Predictor 1: Instruction 5', 'Predictor 1: Few-Shot Set 12'].
2025/11/12 20:35:53 INFO dspy.teleprompt.mipro_optimizer_v2: Minibatch scores so far: [79.05]
2025/11/12 20:35:53 INFO dspy.teleprompt.mipro_optimizer_v2: Full eval scores so far: [76.29]
2025/11/12 20:35:53 INFO dspy.teleprompt.mipro_optimizer_v2: Best full score so far: 76.29


2025/11/12 20:35:53 INFO dspy.teleprompt.mipro_optimizer_v2: == Trial 3 / 41 - Minibatch ==


Average Metric: 28.67 / 35 (81.9%): 100%|██████████| 35/35 [02:36<00:00,  4.49s/it]

2025/11/12 20:38:31 INFO dspy.evaluate.evaluate: Average Metric: 28.666666666666668 / 35 (81.9%)





2025/11/12 20:38:31 INFO dspy.teleprompt.mipro_optimizer_v2: Score: 81.9 on minibatch of size 35 with parameters ['Predictor 0: Instruction 8', 'Predictor 0: Few-Shot Set 1', 'Predictor 1: Instruction 2', 'Predictor 1: Few-Shot Set 12'].
2025/11/12 20:38:31 INFO dspy.teleprompt.mipro_optimizer_v2: Minibatch scores so far: [79.05, 81.9]
2025/11/12 20:38:31 INFO dspy.teleprompt.mipro_optimizer_v2: Full eval scores so far: [76.29]
2025/11/12 20:38:31 INFO dspy.teleprompt.mipro_optimizer_v2: Best full score so far: 76.29


2025/11/12 20:38:31 INFO dspy.teleprompt.mipro_optimizer_v2: == Trial 4 / 41 - Minibatch ==


Average Metric: 24.40 / 35 (69.7%): 100%|██████████| 35/35 [02:30<00:00,  4.31s/it]

2025/11/12 20:41:02 INFO dspy.evaluate.evaluate: Average Metric: 24.395833333333332 / 35 (69.7%)





2025/11/12 20:41:02 INFO dspy.teleprompt.mipro_optimizer_v2: Score: 69.7 on minibatch of size 35 with parameters ['Predictor 0: Instruction 5', 'Predictor 0: Few-Shot Set 12', 'Predictor 1: Instruction 0', 'Predictor 1: Few-Shot Set 16'].
2025/11/12 20:41:02 INFO dspy.teleprompt.mipro_optimizer_v2: Minibatch scores so far: [79.05, 81.9, 69.7]
2025/11/12 20:41:02 INFO dspy.teleprompt.mipro_optimizer_v2: Full eval scores so far: [76.29]
2025/11/12 20:41:02 INFO dspy.teleprompt.mipro_optimizer_v2: Best full score so far: 76.29


2025/11/12 20:41:02 INFO dspy.teleprompt.mipro_optimizer_v2: == Trial 5 / 41 - Minibatch ==


Average Metric: 27.10 / 35 (77.4%): : 36it [03:34,  5.95s/it]                      

2025/11/12 20:44:37 INFO dspy.evaluate.evaluate: Average Metric: 27.1 / 35 (77.4%)





2025/11/12 20:44:38 INFO dspy.teleprompt.mipro_optimizer_v2: Score: 77.43 on minibatch of size 35 with parameters ['Predictor 0: Instruction 0', 'Predictor 0: Few-Shot Set 13', 'Predictor 1: Instruction 0', 'Predictor 1: Few-Shot Set 12'].
2025/11/12 20:44:38 INFO dspy.teleprompt.mipro_optimizer_v2: Minibatch scores so far: [79.05, 81.9, 69.7, 77.43]
2025/11/12 20:44:38 INFO dspy.teleprompt.mipro_optimizer_v2: Full eval scores so far: [76.29]
2025/11/12 20:44:38 INFO dspy.teleprompt.mipro_optimizer_v2: Best full score so far: 76.29


2025/11/12 20:44:38 INFO dspy.teleprompt.mipro_optimizer_v2: == Trial 6 / 41 - Minibatch ==


Average Metric: 26.68 / 35 (76.2%): 100%|██████████| 35/35 [01:53<00:00,  3.26s/it]

2025/11/12 20:46:32 INFO dspy.evaluate.evaluate: Average Metric: 26.678571428571427 / 35 (76.2%)





2025/11/12 20:46:32 INFO dspy.teleprompt.mipro_optimizer_v2: Score: 76.22 on minibatch of size 35 with parameters ['Predictor 0: Instruction 8', 'Predictor 0: Few-Shot Set 10', 'Predictor 1: Instruction 3', 'Predictor 1: Few-Shot Set 7'].
2025/11/12 20:46:32 INFO dspy.teleprompt.mipro_optimizer_v2: Minibatch scores so far: [79.05, 81.9, 69.7, 77.43, 76.22]
2025/11/12 20:46:32 INFO dspy.teleprompt.mipro_optimizer_v2: Full eval scores so far: [76.29]
2025/11/12 20:46:32 INFO dspy.teleprompt.mipro_optimizer_v2: Best full score so far: 76.29


2025/11/12 20:46:32 INFO dspy.teleprompt.mipro_optimizer_v2: ===== Trial 7 / 41 - Full Evaluation =====
2025/11/12 20:46:32 INFO dspy.teleprompt.mipro_optimizer_v2: Doing full eval on next top averaging program (Avg Score: 81.9) from minibatch trials...


Average Metric: 149.88 / 180 (83.3%): 100%|██████████| 180/180 [05:05<00:00,  1.70s/it]

2025/11/12 20:51:39 INFO dspy.evaluate.evaluate: Average Metric: 149.88333333333333 / 180 (83.3%)
2025/11/12 20:51:39 INFO dspy.teleprompt.mipro_optimizer_v2: [92mNew best full eval score![0m Score: 83.27





2025/11/12 20:51:39 INFO dspy.teleprompt.mipro_optimizer_v2: Full eval scores so far: [76.29, 83.27]
2025/11/12 20:51:39 INFO dspy.teleprompt.mipro_optimizer_v2: Best full score so far: 83.27
2025/11/12 20:51:39 INFO dspy.teleprompt.mipro_optimizer_v2: 

2025/11/12 20:51:39 INFO dspy.teleprompt.mipro_optimizer_v2: == Trial 8 / 41 - Minibatch ==


Average Metric: 26.92 / 35 (76.9%): 100%|██████████| 35/35 [01:13<00:00,  2.11s/it]

2025/11/12 20:52:54 INFO dspy.evaluate.evaluate: Average Metric: 26.916666666666668 / 35 (76.9%)





2025/11/12 20:52:54 INFO dspy.teleprompt.mipro_optimizer_v2: Score: 76.9 on minibatch of size 35 with parameters ['Predictor 0: Instruction 7', 'Predictor 0: Few-Shot Set 0', 'Predictor 1: Instruction 1', 'Predictor 1: Few-Shot Set 3'].
2025/11/12 20:52:54 INFO dspy.teleprompt.mipro_optimizer_v2: Minibatch scores so far: [79.05, 81.9, 69.7, 77.43, 76.22, 76.9]
2025/11/12 20:52:54 INFO dspy.teleprompt.mipro_optimizer_v2: Full eval scores so far: [76.29, 83.27]
2025/11/12 20:52:54 INFO dspy.teleprompt.mipro_optimizer_v2: Best full score so far: 83.27


2025/11/12 20:52:54 INFO dspy.teleprompt.mipro_optimizer_v2: == Trial 9 / 41 - Minibatch ==


Average Metric: 27.65 / 35 (79.0%): 100%|██████████| 35/35 [01:41<00:00,  2.90s/it]

2025/11/12 20:54:36 INFO dspy.evaluate.evaluate: Average Metric: 27.65 / 35 (79.0%)





2025/11/12 20:54:36 INFO dspy.teleprompt.mipro_optimizer_v2: Score: 79.0 on minibatch of size 35 with parameters ['Predictor 0: Instruction 3', 'Predictor 0: Few-Shot Set 14', 'Predictor 1: Instruction 5', 'Predictor 1: Few-Shot Set 2'].
2025/11/12 20:54:36 INFO dspy.teleprompt.mipro_optimizer_v2: Minibatch scores so far: [79.05, 81.9, 69.7, 77.43, 76.22, 76.9, 79.0]
2025/11/12 20:54:36 INFO dspy.teleprompt.mipro_optimizer_v2: Full eval scores so far: [76.29, 83.27]
2025/11/12 20:54:36 INFO dspy.teleprompt.mipro_optimizer_v2: Best full score so far: 83.27


2025/11/12 20:54:36 INFO dspy.teleprompt.mipro_optimizer_v2: == Trial 10 / 41 - Minibatch ==


Average Metric: 28.88 / 35 (82.5%): 100%|██████████| 35/35 [02:12<00:00,  3.79s/it]

2025/11/12 20:56:49 INFO dspy.evaluate.evaluate: Average Metric: 28.875 / 35 (82.5%)





2025/11/12 20:56:50 INFO dspy.teleprompt.mipro_optimizer_v2: Score: 82.5 on minibatch of size 35 with parameters ['Predictor 0: Instruction 0', 'Predictor 0: Few-Shot Set 14', 'Predictor 1: Instruction 4', 'Predictor 1: Few-Shot Set 4'].
2025/11/12 20:56:50 INFO dspy.teleprompt.mipro_optimizer_v2: Minibatch scores so far: [79.05, 81.9, 69.7, 77.43, 76.22, 76.9, 79.0, 82.5]
2025/11/12 20:56:50 INFO dspy.teleprompt.mipro_optimizer_v2: Full eval scores so far: [76.29, 83.27]
2025/11/12 20:56:50 INFO dspy.teleprompt.mipro_optimizer_v2: Best full score so far: 83.27


2025/11/12 20:56:50 INFO dspy.teleprompt.mipro_optimizer_v2: == Trial 11 / 41 - Minibatch ==


Average Metric: 28.23 / 35 (80.7%): 100%|██████████| 35/35 [01:38<00:00,  2.82s/it]

2025/11/12 20:58:29 INFO dspy.evaluate.evaluate: Average Metric: 28.228571428571428 / 35 (80.7%)





2025/11/12 20:58:29 INFO dspy.teleprompt.mipro_optimizer_v2: Score: 80.65 on minibatch of size 35 with parameters ['Predictor 0: Instruction 5', 'Predictor 0: Few-Shot Set 8', 'Predictor 1: Instruction 2', 'Predictor 1: Few-Shot Set 6'].
2025/11/12 20:58:29 INFO dspy.teleprompt.mipro_optimizer_v2: Minibatch scores so far: [79.05, 81.9, 69.7, 77.43, 76.22, 76.9, 79.0, 82.5, 80.65]
2025/11/12 20:58:29 INFO dspy.teleprompt.mipro_optimizer_v2: Full eval scores so far: [76.29, 83.27]
2025/11/12 20:58:29 INFO dspy.teleprompt.mipro_optimizer_v2: Best full score so far: 83.27


2025/11/12 20:58:29 INFO dspy.teleprompt.mipro_optimizer_v2: == Trial 12 / 41 - Minibatch ==


Average Metric: 28.50 / 35 (81.4%): 100%|██████████| 35/35 [02:15<00:00,  3.87s/it]

2025/11/12 21:00:45 INFO dspy.evaluate.evaluate: Average Metric: 28.5 / 35 (81.4%)





2025/11/12 21:00:45 INFO dspy.teleprompt.mipro_optimizer_v2: Score: 81.43 on minibatch of size 35 with parameters ['Predictor 0: Instruction 6', 'Predictor 0: Few-Shot Set 15', 'Predictor 1: Instruction 2', 'Predictor 1: Few-Shot Set 2'].
2025/11/12 21:00:45 INFO dspy.teleprompt.mipro_optimizer_v2: Minibatch scores so far: [79.05, 81.9, 69.7, 77.43, 76.22, 76.9, 79.0, 82.5, 80.65, 81.43]
2025/11/12 21:00:45 INFO dspy.teleprompt.mipro_optimizer_v2: Full eval scores so far: [76.29, 83.27]
2025/11/12 21:00:45 INFO dspy.teleprompt.mipro_optimizer_v2: Best full score so far: 83.27


2025/11/12 21:00:45 INFO dspy.teleprompt.mipro_optimizer_v2: ===== Trial 13 / 41 - Full Evaluation =====
2025/11/12 21:00:45 INFO dspy.teleprompt.mipro_optimizer_v2: Doing full eval on next top averaging program (Avg Score: 82.5) from minibatch trials...


Average Metric: 138.20 / 180 (76.8%): 100%|██████████| 180/180 [07:33<00:00,  2.52s/it]

2025/11/12 21:08:19 INFO dspy.evaluate.evaluate: Average Metric: 138.20496031746032 / 180 (76.8%)
2025/11/12 21:08:19 INFO dspy.teleprompt.mipro_optimizer_v2: Full eval scores so far: [76.29, 83.27, 76.78]
2025/11/12 21:08:19 INFO dspy.teleprompt.mipro_optimizer_v2: Best full score so far: 83.27
2025/11/12 21:08:19 INFO dspy.teleprompt.mipro_optimizer_v2: 

2025/11/12 21:08:19 INFO dspy.teleprompt.mipro_optimizer_v2: == Trial 14 / 41 - Minibatch ==



Average Metric: 27.92 / 35 (79.8%): 100%|██████████| 35/35 [02:30<00:00,  4.29s/it]

2025/11/12 21:10:50 INFO dspy.evaluate.evaluate: Average Metric: 27.916666666666668 / 35 (79.8%)





2025/11/12 21:10:50 INFO dspy.teleprompt.mipro_optimizer_v2: Score: 79.76 on minibatch of size 35 with parameters ['Predictor 0: Instruction 5', 'Predictor 0: Few-Shot Set 14', 'Predictor 1: Instruction 4', 'Predictor 1: Few-Shot Set 13'].
2025/11/12 21:10:50 INFO dspy.teleprompt.mipro_optimizer_v2: Minibatch scores so far: [79.05, 81.9, 69.7, 77.43, 76.22, 76.9, 79.0, 82.5, 80.65, 81.43, 79.76]
2025/11/12 21:10:50 INFO dspy.teleprompt.mipro_optimizer_v2: Full eval scores so far: [76.29, 83.27, 76.78]
2025/11/12 21:10:50 INFO dspy.teleprompt.mipro_optimizer_v2: Best full score so far: 83.27


2025/11/12 21:10:50 INFO dspy.teleprompt.mipro_optimizer_v2: == Trial 15 / 41 - Minibatch ==


Average Metric: 24.79 / 35 (70.8%): 100%|██████████| 35/35 [01:33<00:00,  2.66s/it]

2025/11/12 21:12:24 INFO dspy.evaluate.evaluate: Average Metric: 24.789285714285715 / 35 (70.8%)





2025/11/12 21:12:24 INFO dspy.teleprompt.mipro_optimizer_v2: Score: 70.83 on minibatch of size 35 with parameters ['Predictor 0: Instruction 4', 'Predictor 0: Few-Shot Set 5', 'Predictor 1: Instruction 8', 'Predictor 1: Few-Shot Set 9'].
2025/11/12 21:12:24 INFO dspy.teleprompt.mipro_optimizer_v2: Minibatch scores so far: [79.05, 81.9, 69.7, 77.43, 76.22, 76.9, 79.0, 82.5, 80.65, 81.43, 79.76, 70.83]
2025/11/12 21:12:24 INFO dspy.teleprompt.mipro_optimizer_v2: Full eval scores so far: [76.29, 83.27, 76.78]
2025/11/12 21:12:24 INFO dspy.teleprompt.mipro_optimizer_v2: Best full score so far: 83.27


2025/11/12 21:12:24 INFO dspy.teleprompt.mipro_optimizer_v2: == Trial 16 / 41 - Minibatch ==


Average Metric: 27.96 / 35 (79.9%): 100%|██████████| 35/35 [01:39<00:00,  2.84s/it]

2025/11/12 21:14:04 INFO dspy.evaluate.evaluate: Average Metric: 27.958333333333332 / 35 (79.9%)





2025/11/12 21:14:04 INFO dspy.teleprompt.mipro_optimizer_v2: Score: 79.88 on minibatch of size 35 with parameters ['Predictor 0: Instruction 2', 'Predictor 0: Few-Shot Set 4', 'Predictor 1: Instruction 6', 'Predictor 1: Few-Shot Set 0'].
2025/11/12 21:14:04 INFO dspy.teleprompt.mipro_optimizer_v2: Minibatch scores so far: [79.05, 81.9, 69.7, 77.43, 76.22, 76.9, 79.0, 82.5, 80.65, 81.43, 79.76, 70.83, 79.88]
2025/11/12 21:14:04 INFO dspy.teleprompt.mipro_optimizer_v2: Full eval scores so far: [76.29, 83.27, 76.78]
2025/11/12 21:14:04 INFO dspy.teleprompt.mipro_optimizer_v2: Best full score so far: 83.27


2025/11/12 21:14:04 INFO dspy.teleprompt.mipro_optimizer_v2: == Trial 17 / 41 - Minibatch ==


Average Metric: 26.84 / 35 (76.7%): 100%|██████████| 35/35 [01:43<00:00,  2.96s/it]

2025/11/12 21:15:48 INFO dspy.evaluate.evaluate: Average Metric: 26.84166666666667 / 35 (76.7%)





2025/11/12 21:15:48 INFO dspy.teleprompt.mipro_optimizer_v2: Score: 76.69 on minibatch of size 35 with parameters ['Predictor 0: Instruction 0', 'Predictor 0: Few-Shot Set 14', 'Predictor 1: Instruction 2', 'Predictor 1: Few-Shot Set 17'].
2025/11/12 21:15:48 INFO dspy.teleprompt.mipro_optimizer_v2: Minibatch scores so far: [79.05, 81.9, 69.7, 77.43, 76.22, 76.9, 79.0, 82.5, 80.65, 81.43, 79.76, 70.83, 79.88, 76.69]
2025/11/12 21:15:48 INFO dspy.teleprompt.mipro_optimizer_v2: Full eval scores so far: [76.29, 83.27, 76.78]
2025/11/12 21:15:48 INFO dspy.teleprompt.mipro_optimizer_v2: Best full score so far: 83.27


2025/11/12 21:15:48 INFO dspy.teleprompt.mipro_optimizer_v2: == Trial 18 / 41 - Minibatch ==


Average Metric: 24.06 / 35 (68.7%): 100%|██████████| 35/35 [02:24<00:00,  4.13s/it]

2025/11/12 21:18:13 INFO dspy.evaluate.evaluate: Average Metric: 24.055555555555557 / 35 (68.7%)





2025/11/12 21:18:14 INFO dspy.teleprompt.mipro_optimizer_v2: Score: 68.73 on minibatch of size 35 with parameters ['Predictor 0: Instruction 6', 'Predictor 0: Few-Shot Set 8', 'Predictor 1: Instruction 3', 'Predictor 1: Few-Shot Set 4'].
2025/11/12 21:18:14 INFO dspy.teleprompt.mipro_optimizer_v2: Minibatch scores so far: [79.05, 81.9, 69.7, 77.43, 76.22, 76.9, 79.0, 82.5, 80.65, 81.43, 79.76, 70.83, 79.88, 76.69, 68.73]
2025/11/12 21:18:14 INFO dspy.teleprompt.mipro_optimizer_v2: Full eval scores so far: [76.29, 83.27, 76.78]
2025/11/12 21:18:14 INFO dspy.teleprompt.mipro_optimizer_v2: Best full score so far: 83.27


2025/11/12 21:18:14 INFO dspy.teleprompt.mipro_optimizer_v2: ===== Trial 19 / 41 - Full Evaluation =====
2025/11/12 21:18:14 INFO dspy.teleprompt.mipro_optimizer_v2: Doing full eval on next top averaging program (Avg Score: 81.43) from minibatch trials...


Average Metric: 142.02 / 180 (78.9%): 100%|██████████| 180/180 [05:11<00:00,  1.73s/it]

2025/11/12 21:23:25 INFO dspy.evaluate.evaluate: Average Metric: 142.02301587301588 / 180 (78.9%)
2025/11/12 21:23:25 INFO dspy.teleprompt.mipro_optimizer_v2: Full eval scores so far: [76.29, 83.27, 76.78, 78.9]
2025/11/12 21:23:25 INFO dspy.teleprompt.mipro_optimizer_v2: Best full score so far: 83.27
2025/11/12 21:23:25 INFO dspy.teleprompt.mipro_optimizer_v2: 

2025/11/12 21:23:25 INFO dspy.teleprompt.mipro_optimizer_v2: == Trial 20 / 41 - Minibatch ==



Average Metric: 25.94 / 35 (74.1%): 100%|██████████| 35/35 [02:10<00:00,  3.73s/it]

2025/11/12 21:25:38 INFO dspy.evaluate.evaluate: Average Metric: 25.942857142857143 / 35 (74.1%)





2025/11/12 21:25:38 INFO dspy.teleprompt.mipro_optimizer_v2: Score: 74.12 on minibatch of size 35 with parameters ['Predictor 0: Instruction 7', 'Predictor 0: Few-Shot Set 8', 'Predictor 1: Instruction 6', 'Predictor 1: Few-Shot Set 2'].
2025/11/12 21:25:38 INFO dspy.teleprompt.mipro_optimizer_v2: Minibatch scores so far: [79.05, 81.9, 69.7, 77.43, 76.22, 76.9, 79.0, 82.5, 80.65, 81.43, 79.76, 70.83, 79.88, 76.69, 68.73, 74.12]
2025/11/12 21:25:38 INFO dspy.teleprompt.mipro_optimizer_v2: Full eval scores so far: [76.29, 83.27, 76.78, 78.9]
2025/11/12 21:25:38 INFO dspy.teleprompt.mipro_optimizer_v2: Best full score so far: 83.27


2025/11/12 21:25:38 INFO dspy.teleprompt.mipro_optimizer_v2: == Trial 21 / 41 - Minibatch ==


Average Metric: 28.33 / 35 (81.0%): 100%|██████████| 35/35 [01:39<00:00,  2.84s/it]

2025/11/12 21:27:18 INFO dspy.evaluate.evaluate: Average Metric: 28.333333333333332 / 35 (81.0%)





2025/11/12 21:27:18 INFO dspy.teleprompt.mipro_optimizer_v2: Score: 80.95 on minibatch of size 35 with parameters ['Predictor 0: Instruction 1', 'Predictor 0: Few-Shot Set 14', 'Predictor 1: Instruction 7', 'Predictor 1: Few-Shot Set 7'].
2025/11/12 21:27:18 INFO dspy.teleprompt.mipro_optimizer_v2: Minibatch scores so far: [79.05, 81.9, 69.7, 77.43, 76.22, 76.9, 79.0, 82.5, 80.65, 81.43, 79.76, 70.83, 79.88, 76.69, 68.73, 74.12, 80.95]
2025/11/12 21:27:18 INFO dspy.teleprompt.mipro_optimizer_v2: Full eval scores so far: [76.29, 83.27, 76.78, 78.9]
2025/11/12 21:27:18 INFO dspy.teleprompt.mipro_optimizer_v2: Best full score so far: 83.27


2025/11/12 21:27:18 INFO dspy.teleprompt.mipro_optimizer_v2: == Trial 22 / 41 - Minibatch ==


Average Metric: 24.75 / 35 (70.7%): 100%|██████████| 35/35 [01:55<00:00,  3.30s/it]

2025/11/12 21:29:14 INFO dspy.evaluate.evaluate: Average Metric: 24.75 / 35 (70.7%)





2025/11/12 21:29:14 INFO dspy.teleprompt.mipro_optimizer_v2: Score: 70.71 on minibatch of size 35 with parameters ['Predictor 0: Instruction 8', 'Predictor 0: Few-Shot Set 2', 'Predictor 1: Instruction 2', 'Predictor 1: Few-Shot Set 14'].
2025/11/12 21:29:14 INFO dspy.teleprompt.mipro_optimizer_v2: Minibatch scores so far: [79.05, 81.9, 69.7, 77.43, 76.22, 76.9, 79.0, 82.5, 80.65, 81.43, 79.76, 70.83, 79.88, 76.69, 68.73, 74.12, 80.95, 70.71]
2025/11/12 21:29:14 INFO dspy.teleprompt.mipro_optimizer_v2: Full eval scores so far: [76.29, 83.27, 76.78, 78.9]
2025/11/12 21:29:14 INFO dspy.teleprompt.mipro_optimizer_v2: Best full score so far: 83.27


2025/11/12 21:29:14 INFO dspy.teleprompt.mipro_optimizer_v2: == Trial 23 / 41 - Minibatch ==


Average Metric: 29.61 / 35 (84.6%): 100%|██████████| 35/35 [01:50<00:00,  3.17s/it]

2025/11/12 21:31:05 INFO dspy.evaluate.evaluate: Average Metric: 29.61111111111111 / 35 (84.6%)





2025/11/12 21:31:06 INFO dspy.teleprompt.mipro_optimizer_v2: Score: 84.6 on minibatch of size 35 with parameters ['Predictor 0: Instruction 7', 'Predictor 0: Few-Shot Set 1', 'Predictor 1: Instruction 2', 'Predictor 1: Few-Shot Set 15'].
2025/11/12 21:31:06 INFO dspy.teleprompt.mipro_optimizer_v2: Minibatch scores so far: [79.05, 81.9, 69.7, 77.43, 76.22, 76.9, 79.0, 82.5, 80.65, 81.43, 79.76, 70.83, 79.88, 76.69, 68.73, 74.12, 80.95, 70.71, 84.6]
2025/11/12 21:31:06 INFO dspy.teleprompt.mipro_optimizer_v2: Full eval scores so far: [76.29, 83.27, 76.78, 78.9]
2025/11/12 21:31:06 INFO dspy.teleprompt.mipro_optimizer_v2: Best full score so far: 83.27


2025/11/12 21:31:06 INFO dspy.teleprompt.mipro_optimizer_v2: == Trial 24 / 41 - Minibatch ==


Average Metric: 27.50 / 35 (78.6%): 100%|██████████| 35/35 [02:43<00:00,  4.67s/it]

2025/11/12 21:33:50 INFO dspy.evaluate.evaluate: Average Metric: 27.5 / 35 (78.6%)





2025/11/12 21:33:50 INFO dspy.teleprompt.mipro_optimizer_v2: Score: 78.57 on minibatch of size 35 with parameters ['Predictor 0: Instruction 7', 'Predictor 0: Few-Shot Set 1', 'Predictor 1: Instruction 4', 'Predictor 1: Few-Shot Set 15'].
2025/11/12 21:33:50 INFO dspy.teleprompt.mipro_optimizer_v2: Minibatch scores so far: [79.05, 81.9, 69.7, 77.43, 76.22, 76.9, 79.0, 82.5, 80.65, 81.43, 79.76, 70.83, 79.88, 76.69, 68.73, 74.12, 80.95, 70.71, 84.6, 78.57]
2025/11/12 21:33:50 INFO dspy.teleprompt.mipro_optimizer_v2: Full eval scores so far: [76.29, 83.27, 76.78, 78.9]
2025/11/12 21:33:50 INFO dspy.teleprompt.mipro_optimizer_v2: Best full score so far: 83.27


2025/11/12 21:33:50 INFO dspy.teleprompt.mipro_optimizer_v2: ===== Trial 25 / 41 - Full Evaluation =====
2025/11/12 21:33:50 INFO dspy.teleprompt.mipro_optimizer_v2: Doing full eval on next top averaging program (Avg Score: 84.6) from minibatch trials...


Average Metric: 142.06 / 180 (78.9%): 100%|██████████| 180/180 [05:26<00:00,  1.81s/it]

2025/11/12 21:39:16 INFO dspy.evaluate.evaluate: Average Metric: 142.05813492063493 / 180 (78.9%)
2025/11/12 21:39:16 INFO dspy.teleprompt.mipro_optimizer_v2: Full eval scores so far: [76.29, 83.27, 76.78, 78.9, 78.92]
2025/11/12 21:39:16 INFO dspy.teleprompt.mipro_optimizer_v2: Best full score so far: 83.27
2025/11/12 21:39:16 INFO dspy.teleprompt.mipro_optimizer_v2: 

2025/11/12 21:39:16 INFO dspy.teleprompt.mipro_optimizer_v2: == Trial 26 / 41 - Minibatch ==



Average Metric: 26.93 / 35 (77.0%): : 36it [02:44,  4.56s/it]                      

2025/11/12 21:42:01 INFO dspy.evaluate.evaluate: Average Metric: 26.933333333333334 / 35 (77.0%)





2025/11/12 21:42:01 INFO dspy.teleprompt.mipro_optimizer_v2: Score: 76.95 on minibatch of size 35 with parameters ['Predictor 0: Instruction 6', 'Predictor 0: Few-Shot Set 5', 'Predictor 1: Instruction 7', 'Predictor 1: Few-Shot Set 15'].
2025/11/12 21:42:01 INFO dspy.teleprompt.mipro_optimizer_v2: Minibatch scores so far: [79.05, 81.9, 69.7, 77.43, 76.22, 76.9, 79.0, 82.5, 80.65, 81.43, 79.76, 70.83, 79.88, 76.69, 68.73, 74.12, 80.95, 70.71, 84.6, 78.57, 76.95]
2025/11/12 21:42:01 INFO dspy.teleprompt.mipro_optimizer_v2: Full eval scores so far: [76.29, 83.27, 76.78, 78.9, 78.92]
2025/11/12 21:42:01 INFO dspy.teleprompt.mipro_optimizer_v2: Best full score so far: 83.27


2025/11/12 21:42:01 INFO dspy.teleprompt.mipro_optimizer_v2: == Trial 27 / 41 - Minibatch ==


Average Metric: 27.41 / 35 (78.3%): 100%|██████████| 35/35 [03:12<00:00,  5.51s/it]

2025/11/12 21:45:14 INFO dspy.evaluate.evaluate: Average Metric: 27.41111111111111 / 35 (78.3%)





2025/11/12 21:45:15 INFO dspy.teleprompt.mipro_optimizer_v2: Score: 78.32 on minibatch of size 35 with parameters ['Predictor 0: Instruction 8', 'Predictor 0: Few-Shot Set 8', 'Predictor 1: Instruction 5', 'Predictor 1: Few-Shot Set 1'].
2025/11/12 21:45:15 INFO dspy.teleprompt.mipro_optimizer_v2: Minibatch scores so far: [79.05, 81.9, 69.7, 77.43, 76.22, 76.9, 79.0, 82.5, 80.65, 81.43, 79.76, 70.83, 79.88, 76.69, 68.73, 74.12, 80.95, 70.71, 84.6, 78.57, 76.95, 78.32]
2025/11/12 21:45:15 INFO dspy.teleprompt.mipro_optimizer_v2: Full eval scores so far: [76.29, 83.27, 76.78, 78.9, 78.92]
2025/11/12 21:45:15 INFO dspy.teleprompt.mipro_optimizer_v2: Best full score so far: 83.27


2025/11/12 21:45:15 INFO dspy.teleprompt.mipro_optimizer_v2: == Trial 28 / 41 - Minibatch ==


Average Metric: 28.60 / 35 (81.7%): 100%|██████████| 35/35 [02:52<00:00,  4.94s/it]

2025/11/12 21:48:08 INFO dspy.evaluate.evaluate: Average Metric: 28.604166666666668 / 35 (81.7%)





2025/11/12 21:48:08 INFO dspy.teleprompt.mipro_optimizer_v2: Score: 81.73 on minibatch of size 35 with parameters ['Predictor 0: Instruction 8', 'Predictor 0: Few-Shot Set 8', 'Predictor 1: Instruction 2', 'Predictor 1: Few-Shot Set 2'].
2025/11/12 21:48:08 INFO dspy.teleprompt.mipro_optimizer_v2: Minibatch scores so far: [79.05, 81.9, 69.7, 77.43, 76.22, 76.9, 79.0, 82.5, 80.65, 81.43, 79.76, 70.83, 79.88, 76.69, 68.73, 74.12, 80.95, 70.71, 84.6, 78.57, 76.95, 78.32, 81.73]
2025/11/12 21:48:08 INFO dspy.teleprompt.mipro_optimizer_v2: Full eval scores so far: [76.29, 83.27, 76.78, 78.9, 78.92]
2025/11/12 21:48:08 INFO dspy.teleprompt.mipro_optimizer_v2: Best full score so far: 83.27


2025/11/12 21:48:08 INFO dspy.teleprompt.mipro_optimizer_v2: == Trial 29 / 41 - Minibatch ==


Average Metric: 28.15 / 35 (80.4%): 100%|██████████| 35/35 [01:59<00:00,  3.42s/it]

2025/11/12 21:50:08 INFO dspy.evaluate.evaluate: Average Metric: 28.145833333333332 / 35 (80.4%)





2025/11/12 21:50:09 INFO dspy.teleprompt.mipro_optimizer_v2: Score: 80.42 on minibatch of size 35 with parameters ['Predictor 0: Instruction 8', 'Predictor 0: Few-Shot Set 14', 'Predictor 1: Instruction 4', 'Predictor 1: Few-Shot Set 4'].
2025/11/12 21:50:09 INFO dspy.teleprompt.mipro_optimizer_v2: Minibatch scores so far: [79.05, 81.9, 69.7, 77.43, 76.22, 76.9, 79.0, 82.5, 80.65, 81.43, 79.76, 70.83, 79.88, 76.69, 68.73, 74.12, 80.95, 70.71, 84.6, 78.57, 76.95, 78.32, 81.73, 80.42]
2025/11/12 21:50:09 INFO dspy.teleprompt.mipro_optimizer_v2: Full eval scores so far: [76.29, 83.27, 76.78, 78.9, 78.92]
2025/11/12 21:50:09 INFO dspy.teleprompt.mipro_optimizer_v2: Best full score so far: 83.27


2025/11/12 21:50:09 INFO dspy.teleprompt.mipro_optimizer_v2: == Trial 30 / 41 - Minibatch ==


Average Metric: 25.29 / 35 (72.3%): 100%|██████████| 35/35 [02:00<00:00,  3.44s/it]

2025/11/12 21:52:10 INFO dspy.evaluate.evaluate: Average Metric: 25.288095238095238 / 35 (72.3%)





2025/11/12 21:52:10 INFO dspy.teleprompt.mipro_optimizer_v2: Score: 72.25 on minibatch of size 35 with parameters ['Predictor 0: Instruction 7', 'Predictor 0: Few-Shot Set 3', 'Predictor 1: Instruction 7', 'Predictor 1: Few-Shot Set 8'].
2025/11/12 21:52:10 INFO dspy.teleprompt.mipro_optimizer_v2: Minibatch scores so far: [79.05, 81.9, 69.7, 77.43, 76.22, 76.9, 79.0, 82.5, 80.65, 81.43, 79.76, 70.83, 79.88, 76.69, 68.73, 74.12, 80.95, 70.71, 84.6, 78.57, 76.95, 78.32, 81.73, 80.42, 72.25]
2025/11/12 21:52:10 INFO dspy.teleprompt.mipro_optimizer_v2: Full eval scores so far: [76.29, 83.27, 76.78, 78.9, 78.92]
2025/11/12 21:52:10 INFO dspy.teleprompt.mipro_optimizer_v2: Best full score so far: 83.27


2025/11/12 21:52:10 INFO dspy.teleprompt.mipro_optimizer_v2: ===== Trial 31 / 41 - Full Evaluation =====
2025/11/12 21:52:10 INFO dspy.teleprompt.mipro_optimizer_v2: Doing full eval on next top averaging program (Avg Score: 81.73) from minibatch trials...


Average Metric: 139.20 / 180 (77.3%): 100%|██████████| 180/180 [04:53<00:00,  1.63s/it]

2025/11/12 21:57:04 INFO dspy.evaluate.evaluate: Average Metric: 139.20218253968255 / 180 (77.3%)
2025/11/12 21:57:04 INFO dspy.teleprompt.mipro_optimizer_v2: Full eval scores so far: [76.29, 83.27, 76.78, 78.9, 78.92, 77.33]
2025/11/12 21:57:04 INFO dspy.teleprompt.mipro_optimizer_v2: Best full score so far: 83.27
2025/11/12 21:57:04 INFO dspy.teleprompt.mipro_optimizer_v2: 

2025/11/12 21:57:04 INFO dspy.teleprompt.mipro_optimizer_v2: == Trial 32 / 41 - Minibatch ==



Average Metric: 25.79 / 35 (73.7%): 100%|██████████| 35/35 [01:35<00:00,  2.72s/it]

2025/11/12 21:58:41 INFO dspy.evaluate.evaluate: Average Metric: 25.7875 / 35 (73.7%)





2025/11/12 21:58:41 INFO dspy.teleprompt.mipro_optimizer_v2: Score: 73.68 on minibatch of size 35 with parameters ['Predictor 0: Instruction 2', 'Predictor 0: Few-Shot Set 10', 'Predictor 1: Instruction 2', 'Predictor 1: Few-Shot Set 4'].
2025/11/12 21:58:41 INFO dspy.teleprompt.mipro_optimizer_v2: Minibatch scores so far: [79.05, 81.9, 69.7, 77.43, 76.22, 76.9, 79.0, 82.5, 80.65, 81.43, 79.76, 70.83, 79.88, 76.69, 68.73, 74.12, 80.95, 70.71, 84.6, 78.57, 76.95, 78.32, 81.73, 80.42, 72.25, 73.68]
2025/11/12 21:58:41 INFO dspy.teleprompt.mipro_optimizer_v2: Full eval scores so far: [76.29, 83.27, 76.78, 78.9, 78.92, 77.33]
2025/11/12 21:58:41 INFO dspy.teleprompt.mipro_optimizer_v2: Best full score so far: 83.27


2025/11/12 21:58:41 INFO dspy.teleprompt.mipro_optimizer_v2: == Trial 33 / 41 - Minibatch ==


Average Metric: 28.79 / 35 (82.3%): 100%|██████████| 35/35 [00:36<00:00,  1.05s/it]

2025/11/12 21:59:19 INFO dspy.evaluate.evaluate: Average Metric: 28.791666666666668 / 35 (82.3%)





2025/11/12 21:59:19 INFO dspy.teleprompt.mipro_optimizer_v2: Score: 82.26 on minibatch of size 35 with parameters ['Predictor 0: Instruction 8', 'Predictor 0: Few-Shot Set 0', 'Predictor 1: Instruction 8', 'Predictor 1: Few-Shot Set 12'].
2025/11/12 21:59:19 INFO dspy.teleprompt.mipro_optimizer_v2: Minibatch scores so far: [79.05, 81.9, 69.7, 77.43, 76.22, 76.9, 79.0, 82.5, 80.65, 81.43, 79.76, 70.83, 79.88, 76.69, 68.73, 74.12, 80.95, 70.71, 84.6, 78.57, 76.95, 78.32, 81.73, 80.42, 72.25, 73.68, 82.26]
2025/11/12 21:59:19 INFO dspy.teleprompt.mipro_optimizer_v2: Full eval scores so far: [76.29, 83.27, 76.78, 78.9, 78.92, 77.33]
2025/11/12 21:59:19 INFO dspy.teleprompt.mipro_optimizer_v2: Best full score so far: 83.27


2025/11/12 21:59:19 INFO dspy.teleprompt.mipro_optimizer_v2: == Trial 34 / 41 - Minibatch ==


Average Metric: 27.64 / 35 (79.0%): 100%|██████████| 35/35 [01:44<00:00,  2.98s/it]

2025/11/12 22:01:03 INFO dspy.evaluate.evaluate: Average Metric: 27.641666666666666 / 35 (79.0%)





2025/11/12 22:01:04 INFO dspy.teleprompt.mipro_optimizer_v2: Score: 78.98 on minibatch of size 35 with parameters ['Predictor 0: Instruction 5', 'Predictor 0: Few-Shot Set 0', 'Predictor 1: Instruction 8', 'Predictor 1: Few-Shot Set 12'].
2025/11/12 22:01:04 INFO dspy.teleprompt.mipro_optimizer_v2: Minibatch scores so far: [79.05, 81.9, 69.7, 77.43, 76.22, 76.9, 79.0, 82.5, 80.65, 81.43, 79.76, 70.83, 79.88, 76.69, 68.73, 74.12, 80.95, 70.71, 84.6, 78.57, 76.95, 78.32, 81.73, 80.42, 72.25, 73.68, 82.26, 78.98]
2025/11/12 22:01:04 INFO dspy.teleprompt.mipro_optimizer_v2: Full eval scores so far: [76.29, 83.27, 76.78, 78.9, 78.92, 77.33]
2025/11/12 22:01:04 INFO dspy.teleprompt.mipro_optimizer_v2: Best full score so far: 83.27


2025/11/12 22:01:04 INFO dspy.teleprompt.mipro_optimizer_v2: == Trial 35 / 41 - Minibatch ==


Average Metric: 26.04 / 35 (74.4%): 100%|██████████| 35/35 [02:46<00:00,  4.74s/it]

2025/11/12 22:03:50 INFO dspy.evaluate.evaluate: Average Metric: 26.041666666666668 / 35 (74.4%)





2025/11/12 22:03:50 INFO dspy.teleprompt.mipro_optimizer_v2: Score: 74.4 on minibatch of size 35 with parameters ['Predictor 0: Instruction 8', 'Predictor 0: Few-Shot Set 3', 'Predictor 1: Instruction 8', 'Predictor 1: Few-Shot Set 15'].
2025/11/12 22:03:50 INFO dspy.teleprompt.mipro_optimizer_v2: Minibatch scores so far: [79.05, 81.9, 69.7, 77.43, 76.22, 76.9, 79.0, 82.5, 80.65, 81.43, 79.76, 70.83, 79.88, 76.69, 68.73, 74.12, 80.95, 70.71, 84.6, 78.57, 76.95, 78.32, 81.73, 80.42, 72.25, 73.68, 82.26, 78.98, 74.4]
2025/11/12 22:03:50 INFO dspy.teleprompt.mipro_optimizer_v2: Full eval scores so far: [76.29, 83.27, 76.78, 78.9, 78.92, 77.33]
2025/11/12 22:03:50 INFO dspy.teleprompt.mipro_optimizer_v2: Best full score so far: 83.27


2025/11/12 22:03:50 INFO dspy.teleprompt.mipro_optimizer_v2: == Trial 36 / 41 - Minibatch ==


Average Metric: 28.92 / 35 (82.6%): : 36it [02:36,  4.35s/it]                      

2025/11/12 22:06:27 INFO dspy.evaluate.evaluate: Average Metric: 28.916666666666668 / 35 (82.6%)





2025/11/12 22:06:28 INFO dspy.teleprompt.mipro_optimizer_v2: Score: 82.62 on minibatch of size 35 with parameters ['Predictor 0: Instruction 4', 'Predictor 0: Few-Shot Set 4', 'Predictor 1: Instruction 2', 'Predictor 1: Few-Shot Set 10'].
2025/11/12 22:06:28 INFO dspy.teleprompt.mipro_optimizer_v2: Minibatch scores so far: [79.05, 81.9, 69.7, 77.43, 76.22, 76.9, 79.0, 82.5, 80.65, 81.43, 79.76, 70.83, 79.88, 76.69, 68.73, 74.12, 80.95, 70.71, 84.6, 78.57, 76.95, 78.32, 81.73, 80.42, 72.25, 73.68, 82.26, 78.98, 74.4, 82.62]
2025/11/12 22:06:28 INFO dspy.teleprompt.mipro_optimizer_v2: Full eval scores so far: [76.29, 83.27, 76.78, 78.9, 78.92, 77.33]
2025/11/12 22:06:28 INFO dspy.teleprompt.mipro_optimizer_v2: Best full score so far: 83.27


2025/11/12 22:06:28 INFO dspy.teleprompt.mipro_optimizer_v2: ===== Trial 37 / 41 - Full Evaluation =====
2025/11/12 22:06:28 INFO dspy.teleprompt.mipro_optimizer_v2: Doing full eval on next top averaging program (Avg Score: 82.62) from minibatch tria

Average Metric: 138.35 / 180 (76.9%): 100%|██████████| 180/180 [05:39<00:00,  1.88s/it]

2025/11/12 22:12:07 INFO dspy.evaluate.evaluate: Average Metric: 138.3527777777778 / 180 (76.9%)
2025/11/12 22:12:07 INFO dspy.teleprompt.mipro_optimizer_v2: Full eval scores so far: [76.29, 83.27, 76.78, 78.9, 78.92, 77.33, 76.86]
2025/11/12 22:12:07 INFO dspy.teleprompt.mipro_optimizer_v2: Best full score so far: 83.27
2025/11/12 22:12:07 INFO dspy.teleprompt.mipro_optimizer_v2: 

2025/11/12 22:12:07 INFO dspy.teleprompt.mipro_optimizer_v2: == Trial 38 / 41 - Minibatch ==



Average Metric: 26.98 / 35 (77.1%): 100%|██████████| 35/35 [01:30<00:00,  2.58s/it]

2025/11/12 22:13:38 INFO dspy.evaluate.evaluate: Average Metric: 26.979166666666668 / 35 (77.1%)





2025/11/12 22:13:38 INFO dspy.teleprompt.mipro_optimizer_v2: Score: 77.08 on minibatch of size 35 with parameters ['Predictor 0: Instruction 4', 'Predictor 0: Few-Shot Set 5', 'Predictor 1: Instruction 2', 'Predictor 1: Few-Shot Set 10'].
2025/11/12 22:13:38 INFO dspy.teleprompt.mipro_optimizer_v2: Minibatch scores so far: [79.05, 81.9, 69.7, 77.43, 76.22, 76.9, 79.0, 82.5, 80.65, 81.43, 79.76, 70.83, 79.88, 76.69, 68.73, 74.12, 80.95, 70.71, 84.6, 78.57, 76.95, 78.32, 81.73, 80.42, 72.25, 73.68, 82.26, 78.98, 74.4, 82.62, 77.08]
2025/11/12 22:13:38 INFO dspy.teleprompt.mipro_optimizer_v2: Full eval scores so far: [76.29, 83.27, 76.78, 78.9, 78.92, 77.33, 76.86]
2025/11/12 22:13:38 INFO dspy.teleprompt.mipro_optimizer_v2: Best full score so far: 83.27


2025/11/12 22:13:38 INFO dspy.teleprompt.mipro_optimizer_v2: == Trial 39 / 41 - Minibatch ==


Average Metric: 25.94 / 35 (74.1%): 100%|██████████| 35/35 [01:27<00:00,  2.49s/it]

2025/11/12 22:15:06 INFO dspy.evaluate.evaluate: Average Metric: 25.9375 / 35 (74.1%)





2025/11/12 22:15:06 INFO dspy.teleprompt.mipro_optimizer_v2: Score: 74.11 on minibatch of size 35 with parameters ['Predictor 0: Instruction 3', 'Predictor 0: Few-Shot Set 4', 'Predictor 1: Instruction 4', 'Predictor 1: Few-Shot Set 17'].
2025/11/12 22:15:06 INFO dspy.teleprompt.mipro_optimizer_v2: Minibatch scores so far: [79.05, 81.9, 69.7, 77.43, 76.22, 76.9, 79.0, 82.5, 80.65, 81.43, 79.76, 70.83, 79.88, 76.69, 68.73, 74.12, 80.95, 70.71, 84.6, 78.57, 76.95, 78.32, 81.73, 80.42, 72.25, 73.68, 82.26, 78.98, 74.4, 82.62, 77.08, 74.11]
2025/11/12 22:15:06 INFO dspy.teleprompt.mipro_optimizer_v2: Full eval scores so far: [76.29, 83.27, 76.78, 78.9, 78.92, 77.33, 76.86]
2025/11/12 22:15:06 INFO dspy.teleprompt.mipro_optimizer_v2: Best full score so far: 83.27


2025/11/12 22:15:06 INFO dspy.teleprompt.mipro_optimizer_v2: == Trial 40 / 41 - Minibatch ==


Average Metric: 27.81 / 35 (79.5%): 100%|██████████| 35/35 [02:43<00:00,  4.66s/it]

2025/11/12 22:17:50 INFO dspy.evaluate.evaluate: Average Metric: 27.8125 / 35 (79.5%)





2025/11/12 22:17:50 INFO dspy.teleprompt.mipro_optimizer_v2: Score: 79.46 on minibatch of size 35 with parameters ['Predictor 0: Instruction 1', 'Predictor 0: Few-Shot Set 4', 'Predictor 1: Instruction 2', 'Predictor 1: Few-Shot Set 10'].
2025/11/12 22:17:50 INFO dspy.teleprompt.mipro_optimizer_v2: Minibatch scores so far: [79.05, 81.9, 69.7, 77.43, 76.22, 76.9, 79.0, 82.5, 80.65, 81.43, 79.76, 70.83, 79.88, 76.69, 68.73, 74.12, 80.95, 70.71, 84.6, 78.57, 76.95, 78.32, 81.73, 80.42, 72.25, 73.68, 82.26, 78.98, 74.4, 82.62, 77.08, 74.11, 79.46]
2025/11/12 22:17:50 INFO dspy.teleprompt.mipro_optimizer_v2: Full eval scores so far: [76.29, 83.27, 76.78, 78.9, 78.92, 77.33, 76.86]
2025/11/12 22:17:50 INFO dspy.teleprompt.mipro_optimizer_v2: Best full score so far: 83.27


2025/11/12 22:17:50 INFO dspy.teleprompt.mipro_optimizer_v2: ===== Trial 41 / 41 - Full Evaluation =====
2025/11/12 22:17:50 INFO dspy.teleprompt.mipro_optimizer_v2: Doing full eval on next top averaging program (Avg Score

Average Metric: 146.88 / 180 (81.6%): 100%|██████████| 180/180 [02:30<00:00,  1.20it/s]

2025/11/12 22:20:21 INFO dspy.evaluate.evaluate: Average Metric: 146.88333333333333 / 180 (81.6%)
2025/11/12 22:20:21 INFO dspy.teleprompt.mipro_optimizer_v2: Full eval scores so far: [76.29, 83.27, 76.78, 78.9, 78.92, 77.33, 76.86, 81.6]
2025/11/12 22:20:21 INFO dspy.teleprompt.mipro_optimizer_v2: Best full score so far: 83.27
2025/11/12 22:20:21 INFO dspy.teleprompt.mipro_optimizer_v2: 

2025/11/12 22:20:21 INFO dspy.teleprompt.mipro_optimizer_v2: Returning best identified program with score 83.27!





## Inspect Optimized Prompt

Display the optimized prompt that MIPROv2 generated for the privacy-preserving request generator. This prompt has been refined to better anonymize user queries while preserving their intent.


In [9]:
print(optimized_papillon.craft_redacted_request.predict.signature.instructions)

Generate a detailed, privacy-preserving prompt for an external large language model based on a user query. The prompt should rephrase or generalize the query to avoid revealing sensitive or private information, while capturing the core intent of the request—whether it involves summarizing research studies, analyzing complex data, or developing creative narratives. Ensure that the prompt emphasizes generating responses that are insightful, accurate, and respectful of privacy, encouraging the external LLM to produce effective and secure outputs without learning or storing private details.


## Evaluate Optimized PAPILLON System

Run final evaluation on the test set using the optimized PAPILLON system to confirm the performance improvement. The optimized system should show better balance between response quality and privacy preservation.


In [10]:
evaluate(optimized_papillon)

Average Metric: 134.58 / 164 (82.1%):  77%|███████▋  | 164/214 [04:16<00:40,  1.24it/s]



Average Metric: 175.83 / 214 (82.2%): 100%|██████████| 214/214 [05:33<00:00,  1.56s/it]

2025/11/12 22:25:56 INFO dspy.evaluate.evaluate: Average Metric: 175.83333333333334 / 214 (82.2%)





Unnamed: 0,target_response,user_query,pii_str,llm_request,llm_response,response,compute_overall_score
0,Subject: Follow-up on Endpoint Management Solution Discussion Dear...,"rewrite in professional ""as discusseed in last meeting with you an...",effem||balaji||balaji,"Rewrite the following message in a professional tone: ""As discusse...",Subject: Follow-up on Effem Account Implementation and Security Re...,Subject: Follow-up on Effem Account Solution and Pending Decision ...,✔️ [0.500]
1,"Кампания ""Share a Coke"" была запущена Coca-Cola в 2011 году в Авст...","расскажи про PR-кампанию Coca-Cola “Share a Coke""",coca-cola,"Describe the Coca-Cola “Share a Coke” campaign, focusing on its ge...",The Coca-Cola “Share a Coke” campaign is a highly successful and i...,Кампания Coca-Cola “Share a Coke” — это чрезвычайно успешная и инн...,✔️ [1.000]
2,Creating a Facebook ad campaign within a specific budget and targe...,give me a facebook ads campaign for the next 7 days for canva pro ...,jordan||canva,Please create a detailed Facebook Ads campaign plan for the next 7...,Certainly! Below is a comprehensive 7-day Facebook Ads campaign pl...,**Campaign Type:** Conversion (to maximize sign-ups for Canva Pro)...,✔️ [0.500]
3,"The delivery was scheduled for today, but due to a payment delay t...",The delivery supposed to be today. Seems our payment delay to Lvio...,seagate||lviv||ms. chan,"Rewrite the following message to be more professional and clear, e...",Subject: Urgent: Delivery Rescheduling Due to Payment Delay Dear [...,Subject: Urgent: Delivery Rescheduling Due to Payment Delay Dear [...,✔️ [0.500]
4,"Dear Ms. Amanda, I hope this email finds you well. I wanted to bri...","write a email to Ms, Amanda ; tell her, we have a way to overcome ...",india||amanda||hermann(germany)||china||vims(france),Draft a professional and detailed email addressed to Ms. Amanda. T...,Subject: Exploring Strategic Alternatives for Device Import and Co...,"Dear Ms. Amanda, I hope this message finds you well. I wanted to u...",✔️ [0.500]


EvaluationResult(score=82.17, results=<list of 214 results>)

## Performance Improvement Summary

Display the improvement in overall score from baseline (78.96%) to optimized (82.2%) performance. This shows how MIPROv2 optimization improved the system's ability to maintain response quality while better preserving user privacy.

78.96->82.2