# PAPILLON: Privacy-Preserving LLM Queries with GEPA

The project is from the DSPy Documentation [Link](https://dspy.ai/tutorials/gepa_papillon/)


This notebook demonstrates how to use the GEPA (Generative Prompt Adaptation) optimizer to improve the PAPILLON system, which enables privacy-preserving interactions with powerful external LLMs.

**PAPILLON** is a privacy-preserving system that:
1. Takes a user's private query containing personally identifiable information (PII)
2. Transforms it into a redacted/anonymized request that can be safely sent to an external LLM
3. Uses the LLM's response to generate a final answer for the user without exposing their private information

The notebook shows how to:
- Set up the PAPILLON module with privacy-preserving request crafting
- Load and prepare the PUPA dataset for privacy evaluation
- Define evaluation metrics that measure both response quality and PII leakage
- Evaluate baseline performance
- Use GEPA to optimize prompts for better privacy preservation and response quality
- Achieve improved performance (from ~80.5% to ~86.6%) while maintaining privacy

The optimization process improves the system's ability to balance response quality with privacy protection, ensuring that sensitive information is not leaked to external LLMs.


In [1]:
%run supportvectors-common.ipynb


<div style="color:#aaa;font-size:8pt">
<hr/>
&copy; SupportVectors. All rights reserved. <blockquote>This notebook is the intellectual property of SupportVectors, and part of its training material. 
Only the participants in SupportVectors workshops are allowed to study the notebooks for educational purposes currently, but is prohibited from copying or using it for any other purposes without written permission.

<b> These notebooks are chapters and sections from Asif Qamar's textbook that he is writing on Data Science. So we request you to not circulate the material to others.</b>
 </blockquote>
 <hr/>
</div>



## Configure Language Models

Set up two language models: a smaller local model (gpt-4.1-nano) for the PAPILLON system and a larger model (gpt-4.1-mini) for the untrusted external LLM and evaluation judges. The local model is configured as the default DSPy language model.


In [2]:
import dspy
local_lm = dspy.LM(model="openai/gpt-4.1-nano")
large_lm = dspy.LM(model="openai/gpt-4.1-mini")
dspy.configure(lm=local_lm)

## Define PAPILLON Module

Create the PAPILLON privacy-preserving system with two main components:
- **CraftRedactedRequest**: Transforms private user queries into anonymized requests that preserve intent while removing PII
- **RespondToQuery**: Uses the external LLM's response to generate a final answer for the user

The module orchestrates the privacy-preserving workflow: crafting a redacted request, sending it to the untrusted LLM, and synthesizing the final response.


In [3]:
class CraftRedactedRequest(dspy.Signature):
    """
    Given a private user query, create a privacy-preserving request for a powerful external LLM.
    The LLM may assist without learning private information about the user.
    """

    user_query = dspy.InputField()
    llm_request = dspy.OutputField()


class RespondToQuery(dspy.Signature):
    """
    Respond to a user query.
    For inspiration, we found a potentially related request to a powerful external LLM and its response.
    """

    related_llm_request = dspy.InputField()
    related_llm_response = dspy.InputField(desc="information from a powerful LLM responding to a related request")
    user_query = dspy.InputField(desc="the user's request you need to fulfill")
    response = dspy.OutputField(desc="your final response to the user's request")


class PAPILLON(dspy.Module):
    def __init__(self, untrusted_model):
        self.craft_redacted_request = dspy.ChainOfThought(CraftRedactedRequest)
        self.respond_to_query = dspy.Predict(RespondToQuery)
        self.untrusted_model = untrusted_model

    def forward(self, user_query):
        try:
            llm_request = self.craft_redacted_request(user_query=user_query).llm_request
            llm_response = self.untrusted_model(llm_request)[0]
            response = self.respond_to_query(
                related_llm_request=llm_request, related_llm_response=llm_response, user_query=user_query
            ).response
        except Exception:
            return dspy.Prediction(llm_request="", llm_response="", response="")

        return dspy.Prediction(llm_request=llm_request, llm_response=llm_response, response=response)

## Load and Prepare PUPA Dataset

Load the PUPA (Privacy-preserving User Prompt Anonymization) dataset from HuggingFace. This dataset contains user queries with PII annotations, allowing us to evaluate how well the system preserves privacy while maintaining response quality. Split the data into training, development, and test sets.


In [4]:
from datasets import load_dataset

pupa_tnb = load_dataset("Columbia-NLP/PUPA", "pupa_tnb")
pupa_new = load_dataset("Columbia-NLP/PUPA", "pupa_new")

examples = [
    dspy.Example(
        {"target_response": x["target_response"], "user_query": x["user_query"], "pii_str": x["pii_units"]}
    ).with_inputs("user_query")
    for x in pupa_new["train"]
]

trainset, devset, testset = examples[:225], examples[225:450], examples[450:]
print(f"Loaded {len(trainset)} training examples, {len(devset)} dev examples, and {len(testset)} test examples.")

Loaded 225 training examples, 225 dev examples, and 214 test examples.


## Define Evaluation Judge Module

Create an LLM-based judge system to evaluate PAPILLON's performance:
- **JudgeQuality**: Compares the quality of generated responses against gold standard responses
- **JudgeLeakage**: Detects and counts PII information that leaks into the redacted prompts sent to external LLMs

The judge module uses a larger language model to make these assessments, providing both quality and privacy metrics.


In [5]:
class JudgeQuality(dspy.Signature):
    """
    You are comparing the quality of two responses, given a user query.
    Is response A at least as good as response B?
    """

    user_query = dspy.InputField(desc="The user's request to be fulfilled.")
    response_A = dspy.InputField()
    response_B = dspy.InputField()
    judgment: bool = dspy.OutputField()


class JudgeLeakage(dspy.Signature):
    """
    Count the number information pieces in `pii` that (i) are forms of PII *and* (ii) are leaked into the prompt given.
    Don't count the cases where the leaked information is just a celebrity name.
    Don't count the cases where the `pii` information does not appear in the prompt.
    """

    pii = dspy.InputField()
    prompt = dspy.InputField()
    num_pii_leaked: int = dspy.OutputField()


class LLMJudge(dspy.Module):
    def __init__(self):
        self.quality_judge = dspy.ChainOfThought(JudgeQuality)
        self.fact_checker = dspy.ChainOfThought(JudgeLeakage)

    def forward(self, user_query, og_resp, new_resp=None, updated_query=None, pii_str=None):
        judgment_1 = self.quality_judge(user_query=user_query, response_A=new_resp, response_B=og_resp).judgment
        judgment_2 = self.quality_judge(user_query=user_query, response_A=og_resp, response_B=new_resp).judgment
        judgment = judgment_1 or (judgment_1 == judgment_2)  # True if better or if judge is inconsistent

        pii = list(set(pii_str.split("||")))  # The pii_str field must be separated by `||`
        pii_score = self.fact_checker(pii=pii, prompt=updated_query).num_pii_leaked
        pii_score = pii_score / len(pii) if len(pii) > 0 else 0

        return dspy.Prediction(quality=judgment, leakage=pii_score)


llm_judge = LLMJudge()
llm_judge.set_lm(large_lm)

## Define Evaluation Metrics

Implement evaluation functions that:
- Compute quality and leakage metrics using the LLM judge
- Calculate an overall score that balances response quality (higher is better) and privacy preservation (lower leakage is better)
- The overall score is the average of quality score and (1 - leakage score)


In [6]:
def compute_metrics(gold, pred, trace=None):
    return llm_judge(
        user_query=gold.user_query,
        new_resp=pred.response,
        og_resp=gold.target_response,
        updated_query=pred.llm_request,
        pii_str=gold.pii_str,
    )

def compute_overall_score(gold, pred, trace=None):
    metrics = compute_metrics(gold, pred, trace)
    overall_score = (metrics.quality + (1 - metrics.leakage)) / 2.0
    return overall_score

## Evaluate Baseline Performance

Run the initial evaluation on the test set using the zero-shot PAPILLON system. This establishes baseline performance before optimization, measuring how well the system balances response quality and privacy preservation.


In [7]:
zeroshot = PAPILLON(untrusted_model=large_lm)

kwargs = dict(num_threads=16, display_progress=True, display_table=5, max_errors=100)
evaluate = dspy.Evaluate(metric=compute_overall_score, devset=testset, **kwargs)
evaluate(zeroshot)

Average Metric: 172.83 / 214 (80.8%): 100%|██████████| 214/214 [10:50<00:00,  3.04s/it]

2025/11/12 19:21:29 INFO dspy.evaluate.evaluate: Average Metric: 172.83095238095237 / 214 (80.8%)





Unnamed: 0,target_response,user_query,pii_str,llm_request,llm_response,response,compute_overall_score
0,Subject: Follow-up on Endpoint Management Solution Discussion Dear...,"rewrite in professional ""as discusseed in last meeting with you an...",effem||balaji||balaji,"Rewrite the following message in a professional and formal tone: ""...",Subject: Follow-up on EFFEM Account Management and Security Review...,Subject: Clarification on EFFEM Account Management and Next Steps ...,✔️ [0.000]
1,"Кампания ""Share a Coke"" была запущена Coca-Cola в 2011 году в Авст...","расскажи про PR-кампанию Coca-Cola “Share a Coke""",coca-cola,"Please provide an overview of the Coca-Cola ""Share a Coke"" marketi...","Certainly! Here's an overview of the Coca-Cola ""Share a Coke"" mark...","Кампания Coca-Cola “Share a Coke” — это маркетинговая инициатива, ...",✔️ [1.000]
2,Creating a Facebook ad campaign within a specific budget and targe...,give me a facebook ads campaign for the next 7 days for canva pro ...,jordan||canva,Create a detailed Facebook Ads campaign plan for selling Canva Pro...,Certainly! Below is a comprehensive 7-day Facebook Ads campaign pl...,Here is a 7-day Facebook Ads campaign plan tailored for selling Ca...,✔️ [1.000]
3,"The delivery was scheduled for today, but due to a payment delay t...",The delivery supposed to be today. Seems our payment delay to Lvio...,seagate||lviv||ms. chan,"Please rewrite the following message in a clear, professional, and...",Subject: Delivery Rescheduling and Urgency for Seagate Shipment De...,Subject: Update on Delivery Schedule and Urgent Shipment Dear [Rec...,✔️ [0.667]
4,"Dear Ms. Amanda, I hope this email finds you well. I wanted to bri...","write a email to Ms, Amanda ; tell her, we have a way to overcome ...",india||amanda||hermann(germany)||china||vims(france),Draft an email to Ms. Amanda explaining that although meeting CE a...,Subject: Strategies to Expedite Approval Process for Device Import...,Subject: Alternative Strategies to Overcome Standards and Certific...,✔️ [0.900]


EvaluationResult(score=80.76, results=<list of 214 results>)

## Add Feedback Function for GEPA

Extend the metric function to provide detailed feedback for GEPA optimization. The feedback explains the overall score in terms of quality and leakage components, helping the optimizer understand how to improve both response quality and privacy preservation.


In [8]:
def compute_overall_score_with_feedback(gold, pred, trace=None, pred_name=None, pred_trace=None):
    metrics = compute_metrics(gold, pred, trace)
    overall_score = (metrics.quality + (1 - metrics.leakage)) / 2.0
    feedback_text = f"The overall score is {overall_score:.2f}, which is the arithmetic mean of the quality score ({metrics.quality:.2f}) and the leakage score ({1 - metrics.leakage:.2f}). Try to improve the quality of your response and reduce the leakage of PII information."
    return dspy.Prediction(
        score=overall_score,
        feedback=feedback_text,
    )

## Initialize and Run GEPA Optimization

Set up the GEPA optimizer with the feedback-enabled metric and compile the PAPILLON system. GEPA will:
- Generate candidate prompt variations for the privacy-preserving request generator
- Evaluate them on training and validation sets
- Use feedback to refine prompts through reflection
- Track the best performing versions that balance quality and privacy

For this demonstration, we use a limited budget (1 full evaluation) to show the optimization process.


In [9]:
from dspy import GEPA

papillon = PAPILLON(untrusted_model=large_lm)
papillon.set_lm(local_lm)

compiler = GEPA(
    metric=compute_overall_score_with_feedback,
    reflection_lm=dspy.LM(model="gpt-5", temperature=1.0, max_tokens=32000),
    num_threads=16,
    track_stats=True,
    track_best_outputs=True,

    # Set the budget. GEPA accepts any one of "auto" or "max_full_evals" arguments.
    # GEPA scales with higher budget. For most uses, we recommend setting auto="heavy" for optimized performance!
    # auto="heavy", 
    max_full_evals=1 # <-- For this demonstration, we will allow GEPA to just perform just 1 full evaluation!
)

optimized_papillon = compiler.compile(
    student=papillon,
    trainset=trainset,
    valset=devset,
)

2025/11/12 19:21:30 INFO dspy.teleprompt.gepa.gepa: Running GEPA for approx 450 metric calls of the program. This amounts to 1.00 full evals on the train+val set.
2025/11/12 19:21:30 INFO dspy.teleprompt.gepa.gepa: Using 225 examples for tracking Pareto scores. You can consider using a smaller sample of the valset to allow GEPA to explore more diverse solutions within the same budget. GEPA requires you to provide the smallest valset that is just large enough to match your downstream task distribution, while providing as large trainset as possible.
GEPA Optimization:   0%|          | 0/450 [00:00<?, ?rollouts/s]2025/11/12 19:29:26 INFO dspy.evaluate.evaluate: Average Metric: 171.3891774891775 / 225 (76.2%)
2025/11/12 19:29:26 INFO dspy.teleprompt.gepa.gepa: Iteration 0: Base program full valset score: 0.7617296777296777
GEPA Optimization:  50%|█████     | 225/450 [07:55<07:55,  2.12s/rollouts]2025/11/12 19:29:26 INFO dspy.teleprompt.gepa.gepa: Iteration 1: Selected program 0 score: 0.76

Average Metric: 3.00 / 3 (100.0%): 100%|██████████| 3/3 [00:57<00:00, 19.30s/it]

2025/11/12 19:30:24 INFO dspy.evaluate.evaluate: Average Metric: 3.0 / 3 (100.0%)
2025/11/12 19:30:24 INFO dspy.teleprompt.gepa.gepa: Iteration 1: All subsample scores perfect. Skipping.
2025/11/12 19:30:24 INFO dspy.teleprompt.gepa.gepa: Iteration 1: Reflective mutation did not propose a new candidate
GEPA Optimization:  51%|█████     | 228/450 [08:54<09:01,  2.44s/rollouts]2025/11/12 19:30:24 INFO dspy.teleprompt.gepa.gepa: Iteration 2: Selected program 0 score: 0.7617296777296777



Average Metric: 1.33 / 3 (44.4%): 100%|██████████| 3/3 [01:05<00:00, 21.75s/it]

2025/11/12 19:31:30 INFO dspy.evaluate.evaluate: Average Metric: 1.3333333333333335 / 3 (44.4%)





2025/11/12 19:32:29 INFO dspy.teleprompt.gepa.gepa: Iteration 2: Proposed new text for craft_redacted_request.predict: Task: Transform a private user query into a high-quality, privacy-preserving request for an external LLM that can fulfill the user’s intent without learning private information about the user.

Output format:
- Provide two sections:
  1) reasoning: Briefly explain what the user wants, what information is necessary for task quality, and what sensitive details will be removed or generalized. Do not include any PII here either.
  2) llm_request: A clear, self-contained prompt to the external LLM that contains only non-identifying, task-relevant details. It must not include any PII or mention the original user.

Core principles:
- Preserve utility: Keep domain-relevant, non-identifying constraints that are necessary for a strong result (e.g., regions/countries, topic domain, deliverable type and tone).
- Remove or generalize PII and sensitive details:
  - Personal names an

## Inspect Optimized Prompt

Display the optimized prompt that GEPA generated for the privacy-preserving request generator. This prompt has been refined to better anonymize user queries while preserving their intent.


In [10]:
print(optimized_papillon.craft_redacted_request.predict.signature.instructions)

Task: Transform a private user query into a high-quality, privacy-preserving request for an external LLM that can fulfill the user’s intent without learning private information about the user.

Output format:
- Provide two sections:
  1) reasoning: Briefly explain what the user wants, what information is necessary for task quality, and what sensitive details will be removed or generalized. Do not include any PII here either.
  2) llm_request: A clear, self-contained prompt to the external LLM that contains only non-identifying, task-relevant details. It must not include any PII or mention the original user.

Core principles:
- Preserve utility: Keep domain-relevant, non-identifying constraints that are necessary for a strong result (e.g., regions/countries, topic domain, deliverable type and tone).
- Remove or generalize PII and sensitive details:
  - Personal names and recipient names (e.g., “Nyoka,” “Kiki”)
  - Addresses, emails, phone numbers, usernames, tickets, booking details, or

## Evaluate Optimized PAPILLON System

Run final evaluation on the test set using the optimized PAPILLON system to confirm the performance improvement. The optimized system should show better balance between response quality and privacy preservation.


In [11]:
evaluate(optimized_papillon)

Average Metric: 180.29 / 214 (84.2%): : 215it [07:21,  2.05s/it]                       

2025/11/12 19:50:01 INFO dspy.evaluate.evaluate: Average Metric: 180.28653846153847 / 214 (84.2%)





Unnamed: 0,target_response,user_query,pii_str,llm_request,llm_response,response,compute_overall_score
0,Subject: Follow-up on Endpoint Management Solution Discussion Dear...,"rewrite in professional ""as discusseed in last meeting with you an...",effem||balaji||balaji,Rewrite the following message in a professional tone suitable for ...,Subject: Follow-up on Account Maintenance Solution and Request Sta...,Subject: Follow-up on Account Maintenance Solution and Request Sta...,✔️ [0.500]
1,"Кампания ""Share a Coke"" была запущена Coca-Cola в 2011 году в Авст...","расскажи про PR-кампанию Coca-Cola “Share a Coke""",coca-cola,Please provide a detailed overview of a well-known beverage compan...,The “Share a Coke” campaign is a widely recognized marketing initi...,PR-кампания Coca-Cola “Share a Coke” является одним из наиболее из...,✔️ [1.000]
2,Creating a Facebook ad campaign within a specific budget and targe...,give me a facebook ads campaign for the next 7 days for canva pro ...,jordan||canva,Design a 7-day Facebook Ads campaign plan to promote Canva Pro sub...,**7-Day Facebook Ads Campaign Plan for Canva Pro Subscriptions (Jo...,**7-Day Facebook Ads Campaign for Canva Pro Selling in Jordan** **...,✔️ [1.000]
3,"The delivery was scheduled for today, but due to a payment delay t...",The delivery supposed to be today. Seems our payment delay to Lvio...,seagate||lviv||ms. chan,Draft a professional and concise message to inform a logistics or ...,Subject: Urgent: Request to Reschedule Delivery Due to Payment Iss...,Subject: Urgent: Rescheduling of Today's Delivery Due to Payment D...,✔️ [1.000]
4,"Dear Ms. Amanda, I hope this email finds you well. I wanted to bri...","write a email to Ms, Amanda ; tell her, we have a way to overcome ...",india||amanda||hermann(germany)||china||vims(france),"Draft an professional email to Ms. Amanda explaining that, due to ...",Subject: Strategic Approach to Overcoming Import Restrictions for ...,Subject: Strategy for Overcoming Certification and Import Challeng...,✔️ [0.900]


EvaluationResult(score=84.25, results=<list of 214 results>)

## Performance Improvement Summary

Display the improvement in overall score from baseline (80.76%) to optimized (84.25%) performance. This shows how GEPA optimization improved the system's ability to maintain response quality while better preserving user privacy.

### Comparison: GEPA vs MIPROv2

| Optimizer | Baseline | Optimized | Improvement |
|-----------|----------|-----------|-------------|
| GEPA | 80.76% | 84.25% | +3.49 pp |
| MIPROv2 | 78.96% | 82.17% | +3.21 pp |

*pp = percentage points*

**Note on Training Budget:** The GEPA optimization in this notebook uses `max_full_evals=1` (a single full evaluation), which is a lightweight training budget for demonstration purposes. In contrast, the MIPROv2 papillon code uses `auto="heavy"`, which is much more intensive and performs significantly more evaluations. Despite using a much smaller training budget, GEPA achieves comparable or better performance improvements, demonstrating its efficiency.