# Tutorial 1.8: Prompt Optimization with GEPA

![](images/9_Prompt-Optimization-with-GEPA.png)

## Automatically Improve Prompts Using MLflow's GEPA Integration

In Tutorial 1.5, we manually iterated on prompts — writing better versions by hand and versioning them in the Prompt Registry. But what if an algorithm could do this automatically?

This notebook demonstrates **GEPA (Genetic-Pareto)**, an automatic prompt optimization algorithm integrated into MLflow via `mlflow.genai.optimize_prompts()`.

### What You'll Learn

- How GEPA automatically improves prompts
- Using `mlflow.genai.optimize_prompts()` with the Prompt Registry
- Evaluating prompt quality with `Correctness` scorer
- Comparing original vs. optimized prompts

### Prerequisites
- Completed Notebook 1.5 (Prompt Management) and 1.7 (Evaluating Agents)
- Understanding of the Prompt Registry and evaluation scorers

### Estimated Time: 10-15 minutes

---
## Step 1: How GEPA Works

**GEPA (Genetic-Pareto)** optimizes prompts through an iterative cycle:

```
1. EVALUATE  →  Run the prompt on training examples, score with a judge
2. REFLECT   →  Use an LLM to analyze failures and propose improvements
3. MUTATE    →  Generate improved prompt variations
4. SELECT    →  Keep the best-performing candidates (Pareto-optimal)
5. REPEAT    →  Continue until budget exhausted or convergence
```

### Manual vs. Automatic Optimization

| Approach | Method | Effort | Consistency |
|----------|--------|--------|-------------|
| **Manual** (Notebook 1.5) | Human writes better prompts | High | Variable |
| **GEPA** (This notebook) | Algorithm evolves prompts | Low | Systematic |

### Integration with Prompt Registry

GEPA works directly with MLflow's Prompt Registry:
- **Reads** your registered prompt as the starting point
- **Optimizes** it through the evaluate-reflect-mutate cycle
- **Registers** the improved version automatically as a new version

> **Note:** GEPA requires the `gepa` package. Install it with: `pip install gepa`

---
## Step 2: Environment Setup

In [None]:
import mlflow
from dotenv import load_dotenv
from utils.clnt_utils import is_databricks_ai_gateway_client, get_databricks_ai_gateway_client, get_openai_client, get_ai_gateway_model_names

# Load environment
load_dotenv()

# Configure MLflow
mlflow.set_tracking_uri("http://localhost:5000")
mlflow.set_experiment("09-prompt-optimization")

# Configure client and model based on provider
use_databricks_provider = is_databricks_ai_gateway_client()
if use_databricks_provider:
    client = get_databricks_ai_gateway_client()
    model_name = get_ai_gateway_model_names()[0]
    optimizer_model = f"databricks:/{model_name}"
else:
    client = get_openai_client()
    model_name = "gpt-5-mini"
    optimizer_model = f"openai:/{model_name}"

# Enable autologging
mlflow.openai.autolog()

print("\u2705 Environment configured")
print(f"   Provider: {'Databricks AI Gateway' if use_databricks_provider else 'OpenAI'}")
print(f"   Model: {model_name}")
print(f"   Optimizer model: {optimizer_model}")
print(f"   Tracking URI: {mlflow.get_tracking_uri()}")

---
## Step 3: Register a Baseline Prompt

We'll use the same basic Q&A prompt from Notebook 1.5's Prompt Library (`qa_simple`). We register it fresh here so this notebook is self-contained.

This minimal prompt is an ideal optimization target — it has maximum room for GEPA to improve it.

In [None]:
# Register the baseline prompt (same template as qa_simple from Notebook 1.5)
baseline_prompt = mlflow.genai.register_prompt(
    name="gepa-qa-simple",
    template="Answer this question: {{ question }}",
    commit_message="Baseline prompt for GEPA optimization",
    tags={"author": "jules", "use_case": "Simple Q&A", "status": "baseline"}
)

print("\u2705 Baseline prompt registered")
print(f"   Name: {baseline_prompt.name}")
print(f"   Version: {baseline_prompt.version}")
print(f"   URI: {baseline_prompt.uri}")
print(f"   Template: '{baseline_prompt.template}'")

---
## Step 4: Prepare Training Data and Predict Function

GEPA needs two things:
1. **Training data** — example input/output pairs so it can evaluate prompt quality
2. **Predict function** — a callable that loads the prompt, fills it, and calls the LLM

### Why Training Data Design Matters

GEPA improves prompts by finding gaps between actual outputs and expected responses.
If the baseline prompt already produces near-perfect answers (common with powerful LLMs),
GEPA has **no signal to improve** and will register the original template unchanged.

To give GEPA room to work, our training examples require **specific structure and detail**
that the bare-bones prompt `"Answer this question: ..."` won't naturally produce.

In [None]:
from mlflow.genai import optimize_prompts
from mlflow.genai.optimize.optimizers import GepaPromptOptimizer
from mlflow.genai.scorers import Correctness

# Training data: questions that require STRUCTURED, DETAILED responses.
# The bare-bones baseline prompt won't guide the LLM to produce these formats,
# giving GEPA a clear signal to add structure/instructions to the prompt.
train_data = [
    {
        "inputs": {"question": "Explain MLflow Tracking to a beginner in exactly 3 sentences."},
        "expectations": {
            "expected_response": (
                "MLflow Tracking is a component that automatically logs your machine learning experiments. "
                "It records parameters, metrics, and model artifacts so you can compare different runs side by side. "
                "You can view all your experiments through the MLflow UI dashboard."
            )
        },
    },
    {
        "inputs": {"question": "List the top 3 benefits of using vector embeddings. Number each benefit."},
        "expectations": {
            "expected_response": (
                "1. Semantic similarity — embeddings capture meaning, so related concepts like 'happy' and 'joyful' are close in vector space.\n"
                "2. Efficiency — they compress high-dimensional sparse data into compact dense vectors, making computation faster.\n"
                "3. Transfer learning — pre-trained embeddings can be fine-tuned for specific downstream tasks without training from scratch."
            )
        },
    },
    {
        "inputs": {"question": "Compare RAG and fine-tuning in exactly 2 sentences."},
        "expectations": {
            "expected_response": (
                "RAG retrieves relevant documents at inference time to augment the LLM's context, "
                "while fine-tuning modifies the model's weights on domain-specific training data. "
                "RAG allows dynamic knowledge updates without retraining, while fine-tuning creates a "
                "specialized model that may lose some general capabilities."
            )
        },
    },
    {
        "inputs": {"question": "What is prompt engineering? Answer with a definition followed by 2 best practices."},
        "expectations": {
            "expected_response": (
                "Prompt engineering is the practice of designing and refining instructions given to LLMs "
                "to produce desired outputs reliably.\n\n"
                "Best practices:\n"
                "1. Be specific and explicit — clearly state the format, length, and style you expect.\n"
                "2. Provide examples — include one or two input/output examples to guide the model's behavior."
            )
        },
    },
    {
        "inputs": {"question": "Describe the MLflow Model Registry in one paragraph of 3-4 sentences."},
        "expectations": {
            "expected_response": (
                "The MLflow Model Registry is a centralized store for managing the full lifecycle of ML models. "
                "It provides model versioning, so you can track how models evolve over time. "
                "Teams can transition models through stages like Staging and Production using aliases. "
                "It also supports annotations and approval workflows for collaborative model governance."
            )
        },
    },
    {
        "inputs": {"question": "What is a vector database? Structure your answer as: Definition, then Use Cases (2 bullet points)."},
        "expectations": {
            "expected_response": (
                "A vector database is a specialized database designed to store, index, and efficiently query "
                "high-dimensional vector embeddings.\n\n"
                "Use cases:\n"
                "• Semantic search — finding documents or products by meaning rather than keyword matching.\n"
                "• RAG pipelines — retrieving relevant context for LLM generation to improve answer accuracy."
            )
        },
    },
]


# Predict function: GEPA calls this repeatedly during optimization.
# During optimization, GEPA patches PromptVersion.template so that
# load_prompt() returns the MUTATED template instead of the original.
def predict_qa(question: str) -> str:
    """Load the prompt from the registry, fill it, and call the LLM."""
    prompt = mlflow.genai.load_prompt(baseline_prompt.uri)
    filled = prompt.format(question=question)

    response = client.chat.completions.create(
        model=model_name,
        messages=[{"role": "user", "content": filled}],
    )
    return response.choices[0].message.content


print(f"\u2705 Training data prepared: {len(train_data)} examples")
print("   (Questions require specific structure/format the baseline prompt can't guide)")
print("\u2705 Predict function defined")
print(f"   Loads prompt from: {baseline_prompt.uri}")

---
## Step 5: Run GEPA Optimization

Now we run the optimization. GEPA will:
1. Evaluate the baseline prompt using the `Correctness` scorer
2. Reflect on failures and generate improved variations
3. Select the best candidates and repeat
4. Register the optimized prompt as a new version in the Prompt Registry

> **Note:** This may take 3-5 minutes. The training examples require structured responses
> that the bare-bones prompt can't produce well, giving GEPA a clear optimization signal.

In [None]:
import logging
from ipykernel.iostream import OutStream

# === Fix for GEPA Unicode surrogate characters ===
# GEPA's internal output contains Unicode surrogates that crash Jupyter's
# ZMQ/tornado JSON encoder. We fix at three levels:

def _sanitize_surrogates(obj):
    """Recursively replace Unicode surrogates in an object tree."""
    if isinstance(obj, str):
        return obj.encode("utf-8", errors="replace").decode("utf-8")
    elif isinstance(obj, bytes):
        return obj
    elif isinstance(obj, dict):
        return {_sanitize_surrogates(k): _sanitize_surrogates(v) for k, v in obj.items()}
    elif isinstance(obj, (list, tuple)):
        return type(obj)(_sanitize_surrogates(item) for item in obj)
    return obj

# Level 1: Patch OutStream.write at the CLASS level to sanitize all output.
# This ensures surrogates are stripped before they reach ipykernel's buffer,
# regardless of how write() is called (instance, class, or thread).
_orig_outstream_write = OutStream.write

def _safe_outstream_write(self, string):
    if isinstance(string, str):
        string = string.encode("utf-8", errors="replace").decode("utf-8")
    return _orig_outstream_write(self, string)

OutStream.write = _safe_outstream_write

# Level 2: Patch the kernel session's pack function to handle surrogates
# in ZMQ/IOPub messages. This is where the serialization error actually
# occurs — session.pack calls orjson_packer/json_packer which choke on
# surrogate characters. We catch the error and sanitize on retry.
try:
    _kernel = get_ipython().kernel
    _orig_pack = _kernel.session.pack

    def _safe_pack(obj):
        try:
            return _orig_pack(obj)
        except (UnicodeEncodeError, TypeError):
            return _orig_pack(_sanitize_surrogates(obj))

    _kernel.session.pack = _safe_pack
except Exception:
    pass  # Not in a Jupyter kernel context

# Level 3: Suppress async/tornado error log messages for any surrogates
# that slip through to non-stdout kernel messages (non-fatal noise).
for _logger_name in ("tornado.general", "tornado.application", "asyncio"):
    logging.getLogger(_logger_name).setLevel(logging.CRITICAL)

# Run GEPA prompt optimization
print("\U0001f504 Running GEPA prompt optimization...\n")
print("   This will iterate through evaluate \u2192 reflect \u2192 mutate \u2192 select cycles.")
print("   Budget: 100 metric calls (may take 3-5 minutes)\n")

result = optimize_prompts(
    predict_fn=predict_qa,
    train_data=train_data,
    prompt_uris=[baseline_prompt.uri],
    optimizer=GepaPromptOptimizer(
        reflection_model=optimizer_model,
        max_metric_calls=100,
        display_progress_bar=False,
    ),
    scorers=[Correctness(model=optimizer_model)],
)

print("\n\u2705 GEPA optimization complete!")

---
## Step 6: Compare Original vs. Optimized Prompt

In [None]:
def _safe(s):
    """Strip Unicode surrogates for safe display in Jupyter."""
    if isinstance(s, str):
        return s.encode("utf-8", errors="replace").decode("utf-8")
    return str(s)

# Display before/after comparison
print("=" * 70)
print("\ud83d\udcca GEPA Optimization Results")
print("=" * 70)

print("\n\ud83d\udcc8 Score Improvement:")
if result.initial_eval_score is not None:
    print(f"   Initial score: {result.initial_eval_score:.3f}")
else:
    print("   Initial score: N/A")
if result.final_eval_score is not None:
    print(f"   Final score:   {result.final_eval_score:.3f}")
else:
    print("   Final score:   N/A")
if result.initial_eval_score is not None and result.final_eval_score is not None:
    improvement = result.final_eval_score - result.initial_eval_score
    print(f"   Improvement:   {improvement:+.3f}")

# Load the optimized prompt directly from the registry to ensure we
# see the actual registered version (not just the in-memory object)
optimized = result.optimized_prompts[0]
registry_prompt = mlflow.genai.load_prompt(f"prompts:/{optimized.name}/{optimized.version}")

print(f"\n\ud83d\udcdd Original Prompt (version {baseline_prompt.version}):")
print(f"   '{baseline_prompt.template}'")

print(f"\n\ud83d\ude80 Optimized Prompt (version {optimized.version}):")
print(f"   '{_safe(registry_prompt.template)}'")

if baseline_prompt.template.strip() == _safe(registry_prompt.template).strip():
    print("\n\u26a0\ufe0f  Note: The optimized template is identical to the baseline.")
    print("   This can happen when the baseline already scores well on the")
    print("   training data. Try adding harder examples or increasing the budget.")

print("\n\ud83d\udd17 The optimized prompt has been automatically registered")
print(f"   as version {optimized.version} in the Prompt Registry!")
print(f"   View it in MLflow UI \u2192 Prompt Registry \u2192 {_safe(optimized.name)}")

print("\n" + "=" * 70)
print("\n\ud83d\udca1 Key Takeaway:")
print("   GEPA automatically learned to add structure, instructions,")
print("   and constraints that we would normally write by hand.")
print("   Combined with the Prompt Registry, optimized prompts are")
print("   versioned and ready for deployment via aliases.")

---
## Summary

In this notebook, you learned:

1. How **GEPA** automatically optimizes prompts through evaluate-reflect-mutate cycles
2. Using `mlflow.genai.optimize_prompts()` with the **Prompt Registry**
3. Preparing **training data** and a **predict function** for optimization
4. Comparing **before/after** prompt quality with the `Correctness` scorer
5. Optimized prompts are **automatically versioned** in the registry

### Key Takeaways

- **Automate what you can**: GEPA systematically improves prompts that would take many manual iterations
- **Data-driven optimization**: Training examples define what "good" looks like for the algorithm
- **Registry integration**: Optimized prompts flow directly into the Prompt Registry for versioning and deployment
- **Combine approaches**: Use GEPA for initial optimization, then fine-tune manually if needed

### What's Next?

**\ud83d\udcd3 Notebook 1.9: Complete RAG Application**

Learn how to:
- Build a full RAG pipeline with end-to-end tracing
- Evaluate RAG quality with RAGAS metrics
- Track performance, cost, and retrieval quality