# Prompt Evolver - Interactive Notebook

**Automated prompt optimization and testing for LLM workflows**

This notebook guides you through optimizing prompts using the Prompt Evolver pipeline. You'll:
1. Configure your environment and LLM backend
2. Run the optimization pipeline on example data
3. Analyze results and improvements

**Prerequisites:** Python 3.10+, virtual environment activated

**Time to complete:** ~5 minutes

## 1. Environment Setup
---

**Option A: Automated Setup (Recommended)**
```bash
./setup.sh  # Unix/macOS/Linux
setup.bat   # Windows
```

**Option B: Manual Setup**
```bash
python -m venv .venv
source .venv/bin/activate  # Unix/macOS
# .venv\Scripts\activate    # Windows
pip install -r requirements.txt
```

**Important:** After setup, select the `.venv` Python interpreter as your notebook kernel.

## 2. Configuration
---

Configure paths, model names, and pipeline parameters below.

In [1]:
from pathlib import Path
import sys

# === 2.1 Auto-detect project root ===
def find_project_root(start: Path) -> Path:
    """Locate project root by searching for src/prompt_evolver."""
    for parent in [start, *start.parents]:
        if (parent / "src" / "prompt_evolver").exists():
            return parent
    raise RuntimeError("Could not locate project root containing src/prompt_evolver")

project_root = find_project_root(Path.cwd())
src_path = project_root / "src"
if str(src_path) not in sys.path:
    sys.path.insert(0, str(src_path))

print(f"Project root: {project_root}")

# === 2.2 Data paths ===
DATA_DIR = project_root / "data"
CONFIG_DIR = project_root / "configs"

# Example datasets (pre-configured for testing)
EXAMPLE_PROMPTS = DATA_DIR / "example.prompts.csv"
EXAMPLE_TEXTS = DATA_DIR / "example.texts.csv"
EXAMPLE_TASKS = DATA_DIR / "example.tasks.csv"
EXAMPLE_RESULTS = DATA_DIR / "example.results.csv"

# Your custom datasets (optional)
CUSTOM_PROMPTS = DATA_DIR / "prompts.csv"
CUSTOM_TEXTS = DATA_DIR / "texts.csv"
CUSTOM_TASKS = DATA_DIR / "tasks.csv"
CUSTOM_RESULTS = DATA_DIR / "results.csv"

# Prompt templates
EXECUTION_TEMPLATE = CONFIG_DIR / "prompts" / "prompt.execute.md"
IMPROVEMENT_TEMPLATE = CONFIG_DIR / "prompts" / "prompt.improve.md"
EVALUATION_TEMPLATE = CONFIG_DIR / "prompts" / "prompt.evaluation.md"

# Configuration file
CONFIG_FILE = CONFIG_DIR / "config.yaml"

# === 2.3 Model configuration ===
# Update these based on your LLM backend:
# - Ollama/LM Studio: "mistralai/ministral-3-3b", "llama3.2", etc.
# - OpenAI API: "gpt-4", "gpt-3.5-turbo", etc.

EXECUTION_MODEL = "mistralai/ministral-3-3b"
IMPROVEMENT_MODEL = "mistralai/ministral-3-3b"
EVALUATION_MODEL = "mistralai/ministral-3-3b"

# === 2.4 Pipeline parameters ===
MAX_GENERATIONS = 3  # Maximum improvement iterations per task
USE_EXAMPLE_DATA = True  # Toggle between example and custom data

print(f"\nConfiguration loaded:")
print(f"  Data source: {'Example' if USE_EXAMPLE_DATA else 'Custom'}")
print(f"  Max generations: {MAX_GENERATIONS}")
print(f"  Models: {EXECUTION_MODEL}")

Project root: /Users/bru1t/Documents/Development/projects/prompt-evolver

Configuration loaded:
  Data source: Example
  Max generations: 3
  Models: mistralai/ministral-3-3b


## 3. Run Pipeline
---

Execute the optimization pipeline. This will:
- Load prompts, texts, and tasks from CSV files
- Run each task and evaluate outputs
- Iteratively improve prompts based on feedback
- Generate a results CSV with before/after comparison

**Expected runtime:** ~2-3 minutes (depends on LLM speed)

**Note:** If you modified code in `src/`, restart the kernel before running this cell.

In [4]:
from prompt_evolver.config import load_config
from prompt_evolver.pipeline import run_pipeline

# Select data source
if USE_EXAMPLE_DATA:
    prompts_path, texts_path = EXAMPLE_PROMPTS, EXAMPLE_TEXTS
    tasks_path, results_path = EXAMPLE_TASKS, EXAMPLE_RESULTS
else:
    prompts_path, texts_path = CUSTOM_PROMPTS, CUSTOM_TEXTS
    tasks_path, results_path = CUSTOM_TASKS, CUSTOM_RESULTS

print(f"Running pipeline with {prompts_path.name}...\n")

# Load configuration and templates
config = load_config(CONFIG_FILE)
execution_template = EXECUTION_TEMPLATE.read_text(encoding="utf-8")
improvement_template = IMPROVEMENT_TEMPLATE.read_text(encoding="utf-8")
evaluation_template = EVALUATION_TEMPLATE.read_text(encoding="utf-8")

# Run pipeline
results_df = run_pipeline(
    prompts_path,
    texts_path,
    tasks_path,
    results_path,
    config=config,
    execution_prompt_template=execution_template,
    improvement_prompt_template=improvement_template,
    evaluation_prompt_template=evaluation_template,
    max_generations=MAX_GENERATIONS,
    execution_model=EXECUTION_MODEL,
    improvement_model=IMPROVEMENT_MODEL,
    evaluation_model=EVALUATION_MODEL,
)

print(f"\nPipeline complete! Results saved to: {results_path}")
results_df.head()

2026-01-23 15:53:15,578 [INFO] Start pipeline prompts=/Users/bru1t/Documents/Development/projects/prompt-evolver/data/example.prompts.csv texts=/Users/bru1t/Documents/Development/projects/prompt-evolver/data/example.texts.csv tasks=/Users/bru1t/Documents/Development/projects/prompt-evolver/data/example.tasks.csv output=/Users/bru1t/Documents/Development/projects/prompt-evolver/data/example.results.csv
2026-01-23 15:53:15,617 [INFO] Processing 16 tasks...
2026-01-23 15:53:15,617 [INFO] Task 1/16 (6%) - Processing task_id=task_001
2026-01-23 15:53:15,618 [INFO] Start task id_task=task_001 id_prompt=prompt_001 id_text=text_001 type=Writing


Running pipeline with example.prompts.csv...



2026-01-23 15:53:22,249 [INFO] Iteration start task=task_001 iter=1
2026-01-23 15:53:25,439 [INFO] Accept prompt task=task_001 iter=1 score=0.95 tokens_delta=6
2026-01-23 15:53:25,439 [INFO] End task id_task=task_001 status=pass iterations_used=1
2026-01-23 15:53:25,439 [INFO] Task 1/16 (6%) - Completed: iterations=1, token_delta=+6, leakage=NO
2026-01-23 15:53:25,440 [INFO] Task 2/16 (12%) - Processing task_id=task_002
2026-01-23 15:53:25,440 [INFO] Start task id_task=task_002 id_prompt=prompt_002 id_text=text_002 type=Extraction
2026-01-23 15:53:27,138 [INFO] Iteration start task=task_002 iter=1
2026-01-23 15:53:28,390 [INFO] Accept prompt task=task_002 iter=1 score=1.00 tokens_delta=27
2026-01-23 15:53:28,391 [INFO] End task id_task=task_002 status=pass iterations_used=1
2026-01-23 15:53:28,392 [INFO] Task 2/16 (12%) - Completed: iterations=1, token_delta=+27, leakage=NO
2026-01-23 15:53:28,392 [INFO] Task 3/16 (18%) - Processing task_id=task_003
2026-01-23 15:53:28,393 [INFO] Start


Pipeline complete! Results saved to: /Users/bru1t/Documents/Development/projects/prompt-evolver/data/example.results.csv


Unnamed: 0,id_task,id_text,id_prompt,prompt_original,prompt_improved,tokens_original,tokens_improved,tokens_delta,iterations_used,output_original,...,output_tokens_original,output_tokens_improved,evaluation_original,evaluation_improved,model_task,model_improve,model_eval,leakage_flag,sanity_check_details,failure_reason
0,task_001,text_001,prompt_001,Summarize the text in 3 bullet points.,Summarize **{text}** in 3 clear bullet points.,8,14,6,1,- **Iterative prompt refinement**: Prompt Evol...,...,91,87,"{""pass"": false, ""score"": 0.0, ""issues"": [""inva...","{""pass"": true, ""score"": 0.95, ""issues"": [""Mino...",,,,no,,
1,task_002,text_002,prompt_002,Extract the user's name and status in JSON.,Extract `{user_name}` and `status` from `{text...,11,38,27,1,"```json\n{\n ""name"": ""Alice Nguyen"",\n ""stat...",...,39,25,"{""pass"": false, ""score"": 0.0, ""issues"": [""Miss...","{""pass"": true, ""score"": 1.0, ""issues"": [], ""su...",,,,no,,
2,task_003,text_003,prompt_003,Rewrite the text to be clearer and shorter.,Rewrite the text to be clearer and shorter.,9,9,0,2,The product delivers significantly faster perf...,...,61,61,"{""pass"": false, ""score"": 0.65, ""issues"": [""The...","{""pass"": false, ""score"": 0.65, ""issues"": [""The...",,,,no,,no_improvement
3,task_004,text_004,prompt_004,Compare the two products in the text and list ...,Compare the two products in the text and list ...,12,12,0,2,1. **Warranty Duration**: Product A has a **2-...,...,105,105,"{""pass"": false, ""score"": 0.6, ""issues"": [""Miss...","{""pass"": false, ""score"": 0.6, ""issues"": [""Miss...",,,,no,too_large_increase,no_improvement
4,task_005,text_005,prompt_005,Evaluate the text for policy compliance and an...,Evaluate the text for policy compliance and an...,12,12,0,1,pass,...,1,1,"{""pass"": true, ""score"": 1.0, ""issues"": [], ""su...","{""pass"": true, ""score"": 1.0, ""issues"": [], ""su...",,,,no,,


## 4. Analyze Results
---

Review the optimization results and compare before/after performance.

In [2]:
import pandas as pd
import json
import logging

# === Configure logging (same format as pipeline) ===
logging.basicConfig(
    level=logging.INFO,
    format="%(asctime)s [%(levelname)s] %(message)s",
)
logger = logging.getLogger("prompt_evolver")

# === Load Results ===
logger.info("Start analysis results_path=%s", results_path if 'results_path' in dir() else 'not_set')

results_path = EXAMPLE_RESULTS if USE_EXAMPLE_DATA else CUSTOM_RESULTS
logger.info("Loading results file=%s", results_path)

try:
    results = pd.read_csv(results_path)
    logger.info("Loaded results rows=%d columns=%d", len(results), len(results.columns))
except FileNotFoundError:
    logger.error("Results file not found path=%s", results_path)
    raise
except Exception as e:
    logger.error("Failed to load results error=%s", e)
    raise

# === Summary Statistics ===
logger.info("Computing summary statistics...")

print("=" * 70)
print("SUMMARY STATISTICS")
print("=" * 70)

total_tasks = len(results)
logger.info("Total tasks count=%d", total_tasks)

successful = (results['failure_reason'] == '').sum()
logger.info("Successful optimizations count=%d rate=%.1f%%", successful, successful/total_tasks*100)

with_leakage = (results['leakage_flag'] == 'yes').sum()
logger.info("Leakage detected count=%d", with_leakage)

avg_iterations = results['iterations_used'].mean()
logger.info("Average iterations avg=%.2f", avg_iterations)

total_token_delta = results['tokens_delta'].sum()
tokens_original_sum = results['tokens_original'].sum()
logger.info("Token delta total=%+d original_sum=%d", total_token_delta, tokens_original_sum)

print(f"Total tasks processed: {total_tasks}")
print(f"Successful optimizations: {successful} ({successful/total_tasks*100:.1f}%)")
print(f"Tasks flagged for leakage: {with_leakage}")
print(f"Average iterations per task: {avg_iterations:.1f}")
print(f"Total token change: {total_token_delta:+d} tokens")

if total_token_delta < 0:
    savings_pct = abs(total_token_delta) / tokens_original_sum * 100
    print(f"Token savings: {abs(total_token_delta)} tokens ({savings_pct:.1f}% reduction)")
    logger.info("Token savings tokens=%d reduction=%.1f%%", abs(total_token_delta), savings_pct)
elif total_token_delta > 0:
    increase_pct = total_token_delta / tokens_original_sum * 100
    print(f"Token increase: {total_token_delta} tokens (+{increase_pct:.1f}%)")
    logger.info("Token increase tokens=%d increase=%.1f%%", total_token_delta, increase_pct)
else:
    logger.info("Token delta unchanged total=0")

logger.info("Summary statistics complete")

# === Before/After Comparison (First Task) ===
logger.info("Generating before/after comparison...")

print("\n" + "=" * 70)
print("EXAMPLE: BEFORE/AFTER COMPARISON (Task 1)")
print("=" * 70)

if len(results) > 0:
    task = results.iloc[0]
    task_id = task.get('id_task', 'unknown')
    logger.info("Processing comparison task_id=%s", task_id)
    
    print("\nORIGINAL PROMPT:")
    print("-" * 70)
    prompt_original_len = len(task['prompt_original'])
    prompt_preview = task['prompt_original'][:150] + "..." if prompt_original_len > 150 else task['prompt_original']
    print(prompt_preview)
    logger.info("Original prompt chars=%d", prompt_original_len)
    
    print("\nIMPROVED PROMPT:")
    print("-" * 70)
    prompt_improved_len = len(task['prompt_improved'])
    improved_preview = task['prompt_improved'][:150] + "..." if prompt_improved_len > 150 else task['prompt_improved']
    print(improved_preview)
    logger.info("Improved prompt chars=%d", prompt_improved_len)
    
    print("\nMETRICS:")
    print("-" * 70)
    print(f"  Tokens: {task['tokens_original']} → {task['tokens_improved']} (Δ{task['tokens_delta']:+d})")
    print(f"  Iterations: {task['iterations_used']}")
    print(f"  Leakage detected: {task['leakage_flag']}")
    logger.info(
        "Task metrics task_id=%s tokens_original=%d tokens_improved=%d delta=%+d iterations=%d leakage=%s",
        task_id, task['tokens_original'], task['tokens_improved'], 
        task['tokens_delta'], task['iterations_used'], task['leakage_flag']
    )
    
    # Parse evaluation scores
    logger.info("Parsing evaluation scores task_id=%s", task_id)
    try:
        eval_orig = json.loads(task['evaluation_original'])
        eval_improved = json.loads(task['evaluation_improved'])
        print(f"  Original score: {eval_orig['score']:.2f}")
        print(f"  Improved score: {eval_improved['score']:.2f}")
        score_delta = eval_improved['score'] - eval_orig['score']
        logger.info(
            "Evaluation scores task_id=%s original=%.2f improved=%.2f delta=%+.2f",
            task_id, eval_orig['score'], eval_improved['score'], score_delta
        )
    except json.JSONDecodeError as e:
        logger.warning("JSON parse failed task_id=%s error=%s", task_id, e)
    except KeyError as e:
        logger.warning("Missing evaluation key task_id=%s key=%s", task_id, e)
    except Exception as e:
        logger.warning("Evaluation parse error task_id=%s error=%s", task_id, e)
    
    logger.info("Comparison complete task_id=%s", task_id)
else:
    logger.warning("No results available for comparison rows=0")

# === Detailed Results Table ===
logger.info("Preparing detailed results table...")

print("\n" + "=" * 70)
print("DETAILED RESULTS")
print("=" * 70)
print()

columns_to_display = ['id_task', 'tokens_original', 'tokens_improved', 
                      'tokens_delta', 'iterations_used', 'leakage_flag']
rows_to_show = min(20, len(results))
logger.info("Displaying results columns=%d rows=%d/%d", len(columns_to_display), rows_to_show, len(results))

display(results[columns_to_display].head(20))

logger.info("Analysis complete total_tasks=%d", total_tasks)

2026-01-23 16:06:11,896 [INFO] Start analysis results_path=not_set
2026-01-23 16:06:11,897 [INFO] Loading results file=/Users/bru1t/Documents/Development/projects/prompt-evolver/data/example.results.csv
2026-01-23 16:06:11,900 [INFO] Loaded results rows=16 columns=21
2026-01-23 16:06:11,900 [INFO] Computing summary statistics...
2026-01-23 16:06:11,900 [INFO] Total tasks count=16
2026-01-23 16:06:11,901 [INFO] Successful optimizations count=0 rate=0.0%
2026-01-23 16:06:11,901 [INFO] Leakage detected count=0
2026-01-23 16:06:11,913 [INFO] Average iterations avg=1.81
2026-01-23 16:06:11,914 [INFO] Token delta total=+70 original_sum=143
2026-01-23 16:06:11,914 [INFO] Token increase tokens=70 increase=49.0%
2026-01-23 16:06:11,915 [INFO] Summary statistics complete
2026-01-23 16:06:11,915 [INFO] Generating before/after comparison...
2026-01-23 16:06:11,916 [INFO] Processing comparison task_id=task_001
2026-01-23 16:06:11,916 [INFO] Original prompt chars=38
2026-01-23 16:06:11,916 [INFO] Im

SUMMARY STATISTICS
Total tasks processed: 16
Successful optimizations: 0 (0.0%)
Tasks flagged for leakage: 0
Average iterations per task: 1.8
Total token change: +70 tokens
Token increase: 70 tokens (+49.0%)

EXAMPLE: BEFORE/AFTER COMPARISON (Task 1)

ORIGINAL PROMPT:
----------------------------------------------------------------------
Summarize the text in 3 bullet points.

IMPROVED PROMPT:
----------------------------------------------------------------------
Summarize **{text}** in 3 clear bullet points.

METRICS:
----------------------------------------------------------------------
  Tokens: 8 → 14 (Δ+6)
  Iterations: 1
  Leakage detected: no
  Original score: 0.00
  Improved score: 0.95

DETAILED RESULTS



Unnamed: 0,id_task,tokens_original,tokens_improved,tokens_delta,iterations_used,leakage_flag
0,task_001,8,14,6,1,no
1,task_002,11,38,27,1,no
2,task_003,9,9,0,2,no
3,task_004,12,12,0,2,no
4,task_005,12,12,0,1,no
5,task_006,13,50,37,2,no
6,task_007,11,11,0,2,no
7,task_008,9,9,0,2,no
8,task_009,7,7,0,2,no
9,task_010,7,7,0,2,no


2026-01-23 16:06:11,923 [INFO] Analysis complete total_tasks=16


## 5. Next Steps
---

### Use Your Own Data

To optimize your custom prompts:
1. Create CSV files: `data/prompts.csv`, `data/texts.csv`, `data/tasks.csv`
2. Set `USE_EXAMPLE_DATA = False` in Section 2
3. Re-run the pipeline (Section 3)

See [docs/data-model.md](../docs/data-model.md) for CSV format details.

### Run Tests

Validate the codebase with unit tests:
```bash
pytest
```

### Documentation

- [Overview](../docs/overview.md) - Core concepts and workflow
- [Pipeline](../docs/pipeline.md) - Optimization loop details  
- [Config](../docs/config.md) - Configuration reference
- [Prompts](../docs/prompts.md) - Prompt template guide