# GEPA Skill Optimization Demo

This notebook demonstrates how the skill-test framework uses [GEPA](https://github.com/gepa-ai/gepa) to automatically optimize Databricks SKILL.md files for **quality** and **token efficiency**.

SKILL.md files teach AI agents (like Claude Code) Databricks patterns. Every token in a skill consumes agent context window budget, so skills should be as concise and high-quality as possible.

**What GEPA does:**
1. Scores the current SKILL.md against deterministic scorers (syntax, patterns, APIs, facts)
2. Reflects on failures and proposes mutations to improve the skill
3. Selects the best candidate via Pareto frontier optimization
4. Repeats until quality converges or budget is exhausted

## Setup

In [None]:
import sys
from pathlib import Path

# Add skill-test to path
repo_root = Path(".").resolve()
while not (repo_root / ".test" / "src").exists() and repo_root != repo_root.parent:
    repo_root = repo_root.parent
sys.path.insert(0, str(repo_root / ".test" / "src"))

print(f"Repo root: {repo_root}")

In [None]:
import os

# Configure the reflection model -- pick ONE:

# Option A: Databricks Model Serving (default, recommended)
# IMPORTANT: DATABRICKS_API_BASE must end with /serving-endpoints
# os.environ["DATABRICKS_API_KEY"] = "dapi..."  
# os.environ["DATABRICKS_API_BASE"] = "https://<workspace>.cloud.databricks.com/serving-endpoints"
# os.environ["GEPA_REFLECTION_LM"] = "databricks/databricks-gpt-5-2"

# Option B: OpenAI
# os.environ["OPENAI_API_KEY"] = "sk-..."
# os.environ["GEPA_REFLECTION_LM"] = "openai/gpt-4o"

print(f"Reflection LM: {os.environ.get('GEPA_REFLECTION_LM', 'databricks/databricks-gpt-5-2 (default)')}")

## Step 1: Inspect the Skill

Let's look at the `databricks-model-serving` skill -- its current size, test cases, and baseline score.

In [None]:
SKILL_NAME = "databricks-model-serving"

from skill_test.optimize.evaluator import _find_skill_md, count_tokens
from skill_test.optimize.splitter import create_gepa_datasets

# Load skill
skill_path = _find_skill_md(SKILL_NAME)
original_content = skill_path.read_text()
original_tokens = count_tokens(original_content)

# Load test cases
train, val = create_gepa_datasets(SKILL_NAME)

print(f"Skill:        {SKILL_NAME}")
print(f"Path:         {skill_path}")
print(f"Lines:        {len(original_content.splitlines())}")
print(f"Tokens:       {original_tokens:,}")
print(f"Train cases:  {len(train)}")
print(f"Val cases:    {len(val) if val else 'None'}")

In [None]:
# Show first few test cases
for t in train[:3]:
    print(f"\n--- {t['id']} ---")
    print(f"Prompt: {t['input'][:100]}...")
    if t.get('answer'):
        print(f"Answer: {t['answer'][:100]}...")

## Step 2: Evaluate Current Quality (Baseline)

Before optimizing, measure the current skill quality using the scorer pipeline.

In [None]:
from skill_test.optimize.evaluator import create_skill_evaluator, SKILL_KEY
from skill_test.optimize.splitter import to_gepa_instances

evaluator = create_skill_evaluator(SKILL_NAME)
seed_candidate = {SKILL_KEY: original_content}

# Evaluate on all train tasks
gepa_instances = to_gepa_instances(train)

print(f"{'Task ID':<35} {'Score':>8}")
print("-" * 45)
for i, inst in enumerate(gepa_instances):
    score, side_info = evaluator(seed_candidate, inst)
    task_id = train[i]['id']
    status = 'PASS' if score >= 0.5 else 'FAIL'
    print(f"{task_id:<35} {score:>7.3f}  {status}")

# Quick baseline
scores = [evaluator(seed_candidate, inst)[0] for inst in gepa_instances]
baseline_score = sum(scores) / len(scores)
print(f"\nBaseline Score: {baseline_score:.3f}")
print(f"Token Count:    {original_tokens:,}")

## Step 3: Run GEPA Optimization

Now run the optimization. GEPA will:
- Use the current SKILL.md as the seed candidate
- Run scorers against each test case
- Reflect on failures to propose mutations
- Select the best candidate via Pareto frontier
- Penalize token bloat (80% quality, 20% efficiency weighting)

In [None]:
from skill_test.optimize.runner import optimize_skill

result = optimize_skill(
    skill_name=SKILL_NAME,
    mode="static",
    preset="quick",  # 15 iterations -- increase to "standard" (50) or "thorough" (150) for better results
)

print(f"Optimization complete!")
print(f"GEPA metric calls: {result.gepa_result.total_metric_calls}")
print(f"Candidates explored: {result.gepa_result.num_candidates}")

## Step 4: Results Comparison

Compare the original vs. optimized skill across quality and token efficiency.

In [None]:
print("=" * 60)
print(f"  OPTIMIZATION RESULTS: {SKILL_NAME}")
print("=" * 60)
print()

# Quality comparison
quality_delta = result.improvement
quality_pct = (quality_delta / result.original_score * 100) if result.original_score > 0 else 0
print(f"  Quality Score")
print(f"    Before:      {result.original_score:.3f}")
print(f"    After:       {result.optimized_score:.3f}")
print(f"    Delta:       {quality_delta:+.3f} ({quality_pct:+.1f}%)")
print()

# Token comparison  
token_delta = result.original_token_count - result.optimized_token_count
print(f"  Token Count")
print(f"    Before:      {result.original_token_count:,}")
print(f"    After:       {result.optimized_token_count:,}")
print(f"    Saved:       {token_delta:,} tokens ({result.token_reduction_pct:.1f}% reduction)")
print()

# Line count comparison
orig_lines = len(result.original_content.splitlines())
opt_lines = len(result.optimized_content.splitlines())
print(f"  Lines")
print(f"    Before:      {orig_lines}")
print(f"    After:       {opt_lines}")
print(f"    Saved:       {orig_lines - opt_lines} lines")
print()

# Validation scores
if result.val_scores:
    avg_val = sum(result.val_scores.values()) / len(result.val_scores)
    print(f"  Validation (held-out test cases)")
    for tid, score in result.val_scores.items():
        print(f"    {tid}: {score:.3f}")
    print(f"    Average: {avg_val:.3f}")

print()
print("=" * 60)

In [None]:
# Visual comparison bar chart
try:
    import matplotlib.pyplot as plt
    import matplotlib
    matplotlib.rcParams['font.family'] = 'monospace'

    fig, axes = plt.subplots(1, 2, figsize=(12, 5))

    # Quality scores
    ax = axes[0]
    bars = ax.bar(
        ['Before', 'After'],
        [result.original_score, result.optimized_score],
        color=['#d4534b', '#4a9c5d'],
        width=0.5
    )
    ax.set_ylim(0, 1.1)
    ax.set_ylabel('Quality Score')
    ax.set_title(f'Quality: {result.original_score:.3f} → {result.optimized_score:.3f}')
    for bar, val in zip(bars, [result.original_score, result.optimized_score]):
        ax.text(bar.get_x() + bar.get_width()/2., bar.get_height() + 0.02,
                f'{val:.3f}', ha='center', fontweight='bold')

    # Token counts
    ax = axes[1]
    bars = ax.bar(
        ['Before', 'After'],
        [result.original_token_count, result.optimized_token_count],
        color=['#d4534b', '#4a9c5d'],
        width=0.5
    )
    ax.set_ylabel('Token Count')
    ax.set_title(f'Tokens: {result.original_token_count:,} → {result.optimized_token_count:,} ({result.token_reduction_pct:.0f}% reduction)')
    for bar, val in zip(bars, [result.original_token_count, result.optimized_token_count]):
        ax.text(bar.get_x() + bar.get_width()/2., bar.get_height() + 50,
                f'{val:,}', ha='center', fontweight='bold')

    fig.suptitle(f'GEPA Optimization: {SKILL_NAME}', fontsize=14, fontweight='bold')
    plt.tight_layout()
    plt.show()
except ImportError:
    print("(matplotlib not installed -- skipping chart)")

## Step 5: Review the Diff

Inspect what GEPA changed in the SKILL.md.

In [None]:
from skill_test.optimize.review import review_optimization

review_optimization(result)

## Step 6: Apply (Optional)

If the results look good, apply the optimized SKILL.md. Uncomment the cell below to write it.

In [None]:
# Uncomment to apply:
# from skill_test.optimize.review import apply_optimization
# apply_optimization(result)

## Multi-Component Optimization: Skills + Tools

GEPA supports optimizing multiple text components simultaneously. You can optimize SKILL.md files **alongside** MCP tool descriptions in a single run.

GEPA's `RoundRobinReflectionComponentSelector` cycles through components one at a time, so each gets dedicated reflection and mutation.

In [None]:
# Inspect available MCP tools
from skill_test.optimize.tools import get_tool_stats, extract_tool_descriptions, tools_to_gepa_components

stats = get_tool_stats()
print(f"MCP Tool Modules: {stats['modules']}")
print(f"Total Tools:      {stats['total_tools']}")
print(f"Total Chars:      {stats['total_description_chars']:,}")
print()
for mod, info in stats["per_module"].items():
    print(f"  {mod:<20} {info['tools']:>2} tools  {info['chars']:>6,} chars")

# Show what GEPA components look like for selected modules
tool_map = extract_tool_descriptions(modules=["serving", "sql"])
components = tools_to_gepa_components(tool_map, per_module=True)
print(f"\nGEPA components for serving + sql: {list(components.keys())}")
for name, text in components.items():
    from skill_test.optimize.evaluator import count_tokens
    print(f"  {name}: {count_tokens(text):,} tokens")

In [None]:
## Changing the Reflection Model

By default, GEPA uses `databricks/databricks-gpt-5-2` via Databricks Model Serving.
Override per-call or via environment variable:

```python
# Per-call
result = optimize_skill("my-skill", reflection_lm="openai/gpt-4o")

# Environment variable (persistent)
os.environ["GEPA_REFLECTION_LM"] = "databricks/databricks-gpt-5-2"
```

See README.md for full model configuration options.

## Summary

The GEPA optimization pipeline:

| Metric | Before | After | Change |
|--------|--------|-------|--------|
| Quality Score | `result.original_score` | `result.optimized_score` | `result.improvement` |
| Token Count | `result.original_token_count` | `result.optimized_token_count` | `result.token_reduction_pct`% |

Key points:
- **Quality gate**: Existing scorers (syntax, patterns, APIs, facts) are reused as-is
- **Token efficiency**: 80/20 quality/efficiency weighting penalizes bloated skills
- **Validation split**: Held-out test cases detect overfitting
- **Reflection LM**: Configurable via `--reflection-lm` flag or `GEPA_REFLECTION_LM` env var
- **Default model**: `databricks/databricks-gpt-5-2` via Databricks Model Serving