# CJE Planning: Optimize Your Evaluation Budget

Before running a large evaluation, answer these questions:

- **How many samples** do I need?
- **How many oracle labels** are worth the cost?
- **What effect size** can I reliably detect?

CJE's planning tools help you find the optimal allocation between cheap surrogate scores and expensive oracle labels.

**This notebook requires no data to get started.** You can plan your evaluation with just an estimate of your judge's quality.

[![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/cimo-labs/cje/blob/main/examples/cje_planning.ipynb)

In [None]:
# Install CJE
!pip install -q --upgrade cje-eval

## 1. Quick Planning (No Data Required)

To plan your evaluation, you only need:

1. **Judge quality estimate** — How well does your cheap judge predict oracle labels?
2. **Cost model** — What does each judge/oracle call cost?

### Estimating Judge Quality

Judge quality is measured as **R²** — the fraction of oracle variance explained by your judge.

| R² | Quality | Example |
|----|---------|----------|
| 0.9+ | Excellent | GPT-4 predicting human preference on clear tasks |
| 0.7-0.9 | Good | Strong LLM judge on well-defined criteria |
| 0.5-0.7 | Moderate | LLM judge on subjective tasks |
| <0.5 | Weak | Misaligned judge or very noisy oracle |

**Don't know R²?** If you know the correlation between judge and oracle, convert it:

In [None]:
from cje import correlation_to_r2

# If judge-oracle correlation is 0.8
r2 = correlation_to_r2(0.8)
print(f"Correlation 0.8 → R² = {r2:.2f}")

# For monotone (nonlinear) relationships, R² is higher
r2_monotone = correlation_to_r2(0.8, "monotone")
print(f"Correlation 0.8 (monotone) → R² = {r2_monotone:.2f}")

### Specify Your Costs

Use actual dollar costs per API call:

| Surrogate (Judge) | Oracle | Cost Ratio |
|-------------------|--------|------------|
| GPT-4o-mini ($0.01) | GPT-4o ($0.16) | 16× |
| Claude Haiku ($0.008) | Claude Sonnet ($0.09) | 11× |
| Llama-70B ($0.002) | Human ($2.00) | 1000× |

In [None]:
from cje import CostModel

# Example: GPT-4o-mini as judge, GPT-4o as oracle
cost = CostModel(
    surrogate_cost=0.01,  # $/call for judge
    oracle_cost=0.16      # $/call for oracle
)

print(f"Surrogate: ${cost.surrogate_cost}/call")
print(f"Oracle: ${cost.oracle_cost}/call")
print(f"Cost ratio: {cost.oracle_cost/cost.surrogate_cost:.0f}×")

### Run the Simulation

Now we can plan the evaluation. This simulates the variance structure for your judge quality and finds the optimal allocation.

**Note:** The simulation takes ~30-60 seconds to run.

In [None]:
from cje import simulate_planning

# Plan with R²=0.7 (good judge) and $5,000 budget
result = simulate_planning(
    r2=0.7,
    budget=5000,
    cost_model=cost,
    verbose=True
)

In [None]:
# Educational explanation of results
print(result.explain())

In [None]:
# Detailed plan
print(result.summary())

## 2. Understanding the Tradeoffs

### How Judge Quality Affects Planning

Better judges (higher R²) mean:
- Less calibration uncertainty → fewer oracle labels needed
- Lower MDE for the same budget

Let's see how different judge qualities affect the recommendation:

In [None]:
from cje import simulate_variance_model, plan_evaluation

print("Judge Quality vs Planning (at $5,000 budget)")
print("=" * 55)
print(f"{'R²':<6} {'Quality':<12} {'MDE':<8} {'Oracle %':<10} {'Samples'}")
print("-" * 55)

for r2, quality in [(0.5, "Moderate"), (0.7, "Good"), (0.9, "Excellent")]:
    model = simulate_variance_model(r2=r2, verbose=False)
    plan = plan_evaluation(budget=5000, variance_model=model, cost_model=cost)
    print(f"{r2:<6} {quality:<12} {plan.mde:<8.1%} {plan.oracle_fraction:<10.0%} {plan.n_samples:,}")

### Variance Decomposition

CJE's variance has two components:

$$\text{Var}(\hat{\theta}) = \frac{\sigma^2_{\text{eval}}}{n} + \frac{\sigma^2_{\text{cal}}}{m}$$

- **σ²_eval** — Evaluation variance (noise in policy performance)
- **σ²_cal** — Calibration variance (uncertainty in judge→oracle mapping)

Better judges have lower σ²_cal, which shifts the optimal allocation toward more samples and fewer oracle labels.

In [None]:
# Show variance decomposition from our simulation
print(f"Variance decomposition (R² = {result.r2})")
print(f"  Evaluation variance:  {result.eval_variance_fraction:.0%}")
print(f"  Calibration variance: {result.cal_variance_fraction:.0%}")

if result.cal_variance_fraction > 0.5:
    print("\n→ Calibration dominates: investing in oracle labels has high impact")
else:
    print("\n→ Evaluation dominates: more samples matter more than more labels")

## 3. Budget vs MDE Planning

Two ways to plan:

1. **Budget-constrained**: "I have $X — what MDE can I achieve?"
2. **MDE-constrained**: "I need to detect X% — what will it cost?"

In [None]:
from cje import plan_for_mde

# Use the variance model from our simulation
model = result.variance_model

# Budget-constrained: What can I detect with $10K?
plan_10k = plan_evaluation(budget=10000, variance_model=model, cost_model=cost)
print(f"With $10,000 budget: MDE = {plan_10k.mde:.1%}")

# MDE-constrained: What does it cost to detect 2%?
plan_2pct = plan_for_mde(target_mde=0.02, variance_model=model, cost_model=cost)
print(f"To detect 2% differences: ${plan_2pct.total_cost:,.0f}")

In [None]:
# Compare different MDE targets
print("MDE vs Budget Tradeoff")
print("=" * 45)
print(f"{'MDE':<10} {'Budget':<15} {'Samples':<10} {'Oracle'}")
print("-" * 45)

for target in [0.05, 0.03, 0.02, 0.015, 0.01]:
    p = plan_for_mde(target_mde=target, variance_model=model, cost_model=cost)
    print(f"{target:.1%}       ${p.total_cost:>10,.0f}    {p.n_samples:>7,}   {p.m_oracle:>5}")

### The Planning Loop

Planning is iterative. Start with one constraint and check if the other is acceptable:

In [None]:
print("Example planning session:")
print("=" * 50)

# Start with budget constraint
initial = plan_evaluation(budget=10000, variance_model=model, cost_model=cost)
print(f"\n1. With $10K budget: MDE = {initial.mde:.1%}")
print("   → Too high, I need to detect 1.5% differences")

# Check what it costs to hit a tighter MDE
target = plan_for_mde(target_mde=0.015, variance_model=model, cost_model=cost)
print(f"\n2. To hit 1.5% MDE: need ${target.total_cost:,.0f}")
print("   → Too expensive")

# Find a compromise
compromise = plan_for_mde(target_mde=0.02, variance_model=model, cost_model=cost)
print(f"\n3. Compromise at 2% MDE: ${compromise.total_cost:,.0f}")
print("   → Acceptable!")

### Visualize the Tradeoffs

In [None]:
from cje.visualization import plot_planning_dashboard
import matplotlib.pyplot as plt

fig = plot_planning_dashboard(model, cost)
plt.show()

## 4. Refining with Pilot Data (Optional)

Simulation-based planning is great for feasibility checks and initial budgeting. For production planning, you can refine your estimates with real pilot data.

### When to Collect a Pilot

Collect pilot data when:
- Simulation suggests the evaluation is **feasible** (budget/MDE look acceptable)
- You want **more precise** variance estimates
- You're **uncertain** about your judge quality estimate

### Pilot Requirements

| Size | Prompts | Oracle Labels | Oracle % |
|------|---------|---------------|----------|
| Minimum | 200 | 80 | 40% |
| Recommended | 400 | 120-160 | 30-40% |

**Key:** Oracle labels must be a random subset — don't cherry-pick.

In [None]:
# Download Arena sample data as reference
import urllib.request
from pathlib import Path

DATA_DIR = Path("arena_sample")
if not (DATA_DIR / "fresh_draws" / "base_responses.jsonl").exists():
    print("Downloading sample data...")
    DATA_DIR.mkdir(exist_ok=True)
    (DATA_DIR / "fresh_draws").mkdir(exist_ok=True)
    
    BASE_URL = "https://raw.githubusercontent.com/cimo-labs/cje/main/examples/arena_sample"
    for f in ["base_responses.jsonl", "clone_responses.jsonl", 
              "parallel_universe_prompt_responses.jsonl", "unhelpful_responses.jsonl"]:
        urllib.request.urlretrieve(f"{BASE_URL}/fresh_draws/{f}", DATA_DIR / "fresh_draws" / f)
    print("Done!")
else:
    print(f"Data exists at {DATA_DIR.absolute()}")

In [None]:
from cje import fit_variance_model
from cje.data.fresh_draws import load_fresh_draws_auto

# Load pilot data
pilot_data = load_fresh_draws_auto("arena_sample/fresh_draws", "base")

print(f"Loaded: {len(pilot_data.samples)} samples")
print(f"Oracle labels: {sum(1 for s in pilot_data.samples if s.oracle_label is not None)}")

In [None]:
# Fit variance model from pilot data (~30 seconds)
pilot_model = fit_variance_model(pilot_data, verbose=True)

In [None]:
# Compare simulation vs pilot estimates
sim_plan = plan_evaluation(budget=5000, variance_model=result.variance_model, cost_model=cost)
pilot_plan = plan_evaluation(budget=5000, variance_model=pilot_model, cost_model=cost)

print("Comparison: Simulation vs Pilot")
print("=" * 45)
print(f"{'Metric':<20} {'Simulation':<12} {'Pilot'}")
print("-" * 45)
print(f"{'σ²_eval':<20} {result.variance_model.sigma2_eval:<12.4f} {pilot_model.sigma2_eval:.4f}")
print(f"{'σ²_cal':<20} {result.variance_model.sigma2_cal:<12.4f} {pilot_model.sigma2_cal:.4f}")
print(f"{'MDE':<20} {sim_plan.mde:<12.1%} {pilot_plan.mde:.1%}")
print(f"{'Oracle fraction':<20} {sim_plan.oracle_fraction:<12.0%} {pilot_plan.oracle_fraction:.0%}")

## 5. Summary

### Quick Reference

**Simulation-based planning (no data required):**
```python
from cje import simulate_planning, CostModel, correlation_to_r2

# Estimate judge quality
r2 = correlation_to_r2(0.8)  # or use R² directly

# Specify costs
cost = CostModel(surrogate_cost=0.01, oracle_cost=0.16)

# Plan
result = simulate_planning(r2=r2, budget=5000, cost_model=cost)
print(result.explain())
```

**Pilot-based planning (with data):**
```python
from cje import fit_variance_model, plan_evaluation, CostModel
from cje.data.fresh_draws import load_fresh_draws_auto

# Load and fit
pilot = load_fresh_draws_auto("pilot_dir", "base")
model = fit_variance_model(pilot)

# Plan
cost = CostModel(surrogate_cost=0.01, oracle_cost=0.16)
plan = plan_evaluation(budget=5000, variance_model=model, cost_model=cost)
```

### Decision Flowchart

```
Start
  │
  ▼
Have pilot data? ──No──→ simulate_planning(r2, budget, cost)
  │                              │
  Yes                            ▼
  │                      Is evaluation feasible?
  ▼                              │
fit_variance_model()      No ────┴──── Yes
  │                       │            │
  ▼                       ▼            ▼
plan_evaluation()    Reconsider    Collect pilot
                     (improve        (optional)
                      judge or 
                      increase 
                      budget)
```

### Next Steps

Once you have your plan, run the evaluation:

[![Core Tutorial](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/cimo-labs/cje/blob/main/examples/cje_core_demo.ipynb)

---

## Appendix: The Math

<details>
<summary>Click to expand</summary>

### Variance Decomposition

CJE's variance decomposes into two independent components:

$$\text{Var}(\hat{\theta}) = \frac{\sigma^2_{\text{eval}}}{n} + \frac{\sigma^2_{\text{cal}}}{m}$$

Where:
- **n** = total samples (evaluated by cheap judge)
- **m** = oracle-labeled samples (subset of n)
- **σ²_eval** = evaluation variance (inherent noise in policy performance)
- **σ²_cal** = calibration variance (uncertainty in learning judge→oracle mapping)

### Square Root Allocation Law

Given costs c_S (surrogate) and c_Y (oracle), the optimal allocation follows:

$$\frac{m^*}{n^*} = \sqrt{\frac{c_S}{c_Y}} \cdot \sqrt{\frac{\sigma^2_{\text{cal}}}{\sigma^2_{\text{eval}}}}$$

This balances the marginal variance reduction per dollar spent on each component.

### MDE Formula

The minimum detectable effect at power (1-β) and significance α is:

$$\text{MDE} = (z_{1-\alpha/2} + z_{1-\beta}) \cdot \sqrt{2 \cdot \text{Var}(\hat{\theta})}$$

For 80% power and α=0.05: MDE ≈ 2.8 × SE

</details>

**Documentation:**
- [CJE GitHub](https://github.com/cimo-labs/cje)
- [CJE Paper (arXiv)](https://arxiv.org/abs/2512.11150) — See Appendix F for full derivation