# CJE: Calibrate Your LLM Judge

Your LLM judge scores are lying. CJE calibrates them to what actually matters.

**The problem**: Cheap judge scores (S) don't match expensive oracle outcomes (Y). CJE learns the Sâ†’Y mapping so you can trust your metrics.

[Read the full explanation â†’](https://cimolabs.com/blog/metrics-lying)

[![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/cimo-labs/cje/blob/main/examples/cje_core_demo.ipynb)

In [None]:
# Install CJE (force upgrade to get latest features)
!pip install -q --upgrade cje-eval

In [None]:
# Download sample data (1000 Chatbot Arena prompts, 4 prompt variants)
import urllib.request
from pathlib import Path

DATA_DIR = Path("arena_sample")
if not (DATA_DIR / "fresh_draws" / "base_responses.jsonl").exists():
    print("Downloading sample data...")
    DATA_DIR.mkdir(exist_ok=True)
    (DATA_DIR / "fresh_draws").mkdir(exist_ok=True)
    (DATA_DIR / "probe_slice").mkdir(exist_ok=True)
    
    BASE_URL = "https://raw.githubusercontent.com/cimo-labs/cje/main/examples/arena_sample"
    for f in ["base_responses.jsonl", "clone_responses.jsonl", 
              "parallel_universe_prompt_responses.jsonl", "unhelpful_responses.jsonl"]:
        urllib.request.urlretrieve(f"{BASE_URL}/fresh_draws/{f}", DATA_DIR / "fresh_draws" / f)
    for f in ["clone_probe.jsonl", "parallel_universe_prompt_probe.jsonl", "unhelpful_probe.jsonl"]:
        urllib.request.urlretrieve(f"{BASE_URL}/probe_slice/{f}", DATA_DIR / "probe_slice" / f)
    print("Done!")
else:
    print(f"Data exists at {DATA_DIR.absolute()}")

## Step 0: Plan Your Evaluation

Before collecting expensive data, figure out:
- **How many samples** do I need?
- **How many oracle labels** are worth the cost?
- **What effect size** can I reliably detect?

This prevents the #1 mistake: collecting underpowered data that can't answer your question.

In [None]:
from cje import fit_variance_model, CostModel, plan_evaluation, plan_for_mde
from cje.data.fresh_draws import load_fresh_draws_auto, discover_policies_from_fresh_draws

# Use the arena sample as our "pilot" to learn variance structure
policies = discover_policies_from_fresh_draws("arena_sample/fresh_draws")
fresh_draws_dict = {p: load_fresh_draws_auto("arena_sample/fresh_draws", p) for p in policies}

# Fit the variance model (this takes ~30 seconds)
print("Fitting variance model from pilot data...")
model = fit_variance_model(fresh_draws_dict, verbose=False)
print(f"\nâœ“ Variance model fitted (RÂ² = {model.r_squared:.2f})")

In [None]:
# Specify your cost model - THIS IS CRITICAL
# Use actual costs per call so budget is in real dollars
# Example: GPT-4o-mini surrogate ($0.01/call) vs GPT-4o oracle ($0.16/call)

cost_model = CostModel(surrogate_cost=0.01, oracle_cost=0.16)

print(f"Costs: surrogate=${cost_model.surrogate_cost}/call, oracle=${cost_model.oracle_cost}/call")
print(f"Budget will be in real dollars (e.g., $5,000 = $5,000)")

In [None]:
# "I have $5,000 - what effect size can I detect?"
plan = plan_evaluation(budget=5000, variance_model=model, cost_model=cost_model)

print(plan.summary())
print(f"\nðŸ“Š What this means:")
print(f"   Collect {plan.n_samples:,} responses scored by surrogate judge")
print(f"   Randomly label {plan.m_oracle} with oracle ({plan.m_oracle/plan.n_samples:.1%} of samples)")
print(f"   Can detect {plan.mde:.1%} difference between policies (80% power)")

In [None]:
# "I need to detect 2% differences - what's the cost?"
plan_2pct = plan_for_mde(target_mde=0.02, variance_model=model, cost_model=cost_model)

print(f"To detect 2% difference with 80% power:")
print(f"  Budget needed: ${plan_2pct.total_cost:,.0f}")
print(f"  Samples: {plan_2pct.n_samples:,} prompts, {plan_2pct.m_oracle} oracle labels")

# Compare different MDE targets
print(f"\nðŸ“Š MDE vs Budget tradeoff:")
for target in [0.05, 0.03, 0.02, 0.01]:
    p = plan_for_mde(target_mde=target, variance_model=model, cost_model=cost_model)
    print(f"   {target:.0%} MDE â†’ ${p.total_cost:,.0f}")

In [None]:
# Visualize: MDE vs Budget tradeoff
from cje.visualization import plot_planning_dashboard
import matplotlib.pyplot as plt

fig = plot_planning_dashboard(model, cost_model)
plt.show()

**The planning loop:**

1. Start with budget constraint OR target MDE
2. Check if the other is acceptable
3. Iterate until both work

```python
# "I have $10K but need 1.5% MDE"
plan = plan_evaluation(budget=10000, ...)  # â†’ MDE = 2.1% (too high!)
plan = plan_for_mde(target_mde=0.015, ...) # â†’ $18K (too expensive!)
plan = plan_for_mde(target_mde=0.018, ...) # â†’ $12K (compromise)
```

**Rule of thumb:** Target MDE should be 2-3Ã— smaller than differences you care about.

## 1. Compare Prompt Variants

One line to analyze all your prompt variants with calibrated estimates:

In [None]:
from cje import analyze_dataset

results = analyze_dataset(fresh_draws_dir="arena_sample/fresh_draws/", verbose=False)

# Show summary
policies = results.metadata['target_policies']
print(f"Analyzed {len(policies)} policies:\n")
for i, p in enumerate(policies):
    est = results.estimates[i]
    se = results.standard_errors[i]
    print(f"  {p}: {est:.3f} Â± {1.96*se:.3f}")

In [None]:
# Visualize with confidence intervals
results.plot_estimates(
    policy_labels={
        "base": "Standard prompt",
        "clone": "Same prompt (different seed)",
        "parallel_universe_prompt": "Modified system prompt",
        "unhelpful": "Adversarial prompt",
    }
);

In [None]:
# Statistical comparison
policies = results.metadata['target_policies']
best = policies[results.best_policy()]
print(f"Best: {best}\n")

# Compare all to base
base_idx = policies.index('base')
for i, p in enumerate(policies):
    if i != base_idx:
        comp = results.compare_policies(i, base_idx)
        sig = "*" if comp['significant'] else ""
        print(f"{p}: {comp['difference']:+.3f} (p={comp['p_value']:.3f}) {sig}")

## 2. Check If Calibration Transfers

Calibration is learned on one distribution. Does it still work on new data?

In [None]:
import json
from cje.diagnostics import audit_transportability, plot_transport_comparison

# Test on probe slices (held-out data with oracle labels)
probe_files = {
    "clone": "arena_sample/probe_slice/clone_probe.jsonl",
    "modified_prompt": "arena_sample/probe_slice/parallel_universe_prompt_probe.jsonl",
    "adversarial": "arena_sample/probe_slice/unhelpful_probe.jsonl",
}

audits = {}
for name, path in probe_files.items():
    data = [json.loads(line) for line in open(path)]
    audits[name] = audit_transportability(results.calibrator, data)
    print(audits[name].summary())

In [None]:
# Visualize: which variants break calibration?
plot_transport_comparison(audits, title="Does Calibration Transfer?");

In [None]:
# Detailed view of failing variant - residuals by score decile
audits['adversarial'].plot();

## 3. Inspect What's Fooling the Judge

When calibration fails, look at the actual samples. What patterns fool the judge but not the oracle?

In [None]:
from cje.diagnostics import compute_residuals

# Compute residuals for each sample (sorted by worst overestimate first)
adversarial_data = [json.loads(line) for line in open("arena_sample/probe_slice/unhelpful_probe.jsonl")]
samples = compute_residuals(results.calibrator, adversarial_data)

print(f"Samples: {len(samples)}")
print(f"Mean residual: {sum(s['residual'] for s in samples) / len(samples):.3f}")
print(f"\nResidual = Oracle - Calibrated")
print(f"  Negative = judge overestimated (fooled)")
print(f"  Positive = judge underestimated")

In [None]:
# Look at the worst overestimates - where the judge was most fooled
print("WORST OVERESTIMATES: Judge gave high scores, oracle gave low scores")
print("-" * 70)

for i, s in enumerate(samples[:3]):
    print(f"\nSample {i+1} | Residual: {s['residual']:.2f}")
    print(f"  Judge: {s['judge_score']:.2f} â†’ Calibrated: {s['calibrated']:.2f} | Oracle: {s['oracle_label']:.2f}")
    print(f"  Prompt: {s['prompt']}")
    print(f"  Response: {s['response']}")

**What we see**: The adversarial prompt produces responses that *sound* helpful (confident, structured, detailed) but are actually wrong or misleading. The cheap judge is fooled by surface features; the oracle catches the substance.

**The fix**: This is exactly why you need calibration. Raw judge scores would rank the adversarial prompt too high. CJE's transportability check flags this before you ship bad decisions.

## 4. Monitor Calibration Over Time

Calibration drifts. Periodically check it with fresh oracle labels. Here we simulate gradual drift:

In [None]:
# Simulate weekly monitoring with gradual drift
import random

# Load base data with oracle labels for monitoring simulation
base_data = [json.loads(l) for l in open("arena_sample/fresh_draws/base_responses.jsonl")]
base_oracle = [r for r in base_data if r.get("oracle_label") is not None]
unhelpful_data = [json.loads(l) for l in open("arena_sample/probe_slice/unhelpful_probe.jsonl")]

random.seed(42)
random.shuffle(base_oracle)

# Week 1: stable (pure base data)
week1 = base_oracle[:48]

# Week 2: starting to drift (65% base, 35% adversarial)
week2_base = base_oracle[48:79]
week2_adv = unhelpful_data[:17]
week2 = week2_base + week2_adv
random.shuffle(week2)

# Week 3: drifted (40% base, 60% adversarial) 
week3_base = base_oracle[79:99]
week3_adv = unhelpful_data[17:47]
week3 = week3_base + week3_adv
random.shuffle(week3)

weekly_audits = {
    "Week 1": audit_transportability(results.calibrator, week1),
    "Week 2": audit_transportability(results.calibrator, week2),
    "Week 3": audit_transportability(results.calibrator, week3),
}

for name, audit in weekly_audits.items():
    print(audit.summary())

In [None]:
plot_transport_comparison(weekly_audits, title="Weekly Calibration Check");

## Summary

```python
# Compare prompt variants
results = analyze_dataset(fresh_draws_dir="data/responses/")
results.plot_estimates()

# Check calibration transfers
audit = audit_transportability(results.calibrator, new_data)
print(audit.summary())  # PASS or FAIL

# Monitor over time
plot_transport_comparison({"Week 1": audit1, "Week 2": audit2, ...})
```

**PASS** = calibration valid, trust the estimates  
**FAIL** = something changed, investigate or recalibrate

### Learn More
- [Why your metrics lie](https://cimolabs.com/blog/metrics-lying) â€” The full explanation
- [Arena experiment](https://www.cimolabs.com/research/arena-experiment) â€” Benchmarks on 5,000 prompts
- [GitHub](https://github.com/cimo-labs/cje) â€” Documentation and source