# CJE: Calibrate Your LLM Judge

Your LLM judge scores are lying. CJE calibrates them to what actually matters.

**The problem**: Cheap judge scores (S) don't match expensive oracle outcomes (Y). CJE learns the S→Y mapping so you can trust your metrics.

[Read the full explanation →](https://cimolabs.com/blog/metrics-lying)

[![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/cimo-labs/cje/blob/main/examples/cje_core_demo.ipynb)

In [None]:
# Install CJE
!pip install -q cje-eval

In [None]:
# Download sample data (1000 Chatbot Arena prompts, 4 prompt variants)
import urllib.request
from pathlib import Path

DATA_DIR = Path("arena_sample")
if not (DATA_DIR / "fresh_draws" / "base_responses.jsonl").exists():
    print("Downloading sample data...")
    DATA_DIR.mkdir(exist_ok=True)
    (DATA_DIR / "fresh_draws").mkdir(exist_ok=True)
    (DATA_DIR / "probe_slice").mkdir(exist_ok=True)
    
    BASE_URL = "https://raw.githubusercontent.com/cimo-labs/cje/main/examples/arena_sample"
    for f in ["base_responses.jsonl", "clone_responses.jsonl", 
              "parallel_universe_prompt_responses.jsonl", "unhelpful_responses.jsonl"]:
        urllib.request.urlretrieve(f"{BASE_URL}/fresh_draws/{f}", DATA_DIR / "fresh_draws" / f)
    for f in ["clone_probe.jsonl", "parallel_universe_prompt_probe.jsonl", "unhelpful_probe.jsonl"]:
        urllib.request.urlretrieve(f"{BASE_URL}/probe_slice/{f}", DATA_DIR / "probe_slice" / f)
    print("Done!")
else:
    print(f"Data exists at {DATA_DIR.absolute()}")

## 1. Compare Prompt Variants

One line to analyze all your prompt variants with calibrated estimates:

In [None]:
from cje import analyze_dataset

results = analyze_dataset(fresh_draws_dir="arena_sample/fresh_draws/")
print(results)

In [None]:
# Visualize with confidence intervals
results.plot_estimates(
    policy_labels={
        "base": "Standard prompt",
        "clone": "Same prompt (different seed)",
        "parallel_universe_prompt": "Modified system prompt",
        "unhelpful": "Adversarial prompt",
    }
)

In [None]:
# Statistical comparison
policies = results.metadata['target_policies']
best = policies[results.best_policy()]
print(f"Best: {best}\n")

# Compare all to base
base_idx = policies.index('base')
for i, p in enumerate(policies):
    if i != base_idx:
        comp = results.compare_policies(i, base_idx)
        sig = "*" if comp['significant'] else ""
        print(f"{p}: {comp['difference']:+.3f} (p={comp['p_value']:.3f}) {sig}")

## 2. Check If Calibration Transfers

Calibration is learned on one distribution. Does it still work on new data?

In [None]:
import json
from cje.diagnostics import audit_transportability, plot_transport_comparison

# Test on probe slices (held-out data with oracle labels)
probe_files = {
    "clone": "arena_sample/probe_slice/clone_probe.jsonl",
    "modified_prompt": "arena_sample/probe_slice/parallel_universe_prompt_probe.jsonl",
    "adversarial": "arena_sample/probe_slice/unhelpful_probe.jsonl",
}

audits = {}
for name, path in probe_files.items():
    data = [json.loads(line) for line in open(path)]
    audits[name] = audit_transportability(results.calibrator, data)
    print(audits[name].summary())

In [None]:
# Visualize: which variants break calibration?
plot_transport_comparison(audits, title="Does Calibration Transfer?")

In [None]:
# Detailed view of failing variant - residuals by score decile
audits['adversarial'].plot()

## 3. Inspect What's Fooling the Judge

When calibration fails, look at the actual samples. What patterns fool the judge but not the oracle?

In [None]:
from cje.diagnostics import compute_residuals

# Load adversarial samples and compute residuals (sorted by worst overestimate first)
adversarial_data = [json.loads(line) for line in open("arena_sample/probe_slice/unhelpful_probe.jsonl")]
samples = compute_residuals(results.calibrator, adversarial_data)

print(f"Samples: {len(samples)}")
print(f"Mean residual: {sum(s['residual'] for s in samples) / len(samples):.3f}")
print(f"\nResidual = Oracle - Calibrated")
print(f"  Negative = judge overestimated (fooled)")
print(f"  Positive = judge underestimated")

In [None]:
# Look at the worst overestimates - where the judge was most fooled
print("=" * 80)
print("WORST OVERESTIMATES: Judge gave high scores, oracle gave low scores")
print("=" * 80)

for i, s in enumerate(samples[:3]):
    print(f"\n--- Sample {i+1} | Residual: {s['residual']:.2f}")
    print(f"    Judge: {s['judge_score']:.2f} → Calibrated: {s['calibrated']:.2f} | Oracle: {s['oracle_label']:.2f}")
    print(f"\n    Prompt: {s['prompt'][:150]}...")
    print(f"\n    Response: {s['response'][:300]}...")

**What we see**: The adversarial prompt produces responses that *sound* helpful (confident, structured, detailed) but are actually wrong or misleading. The cheap judge is fooled by surface features; the oracle catches the substance.

**The fix**: This is exactly why you need calibration. Raw judge scores would rank the adversarial prompt too high. CJE's transportability check flags this before you ship bad decisions.

## 4. Monitor Calibration Over Time

Calibration drifts. Periodically check it with fresh oracle labels. Here we simulate gradual drift:

In [None]:
# Simulate weekly monitoring with gradual drift
import random

base_data = [json.loads(l) for l in open("arena_sample/fresh_draws/base_responses.jsonl")]
base_oracle = [r for r in base_data if r.get("oracle_label") is not None]
unhelpful_data = [json.loads(l) for l in open("arena_sample/probe_slice/unhelpful_probe.jsonl")]

random.seed(42)
random.shuffle(base_oracle)

# Week 1: stable (pure base data)
week1 = base_oracle[:48]

# Week 2: starting to drift (65% base, 35% adversarial)
week2_base = base_oracle[48:79]
week2_adv = unhelpful_data[:17]
week2 = week2_base + week2_adv
random.shuffle(week2)

# Week 3: drifted (40% base, 60% adversarial) 
week3_base = base_oracle[79:99]
week3_adv = unhelpful_data[17:47]
week3 = week3_base + week3_adv
random.shuffle(week3)

weekly_audits = {
    "Week 1": audit_transportability(results.calibrator, week1),
    "Week 2": audit_transportability(results.calibrator, week2),
    "Week 3": audit_transportability(results.calibrator, week3),
}

for name, audit in weekly_audits.items():
    print(audit.summary())

In [None]:
plot_transport_comparison(weekly_audits, title="Weekly Calibration Check")

## Summary

```python
# Compare prompt variants
results = analyze_dataset(fresh_draws_dir="data/responses/")
results.plot_estimates()

# Check calibration transfers
audit = audit_transportability(results.calibrator, new_data)
print(audit.summary())  # PASS or FAIL

# Monitor over time
plot_transport_comparison({"Week 1": audit1, "Week 2": audit2, ...})
```

**PASS** = calibration valid, trust the estimates  
**FAIL** = something changed, investigate or recalibrate

### Learn More
- [Why your metrics lie](https://cimolabs.com/blog/metrics-lying) — The full explanation
- [Arena experiment](https://www.cimolabs.com/research/arena-experiment) — Benchmarks on 5,000 prompts
- [GitHub](https://github.com/cimo-labs/cje) — Documentation and source