# CJE Quick Start: Direct Mode

**Simple on-policy evaluation with judge-based scoring**

This notebook shows you:
1. What the data looks like
2. How to run Direct Mode (simplest!)
3. How to compare policies with confidence intervals

**What is Direct Mode?**

Direct Mode answers: *"Which policy performs best on this evaluation set?"*

- No logprobs needed
- No importance weighting
- Just: Generate responses → Judge them → Calibrate → Average

Perfect for:
- Quick policy comparisons
- Benchmarking on specific prompt sets
- When you don't need counterfactual estimates

[![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/cimo-labs/cje/blob/main/examples/cje_direct_mode_intro.ipynb)

## Step 1: Install CJE

**Note for Colab users:** Colab comes with numpy 2.3+ which breaks scipy. We install compatible numpy first.

In [None]:
# Colab fix: Install compatible numpy first
!pip install -q 'numpy>=2.0,<2.1' --force-reinstall

# Install CJE from PyPI
!pip install -q cje-eval

# Verify installation
import cje
import numpy as np
print(f"✓ CJE version {cje.__version__} installed")
print(f"✓ NumPy version {np.__version__}")

## Step 2: Download Sample Data

**About the Arena Dataset**

We use a sample from the [LMSYS Chatbot Arena](https://huggingface.co/datasets/agie-ai/lmsys-chatbot_arena_conversations) dataset - real user conversations with human preference judgments. This dataset was used to validate CJE in our ablation studies.

**The Three Policies:**

1. **`clone`** (Baseline)
   - The original Arena responses
   - Our logging/baseline policy
   - Mean reward: ~0.76

2. **`parallel_universe_prompt`** (Experimental)
   - Modified system prompt: "You are a helpful assistant from a parallel universe"
   - Tests whether a quirky prompt helps or hurts
   - Mean reward: ~0.76 (similar to baseline)

3. **`unhelpful`** (Adversarial Control)
   - System prompt: "You are a very unhelpful assistant"
   - Intentionally bad responses for testing
   - Mean reward: ~0.14 (much worse, as expected)

**Why include an adversarial policy?**
- Validates that CJE correctly identifies bad policies
- Tests robustness when target policy differs significantly from baseline
- Demonstrates ESS diagnostics (unhelpful has poor overlap → low ESS)

**Data generation:**
- We took Arena prompts and generated fresh responses from each policy
- Each response was scored by GPT-4 as judge (0-1 score)
- A subset has oracle labels (human preferences) for calibration

In [None]:
import urllib.request
from pathlib import Path

# Create data directory
DATA_DIR = Path("arena_sample")
FRESH_DRAWS_DIR = DATA_DIR / "fresh_draws"
FRESH_DRAWS_DIR.mkdir(parents=True, exist_ok=True)

# Base URL for sample data
BASE_URL = "https://raw.githubusercontent.com/cimo-labs/cje/main/examples/arena_sample"

# Download fresh draws for each policy
policies = {
    "clone": "clone_responses.jsonl",
    "parallel_universe_prompt": "parallel_universe_prompt_responses.jsonl",
    "unhelpful": "unhelpful_responses.jsonl"
}

for policy, filename in policies.items():
    url = f"{BASE_URL}/fresh_draws/{filename}"
    path = FRESH_DRAWS_DIR / filename
    print(f"Downloading {filename}...")
    urllib.request.urlretrieve(url, path)
    print(f"✓ Downloaded {filename}")

print(f"\n✓ All data downloaded to: {FRESH_DRAWS_DIR.absolute()}")

## Step 3: Understand the Data Structure

Let's look at what Direct Mode data looks like.

In [None]:
import json

# Load one policy's responses to inspect
with open(FRESH_DRAWS_DIR / "clone_responses.jsonl") as f:
    samples = [json.loads(line) for line in f]

print(f"Total samples: {len(samples)}")
print(f"\nExample sample (first one):")
print(json.dumps(samples[0], indent=2))

### What's in each sample?

**Required fields:**
- `prompt_id`: Unique identifier for the prompt
- `prompt`: The input text
- `response`: The model's generated response
- `judge_score`: LLM judge's evaluation score (0-1)

**Optional but valuable:**
- `oracle_label`: Ground truth label (0-1) for calibration
  - Having 5-10% of samples with oracle labels is usually enough!
  - CJE uses these to calibrate judge scores (AutoCal-R)

**Not needed for Direct Mode:**
- `base_policy_logprob`: Only needed for IPS/DR modes
- `target_policy_logprobs`: Only needed for IPS/DR modes

In [None]:
# Check oracle coverage
n_with_oracle = sum(1 for s in samples if s.get('oracle_label') is not None)
coverage = n_with_oracle / len(samples) if samples else 0

print(f"Oracle label coverage: {n_with_oracle}/{len(samples)} ({coverage:.1%})")
print(f"\nThis is {'plenty' if coverage >= 0.05 else 'a bit low'} for calibration!")
print(f"Tip: Even 5-10% oracle coverage often gives good calibration.")

# Show judge score distribution
judge_scores = [s['judge_score'] for s in samples if 'judge_score' in s]
print(f"\nJudge scores (clone policy):")
print(f"  Mean: {np.mean(judge_scores):.3f}")
print(f"  Std:  {np.std(judge_scores):.3f}")
print(f"  Range: [{min(judge_scores):.3f}, {max(judge_scores):.3f}]")

## Step 4: Run Direct Mode

Now let's run CJE in Direct Mode to compare our three policies.

**What CJE does:**
1. Loads fresh draws for all policies
2. Calibrates judge scores using oracle labels (AutoCal-R)
3. Computes mean calibrated reward per policy
4. Reports confidence intervals accounting for all uncertainty

In [None]:
from cje import analyze_dataset

# Run Direct Mode - just point to fresh draws directory!
results = analyze_dataset(
    fresh_draws_dir=str(FRESH_DRAWS_DIR),
    estimator="auto",  # Auto-detects Direct mode
    verbose=True,
)

print("\n" + "="*70)
print("Direct Mode Results")
print("="*70)

### Interpreting the Output

Let's break down what we see above:

**Calibration Info:**
- CJE found samples with oracle labels
- It fit an isotonic regression: judge_score → oracle_label
- RMSE tells you how well judges match ground truth
- Mode (monotone/linear/none) is auto-selected based on data

**Why calibration matters:**
- Raw judge scores may be biased (too high or too low)
- Calibration maps judges to oracle scale
- This makes estimates comparable to production metrics

## Step 5: View and Compare Results

Now let's look at the policy estimates and compare them.

In [None]:
# Extract results
policies = results.metadata['target_policies']
estimates = results.estimates
std_errors = results.standard_errors

# Display results table
print(f"{'Policy':<35} {'Estimate':<12} {'Std Error':<12} {'95% CI':<20}")
print("-" * 79)
for i, policy in enumerate(policies):
    est = estimates[i]
    se = std_errors[i]
    ci_low = est - 1.96 * se
    ci_high = est + 1.96 * se
    print(f"{policy:<35} {est:>6.3f}       {se:>6.3f}       [{ci_low:.3f}, {ci_high:.3f}]")

print("\n💡 Interpretation:")
print("   • Estimate = calibrated mean reward on eval set")
print("   • Std Error accounts for sampling + calibration uncertainty")
print("   • 95% CI = we're 95% confident true value is in this range")

### Find the Best Policy

In [None]:
# Find best policy
best_idx = results.best_policy()
best_policy = policies[best_idx]
best_est = estimates[best_idx]
best_se = std_errors[best_idx]

print(f"🏆 Best policy: {best_policy}")
print(f"   Estimate: {best_est:.3f} ± {best_se:.3f}")
print(f"   95% CI: [{best_est - 1.96*best_se:.3f}, {best_est + 1.96*best_se:.3f}]")

### Compare Against Baseline

Let's compare each policy to the baseline using proper statistical inference.

In [None]:
# Set baseline
baseline_policy = 'clone'
baseline_idx = policies.index(baseline_policy)

print(f"📊 Comparison to baseline ({baseline_policy}):")
print(f"   Baseline: {estimates[baseline_idx]:.3f} ± {std_errors[baseline_idx]:.3f}")
print()
print(f"{'Policy':<35} {'Δ vs baseline':<15} {'p-value':<12} {'Significant?'}")
print("-" * 72)

for i, policy in enumerate(policies):
    if i == baseline_idx:
        print(f"{policy:<35} {'(baseline)':<15} {'':<12} {''}")
        continue
    
    # Use built-in comparison (properly accounts for paired data)
    comp = results.compare_policies(i, baseline_idx)
    
    sig_text = "✓ Yes (p<0.05)" if comp['significant'] else "No"
    
    print(f"{policy:<35} {comp['difference']:+.4f}          "
          f"{comp['p_value']:.4f}      {sig_text}")

print("\n💡 Interpretation:")
print("   • Δ > 0: Policy beats baseline")
print("   • Δ < 0: Policy worse than baseline")
print("   • Significant (p<0.05): Difference is statistically reliable")

## Step 6: Check Diagnostics

Always check diagnostics to validate your results!

In [None]:
# Display key diagnostics
diag = results.diagnostics

print("Diagnostics Summary")
print("="*70)
print(f"Mode: {results.metadata.get('mode', 'N/A')}")
print(f"Estimator: {results.metadata.get('estimator', 'N/A')}")
print(f"\nSample sizes:")
print(f"  Total samples: {diag.n_samples_total}")
print(f"  Valid samples: {diag.n_samples_valid}")
print(f"  Samples per policy: {diag.n_samples_used}")

print(f"\nCalibration:")
if diag.calibration_rmse is not None:
    print(f"  RMSE: {diag.calibration_rmse:.4f}")
    print(f"  R²: {diag.calibration_r2:.4f}" if diag.calibration_r2 else "")
    print(f"  Oracle labels used: {diag.n_oracle_labels}")
    print(f"  Coverage: {diag.calibration_coverage:.1%}" if diag.calibration_coverage else "")
else:
    print("  No calibration performed (no oracle labels)")

print("\n✓ Diagnostics look good!" if diag.weight_status.name == 'GOOD' else "\n⚠ Check diagnostics carefully")

## Step 7: Visualize Results

CJE provides built-in visualization to help communicate your findings.

In [None]:
# Just evaluate the result to see a nice HTML table
results

### Jupyter Auto-Display

In Jupyter notebooks (including Colab), results automatically display as formatted HTML tables when you evaluate them:

In [None]:
# Even simpler: use the convenience method
fig = results.plot_estimates()
plt.show()

print("\n💡 The .plot_estimates() method automatically extracts data from the result object")

### Quick Plotting with Convenience Method

You can also use the convenience method on `EstimationResult` for quick plotting:

In [None]:
# Import visualization function from cje
from cje import plot_policy_estimates

# Extract estimates and standard errors as dictionaries
estimates_dict = {policy: float(estimates[i]) for i, policy in enumerate(policies)}
ses_dict = {policy: float(std_errors[i]) for i, policy in enumerate(policies)}

# Create forest plot with confidence intervals
fig = plot_policy_estimates(
    estimates=estimates_dict,
    standard_errors=ses_dict,
    figsize=(10, 4)
)

# Display
import matplotlib.pyplot as plt
plt.tight_layout()
plt.show()

print("\n✓ Forest plot shows point estimates with 95% confidence intervals")
print("  Green dot = best policy")

## Summary: Direct Mode Quick Reference

### When to Use Direct Mode

✅ **Use Direct Mode when:**
- You want quick on-policy comparisons
- You can generate fresh responses from each policy
- You don't need counterfactual deployment estimates
- You want the simplest CJE workflow

❌ **Don't use Direct Mode when:**
- You want to estimate counterfactual deployment value
- You can't generate fresh responses (use IPS instead)
- You need to reuse logged data (use IPS/DR instead)

### Required Data

```
fresh_draws/
  policy_a_responses.jsonl  # One file per policy
  policy_b_responses.jsonl
  policy_c_responses.jsonl
```

Each JSONL line:
```json
{
  "prompt_id": "prompt_001",
  "prompt": "What is 2+2?",
  "response": "2+2 equals 4.",
  "judge_score": 0.95,
  "oracle_label": 1.0  // Optional but recommended (5-10% coverage is fine)
}
```

### Code Template

```python
from cje import analyze_dataset

# Run Direct Mode
results = analyze_dataset(
    fresh_draws_dir="path/to/fresh_draws",
    estimator="auto",  # Auto-detects Direct mode
    verbose=True,
)

# View results
for i, policy in enumerate(results.metadata['target_policies']):
    print(f"{policy}: {results.estimates[i]:.3f} ± {results.standard_errors[i]:.3f}")

# Find best policy
best_idx = results.best_policy()
print(f"Best: {results.metadata['target_policies'][best_idx]}")

# Compare to baseline
comparison = results.compare_policies(policy_idx, baseline_idx)
print(f"Δ: {comparison['difference']:.3f}, p={comparison['p_value']:.3f}")
```

### Key Diagnostics to Check

1. **Oracle coverage**: 5-10% is usually enough for good calibration
2. **Calibration RMSE**: Lower is better (judges match oracle well)
3. **Sample sizes**: Make sure n ≥ 100 per policy for reliable CIs

---

## Next Steps: Off-Policy Evaluation

Ready for more advanced counterfactual estimation?

**➡️ Continue to the [OPE Methods Notebook](https://colab.research.google.com/github/cimo-labs/cje/blob/main/examples/cje_ope_methods.ipynb)**

Learn about:
- **IPS Mode**: Reuse logged data for counterfactual estimates
- **DR Mode**: Combine logged + fresh data for best accuracy
- **ESS Diagnostics**: When to trust IPS vs switch to DR
- **Orthogonality**: Validating doubly robust estimates

---

## Resources

- **GitHub**: [github.com/cimo-labs/cje](https://github.com/cimo-labs/cje)
- **Documentation**: See README and examples/
- **Paper**: Coming soon with full technical details
- **Questions?**: Open an issue on GitHub

Happy evaluating! 🎉