# CJE Advanced: Off-Policy Evaluation

**IPS and DR modes for counterfactual policy evaluation**

This notebook covers advanced off-policy evaluation methods:

1. **IPS Mode**: Counterfactual estimates from logged data (importance sampling)
2. **DR Mode**: Doubly robust estimation (most accurate)
3. **Diagnostics**: ESS, overlap, weight analysis
4. **Comparison**: When to use IPS vs DR vs Direct

---

## Prerequisites

**New to CJE?** Start with [`cje_tutorial.ipynb`](cje_tutorial.ipynb) for Direct mode basics.

## Off-Policy Evaluation

**Direct mode** answers: "Which policy is best on this eval set?"

**IPS and DR** answer the counterfactual: "What would our KPI be if we deployed policy œÄ' instead of œÄ‚ÇÄ?"

**Key difference**: Off-policy methods estimate deployment value, not just eval set performance.

**When you need off-policy evaluation**:
- Reusing logged data from production/experiments
- Estimating deployment performance without actual deployment
- A/B test analysis with shared logging policy
- Cost-effective evaluation (reuse existing logs)

[![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/cimo-labs/cje/blob/main/examples/cje_advanced.ipynb)

## Step 1: Setup

Install CJE and download the Arena sample data.

In [None]:
# Install CJE
!pip install -q 'numpy>=2.0,<2.1' --force-reinstall
!pip install --no-cache-dir --upgrade cje-eval

import cje
print(f"‚úì CJE version {cje.__version__} installed")

In [None]:
# Download Arena sample data
import urllib.request
from pathlib import Path

DATA_DIR = Path("arena_sample")
DATA_DIR.mkdir(exist_ok=True)
(DATA_DIR / "fresh_draws").mkdir(exist_ok=True)

BASE_URL = "https://raw.githubusercontent.com/cimo-labs/cje/main/examples/arena_sample"

# Download logged data (for IPS/DR modes)
print("Downloading logged_data.jsonl...")
urllib.request.urlretrieve(
    f"{BASE_URL}/logged_data.jsonl",
    DATA_DIR / "logged_data.jsonl"
)
print("‚úì Downloaded logged_data.jsonl")

# Download fresh draws (for DR mode)
fresh_draw_files = {
    "base": "base_responses.jsonl",
    "clone": "clone_responses.jsonl",
    "parallel_universe_prompt": "parallel_universe_prompt_responses.jsonl",
    "unhelpful": "unhelpful_responses.jsonl"
}

for policy, filename in fresh_draw_files.items():
    print(f"Downloading fresh_draws/{filename}...")
    urllib.request.urlretrieve(
        f"{BASE_URL}/fresh_draws/{filename}",
        DATA_DIR / "fresh_draws" / filename
    )

print("\n‚úì All data downloaded!")

## Step 2: Inspect Logged Data

For off-policy evaluation, we need logged data with:
- Responses from a **base/logging policy** œÄ‚ÇÄ
- **Logprobs** from both base and target policies
- **Judge scores** for all responses
- **Oracle labels** (subset) for calibration

In [None]:
import json

# Load logged data
with open(DATA_DIR / "logged_data.jsonl") as f:
    logged_samples = [json.loads(line) for line in f]

print(f"Logged data: {len(logged_samples)} samples")
print(f"\nExample sample:")
sample = logged_samples[0]
print(f"  Prompt: {sample['prompt'][:80]}...")
print(f"  Response: {sample['response'][:100]}...")
print(f"  Judge score: {sample['judge_score']}")
print(f"  Oracle label: {sample.get('oracle_label', 'N/A')}")
print(f"  Base logprob: {sample['base_policy_logprob']}")
print(f"  Target policies: {list(sample['target_policy_logprobs'].keys())}")

# Check coverage
n_with_oracle = sum(1 for s in logged_samples if s.get('oracle_label') is not None)
print(f"\nOracle coverage: {n_with_oracle}/{len(logged_samples)} ({n_with_oracle/len(logged_samples):.1%})")
print(f"\nüí° We'll use these {n_with_oracle} oracle labels for AutoCal-R")

## Step 3: IPS Mode - Importance Sampling

**Inverse Propensity Score (IPS)** reweights logged data to estimate target policy performance.

### How IPS Works

1. **Compute importance weights**: `W = œÄ'(a|x) / œÄ‚ÇÄ(a|x)` using logprobs
2. **Calibrate rewards**: Judge scores ‚Üí oracle scale (AutoCal-R)
3. **Stabilize weights**: Monotone projection (SIMCal-W)
4. **Estimate**: `VÃÇ(œÄ') = (1/n) Œ£ W_i ¬∑ R_i`

### When to Use IPS

**Good for**:
- Logged data from production/experiments
- Target policies similar to base policy (good overlap)
- Cost-effective evaluation (no fresh generations needed)

**Watch out for**:
- Poor overlap (ESS < 10%): Switch to DR or regenerate
- Target policies very different from base: High variance

In [None]:
from cje import analyze_dataset

# IPS mode: Logged data only
results_ips = analyze_dataset(
    logged_data_path=str(DATA_DIR / "logged_data.jsonl"),
    estimator="auto",  # Auto-selects calibrated-ips
    verbose=True,
)

print("\n" + "="*70)
print("IPS Results")
print("="*70)
print(f"Mode: {results_ips.metadata['mode']}")
print(f"Estimator: {results_ips.metadata['estimator']}")
print(f"Calibration: {results_ips.metadata.get('calibration', 'none')}")
print()

# Show estimates
policies_ips = results_ips.metadata['target_policies']
print(f"{'Policy':<30} {'Estimate':<12} {'Std Error':<12} {'95% CI':<20}")
print("-" * 74)
for i, policy in enumerate(policies_ips):
    est = results_ips.estimates[i]
    se = results_ips.standard_errors[i]
    ci_low = est - 1.96 * se
    ci_high = est + 1.96 * se
    print(f"{policy:<30} {est:>6.3f}       {se:>6.3f}       [{ci_low:.3f}, {ci_high:.3f}]")

### Check IPS Diagnostics

**ESS (Effective Sample Size)** is critical for IPS reliability:

- **ESS ‚â• 50%**: Excellent overlap ‚úì
- **ESS ‚àà [10%, 50%)**: Moderate overlap (consider DR)
- **ESS < 10%**: Poor overlap (use DR or regenerate)

ESS measures how many "effective" samples contribute to the estimate after reweighting.

In [None]:
# Check ESS diagnostics
print("IPS Diagnostics")
print("="*70)

for policy in policies_ips:
    ess = results_ips.diagnostics.ess_per_policy.get(policy, 0.0)
    
    # Traffic light assessment
    if ess >= 0.5:
        status = "‚úì EXCELLENT"
    elif ess >= 0.1:
        status = "‚ö† MODERATE (consider DR)"
    else:
        status = "‚úó POOR (use DR or regenerate)"
    
    print(f"\n{policy}:")
    print(f"  ESS: {ess:.1%} {status}")
    
    # Show weight statistics
    max_weight = results_ips.diagnostics.max_weight_per_policy.get(policy)
    if max_weight:
        print(f"  Max weight: {max_weight:.2f}")

print("\nüí° Low ESS means estimates dominated by a few samples with high weights.")
print("   DR mode can dramatically improve reliability in this case.")

## Step 4: DR Mode - Doubly Robust

**Doubly Robust (DR)** combines IPS with outcome modeling for improved accuracy.

### How DR Works

1. **Everything from IPS**: Importance weights + calibrated rewards
2. **Outcome model**: Train ƒù(S) to predict rewards on fresh draws
3. **Combine**: `VÃÇ_DR(œÄ') = (1/n) Œ£ [W_i ¬∑ (R_i - ƒù(S_i)) + ƒù(S_i)]`

### Double Robustness Property

DR is **consistent** if *either*:
- Importance weights are correct (overlap holds), OR
- Outcome model is correct

You don't need both! This makes DR more robust than IPS alone.

### When to Use DR

**Perfect for**:
- Low ESS in IPS mode (< 50%)
- Can generate fresh responses
- Want maximum accuracy
- Production deployment decisions

**Benefits**:
- Lower variance than IPS (tighter CIs)
- More robust to model misspecification
- Best statistical properties

In [None]:
# DR mode: Logged data + fresh draws
results_dr = analyze_dataset(
    logged_data_path=str(DATA_DIR / "logged_data.jsonl"),
    fresh_draws_dir=str(DATA_DIR / "fresh_draws"),
    estimator="auto",  # Auto-selects stacked-dr
    estimator_config={"parallel": False},  # Disable parallel for Colab
    verbose=True,
)

print("\n" + "="*70)
print("DR Results")
print("="*70)
print(f"Mode: {results_dr.metadata['mode']}")
print(f"Estimator: {results_dr.metadata['estimator']}")
print()

# Show estimates
policies_dr = results_dr.metadata['target_policies']
print(f"{'Policy':<30} {'Estimate':<12} {'Std Error':<12} {'95% CI':<20}")
print("-" * 74)
for i, policy in enumerate(policies_dr):
    est = results_dr.estimates[i]
    se = results_dr.standard_errors[i]
    ci_low = est - 1.96 * se
    ci_high = est + 1.96 * se
    print(f"{policy:<30} {est:>6.3f}       {se:>6.3f}       [{ci_low:.3f}, {ci_high:.3f}]")

## Step 5: Compare IPS vs DR

Let's see how much DR improves over IPS.

In [None]:
# Compare standard errors (smaller is better)
print("Standard Error Comparison: IPS vs DR")
print("="*70)
print(f"{'Policy':<30} {'IPS SE':<12} {'DR SE':<12} {'Improvement':<15}")
print("-" * 69)

for i, policy in enumerate(policies_dr):
    # Find policy in IPS results
    if policy in policies_ips:
        ips_idx = policies_ips.index(policy)
        ips_se = results_ips.standard_errors[ips_idx]
        dr_se = results_dr.standard_errors[i]
        
        if dr_se < ips_se:
            improvement = f"‚Üì {(1 - dr_se/ips_se)*100:.0f}% SE"
        else:
            improvement = "(similar)"
        
        print(f"{policy:<30} {ips_se:>6.3f}       {dr_se:>6.3f}       {improvement:<15}")

print("\nüí° DR typically reduces standard errors by 20-60% compared to IPS.")
print("   Larger improvements when ESS is low or outcome model fits well.")

## Step 6: Mode Comparison Summary

Let's compare all three modes side-by-side.

**Note**: For Direct mode, we'll run it quickly here (see `cje_tutorial.ipynb` for details).

In [None]:
# Run Direct mode for comparison
results_direct = analyze_dataset(
    fresh_draws_dir=str(DATA_DIR / "fresh_draws"),
    estimator="auto",
    verbose=False,
)

policies_direct = results_direct.metadata['target_policies']

# Compare all three modes
print("Mode Comparison: Direct vs IPS vs DR")
print("="*70)
print(f"{'Policy':<30} {'Direct':<10} {'IPS':<10} {'DR':<10}")
print("-" * 60)

for policy in policies_dr:
    # Get estimates from each mode
    direct_est = "N/A"
    if policy in policies_direct:
        direct_idx = policies_direct.index(policy)
        direct_est = f"{results_direct.estimates[direct_idx]:.3f}"
    
    ips_est = "N/A"
    if policy in policies_ips:
        ips_idx = policies_ips.index(policy)
        ips_est = f"{results_ips.estimates[ips_idx]:.3f}"
    
    dr_idx = policies_dr.index(policy)
    dr_est = f"{results_dr.estimates[dr_idx]:.3f}"
    
    print(f"{policy:<30} {direct_est:>7}    {ips_est:>7}    {dr_est:>7}")

print("\n" + "="*70)
print("Interpretation:")
print("="*70)
print("- Direct mode: Performance on *this* eval set")
print("- IPS/DR modes: Estimated *deployment* performance")
print("- Differences are expected! They answer different questions.")
print("\nüí° Use Direct for eval set rankings, IPS/DR for deployment decisions.")

## Step 7: Visualization

Visualize DR results with confidence intervals.

In [None]:
import matplotlib.pyplot as plt

# Forest plot
fig = results_dr.plot_estimates()
plt.title("DR Mode: Policy Estimates with 95% CIs")
plt.tight_layout()
plt.show()

## Summary: Choosing the Right Mode

| Mode | Data Required | Estimand | Best For |
|------|--------------|----------|----------|
| **Direct** | Fresh draws | "Best on eval set" | Quick comparisons, leaderboards |
| **IPS** | Logged data + logprobs | "Deployment value" | Reusing logs, good overlap (ESS > 50%) |
| **DR** | Logged + fresh draws | "Deployment value" | Low ESS, max accuracy, production decisions |

### Decision Tree

1. **Do you need deployment estimates?**
   - No ‚Üí Use **Direct mode**
   - Yes ‚Üí Continue

2. **Can you generate fresh responses?**
   - Yes ‚Üí Use **DR mode** (most accurate)
   - No ‚Üí Use **IPS mode** (check ESS!)

3. **If using IPS, check ESS:**
   - ESS ‚â• 50%: Great, IPS is reliable
   - ESS < 50%: Consider generating fresh draws for DR
   - ESS < 10%: Definitely use DR or regenerate

### Key Takeaways

‚úì **IPS**: Reweights logged data using importance weights (requires logprobs)

‚úì **DR**: Adds outcome modeling to IPS (needs fresh draws, most robust)

‚úì **ESS**: Critical diagnostic for IPS reliability (aim for ‚â• 50%)

‚úì **Double robustness**: DR works if *either* weights or outcome model is right

‚úì **Variance reduction**: DR typically reduces SE by 20-60% vs IPS

### Next Steps

**Documentation**:
- [CJE README](https://github.com/cimo-labs/cje)
- [Blog Post: Arena Experiment](https://www.cimolabs.com/blog/arena-experiment)
- [API Reference](https://github.com/cimo-labs/cje/tree/main/cje/interface)

**Want more?**
- Explore calibration plots: `plot_calibration_comparison()`
- Weight diagnostics: `plot_weight_dashboard_summary()`
- DR diagnostics: `plot_dr_dashboard()`

Questions? Open an issue on [GitHub](https://github.com/cimo-labs/cje/issues)!