# CJE Calibration Demo

This notebook demonstrates how to use CJE (Causal Judge Evaluation) to calibrate AI survey responses against human ground truth.

**Key Benefits:**
- Get valid estimates using only 5-10% human labels
- Proper uncertainty quantification with confidence intervals
- Statistical comparison between models/policies

## Installation

```bash
pip install edsl[cje]
```

In [None]:
# For development (run from repo root): pip install -e ".[cje]"
# For production: pip install "edsl[cje]"

# If running from docs/notebooks/, install the local package:
%pip install -e ../.. --quiet

In [None]:
# Verify edsl is installed from the local repo
import edsl
print(f"EDSL version: {edsl.__version__}")
print(f"Location: {edsl.__file__}")

## Step 1: Run a Survey with Multiple Models

First, we create a simple sentiment survey and run it with two different models.

In [4]:
from edsl import Survey, QuestionLinearScale, Model, Agent

# Create a simple sentiment question
q = QuestionLinearScale(
    question_name="sentiment",
    question_text="Rate the sentiment of this text: '{{ text }}'",
    question_options=[1, 2, 3, 4, 5],
    option_labels={1: "Very Negative", 5: "Very Positive"}
)

survey = Survey([q])

In [5]:
from edsl import ScenarioList

# Sample texts to analyze
texts = [
    "I love this product! It's amazing.",
    "This is okay, nothing special.",
    "Terrible experience, would not recommend.",
    "Pretty good overall, minor issues.",
    "Absolutely fantastic, exceeded expectations!",
    "Meh, it's fine I guess.",
    "Worst purchase ever.",
    "Great quality, fast shipping.",
    "Not worth the money.",
    "Highly recommend to everyone!",
]

scenarios = ScenarioList.from_list("text", texts)

In [6]:
# Run with two models
models = [Model("gpt-4o"), Model("claude-sonnet-4-20250514")]

results = survey.by(scenarios).by(models).run()

Service,Model,Input Tokens,Input Cost,Output Tokens,Output Cost,Total Cost,Total Credits
openai,gpt-4o,936,$0.0024,285,$0.0029,$0.0053,0.0
anthropic,claude-sonnet-4-20250514,1110,$0.0034,435,$0.0066,$0.0100,0.0
Totals,Totals,2046,$0.0058,720,$0.0095,$0.0153,0.0


In [7]:
# View results
results.select("model", "text", "sentiment")

Unnamed: 0,model.model,scenario.text,answer.sentiment
0,gpt-4o,I love this product! It's amazing.,5
1,claude-sonnet-4-20250514,I love this product! It's amazing.,5
2,gpt-4o,"This is okay, nothing special.",3
3,claude-sonnet-4-20250514,"This is okay, nothing special.",3
4,gpt-4o,"Terrible experience, would not recommend.",1
5,claude-sonnet-4-20250514,"Terrible experience, would not recommend.",1
6,gpt-4o,"Pretty good overall, minor issues.",4
7,claude-sonnet-4-20250514,"Pretty good overall, minor issues.",4
8,gpt-4o,"Absolutely fantastic, exceeded expectations!",5
9,claude-sonnet-4-20250514,"Absolutely fantastic, exceeded expectations!",5


## Step 2: Add Human Labels (Oracle)

In practice, you would collect human ratings for a sample of responses.
Here we simulate this with "ground truth" ratings for ~50% of samples.

In [8]:
import random

# Simulate human labels for some samples
# In practice, you would collect these from actual human raters
human_labels = []

for i, result in enumerate(results):
    # Add human label for ~50% of samples
    if random.random() < 0.5:
        # Simulate a human rating (1-5 scale)
        text = result["scenario"]["text"]
        if "love" in text or "amazing" in text or "fantastic" in text:
            human_labels.append(5)
        elif "terrible" in text or "worst" in text:
            human_labels.append(1)
        elif "good" in text or "great" in text:
            human_labels.append(4)
        elif "okay" in text or "fine" in text or "meh" in text:
            human_labels.append(3)
        else:
            human_labels.append(3)
    else:
        human_labels.append(None)  # No human label for this sample

print(f"Human labels collected: {sum(1 for x in human_labels if x is not None)}/{len(human_labels)}")

Human labels collected: 11/20


## Step 3: Calibrate with CJE

Now we use CJE to calibrate the AI responses against human labels.

In [9]:
# Transform 1-5 scale to 0-1 for CJE
scale_transform = lambda x: (x - 1) / 4

# Calibrate
cal_result = results.calibrate(
    question_name="sentiment",
    oracle_labels=human_labels,
    policy_column="model",
    score_transform=scale_transform,
    verbose=True,
)

AttributeError: 'Results' object has no attribute 'calibrate'

In [None]:
# View calibrated estimates
print("Calibrated Estimates:")
print(cal_result)

## Step 4: Compare Models

Use statistical testing to compare model performance.

In [None]:
# Get the best performing model
print(f"Best model: {cal_result.best_policy()}")
print(f"Ranking: {cal_result.ranking()}")

In [None]:
# Statistical comparison between models
policies = list(cal_result.estimates.keys())
if len(policies) >= 2:
    comparison = cal_result.compare(policies[0], policies[1])
    print(f"\nComparison: {policies[0]} vs {policies[1]}")
    print(f"Difference: {comparison.difference:.3f}")
    print(f"p-value: {comparison.p_value:.3f}")
    print(f"Significant at Î±=0.05: {comparison.significant}")

## Summary

CJE calibration provides:

1. **Efficient use of human labels** - Only need 5-10% oracle labels to calibrate
2. **Valid point estimates** - Corrects for systematic biases in AI judge scores
3. **Uncertainty quantification** - Confidence intervals for each policy
4. **Statistical comparisons** - Proper hypothesis testing between policies

### Next Steps

- Collect real human labels for your survey responses
- Use `CalibrationResult.compare()` to make data-driven model selection decisions
- Monitor for drift by re-calibrating when models are updated