# Deterministic Confidence Scoring

This tutorial demonstrates the deterministic evaluation path — a lightweight
scorer included for debugging, testing, and illustration. It assigns a
reproducible confidence score to a causal estimate based on the measurement
methodology, without calling an LLM.

## Workflow overview

1. Create a mock job directory with `manifest.json` and `impact_results.json`
2. Score directly with `score_initiative()`
3. Score via the `Evaluate` adapter
4. Verify reproducibility across calls

## Setup

In [None]:
import json
import tempfile
from pathlib import Path

from notebook_support import print_result_summary

## Step 1: Create a mock job directory

An upstream producer writes a job directory containing `manifest.json` (metadata
and file references) and `impact_results.json` (the producer's output). Here
we create one manually to illustrate the convention.

In [None]:
tmp = tempfile.mkdtemp(prefix="job-impact-engine-")
job_dir = Path(tmp)

manifest = {
    "schema_version": "2.0",
    "model_type": "experiment",
    "evaluate_strategy": "score",
    "created_at": "2025-06-01T12:00:00+00:00",
    "files": {
        "impact_results": {"path": "impact_results.json", "format": "json"},
    },
}

impact_results = {
    "ci_upper": 15.0,
    "effect_estimate": 10.0,
    "ci_lower": 5.0,
    "cost_to_scale": 100.0,
    "sample_size": 500,
}

(job_dir / "manifest.json").write_text(json.dumps(manifest, indent=2))
(job_dir / "impact_results.json").write_text(json.dumps(impact_results, indent=2))

print(f"Job directory: {job_dir}")
print(f"Files: {[p.name for p in job_dir.iterdir()]}")

## Step 2: Score directly with `score_confidence()`

`score_confidence()` is a pure function useful for debugging and testing. It
takes an `initiative_id` string and a confidence range, hashes the ID to seed
an RNG, and draws a reproducible confidence value. The confidence range comes
from the method reviewer (an experiment uses `(0.85, 1.0)` because RCTs
produce the strongest evidence).

In [None]:
from impact_engine_evaluate.score import score_confidence

result = score_confidence("initiative-demo-001", confidence_range=(0.85, 1.0))
print(f"Initiative:  {result.initiative_id}")
print(f"Confidence:  {result.confidence:.4f}  (range {result.confidence_range[0]:.2f}–{result.confidence_range[1]:.2f})")

The confidence score falls within `(0.85, 1.0)` because we specified the
experiment confidence range. The score is deterministic: running the same
`initiative_id` always produces the same value.

## Step 3: Score via the Evaluate adapter

In the full pipeline, the orchestrator calls `Evaluate.execute()` instead of
`score_initiative()` directly. The adapter reads the manifest, looks up the
registered method reviewer for `model_type`, and dispatches on
`evaluate_strategy`.

In [None]:
from impact_engine_evaluate import Evaluate

evaluator = Evaluate()
result = evaluator.execute({"job_dir": str(job_dir)})

print_result_summary(result)

The adapter produces the same 5-key output dict. It automatically read
`manifest.json`, found `evaluate_strategy: "score"`, and used the
experiment reviewer's confidence range `(0.85, 1.0)`.

## Step 4: Verify reproducibility

The deterministic scorer hashes `initiative_id` to seed a random number
generator. The same ID always produces the same confidence score, regardless
of when or where the code runs.

In [None]:
scores = [score_confidence("initiative-demo-001", confidence_range=(0.85, 1.0)).confidence for _ in range(5)]

print(f"Scores across 5 calls: {scores}")
assert len(set(scores)) == 1, "Scores should be identical"
print("All scores are identical — deterministic scoring is reproducible.")

## Step 5: Compare confidence ranges across methods

Different measurement methodologies get different confidence ranges. An RCT
deserves higher confidence than a weaker design. The `MethodReviewerRegistry`
exposes the confidence map for all registered methods.

In [None]:
from impact_engine_evaluate.review.methods import MethodReviewerRegistry

print("Registered methods and confidence ranges:")
for name, bounds in sorted(MethodReviewerRegistry.confidence_map().items()):
    print(f"  {name}: ({bounds[0]:.2f}, {bounds[1]:.2f})")

## Summary

The deterministic scorer is a lightweight tool for debugging, testing, and
illustrating the pipeline without an LLM dependency.

- `score_confidence()` is a pure function that assigns deterministic
  confidence scores from an `initiative_id` and a confidence range.
- Confidence ranges are tied to measurement methodology, not to individual
  results.
- The `Evaluate` adapter wraps this behind a job-directory-based interface
  for the orchestrator pipeline.
- Scoring is fully reproducible: same `initiative_id` always yields the same
  confidence.

For production evaluation, use the **review** strategy, which sends
measurement artifacts to an LLM for structured, per-dimension review.

In [None]:
import shutil

# Clean up
shutil.rmtree(job_dir, ignore_errors=True)