# Automated Evidence Review

In the previous lectures we developed the diagnostic framework for evaluating causal evidence and examined the design patterns that power the evaluation tool. This lecture puts both together: we use the `impact-engine-evaluate` package end-to-end, running the full MEASURE → EVALUATE → ALLOCATE pipeline to demonstrate how evidence quality translates into investment decisions.

---

## Part I: Theory

The decision pipeline flows through three stages:

$$\text{MEASURE} \;\longrightarrow\; \text{EVALUATE} \;\longrightarrow\; \text{ALLOCATE}$$

- **MEASURE** produces causal estimates: effect sizes, confidence intervals, and diagnostic statistics
- **EVALUATE** assesses how trustworthy those estimates are and assigns a confidence score
- **ALLOCATE** uses confidence-weighted estimates to decide where to invest resources

The EVALUATE stage implements two strategies that correspond to different levels of evidence scrutiny:

| Strategy | Basis | When to Use |
|----------|-------|-------------|
| `score` | Methodology-based prior (hierarchy of evidence from Lecture 1) | Early screening, large portfolios, time-constrained decisions |
| `review` | LLM diagnostic review (applying the framework from Lecture 1 to actual artifacts) | High-stakes decisions, detailed audit trail, before major resource commitments |

Both strategies return the same 8-key output, making them interchangeable from the perspective of the downstream ALLOCATE stage.

---

## Part II: Application

In [None]:
import inspect
import json

from impact_engine_evaluate import Evaluate, score_confidence
from impact_engine_evaluate.review.methods.base import MethodReviewerRegistry
from impact_engine_evaluate.review.models import ReviewDimension, ReviewResult
from IPython.display import Code

from support import (
    create_mock_job_directory,
    plot_confidence_ranges,
    plot_review_dimensions,
    print_evaluate_result,
    print_review_result,
)

## 1. Measurement Artifacts

The EVALUATE stage reads a **job directory** produced by MEASURE. The directory contains two files:

- `manifest.json`: describes the initiative, causal method, and evaluation strategy
- `impact_results.json`: the measurement output — effect estimate, confidence interval, sample size, cost

We use a helper function to create a mock job directory that mimics MEASURE output, so we can demonstrate EVALUATE without running the full pipeline:

In [None]:
Code(inspect.getsource(create_mock_job_directory), language="python")

In [None]:
# Create mock MEASURE output
job_dir = create_mock_job_directory()

# Inspect the manifest
manifest = json.loads((job_dir / "manifest.json").read_text())
print("manifest.json:")
print(json.dumps(manifest, indent=2))

In [None]:
# Inspect the impact results
impact_results = json.loads((job_dir / "impact_results.json").read_text())
print("impact_results.json:")
print(json.dumps(impact_results, indent=2))

## 2. Deterministic Scoring

The simplest evaluation strategy assigns a confidence score based on the **methodology used**, without examining the specific results. This reflects the hierarchy of evidence from Lecture 1: an experiment, by design, provides stronger evidence than an observational study.

### Registered Methods and Confidence Ranges

Each registered method reviewer defines a confidence range reflecting the methodology's inherent strength:

In [None]:
confidence_map = MethodReviewerRegistry.confidence_map()

for method, (lo, hi) in confidence_map.items():
    print(f"  {method}: [{lo:.2f}, {hi:.2f}]")

In [None]:
plot_confidence_ranges(confidence_map)

The confidence range for experiments (0.85–1.00) is higher than it would be for observational methods, reflecting the stronger identification strategy. Within each range, the exact score is drawn deterministically from the initiative ID, ensuring reproducibility across runs.

### Running the EVALUATE Stage

We run the full EVALUATE pipeline by passing the job directory to the `Evaluate` adapter. It reads the manifest, dispatches to the appropriate reviewer, and returns the standardized 8-key output:

In [None]:
evaluator = Evaluate()
result = evaluator.execute({"job_dir": str(job_dir)})

print_evaluate_result(result)

The `score_confidence` function can also be called directly with an initiative ID and confidence range — useful when you want to score without reading a job directory:

In [None]:
score_result = score_confidence("initiative_product_content_experiment", (0.85, 1.0))
print(f"Confidence: {score_result.confidence:.3f}")
print(f"Range:      {score_result.confidence_range}")

## 3. Agentic Review

The deterministic strategy assigns confidence based on methodology alone. The **agentic review** strategy goes further: it sends the actual measurement artifacts to an LLM, which evaluates them against the diagnostic framework from Lecture 1.

### Review Dimensions

The `ExperimentReviewer` defines five review dimensions, each corresponding to a diagnostic category from Lecture 1:

| Review Dimension | Lecture 1 Diagnostic | What the LLM Assesses |
|------------------|----------------------|-----------------------|
| Randomization integrity | RCT diagnostics — randomization integrity | Covariate balance, randomization procedure, baseline equivalence |
| Specification adequacy | RCT diagnostics — specification | OLS formula, covariate selection, functional form |
| Statistical inference | Shared diagnostics — statistical significance | Confidence intervals, p-values, standard errors |
| Threats to validity | RCT diagnostics — attrition, non-compliance, spillover | Whether common threats are present or addressed |
| Effect size plausibility | RCT diagnostics — effect plausibility | Whether the magnitude is realistic for the intervention |

```{note}
The agentic review calls an LLM API, which requires an API key and incurs cost. Since this notebook runs during the documentation build (where no API key is available), we construct a representative `ReviewResult` from pre-computed values. In practice, set `evaluate_strategy: review` in the manifest and call `Evaluate.execute()` to produce this output automatically.
```

In [None]:
# Pre-computed representative review output
review_result = ReviewResult(
    initiative_id="initiative_product_content_experiment",
    prompt_name="experiment_review",
    prompt_version="1.0",
    backend_name="anthropic",
    model="claude-sonnet-4-5-20250929",
    dimensions=[
        ReviewDimension(
            name="randomization_integrity",
            score=0.92,
            justification=(
                "Covariate balance is strong (max SMD = 0.04). Random assignment "
                "appears properly implemented with no systematic baseline differences."
            ),
        ),
        ReviewDimension(
            name="specification_adequacy",
            score=0.85,
            justification=(
                "Standard OLS specification with treatment indicator. Could benefit "
                "from covariate adjustment to improve precision, but the core specification is sound."
            ),
        ),
        ReviewDimension(
            name="statistical_inference",
            score=0.88,
            justification=(
                "Confidence interval [80, 220] excludes zero. p-value of 0.003 indicates "
                "statistical significance. Sample size of 500 provides adequate power."
            ),
        ),
        ReviewDimension(
            name="threats_to_validity",
            score=0.80,
            justification=(
                "Attrition rate of 5% is acceptable. Compliance rate of 92% is high. "
                "No evidence of spillover, though SUTVA cannot be fully verified from artifacts alone."
            ),
        ),
        ReviewDimension(
            name="effect_size_plausibility",
            score=0.83,
            justification=(
                "Effect estimate of $150 (roughly 30% of baseline) is plausible for a "
                "content optimization intervention, though on the higher end of typical effects."
            ),
        ),
    ],
    overall_score=0.856,
    raw_response="(pre-computed for documentation build)",
    timestamp="2026-01-15T10:30:00Z",
)

print_review_result(review_result)

In [None]:
plot_review_dimensions(review_result)

## 4. From Confidence to Allocation

Both evaluation strategies produce the same 8-key output dictionary. This standardized interface is what the downstream ALLOCATE stage consumes:

| Key | Type | Description |
|-----|------|-------------|
| `initiative_id` | str | Unique identifier for the initiative |
| `confidence` | float | Trustworthiness score (0–1) |
| `cost` | float | Cost to scale the initiative |
| `return_best` | float | Upper bound of expected return (CI upper) |
| `return_median` | float | Point estimate of return |
| `return_worst` | float | Lower bound of expected return (CI lower) |
| `model_type` | str | Causal method used |
| `sample_size` | int | Number of observations |

### How Confidence Discounts Returns

The ALLOCATE stage uses the confidence score to discount expected returns. An initiative with high measured impact but low confidence receives a smaller allocation than one with moderate impact and high confidence:

$$\text{Adjusted Return} = \text{confidence} \times \text{return\_median}$$

This captures the key insight from Lecture 1: the *quality* of evidence matters as much as the *magnitude* of the estimate. A large but unreliable effect is worth less than a moderate but well-established one.

In [None]:
confidence = result["confidence"]
return_median = result["return_median"]
adjusted_return = confidence * return_median

print(f"Raw return estimate:    ${return_median:,.0f}")
print(f"Confidence score:       {confidence:.3f}")
print(f"Adjusted return:        ${adjusted_return:,.0f}")
print()
print("Compare with a hypothetical observational study:")
obs_confidence = 0.55
obs_return = 200.0
print(f"  Raw return estimate:  ${obs_return:,.0f}")
print(f"  Confidence score:     {obs_confidence:.3f}")
print(f"  Adjusted return:      ${obs_confidence * obs_return:,.0f}")
print()
print("The experiment's adjusted return is higher despite a lower raw estimate,")
print("because the stronger methodology commands greater confidence.")

## Additional Resources

- **Young, A. (2022)**. Consistency without inference: Instrumental variables in practical application. *European Economic Review*, 147, 104112.

- **Angrist, J. D. & Pischke, J.‑S. (2010)**. The credibility revolution in empirical economics: How better research design is taking the con out of econometrics. *Journal of Economic Perspectives*, 24(2), 3–30.

- [impact-engine-evaluate documentation](https://eisenhauerio.github.io/tools-impact-engine-evaluate/) — Usage, configuration, and system design