# Automated Evidence Review

In Lecture 1 we developed the diagnostic framework for evaluating causal evidence; in Lecture 2 we examined the design patterns that power the evaluation tool. This lecture puts both together: we use the `impact-engine-evaluate` package end-to-end, running the full MEASURE → EVALUATE → ALLOCATE pipeline to demonstrate how evidence quality translates into investment decisions.

---

## Part I: Theory

The decision pipeline flows through three stages:

$$\text{MEASURE} \;\longrightarrow\; \text{EVALUATE} \;\longrightarrow\; \text{ALLOCATE}$$

- **MEASURE** produces causal estimates: effect sizes, confidence intervals, and diagnostic statistics
- **EVALUATE** assesses how trustworthy those estimates are and assigns a confidence score
- **ALLOCATE** uses confidence-weighted estimates to decide where to invest resources

The EVALUATE stage implements two strategies that correspond to different levels of evidence scrutiny:

| Strategy | Basis | When to Use |
|----------|-------|-------------|
| `deterministic` | Methodology-based prior (hierarchy of evidence from Lecture 1) | Early screening, large portfolios, time-constrained decisions |
| `agentic` | LLM diagnostic review (applying the framework from Lecture 1 to actual artifacts) | High-stakes decisions, detailed audit trail, before major resource commitments |

Both strategies return the same 8-key output, making them interchangeable from the perspective of the downstream ALLOCATE stage.

---

## Part II: Application

In [None]:
# Standard Library
import inspect
import json

# Third-party
from impact_engine_evaluate import Evaluate, score_initiative
from impact_engine_evaluate.review.methods.base import MethodReviewerRegistry
from impact_engine_evaluate.review.models import ReviewDimension, ReviewResult
from IPython.display import Code

# Local
from support import (
    create_mock_job_directory,
    plot_confidence_ranges,
    plot_review_dimensions,
    plot_score_comparison,
    print_evaluate_result,
    print_review_result,
)

## 1. Measurement Artifacts

The EVALUATE stage reads a **job directory** produced by MEASURE. The directory contains two files:

- `manifest.json`: describes the initiative, causal method, and evaluation strategy
- `impact_results.json`: the measurement output — effect estimate, confidence interval, sample size, cost

We use a helper function to create a mock job directory that mimics MEASURE output, so we can demonstrate EVALUATE without running the full pipeline:

In [None]:
Code(inspect.getsource(create_mock_job_directory), language="python")

In [None]:
# Create mock MEASURE output
job_dir = create_mock_job_directory()

In [None]:
# Inspect the manifest
manifest = json.loads((job_dir / "manifest.json").read_text())
print("manifest.json:")
print(json.dumps(manifest, indent=2))

In [None]:
# Inspect the impact results
impact_results = json.loads((job_dir / "impact_results.json").read_text())
print("impact_results.json:")
print(json.dumps(impact_results, indent=2))

## 2. Deterministic Scoring

The simplest evaluation strategy assigns a confidence score based on the **methodology used**, without examining the specific results. This reflects the hierarchy of evidence from Lecture 1: an experiment, by design, provides stronger evidence than an observational study.

### Registered Methods and Confidence Ranges

Each registered method reviewer defines a confidence range reflecting the methodology's inherent strength:

In [None]:
confidence_map = MethodReviewerRegistry.confidence_map()

for method, (lo, hi) in confidence_map.items():
    print(f"  {method}: [{lo:.2f}, {hi:.2f}]")

In [None]:
Code(inspect.getsource(plot_confidence_ranges), language="python")

In [None]:
plot_confidence_ranges(confidence_map)

The confidence range for experiments (0.85–1.00) is higher than it would be for observational methods, reflecting the stronger identification strategy. Within each range, the exact score is drawn deterministically from the initiative ID, ensuring reproducibility across runs.

### Running the EVALUATE Stage

We run the full EVALUATE pipeline by passing the job directory to the `Evaluate` adapter. It reads the manifest, dispatches to the appropriate reviewer, and returns the standardized 8-key output:

In [None]:
Code(inspect.getsource(print_evaluate_result), language="python")

In [None]:
evaluator = Evaluate()
result = evaluator.execute({"job_dir": str(job_dir)})

print_evaluate_result(result)

`score_initiative` can also be called directly with an event dict and a confidence range — useful when you want to score a single initiative without reading a full job directory:

In [None]:
direct_result = score_initiative(
    {
        "initiative_id": "initiative_product_content_experiment",
        "model_type": "experiment",
        "effect_estimate": 150.0,
        "ci_upper": 220.0,
        "ci_lower": 80.0,
        "cost_to_scale": 50000.0,
        "sample_size": 500,
    },
    confidence_range=(0.85, 1.0),
)
print(f"Confidence: {direct_result['confidence']:.3f}")
print(f"Range:      (0.85, 1.00)")

## 3. Agentic Review

The deterministic strategy assigns confidence based on methodology alone. The **agentic review** strategy goes further: it sends the actual measurement artifacts to an LLM, which evaluates them against the diagnostic framework from Lecture 1.

### Review Dimensions

The `ExperimentReviewer` defines five review dimensions, each corresponding to a diagnostic category from Lecture 1:

| Review Dimension | Lecture 1 Diagnostic | What the LLM Assesses |
|------------------|----------------------|-----------------------|
| Randomization integrity | RCT diagnostics — randomization integrity | Covariate balance, randomization procedure, baseline equivalence |
| Specification adequacy | RCT diagnostics — specification | OLS formula, covariate selection, functional form |
| Statistical inference | Shared diagnostics — statistical significance | Confidence intervals, p-values, standard errors |
| Threats to validity | RCT diagnostics — attrition, non-compliance, spillover | Whether common threats are present or addressed |
| Effect size plausibility | RCT diagnostics — effect plausibility | Whether the magnitude is realistic for the intervention |

```{note}
The agentic review calls an LLM API, which requires an API key and incurs cost. Since this notebook runs during the documentation build (where no API key is available), we construct a representative `ReviewResult` from pre-computed values. In practice, set `evaluate_strategy: agentic` in the manifest and call `Evaluate.execute()` to produce this output automatically.
```

### Reading the Review Dimension by Dimension

The review result for our product content experiment can be read as a structured narrative, where each dimension score traces directly to a specific artifact value in the job directory. This is the traceability pillar from Lecture 1 in practice — every number is walkable.

**Randomization integrity — 0.92.** The covariate balance diagnostic shows a maximum SMD of 0.04 across all baseline covariates, well within the 0.1 threshold. Random assignment appears properly implemented with no systematic baseline differences. The score is high but not perfect: with 500 observations, balance was verified but the randomization procedure itself was not independently audited.

**Specification adequacy — 0.85.** The OLS specification uses a treatment indicator without covariate adjustment. Given the strong balance, this is sound — but including pre-treatment covariates as controls would tighten the standard error without biasing the estimate. The reviewer flags this as a missed precision opportunity rather than a validity concern.

**Statistical inference — 0.88.** The confidence interval [80, 220] excludes zero, the p-value of 0.003 indicates robust statistical significance, and the sample size of 500 is adequate for detection. The interval is moderately wide (a factor of 2.75 from lower to upper bound), reflecting real uncertainty in the effect magnitude.

**Threats to validity — 0.80.** A 5% attrition rate is acceptable; differential attrition would require separate investigation. The 92% compliance rate is high, meaning the intent-to-treat estimate closely approximates the treatment effect on the treated. SUTVA cannot be fully verified from the artifact alone — the reviewer notes this as an unresolved uncertainty rather than a confirmed violation.

**Effect size plausibility — 0.83.** The $150 effect represents roughly 30% of baseline revenue. This is plausible for a content optimization intervention but sits on the higher end of typical effects. The reviewer flags it as a yellow rather than red — not implausible, but worth triangulating with prior evidence before treating as established.

**Overall: 0.856.** This score reflects genuinely strong evidence — a properly randomized experiment with good balance, significant results, and manageable threats. The deductions are real: the wide CI, the unmeasured SUTVA risk, and the larger-than-typical effect size each reduce confidence below the top of the experiment range (1.00). Every deduction is traceable to a specific artifact value.

In [None]:
Code(inspect.getsource(print_review_result), language="python")

In [None]:
# Pre-computed representative review output
review_result = ReviewResult(
    initiative_id="initiative_product_content_experiment",
    prompt_name="experiment_review",
    prompt_version="1.0",
    backend_name="anthropic",
    model="claude-sonnet-4-6",
    dimensions=[
        ReviewDimension(
            name="randomization_integrity",
            score=0.92,
            justification=(
                "Covariate balance is strong (max SMD = 0.04). Random assignment "
                "appears properly implemented with no systematic baseline differences."
            ),
        ),
        ReviewDimension(
            name="specification_adequacy",
            score=0.85,
            justification=(
                "Standard OLS specification with treatment indicator. Could benefit "
                "from covariate adjustment to improve precision, but the core specification is sound."
            ),
        ),
        ReviewDimension(
            name="statistical_inference",
            score=0.88,
            justification=(
                "Confidence interval [80, 220] excludes zero. p-value of 0.003 indicates "
                "statistical significance. Sample size of 500 provides adequate power."
            ),
        ),
        ReviewDimension(
            name="threats_to_validity",
            score=0.80,
            justification=(
                "Attrition rate of 5% is acceptable. Compliance rate of 92% is high. "
                "No evidence of spillover, though SUTVA cannot be fully verified from artifacts alone."
            ),
        ),
        ReviewDimension(
            name="effect_size_plausibility",
            score=0.83,
            justification=(
                "Effect estimate of $150 (roughly 30% of baseline) is plausible for a "
                "content optimization intervention, though on the higher end of typical effects."
            ),
        ),
    ],
    overall_score=0.856,
    raw_response="(pre-computed for documentation build)",
    timestamp="2026-01-15T10:30:00Z",
)

print_review_result(review_result)

In [None]:
Code(inspect.getsource(plot_review_dimensions), language="python")

In [None]:
plot_review_dimensions(review_result)

## 4. From Confidence to Allocation

Both evaluation strategies produce the same 8-key output dictionary. This standardized interface is what the downstream ALLOCATE stage consumes:

| Key | Type | Description |
|-----|------|-------------|
| `initiative_id` | str | Unique identifier for the initiative |
| `confidence` | float | Trustworthiness score (0–1) |
| `cost` | float | Cost to scale the initiative |
| `return_best` | float | Upper bound of expected return (CI upper) |
| `return_median` | float | Point estimate of return |
| `return_worst` | float | Lower bound of expected return (CI lower) |
| `model_type` | str | Causal method used |
| `sample_size` | int | Number of observations |

### How Confidence Discounts Returns

The ALLOCATE stage uses the confidence score to discount expected returns. An initiative with high measured impact but low confidence receives a smaller allocation than one with moderate impact and high confidence:

$$\text{Adjusted Return} = \text{confidence} \times \text{return\_median}$$

This captures the key insight from Lecture 1: the *quality* of evidence matters as much as the *magnitude* of the estimate. A large but unreliable effect is worth less than a moderate but well-established one.

### The Organizational Incentive

Confidence-weighted allocation creates a direct incentive for teams to invest in better measurement designs. Teams that run proper experiments, collect adequate samples, and perform thorough robustness checks receive higher confidence scores — and higher scores amplify their return estimates in the allocator. Teams relying on weak quasi-experimental designs or sparse data see their estimates discounted, even when the raw point estimates are large.

This makes evidence quality a competitive advantage within the portfolio, not merely a methodological nicety. The allocation mechanism rewards measurement rigor in the same currency as it rewards impact magnitude.

In [None]:
confidence = result["confidence"]
return_median = result["return_median"]
adjusted_return = confidence * return_median

print(f"Raw return estimate:    ${return_median:,.0f}")
print(f"Confidence score:       {confidence:.3f}")
print(f"Adjusted return:        ${adjusted_return:,.0f}")
print()
print("Compare with a hypothetical observational study:")
# Typical observational-study confidence and a slightly larger raw estimate
# illustrate how methodology discounting reverses the ranking.
hypothetical_obs_confidence = 0.55
hypothetical_obs_return = 200.0
print(f"  Raw return estimate:  ${hypothetical_obs_return:,.0f}")
print(f"  Confidence score:     {hypothetical_obs_confidence:.3f}")
print(f"  Adjusted return:      ${hypothetical_obs_confidence * hypothetical_obs_return:,.0f}")
print()
print("The experiment's adjusted return is higher despite a lower raw estimate,")
print("because the stronger methodology commands greater confidence.")

## 5. Evaluating the Evaluator

The four pillars establish what a trustworthy evaluation system should look like. But the Correctness pillar is different from the others: you cannot guarantee it by design. You must test for it.

Groundedness is enforced architecturally — the LLM only sees artifacts the measurement engine produced. Traceability is enforced by the per-dimension output schema. Reproducibility is enforced by fixed prompts, zero temperature, and version-pinned backends. But Correctness — whether the LLM accurately reads the evidence it is given — is an empirical property that must be verified through evaluation.

The Assess mode applies the internal/external validity distinction from Lecture 1 to the automated system itself:

**Internal validity** tests coherence without requiring ground truth. A well-specified system should produce stable scores regardless of which backend processes the artifact or how the prompt is phrased. If scores shift substantially across semantically equivalent prompts, the rubric is under-specified.

**External validity** tests correctness against synthetic artifacts where the right answer is known by construction. A known-clean artifact — a well-powered RCT with excellent diagnostics — should score highly. A known-flaw artifact — the same method with deliberately poor diagnostics — should score lower and flag the specific problems. If the system cannot discriminate between them, the rubric fails its core purpose.

We demonstrate both below using pre-computed `ReviewResult` objects representing the two extremes.

In [None]:
# Known-clean: well-powered RCT with excellent diagnostics — should score near the top of the range
review_clean = ReviewResult(
    initiative_id="eval_test_clean_rct",
    prompt_name="experiment_review",
    prompt_version="1.0",
    backend_name="anthropic",
    model="claude-sonnet-4-6",
    dimensions=[
        ReviewDimension(
            name="randomization_integrity",
            score=0.95,
            justification=(
                "Perfect balance across all covariates (max SMD = 0.02). Randomization "
                "procedure documented and independently verified."
            ),
        ),
        ReviewDimension(
            name="specification_adequacy",
            score=0.90,
            justification=(
                "OLS with pre-specified covariates. No post-hoc specification changes. "
                "Analysis plan registered before data collection."
            ),
        ),
        ReviewDimension(
            name="statistical_inference",
            score=0.93,
            justification=(
                "CI [120, 180] tightly excludes zero. n = 10,000 provides high power. "
                "p-value of 0.0001 indicates robust significance."
            ),
        ),
        ReviewDimension(
            name="threats_to_validity",
            score=0.92,
            justification=(
                "Attrition 2% and symmetric across arms. Compliance 97%. "
                "No evidence of spillover in available diagnostics."
            ),
        ),
        ReviewDimension(
            name="effect_size_plausibility",
            score=0.88,
            justification=(
                "Effect of $50 (~10% of baseline) is consistent with prior literature "
                "on content optimization interventions."
            ),
        ),
    ],
    overall_score=0.916,
    raw_response="(pre-computed for documentation build)",
    timestamp="2026-01-15T11:00:00Z",
)

print_review_result(review_clean)

In [None]:
# Known-flaw: same method, deliberately poor diagnostics — should score lower and flag specific issues
review_flawed = ReviewResult(
    initiative_id="eval_test_flawed_rct",
    prompt_name="experiment_review",
    prompt_version="1.0",
    backend_name="anthropic",
    model="claude-sonnet-4-6",
    dimensions=[
        ReviewDimension(
            name="randomization_integrity",
            score=0.42,
            justification=(
                "Three covariates exceed SMD 0.1 (max SMD = 0.35). Baseline balance is "
                "compromised — randomization may have failed or was not properly implemented."
            ),
        ),
        ReviewDimension(
            name="specification_adequacy",
            score=0.65,
            justification=(
                "Imbalanced covariates not included as controls, leaving known confounders "
                "unaddressed. Specification does not adequately account for baseline differences."
            ),
        ),
        ReviewDimension(
            name="statistical_inference",
            score=0.55,
            justification=(
                "CI [-5, 305] barely excludes zero at conventional thresholds. "
                "Small sample (n = 80) suggests underpowered design; result is fragile."
            ),
        ),
        ReviewDimension(
            name="threats_to_validity",
            score=0.38,
            justification=(
                "Attrition 25% with differential rates across arms (treated: 18%, control: 32%). "
                "Compliance 68%. Spillover mechanisms present but unaddressed."
            ),
        ),
        ReviewDimension(
            name="effect_size_plausibility",
            score=0.60,
            justification=(
                "Effect of $300 (~60% of baseline) is implausibly large for a content "
                "optimization intervention. Possible measurement error or specification problem."
            ),
        ),
    ],
    overall_score=0.52,
    raw_response="(pre-computed for documentation build)",
    timestamp="2026-01-15T11:01:00Z",
)

print_review_result(review_flawed)

In [None]:
Code(inspect.getsource(plot_score_comparison), language="python")

In [None]:
plot_score_comparison(review_clean, review_flawed, labels=["Known-clean RCT", "Known-flaw RCT"])

### What This Tells Us

The score gap (0.916 vs. 0.52) confirms **external validity**: the reviewer discriminates sharply between strong and weak evidence. It does not cluster scores in a narrow band — the full range is used, and the directional ordering is correct.

The dimension-level breakdown matters as much as the overall score. Threats to validity (0.38) is the most penalized dimension in the flawed case, driven by the combination of high differential attrition and low compliance — exactly the diagnostics that should dominate a validity assessment when randomization is compromised. Specification adequacy (0.65) is penalized less severely because the corollary failure (not controlling for imbalanced covariates) is a consequence of the randomization problem, not an independent error.

Every low score is linked to a specific artifact value: max SMD = 0.35, attrition 25%, compliance 68%. This is the traceability pillar holding in the flawed case — not just in the clean one.

### What This Implies for Improve Mode

If the system had *failed* to penalize the known-flaw case — say, the threats to validity score came back as 0.75 instead of 0.38 — that would be a clear **external validity failure**. The Improve mode response would be to inspect the threats-to-validity prompt template and add explicit thresholds: "An attrition rate above 15%, or differential attrition across arms, should substantially reduce this score." The fix would then be validated on held-out flawed artifacts — not the one that revealed the problem — before being promoted to production.

## Additional resources

- **Young, A. (2022)**. Consistency without inference: Instrumental variables in practical application. *European Economic Review*, 147, 104112.

- **Angrist, J. D. & Pischke, J.‑S. (2010)**. The credibility revolution in empirical economics: How better research design is taking the con out of econometrics. *Journal of Economic Perspectives*, 24(2), 3–30.

- [impact-engine-evaluate documentation](https://eisenhauerio.github.io/tools-impact-engine-evaluate/) — Usage, configuration, and system design