# Evaluating Evidence Quality

This lecture develops a framework for assessing whether causal estimates are trustworthy enough to act on. We then apply these concepts using the `impact-engine-evaluate` package, which automates evidence evaluation as the bridge between measuring impact and allocating resources.

---

## Part I: Theory

Measuring causal effects is only the first step. Before using an estimate to guide a business decision, we need to ask: **How much should we trust this number?** This section develops the conceptual tools for answering that question.

## 1. Evaluating Evidence in General

### Internal vs. External Validity

Every causal study faces two distinct questions:

| Validity Type | Question | Threats |
|---------------|----------|----------|
| **Internal validity** | Is the causal estimate correct *for this study population*? | Selection bias, confounding, measurement error |
| **External validity** | Does the estimate generalize *to other populations or settings*? | Sample selection, context dependence, time effects |

Internal validity is a prerequisite — an estimate that is wrong for the study population tells us nothing about other settings. But a perfectly valid estimate from a narrow population may not apply elsewhere.

### Statistical vs. Practical Significance

A result can be statistically significant without being practically meaningful, and vice versa:

- **Statistical significance** asks: Could this result have arisen by chance? (Measured by p-values, confidence intervals)
- **Practical significance** asks: Is the effect large enough to matter for the decision? (Measured by effect sizes, cost-benefit analysis)

With large samples, even trivially small effects become statistically significant. Conversely, small samples may fail to detect genuinely important effects. Decision-makers need both: evidence that the effect is real *and* that it is large enough to justify action.

### Hierarchy of Evidence

Not all research designs provide equally credible evidence. The strength of a causal claim depends on how effectively the design rules out alternative explanations:

| Tier | Design | Strength |
|------|--------|----------|
| 1 | Randomized experiments (RCTs) | Random assignment eliminates selection bias by construction |
| 2 | Quasi-experimental methods (DiD, RDD, IV) | Exploit natural variation but rely on untestable assumptions |
| 3 | Observational methods (matching, regression) | Require strong conditional independence assumptions |

Higher tiers are not always feasible. The goal is to use the strongest design available and to be transparent about the assumptions required by weaker designs.

### Replication and Robustness

A single estimate, no matter how well-designed the study, provides limited evidence. Confidence increases when:

- **Replication**: Different studies, using different data and methods, reach similar conclusions
- **Robustness**: The estimate remains stable across reasonable changes in specification, sample, or assumptions

Sensitivity to minor analytical choices — such as which covariates to include or how to define the outcome — signals fragility in the evidence.

## 2. Diagnostics Shared Across Causal Methods

Regardless of the specific causal method, several diagnostic tools apply broadly. These checks assess whether the core assumptions of the design are plausible.

### Covariate Balance

Causal methods that condition on observables (matching, subclassification, regression) rely on treated and control groups being comparable. **Covariate balance** measures how similar the groups are on pre-treatment characteristics.

The standard metric is the **standardized mean difference (SMD)**:

$$\text{SMD}_k = \frac{\bar{X}_{k,\text{treated}} - \bar{X}_{k,\text{control}}}{\sqrt{(s_{k,\text{treated}}^2 + s_{k,\text{control}}^2) / 2}}$$

A common threshold is $|\text{SMD}| < 0.1$, though this is a guideline, not a rule. **Love plots** display SMDs for all covariates before and after adjustment, making it easy to assess whether a method has improved balance.

```{note}
We used Love plots in the matching and subclassification lecture to assess balance improvement. The evaluate tool checks covariate balance as part of its automated review.
```

### Placebo Tests

**Placebo tests** apply the causal method in settings where we know the true effect is zero. If the method detects a "significant" effect where none exists, something is wrong.

Common placebo tests include:

| Type | Description | Example |
|------|-------------|----------|
| **Placebo outcomes** | Apply the method to outcomes that should not be affected by treatment | Treatment is a marketing campaign; placebo outcome is warehouse temperature |
| **Placebo treatments** | Assign treatment at a time or threshold where no real treatment occurred | In a DiD design, test for a "treatment effect" two periods before the actual intervention |
| **Placebo units** | Apply the method to units that were not actually treated | In synthetic control, run the method on a control unit and check if a gap appears |

### Sensitivity Analysis

Every causal method relies on assumptions that cannot be fully tested with the data. **Sensitivity analysis** asks: How much would the assumptions need to be violated to overturn the conclusion?

- **Rosenbaum bounds**: For matching estimators, quantify how large an unobserved confounder would need to be to explain away the effect
- **Coefficient stability**: If adding additional covariates substantially changes the estimate, the conditional independence assumption may be fragile
- **Omitted variable bias bounds**: Formal frameworks (e.g., Oster 2019) bound the bias from unobserved confounders based on the explanatory power of observed ones

### Common Support / Overlap

Methods that compare treated and control units require that both groups exist across the relevant range of covariates. **Common support** (or **overlap**) means that for any covariate value, there is a positive probability of being in either group:

$$0 < P(D=1 \mid X=x) < 1 \quad \text{for all } x$$

Violations of common support mean the method is extrapolating rather than comparing like with like. Diagnostics include propensity score histograms and trimming rules.

### Pre-treatment Outcome Trends

For methods that use time-series variation (difference-in-differences, synthetic control), **parallel pre-treatment trends** between treated and control units support the identifying assumption. If outcomes diverge before treatment, the estimated effect may reflect pre-existing differences rather than the intervention.

## 3. Method-Specific Diagnostics

Beyond shared diagnostics, each causal method has its own set of checks tailored to its specific assumptions.

### Experiments (RCTs)

| Diagnostic | What It Checks | Red Flag |
|------------|----------------|----------|
| **Randomization integrity** | Covariate balance between treatment and control | Systematic differences in baseline characteristics |
| **Attrition** | Whether dropout rates differ by treatment status | Differential attrition compromises random assignment |
| **Non-compliance** | Whether all units received their assigned treatment | High non-compliance dilutes the estimated effect |
| **Spillover / SUTVA** | Whether treatment of one unit affects others | Interference between units biases the estimate |
| **Effect plausibility** | Whether the magnitude of the effect is realistic | Implausibly large effects suggest measurement or specification errors |

### Matching & Subclassification

| Diagnostic | What It Checks | Red Flag |
|------------|----------------|----------|
| **Balance improvement** | Whether matching reduced covariate imbalance | SMDs remain large after matching |
| **Common support** | Whether treated and control overlap in covariate space | Many treated units have no comparable controls |
| **Hidden bias sensitivity** | How large an unobserved confounder would need to be | Effect disappears at low values of Rosenbaum's $\Gamma$ |

### Synthetic Control

| Diagnostic | What It Checks | Red Flag |
|------------|----------------|----------|
| **Pre-treatment fit (RMSPE)** | How well the synthetic control tracks the treated unit before treatment | Large pre-treatment gaps undermine post-treatment comparisons |
| **Placebo gaps** | Whether placebo units show similar post-treatment gaps | Many placebos have gaps as large as the treated unit |
| **RMSPE ratios** | Post/pre RMSPE ratio relative to placebo distribution | Treated unit's ratio is not extreme relative to placebos |
| **Donor composition** | Whether the synthetic control relies on sensible weights | Weights concentrated on dissimilar units |

### Other Methods (Preview)

Methods covered later in the course have their own diagnostics:

| Method | Key Diagnostics |
|--------|------------------|
| **Difference-in-Differences** | Parallel pre-trends, event study plots, staggered treatment tests |
| **Instrumental Variables** | First-stage F-statistic, exclusion restriction arguments, overidentification tests |
| **Regression Discontinuity** | Continuity of baseline covariates at cutoff, McCrary density test, bandwidth sensitivity |

---

## Part II: Application

In Part I we developed a framework for evaluating causal evidence — from general principles of validity and significance, through shared diagnostics like covariate balance and placebo tests, to method-specific checks for experiments, matching, and synthetic control.

In this application we use the `impact-engine-evaluate` package, which automates this assessment as the bridge between the MEASURE and ALLOCATE stages of the decision pipeline. The tool implements two evaluation strategies: deterministic confidence scoring based on methodology, and agentic review where an LLM applies the diagnostic framework from Part I to actual measurement artifacts.

In [None]:
import inspect

from impact_engine_evaluate import Evaluate, score_confidence
from impact_engine_evaluate.review.methods.base import MethodReviewerRegistry
from impact_engine_evaluate.review.methods.experiment.reviewer import ExperimentReviewer
from impact_engine_evaluate.review.models import ReviewDimension, ReviewResult
from IPython.display import Code

from support import (
    create_mock_job_directory,
    plot_confidence_ranges,
    plot_review_dimensions,
    print_evaluate_result,
    print_review_result,
)

## 1. Pipeline Context

The decision pipeline flows through three stages:

$$\text{MEASURE} \;\longrightarrow\; \text{EVALUATE} \;\longrightarrow\; \text{ALLOCATE}$$

- **MEASURE** produces causal estimates (effect sizes, confidence intervals, diagnostics)
- **EVALUATE** assesses how trustworthy those estimates are and assigns a confidence score
- **ALLOCATE** uses the confidence-weighted estimates to decide where to invest resources

The `Evaluate` adapter orchestrates the evaluation stage. It reads a job directory produced by MEASURE, dispatches on the configured strategy, and outputs a standardized 8-key dictionary that ALLOCATE consumes.

In [None]:
Code(inspect.getsource(Evaluate), language="python")

### Interface-to-Theory Mapping

The `Evaluate` adapter connects Part I concepts to concrete implementation:

| Adapter Field / Method | Part I Concept |
|------------------------|----------------|
| `evaluate_strategy` | Choice between methodology-based priors (Section 1: hierarchy of evidence) and detailed diagnostic review (Sections 2–3) |
| `confidence` output | Overall trustworthiness — combines statistical and practical significance |
| `MethodReviewerRegistry` | Method-specific diagnostics (Section 3) — each reviewer encodes the relevant checks for its methodology |
| `confidence_range` per method | Hierarchy of evidence (Section 1) — experiments get higher baseline confidence than observational methods |
| Review dimensions | Maps directly to diagnostic categories: randomization integrity, specification adequacy, statistical inference, threats to validity, effect plausibility |

## 2. Deterministic Scoring

The simplest evaluation strategy assigns a confidence score based on the **methodology used**, without examining the specific results. This reflects the hierarchy of evidence from Part I: an experiment, by design, provides stronger evidence than an observational study.

### Registered Methods and Confidence Ranges

Each registered method reviewer defines a confidence range reflecting the methodology's inherent strength:

In [None]:
# Show registered methods and their confidence ranges
confidence_map = MethodReviewerRegistry.confidence_map()

for method, (lower, upper) in confidence_map.items():
    print(f"  {method}: [{lower:.2f}, {upper:.2f}]")

In [None]:
plot_confidence_ranges(confidence_map)

The confidence range for experiments (0.85–1.00) is higher than it would be for observational methods, reflecting the stronger identification strategy. Within each range, the exact score is drawn deterministically from the initiative ID, ensuring reproducibility.

### Scoring an Initiative

To demonstrate the scoring strategy, we create a mock job directory that mimics the output of the MEASURE stage:

In [None]:
Code(inspect.getsource(create_mock_job_directory), language="python")

In [None]:
# Create mock MEASURE output
job_dir = create_mock_job_directory()

# Show the manifest
import json

manifest = json.loads((job_dir / "manifest.json").read_text())
print(json.dumps(manifest, indent=2))

Now we run the EVALUATE stage. The `Evaluate` adapter reads the manifest, identifies the method reviewer, and dispatches to the score strategy:

In [None]:
# Run the full EVALUATE pipeline
evaluator = Evaluate()
result = evaluator.execute({"job_dir": str(job_dir)})

print_evaluate_result(result)

The `score_confidence` function can also be called directly with an initiative ID and confidence range:

In [None]:
# Direct scoring — same deterministic confidence for the same initiative ID
score_result = score_confidence("initiative_product_content_experiment", (0.85, 1.0))
print(f"Confidence: {score_result.confidence:.3f}")
print(f"Range:      {score_result.confidence_range}")

## 3. Agentic Review

The deterministic strategy assigns confidence based on methodology alone. The **agentic review** strategy goes further: it sends the actual measurement artifacts to an LLM, which evaluates them against the diagnostic framework from Part I.

### Review Dimensions

The `ExperimentReviewer` defines five review dimensions, each corresponding to a diagnostic category from Part I Section 3:

| Review Dimension | Part I Diagnostic | What the LLM Assesses |
|------------------|-------------------|-----------------------|
| Randomization integrity | Randomization integrity | Covariate balance, randomization procedure, baseline equivalence |
| Specification adequacy | (Statistical inference) | OLS formula, covariate selection, functional form |
| Statistical inference | Statistical significance | Confidence intervals, p-values, standard errors |
| Threats to validity | Attrition, non-compliance, spillover, SUTVA | Whether common threats are present or addressed |
| Effect size plausibility | Effect plausibility | Whether the magnitude is realistic for the intervention |

In [None]:
Code(inspect.getsource(ExperimentReviewer), language="python")

### Knowledge Base

Each method reviewer maintains a knowledge directory with domain expertise files that ground the LLM's assessment. For experiments, this includes design principles, common pitfalls, and diagnostic standards:

In [None]:
# Show the knowledge files available to the experiment reviewer
reviewer = ExperimentReviewer()
knowledge_dir = reviewer.knowledge_content_dir()

for path in sorted(knowledge_dir.iterdir()):
    if path.suffix in (".md", ".txt"):
        line_count = len(path.read_text().splitlines())
        print(f"  {path.name} ({line_count} lines)")

### Pre-computed Review Output

```{note}
The agentic review calls an LLM API, which requires an API key and incurs cost. Since this notebook runs during the documentation build (where no API key is available), we construct a representative `ReviewResult` from pre-computed values. In practice, calling `Evaluate.execute()` with `evaluate_strategy: "review"` in the manifest produces this output automatically.
```

The following shows what a typical agentic review produces for a well-designed experiment:

In [None]:
# Pre-computed representative review output
review_result = ReviewResult(
    initiative_id="initiative_product_content_experiment",
    prompt_name="experiment_review",
    prompt_version="1.0",
    backend_name="anthropic",
    model="claude-sonnet-4-5-20250929",
    dimensions=[
        ReviewDimension(
            name="randomization_integrity",
            score=0.92,
            justification=(
                "Covariate balance is strong (max SMD = 0.04). Random assignment "
                "appears properly implemented with no systematic baseline differences."
            ),
        ),
        ReviewDimension(
            name="specification_adequacy",
            score=0.85,
            justification=(
                "Standard OLS specification with treatment indicator. Could benefit "
                "from covariate adjustment to improve precision, but the core specification is sound."
            ),
        ),
        ReviewDimension(
            name="statistical_inference",
            score=0.88,
            justification=(
                "Confidence interval [80, 220] excludes zero. p-value of 0.003 indicates "
                "statistical significance. Sample size of 500 provides adequate power."
            ),
        ),
        ReviewDimension(
            name="threats_to_validity",
            score=0.80,
            justification=(
                "Attrition rate of 5% is acceptable. Compliance rate of 92% is high. "
                "No evidence of spillover, though SUTVA cannot be fully verified from artifacts alone."
            ),
        ),
        ReviewDimension(
            name="effect_size_plausibility",
            score=0.83,
            justification=(
                "Effect estimate of $150 (roughly 30% of baseline) is plausible for a "
                "content optimization intervention, though on the higher end of typical effects."
            ),
        ),
    ],
    overall_score=0.856,
    raw_response="(pre-computed for documentation build)",
    timestamp="2026-01-15T10:30:00Z",
)

print_review_result(review_result)

In [None]:
plot_review_dimensions(review_result)

## 4. From Confidence to Allocation

Both evaluation strategies — deterministic scoring and agentic review — produce the same 8-key output dictionary. This standardized interface is what the downstream ALLOCATE stage consumes:

| Key | Type | Description |
|-----|------|-------------|
| `initiative_id` | str | Unique identifier for the initiative |
| `confidence` | float | Trustworthiness score (0–1) |
| `cost` | float | Cost to scale the initiative |
| `return_best` | float | Upper bound of expected return (CI upper) |
| `return_median` | float | Point estimate of return |
| `return_worst` | float | Lower bound of expected return (CI lower) |
| `model_type` | str | Causal method used |
| `sample_size` | int | Number of observations |

### How Confidence Discounts Returns

The ALLOCATE stage uses the confidence score to discount expected returns. An initiative with high measured impact but low confidence receives a smaller allocation than one with moderate impact and high confidence:

$$\text{Adjusted Return} = \text{confidence} \times \text{return\_median}$$

This captures the key insight from Part I: the *quality* of evidence matters as much as the *magnitude* of the estimate. A large but unreliable effect is worth less than a moderate but well-established one.

For example, with our experiment's output:

In [None]:
# Demonstrate confidence discounting
confidence = result["confidence"]
return_median = result["return_median"]
adjusted_return = confidence * return_median

print(f"Raw return estimate:    ${return_median:,.0f}")
print(f"Confidence score:       {confidence:.3f}")
print(f"Adjusted return:        ${adjusted_return:,.0f}")
print()
print("Compare with a hypothetical observational study:")
obs_confidence = 0.55
obs_return = 200.0
print(f"  Raw return estimate:  ${obs_return:,.0f}")
print(f"  Confidence score:     {obs_confidence:.3f}")
print(f"  Adjusted return:      ${obs_confidence * obs_return:,.0f}")
print()
print("The experiment's adjusted return is higher despite a lower raw estimate,")
print("because the stronger methodology commands greater confidence.")

## Additional resources

- **Angrist, J. D. & Pischke, J.‑S. (2010)**. The credibility revolution in empirical economics: How better research design is taking the con out of econometrics. *Journal of Economic Perspectives*, 24(2), 3–30.

- **Athey, S. & Imbens, G. W. (2017)**. [The econometrics of randomized experiments](https://doi.org/10.1016/bs.hefe.2016.10.003). In *Handbook of Economic Field Experiments* (Vol. 1, pp. 73–140). Elsevier.

- **Imbens, G. W. (2020)**. [Potential outcome and directed acyclic graph approaches to causality: Relevance for empirical practice in economics](https://www.aeaweb.org/articles?id=10.1257/jel.20191597). *Journal of Economic Literature*, 58(4), 1129–1179.

- **Oster, E. (2019)**. Unobservable selection and coefficient stability: Theory and evidence. *Journal of Business & Economic Statistics*, 37(2), 187–204.

- **Rosenbaum, P. R. (2002)**. *Observational Studies* (2nd ed.). Springer.

- **Young, A. (2022)**. Consistency without inference: Instrumental variables in practical application. *European Economic Review*, 147, 104112.