# Evaluating Causal Evidence

Measuring causal effects is only the first step. Before using an estimate to guide a business decision, we need to ask: **How much should we trust this number?** This lecture develops the conceptual tools for answering that question — from general principles of evidence quality, through diagnostic checks shared across all causal methods, to the method-specific tests that assess each design's identifying assumptions.

## 1. Evaluating Evidence in General

### Internal vs. External Validity

Every causal study faces two distinct questions:

| Validity Type | Question | Threats |
|---------------|----------|----------|
| **Internal validity** | Is the causal estimate correct *for this study population*? | Selection bias, confounding, measurement error |
| **External validity** | Does the estimate generalize *to other populations or settings*? | Sample selection, context dependence, time effects |

Internal validity is a prerequisite — an estimate that is wrong for the study population tells us nothing about other settings. But a perfectly valid estimate from a narrow population may not apply elsewhere.

### Statistical vs. Practical Significance

A result can be statistically significant without being practically meaningful, and vice versa:

- **Statistical significance** asks: Could this result have arisen by chance? (Measured by p-values, confidence intervals)
- **Practical significance** asks: Is the effect large enough to matter for the decision? (Measured by effect sizes, cost-benefit analysis)

With large samples, even trivially small effects become statistically significant. Conversely, small samples may fail to detect genuinely important effects. Decision-makers need both: evidence that the effect is real *and* that it is large enough to justify action.

### Hierarchy of Evidence

Not all research designs provide equally credible evidence. The strength of a causal claim depends on how effectively the design rules out alternative explanations:

| Tier | Design | Strength |
|------|--------|----------|
| 1 | Randomized experiments (RCTs) | Random assignment eliminates selection bias by construction |
| 2 | Quasi-experimental methods (difference-in-differences, regression discontinuity, instrumental variables) | Exploit natural variation but rely on untestable assumptions |
| 3 | Observational methods (matching, regression) | Require strong conditional independence assumptions |

Higher tiers are not always feasible. The goal is to use the strongest design available and to be transparent about the assumptions required by weaker designs.

### Replication and Robustness

A single estimate, no matter how well-designed the study, provides limited evidence. Confidence increases when:

- **Replication**: Different studies, using different data and methods, reach similar conclusions
- **Robustness**: The estimate remains stable across reasonable changes in specification, sample, or assumptions

Sensitivity to minor analytical choices — such as which covariates to include or how to define the outcome — signals fragility in the evidence.

## 2. Principles for Trustworthy Automated Assessment

The diagnostic framework above applies when a human analyst reviews a study. When we automate this review — using an LLM to assess measurement artifacts and assign a confidence score — four additional failure modes emerge that must be addressed by design.

### Four Failure Modes of Automated Confidence Scoring

| Failure Mode | What Goes Wrong | Example |
|---|---|---|
| **Ungroundedness** | The score is not traceable to any observable artifact | The system produces "confidence: 0.73" without citing any diagnostic |
| **Incorrectness** | Artifacts are present but misread | An unbalanced covariate table is narrated as supportive |
| **Opacity** | No audit trail — the score cannot be challenged or inspected | "Why 0.73 and not 0.85?" yields a plausible but unfalsifiable answer |
| **Instability** | Same evidence described slightly differently yields a different score | Confidence becomes a function of prompt phrasing, not measurement quality |

A confidence score that cannot be defended is worse than no confidence score at all. In enterprise settings where decisions are audited and revisited, these failure modes are fatal.

### Four Pillars of Defensible Confidence

Each pillar directly addresses one failure mode:

| Pillar | Principle | Failure Mode Addressed | Mechanism |
|---|---|---|---|
| **Groundedness** | Every confidence claim traces to an observable statistical artifact | Ungroundedness | Common support checks, assumption tests, robustness diagnostics |
| **Correctness** | The system interprets those artifacts accurately | Incorrectness | Evals on synthetic artifacts with known ground truth |
| **Traceability** | Full audit trail from score to source data | Opacity | Per-dimension scores linked to specific diagnostics |
| **Reproducibility** | Same pipeline + same data = same assessment | Instability | Fixed prompts, structured schemas, version-pinned backends |

The pillars have a natural dependency order. Groundedness is the precondition: without observable artifacts, there is nothing to be correct about. Correctness builds on groundedness: the system must read the evidence accurately. Traceability makes correctness inspectable: when interpretations are wrong, the audit trail reveals where. Reproducibility ensures all three hold across runs, not just on a single evaluation.

### The LLM as Narrator, Not Oracle

The key architectural implication of these pillars is a precise division of labor. The LLM's role is to *contextualize* diagnostics that the measurement engine already produces — it does not generate confidence from its own internal probabilities or invent evidence. It is a narrator reading a score sheet, not an oracle pronouncing judgment.

This bounds the LLM's contribution to interpretation within a constrained evidence set, which is what makes the output auditable. The correctness of that interpretation is then the empirical question that the evaluation framework in Lecture 3 assesses. Lecture 2 examines how these four pillars are instantiated in the design patterns of a production agentic system — the registry, prompt engineering, escalation, and Assess vs. Improve discipline that together enforce the pillars in software.

## 3. Diagnostics Shared Across Causal Methods

Regardless of the specific causal method, several diagnostic tools apply broadly. These checks assess whether the core assumptions of the design are plausible.

### Covariate Balance

Causal methods that condition on observables (matching, subclassification, regression) rely on treated and control groups being comparable. **Covariate balance** measures how similar the groups are on pre-treatment characteristics.

The standard metric is the **standardized mean difference (SMD)**:

$$\text{SMD}_k = \frac{\bar{X}_{k,\text{treated}} - \bar{X}_{k,\text{control}}}{\sqrt{(s_{k,\text{treated}}^2 + s_{k,\text{control}}^2) / 2}}$$

A common threshold is $|\text{SMD}| < 0.1$, though this is a guideline, not a rule. **Love plots** display SMDs for all covariates before and after adjustment, making it easy to assess whether a method has improved balance.

```{note}
We used Love plots in the matching and subclassification lecture to assess balance improvement. The evaluate tool checks covariate balance as part of its automated review.
```

### Placebo Tests

**Placebo tests** apply the causal method in settings where we know the true effect is zero. If the method detects a "significant" effect where none exists, something is wrong.

Common placebo tests include:

| Type | Description | Example |
|------|-------------|---------|
| **Placebo outcomes** | Apply the method to outcomes that should not be affected by treatment | Treatment is a marketing campaign; placebo outcome is warehouse temperature |
| **Placebo treatments** | Assign treatment at a time or threshold where no real treatment occurred | In a DiD design, test for a "treatment effect" two periods before the actual intervention |
| **Placebo units** | Apply the method to units that were not actually treated | In synthetic control, run the method on a control unit and check if a gap appears |

### Sensitivity Analysis

Every causal method relies on assumptions that cannot be fully tested with the data. **Sensitivity analysis** asks: How much would the assumptions need to be violated to overturn the conclusion?

- **Rosenbaum bounds**: For matching estimators, quantify how large an unobserved confounder would need to be to explain away the effect
- **Coefficient stability**: If adding additional covariates substantially changes the estimate, the conditional independence assumption may be fragile
- **Omitted variable bias bounds**: Formal frameworks (e.g., Oster 2019) bound the bias from unobserved confounders based on the explanatory power of observed ones

### Common Support / Overlap

Methods that compare treated and control units require that both groups exist across the relevant range of covariates. **Common support** (or **overlap**) means that for any covariate value, there is a positive probability of being in either group:

$$0 < P(D=1 \mid X=x) < 1 \quad \text{for all } x$$

Violations of common support mean the method is extrapolating rather than comparing like with like. Diagnostics include propensity score histograms and trimming rules.

### Pre-treatment Outcome Trends

For methods that use time-series variation (difference-in-differences, synthetic control), **parallel pre-treatment trends** between treated and control units support the identifying assumption. If outcomes diverge before treatment, the estimated effect may reflect pre-existing differences rather than the intervention.

## 4. Method-Specific Diagnostics

Beyond shared diagnostics, each causal method has its own set of checks tailored to its specific assumptions.

### Experiments (RCTs)

| Diagnostic | What It Checks | Red Flag |
|------------|----------------|----------|
| **Randomization integrity** | Covariate balance between treatment and control | Systematic differences in baseline characteristics |
| **Attrition** | Whether dropout rates differ by treatment status | Differential attrition compromises random assignment |
| **Non-compliance** | Whether all units received their assigned treatment | High non-compliance dilutes the estimated effect |
| **Spillover / SUTVA** | Whether treatment of one unit affects others | Interference between units biases the estimate |
| **Effect plausibility** | Whether the magnitude of the effect is realistic | Implausibly large effects suggest measurement or specification errors |

### Matching & Subclassification

| Diagnostic | What It Checks | Red Flag |
|------------|----------------|----------|
| **Balance improvement** | Whether matching reduced covariate imbalance | SMDs remain large after matching |
| **Common support** | Whether treated and control overlap in covariate space | Many treated units have no comparable controls |
| **Hidden bias sensitivity** | How large an unobserved confounder would need to be | Effect disappears at low values of Rosenbaum's $\Gamma$ |

### Synthetic Control

| Diagnostic | What It Checks | Red Flag |
|------------|----------------|----------|
| **Pre-treatment fit (RMSPE)** | How well the synthetic control tracks the treated unit before treatment | Large pre-treatment gaps undermine post-treatment comparisons |
| **Placebo gaps** | Whether placebo units show similar post-treatment gaps | Many placebos have gaps as large as the treated unit |
| **RMSPE ratios** | Post/pre RMSPE ratio relative to placebo distribution | Treated unit's ratio is not extreme relative to placebos |
| **Donor composition** | Whether the synthetic control relies on sensible weights | Weights concentrated on dissimilar units |

### Other Methods (Preview)

Methods covered later in the course have their own diagnostics:

| Method | Key Diagnostics |
|--------|------------------|
| **Difference-in-Differences** | Parallel pre-trends, event study plots, staggered treatment tests |
| **Instrumental Variables** | First-stage F-statistic, exclusion restriction arguments, overidentification tests |
| **Regression Discontinuity** | Continuity of baseline covariates at cutoff, McCrary density test, bandwidth sensitivity |

## Additional resources

- **Angrist, J. D. & Pischke, J.‑S. (2010)**. The credibility revolution in empirical economics: How better research design is taking the con out of econometrics. *Journal of Economic Perspectives*, 24(2), 3–30.

- **Athey, S. & Imbens, G. W. (2017)**. [The econometrics of randomized experiments](https://doi.org/10.1016/bs.hefe.2016.10.003). In *Handbook of Economic Field Experiments* (Vol. 1, pp. 73–140). Elsevier.

- **Imbens, G. W. (2020)**. [Potential outcome and directed acyclic graph approaches to causality: Relevance for empirical practice in economics](https://www.aeaweb.org/articles?id=10.1257/jel.20191597). *Journal of Economic Literature*, 58(4), 1129–1179.

- **Oster, E. (2019)**. Unobservable selection and coefficient stability: Theory and evidence. *Journal of Business & Economic Statistics*, 37(2), 187–204.

- **Rosenbaum, P. R. (2002)**. *Observational Studies* (2nd ed.). Springer.