# Evaluating Causal Evidence

Measuring causal effects is only the first step. Before using an estimate to guide a business decision, we need to ask: **How much should we trust this number?** Two questions structure this lecture. First, what general principles separate trustworthy evidence from unreliable evidence? Second, what specific diagnostic checks test each method's identifying assumptions? Section 1 develops the conceptual toolkit — validity, the hierarchy of evidence designs, the stress tests that probe whether a single estimate deserves confidence, and the distinction between statistical and practical significance. Section 2 turns to the method-specific diagnostics for experiments, matching, and synthetic control.

## 1. Evaluating Evidence

### Internal and External Validity

Every causal study faces two distinct questions about the quality of its evidence. Internal validity asks whether the causal estimate is correct for the study population. Threats to internal validity include selection bias, confounding, and measurement error; when any of these are present, the number itself may be wrong. External validity asks whether the estimate generalizes to other populations or settings. An internally valid estimate from a narrow population may still fail to predict outcomes in a different market, time period, or customer segment.

| Validity Type | Question | Threats |
|---------------|----------|----------|
| **Internal validity** | Is the causal estimate correct *for this study population*? | Selection bias, confounding, measurement error |
| **External validity** | Does the estimate generalize *to other populations or settings*? | Sample selection, context dependence, time effects |

Internal validity is a prerequisite. An estimate that is wrong for the study population tells us nothing about any other setting. But even a perfectly valid estimate may not travel well, so both questions matter for decision-making.

### Hierarchy of Evidence

Not all research designs provide equally credible evidence. The strength of a causal claim depends on how effectively the design rules out alternative explanations for the observed association.

<img src="../../_static/hierarchy-of-evidence.svg" alt="Hierarchy of evidence — experiments, observational causal studies, time series analysis" style="display: block; margin: 1em auto; max-width: 600px;">

At the top sit experiments, where random assignment of units to treatment and control eliminates selection bias by construction. The middle tier contains observational causal studies — methods such as matching, difference-in-differences, instrumental variables, regression discontinuity, and synthetic control — that attempt to mimic random assignment by exploiting natural variation or conditioning on observables. These designs produce credible estimates when their identifying assumptions hold, but those assumptions cannot be verified from the data alone. The bottom tier contains time-series approaches that track a metric before and after an intervention without constructing an explicit counterfactual. These designs are the easiest to implement but the most vulnerable to confounding from concurrent events.

Higher tiers are not always feasible. The goal is to use the strongest design available and to be transparent about the assumptions required by weaker designs.

### Stress-Testing a Single Estimate

A single number from a single study — even a well-designed one — is still just one number. Three complementary strategies probe whether that number deserves confidence, each attacking the estimate from a different angle.

<img src="../../_static/stress-tests.svg" alt="Three stress tests converging on a causal estimate — robustness, sensitivity, placebo" style="display: block; margin: 1em auto; max-width: 600px;">

#### Robustness Checks

Robustness checks vary the analytical choices that the researcher controls and ask whether the estimate holds. Every study requires decisions about model specification, sample boundaries, and which covariates to include; reasonable analysts could make different choices at each step. If the central result survives a range of these alternatives, you have a cloud of estimates pointing in the same direction rather than a lone point. If the result is sensitive to seemingly innocuous choices, that fragility is itself informative.

#### Sensitivity Analysis

Sensitivity analysis asks how severely the method's key assumptions would need to be violated to overturn the result. Rather than testing whether the assumptions hold — which is typically impossible with observational data — it quantifies the margin of safety. An estimate that survives large hypothetical violations is more credible than one that collapses under modest departures from the identifying assumptions. Each causal method has its own sensitivity tools: Rosenbaum bounds for matching, Oster bounds for regression, and pre-treatment fit sensitivity for synthetic control. Section 2 covers these method-specific implementations in detail.

#### Placebo Tests

Placebo tests apply the causal method in settings where the true effect is known to be zero. If the method detects a significant effect where none should exist, something is wrong with the design, the data, or the implementation. Three variants target different dimensions of the analysis: outcomes that should be unaffected, time periods before the intervention occurred, and units that were never treated.

| Type | Approach | Example |
|------|----------|---------|
| **Placebo outcomes** | Apply the method to outcomes that should not be affected by treatment | Treatment is a marketing campaign; placebo outcome is warehouse temperature |
| **Placebo treatments** | Assign treatment at a time or threshold where no real treatment occurred | In a DiD design, test for a "treatment effect" two periods before the actual intervention |
| **Placebo units** | Apply the method to units that were not actually treated | In synthetic control, run the method on a control unit and check if a gap appears |

### From Estimate to Evidence

#### Statistical and Practical Significance

A result can be statistically significant without being practically meaningful, and the reverse is equally true. Statistical significance asks whether the observed effect could have arisen by chance alone, typically assessed through p-values and confidence intervals. Practical significance asks whether the effect is large enough to matter for the decision at hand, assessed through effect sizes and cost-benefit analysis.

Large samples can make even trivially small effects statistically significant, while small samples may fail to detect genuinely important effects. Decision-makers need evidence on both fronts: that the effect is real and that it is large enough to justify action.

#### Replication

Stress tests probe a single study from within. Replication tests the conclusion from outside: different researchers, working with different data and different methods, ask the same causal question and check whether they reach similar conclusions. When multiple independent studies converge on the same answer, the evidence is far more credible than any single estimate — however thoroughly stress-tested — can be on its own.

## 2. Method-Specific Diagnostics

Each causal method rests on its own set of identifying assumptions, and each assumption has diagnostic checks designed to detect violations. The diagnostics below cover the three methods developed in this course — experiments, matching, and synthetic control. Other designs (difference-in-differences, instrumental variables, regression discontinuity) have analogous checks.

### Experiments (RCTs)

| Diagnostic | What It Checks | Red Flag |
|------------|----------------|----------|
| **Randomization integrity** | Covariate balance between treatment and control | Systematic differences in baseline characteristics |
| **Attrition** | Whether dropout rates differ by treatment status | Differential attrition compromises random assignment |
| **Non-compliance** | Whether all units received their assigned treatment | High non-compliance dilutes the estimated effect |
| **Spillover / SUTVA** | Whether treatment of one unit affects others | Interference between units biases the estimate |

### Matching and Subclassification

Matching methods condition on observed covariates to make treated and control groups comparable. Their diagnostics focus on verifying that comparability, detecting gaps in overlap, and bounding the influence of unobserved confounders.

**Covariate balance** measures how similar the groups are on pre-treatment characteristics after adjustment. The standard metric is the standardized mean difference, which expresses the gap between treated and control group means in units of pooled standard deviations. A common threshold is an absolute standardized mean difference below 0.1, though this is a guideline rather than a rule. Love plots display standardized mean differences for all covariates before and after adjustment, making it straightforward to assess whether matching has improved balance.

**Common support** — or overlap — requires that for any combination of covariate values, there is a positive probability of being in either the treatment or the control group. Violations mean the method is extrapolating rather than comparing like with like. Propensity score histograms and trimming rules are the standard diagnostics.

| Diagnostic | What It Checks | Red Flag |
|------------|----------------|----------|
| **Balance improvement** | Whether matching reduced covariate imbalance (Love plots, SMD) | SMDs remain large after matching |
| **Common support** | Whether treated and control overlap in covariate space | Many treated units have no comparable controls |
| **Hidden bias sensitivity** | How large an unobserved confounder would need to be (Rosenbaum $\Gamma$) | Effect disappears at low values of $\Gamma$ |

### Synthetic Control

Synthetic control methods construct a counterfactual by weighting untreated units to match the treated unit's pre-treatment trajectory. Their diagnostics test the quality of that match and the statistical significance of the post-treatment gap.

**Pre-treatment fit** is the foundation: if the synthetic control does not track the treated unit closely before the intervention, the post-treatment gap is not credible. Pre-treatment root mean squared prediction error (RMSPE) quantifies the fit. When pre-treatment trends between the synthetic and treated unit diverge, the estimated effect may reflect pre-existing differences rather than the intervention.

| Diagnostic | What It Checks | Red Flag |
|------------|----------------|----------|
| **Pre-treatment fit (RMSPE)** | How well the synthetic control tracks the treated unit before treatment | Large pre-treatment gaps undermine post-treatment comparisons |
| **Placebo gaps** | Whether placebo units show similar post-treatment gaps (see Section 1) | Many placebos have gaps as large as the treated unit |
| **RMSPE ratios** | Post/pre RMSPE ratio relative to the placebo distribution | Treated unit's ratio is not extreme relative to placebos |
| **Donor composition** | Whether the synthetic control relies on sensible weights | Weights concentrated on dissimilar units |

## Additional resources

- **Angrist, J. D. & Pischke, J.‑S. (2010)**. The credibility revolution in empirical economics: How better research design is taking the con out of econometrics. *Journal of Economic Perspectives*, 24(2), 3–30.

- **Athey, S. & Imbens, G. W. (2017)**. [The econometrics of randomized experiments](https://doi.org/10.1016/bs.hefe.2016.10.003). In *Handbook of Economic Field Experiments* (Vol. 1, pp. 73–140). Elsevier.

- **Imbens, G. W. (2020)**. [Potential outcome and directed acyclic graph approaches to causality: Relevance for empirical practice in economics](https://www.aeaweb.org/articles?id=10.1257/jel.20191597). *Journal of Economic Literature*, 58(4), 1129–1179.

- **Oster, E. (2019)**. Unobservable selection and coefficient stability: Theory and evidence. *Journal of Business & Economic Statistics*, 37(2), 187–204.

- **Rosenbaum, P. R. (2002)**. *Observational Studies* (2nd ed.). Springer.