# Matching and Subclassification

> **Reference:** *Causal Inference: The Mixtape*, Chapter 5: Matching and Subclassification (pp. 175-206)

This lecture introduces methods for causal inference when selection into treatment depends on observed covariates.

---

## Part I: Theory

This section covers the theoretical foundations of matching and subclassification as presented in Cunningham's *Causal Inference: The Mixtape*, Chapter 5.

## 1. Selection on Observables

In the previous lecture on directed acyclic graphs, we learned that the **backdoor criterion** tells us which variables to condition on to identify causal effects. When a set of observed covariates $X$ satisfies the backdoor criterion, we have a situation called **selection on observables**. This means that treatment assignment, while not random, depends only on variables we can measure and control for.

### The Conditional Independence Assumption

The key identifying assumption for all methods in this lecture is the **conditional independence assumption (CIA)**, also known as **unconfoundedness** or **ignorability**:

$$(Y^1, Y^0) \perp D \mid X$$

where $\perp$ denotes statistical independence. This assumption states that, conditional on the observed covariates $X$, the potential outcomes are independent of treatment assignment. In plain language: once we account for the variables in $X$, treatment is "as good as random."

What does this mean in practice? Consider two individuals with identical values of $X$. Under CIA, the fact that one received treatment and the other did not is essentially random—there are no other factors systematically driving both their treatment status and their outcomes.

### Implications of CIA

When CIA holds, the expected potential outcomes are equal across treatment groups for each value of $X$:

$$E[Y^1 \mid D=1, X] = E[Y^1 \mid D=0, X] = E[Y^1 \mid X]$$

$$E[Y^0 \mid D=1, X] = E[Y^0 \mid D=0, X] = E[Y^0 \mid X]$$

This is powerful because it allows us to use observed outcomes from the control group to estimate the counterfactual for the treatment group (and vice versa) within each stratum defined by $X$.

### Common Support

CIA alone is not sufficient for identification. We also need the **common support** (or **overlap**) assumption:

$$0 < \Pr(D=1 \mid X) < 1$$

This requires that for every value of $X$, there is a positive probability of being in both the treatment and control groups. Without common support, we cannot compare treated and untreated units with similar characteristics—some regions of the covariate space would have only treated units or only control units.

| Assumption | Formal Statement | Intuition |
|------------|------------------|------------|
| **CIA** | $(Y^1, Y^0) \perp D \mid X$ | Treatment is "as good as random" given $X$ |
| **Common Support** | $0 < \Pr(D=1 \mid X) < 1$ | Both treated and control units exist for all $X$ |

Together, these assumptions allow us to identify causal effects by comparing outcomes across treatment groups within strata defined by $X$.

### From Assumptions to Identification

Under CIA and common support, the average treatment effect (ATE) is identified as:

$$\delta_{ATE} = E[Y^1 - Y^0] = \int \left( E[Y \mid X, D=1] - E[Y \mid X, D=0] \right) dF(X)$$

The conditional expectations $E[Y \mid X, D=1]$ and $E[Y \mid X, D=0]$ are directly estimable from data. The integral averages these conditional effects over the distribution of $X$ in the population.

The challenge is how to implement this in practice. With continuous covariates or many discrete covariates, we cannot simply compute separate means for each unique value of $X$. The methods we discuss below—subclassification and matching—provide practical approaches to this problem.

## 2. Subclassification

**Subclassification** (also called **stratification**) is the most direct approach to satisfying the backdoor criterion. The idea is simple: divide the data into strata based on the confounding variable(s), compute treatment effects within each stratum, and then average across strata.

### Historical Context: Smoking and Lung Cancer

The subclassification method has deep historical roots. One of its most famous applications was in the debate over whether smoking causes lung cancer. In the mid-20th century, statisticians like William Cochran used subclassification to address critiques from skeptics like Ronald Fisher.

Fisher argued that the correlation between smoking and lung cancer could be spurious—perhaps a genetic factor caused both smoking behavior and susceptibility to lung cancer. If such a confounder existed, comparing smokers to non-smokers would not reveal the causal effect of smoking.

Cochran's insight was to compare smokers and non-smokers *within age groups*. If age was a confounder (older people both smoke different types of tobacco and have higher mortality), then age-adjusted comparisons would remove this source of bias.

### How Subclassification Works

The subclassification procedure consists of three steps:

1. **Stratify**: Divide the sample into $K$ mutually exclusive strata based on the covariate(s) $X$
2. **Compare**: Within each stratum $k$, compute the difference in mean outcomes between treated and control units
3. **Aggregate**: Take a weighted average of the within-stratum effects

For the average treatment effect on the treated (ATT), the estimator is:

$$\hat{\delta}_{ATT} = \sum_{k=1}^{K} \left( \bar{Y}^{1k} - \bar{Y}^{0k} \right) \times \frac{N_T^k}{N_T}$$

where:
- $\bar{Y}^{1k}$ is the mean outcome for treated units in stratum $k$
- $\bar{Y}^{0k}$ is the mean outcome for control units in stratum $k$
- $N_T^k$ is the number of treated units in stratum $k$
- $N_T$ is the total number of treated units

The weights $\frac{N_T^k}{N_T}$ ensure that the strata contribute to the overall estimate in proportion to how many treated units they contain.

### Worked Example: Age-Adjusted Mortality Rates

Consider Cochran's analysis of smoking and mortality, using age as the stratifying variable. The data (simplified from Cochran 1968) shows death rates per 1,000 person-years:

| Age Group | Death Rate (Cigarette Smokers) | Number of Cigarette Smokers | Number of Pipe/Cigar Smokers |
|-----------|-------------------------------|----------------------------|-----------------------------|
| 20-40 | 20 | 65 | 10 |
| 41-70 | 40 | 25 | 25 |
| 71+ | 60 | 10 | 65 |
| **Total** | | **100** | **100** |

**Without subclassification**, the mortality rate for cigarette smokers is simply the weighted average:

$$\text{Crude rate} = 20 \times \frac{65}{100} + 40 \times \frac{25}{100} + 60 \times \frac{10}{100} = 29 \text{ per 1,000}$$

**With subclassification**, we adjust for the age distribution. If we want to compare cigarette smokers to pipe/cigar smokers using the pipe/cigar smokers' age distribution as the standard:

$$\text{Age-adjusted rate} = 20 \times \frac{10}{100} + 40 \times \frac{25}{100} + 60 \times \frac{65}{100} = 51 \text{ per 1,000}$$

The age-adjusted rate is nearly twice the crude rate! This reveals that cigarette smokers appeared to have lower mortality partly because they were younger on average. Once we compare like with like (within age groups), cigarette smoking is associated with *higher* mortality.

### Covariate Balance

The goal of subclassification is to achieve **covariate balance**—making the distribution of confounders similar across treatment and control groups. When the treated and control groups within each stratum have similar covariate values, we say the groups are **exchangeable** with respect to those covariates.

Balance is the empirical manifestation of the conditional independence assumption. If CIA holds and we have balanced covariates, then any remaining difference in outcomes between treated and control units can be attributed to the treatment effect.

We can check balance by comparing the means (or distributions) of covariates across treatment groups within each stratum. If the means are similar, we have achieved balance on that covariate.

### The Curse of Dimensionality

Subclassification works well when we have one or two discrete covariates. But what happens when we have many covariates, or when covariates are continuous?

This is the **curse of dimensionality**. As the number of covariates $K$ grows, the number of strata grows exponentially. With just two binary covariates, we have $2^2 = 4$ strata. With ten binary covariates, we have $2^{10} = 1,024$ strata.

The problem is that our sample size doesn't grow with the number of strata. As strata multiply:

- Many strata become **empty** (no observations)
- Many strata have observations of only one type (all treated or all control)
- Even strata with both types may have very few observations, leading to imprecise estimates

When a stratum contains only treated units or only control units, we have a **common support violation**—we cannot estimate the treatment effect in that stratum because we lack a comparison group.

This limitation motivates the methods we discuss next: matching, which provides a more flexible approach to achieving covariate balance.

## 3. Exact Matching

**Matching** takes a different approach to the identification problem. Instead of stratifying the data and computing within-stratum means, matching directly imputes the missing counterfactual for each treated unit by finding a similar control unit.

### The Matching Idea

Recall the fundamental problem of causal inference: for each treated unit $i$, we observe $Y_i^1$ but not $Y_i^0$. The treatment effect for unit $i$ is $\delta_i = Y_i^1 - Y_i^0$, but we can only observe half of this difference.

Matching addresses this by finding a control unit $j$ with covariate values identical (or very similar) to unit $i$. Under CIA, this control unit's outcome $Y_j$ serves as a valid estimate of unit $i$'s counterfactual outcome $Y_i^0$.

For **exact matching**, we require that the matched control has exactly the same covariate values:

$$X_{j(i)} = X_i$$

where $j(i)$ denotes the control unit matched to treated unit $i$.

### The Matching Estimator

Once we have found matches, the ATT estimator is simply the average difference between each treated unit's outcome and its match's outcome:

$$\hat{\delta}_{ATT} = \frac{1}{N_T} \sum_{D_i=1} \left( Y_i - Y_{j(i)} \right)$$

This estimator has an intuitive interpretation: for each treated unit, we estimate its individual treatment effect by comparing its outcome to what a similar untreated unit achieved. We then average these individual effects.

Note that this is an estimator of the ATT, not the ATE, because we are averaging over the treated units only. Each treated unit gets equal weight, and the implicit weighting reflects the distribution of $X$ among the treated.

### Example: Job Training and Earnings

Consider a job training program where we want to estimate the effect on earnings. We have data on trainees (treatment group) and non-trainees (control group), with age as the only covariate.

| Unit | **Trainees** | | | **Non-Trainees** | |
|------|------|----------|------|------|----------|
| | Age | Earnings | | Age | Earnings |
| 1 | 18 | $9,500 | 1 | 20 | $8,500 |
| 2 | 29 | $12,250 | 2 | 27 | $10,075 |
| 3 | 24 | $11,000 | 3 | 21 | $8,725 |
| 4 | 27 | $11,750 | 4 | 39 | $12,775 |
| 5 | 33 | $13,250 | 5 | 38 | $12,550 |
| ... | ... | ... | ... | ... | ... |
| **Mean** | **24.3** | **$11,075** | | **31.95** | **$11,101** |

The naive comparison suggests almost no effect: trainees earn slightly *less* on average ($11,075 vs $11,101). But notice that trainees are younger on average (24.3 vs 31.95 years). Since earnings typically rise with age, this comparison is confounded.

### Creating the Matched Sample

For exact matching on age, we find each trainee's match in the control group:

- Trainee 1 (age 18) → Match: Non-trainee 14 (age 18), earnings $8,050
- Trainee 2 (age 29) → Match: Non-trainee 6 (age 29), earnings $10,525
- Trainee 3 (age 24) → Match: Non-trainee 9 (age 24), earnings $9,400
- And so on...

After matching, the matched control group has the same age distribution as the trainees. The groups are now **balanced** on age, and therefore **exchangeable** with respect to this covariate.

The mean earnings of the matched controls is $9,380, compared to $11,075 for trainees. The matching estimate of the ATT is:

$$\hat{\delta}_{ATT} = \$11,075 - \$9,380 = \$1,695$$

Once we compare trainees to non-trainees of the same age, the program appears to increase earnings by nearly $1,700.

### Limitations of Exact Matching

Exact matching inherits the curse of dimensionality from subclassification. With multiple covariates, finding exact matches becomes increasingly difficult:

- With continuous covariates (like age measured in days), exact matches may be impossible
- With many discrete covariates, the probability of finding an exact match drops rapidly
- Some treated units may have no exact match in the control group

When exact matches cannot be found, we must either:
1. Drop unmatched treated units (reducing sample size and potentially introducing selection bias)
2. Use **approximate matching**, which we discuss next

## 4. Approximate Matching

In practice, exact matches are often unavailable. **Approximate matching** relaxes the requirement of identical covariate values, instead matching on units that are "close" in some sense.

### The Need for Distance Metrics

When we cannot find an exact match, we seek the "nearest" control unit. But what does "nearest" mean when comparing multiple covariates?

With a single covariate, distance is straightforward: the distance between ages 25 and 27 is simply 2 years. But with multiple covariates—say, age and income—we need a way to combine distances across dimensions. Is someone who differs by 2 years in age and $1,000 in income "closer" or "farther" than someone who differs by 5 years in age and $200 in income?

This is where **distance metrics** come in. A distance metric provides a principled way to measure the overall similarity between two units based on their covariate values.

### Distance Metrics

The most common distance metrics are:

**Euclidean Distance**

The standard "straight line" distance in covariate space:

$$||X_i - X_j|| = \sqrt{(X_i - X_j)'(X_i - X_j)} = \sqrt{\sum_{n=1}^{K} (X_{ni} - X_{nj})^2}$$

The problem with Euclidean distance is that it treats all covariates equally. If age is measured in years (range 18-65) and income in dollars (range $0-$500,000), income will dominate the distance calculation simply because of its larger scale.

**Normalized Euclidean Distance**

To address the scale problem, we can standardize each covariate by its variance:

$$||X_i - X_j|| = \sqrt{\sum_{n=1}^{K} \frac{(X_{ni} - X_{nj})^2}{\hat{\sigma}_n^2}}$$

where $\hat{\sigma}_n^2$ is the sample variance of covariate $n$. Now a one-standard-deviation difference in any covariate contributes equally to the distance.

**Mahalanobis Distance**

The most sophisticated option accounts for correlations between covariates:

$$||X_i - X_j|| = \sqrt{(X_i - X_j)'\hat{\Sigma}_X^{-1}(X_i - X_j)}$$

where $\hat{\Sigma}_X$ is the sample covariance matrix of the covariates. Mahalanobis distance not only standardizes by variance but also accounts for the correlation structure among covariates.

### Comparison of Distance Metrics

| Metric | Formula | Advantages | Disadvantages |
|--------|---------|------------|---------------|
| **Euclidean** | $\sqrt{\sum (X_{ni} - X_{nj})^2}$ | Simple, intuitive | Scale-dependent |
| **Normalized Euclidean** | $\sqrt{\sum \frac{(X_{ni} - X_{nj})^2}{\sigma_n^2}}$ | Scale-invariant | Ignores correlations |
| **Mahalanobis** | $\sqrt{(X_i-X_j)'\Sigma^{-1}(X_i-X_j)}$ | Scale-invariant, accounts for correlations | Requires invertible covariance matrix |

### Nearest Neighbor Matching

Once we have chosen a distance metric, **nearest neighbor matching** proceeds as follows:

1. For each treated unit $i$, compute the distance to all control units
2. Select the control unit $j(i)$ with the smallest distance to unit $i$
3. Use $Y_{j(i)}$ as the imputed counterfactual for unit $i$

The ATT estimator remains:

$$\hat{\delta}_{ATT} = \frac{1}{N_T} \sum_{D_i=1} \left( Y_i - Y_{j(i)} \right)$$

**Matching with multiple neighbors**: Sometimes we may want to use more than one control unit as a match. If we find $M$ nearest neighbors for each treated unit, we average their outcomes:

$$\hat{\delta}_{ATT} = \frac{1}{N_T} \sum_{D_i=1} \left( Y_i - \frac{1}{M} \sum_{m=1}^{M} Y_{j_m(i)} \right)$$

Using multiple matches reduces variance (more data points) but may increase bias (matches are farther away on average).

### Matching Discrepancies and Bias

With approximate matching, the matched control unit typically does not have exactly the same covariate values as the treated unit: $X_{j(i)} \neq X_i$. This difference is called the **matching discrepancy**.

Matching discrepancies introduce bias because the control unit's outcome reflects not just what the treated unit would have experienced under control, but also any systematic differences associated with having different covariate values.

The key insight is that matching discrepancies tend to shrink as the sample size grows—with more potential controls, we can typically find closer matches. In the limit of an infinite control pool, exact matching becomes feasible and the bias disappears.

In finite samples, researchers should:
- Examine the quality of matches by comparing covariate distributions before and after matching
- Use covariate balance diagnostics to assess whether matching has successfully balanced the groups
- Consider whether remaining imbalances could materially affect the estimated treatment effect

The practical takeaway is that approximate matching is a valuable tool, but researchers should be transparent about match quality and the potential for residual bias.

## Additional resources

- **Abadie, A. & Imbens, G. (2006)**. Large sample properties of matching estimators for average treatment effects. *Econometrica*, 74(1), 235-267.

- **Abadie, A. & Imbens, G. (2011)**. Bias-corrected matching estimators for average treatment effects. *Journal of Business & Economic Statistics*, 29(1), 1-11.

- **Cochran, W. G. (1968)**. The effectiveness of adjustment by subclassification in removing bias in observational studies. *Biometrics*, 24(2), 295-313.

- **Imbens, G. W. (2004)**. Nonparametric estimation of average treatment effects under exogeneity: A review. *Review of Economics and Statistics*, 86(1), 4-29.

- **Rosenbaum, P. R. (2002)**. *Observational Studies* (2nd ed.). Springer.

- **Rosenbaum, P. R. & Rubin, D. B. (1983)**. The central role of the propensity score in observational studies for causal effects. *Biometrika*, 70(1), 41-55.

- **Stuart, E. A. (2010)**. Matching methods for causal inference: A review and a look forward. *Statistical Science*, 25(1), 1-21.