# Matching



Matching methods are designed to generate estimates of the ATE or (much more commonly) the ATT by balancing, pruning and weighting your data. Matching methods are data-greedy; they work best when you have lots of observations to choose from. Typically, treated observations (those participating in the policy) will be matched with untreated observations. The untreated observations are thus used to provide the counterfactual: what would have been the case for our treated observations if they had not been treated. Or that, at least, is the idea.

Let's install necessary packages (if needed) and load them: 

In [None]:
install.packages(c(
   'tidyverse',    # for data wrangling
   'faux',         # for creating some fake data
   'modelsummary', # for regression tables
   'MatchIt'       # for matching
))

library(tidyverse)
library(modelsummary)
library(faux)
library(MatchIt)

## Setup

Let's create some new data using the `faux` package. The package lets us draw random variables that are correlated with each other. The below function draws two variables with different scales from a multivariate normal distribution that are positively correlated at $r=.40$. 

In [None]:
df1 <- rnorm_multi( n = 1000,
                    mu = c(7, 51),  # the means of the two vars
                    sd = c(3, 20),  # the standard deviations
                    r = .40,        # the correlation between the vars
                    varnames = c("xvar1", 
                                 "xvar2"))

Now we can create a treatment variable. Don't worry about the details here, but we are drawing a treatment variable, `treat` from the binomial distribution where the probability that the observation is treated (meaning: equals 1 rather than 0) is a function of our two condictioning variables `xvar1` and `xvar2` plus some random noise.

In [None]:
df1$treat <- rbinom(n = 1000, 
                    size = 1, 
                    prob = plogis(-16 + 1.2*df1$xvar1 + .08*df1$xvar2 + 
                                    rnorm(1000, 0, 3)))

Now we can create our outcome variable `yvar` as a linear function of our two conditioning variables and our treatment.

In [None]:
df1$yvar <- -4 - 6.1*df1$treat + 
                4*df1$xvar1 + 
                .35*df1$xvar2 +
                rnorm(1000, 0, 5)

We clearly have selection bias. Our *X*s impact program participation (treatment). What happens if we simply examine the difference in outcomes based on observed treatment status?

In [None]:
df1 |>  group_by(treat) |> 
        summarize("mean of y" = mean(yvar)) |>
        pivot_wider(names_from = treat,
                   values_from = `mean of y`) |>
        mutate("diff in means" = `1` - `0`)

Yikes. That isn't good at all. Recall that above, the ATE for `treat` is **-6.1**. We're *way* off in our estimate and the estimate is in the wrong direction. Why?

One thing we could is use regression to condition on our confounding variables. `m1`-`m3` below all suffer from omitted variable bias. 

In [None]:
m1 <- lm(yvar ~ treat, data = df1)
m2 <- lm(yvar ~ treat + xvar1, data = df1)
m3 <- lm(yvar ~ treat + xvar2, data = df1)
m4 <- lm(yvar ~ treat + xvar1 + xvar2, data = df1)

modelsummary(list(m1, m2, m3, m4), out = "jupyter")

## MatchIt


The `MatchIt` package offers a number of matching methods. Using these methods is (at least) a two-step process. First, use `matchit()` to generate a matchit object. This is function in which you will describe which the covariates that are driving treatment status, the matching methods to use, the number of matches, and the distance method to utilize with approximate matching methods. 

The `summary()` function will display balance statistics.

In [None]:
m.obj <- matchit(treat ~ xvar1 + xvar2, 
                    data = df1, 
                    method = "nearest", # try "cem" with the cutpoints argument 
                    ratio = 1,          # try 2
                    distance = "glm")   # try "mahalanobis"
summary(m.obj)

The `match.data()` function will allow you to write the matched observations (with weigths and pairs/grouping information) to a new data frame for analysis.

Let's see if we get a more accurate estimate of the effect of the treatment on the outcome using a simple difference in means with the "pruned" matched sample. 

In [None]:
dfmatched <- match.data(m.obj)

dfmatched |>    group_by(treat) |> 
                summarize("mean of y" = mean(yvar)) |>
                pivot_wider(    names_from = treat,
                                values_from = `mean of y`) |>
                mutate("diff in means" = `1` - `0`)

We can combine matching with regression methods to condition on other potential determinants of $Y$ using our matched sample. 

In [None]:
matched1 <- lm(yvar ~ treat, data = dfmatched)
matched2 <- lm(yvar ~ treat + xvar1, data = dfmatched)
matched3 <- lm(yvar ~ treat + xvar2, data = dfmatched)
matched4 <- lm(yvar ~ treat + xvar1 + xvar2, data = dfmatched)

modelsummary(list(m1, m2, m3, m4, 
                matched1, matched2, matched3, matched4),
                out = "jupyter")