## Instrumental Variables (IV)

Our final lab illustrates how to conduct instrumental variables regression. IV, as you recall, is useful when you have some instrument causing your predictor variable. This instrument must be connected to the outcome only through through the predictor. IV then uses the instrument to predict the predictor/treatment variable (first stage) and then uses those predicted values to model the outcome (second stage). IV regression helps us because we are modeling the change in the predictor attributable to the instrument, and the instrument is independent from any other confounders. Let's illustrate how IV regression works using some simulated data.

In [None]:
# install packages as necessary and load
install.packages('tidyverse',  'modelsummary', 'faux', 'ivreg')

library(tidyverse)
library(modelsummary)
library(ivreg)
library(faux)

Now we can create our simulated data using **{faux}**. Remember, we need an exogenous instrument. We also create two other predictors of $Y$: an un-observed confounder (`x.unob`) and a known and measured confounder (`confounder`).

In [None]:
# draw instrument from uniform distribution from 3 to 10.
inst <- runif(500, min = 3, max = 10)

# create known confounder from normal distribution with 
# mean of 31 and a sd of 7.
confounder <- rnorm(500, 31, 7)

# draw correlated variables xvar1 and x.unob
newvars <- rnorm_multi(n = 500,
                       mu = c(7, 51),
                       sd = c(3, 20),
                       r = .7,
                       varnames = c("xvar1", "x.unob"))

# combine them all into our tibble dataframe
df <- tibble(inst, confounder, newvars)

Now, we need our predictor variable to have at least a portion of its values coming from the instrument and the confounder, so we can alter our predictor variable. Let's also create our final model for $Y$ as a function of our $X$s:

In [4]:
df <- df |> 
        mutate(xvar1 = xvar1 + 
                       3*inst + 
                       .1*confounder + 
                        rnorm(500, 0, .5),
               yvar = -4 + .8*confounder +
                    .56*xvar1 +.35*x.unob + 
                    rnorm(500, 0, 4))

So our "true" effect of `xvar1` should be .56. What happens if we run a model where we correctly identify `confounder` but can't measure `x.unob`? 

In [None]:
m1 <- lm(yvar ~ xvar1  + confounder, data = df)

summary(m1)

Ok, our estimate of the effect is too large. In fact:

In [None]:
pct <- (m1$coefficients["xvar1"] - .56) / .56 * 100 

cat("OLS estimate is", round(pct, 1), "% too big.")

Why do you think our model is overestimating the effect of `xvar1`?

Ok, now let's conduct our IV regression. We'll do so in two ways. First, we will run our two-stage models by hand to illustrate what is happening. Note that these models will have incorrect standard errors in the second stage. Second, we'll run the models using 2SLS through the **{ivreg}** package.

In [None]:
# run first stage model, where xvar1 is 
# regressed on instrument and confounder
m2_first <- lm(xvar1 ~ inst + confounder, data = df)

# check the reduced form. Is inst related to yvar?
m2_reduced <- lm(yvar ~ inst + confounder, data = df)
summary(m2_reduced)

# generated predicted values of xvar1 based on inst and confounder
df$xvar1_pred <- m2_first$fitted.values

# regress y on predicted values and confounder
m2_second <- lm(yvar ~ xvar1_pred + confounder, data = df)

Now, `ivreg` does these steps in one function call:

In [12]:
# in ivreg, the second stage goes first
# the first stage appears after the |
# your predictor must appear in second stage
# but not in the first. The control variables
# must appear in both models.
# note that we still are omitting `x.unob` because
# we do not know to include it or perhaps can't 
# measure it.
m3 <- ivreg(yvar ~ xvar1 + confounder | inst + confounder, data = df)

Now let's see how we did:

In [None]:
modelsummary(list(m1, m2_second, m3), 
            coef_rename = c("xvar1_pred" = "Predictor", 
            "xvar1" = "Predictor", "confounder" = "Confounder"),
            stars = TRUE,                   
            estimate = "{estimate}{stars}", 
            statistic = "({std.error})",
            gof_map = c("nobs", "r.squared", "rmse"))

How are we doing now? Did our instrumental variables design get us closer to the actual effect of .56 in Models 2 (iv done by hand) and 3 (2SLS)?