# Instrumental variables estimation

Bioinformatics reading group

Yi Liu, 07 Febuary 2020

Outline

- Estimation of linear models under omitted variable bias: IV, 2SLS
- Randomised experiments: LATE and treatment heterogeneity
- Tests

Note: notations roughly follow conventions in Wooldridge. Only linear models, no GMM.

# OLS, Omitted Variable Bias, and IV

## OLS and BLUE

OLS estimator

$$y = \beta_0 + \beta_1 x_1 + \beta_2 x_2 + \beta_3 x_3 + ... + \beta_k x_k + u$$

$$y = \mathbf{X}\mathbf{\beta} + u$$

$$\mathbf{\hat{\beta}} = (\mathbf{X}'\mathbf{X})^{-1})\mathbf{X}'\mathbf{y}$$

OLS assumptions

1. Linearity in params: $y = \beta_0 + \beta_1 x_1 + \beta_2 x_2 + ... + \beta_k x_k + u$
2. Random sampling
3. No perfect collinearity
4. Zero conditional mean: $E(u|x_1, x_2, ..., x_k) = 0$
5. Homoskedasticity: $Var(u|x_1, ..., x_k) = \sigma^2$

Under Gauss-Markov theorem, OLS method is the best linear unbiased estimator (BLUE)

- Best: having smallest variance
- Unbiased: $E(\hat{\beta_j}) = \beta_j$

**Always BLUE**?

![](https://i.redd.it/yuachr4ncj341.gif)

## Omitted variable bias

![](assets/omitted-variable-bias.png)

$$log(wage) = \beta_0 + \beta_1 educ + \beta_2 abil + e$$

$$log(wage) = \beta_0 + \beta_1 educ + u$$

$$Cov(educ, u) \neq 0$$

OLS $\hat{\beta_1}$ is biased and incosistent.

## IV

![](assets/instrumental-variables.png)

- Instrument exogeneity: $Cov(z, u) = 0$
- Instrument relevance: $Cov(z, x) \neq 0$

$$\beta_1 = \frac{Cov(z, y)}{Cov(z, x)}$$

$$\hat{\beta_1} = \frac{\Sigma(z_i - \bar{z})(y_i - \bar{y})}{\Sigma(x_i - \bar{x})(x_i - \bar{x})}$$

Asymptotic (large sample) consistency: $plim(\hat{\beta_1}) = \beta_1$ -- in small samples, IV / 2SLS are biased.

## Two stage least squares (2SLS)

When multiple instruments are available ($z_2$, $z_3$), and 2SLS assumptions (Wooldridge) hold, their linear combination is the best IV, i.e. having smaller variance. Adding instruments improves the asymptotic efficiency of the 2SLS estimator.

$$y_1 = \beta_0 + \beta_1 y_2 + \beta_2 z_1 + u_1$$

$$y^{*}_2 = \pi_0 + \pi_1 z_1 + \pi_2 z_2 + \pi_3 z_3$$

Stage 1 OLS: $$\hat{y}_2 = \hat{\pi}_0 + \hat{\pi}_1 z_1 + \hat{\pi}_2 z_2 + \hat{\pi}_3 z_3$$

Stage 2 OLS(^): $$y_1 = \beta_0 + \beta_1 y^{*}_2 + \beta_2 z_1 + u_1 + \beta_1 v_2, y_2 = y^{*}_2 + v_2, E(u_1 + \beta_1 v_2| y^{*}_2, z_1)$$

[^] 2SLS is just a metaphor. You should not do it in a naive way. *why?*

### Issues with 2SLS manually

1. SEs of coef estimates for manual stage 2 OLS are not correct.  
Correct residual variance is $Var(u_1)$, not $Var(u_1 + \beta_1 v_2)$.

2. People forget to put same covariates in both stages.

3. "Forbidden regressions" involving non-continuous / non-linear covariates (check Angrist and Pischke).

# LATE

![](https://pics.me.me/setup-the-wizard-wil-now-install-your-software-next-cancel-7994288.png)

# Local average treatment effect (LATE)

Treatment effect of the compilers subgroup.

- $A_i$: (binary) treatment status for $i$
  - $A_i(1)$: treated
  - $A_i(0)$: non-treated / controlled
- $Z_i$: (binary) IV for $A_i$, randomised treatment assignment
- $Y_i(a, z)$: outcome for $i$, $A_i = a$, $Z_i = z$

Assumptions:

- Independence, randomised IV: $[\{Y_i(a, z), \forall a, z\}, A_i (1), A_i (0)] \perp Z_i$
- Exclusion restriction: $Y_i (a, 1) = Y_i(a, 0) \equiv Y_{a_i} \; \forall a = 0, 1$
- First stage: $E[A_i (1) - A_i (0)] \neq 0$
- **Monotonicity**, i.e. no *defiers*: $A_i(1) - A_i(0) \geq 0$

No defiers

| Name | $A_i$($Z_i$=1) | $A_i$($Z_i$=0) |
| --- | --- | --- |
| Always takers | 1 | 1 |
| Never takers | 0 | 0 |
| Compliers | 1 | 0 |
| Defiers | 0 | 1 |


Under those assumptions, IV estimate is *precisely* LATE

$$\frac{E[Y_i|Z_i = 1] - E[Y_i|Z_i = 0]}{E[A_i|Z_i = 1] - E[A_i|Z_i = 0]} = E[Y_i(1) - Y_i(0)|A_i(1) > A_i(0)]$$

LATE under 2SLS (Angrist & Pischke)

Multiple instruments each yields its own LATE. And the 2SLS estimate is the weighted sum of components LATEs, weighted by the strength of prediction for each of the first stage effects.

# Tests

## Test for weak IVs

Intuition

- $F$-test of the first stage.

## Test for endogeneity of explanatory variables

Intution

- When the explanatory variable is exogenous, it is IV of itself, and 2SLS is equivalent to OLS.
- If coef estimates from OLS and 2SLS differ statistically significantly, the explanatory variable of interests is endogenous.
- Hausman test

## Test for exclusion restrictions

When there is only one instrument, exclusion restriction can not be tested, but rely on domain knowledge.

We need multiple instruments and the *overidentification* of the model (more relevant IVs than endogenous variables) to be able to test the exclusion restrictions.

- Outcome: $y_1$
- Endogenous explanatory variable: $y_2$
- Exogenous explanatory variables: $z_1$, $z_2$
- IVs for $y_2$: $z_3$, $z_4$

Intuition
- $\check{\beta_1}$: IV estimate of $\beta_1$ using $z_3$.
- $\tilde{\beta_1}$: IV estimate of $\beta_1$ using $z_4$.
- If $\check{\beta_1}$ and $\tilde{\beta_1}$ are statistically significantly different, then either $z_3$, or $z_4$, or *both* fail the exclusion restriction.
- In real world this is done using residuals of 2SLS (Sargan test).

# References

- Introductory Econometrics, Jeffery Wooldridge
- Mostly harmless econometrics, Joshua Angrist and Jorn-Steffen Pischke
- Various slides by Christopher Baum
  - https://fmwww.bc.edu/GStat/docs/StataIV.pdf
  - http://www.ncer.edu.au/resources/documents/IVandGMM.pdf
- Applied Econometricsd with R, Christian Kleiber and Achim Zeileis
  - https://cran.r-project.org/web/packages/AER/