# Lab 5
## Randomized Promotion
2/13/2018

### Announcements:
Reminder that the next problem set is due at the start of lecture next week (February 21).  Solutions for the last problem set are posted!  Grades should be up in bcourses within 1-2 weeks.  A reminder on problem set submission:
* Jupyter notebook or R script only
* Only assigned exercises
* Submit by 9:30am for full credit

### Plan for Today's Lab
1. Quick discussion of instrumental variables applications (5-10 minutes)
2. Checking compliance in R- crosstab (5 minutes)
3. IV Regressions in R (5-10 minutes)

## 1-Discussion
Instrumental variabes are extremely useful when trying test products or programs in the real world, especially in the following situations:
### Imperfect Compliance 
This was discussed in lecture: not everyone does what you tell them to!  It's important to understand that if we try to compare those who received the treatment directly to those who did not (instead of those who were randomly assigned to receive the treatment), our intervention is no longer randomly assigned!  Our estimates may be biased upwards or downwards.  Example: only very poor families choose to take up PROGRESA, so those who are assigned to receive the program but choose not comply to are wealthier/otherwise better off.
### Imperfect Compliance By Design
Sometimes, you may design an experiment in such a way that you know not everyone in your randomized group will receive the treatment.  For example, you may not be able to limit the sample of randomization only to people who are eligible. Example: Oregon Health Study 
### Encouragement or "nudge" designs
It's important to understand the distinction here between the interpretation of the (a) first stage (b) intent to treat effect (c) local average treatment effect/treatment on the treated. With nudges or encouragement programs, we might actually care more about the intent to treat effect, because this is the true effect of the program (esp for cost effectiveness purposes).  In that case IV isn't really necessary.

While the LATE/TOT can be interpretted casaully, we must remember that it is *local*.  This means that it is estimating the effect *for the group that was actually treated*.  So, we can't extrapolate what the effect would be for the group that was encouraged but didn't take up the treatment if we could somehow force them to take it up. (example: nudge to improve enrollment vs autoenrollment)

### Natural experiments
We can also use IV in to analyze the effect of programs/products when we're not able to run an experiment.  We just have to find something that randomly induces variation in exposure to the program/product/thing we're trying to understand the effect of. Example: Use years with bad flu season to estimate effects of school absenses on test scores


In [2]:
# Clear Environment 
rm(list = ls())

# Load required packages
library(dplyr)
library(ggplot2)
install.packages("gmodels")
library(gmodels)
library(AER)

# Set working directory to the location of your data files
setwd("../Data")

# read the file
PanelPROGRESA_97_99year <- read.csv("PanelPROGRESA_97_99year.csv")
str(PanelPROGRESA_97_99year)

Installing package into ‘/home/aoyh/R/x86_64-pc-linux-gnu-library/3.4’
(as ‘lib’ is unspecified)
Loading required package: car

Attaching package: ‘car’

The following object is masked from ‘package:dplyr’:

    recode

Loading required package: lmtest
Loading required package: zoo

Attaching package: ‘zoo’

The following objects are masked from ‘package:base’:

    as.Date, as.Date.numeric

Loading required package: sandwich
Loading required package: survival


'data.frame':	71411 obs. of  12 variables:
 $ year        : int  1998 1998 1999 1998 1999 1998 1999 1998 1999 1998 ...
 $ villid      : int  13030105 13030024 13030024 13030024 13030024 13030105 13030105 13030105 13030105 13030006 ...
 $ geopolid    : int  13 13 13 13 13 13 13 13 13 13 ...
 $ hogid       : int  1 2 2 3 3 4 4 5 5 6 ...
 $ pov_HH      : Factor w/ 2 levels "Non poor","poor": 2 1 1 1 1 2 2 2 2 1 ...
 $ D           : Factor w/ 2 levels "Control","Treated": 2 1 1 1 1 2 2 2 2 2 ...
 $ D_HH        : int  NA 0 0 0 0 NA NA NA NA 0 ...
 $ IncomeLab_HH: num  NA 9000 1200 1200 900 ...
 $ famsize     : int  2 3 3 3 3 5 5 2 2 4 ...
 $ eduhead     : int  NA NA NA NA NA NA NA NA NA NA ...
 $ agehead     : int  70 27 29 55 55 26 27 NA 47 60 ...
 $ sexhead     : Factor w/ 2 levels "Female","Male": 2 2 2 2 2 2 2 2 2 1 ...


## 2-Checking Compliance
Here we'll just create a table to check for compliance.  Quick discussion: do we still care about compliance if we can use IV to correct for it?

In [3]:
# Creating a subset for the year 1999
PanelPROGRESA_99 <- subset(PanelPROGRESA_97_99year, year == 1999)
# Checkout the CrossTable command
?CrossTable
# Setting a bunch of defaults to false, because we only need row - proportions
CrossTable(PanelPROGRESA_99$D, PanelPROGRESA_99$D_HH, prop.c = FALSE, prop.t = FALSE, prop.chisq = FALSE)




 
   Cell Contents
|-------------------------|
|                       N |
|           N / Row Total |
|-------------------------|

 
Total Observations in Table:  22124 

 
                   | PanelPROGRESA_99$D_HH 
PanelPROGRESA_99$D |         0 |         1 | Row Total | 
-------------------|-----------|-----------|-----------|
           Control |      9329 |         0 |      9329 | 
                   |     1.000 |     0.000 |     0.422 | 
-------------------|-----------|-----------|-----------|
           Treated |      4912 |      7883 |     12795 | 
                   |     0.384 |     0.616 |     0.578 | 
-------------------|-----------|-----------|-----------|
      Column Total |     14241 |      7883 |     22124 | 
-------------------|-----------|-----------|-----------|

 


## 3-IV Regressions
If we don't use any control variables, we can just use LATE=ITT/Compliance.  However, we'll often want to include control variables to increase precision or if our treatment is only random conditional on that variable. Remember that controlling for a variable (including the variable on the right hand side of a regression) allows us to estimate the effect of a treatment *holding constant* the control.  In this case, we'll need to use two stage least squares, which just means that we're technically running two regressions (though R does this automatically with one command):
### Regression 1
$ C_i=\alpha+\beta A_i +X_i +\epsilon_i$

$\alpha$ is constant

$ C_i $ is indicator for complied with treatment

$ \beta $ is likelihood of taking up treatment given that individual was assigned

$ A_i $ is indicator for assigned to treatment

$ \epsilon_i $ is error term

### Regression 2
$ Y_i=\gamma+\delta \hat C_i +X_i+\tau_i$

$ Y_i $ is outcome

$\gamma $ is contant

$ \delta $ is effect of treatment

$ \hat C_i $ is predicted compliace from equation 1

$ X_i $ is control variables

$\tau_i$ is error term

In [5]:
# 2SLS (Two-Stage-Least-Squares) Regression or Instrumental Variable Regression
#include D and eduhead as instruments-- assumes eduhead affects probability of treatment
iv_model <- ivreg(IncomeLab_HH ~ D_HH |D + eduhead, data = PanelPROGRESA_99)
summary(iv_model)
#also includes eduhead in second regression-- effect of treatment holding eduhead constant
#if eduhead included in second regresion (control when assessing effect of treatment)...
#it must be included as an instrument in first regression
iv_model2 <- ivreg(IncomeLab_HH ~ D_HH+eduhead | D +eduhead, data = PanelPROGRESA_99)
summary(iv_model2)


Call:
ivreg(formula = IncomeLab_HH ~ D_HH | D + eduhead, data = PanelPROGRESA_99)

Residuals:
       Min         1Q     Median         3Q        Max 
  -2153.24   -1004.04    -669.90      45.96 1800830.10 

Coefficients:
            Estimate Std. Error t value Pr(>|t|)    
(Intercept)   1754.0      178.2   9.843   <2e-16 ***
D_HH           415.9      355.9   1.168    0.243    
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

Residual standard error: 14700 on 16981 degrees of freedom
Multiple R-Squared: -6.532e-05,	Adjusted R-squared: -0.0001242 
Wald test: 1.365 on 1 and 16981 DF,  p-value: 0.2426 



Call:
ivreg(formula = IncomeLab_HH ~ D_HH + eduhead | D + eduhead, 
    data = PanelPROGRESA_99)

Residuals:
       Min         1Q     Median         3Q        Max 
  -2247.35   -1072.35    -662.86      26.67 1800727.65 

Coefficients:
            Estimate Std. Error t value Pr(>|t|)    
(Intercept)  1862.86     213.63   8.720   <2e-16 ***
D_HH          409.50     355.98   1.150    0.250    
eduhead       -37.22      40.30  -0.924    0.356    
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

Residual standard error: 14700 on 16980 degrees of freedom
Multiple R-Squared: -1.119e-05,	Adjusted R-squared: -0.000129 
Wald test: 1.109 on 2 and 16980 DF,  p-value: 0.3299 


In [6]:
?ivreg