In statistics, econometrics, epidemiology and related disciplines, the method of instrumental variables (IV) is used to estimate causal relationships when controlled experiments are not feasible or when a treatment is not successfully delivered to every unit in a randomized experiment.

**Q1)** Mention the two requirements for using instrumental variables (IV) in linear models:

* The instrument *must* be correlated with the *endogenous explanatory variables*, conditional on the other covariates.
* The instrument *cannot* be correlated with the *error term* in the explanatory equation (conditional on the other covariates), that is, the instrument cannot suffer from the same problem as the original predicting variable.



# Use case: Predicting Civic Engagement

Can we use college attainment (COLLEGE) to predict the probability of civic engagement (REGISTER)? College attainment is not randomized, and the arrow of causality may move in the opposite direction, so all we can do with standard regression is to establish a correlation.

Import:

In [1]:
import codecs                     
import json                       
import pandas as pd               
import numpy as np               
import matplotlib.pyplot as plt   
import seaborn as sns             
from dateutil import *           
import math                       
import statsmodels.formula.api as smf  
import statsmodels.api as sm      
from collections import Counter   
import scipy

**Q2)** Read the iv.csv file. Get the basic stats. i.e summary stats (mean, quartiles, std) for: 'register','college', 'distance'

In [13]:
import sas7bdat
from sas7bdat import *
foo = SAS7BDAT('dee.sas7bdat')
# This converts to dataframe:
ds = foo.to_data_frame()

In [18]:
ds.describe()

Unnamed: 0,schoolid,hispanic,college,black,otherrace,female,register,distance
count,9227.0,9227.0,9227.0,9227.0,9227.0,9227.0,9227.0,9227.0
mean,5406.414869,0.198873,0.54709,0.125393,0.049312,0.517178,0.670857,9.735992
std,2608.65778,0.399174,0.497805,0.331182,0.21653,0.499732,0.469927,8.702286
min,1032.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
25%,3141.0,0.0,0.0,0.0,0.0,0.0,0.0,3.0
50%,5185.0,0.0,1.0,0.0,0.0,1.0,1.0,7.0
75%,7688.0,0.0,1.0,0.0,0.0,1.0,1.0,15.000001
max,9978.0,1.0,1.0,1.0,1.0,1.0,1.0,35.0


**Q3)** Use the function .corr to get a correlation matrix of the variables analized in Q2. Report the results and inteprete the correlations.

In [21]:
ds1 = ds[['register','college','distance']]

In [23]:
corr = ds1.corr().values
corr

array([[ 1.        ,  0.1874256 , -0.03346988],
       [ 0.1874256 ,  1.        , -0.11137305],
       [-0.03346988, -0.11137305,  1.        ]])

**Q4)** Run a linear regression of REGISTER on COLLEGE. Report the summary results of the regression. Is coefficient of college statiscally significant?. Interpret this coeffcient

In [26]:
result = smf.ols(formula = "register ~ college", data = ds).fit()
print (result.summary())

                            OLS Regression Results                            
Dep. Variable:               register   R-squared:                       0.035
Model:                            OLS   Adj. R-squared:                  0.035
Method:                 Least Squares   F-statistic:                     335.9
Date:                Fri, 08 Feb 2019   Prob (F-statistic):           1.03e-73
Time:                        04:58:44   Log-Likelihood:                -5959.0
No. Observations:                9227   AIC:                         1.192e+04
Df Residuals:                    9225   BIC:                         1.194e+04
Df Model:                           1                                         
Covariance Type:            nonrobust                                         
                 coef    std err          t      P>|t|      [0.025      0.975]
------------------------------------------------------------------------------
Intercept      0.5741      0.007     80.391      0.0

### Two-Stage Least Squares Regression of REGISTER ~ COLLEGE where IV=DISTANCE

In two-stage least squares regression, we regress COLLEGE on DISTANCE and use the predictions from that model as the predictors for REGISTER. So you have to make the regression in two stages.

**Q5)** STAGE 1: Run a ols of college on distance. Save the results in the main dataframe as 'college_fitted'. TIP: for saving the predict values, you could use result.predict()

In [29]:
result = smf.ols(formula = "college ~ distance", data = ds).fit()
print (result.summary())
ds['college_fitted'] = result.predict()

                            OLS Regression Results                            
Dep. Variable:                college   R-squared:                       0.012
Model:                            OLS   Adj. R-squared:                  0.012
Method:                 Least Squares   F-statistic:                     115.9
Date:                Fri, 08 Feb 2019   Prob (F-statistic):           7.35e-27
Time:                        05:04:53   Log-Likelihood:                -6598.2
No. Observations:                9227   AIC:                         1.320e+04
Df Residuals:                    9225   BIC:                         1.321e+04
Df Model:                           1                                         
Covariance Type:            nonrobust                                         
                 coef    std err          t      P>|t|      [0.025      0.975]
------------------------------------------------------------------------------
Intercept      0.6091      0.008     78.812      0.0

**Q6)** STAGE 2:  Run a ols of register on college_fitted. Report the summary table. Compare the R.squared, pi-val, and the interpret the coefficient college_fitted. 

In [30]:
result = smf.ols(formula = "register ~ college_fitted", data=ds).fit()
print (result.summary())

                            OLS Regression Results                            
Dep. Variable:               register   R-squared:                       0.001
Model:                            OLS   Adj. R-squared:                  0.001
Method:                 Least Squares   F-statistic:                     10.35
Date:                Fri, 08 Feb 2019   Prob (F-statistic):            0.00130
Time:                        05:06:47   Log-Likelihood:                -6118.9
No. Observations:                9227   AIC:                         1.224e+04
Df Residuals:                    9225   BIC:                         1.226e+04
Df Model:                           1                                         
Covariance Type:            nonrobust                                         
                     coef    std err          t      P>|t|      [0.025      0.975]
----------------------------------------------------------------------------------
Intercept          0.5157      0.049     10.

### Adding covariates that do not satisfy the requirements of instrumental variables
In the case of the covariate of race/ethicity, we expect that there might be a relationship between race/ethnicity and distance to a community college, as well as a relationship between race/ethnicity and voter registration. 

While race/ethnicity fails the test for instrumental variables, it can still be included as a covariate in a multiple regression model. In such cases, it is essential to include covariates at both stages of a two-stage test.

**Q7)** Repeat Q5 and Q6, i.e the two-Stage Least Squares Regression, but now use this specification model: REGISTER ~ COLLEGE + BLACK + HISPANIC + OTHERRACE where IV=DISTANCE

**Q8)** Compare the two models, choice one of them and justify your seleccion.

In [33]:
result = smf.ols(formula = "college ~ distance + black + hispanic + otherrace", data = ds).fit()
print (result.summary())
ds['college_fitted'] = result.predict()

result = smf.ols(formula = "register ~ college_fitted + black + hispanic + otherrace", data=ds).fit()
print (result.summary())

                            OLS Regression Results                            
Dep. Variable:                college   R-squared:                       0.022
Model:                            OLS   Adj. R-squared:                  0.021
Method:                 Least Squares   F-statistic:                     51.24
Date:                Fri, 08 Feb 2019   Prob (F-statistic):           9.75e-43
Time:                        05:18:04   Log-Likelihood:                -6554.4
No. Observations:                9227   AIC:                         1.312e+04
Df Residuals:                    9222   BIC:                         1.315e+04
Df Model:                           4                                         
Covariance Type:            nonrobust                                         
                 coef    std err          t      P>|t|      [0.025      0.975]
------------------------------------------------------------------------------
Intercept      0.6431      0.009     71.039      0.0