# Running an Instrumental Variables (2SLS) Regression Using statsmodels library in Python

Example is taken from: http://www.ats.ucla.edu/stat/stata/examples/methods_matter/chapter10/default.htm which provides a good example of running IV 2SLS in Stata.

Instrumental variables (IV) is used to estimate causal relationships when the treatment variable of interest is not randomly assigned, but an instrumental variable is.

In general, there are two main requirements for using an IV:
1. The instrument must be correlated with the endogenous explanatory variables, conditional on the other covariates. This is testable

2. The instrument cannot be correlated with the error term in the explanatory equation (conditional on the other covariates), that is, the instrument cannot suffer from the same problem as the original predicting variable. This is fundamentally still untestable, but one can provide evidence and argue using economic and intuitive logic.

## Example: Effect of College Education on Probabiliy of a person registers to vote

In this example, we want to estimate the effect of having a college education on the probability of whether a person register to vote. However, there could be omitted variables bias of people who have a college education and their likelihood to regster to vote. Instead, we use distance to college as an instrument for whether they get a college education.

### Acquire Data

http://www.ats.ucla.edu/stat/stata/examples/methods_matter/chapter10/dee.dta

In [3]:
import pandas as pd
dee_df = pd.read_stata("dee.dta")

### Some Summary Stats

In [4]:
dee_df[['register','college', 'distance']].describe()

Unnamed: 0,register,college,distance
count,9227.0,9227.0,9227.0
mean,0.670857,0.54709,9.735992
std,0.469927,0.497805,8.702286
min,0.0,0.0,0.0
25%,0.0,0.0,3.0
50%,1.0,1.0,7.0
75%,1.0,1.0,15.000001
max,1.0,1.0,35.0


### Linear Regression Example

In [5]:
import statsmodels.formula.api as smf
result = smf.ols(formula = "register ~ college", data = dee_df).fit()
print(result.summary())

                            OLS Regression Results                            
Dep. Variable:               register   R-squared:                       0.035
Model:                            OLS   Adj. R-squared:                  0.035
Method:                 Least Squares   F-statistic:                     335.9
Date:                Sun, 06 Nov 2016   Prob (F-statistic):           1.03e-73
Time:                        23:10:49   Log-Likelihood:                -5959.0
No. Observations:                9227   AIC:                         1.192e+04
Df Residuals:                    9225   BIC:                         1.194e+04
Df Model:                           1                                         
Covariance Type:            nonrobust                                         
                 coef    std err          t      P>|t|      [95.0% Conf. Int.]
------------------------------------------------------------------------------
Intercept      0.5741      0.007     80.391      0.0

### IV 2SLS ("By Hand")

In [6]:
print("==============================================================================")
print("                                  FIRST STAGE                                 ")
print("==============================================================================")
result = smf.ols(formula = "college ~ distance", data = dee_df).fit()
print(result.summary())
dee_df['college_fitted'] = result.predict()

print("==============================================================================")
print("                                  SECOND STAGE                                ")
print("==============================================================================")

result = smf.ols(formula = "register ~ college_fitted", data=dee_df).fit()
print(result.summary())

                                  FIRST STAGE                                 
                            OLS Regression Results                            
Dep. Variable:                college   R-squared:                       0.012
Model:                            OLS   Adj. R-squared:                  0.012
Method:                 Least Squares   F-statistic:                     115.9
Date:                Sun, 06 Nov 2016   Prob (F-statistic):           7.35e-27
Time:                        23:10:49   Log-Likelihood:                -6598.2
No. Observations:                9227   AIC:                         1.320e+04
Df Residuals:                    9225   BIC:                         1.321e+04
Df Model:                           1                                         
Covariance Type:            nonrobust                                         
                 coef    std err          t      P>|t|      [95.0% Conf. Int.]
----------------------------------------------------

The coefficient estimate of `college_fitted` is the correct 2SLS estimate. However, the standard errors are biased downward in this case.

### IV 2SLS Using GMM.IV2SLS

This requires importing the gmm class from the statsmodels.sandbox

In [7]:
from statsmodels.sandbox.regression import gmm

In [8]:
# Need to add a column for the constant term
dee_df['_const'] = 1

# Y Variable
endog_df = dee_df[['register']]

# X Variable Plus Constant
exog_df = dee_df[['college', '_const']]

# Instrument, and Constant
instrument_df = dee_df[['distance', '_const']]

In [9]:
# Estimate 2SLS model
mod = gmm.IV2SLS(endog=endog_df, exog=exog_df, instrument=instrument_df).fit()

In [10]:
# Print Results
print(mod.summary())

                          IV2SLS Regression Results                           
Dep. Variable:               register   R-squared:                       0.022
Model:                         IV2SLS   Adj. R-squared:                  0.022
Method:                     Two Stage   F-statistic:                     10.57
                        Least Squares   Prob (F-statistic):            0.00115
Date:                Sun, 06 Nov 2016                                         
Time:                        23:10:49                                         
No. Observations:                9227                                         
Df Residuals:                    9225                                         
Df Model:                           1                                         
                 coef    std err          t      P>|t|      [95.0% Conf. Int.]
------------------------------------------------------------------------------
college        0.2837      0.087      3.251      0.0