In [1]:
# Loading the libraries we will use and setting global options

# Data manipulation and math/stats functions
import numpy as np
np.set_printoptions(suppress=True)
import pandas as pd
import statsmodels.api as sm
#!pip install linearmodels
from linearmodels.iv import IV2SLS 

# Plotting preferences
%matplotlib inline
%config InlineBackend.figure_format = 'retina'
import matplotlib.pyplot as plt
import matplotlib.mlab as mlab
import seaborn as sns

# Suppressing warnings
import warnings
warnings.simplefilter(action = "ignore")

# Import self-made functions
from p3functions import *

  from pandas.core import datetools


In [2]:
#Load in cleaned data and resutls from our naive OLS fit
student_perf = pd.read_pickle('data/student_por_v3.pkl')
naive_ols = sm.load('results/Naive_OLS.pickle')

# Model Fitting

### Addressing Endogeneity

Now that we have identified our problem, we need to find a way to account for it. Specifically, we need to find one or more instruments for hours of studying. A good instrument, $z$, would have the following characteristics:
* $\mathrm{Cov}[z, studying] \ne 0 \ \quad$ (hours of studying covaries with $z$)
* $\mathrm{Cov}[z, \varepsilon] = 0 \ \ \ \quad\quad\quad$ ($z$ is exogenous)

Finding valid instruments proves to be fairly challenging, because we can often imagine situations where a potential instrument is not valid. For example, we might be tempted to include `school_sup` (extra educational support) as an instrument because if someone is enrolled in extra support they will study more. But upon further thought, it seems likely that enrollment in tutoring is not independent of ability (perhaps high or low ability students are more likely to get tutored), meaning that perhaps `school_sup` covaries with the unidentified ability$^1$.

After some thought I concluded that `school_GP` (which school the student is enrolled in), `goout` (how often the student goes out with friends), and `male` (indicator for male), are valid instruments for hours of studying. School enrollment can be a valid instrument because studying habits are often modelled in networks$^2$; when one student with high centrality increases their studying, connected students (friends or classmates) also increase their studying. It is easy to imagine that the two schools have friendship networks that have fewer heterophilic links than homophilic links, which implies that hours of studying covaries with enrollment through random perturbations in central students' studying habits. Frequency of going out is a valid instrument because it seems likely that most students go out with friends, and that students who go out more have less time to study. We can assume that going out and ability are independent because there is no reason to think that going out would covary with ability. Lastly, gender is a valid instrument for a similar reason to that of enrollment; male to male links (friendships) will be more common than male to female links, so if central males study less than central females, then we would expect males to study less than females on average (and vice versa).

Additionally we include a new artificial instrument which is derived from the two way frequency table we create in the data exploration notebook. We noticed that there were 149 students that chose their school based on it's proximity to their home, and that 49 of these students had a commute of more than 15 minutes - we create an indicator that indicates whether the observations is one of these 49 students. We argue that this is a valid instrument because those students have chosen a school based on its proximity, and their commute is fairly significant; taking time out of their day that could have been spent studying. We believe that the decrease in available study time arising from the increased commute is a decent instrument in our model.

Now that we have some plausible instruments for our endogenous variable, we can run 2SLS to estimate the model. Note that we exclude `no_parent` (trying to address some multicollinearity from splitting categorical variables into multiple indicators) from the exogenous variables. We are again using a heteroskedasticity robust variance-covariance matrix estimator - because robustness can never hurt.

-----
1. Earlier we had assumed that all variables other than study time were exogenous, meaning that any of those variables that was correlated with study time would be a valid instrument. But the assumption we made was very weak, in reality we don't really care if variables other than study time are endogenous because we are not interested in the coefficient on these variables. So when we are thinking about constructing instruments for schooling we still need to critically consider the exogeneity of the instrument.
2. This specific example is from Bryan Graham's course Econ 142 during a discussion of networks and centrality. 

In [5]:
# Running the 2SLS model

# Dependent variable
Y = student_perf.G3_perc

# Exogenous variables
X_exog = student_perf[['age', 'urban', 'fam_small', 'fam_split', 'school_GP', 'Medu', 'Fedu', 'mother', 'father', 'traveltime', 'freetime', 'failures', 'school_sup', 'famsup', 'paid', 'activities', 'nursery', 'higher', 'internet', 'romantic', 'famrel', 'Dalc', 'Walc', 'health', 'absences']]
X_exog = sm.add_constant(X_exog)

# Endogenous variable
X_endog = student_perf.studytime

# Instruments
student_perf['artificial_instrument'] = 1*(student_perf.reason == 'home')*1*(student_perf.traveltime > 1)
Z = student_perf[['goout', 'male', 'artificial_instrument']]

model = IV2SLS(Y, X_exog, X_endog, Z)
results = model.fit()
print(results.summary)

                          IV-2SLS Estimation Summary                          
Dep. Variable:                G3_perc   R-squared:                      0.0431
Estimator:                    IV-2SLS   Adj. R-squared:                 0.0031
No. Observations:                 649   F-statistic:                    232.05
Date:                Wed, Dec 06 2017   P-value (F-stat)                0.0000
Time:                        17:09:25   Distribution:                 chi2(26)
Cov. Estimator:                robust                                         
                                                                              
                             Parameter Estimates                              
            Parameter  Std. Err.     T-stat    P-value    Lower CI    Upper CI
------------------------------------------------------------------------------
const          0.4534     0.2234     2.0293     0.0424      0.0155      0.8914
age            0.0005     0.0119     0.0443     0.96

These results are very encouraging. Usually we expect the significance of the coefficients to decrease for almost all the variables in a 2SLS model compared to the analog OLS model. In our 2SLS fit we find that most of the coefficients are indeed less significant, but that the coefficient on study time is actually more significant by an order of magnitude (0.0003 vs 0.002)! Not only that, but the coefficient itself is an order of magnitude larger (0.36 vs 0.03)! The increase in significance of the coefficient on our endogenous variable is highly suggestive that we have good instruments, and better identification of the marginal effect of studying on grades. 

One item to note is that in the 2SLS fit, our value for $R^2$ is negative. It turns out that in 2SLS (and 3SLS), a negative $R^2$ is not usually a cause for concern. In instrumental variable models we are more interested in the standard error and significance level on the coefficient of our endogenous variable. And in our fit, the coefficient is highly significant. For a more in depth analysis of the negative $R^2$ phenomenon, Stata documention provides a good explanation [here](http://www.stata.com/support/faqs/statistics/two-stage-least-squares/ "2SLS R^2").

In [24]:
#Saving results of our 2SLS model
results.save('results/2SLS.pickle')

AttributeError: 'IVResults' object has no attribute 'save'

In [28]:
import pickle    
with open('results/2SLS.pkl','wb') as f:
    pickle.dump(results,f)

In [27]:
results.to_pickle('results/2SLS.pickle')

AttributeError: 'IVResults' object has no attribute 'to_pickle'

In [31]:
with open('results/2SLS.pkl','rb') as f:
    model_results = pickle.load(f)

In [32]:
model_results.params

const         0.504802
age           0.000110
urban         0.039935
fam_small     0.019446
fam_split    -0.012531
Medu          0.010701
Fedu          0.020835
mother       -0.025518
father       -0.015352
traveltime   -0.000272
freetime     -0.010403
failures     -0.096921
school_sup   -0.115723
famsup       -0.030566
paid         -0.043381
activities    0.001406
nursery      -0.023750
higher        0.094381
internet      0.036485
romantic     -0.049226
famrel        0.014016
Dalc         -0.027947
Walc          0.007068
health       -0.009991
absences      0.001596
studytime     0.213904
Name: parameter, dtype: float64