# Model Fitting II

### Addressing Endogeneity

To address the issue of endogeneity we need to find instruments for hours of studying that have the following characteristics:

* $\mathrm{Cov}[z, studying] \ne 0 \ \quad$ (hours of studying covaries with $z$)
* $\mathrm{Cov}[z, \varepsilon] = 0 \ \ \ \quad\quad\quad$ ($z$ is exogenous)

Looking at the variables in our dataset, `school_GP`, `goout`, and `male` are valid instrument we can use for our analysis. In addition, based off our two way frequency table we found that 49 people stated they had picked the school because of its proximity to home and that they had a communte time greater than 15 minutes. This subset eliminates people that may have picked the school for reasons that may directly effect how much their test scores (e.g. Picking a school because of "Courses" might correlate with test scores) and only affects test scores indrectly through study time.

Below we will use our instruments in a 2 stage least squares regression.


In [2]:
# Loading the libraries we will use and setting global options

# Data manipulation and math/stats functions
import numpy as np
np.set_printoptions(suppress=True)
import pandas as pd
import statsmodels.api as sm
#!pip install linearmodels
from linearmodels.iv import IV2SLS 

# Plotting preferences
%matplotlib inline
%config InlineBackend.figure_format = 'retina'
import matplotlib.pyplot as plt
import matplotlib.mlab as mlab
import seaborn as sns

# Suppressing warnings
import warnings
warnings.filterwarnings(action = "ignore")

# Import self-made functions
from p3functions import *

In [9]:
#Load in cleaned data and resutls from our naive OLS fit
student_perf = pd.read_pickle('data/student_por_v3.pkl')
naive_ols = sm.load('results/Naive_OLS.pickle')

In [8]:
# Running the 2SLS model

# Dependent variable
Y = student_perf.G3_perc

# Exogenous variables
X_exog = student_perf[['age', 'urban', 'fam_small', 'fam_split', 'school_GP', 'Medu', 'Fedu', 'mother', 'father', 'traveltime', 'freetime', 'failures', 'school_sup', 'famsup', 'paid', 'activities', 'nursery', 'higher', 'internet', 'romantic', 'famrel', 'Dalc', 'Walc', 'health', 'absences']]
X_exog = sm.add_constant(X_exog)

# Endogenous variable
X_endog = student_perf.studytime

# Instruments
student_perf['artificial_instrument'] = 1*(student_perf.reason == 'home')*1*(student_perf.traveltime > 1)
Z = student_perf[['goout', 'male', 'artificial_instrument']]

model = IV2SLS(Y, X_exog, X_endog, Z)
results = model.fit()
print(results.summary)

                          IV-2SLS Estimation Summary                          
Dep. Variable:                G3_perc   R-squared:                      0.0431
Estimator:                    IV-2SLS   Adj. R-squared:                 0.0031
No. Observations:                 649   F-statistic:                    232.05
Date:                Wed, Dec 06 2017   P-value (F-stat)                0.0000
Time:                        23:11:02   Distribution:                 chi2(26)
Cov. Estimator:                robust                                         
                                                                              
                             Parameter Estimates                              
            Parameter  Std. Err.     T-stat    P-value    Lower CI    Upper CI
------------------------------------------------------------------------------
const          0.4534     0.2234     2.0293     0.0424      0.0155      0.8914
age            0.0005     0.0119     0.0443     0.96

In [16]:
np.mean(naive_ols.pvalues)

0.22334223263062752

In [17]:
np.mean(results.pvalues)

0.40060336984890871

In [28]:
#saving 2 way SLS model to our results directory
import pickle
with open('results/2SLS.pkl','wb') as f:
    pickle.dump(results,f)