# Model Fitting

We want to find the causal effect of studying on grades, so we will be using some econometric techniques, focusing on causal inference. We will first run a naive OLS fit, and then demonstrate why it is inappropriate in this context.

### Naive OLS Fit

The naive approach would be to use these data to fit a "kitchen sink" OLS regression to the data. So lets see what this regression would yield, and then address the plausibility of these results.

*Note: We are using MacKinnon and White's (1985) HC3 heteroskedasticity robust covariance estimator*

In [36]:
# Loading the libraries we will use and setting global options

# Data manipulation and math/stats functions
import numpy as np
np.set_printoptions(suppress=True)
import pandas as pd
import statsmodels.api as sm
#!pip install linearmodels
from linearmodels.iv import IV2SLS 

# Plotting preferences
%matplotlib inline
%config InlineBackend.figure_format = 'retina'
import matplotlib.pyplot as plt
import matplotlib.mlab as mlab
import seaborn as sns

# Suppressing warnings
import warnings
warnings.simplefilter(action = "ignore")

# Import self-made functions
from p3functions import *

In [37]:
# Loading the data
student_perf = pd.read_pickle('data/student_por_v2.pkl')

In [38]:
# Data formatting - converting strings to indicators
indicator_names = {
    'school_GP': ('school', 'GP'),
    'male': ('sex', 'M'),
    'urban': ('address', 'U'),
    'fam_small': ('famsize', 'LE3'),
    'fam_split': ('Pstatus', 'A'),
    'no_parent': ('guardian', 'other'),
    'father': ('guardian', 'father'),
    'mother': ('guardian', 'mother'),
    'school_sup': ('schoolsup', 'yes'),
    'famsup': ('famsup', 'yes'),
    'paid': ('paid', 'yes'),
    'activities': ('activities', 'yes'),
    'nursery': ('nursery', 'yes'),
    'higher': ('higher', 'yes'),
    'internet': ('internet', 'yes'),
    'romantic': ('romantic', 'yes')
}
make_indicators(student_perf, indicator_names)

# Converting G3 to percent
student_perf['G3_perc'] = student_perf.G3 / 12

In [42]:
# Running the OLS model
Y = student_perf.G3_perc
X = student_perf[['studytime', 'school_GP', 'male', 'age', 'urban', 'fam_small', 'fam_split', 'Medu', 'Fedu', 'no_parent','mother', 'father', 'traveltime', 'freetime', 'failures', 'school_sup', 'famsup', 'paid', 'activities', 'nursery', 'higher', 'internet', 'romantic', 'famrel', 'goout', 'Dalc', 'Walc', 'health', 'absences']]
X = sm.add_constant(X)
model = sm.OLS(Y, X)
results = model.fit(cov_type='HC3')
print(results.summary())

                            OLS Regression Results                            
Dep. Variable:                G3_perc   R-squared:                       0.347
Model:                            OLS   Adj. R-squared:                  0.317
Method:                 Least Squares   F-statistic:                     648.2
Date:                Wed, 06 Dec 2017   Prob (F-statistic):               0.00
Time:                        14:54:49   Log-Likelihood:                 69.417
No. Observations:                 649   AIC:                            -80.83
Df Residuals:                     620   BIC:                             48.95
Df Model:                          28                                         
Covariance Type:                  HC3                                         
                 coef    std err          z      P>|z|      [0.025      0.975]
------------------------------------------------------------------------------
const          0.4336      0.121      3.572      0.0

These results suggest that within this population, an additional hour of studying per week would "result" in an increase of 3 percentage points to the final grade (with high statistical significance). But this conclusion is not necessarily the correct one.

Many of the coefficients are not statistically significant, and our X matrix is close to singular (suggesting a problem with multicollinearity as `statsmodels.api` explains). While this is not necessarily a cause for concern, it might cast some suspicion on the results.

This result is unsurprising because in this regression we include proxies to represent unobserved ability, and include our variable of interest, hours of studying. This set up makes the implicit assumption that hours of studying and ability are independent. Unforunately, the validity of this assumption is questionable. It is certainly plausible that students with higher abilities enjoy studying more (perahps because they find it less difficult), suggesting that they will have a less negative coefficient on studying in their utility maximizations, leading them to study for longer. So the assumption that hours of studying and ability are independent is quite strong, and not necessarily valid. Violation of this assumption would suggest that the dependence between ability and studying is a source of serious multicollinearity. On the other hand, our low $\mathrm{R}^2$ suggests that multicollinearity might not be an issue here; and that we probably just need more data. We could try to address this potential multicollinearity by dropping some variables, but generally it is better to report the full model and point out the potential issue. We don't usually drop variables because that could cause a more agregious problem through confounding effects. The ideal solution here would be to collect more data, but that is not a possibility. While multicollinearity might be present, we can't do much about it.

There is another, more serious, issue with this naive approach of running an OLS fit on the data - endogeneity. In our model we have attempted to use proxies to identify ability, but what if our proxies are not fully identifying ability?  Then clearly the covariance between our $x_k$'s and the error terms is non-zero: we have endogenous variables. This endogeneity means that OLS is not a consistent estimator for our vector of $\beta$'s! We have the following:
$$
\mathrm{b}_k \stackrel{p}{\longrightarrow} \beta_k + \gamma \frac{\mathrm{Cov}[q, x_k]}{\mathrm{Var}[x_k]}
$$
Where $q$ is the unidentified portion of ability, $\gamma$ is the true coefficient on $q$ (if we could identify it), $\mathrm{b}_k$ is the OLS estimated coefficient on $x_k$, and $\beta_k$ is the true coefficient on $x_k$. Notice that $q$ is determined by how well our proxies identify ability. The better our proxies identify ability, the smaller the bias term $\gamma \frac{\mathrm{Cov}[q, x_k]}{\mathrm{Var}[x_k]}$ will be (with bias arising from our endogeneity problem).

In many settings, researchers assume that $\mathrm{Cov}[q, x_k] = 0$ except for the variable of interest (in our case, hours of studying). This is also a plausible assumption in our setting; even more so when we consider that we already determined that study time and ability might be dependent (and that part of ability is in the error term). So we have simplified our problem to identification of the true coefficient on the endogenous variable hours of studying. We should note that since we do not know what part of "ability" is being omitted, we cannot make any guesses about which way our OLS estimated coefficient is biased (because we can't determine the sign of $\mathrm{Cov}[q, studying]$ ).

In [43]:
#Saving our cleaned dataset and results of our Naive OLS fit
student_perf.to_pickle('data/student_por_v3.pkl')
results.save('results/Naive_OLS.pickle')