In [1]:
import pandas as pd
import matplotlib.pyplot as plt
import statsmodels.api as sm
import numpy as np

from scipy.stats import f, chi2

%matplotlib inline

# MOOC Econometrics
## Test Exercise 4

A challenging and very relevant economic problem is the measurement of the returns to schooling. In this question
we will use the following variables on 3010 US men:

- `logw`: log wage
- `educ`: number of years of schooling
- `age`: age of the individual in years
- `exper`: working experience in years
- `smsa`: dummy indicating whether the individual lived in a metropolitan area
- `south`: dummy indicating whether the individual lived in the south
- `nearc`: dummy indicating whether the individual lived near a 4-year college
- `dadeduc`: education of the individual's father (in years)
- `momeduc`: education of the individual's mother (in years)

This data is a selection of the data used by D. Card (1995) (“Using Geographic Variation in College Proximity to Estimate the Return to Schooling”)

In [2]:
wage = pd.read_csv('TestExer4_Wage-round1.txt')
wage.head()

Unnamed: 0,logw,educ,age,exper,smsa,south,nearc,daded,momed
0,6.306275,7,29,16,1,0,0,9.94,10.25
1,6.175867,12,27,9,1,0,0,8.0,8.0
2,6.580639,12,34,16,1,0,0,14.0,12.0
3,5.521461,11,27,10,1,0,1,11.0,12.0
4,6.591674,12,34,16,1,0,1,8.0,7.0


**(a)** Use OLS to estimate the parameters of the model

$$\text{logw} = β_1 + β_2 \, \text{educ} + β_3 \, \text{exper} + β_4 \, \text{exper}^2 + β_5 \, \text{smsa} + β_6 \, \text{south} + ε.$$

Give an interpretation to the estimated $β_2$ coefficient.

In [3]:
def small_summary(model, round=2):
    df = pd.DataFrame({'$\\beta$': model.params.round(round),
                       'SE': model.bse.round(round),
                       '$t$': model.tvalues.round(round),
                       '$p$-val': model.pvalues.round(round)},
                      columns=['$\\beta$', 'SE', '$t$', '$p$-val'])
    return df

In [4]:
wage['exper2'] = wage.exper ** 2

logw = wage.ix[:, 'logw']
X = sm.add_constant(wage.ix[:, ['educ', 'exper', 'exper2', 'smsa', 'south']])

model_ols = sm.OLS(logw, X).fit()
small_summary(model_ols)

Unnamed: 0,$\beta$,SE,$t$,$p$-val
const,4.61,0.07,67.91,0
educ,0.08,0.0,23.31,0
exper,0.08,0.01,12.38,0
exper2,-0.0,0.0,-6.8,0
smsa,0.15,0.02,9.52,0
south,-0.18,0.01,-11.96,0


$\beta_2$ can be interpreted as "an increase in education by one year is associated with 8% increase of the wage"

**(b)** OLS may be inconsistent in this case as `educ` and `exper` may be endogenous. 

- Give a reason why this may be the case. 
- Also indicate whether the estimate in part **(a)** is still useful.

Predictors `educ` and `exper` may be endogenous because of omitted factors - factors which we don't observe that could influence the wage, e.g.
- `parent_income` - in the US the education is not free, and if parent's can't afford tuition, it can influence `educ`
- `area` - the area where the individual works, in some fields (e.g. medical) they spend more time studying, but have higher salary afterwards (even with fewer years of experience) 

As the result, the estimators for $\beta_2$, $\beta_3$ and $\beta_4$ may be inconsistent and OLS won't estimate the causal effect for these factors properly

Estimates from part **(a)** are not useful when having endogeneity

**(c)** Give a motivation why $\text{age}$ and $\text{age}^2$ can be used as instruments for $\text{exper}$ and $\text{exper}^2$.

The older a person is, the more years of experience they have, and wage is not directly related to age, only throught experience.

**(d)** Run the first-stage regression for `educ` for the two-stage least squares estimation of the parameters in the
model above when `age`, `age2` , `nearc`, `dadeduc`, and `momeduc` are used as additional instruments. 

What do you conclude about the suitability of these instruments for schooling?

In [5]:
wage['age2'] = wage.age ** 2


In [6]:
educ = wage.educ
Z = sm.add_constant(wage.ix[:, ['age', 'age2', 'nearc', 'daded', 'momed']])

model_1stage = sm.OLS(educ, Z).fit()
small_summary(model_1stage)

Unnamed: 0,$\beta$,SE,$t$,$p$-val
const,-5.92,4.01,-1.48,0.14
age,0.99,0.28,3.53,0.0
age2,-0.02,0.0,-3.5,0.0
nearc,0.53,0.09,5.7,0.0
daded,0.2,0.02,12.9,0.0
momed,0.25,0.02,14.58,0.0


These instruments are suitable for predicting `educ` - all of them are significant

**(e)** Estimate the parameters of the model for log wage using two-stage least squares. Compare your result to the
estimate in part **(a)**.

Let's also fit `exper`:

In [15]:
exper = wage.ix[:, 'exper']
model_1stage_exper = sm.OLS(exper, Z).fit()

wage['exper_fit'] = model_1stage_exper.fittedvalues
wage['exper_fit2'] = model_1stage_exper.fittedvalues ** 2
wage['educ_fit'] = model_1stage.fittedvalues

In [16]:
logw = wage.ix[:, 'logw']
X_2sls = sm.add_constant(wage.ix[:, ['educ_fit', 'exper_fit', 'exper_fit2', 'smsa', 'south']])

model_2stage = sm.OLS(logw, X_2sls).fit()
pd.DataFrame({'$\\beta_{2sls}$': model_2stage.params,
              '$\\beta_{ols}$': model_ols.params})

Unnamed: 0,$\beta_{2sls}$,$\beta_{ols}$
const,4.396964,4.611014
educ,,0.08158
educ_fit,0.096303,
exper,,0.083836
exper2,,-0.002202
exper_fit,0.08534,
exper_fit2,-0.002324,
smsa,0.163446,0.150801
south,-0.186798,-0.175176


For 2SLS, the parameter for education is 0.09 vs 0.08 for OLS. Parameters for experience don't seem to change

**(f)** Perform the Sargan test for validity of the instruments. What is your conclusion?

In [23]:
logw = wage.ix[:, 'logw']
X = sm.add_constant(wage.ix[:, ['educ', 'exper', 'exper2', 'smsa', 'south']])

logw_predicted = np.array(X).dot(model_2stage.params)

e = logw - logw_predicted

In [24]:
Z = sm.add_constant(wage.ix[:, ['age', 'age2', 'nearc', 'daded', 'momed']])
sargan_model = sm.OLS(e, Z).fit()

print 'R2 = %0.4f' % sargan_model.rsquared
small_summary(sargan_model)

R2 = 0.0013


Unnamed: 0,$\beta$,SE,$t$,$p$-val
const,0.61,0.66,0.94,0.35
age,-0.04,0.05,-0.93,0.35
age2,0.0,0.0,0.96,0.34
nearc,-0.0,0.02,-0.2,0.84
daded,-0.0,0.0,-1.6,0.11
momed,0.0,0.0,1.32,0.19


None of the variables seems significant and $R^2$ is very small

$n \cdot R^2$ is distributed as $\chi^2(m - k)$, where $m$ is the number of instruments and $k$ is the number of explanatory variables

- $m = 6$: const, age, age2, nearc, daded, momed
- $k = 6$: const, educ, exper, exper2, smsa, south

Since $m = k$, we cannot perform the Sargan test, it only can be performed when $m > k$. 

But because the $p$-values of the coefficients are small and $R^2$ is also small, we may conclude that performing 2SLS still makes sense, even though we cannot formally test the validity of the instruments.