# 7. Angrist-Krueger (1991) Replication

To do: 

    1. # 2 Add year dummies     
    2. # 3 Add state dummies 
    3. # 4 define functions and run  

### 1. Identifying assumption

The study **assumes** that the instruments (season of birth interacted with birth years) are valid. This means that variation in education caused by season of birth interacted with birth years can be used to examine the effect of education, which is endogenous to the model, on wages. Thus, the instruments only effect wages through education. In other terms, $$E z'u = 0$$


### 2. TSLS - Figure 5

In [5]:
import pandas as pd

import numpy as np
from scipy import stats
from matplotlib import style
from matplotlib import pyplot as plt
import statsmodels.api as sm
import statsmodels.formula.api as smf
from linearmodels.iv import IV2SLS

%matplotlib inline

In [8]:
df = pd.read_stata('/Users/rachelhaswell/Documents/Spring 2023/Metrics ARE 212/Problem sets/2nd half/PS2/angrist-krueger91.dta')

**Column 2**

The excluded instruments from the wage equation in the
TSLS estimates are three quarter-of-birth dummies interacted
with nine year-of-birth dummies.

In [9]:
# create quarter and year dummies 
qob_dummies = pd.get_dummies(df['qob'], prefix='qob')
df = pd.concat([df, qob_dummies], axis=1)
yob_dummies = pd.get_dummies(df['yob'], prefix='yob')
df = pd.concat([df, yob_dummies], axis=1)

In [44]:
formula_2 = 'logwage ~ 1 + [edu ~ (qob_1*yob_1930 - qob_1*yob_1939) + (qob_2*yob_1930 - qob_2*yob_1939) + (qob_3*yob_1930 - qob_3*yob_1939) + (yob_1930 - yob_1938)]'
mod_2 = IV2SLS.from_formula(formula_2, df).fit(cov_type='unadjusted')

In [45]:
mod_2.summary

0,1,2,3
Dep. Variable:,logwage,R-squared:,0.0054
Estimator:,IV-2SLS,Adj. R-squared:,0.0054
No. Observations:,329509,F-statistic:,0.0248
Date:,"Sun, Apr 16 2023",P-value (F-stat),0.8748
Time:,23:28:37,Distribution:,chi2(1)
Cov. Estimator:,unadjusted,,
,,,

0,1,2,3,4,5,6
,Parameter,Std. Err.,T-stat,P-value,Lower CI,Upper CI
Intercept,5.8790,0.1332,44.141,0.0000,5.6179,6.1400
edu,0.0016,0.0104,0.1575,0.8748,-0.0188,0.0221


**Column 4**

In [10]:
df['age2'] = df['ageq']**2

In [61]:
formula_4 = 'logwage ~ 1 + ageq + age2 [edu ~ (qob_1*yob_1930 - qob_1*yob_1939) + (qob_2*yob_1930 - qob_2*yob_1939) + (qob_3*yob_1930 - qob_3*yob_1939) + (yob_1930 - yob_1938)]'
mod_4 = IV2SLS.from_formula(formula_4, df).fit(cov_type='unadjusted')

In [62]:
mod_4.summary

0,1,2,3
Dep. Variable:,logwage,R-squared:,0.1127
Estimator:,IV-2SLS,Adj. R-squared:,0.1127
No. Observations:,329509,F-statistic:,10.653
Date:,"Sun, Apr 16 2023",P-value (F-stat),0.0138
Time:,23:36:10,Distribution:,chi2(3)
Cov. Estimator:,unadjusted,,
,,,

0,1,2,3,4,5,6
,Parameter,Std. Err.,T-stat,P-value,Lower CI,Upper CI
Intercept,4.9287,0.5163,9.5459,0.0000,3.9168,5.9407
age2,0.0002,0.0001,1.2576,0.2086,-0.0001,0.0005
ageq,-0.0112,0.0134,-0.8415,0.4001,-0.0374,0.0150
edu,0.0858,0.0296,2.8977,0.0038,0.0278,0.1438


**Column 6** 

In [63]:
formula_6 = 'logwage ~ 1 + black + smsa + married [edu ~ (qob_1*yob_1930 - qob_1*yob_1939) + (qob_2*yob_1930 - qob_2*yob_1939) + (qob_3*yob_1930 - qob_3*yob_1939) + (yob_1930 - yob_1938)]'
mod_6 = IV2SLS.from_formula(formula_6, df).fit(cov_type='unadjusted')

In [64]:
mod_6.summary

0,1,2,3
Dep. Variable:,logwage,R-squared:,0.1134
Estimator:,IV-2SLS,Adj. R-squared:,0.1134
No. Observations:,329509,F-statistic:,2.393e+04
Date:,"Sun, Apr 16 2023",P-value (F-stat),0.0000
Time:,23:42:46,Distribution:,chi2(4)
Cov. Estimator:,unadjusted,,
,,,

0,1,2,3,4,5,6
,Parameter,Std. Err.,T-stat,P-value,Lower CI,Upper CI
Intercept,5.2538,0.1098,47.828,0.0000,5.0385,5.4691
black,-0.3446,0.0163,-21.111,0.0000,-0.3766,-0.3126
married,0.2496,0.0035,70.786,0.0000,0.2427,0.2565
smsa,0.2470,0.0112,22.125,0.0000,0.2251,0.2689
edu,0.0202,0.0093,2.1789,0.0293,0.0020,0.0384


**Column 8** 

In [59]:
formula_8 = 'logwage ~ 1 + black + smsa + married + ageq + age2 [edu ~ (qob_1*yob_1930 - qob_1*yob_1939) + (qob_2*yob_1930 - qob_2*yob_1939) + (qob_3*yob_1930 - qob_3*yob_1939) + (yob_1930 - yob_1938)]'
mod_8 = IV2SLS.from_formula(formula_8, df).fit(cov_type='unadjusted')

In [60]:
mod_8.summary

0,1,2,3
Dep. Variable:,logwage,R-squared:,0.1551
Estimator:,IV-2SLS,Adj. R-squared:,0.1550
No. Observations:,329509,F-statistic:,2.512e+04
Date:,"Sun, Apr 16 2023",P-value (F-stat),0.0000
Time:,23:36:02,Distribution:,chi2(6)
Cov. Estimator:,unadjusted,,
,,,

0,1,2,3,4,5,6
,Parameter,Std. Err.,T-stat,P-value,Lower CI,Upper CI
Intercept,4.6649,0.4732,9.8580,0.0000,3.7374,5.5924
age2,0.0001,0.0001,0.6844,0.4937,-0.0002,0.0004
ageq,-0.0054,0.0131,-0.4122,0.6802,-0.0311,0.0203
black,-0.2542,0.0522,-4.8681,0.0000,-0.3566,-0.1519
married,0.2412,0.0058,41.905,0.0000,0.2299,0.2525
smsa,0.1852,0.0357,5.1865,0.0000,0.1152,0.2552
edu,0.0730,0.0305,2.3975,0.0165,0.0133,0.1327


In [None]:
W = logwage 
X = black smsa married
Y = yob ## 9 year of birth dummies? 
E = edu

### 3. TSLS - Figure 7

**Column 2** 

In [11]:
# create state dummies 
state_dummies = pd.get_dummies(df['state'], prefix='state')
df = pd.concat([df, state_dummies], axis=1)

In [69]:
form_col2 = 'logwage ~ 1 + [edu ~ (qob_1*yob_1930 - qob_1*yob_1939) + (qob_2*yob_1930 - qob_2*yob_1939) + (qob_3*yob_1930 - qob_3*yob_1939) + (yob_1930 - yob_1938) + (state_1 - state_51) ]'
mod_col2 = IV2SLS.from_formula(form_col2, df).fit(cov_type='unadjusted')

In [70]:
mod_col2.summary

0,1,2,3
Dep. Variable:,logwage,R-squared:,0.0595
Estimator:,IV-2SLS,Adj. R-squared:,0.0595
No. Observations:,329509,F-statistic:,475.45
Date:,"Sun, Apr 16 2023",P-value (F-stat),0.0000
Time:,23:47:08,Distribution:,chi2(1)
Cov. Estimator:,unadjusted,,
,,,

0,1,2,3,4,5,6
,Parameter,Std. Err.,T-stat,P-value,Lower CI,Upper CI
Intercept,4.3602,0.0706,61.739,0.0000,4.2218,4.4986
edu,0.1206,0.0055,21.805,0.0000,0.1097,0.1314


### 4. Chernozhukov and C. Hansen estimator 

**Table 5** 

(1) Again suppose that the true  $B = 1$. Write a function which
takes as arguments (y; x;Z; $B0$) and which returns the p-value
associated with the hypothesis that every element of $gamma hat$ is zero (an F-test would be appropriate). Note that this same p-value
characterizes the hypothesis test that  $B = B0$.

In [2]:
# define functions from # 5
def pval_fun(y,x,Z,B0): 
    """
    inputs: 
        y: dependent var
        x: endogenous var
        Z: instrument 
        B0
    
    y-B0x = Zgamma + w
    return p-value associated with hypothesis that gamma = 0
    """
    formula = f"{y.name} ~ offset({B0}*{x.name}) + {' + '.join(Z.columns)}"
    lm = smf.ols(formula=formula, data=pd.concat([y, x, Z], axis=1)).fit()
   
    # Compute the F-test p-value
    pval = lm.f_pvalue
    return pval

Using your function and taking $pi = 1$, estimate $B$ by finding
the value of $B0$ which delivers maximal p-values. Describe the
bias and precision of this estimator.

In [None]:
def B_fun (y, x, Z):
    """
    inputs:
        y: dependent var
        x: endogenous var
        Z: instrument 
        
    return B0 that maximizes p-value
    """
    

**Table 7** 

### 5. Confidence intervals

**2SLS**