---
# <font color = blue> Feb 15, 2020 Stata IV, 2SLS, GMM with ivreg pkg
---
* Name: Jikhan Jeong
* Part 1: IV (2SLS base just-identified and over-identified)
* Ref: http://www3.grips.ac.jp/~yamanota/yamanoCourses.htm (lecture, code, data source) **iverg** IV 2SLS, 2SLS applied and se is fixed
* Part 2: 2SLS and GMM
* Ref: https://www.soderbom.net/metrix2.htm (lcture, code source) : iverg2 2sls and GMM
* Ref: https://kylebarron.dev/stata_kernel/ (stata kernel)
* Data: CARD.dta | Wooldridge probm 5.4-6.1
---            
    

# Data Preparing
---

In [1]:
use "card.dta", clear

In [4]:
%head 1

Unnamed: 0,id,nearc2,nearc4,educ,age,fatheduc,motheduc,weight,momdad14,sinmom14,step14,reg661,reg662,reg663,reg664,reg665,reg666,reg667,reg668,reg669,south66,black,smsa,south,smsa66,wage,enroll,KWW,IQ,married,libcrd14,exper,lwage,expersq
1,2,0,0,7,29,.,.,158413,1,0,0,1,0,0,0,0,0,0,0,0,0,1,1,0,1,548,0,15,.,1,0,16,6.3062754,256


In [5]:
sum


    Variable |        Obs        Mean    Std. Dev.       Min        Max
-------------+---------------------------------------------------------
          id |      3,010    2581.749    1500.539          2       5225
      nearc2 |      3,010    .4408638    .4965731          0          1
      nearc4 |      3,010    .6820598    .4657535          0          1
        educ |      3,010    13.26346    2.676913          1         18
         age |      3,010     28.1196    3.137004         24         34
-------------+---------------------------------------------------------
    fatheduc |      2,320    10.00345    3.720737          0         18
    motheduc |      2,657    10.34814    3.179671          0         18
      weight |      3,010    321185.3    170645.8      75607    1752340
    momdad14 |      3,010    .7893688    .4078247          0          1
    sinmom14 |      3,010    .1006645    .3009339          0          1
-------------+-------------------------------------------------

---
# <font color=blue> Part1:  IV and 2SLS
* Ref: https://github.com/Jikhan-Jeong/Causal-Inference-in-Econ/blob/master/Nov%2018%2C%202019%20IV%20and%202SLS.do (My Github)
* Ref: http://www3.grips.ac.jp/~yamanota/Lecture%20Note%208%20to%2010%202SLS%20&%20others.pdf (Original Lecture Note by Prof. Takashi Yamano)
---

### <font color=blue> IV
* **educ** is endogeneous variable 
* **x** = correlated with error, E(Xe)= non-zero
---
#### <font color=blue> Simple OLS 

In [6]:
reg lwage educ


      Source |       SS           df       MS      Number of obs   =     3,010
-------------+----------------------------------   F(1, 3008)      =    329.54
       Model |  58.5153704         1  58.5153704   Prob > F        =    0.0000
    Residual |  534.126274     3,008  .177568575   R-squared       =    0.0987
-------------+----------------------------------   Adj R-squared   =    0.0984
       Total |  592.641645     3,009  .196956346   Root MSE        =    .42139

------------------------------------------------------------------------------
       lwage |      Coef.   Std. Err.      t    P>|t|     [95% Conf. Interval]
-------------+----------------------------------------------------------------
        educ |   .0520942   .0028697    18.15   0.000     .0464674     .057721
       _cons |   5.570882   .0388295   143.47   0.000     5.494747    5.647017
------------------------------------------------------------------------------


---
#### <font color=blue> IV Validity Test (Just identification)
* Check IV's validity (z is correlated with x by running regress x on z
* x ~ z, z= nearc4 (IV) z is significant
* educ ~ nearc4 is 0.01 statistcally significant at 0.01
---

In [12]:
corr educ nearc4

(obs=3,010)

             |     educ   nearc4
-------------+------------------
        educ |   1.0000
      nearc4 |   0.1442   1.0000



In [13]:
reg educ nearc4 


      Source |       SS           df       MS      Number of obs   =     3,010
-------------+----------------------------------   F(1, 3008)      =     63.91
       Model |  448.604204         1  448.604204   Prob > F        =    0.0000
    Residual |  21113.4759     3,008  7.01910767   R-squared       =    0.0208
-------------+----------------------------------   Adj R-squared   =    0.0205
       Total |  21562.0801     3,009  7.16586243   Root MSE        =    2.6494

------------------------------------------------------------------------------
        educ |      Coef.   Std. Err.      t    P>|t|     [95% Conf. Interval]
-------------+----------------------------------------------------------------
      nearc4 |    .829019   .1036988     7.99   0.000     .6256912    1.032347
       _cons |   12.69801   .0856416   148.27   0.000     12.53009    12.86594
------------------------------------------------------------------------------


---
#### <font color=blue> IV Estimation: just identification case (2SLS)
* number of Endo variable = number IV (endo var = educ, IV = nearc4)
* standerror using iv is wrong
* but following ivreg fixed it in the results
* same coefficient with OLS but stand error is smaller .0028697 (OLS) >  .1036988 (IV)
* **iverg** command uses 2SLS and correct its standard error
---

In [17]:
ivreg lwage (educ=nearc4)


Instrumental variables (2SLS) regression

      Source |       SS           df       MS      Number of obs   =     3,010
-------------+----------------------------------   F(1, 3008)      =     51.17
       Model |  -340.11155         1  -340.11155   Prob > F        =    0.0000
    Residual |  932.753194     3,008  .310090823   R-squared       =         .
-------------+----------------------------------   Adj R-squared   =         .
       Total |  592.641645     3,009  .196956346   Root MSE        =    .55686

------------------------------------------------------------------------------
       lwage |      Coef.   Std. Err.      t    P>|t|     [95% Conf. Interval]
-------------+----------------------------------------------------------------
        educ |   .1880626   .0262913     7.15   0.000     .1365118    .2396135
       _cons |   3.767472   .3488617    10.80   0.000      3.08344    4.451503
------------------------------------------------------------------------------
Instrume

---
### <font color=blue> 2SLS : Over-identification
---
#### <font color=blue> OLS (educ = endo variable) 
* edu coefficient: .074009 edu sd:   .0035054    adR:  0.2891
---

In [18]:
reg lwage educ exper expersq black smsa south 


      Source |       SS           df       MS      Number of obs   =     3,010
-------------+----------------------------------   F(6, 3003)      =    204.93
       Model |  172.165628         6  28.6942714   Prob > F        =    0.0000
    Residual |  420.476016     3,003  .140018653   R-squared       =    0.2905
-------------+----------------------------------   Adj R-squared   =    0.2891
       Total |  592.641645     3,009  .196956346   Root MSE        =    .37419

------------------------------------------------------------------------------
       lwage |      Coef.   Std. Err.      t    P>|t|     [95% Conf. Interval]
-------------+----------------------------------------------------------------
        educ |    .074009   .0035054    21.11   0.000     .0671357    .0808823
       exper |   .0835958   .0066478    12.57   0.000     .0705612    .0966305
     expersq |  -.0022409   .0003178    -7.05   0.000    -.0028641   -.0016177
       black |  -.1896315   .0176266   -10.76   0.

---
#### <font color=blue> 2 IVs :         
* Instrumented:  educ
* Instruments:   nearc2 nearc4
* edu coefficient: .1608487 edu sd:   .0486291    adR:  0.1438
---

In [19]:
ivreg lwage (educ= nearc2 nearc4) exper expersq black smsa south


Instrumental variables (2SLS) regression

      Source |       SS           df       MS      Number of obs   =     3,010
-------------+----------------------------------   F(6, 3003)      =    110.30
       Model |  86.2367703         6  14.3727951   Prob > F        =    0.0000
    Residual |  506.404874     3,003  .168632992   R-squared       =    0.1455
-------------+----------------------------------   Adj R-squared   =    0.1438
       Total |  592.641645     3,009  .196956346   Root MSE        =    .41065

------------------------------------------------------------------------------
       lwage |      Coef.   Std. Err.      t    P>|t|     [95% Conf. Interval]
-------------+----------------------------------------------------------------
        educ |   .1608487   .0486291     3.31   0.001      .065499    .2561984
       exper |   .1192112   .0211779     5.63   0.000     .0776866    .1607358
     expersq |  -.0023052   .0003507    -6.57   0.000    -.0029928   -.0016177
       b

---
#### <font color=blue> 4 IVs:                      
* edu coefficient: .1000713 edu sd:   .01263      adR: 0.2507 ******** lowest sd. 
* Adjusted-R square increased a lot
---

In [20]:
ivreg lwage (educ= nearc2 nearc4 fatheduc motheduc) exper expersq black smsa south


Instrumental variables (2SLS) regression

      Source |       SS           df       MS      Number of obs   =     2,220
-------------+----------------------------------   F(6, 2213)      =     83.66
       Model |  108.419494         6  18.0699157   Prob > F        =    0.0000
    Residual |  320.580009     2,213  .144862182   R-squared       =    0.2527
-------------+----------------------------------   Adj R-squared   =    0.2507
       Total |  428.999503     2,219  .193330105   Root MSE        =    .38061

------------------------------------------------------------------------------
       lwage |      Coef.   Std. Err.      t    P>|t|     [95% Conf. Interval]
-------------+----------------------------------------------------------------
        educ |   .1000713     .01263     7.92   0.000     .0753034    .1248392
       exper |   .0989441    .009482    10.43   0.000     .0803496    .1175385
     expersq |   -.002449   .0004013    -6.10   0.000    -.0032359   -.0016621
       b

---
# <font color = blue> 2. 2SLS and GMM with iverg2
* Ref: https://www.soderbom.net/metrix2.htm (original source by MÂns Sˆderbom)
* Ref : https://github.com/Jikhan-Jeong/Causal-Inference-in-Econ/blob/master/Nov%2022%2C%202019%202SLS%20and%20GMM.do (my github, do file)
---

#### <font color=blue> 1. down load user written package: iverg2_____________________________________

In [22]:
ssc install ivreg2
ssc install ranktest


checking ivreg2 consistency and verifying not already installed...
installing into /home/jikhan.jeong/ado/plus/...
installation complete.

checking ranktest consistency and verifying not already installed...
installing into /home/jikhan.jeong/ado/plus/...
installation complete.


---
####  <font color=blue> 2. summary statistics_________________________________________________________
* lwage is a dependent var
* educ is a endo var, because it may corrleated with unobserved error (=ability)
* nearc2 is a IV
---

In [23]:
tabstat lwage educ nearc2 , s(mean N min max p50)


   stats |     lwage      educ    nearc2
---------+------------------------------
    mean |  6.261832  13.26346  .4408638
       N |      3010      3010      3010
     min |   4.60517         1         0
     max |  7.784889        18         1
     p50 |  6.286928        13         0
----------------------------------------


---
#### <font color=blue> 3. OLS________________________________________________________________________
* $ educ |   coef .0520942   sd .0028697 
---

In [25]:
reg lwage educ 


      Source |       SS           df       MS      Number of obs   =     3,010
-------------+----------------------------------   F(1, 3008)      =    329.54
       Model |  58.5153704         1  58.5153704   Prob > F        =    0.0000
    Residual |  534.126274     3,008  .177568575   R-squared       =    0.0987
-------------+----------------------------------   Adj R-squared   =    0.0984
       Total |  592.641645     3,009  .196956346   Root MSE        =    .42139

------------------------------------------------------------------------------
       lwage |      Coef.   Std. Err.      t    P>|t|     [95% Conf. Interval]
-------------+----------------------------------------------------------------
        educ |   .0520942   .0028697    18.15   0.000     .0464674     .057721
       _cons |   5.570882   .0388295   143.47   0.000     5.494747    5.647017
------------------------------------------------------------------------------


---
#### <font color=blue> 4. 1st Stage and IV reg_______________________________________________________
* (1st result) 1 stage regression : IV must be significant endo var ~ IV + other Xs
* (2st result) IV (2SLS) estimation
--- 
#### result
*  (1stage) nearc2 |   .2552584   .0981804     2.60   p-value 0.009 1st, Y= educ (=endo var)
*  (IV)       educ |   .3432739   .1274114     2.69   p-value 0.007
*  (OLS)      educ |  coef |.0520942  << (IV) educ|coeff |.3432739  
*  (OLS)      educ |  sd   |.0028697  << (IV) educ|coeff |.1274114  
---

In [29]:
ivreg2 lwage ( educ = nearc2 ), first


First-stage regressions
-----------------------


First-stage regression of educ:

Statistics consistent for homoskedasticity only
Number of obs =                   3010
------------------------------------------------------------------------------
        educ |      Coef.   Std. Err.      t    P>|t|     [95% Conf. Interval]
-------------+----------------------------------------------------------------
      nearc2 |   .2552584   .0981804     2.60   0.009     .0627509    .4477659
       _cons |   13.15092   .0651894   201.73   0.000      13.0231    13.27874
------------------------------------------------------------------------------
F test of excluded instruments:
  F(  1,  3008) =     6.76
  Prob > F      =   0.0094
Sanderson-Windmeijer multivariate F test of excluded instruments:
  F(  1,  3008) =     6.76
  Prob > F      =   0.0094



Summary results for first-stage regressions
-------------------------------------------

                                           (Underid)     

In [30]:
tabstat lwage educ, s(mean) by(nearc2)


Summary statistics: mean
  by categories of: nearc2 (=1 if near 2 yr college, 1966)

  nearc2 |     lwage      educ
---------+--------------------
       0 |  6.223202  13.15092
       1 |  6.310825  13.40618
---------+--------------------
   Total |  6.261832  13.26346
------------------------------


---
* IV coef with wald estimator : when # of IV = # endo var = 1 and dummy
* [E(y|IV=1)-E(y|IV=0)] / [E(x|IV=1)-E(x|IV=0)] = 6.310825 -  6.223202)/( 13.40618- 13.15092)  = 0.34326960745 = (IV) educ$coeff 0.3432739 

---
#### <font color=blue> 5. 2SLS with robust endogeneous |educ |   .1033048   *.0126463     8.17   0.000 
* Without robust option 2 IVs
* Endogeneity test of endogenous regressors: Under H0 the specified endogenous regressors can actually be treated as exogenous
* Endogeneity test of endogenous regressors:  Chi-sq(1) P-val =    0.0864 : okay, this part is different than the lecture note, in the lecture note reject
---

In [37]:
ivreg2 lwage (educ=nearc2 nearc4) exper expersq black south smsa reg661 reg662 reg663 reg664 reg665 reg666 reg667 reg668 south66

Vars dropped:       south66

IV (2SLS) estimation
--------------------

Estimates efficient for homoskedasticity only
Statistics consistent for homoskedasticity only

                                                      Number of obs =     3010
                                                      F( 14,  2995) =    48.20
                                                      Prob > F      =   0.0000
Total (centered) SS     =  592.6416447                Centered R2   =   0.1320
Total (uncentered) SS   =  118616.3653                Uncentered R2 =   0.9957
Residual SS             =  514.4136847                Root MSE      =    .4134

------------------------------------------------------------------------------
       lwage |      Coef.   Std. Err.      z    P>|z|     [95% Conf. Interval]
-------------+----------------------------------------------------------------
        educ |    .168382   .0508209     3.31   0.001     .0687747    .2679892
       exper |   .1234909   .0221606     5

---
#### <font color=blue> Without robust option 4 IVs
* Endogeneity test of endogenous regressors:  Chi-sq(1) P-val =    0.0266
---

In [33]:
ivreg2 lwage (educ=nearc2 nearc4 motheduc fatheduc ) exper expersq black south smsa reg661 reg662 reg663 reg664 reg665 reg666 reg667 reg668 south66,endog(educ)

Vars dropped:       south66

IV (2SLS) estimation
--------------------

Estimates efficient for homoskedasticity only
Statistics consistent for homoskedasticity only

                                                      Number of obs =     2220
                                                      F( 14,  2205) =    37.33
                                                      Prob > F      =   0.0000
Total (centered) SS     =  428.9995035                Centered R2   =   0.2580
Total (uncentered) SS   =  88133.52217                Uncentered R2 =   0.9964
Residual SS             =  318.3363362                Root MSE      =    .3787

------------------------------------------------------------------------------
       lwage |      Coef.   Std. Err.      z    P>|z|     [95% Conf. Interval]
-------------+----------------------------------------------------------------
        educ |   .1033048   .0126463     8.17   0.000     .0785185    .1280912
       exper |   .1011632   .0094696    10

---
#### <font color=blue> With roubst option 4 IVs        
* | educ |  .1033048   *.0131855 (bigger than unrobust = .0126463)    z  7.83   p 0.000
---

In [35]:
ivreg2 lwage (educ=nearc2 nearc4 motheduc fatheduc ) exper expersq black south smsa reg661 reg662 reg663 reg664 reg665 reg666 reg667 reg668 south66, robust endog(educ)

Vars dropped:       south66

IV (2SLS) estimation
--------------------

Estimates efficient for homoskedasticity only
Statistics robust to heteroskedasticity

                                                      Number of obs =     2220
                                                      F( 14,  2205) =    41.01
                                                      Prob > F      =   0.0000
Total (centered) SS     =  428.9995035                Centered R2   =   0.2580
Total (uncentered) SS   =  88133.52217                Uncentered R2 =   0.9964
Residual SS             =  318.3363362                Root MSE      =    .3787

------------------------------------------------------------------------------
             |               Robust
       lwage |      Coef.   Std. Err.      z    P>|z|     [95% Conf. Interval]
-------------+----------------------------------------------------------------
        educ |   .1033048   .0131855     7.83   0.000     .0774616     .129148
       exper |

---
#### 6. <font color = blue> GMM 2Step       
* |educ |   *.1019251 (smaller than 2SLS robust)   .0131539 (smaller than 2SLS robust)     7.75   0.000   
---

In [36]:
ivreg2 lwage (educ=nearc2 nearc4 motheduc fatheduc ) exper expersq black south smsa reg661 reg662 reg663 reg664 reg665 reg666 reg667 reg668 south66, gmm2s robust endog(educ)

Vars dropped:       south66

2-Step GMM estimation
---------------------

Estimates efficient for arbitrary heteroskedasticity
Statistics robust to heteroskedasticity

                                                      Number of obs =     2220
                                                      F( 14,  2205) =    41.36
                                                      Prob > F      =   0.0000
Total (centered) SS     =  428.9995035                Centered R2   =   0.2593
Total (uncentered) SS   =  88133.52217                Uncentered R2 =   0.9964
Residual SS             =  317.7482854                Root MSE      =    .3783

------------------------------------------------------------------------------
             |               Robust
       lwage |      Coef.   Std. Err.      z    P>|z|     [95% Conf. Interval]
-------------+----------------------------------------------------------------
        educ |   .1019251   .0131539     7.75   0.000     .0761439    .1277063
     