<img src="images/econ140R_logo.png" width="200" />

In the following cell, please type your name and SID:

In the cell below, please write out the [Honor Code](https://teaching.berkeley.edu/berkeley-honor-code) to reaffirm you are abiding by it.

Did you work with other students? List them below. Please write your answers in your own words, not in theirs.

<h1>ECON 140R - Problem Set 3 Part 2</h1>

<font color="red"><b>Please complete Problem Set 3 Part 1 also</b></font>

<h2>INSTRUCTIONS</h2>

Please step through this problem set, copying and pasting code as needed, and run the code to produce output. Answer the questions asked, which appear in <font color="blue">blue font</font>. You will earn 100% of the credit on this problem set for <b>completing</b> it with working code and coherent answers. Answers do not need to be correct for full credit.

It turns out that `ivreg()` is memory intensive.

<font color = "green">If you are encountering <b>kernel crashes</b>, it is probably because of memory violations; that is, exceeding the 1 GB maximum. If this happens, try:</font>

1. Halting other notebooks you may have open on datahub. Go to "File: Close and Halt"

2. Clean away data in your notebook's workspace:
* `ls()` to find the data objects present
* `rm()` to remove them
* `gc()` to empty the garbage can

In [1]:
#library(tidyverse)  # Don't need it. This or ggplot2 appears to overload memory
library(haven)
#library(ggplot2)    # Don't need it. 
install.packages("ivreg", dependencies = TRUE)
library("ivreg")

Installing package into ‘/opt/r’
(as ‘lib’ is unspecified)



It turns out that this handy command stops __R__ from defaulting to scientific notation. 

In [2]:
options(scipen=999)

<hr>

Please see the data description in Problem Set 3 Part 1.

In order to replicate Angrist and Krueger's (1994) <b>instrumental variables</b> analysis without crashing the __R__ kernel, we need to start fresh with the same 50% subsample of the 5% public-use microsample of the 1980 Census.

<hr>

In [3]:
data_c80_regsample = read_dta("data_c80_regsample_3.dta")

As before, we are going to model log pre-tax wage and salary income as a function of WWII veteran status and controls:

$$
\ln Y_i = \alpha + \beta^{w} \ wwii_i + B \ controls_i + \epsilon_i
$$

We are controlling for 0/1 WWII service; year of birth; being white (Black, Hispanic, and other men are the baseline omitted category); being married in 1980; a 0/1 indicator of living and working in a standard metropolitan statistical area (SMSA); years of education; a 0/1 indicator of a disability that limits or prevents work; and 49 indicators for 48 lower states (AK and HI are dropped) plus DC.

We'll run this regression and examine what we find for $\beta^w$. Let's follow what [Angrist and Krueger (1994)](https://www-jstor-org.libproxy.berkeley.edu/stable/2535121) do in the right side of Table 4, marked "2SLS," which looks a lot like the left-hand side of Table 2.2 in <i>Mastering Metrics</i> Chapter 2. In both, the authors start with a simple model and the add some covariates that might have (and did) inject omitted variable bias. Here's what we'll do:

1. $\ln Y_i = \alpha + \beta^{w} \ wwii_i + \sum \beta^{by} \ birthyear_i  + \epsilon_i$

2. $\ln Y_i = \alpha + \beta^{w} \ wwii_i + \sum \beta^{by} \ birthyear_i  + \beta^wnh \ white_i + \beta^m \ married_i + \sum \beta^{s}\ state_i + \beta^u \ SMSA_i + \epsilon_i$

2. $\ln Y_i = \alpha + \beta^{w} \ wwii_i + \sum \beta^{by} \ birthyear_i  + \beta^wnh \ white_i + \beta^m \ married_i + \sum \beta^{s}\ state_i + \beta^u \ SMSA_i + \beta^e \ educ_i + \beta^d \ disability_i + \epsilon_i$

where here we are also running two-stage least squares (2SLS), a common form of instrumental variables (IV) estimation. 

The motivation for IV is that we suspect WWII service was not randomly assigned, even though there were draft lotteries. Rather, the most healthy were selected to serve. An instrumental variables approach based on year and quarter of birth can help reduce the selection bias plaguing $\beta^w$ because men who were born too late had no chance of serving in WWII, even though they were healthy and could have been randomly selected if they were born earlier.

Let us follow in the footsteps of Angrist and Krueger and estimate these three equations above by 2SLS using `ivreg()` in __R__. The syntax for ivreg() is very similar to that of `lm()`, except that you need a "pipe" symbol: "|". The pipe appears after your original equation, and the variable list after the pipe needs to include ALL the exogenous variables plus any instrumental variables. It must be list at least as long as the list between the tilde "~" and the pipe "|", and it cannot include the endogenous regressor, which is $wwii_i$ here.

The <b>instrumental variables</b> are year-of-birth interacted with (i.e., times) quarter-of-birth, or in other words, indicator variables for being born in a particular year and quarter. Because year of birth is also in the regression, we need to omit one quarter or __R__ will do it for us, because of collinearity.

<h4>
<font color="blue">Complete the code below and run it.</font>
    </h4>

In [4]:
c80_ivreg1 <- ivreg(logincwage ~ wwii + factor(birthyr)| 
                    factor(birthyr) +
                    b25q1 + b25q2 + b25q3 +
                    b26q1 + b26q2 + b26q3 +
                    b27q1 + b27q2 + b27q3 + 
                    b28q1 + b28q2 + b28q3, 
                    data = data_c80_regsample)
summary(c80_ivreg1)


Call:
ivreg(formula = logincwage ~ wwii + factor(birthyr) | factor(birthyr) + 
    b25q1 + b25q2 + b25q3 + b26q1 + b26q2 + b26q3 + b27q1 + b27q2 + 
    b27q3 + b28q1 + b28q2 + b28q3, data = data_c80_regsample)

Residuals:
    Min      1Q  Median      3Q     Max 
-5.8155 -0.3034  0.1149  0.4329  1.5722 

Coefficients:
                     Estimate Std. Error t value             Pr(>|t|)    
(Intercept)          9.791947   0.038809 252.311 < 0.0000000000000002 ***
wwii                -0.123413   0.051160  -2.412               0.0159 *  
factor(birthyr)1926  0.030861   0.007457   4.139             0.000035 ***
factor(birthyr)1927  0.014972   0.008736   1.714               0.0866 .  
factor(birthyr)1928 -0.015453   0.023805  -0.649               0.5162    

Diagnostic tests:
                   df1   df2 statistic              p-value    
Weak instruments    12 81495     88.69 < 0.0000000000000002 ***
Wu-Hausman           1 81505     53.13    0.000000000000316 ***
Sargan              11   

<font color="blue">
    <h3>
    Question 10</h3>
Look at the IV regression output above and describe what you see. How much more do male veterans of WWII earn compared to male nonveterans? Is the effect statistically significant? State what you see in descriptive sentences.</font>

In this instrumental variables regression, we see a startling outcome: WWII veterans earn 12% LESS than nonveterans in the same year of birth. The effect is statistically significant at the 5% level; the $t$-statistic is $-2.4$.

In [5]:
rm(c80_ivreg1)
ls()
gc()

Unnamed: 0,used,(Mb),gc trigger,(Mb).1,max used,(Mb).2
Ncells,773047,41.3,1349511,72.1,1349511,72.1
Vcells,6879619,52.5,22817020,174.1,22056332,168.3


In [9]:
c80_ivreg2 <- ivreg(logincwage ~ wwii + factor(birthyr) + white + married +
                    factor(statefip) + smsa | 
                    factor(birthyr) + white + married +
                    factor(statefip) + smsa +
                    b25q1 + b25q2 + b25q3 +
                    b26q1 + b26q2 + b26q3 +
                    b27q1 + b27q2 + b27q3 + 
                    b28q1 + b28q2 + b28q3, 
                    data = data_c80_regsample)
summary(c80_ivreg2)


Call:
ivreg(formula = logincwage ~ wwii + factor(birthyr) + white + 
    married + factor(statefip) + smsa | factor(birthyr) + white + 
    married + factor(statefip) + smsa + b25q1 + b25q2 + b25q3 + 
    b26q1 + b26q2 + b26q3 + b27q1 + b27q2 + b27q3 + b28q1 + b28q2 + 
    b28q3, data = data_c80_regsample)

Residuals:
    Min      1Q  Median      3Q     Max 
-5.9078 -0.2817  0.0775  0.4065  2.2503 

Coefficients:
                      Estimate Std. Error t value             Pr(>|t|)    
(Intercept)          8.7444088  0.0277643 314.952 < 0.0000000000000002 ***
wwii                -0.1406921  0.0486355  -2.893             0.003819 ** 
factor(birthyr)1926  0.0281032  0.0070572   3.982    0.000068330334312 ***
factor(birthyr)1927  0.0185772  0.0082476   2.252             0.024297 *  
factor(birthyr)1928 -0.0213597  0.0225946  -0.945             0.344486    
white                0.4276366  0.0141470  30.228 < 0.0000000000000002 ***
married              0.3934452  0.0085340  46.103 < 0.000

<font color="blue">
    <h3>
    Question 11</h3>
Look at the IV regression output above and describe what you see. How much more do male veterans of WWII earn compared to male nonveterans? Is the effect statistically significant? State what you see in descriptive sentences. Describe how results have changed (or not) with more controls. Discuss omitted variable bias if it seems useful, and perhaps compare results here to what you saw using OLS earlier.</font>

In this second IV regression, we continue to see a negative coefficient that is statistically significant, and here it is slightly larger in magnitude than it was in Question 10: a reduction of 14%. It is interesting how adding controls actually made the negative effect larger in size. But that seems consistent with the OLS results earlier: when we controlled for being married and being white there, both of which were positively correlated with earnings and with WWII veteran status, we also made $\beta^w$ less positive (here, more negative) because of positive omitted variable bias.

<h4>
<font color="blue">Complete the code below and run it.</font>
    </h4>

In [10]:
rm(c80_ivreg2)
ls()
gc()

Unnamed: 0,used,(Mb),gc trigger,(Mb).1,max used,(Mb).2
Ncells,778095,41.6,1349511,72.1,1349511,72.1
Vcells,6890892,52.6,80033212,610.7,83037398,633.6


In [11]:
c80_ivreg3 <- ivreg(logincwage ~ wwii + factor(birthyr) + white + married +
                    factor(statefip) + smsa + 
                    edyrs + disability | 
                    factor(birthyr) + white + married +
                    factor(statefip) + smsa +
                    edyrs + disability +
                    b25q1 + b25q2 + b25q3 +
                    b26q1 + b26q2 + b26q3 +
                    b27q1 + b27q2 + b27q3 + 
                    b28q1 + b28q2 + b28q3, 
                    data = data_c80_regsample)
summary(c80_ivreg3)


Call:
ivreg(formula = logincwage ~ wwii + factor(birthyr) + white + 
    married + factor(statefip) + smsa + edyrs + disability | 
    factor(birthyr) + white + married + factor(statefip) + smsa + 
        edyrs + disability + b25q1 + b25q2 + b25q3 + b26q1 + 
        b26q2 + b26q3 + b27q1 + b27q2 + b27q3 + b28q1 + b28q2 + 
        b28q3, data = data_c80_regsample)

Residuals:
    Min      1Q  Median      3Q     Max 
-5.8617 -0.2478  0.0811  0.3754  2.6780 

Coefficients:
                      Estimate Std. Error t value             Pr(>|t|)    
(Intercept)          8.2611158  0.0227649 362.888 < 0.0000000000000002 ***
wwii                -0.0663295  0.0440599  -1.505             0.132215    
factor(birthyr)1926  0.0205538  0.0065369   3.144             0.001666 ** 
factor(birthyr)1927  0.0108970  0.0076331   1.428             0.153414    
factor(birthyr)1928 -0.0104960  0.0206596  -0.508             0.611422    
white                0.2757476  0.0116975  23.573 < 0.0000000000000002 **

<font color="blue">
    <h3>
    Question 12</h3>
Look at the IV regression output above and describe what you see. How much more do male veterans of WWII earn compared to male nonveterans? Is the effect statistically significant? State what you see in descriptive sentences. Describe how results have changed (or not) with more controls.</font>

Here in this last IV regression, we see basically no effect of WWII service on earnings. The coefficient is $-0.066$, but it is statistically insignificantly different from zero. We've added more controls, and this last set has driven the result to zero.

If you can keep the __R__ kernel from dying, you could explore this a little further by checking whether it is the inclusion of education or of disability that does this, knocking $\beta^w$ to zero. If you check, it turns out that it is education that does it; disability is significant but it is negatively correlated with WWII veteran status.

What Angrist and Krueger (1994) say about this is that it might be connected to the effects of year and quarter of birth. They point out that men born in later years in this sample have more education, and also that men born earlier in the year tend to have less education, presumably because of compulsory schooling age laws, and also are more likely to have served in WWII.

Complicated!

<hr>

<font color="blue">
    <h3>
    Question 13</h3>
Take a step back and assess what we have found. Do you believe the OLS results? Or are the IV results more convincing? What do each set of results <i>mean</i>, for things that we care about like inequality and policy? Did WWII veterans benefit from their service in terms of earnings? Or do someo of these results imply the republic may have literally owed them something for their service?</font>

Ordinary least squares estimates of the effect of WWII service on earnings are likely to be misleading because military service is select. As Chapters 1 and 2 in <i>Mastering Metrics</i> showed us, selection bias is a real problem for observational studies.

The IV results are more convincing because of what we know about eligibility for service in WWII and what we see in the data on service rates by year and quarter of birth.

The results mean that WWII veterans were not naturally compensated by the labor market in 1980 for their military service. The causal effect of military service on earnings may have been zero, if we hold education constant, or it might have been negative, if we do not. This implies that the republic literally owed veterans for their service.

<hr>

<i>As warfare and killing rage again in Europe in 2022, let's also take a moment to recognize the great human costs and sacrifices associated with armed conflict and open warfare, and the tragedy of nuclear war.</i>

<hr>

<div style="text-align: right"> <span style="font-family:Papyrus; ">And they lived happily ever after. The End.</span></div>