<img src="images/econ140R_logo.png" width="200" />

In the following cell, please type your name and SID:

In the cell below, please write out the [Honor Code](https://teaching.berkeley.edu/berkeley-honor-code) to reaffirm you are abiding by it.

Did you work with other students? List them below. Please write your answers in your own words, not in theirs.

<h1>ECON 140R - Problem Set 3 Part 1</h1>

<font color="red"><b>Please complete Problem Set 3 Part 2 also</b></font>

<h2>INSTRUCTIONS</h2>

Please step through this problem set, copying and pasting code as needed, and run the code to produce output. Answer the questions asked, which appear in <font color="blue">blue font</font>. You will earn 100% of the credit on this problem set for <b>completing</b> it with working code and coherent answers. Answers do not need to be correct for full credit.

Throughout, <b>you may consult outside sources like the paper below or other commentary</b>, but your analysis should be in your own words. We will not run TurnItIn software on this Problem Set, but as usual, you should not borrow phrases without attribution or commit "mosaic plagiarism."

In this problem set we're examining a special extract of the 1980 Census to replicate [Angrist and Krueger (1994)](https://www-jstor-org.libproxy.berkeley.edu/stable/2535121), who research "Why Do World War II Veterans Earn More than Nonveterans?" We will use ordinary least squares and two-stage least squares (2SLS), a common implementation of instrumental variables, to replicate their primary results shown in their Table 4, which is reprinted below:

<img src="images/angrist-krueger-1994-ta4.png" width="500" />

In [None]:
library(tidyverse)
library(haven)
library(ggplot2)
install.packages("ivreg", dependencies = TRUE)
library("ivreg")

It turns out that this handy command stops __R__ from defaulting to scientific notation. 

In [None]:
options(scipen=999)

<hr>

Below, we load in a 50% randomly selected subsample of an extract downloaded from IPUMS of the 1980 Census public-use microsample, which is a 5% flat cut of the entire Census. (So we are looking at a 2.5% flat cut.) This particular extract contains men only, who were born in 1925, 1926, 1927, and 1928. The data also contain their quarter of birth, whether they served in WWII `wwii`, and their wage and salary income `incwage`, and several other characteristics.

The 1980 Census was unique in asking about month of birth on the [short form](https://www.census.gov/history/pdf/1980_short_questionnaire.pdf), which everyone answered. The public-use microdata sample condensed this into quarter of birth, but `birthqtr` is still fairly unique across Census products. Some other datasets measure month of birth, and restricted-use datasets may also supply even day of birth.

You can download these data yourself. But be advised the the full dataset contains over 11 million records, or 5% of the nation's roughly 227 million people in 1980. Even with just 36 variables selected, the extract is over a gigabyte in size and is too large for datahub.

<hr>

Let's load up and look at the data.

In [None]:
data_c80_regsample = read_dta("data_c80_regsample_3.dta")
head(data_c80_regsample)

Here is a baseline regression of a useful $Y$ variable, log pre-tax wage and salary income, which is [described at IPUMS here](https://usa.ipums.org/usa-action/variables/INCWAGE#description_section):

$$
\ln Y_i = \alpha + \beta^{w} \ wwii_i + B \ controls_i + \epsilon_i
$$

where we are controlling for 0/1 WWII service; year of birth; being white (Black, Hispanic, and other men are the baseline omitted category); being married in 1980; a 0/1 indicator of living and working in a standard metropolitan statistical area (SMSA); years of education; a 0/1 indicator of a disability that limits or prevents work; and 49 indicators for 48 lower states (AK and HI are dropped) plus DC.

We'll run this regression and examine what we find for $\beta^w$. Let's follow what [Angrist and Krueger (1994)](https://www-jstor-org.libproxy.berkeley.edu/stable/2535121) do in the left side of Table 4, marked "OLS," which looks a lot like the left-hand side of Table 2.2 in <i>Mastering Metrics</i> Chapter 2. In both, the authors start with a simple model and the add some covariates that might have (and did) inject omitted variable bias. Here's what we'll do:

1. $\ln Y_i = \alpha + \beta^{w} \ wwii_i + \sum \beta^{by} \ birthyear_i  + \epsilon_i$

2. $\ln Y_i = \alpha + \beta^{w} \ wwii_i + \sum \beta^{by} \ birthyear_i  + \beta^wnh \ white_i + \beta^m \ married_i + \sum \beta^{s}\ state_i + \beta^u \ SMSA_i + \epsilon_i$

2. $\ln Y_i = \alpha + \beta^{w} \ wwii_i + \sum \beta^{by} \ birthyear_i  + \beta^wnh \ white_i + \beta^m \ married_i + \sum \beta^{s}\ state_i + \beta^u \ SMSA_i + \beta^e \ educ_i + \beta^d \ disability_i + \epsilon_i$

<h4>
<font color="blue">Complete the code below and run it.</font>
    </h4>

In [None]:
c80_reg1 <- lm(logincwage ~ wwii + factor(birthyr), 
               data = data_c80_regsample)
summary(c80_reg1)

<font color="blue">
    <h3>
    Question 1</h3>
    Look at the regression output above and describe what you see. How much more do male veterans of WWII earn compared to male nonveterans? Is the effect statistically significant? Describe using the reported numbers in the regression output, and also state what you see in descriptive sentences where you refer to the percentages revealed by the reported numbers. 
</font>

<h4>
<font color="blue">Complete the code below and run it.</font>
    </h4>

In [None]:
c80_reg2 <- lm(logincwage ~ wwii + factor(birthyr) + white + married 
               + factor(statefip) + smsa, 
               data = data_c80_regsample)
summary(c80_reg2)

<font color="blue">
    <h3>
    Question 2</h3>
    Look at the regression output above and describe what you see. How much more do male veterans of WWII earn compared to male nonveterans? Is the effect statistically significant? Discuss what is different about this regression compared to the first one above, in terms of the controls. Are these bad controls? California is <i>statefip</i> == 6; how does California residence affect earnings? State what you see in descriptive sentences.
</font>

<h4>
<font color="blue">Complete the code below and run it.</font>
    </h4>

In [None]:
c80_reg3 <- lm(logincwage ~ wwii + factor(birthyr) + white + married 
               + factor(statefip) + smsa 
               + edyrs + disability, 
               data = data_c80_regsample)
summary(c80_reg3)

<font color="blue">
    <h3>
    Question 3</h3>
   Look at the regression output above and describe what you see. How much more do male veterans of WWII earn compared to male nonveterans? Is the effect statistically significant? Discuss what is different about this regression compared to the second one above, in terms of the controls. Are these bad controls? California is statefip == 6; how does California residence affect earnings? State what you see in descriptive sentences.
</font>

<font color="blue">
    <h3>
    Question 4</h3>
    Now discuss the three regressions we have run thus far. If we are concerned about the impact of WWII service on wages, do you think those extra right-hand-side variables are more like omitted variables, or more like bad controls? Or does it depend on your point of view? You can summarize your earlier remarks.</font>

<hr>

Unfortunately `ggplot2` and `haven` apparently don't always play well together. When I first tried running the boxplot code further below, I got weird error messages and nothing. Tips for when (not if) __R__ squawks and hoses you:

* Deep breath. Sigh. Laugh. Post angrily to Twitter?
* Copy the error message and paste into Google
* Look for a Stack Exchange [stackoverflow.com](https://stackoverflow.com) post
* Profit


The code below apparently cleans up these indicator variables for use in `ggplot2`. Don't ask me why, because I don't know! The variable `notwwii` is obvious and it was useful because it switches the order. The variable `srgrp` is useful because of data labels for its values that I inserted in Stata. It measures 3 subgroups shown in an JASA paper by [Small and Rosenbaum (2008)](https://www-tandfonline-com.libproxy.berkeley.edu/doi/abs/10.1198/016214507000001247). 
1. Born in 1924 Q3 or Q4
2. Born in 1926 Q3 or Q4
3. Born in 1928 Q3 or Q4

Note that our extract includes men born in 1925 through 1928, in concordance with the sample drawn by [Angrist and Krueger (1994)](https://www-jstor-org.libproxy.berkeley.edu/stable/2535121). So the men born in 1924 aren't present and won't show up in the box plots. (Which is OK, they look a lot like the men born in 1926.)

In [None]:
data_c80_regsample$notwwii <- haven::as_factor(data_c80_regsample$notwwii)
data_c80_regsample$srgrp   <- haven::as_factor(data_c80_regsample$srgrp)

Now that we have that glitch figured out, here is a replication of Figure 1 in [Small and Rosenbaum (2008)](https://www-tandfonline-com.libproxy.berkeley.edu/doi/abs/10.1198/016214507000001247), a very nice review of what Angrist and Krueger had done. <b>Box plots</b> are more common in other disciplines than in economics, and they depict visually what economists and sociologists might instead place in a table: means, centiles, outliers, and so on. In its box plots, __R__'s `ggplot2` shows us the 25th, 50th, and 75th percentiles, and two "whiskers." The whiskers extend by default to 1.5 times the interquartile range, which is the distance between 25th and 75th percentiles. [Details here](https://www.rdocumentation.org/packages/ggplot2/versions/3.3.6/topics/geom_boxplot).

In [None]:
ggplot(subset(data_c80_regsample, birthyr == 1926 & (birthqtr == 3 | birthqtr == 4)), 
              aes(notwwii, incwage)) + geom_boxplot()

<font color="blue">
    <h3>
    Question 5</h3>
    Describe what you see. Who earns more among these men born in Q3 or Q4 of 1926, WWII veterans or nonveterans?</font>

<hr>

As discussed in Chapter 6 of <i>Mastering Metrics</i>, <b>quarter of birth</b> is a pretty interesting variable in applied microeconometrics. See section 6.3, and in particular the passage starting on page 229.

Why might it be interesting here? Let us run a handy regression with indicator variables that I have created in the dataset already: `b25q2` for example is an indicator for having been born in the second quarter (April through June) of 1925. Let us toss in all such indicators except `b25q1`, so that men born in quarter 1 of 1925 are the omitted category, and let us run this informative regression (which will end up being like the "first stage" in the instrumental variables chain):

$$
wwii_i = \alpha + \sum_k \theta_k \ birthqtr_i \times birthyr_i + \epsilon_i
$$

When we estimate this, $\alpha$ is the rate of WWII veteran status among men born in 1925:Q1, and the $\theta$'s tell us the difference in the rate of WWII veteran status for men born in different years and quarters. For example, if $\alpha = 0.75$, then 75% of men born in 1925:Q1 are WWII veterans; and if $\theta_{25q2} = 0.005$, then among the cohort born in 1925:Q2, $\alpha + \theta_{25q2} = 0.75 + 0.005 = 0.755$, and thus = 75.5% are WWII veterans.

<h4>
<font color="blue">Complete the code below and run it.</font>
    </h4>

In [None]:
wwii_reg <- lm(wwii ~         b25q2 + b25q3 + b25q4 +
                      b26q1 + b26q2 + b26q3 + b26q4 +
                      b27q1 + b27q2 + b27q3 + b27q4 +
                      b28q1 + b28q2 + b28q3 + b28q4,
              data = data_c80_regsample)
summary(wwii_reg)

<font color="blue">
    <h3>
    Question 6</h3>
    Describe what you see. Which group has the highest rate of WWII veteran status? Which group has the lowest rate? Can you see a "cliff" here, which WWII service really falls off of? (No need to look for an extremely precise cliff; cliffs can be comically abrupt or a little more gradual.)</font>

<hr>

Now consider this comparison. Let's look at two groups separated by birth year:

1. Men born in Q3 or Q4 of 1926
2. Men born in Q3 or Q4 of 1928

I've coded this using the categorical variable `srgrp`, because it helpfully places the value labels along the bottom of the plot. The labels shown the percentages of these two groups that are veterans of WWII. For the 1926 group and many birth years before, the WWII veteran share is around 75%. For the 1928 group, it is about 25%. 

In [None]:
ggplot(subset(data_c80_regsample, srgrp != "NA"), 
       aes(srgrp, incwage)) + geom_boxplot()

<font color="blue">
    <h3>
    Question 7</h3>
Describe what you see. Despite the large difference in the WWII-veteran shares of these birth cohorts, are there large differences in earnings? Is this what you would expect to see if the cohort born later had half as many WWII veterans, and WWII veterans earned a lot more?</font>

<font color="blue">
    <h3>
    Question 8</h3>
Discuss how the OLS results above, in Questions 1-4, appear incongruous with this second box plot. Feel free to consult the hyperlinked papers above if you are not sure, but please write answers in your own words.</font>

<hr>

Quarter of birth certainly seems important for WWII service in this sample, for obvious reasons. Men could be born too late to serve in WWII, which ended in August 1945. With some exceptions, which we have seen, that meant that men born in Q3 or Q4 of 1928 were generally too young ever to serve in the war. 

We have already run something that looks like a <i>first stage</i> regression. It was:

$$
wwii_i = \alpha + \sum_k \theta_k \ birthqtr_i \times birthyr_i + \epsilon_i
$$

What if we looked at the effect of quarter of birth on earnings in a type of <i>reduced form</i> regression of log earnings on quarter of birth variables? That equation looks like this:

$$
\ln Y_i = \alpha^Y + \sum_k \psi_k \ birthqtr_i \times birthyr_i + \nu_i
$$

<h4>
<font color="blue">Complete the code below and run it.</font>
    </h4>

In [None]:
c80_rf <- lm(logincwage ~ b25q1 + b25q2 + b25q3 +
                          b26q1 + b26q2 + b26q3 + 
                          b27q1 + b27q2 + b27q3 + 
                          b28q1 + b28q2 + b28q3 + 
                          factor(birthyr),
             data = data_c80_regsample)
summary(c80_rf)

<font color="blue">
    <h3>
    Question 9</h3>
This is not very easy to read, because there are so many instrumental variables on the right-hand side here. But what do you see? Are some of these instrumental variables statistically significant? Or are none of them?</font> 

<font color="blue">(Sophisticated users will note two things: [1] the <i>F</i>-stat is 4.867, which is sort of OK, and [2] this is the reduced form only for the model in column (4) in Angrist and Krueger's Table 4.)</font>

<hr>

<font color="red"><b>Please complete Problem Set 3 Part 2 also.</b> The reason there are 2 parts is because starting from scratch allows us to run the IV estimation routine without overloading memory and crashing the R kernel.</font>

<hr>

<i>As warfare and killing rage again in Europe in 2022, let's also take a moment to recognize the great human costs and sacrifices associated with armed conflict and open warfare, and the tragedy of nuclear war.</i>

<hr>

<div style="text-align: right"> <span style="font-family:Papyrus; ">And they lived happily ever after. The End.</span></div>