<h1>ECON 140R Class 10.1</h1>

The Class 10 hands-on runs in two analyses:
* <b>10.1: A second look at earnings and number of siblings the 1998 HRS</b>
* 10.2: Bad controls in the January 2018 Current Population Survey (CPS)

Learning objectives:
1. Repetition
2. Visualization and common pitfalls

<h2>HRS and siblings redux</h2>

Recall our analysis of log earnings in the 1998 (4th) wave of the U.S. Health and Retirement Study (HRS) from class 09. The top of this hands-on replicates that, before running a further variant of the analysis.

We'd like to understand how "family size," as discussed by Angrist and Pischke, might be an omitted variable in the Dale and Krueger (2002) study. The HRS measures the number of living siblings in wave 4, `r4livsib`, for this sample aged 50-59 in 1998. They are roughly 10-20 years older than the Dale-Krueger study participants, who entered college in 1976 and so were probably born around 1958.

Our objective is to run a version of this "long regression" from page 73 of <i>Mastering Metrics</i>:

$$
\ln Y_i = \alpha^l 
+ \beta^l \ P_i + 
\sum_j \gamma_j^l GROUP_{ji} 
+ \delta_1^l SAT_i
+ \delta_2^l \ln PI_i
+ \lambda FS_i
+ e^l_i
$$

where we can measure $FS_i$ in the HRS data, as `r4livsib`. We don't have many of the other right-hand-side variables shown here, but that shouldn't matter for this exercise. The coefficient on family size in a log earnings regression is not likely to depend much on the other controls shown here.

In [None]:
library(tidyverse)
library(haven)

This is an extract I prepared specially for this purpose. The entire RAND version of the longitudinal file is big, over 1 GB in size. Berkeley's datahub is not configured to allow more than a gigabyte of memory per user, so this would be problematic. If you want to use these data yourself:
* Navigate to [https://hrs.isr.umich.edu/](https://hrs.isr.umich.edu/) and register as a user
* Start with the RAND file, I think it's the easiest
* Download the data to your local machine and use RStudio

In [None]:
hrs_w4_earn_sibs = read_dta("hrs_w4_earn_sibs.dta")
head(hrs_w4_earn_sibs)

The RAND file uses a very helpful variable naming convention: `rKvarname`, where K is the wave. Here, let's look at summary statistics for the variable `r4livsib`, which is number of living siblings. For people we'll look at, this is going to be very close to siblings ever born. 

In [None]:
summary(hrs_w4_earn_sibs$r4livsib)
hist(hrs_w4_earn_sibs$r4livsib)

It's also helpful to look at years of education `raedyrs`, because that appears to be pretty important for understanding the effects of number of siblings on earnings:

In [None]:
summary(hrs_w4_earn_sibs$raedyrs)
hist(hrs_w4_earn_sibs$raedyrs)

Many folks are at that huge spike at a high school degree, 12 years. The Dale-Krueger dataset includes only those people and those with more education, and none of the left tail.

It might be interesting to see these two variables in a scatterplot, wouldn't it? Unfortunately, variables like this that take on integer values can create extremely unfortunate visualizations:

In [None]:
plot(hrs_w4_earn_sibs$r4livsib, hrs_w4_earn_sibs$raedyrs)

A tried and true solution to this problem is to MESS WITH THE DATA. You may not have known it, but chances are that in STAT 20 or DATA 8, you saw more than your fair share of scatterplots with deliberately "cooked" data in the way we're about to cook it.

If we add a small random number to both variables, we are monkeying with the data but basically preserving it. It's good to seed the random number generator (RNG) so that we can reproduce outcomes if we want to. 

In [None]:
set.seed(20220927)

In [None]:
# Let's create variables endin in -r that have a normally distributed random variable added, with mean 0 and small SD
hrs_w4_earn_sibs <- mutate(hrs_w4_earn_sibs, raedyrsr = raedyrs + rnorm(n(),0,1))
hrs_w4_earn_sibs <- mutate(hrs_w4_earn_sibs, r4livsibr = r4livsib + rnorm(n(),0,0.5))
head(hrs_w4_earn_sibs)

In [None]:
plot(hrs_w4_earn_sibs$r4livsibr, hrs_w4_earn_sibs$raedyrsr)

Another thing we could do is just run a regression. Here it is:

$$
raedyrs_i = \alpha^e + \beta^e \ livingsiblings_i + \epsilon^e_i
$$

In [None]:
edyrs_sib_reg <- lm(raedyrs ~ r4livsib, data = hrs_w4_earn_sibs)
summary(edyrs_sib_reg)

<hr>

Let's call `mutate()` to add some categoricals, for female gender identity and for the race/ethnicity categories that are useful to summarize folks:

In [None]:
hrs_w4_earn_sibs <- mutate(hrs_w4_earn_sibs, rafemale = ragender - 1)

In [None]:
hrs_w4_earn_sibs <- mutate(hrs_w4_earn_sibs, rablacknh  = ifelse(raraceth == 2, 1, 0))
hrs_w4_earn_sibs <- mutate(hrs_w4_earn_sibs, rahispanic = ifelse(raraceth == 3, 1, 0))
hrs_w4_earn_sibs <- mutate(hrs_w4_earn_sibs, raothernh  = ifelse(raraceth == 4, 1, 0))
head(hrs_w4_earn_sibs)

Behind the scenes, I have created some standard "labor economics variables." One thing you can do in a log-wage regression is control for age and age-squared. You could also control for age group, with indicators for set ranges of age, maybe in 5-year age groups. You could also calculate what labor economists like as a baseline, which is a rough measure of years of "experience," calculated as age minus years of education:

$$
r4exper_i = r4age_i - raedyrs_i
$$

I also created a variable `r4expersq` by squaring this experience variable. Over a broad age range, typically what we see is earnings rise and then plateau with age, and so a quadratic in experience captures the typical experience fairly well. The expectation is that the coefficient on the linear term should be positive, and the coefficient on the squared term should be negative, so that the parabola opens downward. This isn't always true, especially if we limit our analysis to a particular age range rather than all working ages 20-64.

Let's run this regression:
$$
\ln earnings_i = \alpha_i + \beta \ livingsiblings_i + \gamma \ raedyrs_i + B \cdot controls + e_i
$$

In [None]:
hrs_reg1 <- lm(logr4iearn ~ r4livsib + raedyrs + r4exper + r4expersq 
               + rafemale + rablacknh + rahispanic + raothernh, data = hrs_w4_earn_sibs)
summary(hrs_reg1)

Once we have controlled for age or experience, years of education, gender identity, and race/ethnicity, it doesn't appear that number of living siblings tells us anything about earnings.

By contrast, number of living siblings in 1998 definitely does appear to be correlated with years of education, controlling for gender and race/ethnicity:

In [None]:
hrs_reg2 <- lm(raedyrs ~ r4livsib + rafemale + rablacknh + rahispanic + raothernh, data = hrs_w4_earn_sibs)
summary(hrs_reg2)

<hr>

Another reasonable approach here would be to drop observations who have less education than people in the Dale-Krueger (2002) study. They describe their sample in Table II on p. 1506, and they report that 85% graduated from college, and 56% obtained an advanced degree.

In the HRS data, if we look at people with `raedyrs` of 15 and more, that gets us roughly this break.

In [None]:
table(unlist(hrs_w4_earn_sibs$raedyrs))

In [None]:
244/(244+620+706)

The code above shows that the 244 people at `raedyrs` == 15 are 15% of the total at or above that level. The code below runs the model on people with 15+ years of education and drops education from the right-hand side. This produces a subsample that is similar to the Dale and Krueger data.

In [None]:
hrs_reg3 <- lm(logr4iearn ~ r4livsib + r4exper + r4expersq # + raedyrs 
               + rafemale + rablacknh + rahispanic + raothernh, 
               data = subset(hrs_w4_earn_sibs, raedyrs >= 15))
summary(hrs_reg3)

There's not much evidence here. We can also examine a pretty extreme model, where we drop all other covariates and look at the bivariate relationship between log earnings and living siblings:

In [None]:
hrs_reg4 <- lm(logr4iearn ~ r4livsib , 
               data = subset(hrs_w4_earn_sibs, raedyrs >= 15))
summary(hrs_reg4)

<hr>

<div style="text-align: right"> <span style="font-family:Papyrus; ">And they lived happily ever after. The End.</span></div>