<h1>ECON 140R Class 09</h1>

More practice with regression analysis is always better. Let's look at Angrist and Pischke's supposition that family size, meaning the number of siblings, independently impacts earnings. The dataset we'll use is the U.S. Health and Retirement Study (HRS), a panel survey of Americans aged 50 and older that started in 1992 and has been refreshed periodically.

The fourth wave took place in 1998, and we'll examine data from it. It isn't a perfect match to the cohort examined by Dale and Krueger (2002) of college entrants in 1972 reinterviewed in 1995, but it's close enough to offer some insights.

In [3]:
library(tidyverse)
library(haven)

── [1mAttaching packages[22m ─────────────────────────────────────── tidyverse 1.3.0 ──

[32m✔[39m [34mggplot2[39m 3.3.6      [32m✔[39m [34mpurrr  [39m 0.3.4 
[32m✔[39m [34mtibble [39m 3.1.8      [32m✔[39m [34mdplyr  [39m 1.0.10
[32m✔[39m [34mtidyr  [39m 1.2.1      [32m✔[39m [34mstringr[39m 1.4.1 
[32m✔[39m [34mreadr  [39m 2.1.2      [32m✔[39m [34mforcats[39m 0.5.2 

── [1mConflicts[22m ────────────────────────────────────────── tidyverse_conflicts() ──
[31m✖[39m [34mdplyr[39m::[32mfilter()[39m masks [34mstats[39m::filter()
[31m✖[39m [34mdplyr[39m::[32mlag()[39m    masks [34mstats[39m::lag()



This is an extract I prepared specially for this purpose. The entire RAND version of the longitudinal file is big, over 1 GB in size. Berkeley's datahub is not configured to allow more than a gigabyte of memory per user, so this would be problematic. If you want to use these data yourself:
* Navigate to [https://hrs.isr.umich.edu/](https://hrs.isr.umich.edu/) and register as a user
* Start with the RAND file, I think it's the easiest
* Download the data to your local machine and use RStudio

In [4]:
hrs_w4_earn_sibs = read_dta("hrs_w4_earn_sibs.dta")
head(hrs_w4_earn_sibs)

hhidpn,ragender,raedyrs,r4agey_m,r4livsib,raraceth,logr4iearn,r4exper,r4expersq
<dbl>,<dbl+lbl>,<dbl+lbl>,<dbl>,<dbl>,<dbl+lbl>,<dbl>,<dbl>,<dbl>
3020,2,16,59,1,1,8.006368,43,1849
10001010,1,12,58,1,1,,46,2116
10004010,1,16,58,1,1,,42,1764
10004040,2,12,52,2,1,11.461632,40,1600
10013040,2,13,50,2,1,11.225244,37,1369
10038040,2,16,55,1,1,10.819778,39,1521


The RAND file uses a very helpful variable naming convention: `rKvarname`, where K is the wave. Here, let's look at summary statistics for the variable `r4livsib`, which is number of living siblings. For people we'll look at, this is going to be very close to siblings ever born. 

In [5]:
summary(hrs_w4_earn_sibs$r4livsib)

   Min. 1st Qu.  Median    Mean 3rd Qu.    Max.    NA's 
  0.000   1.000   2.000   3.047   4.000  23.000      26 

Let's call `mutate()` to add some categoricals, for female gender identity and for the race/ethnicity categories that are useful to summarize folks:

In [6]:
hrs_w4_earn_sibs <- mutate(hrs_w4_earn_sibs, rafemale = ragender - 1)

In [9]:
hrs_w4_earn_sibs <- mutate(hrs_w4_earn_sibs, rablacknh  = ifelse(raraceth == 2, 1, 0))
hrs_w4_earn_sibs <- mutate(hrs_w4_earn_sibs, rahispanic = ifelse(raraceth == 3, 1, 0))
hrs_w4_earn_sibs <- mutate(hrs_w4_earn_sibs, raothernh  = ifelse(raraceth == 4, 1, 0))
head(hrs_w4_earn_sibs)

hhidpn,ragender,raedyrs,r4agey_m,r4livsib,raraceth,logr4iearn,r4exper,r4expersq,rafemale,rablacknh,rahispanic,raothernh
<dbl>,<dbl+lbl>,<dbl+lbl>,<dbl>,<dbl>,<dbl+lbl>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>
3020,2,16,59,1,1,8.006368,43,1849,1,0,0,0
10001010,1,12,58,1,1,,46,2116,0,0,0,0
10004010,1,16,58,1,1,,42,1764,0,0,0,0
10004040,2,12,52,2,1,11.461632,40,1600,1,0,0,0
10013040,2,13,50,2,1,11.225244,37,1369,1,0,0,0
10038040,2,16,55,1,1,10.819778,39,1521,1,0,0,0


Behind the scenes, I have created some standard "labor economics variables." One thing you can do in a log-wage regression is control for age and age-squared. You could also control for age group, with indicators for set ranges of age, maybe in 5-year age groups. You could also calculate what labor economists like as a baseline, which is a rough measure of years of "experience," calculated as age minus years of education:

$$
r4exper_i = r4age_i - raedyrs_i
$$

I also created a variable `r4expersq` by squaring this experience variable. Over a broad age range, typically what we see is earnings rise and then plateau with age, and so a quadratic in experience captures the typical experience fairly well. The expectation is that the coefficient on the linear term should be positive, and the coefficient on the squared term should be negative, so that the parabola opens downward. This isn't always true, especially if we limit our analysis to a particular age range rather than all working ages 20-64.

Let's run this regression:
$$
\ln earnings_i = \alpha_i + \beta \ livingsiblings_i + B \cdot controls + e_i
$$

In [10]:
hrs_reg1 <- lm(logr4iearn ~ r4livsib + r4exper + r4expersq + raedyrs 
               + rafemale + rablacknh + rahispanic + raothernh, data = hrs_w4_earn_sibs)
summary(hrs_reg1)


Call:
lm(formula = logr4iearn ~ r4livsib + r4exper + r4expersq + raedyrs + 
    rafemale + rablacknh + rahispanic + raothernh, data = hrs_w4_earn_sibs)

Residuals:
    Min      1Q  Median      3Q     Max 
-6.7206 -0.3432  0.1682  0.5447  4.2218 

Coefficients:
              Estimate Std. Error t value Pr(>|t|)    
(Intercept) 14.1498869  1.0477470  13.505  < 2e-16 ***
r4livsib    -0.0054531  0.0062996  -0.866 0.386746    
r4exper     -0.2162412  0.0504218  -4.289 1.84e-05 ***
r4expersq    0.0023144  0.0006044   3.829 0.000131 ***
raedyrs      0.0988495  0.0083438  11.847  < 2e-16 ***
rafemale    -0.6635883  0.0297004 -22.343  < 2e-16 ***
rablacknh    0.0285341  0.0435713   0.655 0.512581    
rahispanic  -0.1127790  0.0596236  -1.892 0.058628 .  
raothernh   -0.0536690  0.0953275  -0.563 0.573470    
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

Residual standard error: 0.9243 on 3985 degrees of freedom
  (2097 observations deleted due to missingness)
Multiple R-s

Once we have controlled for age or experience, years of education, gender identity, and race/ethnicity, it doesn't appear that number of living siblings tells us anything about earnings.

By contrast, number of living siblings in 1998 definitely does appear to be correlated with years of education, controlling for gender and race/ethnicity:

In [11]:
hrs_reg2 <- lm(raedyrs ~ r4livsib + rafemale + rablacknh + rahispanic + raothernh, data = hrs_w4_earn_sibs)
summary(hrs_reg2)


Call:
lm(formula = raedyrs ~ r4livsib + rafemale + rablacknh + rahispanic + 
    raothernh, data = hrs_w4_earn_sibs)

Residuals:
     Min       1Q   Median       3Q      Max 
-13.8424  -1.4761  -0.2341   2.2911  10.3433 

Coefficients:
            Estimate Std. Error t value Pr(>|t|)    
(Intercept) 13.92918    0.06873 202.664  < 2e-16 ***
r4livsib    -0.25573    0.01488 -17.182  < 2e-16 ***
rafemale    -0.19734    0.07189  -2.745  0.00607 ** 
rablacknh   -0.73060    0.10315  -7.083 1.57e-12 ***
rahispanic  -3.39074    0.13002 -26.080  < 2e-16 ***
raothernh    0.16899    0.24001   0.704  0.48140    
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

Residual standard error: 2.757 on 6056 degrees of freedom
  (29 observations deleted due to missingness)
Multiple R-squared:  0.181,	Adjusted R-squared:  0.1803 
F-statistic: 267.7 on 5 and 6056 DF,  p-value: < 2.2e-16


Discuss! Are there other regressions you'd like to run?

<div style="text-align: right"> <span style="font-family:Papyrus; ">And they lived happily ever after. The End.</span></div>