<h1>ECON 140R Class 03</h1>

There are a lot of outcomes ($y$) that we would like to compare across groups defined in various ways. In <i>Mastering Metrics</i>, we started by comparing people across health insurance status in Chapter 1. Then later in that chapter, we see just how different the story becomes when we examine the RAND Health Insurance Experiment, when treatment groups were randomly assigned to different health insurance plans.

As we will see, when you can't run a randomized controlled trial (RCT) like that, you have to figure out what to do next. For now, let's explore some characteristics of respondents to the 14th wave of the U.S. [Health and Retirement Study (HRS)](https://hrs.isr.umich.edu/welcome-health-and-retirement-study), conducted mostly in 2018, before the COVID-19 pandemic. The HRS is a panel survey, with biennial waves, of Americans aged 50 and over.

Let's load up <b>haven</b> and <b>tidyverse</b>

In [None]:
library(haven)
library(tidyverse)

I have prepared an extract from the RAND version of the HRS file, and I have created some additional helpful variables. Here is the extract:

In [None]:
hrs_heart <- read_dta("hrs_heart.dta")

In [None]:
head(hrs_heart)

There is a lot here. The way RAND names the variables is `RwVARNAME` where "w" is either the number of the wave, or the letter "a" if the measure should be the same across all waves. For example, above age 50, years of education usually never changes, so that variable is named `RAEDYRS`. (Note that here, all variables are in lowercase.)

Let us example the measure in wave 14 of the binary response to the question, "Has has a doctor (ever) told you that you had a heart attack, coronary heart disease, angina, congestive heart failure, or other heart problems?" This variable is `r14hearte`.

I have created a quadrichotomous (is that a word?) variable called `raraceth` that combines race and ethnicity information into four broad, useful, and common categories: 

(1) White non-Hispanic people

(2) Black non-Hispanic people

(3) Hispanic people

(4) Non-Hispanic people of other race

Especially in relatively small samples, this breakdown is helpful. It certainly doesn't cover all variation in lived experiences, but it's a good balance between parsimony and breadth.

Let us filter out the WNH people, the BNH people, and also those two groups together.

In [None]:
hrs_heart_wnh <- filter(hrs_heart, raraceth == 1)
hrs_heart_bnh <- filter(hrs_heart, raraceth == 2)
hrs_heart_wbnh <- filter(hrs_heart, raraceth == 1 | raraceth == 2)

There are many ways to proceed. Let's examine what `summary()` can show us. Let's start with white non-Hispanic people:

In [None]:
summary(hrs_heart_wnh)

Before we whittle this down a little, let's also look now at Black non-Hispanic people:

In [None]:
summary(hrs_heart_bnh)

Look at the "Mean" entries for WNH people and the variables `r14hearte` and `r14agey_m`, the latter of which is the respondent's age in years. (The code below might help make this clearer.) What do you see?

In [None]:
summary(hrs_heart_wnh$r14hearte)
summary(hrs_heart_wnh$r14agey_m)

Let's compute the same for BNH people:

In [None]:
summary(hrs_heart_bnh$r14hearte)
summary(hrs_heart_bnh$r14agey_m)

What do you see? Compare the means of these two variables across WNH and BNH people in a few words or sentences in the markdown field below:

Hold on to your hats.

It turns out that a linear regression of $y$ on an $x$ that is dichotomous (equal to 0 or 1) will tell us the average $y$ for the group whose $x = 0$, that's the constant term $\alpha$; and it will tell us the difference between that group's average $y$ and the other group's (indicated by $x = 1$).

In other words, when we run
$$y = \alpha + \beta x + \epsilon$$

when $x$ is a dichtomous "indicator variable" (sometimes called "dummy variable"), $\alpha$ is the average $y$ for the $x = 0$ group, and $\alpha + \beta$ is the average $y$ for the $x = 1$ group.

Observe. Recall that `hrs_heart_wbnh` is the data frame that includes both WNH and BNH people. Let's run this regression:
$$heart = \alpha + \beta \cdot BNH + \epsilon$$

where `heart` is shorthand for `r14hearte` and `BNH` is the indicator variable for identifying as a Black non-Hispanic person, `rablacknonh`.

In [None]:
heartreg <- lm(r14hearte ~ rablacknonh, data = hrs_heart_wbnh)
summary(heartreg)

Interesting. Consider this arithmetic from the averages of `r14hearte` among BNH people and WNH people shown earlier above:

In [None]:
0.2226 - 0.2847

Do you see similar numbers in the arithmetic and in the regression output? Where? Write some thoughts in the markdown field below.

Should we be worried that WNH people are on average about 5 years older than BNH people in these data? If we were, what could we do about it? Maybe try this:
$$heart = \alpha + \beta \cdot BNH + \delta \cdot age + \epsilon$$

In [None]:
heartreg_age <- lm(r14hearte ~ rablacknonh + r14agey_m, data = hrs_heart_wbnh)
summary(heartreg_age)

There is certainly a lot to see here. What happened to the coefficient on `BNH`, shown here in the "Estimate" column and on the `rablacknonh` row? Look at far right. Do you see any asterisks in that row now?

<div style="text-align: right"> <span style="font-family:Papyrus; ">And they lived happily ever after. The End.</span></div>