<h1>ECON 140R Class 05</h1>

Data from the 1974-1982 RAND Health Insurance Experiment (HIE) were unearthed by Aviva Aron-Dine, Liran Einav, and Amy Finkelstein (J. Econ. Perspect., 2013). Josh Angrist and J&#246;rn-Steffen Pischke provide an extract online at [Mastering Metrics](https://www.masteringmetrics.com/resources/).

Let's examine the data behind Table 1.3, which shows baseline characteristics for the "control group," people with catastrophic health insurance only (the leftmost column), and in subsequent columns, the average difference in the characteristic in that row between one of the three "treatment arms" they argue are useful to consider (deductible, coinsurance, free), and the control group.

The objectives here are to get more experience with real data, and to notice that ordinary least squares regression with `lm()` is a very handy way to cut to the chase and test average differences across subgroups. A "small print" detail is that Angrist and Pischke are doing what's called <i>clustering standard errors at the family level</i>. This last point will definitely not be on any exams.

The main objective is to recognize that with an outcome variable $y_i$ and group identity indicator variables $D^d_i$, $D^c_i$, and $D^f_i$, for example, then this regression:

$$
y_i = \alpha + \beta^d \cdot D^d_i + \beta^c \cdot D^c_i + \beta^f \cdot D^f_i + \epsilon_i
$$

provides a very convenient way of testing the average differences:
* between the control group and group $d$: $\beta^d$
* between the control group and group $c$: $\beta^c$
* between the control group and group $f$: $\beta^f$

Here's a clean PNG of Table 1.3:

<img src="MMtbl13.png" width="800" />

Let's load up <b>haven</b> and <b>tidyverse</b>

In [None]:
library(haven)
library(tidyverse)

I have prepared an extract of the RAND HIE data underneath Table 1.3 in <i>Mastering Metrics</i>. These data include health care utilization outcomes across the four groups that Angrist and Pischke argue are usefully distinguishable, ordered here from least generous to most generous:

* Catastrophic plan
* Deductible plan
* Coinsurance plan
* Free plan

We have the baseline background characteristics, and we also have baseline and end-of-study health status. The variables named with an "-x" at the end are the end-of-study measures, except where `ghindx` is baseline general health and thus `ghindxx` is end-of-study general health. 

In [None]:
table1_3 <- read_dta("table1_3.dta")

In [None]:
head(table1_3)

Let's create new data frames for each of the four groups using `filter()`. The shortened group names are:

* `plan_catas` = Catastrophic plan 
* `plan_deduc` = Deductible plan   
* `plan_coins` = Coinsurance plan  
* `plan_free`  = Free plan   

Copy and paste this code below and run it:

`table1_3_catas <- filter(table1_3, plan_catas == 1)`

`table1_3_deduc <- filter(table1_3, plan_deduc == 1)`

`table1_3_coins <- filter(table1_3, plan_coins == 1)`

`table1_3_free  <- filter(table1_3, plan_free  == 1)`

What we now have are 4 separate data frames for the 4 groups assigned to different insurance plans.

In STAT 20, you might have used `t.test()` to run a comparison between two groups. Let's run `t.test()` on the percent identifying as `female` in the deductible group versus the catastrophic group. This should get us something like the two numbers in the table at upper left.

`t.test(table1_3_deduc$female, table1_3_catas$female)`

Not exactly clear, is it? The $t$-statistic is 1.53, which in words means that this difference is about 1.5 times its standard error. That's not big enough for us to reject the null hypothesis that the true difference is zero. 

There's probably an option to `t.test()` that will show us this, but we can also just type it into __R__. Here is the difference between those last two numbers in the output:

In [None]:
0.5368899 - 0.5599473

This is indeed the point estimate ($-0.23$) of the average difference that appears at the upper left of Table 1.3A.

And then this, the difference divided by the $t$-stat, has to be the estimated standard error:

In [None]:
(0.5368899 - 0.5599473)/-0.93539

Unfortunately this is not the standard error (.16) that appears under the $-0.23$ at the upper left of Table 1.4A. What's going on? Let's load in that new library, which will let us run a special version of `lm()` that will help reveal what's going on.

In [None]:
library(estimatr)

First, let's run `lm_robust()` with options set to the baseline. The syntax is the same as it is for `lm()`, and we should recover the same results, as long as we set the standard errors to "classical" type.

`reg_toprow <- lm(female ~ plan_deduc + plan_coins + plan_free, data = table1_3)`

`summary(reg_toprow)`

`reg_toprowrob <- lm_robust(female ~ plan_deduc + plan_coins + plan_free, 
                           data = table1_3, se_type = "classical")`

`summary(reg_toprowrob)`

Now let's explore what <i>clustering our standard errors at the family level</i> does to our estimates of the standard errors. Because there are families in these data, indexed by the `famid` variable, we might expect that the $\epsilon$'s that shock a person one way or another within a family might shock the rest of the family as well. Imagine a family car that breaks down, so nobody keeps their checkup appointments.

`reg_toprowcluster <- lm_robust(female ~ plan_deduc + plan_coins + plan_free, 
                                data = table1_3, clusters = famid)`
                              
`summary(reg_toprowcluster)`

Compare these results to the top row in Table 1.3A. What do you see?

Compare these results to the results without clustering standard errors at the family level. Which ones are larger?

<h6>A pause to reflect</h6>
Notice that we ran this model::
$$
female_i = \alpha + \beta^d \cdot D^d_i + \beta^c \cdot D^c_i + \beta^f \cdot D^f_i + \epsilon_i
$$

where the $D$'s are the 0/1 indicator variables showing membership in the treatment arms (deductible, coinsurance, free).

Does it surprise or concern you at all that the $y$-variable, `female`, is <i>also</i> a 0/1 indicator variable?

<h2>BONUS ROUND</h2>

There are many other interesting things to look at here. We can also examine how baseline (before randomization) or end-of-study outcomes vary with characteristics, in a break with the RCT's advantages. But it can help us with practice and potentially some interesting thought experiments.

Here's a reproduction of the top line in panel B, where we are looking at the "General health index" `ghindx` across control and treatment arms:

`reg_ghindxcluster <- lm_robust(ghindx ~ plan_deduc + plan_coins + plan_free,                                  
                               data = table1_3, clusters = famid)`

`summary(reg_ghindxcluster)`

It looks like we've got the right variable. Maybe we can try this model of health:

$$
ghindx_{i0} = \alpha + \beta^a \cdot age_i + \beta^f \cdot female_i + \beta^bh \cdot blackhisp_i + \beta^e \cdot educ_i +
\beta^i \cdot income_{i0} + \epsilon_i
$$

where I'm using $i0$ to refer to the measure of a variable for person $i$ at time $0$, meaning before the RCT. 

`reg_ghindx_yc <- lm_robust(ghindx ~ age + female + blackhisp + educper + income1cpi,                                  
                               data = table1_3, clusters = famid)`

`summary(reg_ghindx_yc)`

Write about what you see here!

<div style="text-align: right"> <span style="font-family:Papyrus; ">And they lived happily ever after. The End.</span></div>