<h1>ECON 140R Class 04</h1>

Data from the 1974-1982 RAND Health Insurance Experiment (HIE) were unearthed by Aviva Aron-Dine, Liran Einav, and Amy Finkelstein (J. Econ. Perspect., 2013). Josh Angrist and J&#246;rn-Steffen Pischke provide an extract online at [Mastering Metrics](https://www.masteringmetrics.com/resources/).

Let's examine the data behind Panel A in Table 1.4, which reveals average levels of health care utilization across 5 types of care (the rows) for the "control group," people with catastrophic health insurance only (the leftmost column). In subsequent columns, the authors show us the average difference in the utilization measure in that row between one of the three "treatment arms" they argue are useful to consider (deductible, coinsurance, free), and the control group.

The objectives here are to get more experience with real data, and to notice that ordinary least squares regression with `lm()` is a very handy way to cut to the chase and test average differences across subgroups. A "small print" detail is that Angrist and Pischke are doing what's called <i>clustering standard errors at the family level</i>. This last point will definitely not be on any exams.

The main objective is to recognize that with an outcome variable $y_i$ and group identity indicator variables $D^d_i$, $D^c_i$, and $D^f_i$, for example, then this regression:

$$
y_i = \alpha + \beta^d \cdot D^d_i + \beta^c \cdot D^c_i + \beta^f \cdot D^f_i + \epsilon_i
$$

provides a very convenient way of testing the average differences:
* between the control group and group $d$: $\beta^d$
* between the control group and group $c$: $\beta^c$
* between the control group and group $f$: $\beta^f$

Here's a clean PNG of Table 1.4:

<img src="MMtbl14.png" width="800" />

Let's load up <b>haven</b> and <b>tidyverse</b>

In [1]:
library(haven)
library(tidyverse)

── [1mAttaching packages[22m ─────────────────────────────────────── tidyverse 1.3.0 ──

[32m✔[39m [34mggplot2[39m 3.3.6     [32m✔[39m [34mpurrr  [39m 0.3.4
[32m✔[39m [34mtibble [39m 3.1.8     [32m✔[39m [34mdplyr  [39m 1.0.9
[32m✔[39m [34mtidyr  [39m 1.2.0     [32m✔[39m [34mstringr[39m 1.4.1
[32m✔[39m [34mreadr  [39m 2.1.2     [32m✔[39m [34mforcats[39m 0.5.2

── [1mConflicts[22m ────────────────────────────────────────── tidyverse_conflicts() ──
[31m✖[39m [34mdplyr[39m::[32mfilter()[39m masks [34mstats[39m::filter()
[31m✖[39m [34mdplyr[39m::[32mlag()[39m    masks [34mstats[39m::lag()



I have prepared an extract of the RAND HIE data underneath Table 1.4 Panel A in <i>Mastering Metrics</i>. These data include health care utilization outcomes across the four groups that Angrist and Pischke argue are usefully distinguishable, ordered here from least generous to most generous:

* Catastrophic plan
* Deductible plan
* Coinsurance plan
* Free plan

We have the five utilization measures shown in Table 1.4A here: `ftf` is face-to-face visits; `out_inf` are outpatient expenses; `totadm` is total hospital admissions; `inpdol_inf` are inpatient expenses, and `tot_inf` are total expenses.

In [11]:
table1_4a <- read_dta("table1_4.dta")

In [10]:
head(table1_4a)

person,year,ftf,totadm,plantype,out_inf,inpdol_inf,tot_inf,famid,plan_free,plan_deduc,plan_coins,plan_catas
<chr>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>
MA250247,1,0,0,4,36.3055,0,36.3055,100082,0,0,0,1
MA250247,2,4,0,4,275.2085,0,275.2085,100082,0,0,0,1
MA250247,3,0,0,4,0.0,0,0.0,100082,0,0,0,1
MA250247,4,0,0,4,0.0,0,0.0,100082,0,0,0,1
MA250247,5,0,0,4,0.0,0,0.0,100082,0,0,0,1
MA250255,1,0,0,4,0.0,0,0.0,100082,0,0,0,1


Let's create new data frames for each of the four groups using `filter()`. The shortened group names are:

* `plan_catas` = Catastrophic plan 
* `plan_deduc` = Deductible plan   
* `plan_coins` = Coinsurance plan  
* `plan_free`  = Free plan   

Copy and paste this code below and run it:

`table1_4a_catas <- filter(table1_4a, plan_catas == 1)`

`table1_4a_deduc <- filter(table1_4a, plan_deduc == 1)`

`table1_4a_coins <- filter(table1_4a, plan_coins == 1)`

`table1_4a_free  <- filter(table1_4a, plan_free  == 1)`

What we now have are 4 separate data frames for the 4 groups assigned to different insurance plans.

In STAT 20, you might have used `t.test()` to run a comparison between two groups. Let's run `t.test()` on the face-to-face visits `ftf` in the deductible group versus the catastrophic group. This should get us something like the two numbers in the table at upper left.

`t.test(table1_4a_deduc$ftf, table1_4a_catas$ftf)`

Not exactly clear, is it? The $t$-statistic is 1.53, which in words means that this difference is about 1.5 times its standard error. That's not big enough for us to reject the null hypothesis that the true difference is zero. 

There's probably an option to `t.test()` that will show us this, but we can also just type it into __R__. Here is the difference between those last two numbers in the output:

In [15]:
2.976766 - 2.784103

This is indeed the point estimate (0.19) of the average difference that appears at the upper left of Table 1.4A.

And then this, the difference divided by the $t$-stat, has to be the estimated standard error:

In [16]:
(2.976766 - 2.784103)/1.5318

Unfortunately this is not the standard error (.25) that appears under the .19 at the upper left of Table 1.4A. What's going on? Stay tuned. Let's load in a new library, which will let us run a special version of `lm()` that will help reveal what's going on.

In [17]:
library(estimatr)

First, let's run `lm_robust()` with options set to the baseline. The syntax is the same as it is for `lm()`, and we should recover the same results, as long as we set the standard errors to "classical" type.

`reg_toprow <- lm(ftf ~ plan_deduc + plan_coins + plan_free, data = table1_4a)`

`summary(reg_toprow)`

`reg_toprowrob <- lm_robust(ftf ~ plan_deduc + plan_coins + plan_free, 
                           data = table1_4a, se_type = "classical")`

`summary(reg_toprowrob)`

Now let's explore what <i>clustering our standard errors at the family level</i> does to our estimates of the standard errors. Because there are families in these data, indexed by the `famid` variable, we might expect that the $\epsilon$'s that shock a person one way or another within a family might shock the rest of the family as well. Imagine a family car that breaks down, so nobody keeps their checkup appointments.

`reg_toprowcluster <- lm_robust(ftf ~ plan_deduc + plan_coins + plan_free, 
                                data = table1_4a, clusters = famid)`
                              
`summary(reg_toprowcluster)`

Compare these results to the top row in Table 1.4A. What do you see?

Compare these results to the results without clustering standard errors at the family level. Which ones are larger?

<div style="text-align: right"> <span style="font-family:Papyrus; ">And they lived happily ever after. The End.</span></div>