<h1>ECON 140R Class 04</h1>

Data from the 1974-1982 RAND Health Insurance Experiment (HIE) were unearthed by Aviva Aron-Dine, Liran Einav, and Amy Finkelstein (J. Econ. Perspect., 2013). Josh Angrist and J&#246;rn-Steffen Pischke provide an extract online at [Mastering Metrics](https://www.masteringmetrics.com/resources/).

Let's examine the data behind Panel A in Table 1.4, which reveals average levels of health care utilization across 5 types of care (the rows) for the "control group," people with catastrophic health insurance only (the leftmost column). In subsequent columns, the authors show us the average difference in the utilization measure in that row between one of the three "treatment arms" they argue are useful to consider (deductible, coinsurance, free), and the control group.

The objectives here are to get more experience with real data, and to notice that ordinary least squares regression with `lm()` is a very handy way to cut to the chase and test average differences across subgroups. A "small print" detail is that Angrist and Pischke are doing what's called <i>clustering standard errors at the family level</i>. This last point will definitely not be on any exams.

The main objective is to recognize that with an outcome variable $y_i$ and group identity indicator variables $D^d_i$, $D^c_i$, and $D^f_i$, for example, then this regression:

$$
y_i = \alpha + \beta^d \cdot D^d_i + \beta^c \cdot D^c_i + \beta^f \cdot D^f_i + \epsilon_i
$$

provides a very convenient way of testing the average differences:
* between the control group and group $d$: $\beta^d$
* between the control group and group $c$: $\beta^c$
* between the control group and group $f$: $\beta^f$

Here's a clean PNG of Table 1.4:

<img src="MMtbl14.png" width="800" />

Let's load up <b>haven</b> and <b>tidyverse</b>

In [3]:
library(haven)
library(tidyverse)

── [1mAttaching packages[22m ─────────────────────────────────────── tidyverse 1.3.0 ──

[32m✔[39m [34mggplot2[39m 3.3.6      [32m✔[39m [34mpurrr  [39m 0.3.4 
[32m✔[39m [34mtibble [39m 3.1.8      [32m✔[39m [34mdplyr  [39m 1.0.10
[32m✔[39m [34mtidyr  [39m 1.2.1      [32m✔[39m [34mstringr[39m 1.4.1 
[32m✔[39m [34mreadr  [39m 2.1.3      [32m✔[39m [34mforcats[39m 0.5.2 

── [1mConflicts[22m ────────────────────────────────────────── tidyverse_conflicts() ──
[31m✖[39m [34mdplyr[39m::[32mfilter()[39m masks [34mstats[39m::filter()
[31m✖[39m [34mdplyr[39m::[32mlag()[39m    masks [34mstats[39m::lag()



I have prepared an extract of the RAND HIE data underneath Table 1.4 Panel A in <i>Mastering Metrics</i>. These data include health care utilization outcomes across the four groups that Angrist and Pischke argue are usefully distinguishable, ordered here from least generous to most generous:

* Catastrophic plan
* Deductible plan
* Coinsurance plan
* Free plan

We have the five utilization measures shown in Table 1.4A here: `ftf` is face-to-face visits; `out_inf` are outpatient expenses; `totadm` is total hospital admissions; `inpdol_inf` are inpatient expenses, and `tot_inf` are total expenses.

In [4]:
table1_4a <- read_dta("table1_4.dta")

In [5]:
head(table1_4a, n = 100)

person,year,ftf,totadm,plantype,out_inf,inpdol_inf,tot_inf,famid,plan_free,plan_deduc,plan_coins,plan_catas
<chr>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>
MA250247,1,0,0,4,36.3055,0.000,36.3055,100082,0,0,0,1
MA250247,2,4,0,4,275.2085,0.000,275.2085,100082,0,0,0,1
MA250247,3,0,0,4,0.0000,0.000,0.0000,100082,0,0,0,1
MA250247,4,0,0,4,0.0000,0.000,0.0000,100082,0,0,0,1
MA250247,5,0,0,4,0.0000,0.000,0.0000,100082,0,0,0,1
MA250255,1,0,0,4,0.0000,0.000,0.0000,100082,0,0,0,1
MA250255,2,0,0,4,0.0000,0.000,0.0000,100082,0,0,0,1
MA250255,3,1,0,4,33.6000,0.000,33.6000,100082,0,0,0,1
MA250255,4,0,0,4,0.0000,0.000,0.0000,100082,0,0,0,1
MA250255,5,0,0,4,0.0000,0.000,0.0000,100082,0,0,0,1


Let's create new data frames for each of the four groups using `filter()`. The shortened group names are:

* `plan_catas` = Catastrophic plan 
* `plan_deduc` = Deductible plan   
* `plan_coins` = Coinsurance plan  
* `plan_free`  = Free plan   

Copy and paste this code below and run it:

`table1_4a_catas <- filter(table1_4a, plan_catas == 1)`

`table1_4a_deduc <- filter(table1_4a, plan_deduc == 1)`

`table1_4a_coins <- filter(table1_4a, plan_coins == 1)`

`table1_4a_free  <- filter(table1_4a, plan_free  == 1)`

In [6]:
table1_4a_catas <- filter(table1_4a, plan_catas == 1)

table1_4a_deduc <- filter(table1_4a, plan_deduc == 1)

table1_4a_coins <- filter(table1_4a, plan_coins == 1)

table1_4a_free  <- filter(table1_4a, plan_free  == 1)

What we now have are 4 separate data frames for the 4 groups assigned to different insurance plans.

In STAT 20, you might have used `t.test()` to run a comparison between two groups. Let's run `t.test()` on the face-to-face visits `ftf` in the deductible group versus the catastrophic group. This should get us something like the two numbers in the table at upper left.

`t.test(table1_4a_deduc$ftf, table1_4a_catas$ftf)`

In [7]:
t.test(table1_4a_deduc$ftf, table1_4a_catas$ftf)


	Welch Two Sample t-test

data:  table1_4a_deduc$ftf and table1_4a_catas$ftf
t = 1.5318, df = 7839.1, p-value = 0.1256
alternative hypothesis: true difference in means is not equal to 0
95 percent confidence interval:
 -0.05388687  0.43921358
sample estimates:
mean of x mean of y 
 2.976766  2.784103 


Not exactly clear, is it? The $t$-statistic is 1.53, which in words means that this difference is about 1.5 times its standard error. That's not big enough for us to reject the null hypothesis that the true difference is zero. 

There's probably an option to `t.test()` that will show us this, but we can also just type it into __R__. Here is the difference between those last two numbers in the output:

In [8]:
2.976766 - 2.784103

This is indeed the point estimate (0.19) of the average difference that appears at the upper left of Table 1.4A.

And then this, the difference divided by the $t$-stat, has to be the estimated standard error:

In [9]:
(2.976766 - 2.784103)/1.5318

Unfortunately this is not the standard error (.25) that appears under the .19 at the upper left of Table 1.4A. What's going on? Stay tuned. Let's load in a new library, which will let us run a special version of `lm()` that will help reveal what's going on.

In [10]:
library(estimatr)

First, let's run `lm_robust()` with options set to the baseline. The syntax is the same as it is for `lm()`, and we should recover the same results, as long as we set the standard errors to "classical" type.

`reg_toprow <- lm(ftf ~ plan_deduc + plan_coins + plan_free, data = table1_4a)`

`summary(reg_toprow)`

`reg_toprowrob <- lm_robust(ftf ~ plan_deduc + plan_coins + plan_free, 
                           data = table1_4a, se_type = "classical")`

`summary(reg_toprowrob)`

In [11]:
reg_toprow <- lm(ftf ~ plan_deduc, data = table1_4a)

summary(reg_toprow)

reg_toprowrob <- lm_robust(ftf ~ plan_deduc + plan_coins + plan_free,                             
                           data = table1_4a, se_type = "classical")

summary(reg_toprowrob)


Call:
lm(formula = ftf ~ plan_deduc, data = table1_4a)

Residuals:
    Min      1Q  Median      3Q     Max 
 -3.658  -2.977  -1.658   0.342 140.342 

Coefficients:
            Estimate Std. Error t value Pr(>|t|)    
(Intercept)  3.65810    0.04996  73.218  < 2e-16 ***
plan_deduc  -0.68133    0.10990  -6.199 5.78e-10 ***
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

Residual standard error: 6.325 on 20201 degrees of freedom
Multiple R-squared:  0.001899,	Adjusted R-squared:  0.001849 
F-statistic: 38.43 on 1 and 20201 DF,  p-value: 5.781e-10



Call:
lm_robust(formula = ftf ~ plan_deduc + plan_coins + plan_free, 
    data = table1_4a, se_type = "classical")

Standard error type:  classical 

Coefficients:
            Estimate Std. Error t value   Pr(>|t|) CI Lower CI Upper    DF
(Intercept)   2.7841     0.1031  26.992 1.130e-157  2.58193   2.9863 20199
plan_deduc    0.1927     0.1419   1.358  1.745e-01 -0.08542   0.4707 20199
plan_coins    0.4811     0.1338   3.597  3.228e-04  0.21892   0.7433 20199
plan_free     1.6637     0.1282  12.979  2.288e-38  1.41245   1.9150 20199

Multiple R-squared:  0.01172 ,	Adjusted R-squared:  0.01157 
F-statistic: 79.86 on 3 and 20199 DF,  p-value: < 2.2e-16

Now let's explore what <i>clustering our standard errors at the family level</i> does to our estimates of the standard errors. Because there are families in these data, indexed by the `famid` variable, we might expect that the $\epsilon$'s that shock a person one way or another within a family might shock the rest of the family as well. Imagine a family car that breaks down, so nobody keeps their checkup appointments.

`reg_toprowcluster <- lm_robust(ftf ~ plan_deduc + plan_coins + plan_free, 
                                data = table1_4a, clusters = famid)`
                              
`summary(reg_toprowcluster)`

In [12]:
reg_toprowcluster <- lm_robust(ftf ~ plan_deduc + plan_coins + plan_free, 
                               data = table1_4a, clusters = famid)

summary(reg_toprowcluster)


Call:
lm_robust(formula = ftf ~ plan_deduc + plan_coins + plan_free, 
    data = table1_4a, clusters = famid)

Standard error type:  CR2 

Coefficients:
            Estimate Std. Error t value  Pr(>|t|) CI Lower CI Upper    DF
(Intercept)   2.7841     0.1782 15.6238 4.389e-39  2.43318   3.1350 255.6
plan_deduc    0.1927     0.2468  0.7805 4.354e-01 -0.29217   0.6775 556.0
plan_coins    0.4811     0.2395  2.0091 4.502e-02  0.01071   0.9515 543.3
plan_free     1.6637     0.2482  6.7022 5.352e-11  1.17604   2.1514 520.4

Multiple R-squared:  0.01172 ,	Adjusted R-squared:  0.01157 
F-statistic: 18.45 on 3 and 2021 DF,  p-value: 8.271e-12

In [17]:
table1_4a <- mutate(table1_4a, logtot_inf = ifelse(tot_inf == 0, NA, log(tot_inf)))
head(table1_4a)

person,year,ftf,totadm,plantype,out_inf,inpdol_inf,tot_inf,famid,plan_free,plan_deduc,plan_coins,plan_catas,logtot_inf
<chr>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>
MA250247,1,0,0,4,36.3055,0,36.3055,100082,0,0,0,1,3.591969
MA250247,2,4,0,4,275.2085,0,275.2085,100082,0,0,0,1,5.617529
MA250247,3,0,0,4,0.0,0,0.0,100082,0,0,0,1,
MA250247,4,0,0,4,0.0,0,0.0,100082,0,0,0,1,
MA250247,5,0,0,4,0.0,0,0.0,100082,0,0,0,1,
MA250255,1,0,0,4,0.0,0,0.0,100082,0,0,0,1,


In [21]:
reg_toprowcluster_logtot_inf <- lm(logtot_inf ~ plan_deduc + plan_coins + plan_free, 
                                   data = table1_4a)
summary(reg_toprowcluster_logtot_inf)


Call:
lm(formula = logtot_inf ~ plan_deduc + plan_coins + plan_free, 
    data = table1_4a)

Residuals:
    Min      1Q  Median      3Q     Max 
-5.1223 -0.9698 -0.0931  0.8249  6.5056 

Coefficients:
            Estimate Std. Error t value Pr(>|t|)    
(Intercept)  5.44875    0.02945 185.029  < 2e-16 ***
plan_deduc   0.16462    0.03989   4.127 3.69e-05 ***
plan_coins   0.11077    0.03711   2.985  0.00284 ** 
plan_free    0.35848    0.03516  10.197  < 2e-16 ***
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

Residual standard error: 1.478 on 15733 degrees of freedom
  (4466 observations deleted due to missingness)
Multiple R-squared:  0.008288,	Adjusted R-squared:  0.008098 
F-statistic: 43.83 on 3 and 15733 DF,  p-value: < 2.2e-16


In [23]:
reg_toprowcluster_logtot_inf <- lm_robust(logtot_inf ~ plan_deduc + plan_coins + plan_free, 
                              data = table1_4a, clusters = famid)
summary(reg_toprowcluster_logtot_inf)


Call:
lm_robust(formula = logtot_inf ~ plan_deduc + plan_coins + plan_free, 
    data = table1_4a, clusters = famid)

Standard error type:  CR2 

Coefficients:
            Estimate Std. Error t value   Pr(>|t|) CI Lower CI Upper    DF
(Intercept)   5.4488    0.04698 115.979 1.233e-205  5.35618   5.5413 229.4
plan_deduc    0.1646    0.06325   2.603  9.523e-03  0.04035   0.2889 504.0
plan_coins    0.1108    0.06112   1.812  7.057e-02 -0.00933   0.2309 469.3
plan_free     0.3585    0.05673   6.319  6.659e-10  0.24697   0.4700 425.1

Multiple R-squared:  0.008288 ,	Adjusted R-squared:  0.008098 
F-statistic: 16.27 on 3 and 1973 DF,  p-value: 1.902e-10

In [1]:
reg_toprowcluster_tot_inf <- lm_robust(tot_inf ~ plan_deduc + plan_coins + plan_free, 
                               data = table1_4a, clusters = famid)

summary(reg_toprowcluster_tot_inf)

ERROR: Error in lm_robust(tot_inf ~ plan_deduc + plan_coins + plan_free, data = table1_4a, : could not find function "lm_robust"


Compare these results to the top row in Table 1.4A. What do you see?

<font color="red">
    This is exactly what we see in the top row of Table 1.4A, with the exception of the bracketed standard deviation [5.50] below the far left-hand side number, 2.78, which is the intercept term here. The SD we could probably get from `summary()` by conditioning on just the control group (`plan_catas`). The rest of the numbers in the row are the estimates shown here and their standard errors.
    </font>

Compare these results to the results without clustering standard errors at the family level. Which ones are larger?

<span style="color: red;">These standard errors, obtained when clustering errors at the family level, are larger than what we saw when we didn't cluster. This is all you need to be able to do for ECON 140: answer a question like this about something that you're observing.
It's also OK to speculate a little, too. Clustering like this is similar to but of course not exactly the same as reducing sample size. Here it raises the standard errors, like a reduced sample size would have also.
    </span> 

<div style="text-align: right"> <span style="font-family:Papyrus; ">And they lived happily ever after. The End.</span></div>