<h1>ECON 140R Class 06</h1>

Let's spend time to cement a few key takeaways from Chapter 1 of <i>Mastering Metrics</i>. In the book, Angrist and Pischke show us a simple example with 2 individuals. Here, let's examine a simple example with 20 individuals, 10 each in the control and treatment groups.

<h2>Learning Objectives</h2>

* Run an ordinary least squares (OLS) regression $y_i = \alpha + \beta \ D_i + \epsilon_i$ using `lm()`
* See that when $y$ is an outcome, and if the indicator variable $D = 1$ measures treatment group assignment in an RCT, then OLS reveals:
    * $\alpha$ = average $y$ for the control group
    * $\beta$ = average difference in outcomes between treatment and control
* See a brief example of a "recode" in __R__ using `ifelse()`

In [3]:
library(tidyverse)

── [1mAttaching packages[22m ─────────────────────────────────────── tidyverse 1.3.0 ──

[32m✔[39m [34mggplot2[39m 3.3.6     [32m✔[39m [34mpurrr  [39m 0.3.4
[32m✔[39m [34mtibble [39m 3.1.8     [32m✔[39m [34mdplyr  [39m 1.0.9
[32m✔[39m [34mtidyr  [39m 1.2.0     [32m✔[39m [34mstringr[39m 1.4.1
[32m✔[39m [34mreadr  [39m 2.1.2     [32m✔[39m [34mforcats[39m 0.5.2

── [1mConflicts[22m ────────────────────────────────────────── tidyverse_conflicts() ──
[31m✖[39m [34mdplyr[39m::[32mfilter()[39m masks [34mstats[39m::filter()
[31m✖[39m [34mdplyr[39m::[32mlag()[39m    masks [34mstats[39m::lag()



I've learned some __R__ to create a fictional dataset containing study participants in a randomized controlled trial (RCT). Here is code that ultimately creates a data frame for the 10-person control group that shows their first names; an (old-school) binary gender identity$^{\dagger}$; RCT group membership; and a <u>bad health outcome</u>, first coded numerically and then again as a string.

Zeros and ones are common codings for bad health outcomes in medicine and in health economics. You could think of $D = 1$ meaning that the participant catches COVID-19, for example. Another, more extreme example is that the bad health outcome could be death. Here, I've coded "poor health" as `outcomestr == 1` with "good health" being the other state. (This is a common way of collapsing what is usually a 5-point scale for self-reported health: "excellent," "very good," "good," "fair," and "poor," with the first three categories usually mapped to "good" and the latter two categories mapped to "poor."

Because of this coding, note that we are looking for a treatment that has a <b>negative</b> or protective effect: $\beta < 0$. A positive effect in this context would mean that the treatment is actually worsening health.

In [4]:
names   <- c("Alison", "Bradley", "Catherine", "David", "Esme", 
             "Frank", "Georgina", "Henry", "Inez", "James")

gender  <- c("female", "male", "female", "male", "female",
             "male", "female", "male", "female", "male")

group   <- c("control", "control", "control", "control", "control",
             "control", "control", "control", "control", "control")

outcome <- c(0,1,0,0,1,
             1,1,0,1,0)

outcomestr <- c("good", "poor", "good", "good", "poor",
                "poor", "poor", "good", "poor", "good")

# data.frame() constructs the data frame and labels the columns with the variable names
# Parentheses around the command also ask R to show it to us

(control_df <- data.frame(names, gender, group, outcome, outcomestr))


names,gender,group,outcome,outcomestr
<chr>,<chr>,<chr>,<dbl>,<chr>
Alison,female,control,0,good
Bradley,male,control,1,poor
Catherine,female,control,0,good
David,male,control,0,good
Esme,female,control,1,poor
Frank,male,control,1,poor
Georgina,female,control,1,poor
Henry,male,control,0,good
Inez,female,control,1,poor
James,male,control,0,good


Can you eyeball the average of `outcome` here for the control group? There are 10 people, and 5 of them have `outcome == 1`, so ...

In [None]:
outcome_avg_control =

In [9]:
names   <- c("Kate", "Larry", "Mallory", "Niles", "Olivia", 
             "Peter", "Quincy", "Rutger", "Stephanie", "Troy")

gender  <- c("female", "male", "female", "male", "female",
             "male", "female", "male", "female", "male")

group   <- c("treatment", "treatment", "treatment", "treatment", "treatment",
             "treatment", "treatment", "treatment", "treatment", "treatment")

outcome <- c(0,0,0,1,0,
             1,1,0,0,0)

outcomestr <- c("good", "good", "good", "poor", "good",
                "poor", "poor", "good", "good", "good")

(treatment_df <- data.frame(names, gender, group, outcome, outcomestr))


names,gender,group,outcome,outcomestr
<chr>,<chr>,<chr>,<dbl>,<chr>
Kate,female,treatment,0,good
Larry,male,treatment,0,good
Mallory,female,treatment,0,good
Niles,male,treatment,1,poor
Olivia,female,treatment,0,good
Peter,male,treatment,1,poor
Quincy,female,treatment,1,poor
Rutger,male,treatment,0,good
Stephanie,female,treatment,0,good
Troy,male,treatment,0,good


Can you eyeball the average of `outcome` here? There are 10 people, and 3 of them have `outcome == 1`, so therefore ...



In [None]:
outcome_avg_treatment =

The randomization and placebo might be rocket science, but otherwise we are done with any rocket science. All we are really looking for is the average difference between control and treatment, which you can eyeball in this simple example. Remember that if the treatment is protective against bad health, we expect to find a <i>negative</i> treatment effect here:

In [None]:
treatment_effect = outcome_avg_treatment - outcome_avg_control

Now we have two separate data frames for treatment and control. In order to run OLS using `lm()`, with a new indicator variable `treatment` for $D_i$, we need to append or add the datasets to one another. In your mind's eye, what we want to do is create a new matrix from these two existing matrices by stacking them vertically. Here's a way to do that with data frames in __R__:

In [6]:
fake_rct_df <- rbind(control_df, treatment_df)
fake_rct_df

names,gender,group,outcome,outcomestr
<chr>,<chr>,<chr>,<dbl>,<chr>
Alison,female,control,0,good
Bradley,male,control,1,poor
Catherine,female,control,0,good
David,male,control,0,good
Esme,female,control,1,poor
Frank,male,control,1,poor
Georgina,female,control,1,poor
Henry,male,control,0,good
Inez,female,control,1,poor
James,male,control,0,good


Now let's create that indicator variable `treatment` that will serve as the right-hand side variable $D_i$ in the regression equation shown at the top of this notebook. Here is one way to do that by using `mutate()` to add a column for the variable `treatment`, which we create with a call to `ifelse()`. Here, `ifelse()` is told to return a 1 if `group == "treatment"` and a 0 otherwise.

In [7]:
fake_rct_df <- mutate(fake_rct_df, treatment = ifelse(group == "treatment", 1, 0))
fake_rct_df

names,gender,group,outcome,outcomestr,treatment
<chr>,<chr>,<chr>,<dbl>,<chr>,<dbl>
Alison,female,control,0,good,0
Bradley,male,control,1,poor,0
Catherine,female,control,0,good,0
David,male,control,0,good,0
Esme,female,control,1,poor,0
Frank,male,control,1,poor,0
Georgina,female,control,1,poor,0
Henry,male,control,0,good,0
Inez,female,control,1,poor,0
James,male,control,0,good,0


Now let's run the OLS regression from above. I'll write it in its generic form first, and then with variable names, and then the code field will show its equivalent in __R__ using `lm()`

$$
y_i = \alpha + \beta \ D_i + \epsilon_i \\
outcome_i = \alpha + \beta \ treatment_i + \epsilon_i
$$

In [8]:
fake_rct_reg <- lm(outcome ~ treatment, data = fake_rct_df)
summary(fake_rct_reg)


Call:
lm(formula = outcome ~ treatment, data = fake_rct_df)

Residuals:
   Min     1Q Median     3Q    Max 
 -0.50  -0.35  -0.30   0.50   0.70 

Coefficients:
            Estimate Std. Error t value Pr(>|t|)   
(Intercept)   0.5000     0.1599   3.128  0.00582 **
treatment    -0.2000     0.2261  -0.885  0.38801   
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

Residual standard error: 0.5055 on 18 degrees of freedom
Multiple R-squared:  0.04167,	Adjusted R-squared:  -0.01157 
F-statistic: 0.7826 on 1 and 18 DF,  p-value: 0.388


Examine these results and compare to what you have seen earlier. Below is another way of extracting this information from the data, without using OLS:

In [7]:
mean(treatment_df$outcome)
mean(control_df$outcome)

mean(treatment_df$outcome) - mean(control_df$outcome)


<div style="text-align: right"> <span style="font-family:Papyrus; ">And they lived happily ever after. The End.</span></div>

<hr>

<i>ABANDON ALL HOPE, ye who enter here.</i>

In [17]:
#install.packages("mfx")

In [18]:
#library(mfx)

In [16]:
#(fake_rct_logit <- logitmfx(outcome ~ treatment, data = fake_rct_df))

<br>

<hr>

${\dagger}$ To learn more about 21-century methods of measuring gender identity and related concepts, see the National Academies of Sciences, Engineering, and Medicine. 2022. <i>Measuring Sex, Gender Identity, and Sexual Orientation.</i> Washington, DC: The National Academies Press. (https://doi.org/10.17226/26424)[https://doi.org/10.17226/26424].