# POLSCI 3

## Week 4, Lecture Notebook 2: Revisiting Omitted Variable/Selection Bias

We'll continue using the wellness data from earlier this week, although this time we'll be using the real data. This means we won't be able to see both potential outcomes for everyone -- only the outcome that actually happened given the treatment they actually got.

### Revisiting the Data

Let's read in the wellness dataset and recap what each of the variables represent. 

In [None]:
# Do not edit this cell, just run it.
library(testthat)
wellness <- read.csv('ps3_wellness_real.csv')
head(wellness)

This dataset contains real data on participants in a wellness program. Each row represents a unique respondent, and the data measures medical expenditure before and after the wellness treatment. Here is more information about the variables:

`id`: Respondent ID (anonymized identifier for each respondent)

`treat`: Whether or not the treatment was offered to the participant (treatment here is the wellness program) (`1` = was offered wellness program, `0` = wasn't offered wellness program). This is randomly assigned.

`participate`: Whether respondent **actually participated** in the wellness program (`1` = actually participated, `0` = didn't particiapte)

`baseline`: Monthly average medical costs at baseline; that is, before the program started.

`outcome_post`: Monthly cost of medical care for this person after the workplace wellness program started (regardless of whether they participated or not).

### Groups in the dataset

The researchers first randomized people to have the opportunity to participate in the wellness program or not. Then, among the people who had the chance to participate, some chose to do so and some chose not to do so. This means there are three groups of people:

1. `treat == 0`, people assigned to the control group and who therefore didn't have the chance to participate (so `participate == 0` for all of them, too)
2. `treat == 1` and `participate == 0`, people who had the opportunity to participate and chose **not** to
3. `treat == 1` and `participate == 1`, people who had the opportunity to participate and **did** choose to

---------

**Question 1.** Let's think about what groups to compare to understand the effect of this wellness program. One idea someone might have is to compare groups 2 and 3. So that we can do this, create two new subsets in the dataset:

- a subset called `chose.not.to.participate` that includes group 2 above: the people who 1) were assigned to the treatment group, and so had a chance to participate, but 2) chose not to
- a subset called `chose.to.participate` that includes group 3 above: the people who 1) were assigned to the treatment group and 2) *did* choose to participate

In the below cell, make those subsets.


In [None]:
chose.not.to.participate <- NULL # YOUR CODE HERE
chose.to.participate <- NULL # YOUR CODE HERE

In [None]:
. = ottr::check("tests/q1.R")

------

**Question 2.** Next, let's use the strategy of comparing outcomes (medical costs after the wellness program started) among those who choose to participate in the program to those who choose not to participate in the program.

- On the first line below, take the mean of `outcome_post` in `chose.not.to.participate`.
- On the second line below, take the mean of `outcome_post` in `chose.to.participate`.


In [None]:
mean.spending.non.participants <- NULL # YOUR CODE HERE
mean.spending.participants <- NULL # YOUR CODE HERE

# The next line will print a summary of what you did, you do not need to change it.
paste0("Wellness program participants spent ", round(mean.spending.participants), " dollars on medical care after the program started. ",
       "People who did not participate spent ", round(mean.spending.non.participants), " dollars on medical care after the program started. ",
      "On average, wellness program participants therefore spent ", round(mean.spending.non.participants - mean.spending.participants),
      " dollars less on medical care than did non-participants in the months after the program.")

In [None]:
. = ottr::check("tests/q2.R")

<!-- BEGIN QUESTION -->

-------

**Question 3.** If we want to know the effect of the wellness program on medical costs, is it a good idea to compare the medical costs of people who choose to participate in the program to the people who choose not to? Why or why not?

*Please limit your answer to 2-3 sentences.*


_Type your answer here, replacing this text._

<!-- END QUESTION -->

------

**Question 4.** Remember that the dataset also contains the `baseline` variable, which measured how much people spent on medical care **before** the wellness program began. Let's compare how much the people who *later* decided to or not to participate in the wellness program spent on medical care *before* the wellness program even started.

- On the first line below, take the mean of `baseline` in `chose.not.to.participate`.
- On the second line below, take the mean of `baseline` in `chose.to.participate`.


In [None]:
mean.baseline.spending.non.participants <- NULL # YOUR CODE HERE
mean.baseline.spending.participants <- NULL # YOUR CODE HERE

# The next line will print a summary of what you did, you do not need to change it.
paste0("On average, wellness program participants spent ", round(mean.baseline.spending.participants),
       " dollars on medical care before the program started. ",
       "On average, people who did not participate spent ", round(mean.baseline.spending.non.participants),
       " dollars on medical care before the program started. ",
      "On average, wellness program participants therefore spent ",
       round(mean.baseline.spending.non.participants - mean.baseline.spending.participants),
      " dollars per month less on medical care before the wellness program started.")

In [None]:
. = ottr::check("tests/q4.R")

<!-- BEGIN QUESTION -->

------

**Question 5a.** Tell a story about why the results in Question 4 might look like they do. Why is there such a big difference between these groups to start with, before the program even started?

*Please limit your answer to 1-3 sentences.*


_Type your answer here, replacing this text._

<!-- END QUESTION -->

<!-- BEGIN QUESTION -->

-----

**Question 5b.** What do your answers to Question 4 and 5a indicate about whether it is a good idea to measure the effect of the wellness program by comparing people who choose to and not to participate?

*Please limit your answer to 1-3 sentences.*


_Type your answer here, replacing this text._

<!-- END QUESTION -->

### P.S. What if we analyze the data the right way?

A teaser: Next week, we'll start to learn how to analyze experiments the *right* way. Below is code that analyzes the data from <a href="https://academic.oup.com/qje/article/134/4/1747/5550759?login=true" target="_blank">the study</a> the right way: by comparing the randomly assigned treatment and control groups. We'll dig deeper into how to analyze this the right way next week.

In [None]:
# This is medical spending among those randomly assigned to the wellness program.
mean(subset(wellness, treat == 1)$outcome_post)

In [None]:
# This is medical spending among those randomly assigned to the control group.
mean(subset(wellness, treat == 0)$outcome_post)

It looks like we can reach the same conclusions as <a href="https://academic.oup.com/qje/article/134/4/1747/5550759?login=true" target="_blank">the study</a>:

> We find strong patterns of selection: during the year prior to the intervention, program participants had lower medical expenditures and healthier behaviors than nonparticipants. ... we do not find significant causal effects of treatment on total medical expenditures.

------

# Submitting Your Notebook (please read carefully!)

To submit your notebook...

### 1. Click `File` $\rightarrow$ `Save and Checkpoint`.

### 2. Wait 5 seconds.

### 3. Select the cell below and hit run.

In [None]:
ottr::export("Week4_Activity2group.ipynb", pdf = TRUE)

### 4. Submit the .zip file you just downloaded <a href="https://www.gradescope.com/" target="_blank">on Gradescope here</a>.

Notes:

- **This does not seem to work on Chrome for iPad or iPhone.** If you're using an iPad or iPhone, you need to download the file using **Safari**.
- If your web browser automatically unzips the .zip file (so you see a folder instead of a .zip file), you can just upload the .ipynb file that is inside the folder.
- If this method is not working for you, try this: hit `File`, then `Download as`, then `Notebook (.ipynb)` and submit that.

### 5. In Gradescope, add your group members' names.