# POLSCI 3

## Week 8, Lecture Notebook 1: Heterogeneous Treatment Effects

### Heterogeneous Treatment Effects

We're now done with the most difficult content in the class, standard errors and $p$-values. We now have the basic tools we need to understand essentially everything else we'll cover this semester.

Today's lecture will focus on one of these ideas: *heterogeneous treatment effects*. Heterogeneous is just a fancy word for different. Heterogeneous treatment effects refers to the idea of **different treatment effects among different _subgroups_ of observations**.

We've already seen a bit of this, we just haven't talked about it directly yet. For example, in Problem Set 1: The effect of the "DeShawn" letter treatment was different for different _subgroups_ of legislators: white Democratic legislators, non-white Democratic legislators, and Republican legislators.

That was a heterogeneous treatment effect: the effect of the treatment was different for these different _subgroups_ of observations (in that case, each observation was a legislator).

### Two things to know about heterogeneous treatment effects

#### We compute heterogeneous treatment effects by looking at effects within subsets

In this class, we'll follow a basic recipe for examining heterogeneous treatment effects. You already know how to do do this!

1. Save a new subset with just the subgroup of observations you want to look at effects among.
2. Use `difference_in_means()`, but passing `data = name.of.the.subset`.

For example:
```
subset.name <- subset(data, subset.var == 1)

difference_in_means(outcome ~ treat, data = subset.name, condition1 = 'control', condition2 = 'treat')
```

#### Differences in treatment effects are observational/descriptive, not experimental/causal comparisons

Remember how we defined causality: difference between potential outcomes.

Before, we had a table like this:

<table>
<thead>
  <tr>
    <th>Legislator</th>
    <th>Respond to<br>Jake Alias</th>
    <th>Respond to<br>DeShawn Alias</th>
    <th>Treatment Effect</th>
  </tr>
</thead>
<tbody>
  <tr>
    <td>A</td>
    <td>0</td>
    <td>0</td>
    <td>0</td>
  </tr>
  <tr>
    <td>B</td>
    <td>1</td>
    <td>0</td>
    <td>-1</td>
  </tr>
  <tr>
    <td>C</td>
    <td>1</td>
    <td>1</td>
    <td>0</td>
  </tr>
    <tr>
    <td>D</td>
    <td>0</td>
    <td>0</td>
    <td>0</td>
  </tr>
</tbody>
</table>

With heterogeneous treatment effects, we want to know the *average* treatment effect among different subgroups.

For example, imagine these are **California legislators**:

<table>
<thead>
  <tr>
    <th>Legislator</th>
    <th>Respond to<br>Jake Alias</th>
    <th>Respond to<br>DeShawn Alias</th>
    <th>Treatment Effect</th>
  </tr>
</thead>
<tbody>
  <tr>
    <td>A</td>
    <td>0</td>
    <td>0</td>
    <td>0</td>
  </tr>
  <tr>
    <td>B</td>
    <td>1</td>
    <td>0</td>
    <td>-1</td>
  </tr>
</tbody>
</table>

Now imagine these are **Virginia legislators**:

<table>
<thead>
  <tr>
    <th>Legislator</th>
    <th>Respond to<br>Jake Alias</th>
    <th>Respond to<br>DeShawn Alias</th>
    <th>Treatment Effect</th>
  </tr>
</thead>
<tbody>
  <tr>
    <td>C</td>
    <td>1</td>
    <td>1</td>
    <td>0</td>
  </tr>
  <tr>
    <td>D</td>
    <td>0</td>
    <td>0</td>
    <td>0</td>
  </tr>
</tbody>
</table>

In these examples, the average treatment effect among California legislators is -0.5 and the average treatment effect among Virginia legislators is 0. This is a heterogeneous treatment effect.

The key point: **we are simply _describing treatment effects_ within different subgroups**. These are causal effects **within** subgroups, but **the comparisons between them are not causal**. For example, we are **not** saying that _being a California legislator causes legislators to discriminate:_

<table>
<thead>
  <tr>
    <th>Legislator</th>
    <th>Treatment Effect<br>if in California<br>(What we saw)</th>
    <th>Treatment Effect<br>if in Virginia<br>(Counterfactual)</th>
  </tr>
</thead>
<tbody>
  <tr>
    <td>A</td>
    <td>0</td>
    <td>0</td>
  </tr>
  <tr>
    <td>B</td>
    <td>-1</td>
    <td>0</td>
  </tr>
</tbody>
</table>

"Being a California legislator causes legislators to discriminate" makes no sense. We are **not** saying that, counterfactually, the people who serve as legislators in California would discriminate less if those same people moved to Virginia and served in that legislature.

Rather, we **are** saying that, descriptively speaking, the legislators in California discriminate more than the ones in Virginia.

But this doesn't say anything about "the effect of being a California legislator" because it doesn't make sense to imagine the counterfactual that the California legislators instead served in Virginia.

For example, in Problem Set 1, we weren't saying that the non-White legislators would discriminate more if we made them White. It's not clear what this even would mean. There is **no counterfactual**, so we are **not examining the _causal effect of being in the subgroup_ on treatment effects**. Instead, we simply described behavior in different subgroups and noted differences in how each responded to the treatment.

### Data example: Utah Precinct Chair Experiment

For the first half of Week 8, we'll be using the same dataset as we did in Week 7, the data from the Utah Republican Party's experiment to increase the number of women who serve as precinct chairs.

But, there's a new variable in the dataset:

In [None]:
#import libraries
library(estimatr)

data <- read.csv("ps3_week8_electing_women.csv")
head(data)

Here is a quick reminder of what each column means:

- `unique_id`: Precinct ID
- `treat`: treatment variable
    - `'control'`: control group
    - `'supply'`: supply group; party chair instructed to recruit 2-3 women
    - `'demand'`: demand group; party chair reads letter at precinct convention
    - `'both'`: a fourth group getting both the supply and demand treatments; party chair instructed to read letter *and* to recruit 2-3 women
- `prop_sd_fem2014`: **Outcome**: Proportion of 2014 elected state delegates from that precinct who were women
- `sd_onefem2014`: **Outcome**: 1 if at least one woman was selected; 0 otherwise
- `county` : County name in Utah
- `pc_male`: 1 if precinct chair is male; 0 otherwise (precinct chair is person who runs precinct meeting, would read letter if assigned to do so, etc.)
- `mormon`: 1 if precinct chair filled out a survey and told the party they were a Mormon; 0 otherwise (either because not Mormon or did not fill out survey) **<span style="color:red">New variable!</span>**

Here's our new variable:

In [None]:
table(data$mormon)

### Example of calculating heterogeneous treatment effects

In class, your activity will be to look at effects among the subset where `mormon == 1` and among the subset where `mormon == 0` and then interpret it.

Instead of giving you the right answer, I'll give you an example for another variable in this dataset, `county`.

Let's compute the effect of the "both" treatment (relative to control) inside and outside of Salt Lake county.

In [None]:
table(data$county)

Here's the treatment effects for the entire sample:

In [None]:
difference_in_means(prop_sd_fem2014 ~ treat, data = data,
                    condition1 = 'control', condition2 = 'both')

#### Calculating the treatment effect within both subgroups

First let's look at the effects inside Salt Lake county.

In [None]:
# First, make a subset
data.salt.lake <- subset(data, county == 'Salt Lake')

# Now, look at effect in that subset
difference_in_means(prop_sd_fem2014 ~ treat, data = data.salt.lake,
                    condition1 = 'control', condition2 = 'both')

Now, let's look at the effect outside of Salt Lake county:

In [None]:
# First, make a subset
data.not.salt.lake <- subset(data, county != 'Salt Lake')

# Now, look at effect in that subset
difference_in_means(prop_sd_fem2014 ~ treat, data = data.not.salt.lake,
                    condition1 = 'control', condition2 = 'both')

It looks like the effects are bigger outside of Salt Lake County than inside Salt Lake County.

#### Interpretation

The final important point is interpretation. What does what we found mean?

The interpretation is **not** causal: we are not saying that there is an effect of being in Salt Lake County on responses to this treatment. As soon as we say "effect", we are talking about a counterfactual, and that counterfactual makes no sense. For example, we **cannot** conclude that, if you took the people outside Salt Lake County, bused them into Salt Lake County, and then did the same experiment, they'd respond less to the "both" treatment.

What we can conclude is that the effects look larger outside Salt Like County than inside Salt Lake County.

This might still be really important and useful! For example, if the party has limited resources for sending out these letters, they might focus on sending them outside Salt Lake County since the effects seem bigger there.

Why is this distinction so important? It's omitted variable bias: there's a *lot* of reasons that precinct chairs inside and outside Salt Lake County might be different, and so we can't be sure that differences between how their respond are due to where they live, not some other factor. Although we can be confident that the causal effects we estimated in the experiment accurately describe the effects within these two groups, we can't be sure that location is what causes the difference in these causal effects.

### Caution: $p$-hacking

There is a danger with heterogenous treatment effects, often called $p$-hacking.

The idea of $p$-hacking is that, if you test lots and lots of hypotheses that are not true (i.e., where the null hypothesis is true), you're likely to get at least one statistically significant result just by chance. In fact, a $p$-value of 0.05 means there is a 1 in 20 chance that the result might have occurred my chance --- so it makes sense that if you test 20 different hypotheses, you'll likely get one statistically significant result even if none of the hypotheses you're testing are true!

This xkcd cartoon I've showed previously makes this point:

<img src="significant.png">

**$p$-hacking is when you test so many hypotheses that you become _very_ likely to get a statistically significant $p$-value for a null hypothesis that is true** --- i.e., to reach a $p$-value that incorrectly leads you to reject a true null hypothesis (e.g., thinking a treatment had an effect that actually has no effect).

I bring this up during our discussion of heterogenous treatment effects because $p$-hacking is a _particularly_ big risk when examining heterogenous treatment effects, because in many datasets there are a ton of different subgroups you can slice and dice your dataset into. This can result in a situation where researchers think or convince others than, even if a treatment doesn't work overall, if you look at a specific group it does. Whenever you hear a claim like this, you should worry about $p$-hacking.

The _New York Times_ did a nice story called "<a href="https://www.nytimes.com/2017/11/28/magazine/a-failure-to-heal.html?smid=pl-share" target="_blank">A Failure to Heal</a>" about how this happens in medicine, especially drug development. When a drug in testing fails to have an effect...

> a second instinct takes over: Why not try to find the people for whom the drug did work? ...
>
> This kind of search-and-rescue mission is called “post hoc” analysis. It’s exhilarating — and dangerous. On one hand, it promises the possibility of resuscitating the medicine: Find the right group of responsive patients within the trial group — men above 60, say, or postmenopausal women — and you can, perhaps, pull the drug out of the rubble of the failed study.
>
> But it’s also a treacherous seduction. The reasoning is fatally circular — a just-so story. You go hunting for groups of patients that happened to respond — and then you turn around and claim that the drug “worked” on, um, those very patients that you found. (It’s quite different if the subgroups are defined before the trial. There’s still the statistical danger of overparsing the groups, but the reasoning is fundamentally less circular.) It would be as if Sacks, having found that the three long-term responders to L-dopa happened to be 80-year-old women from one nursing home, then published a study claiming that the drug “worked” on Brooklyn octogenarians.
>
> Perhaps the most stinging reminder of these pitfalls comes from a timeless paper published by the statistician Richard Peto. In 1988, Peto and colleagues had finished an enormous randomized trial on 17,000 patients that proved the benefit of aspirin after a heart attack. The Lancet agreed to publish the data, but with a catch: The editors wanted to determine which patients had benefited the most. Older or younger subjects? Men or women?
>
> Peto, a statistical rigorist, refused — such analyses would inevitably lead to artifactual conclusions — but the editors persisted, declining to advance the paper otherwise. Peto sent the paper back, but with a prank buried inside. The clinical subgroups were there, as requested — but he had inserted an additional one: “The patients were subdivided into 12 ... groups according to their medieval astrological birth signs.” When the tongue-in-cheek zodiac subgroups were analyzed, Geminis and Libras were found to have no benefit from aspirin, but the drug “produced halving of risk if you were born under Capricorn.” Peto now insisted that the “astrological subgroups” also be included in the paper — in part to serve as a moral lesson for posterity. I’ve often thought of Peto’s paper as required reading for every medical student.

Beware of this instinct! It can be a good idea to check if a treatment has different effects for different groups, but $p < 0.05$ is a lot less meaningful when you've tested many hypotheses.