# POLSCI 3

## Week 8, Lecture Notebook 2: Generalizability

### Why generalize?

When we conduct experiments, we learn about the effect of the *particular treatment* studied, conducted *among the people* the experiment was conducted among, *at the time/in the context* the experiment was conducted. But we're usually interested in *generalizing* from experiments across all these dimensions:

- *Treatment*: Problem Set 2 looked at the effect of a "gentle" social pressure mailer. What if we made it even "gentler" by not telling people how often their neighbors had voted?
- *People*: Would the effects of the "both" condition among Utah Republican precinct chairs be similar to the effects for Utah Democratic precinct chairs? How about precinct chairs in other states?
- *Time/Context*: The social pressure experiment we analyzed in Week 5 was conducted in an off-year primary election (e.g., imagine a 2021 primary election for Berkeley mayor). Would the effects be different in a Presidential general election (e.g., Biden versus Trump in 2020)?

We learned earlier this week when we studied *heterogeneous treatment effects* that treatment effects can differ across individuals. When we try to generalize an experiment conducted on one particular treatment, among one particular population, at one particular time/in one particular context, the effects might differ between the original study and the new setting we are trying to generalize to.

### Why might not effects generalize perfectly? An example

Imagine an experiment done in the 2020 Presidential election in two states: Wyoming and Wisconsin. By the 2020 election, social pressure mail had become a mainstay of political campaigns, and so voters in swing states were recieving a lot of social pressure mail. Wisconsin was an important swing state in 2020, so voters there likely received a lot of social pressure mail.

In that context, an experiment sending one social pressure mailer would create the following situation:

<table>
<thead>
  <tr>
    <th></th>
    <th colspan="2">Average Number of Social<br>Pressure Mailers Received</th>
  </tr>

  <tr>
    <td></td>
      <td><b>Control</b></td>
      <td><b>Treatment</b></td>
  </tr>
    </thead>
<tbody>
  <tr>
    <td>Wyoming</td>
    <td>0</td>
    <td>1</td>
  </tr>
  <tr>
    <td>Wisconsin</td>
    <td>17</td>
    <td>18</td>
  </tr>
</tbody>
</table>

In both places, the experiments are completely valid. In both places, they would answer the same question: what is the effect of sending voters a social pressure mailer on voter turnout?

But they are answering those questions in different contexts. And in Wyoming, the effect of sending voters a social pressure mailer might be much bigger because voters otherwise wouldn't get any social pressure mailers. But in Wisconsin, the effect might be smaller because people there probably already received a lot of social pressure mailers.

At least, that is my guess of what would happen. But, **there is no magic way to know whether the effect of a treatment would be the same if it was done differently, among other people, or in another time/context. We just have to do our best to make an educated guess -- or gather more data to understand how a treatment's effects vary.**

### A common approach for understanding generalizability: meta-analysis

Is generalizing hopeless? Instead of just throwing up our hands and saying we don't know what one experiment says about what causal effects would be of another treatment, in another context, or among a different population, a common approach is to look across the results of *many* experiments and look to see how the results vary.

This is called *meta-analysis* because it's an analysis of analyses.

For some questions, we can't do this, because there's only been one experiment -- all we can do is make an educated guess.

But for other questions that are studied more often, we can compile evidence from across many contexts and understand if and how effects vary.

Today, we'll take a look at a dataset from a recent paper I wrote that collects all of the public experiments that had been done on persuasion in American political campaigns.

### Campaign persuasion dataset

In [None]:
library(estimatr)
library(ggplot2)

data <- read.csv("ps3_KB_campaign.csv")
head(data)

In this dataset, every row is an experiment. The authors (me and Josh Kalla) collected information from previous experiments and collected it in this dataset for meta-analysis.

Here is a quick rundown of what each column means:

- `Experiment`: Paper reference of the experiment
- `effect`: estimate of the effect size
- `SE`: standard error of the effect estimate
- `precision`: precision; this is how precise the experiment is. defined as $\frac{1}{\text{Standard Error}^2}$ (weights to be used)
- `General`: binary variable indicating whether the experiment was done in a general election (1 = yes, 0 = no)
- `Primary`: binary variable indicating whether the experiment was done in a primary election (1 = yes, 0 = no). these elections are typically quieter
- `Treatment.within.2months.of.election`: binary variable incating whether the experiment took place within 2 months of election day (1 = yes, was within 2 months; 0 = no, was more than 2 months before election day)

### Meta-analysis using `weighted.mean()`

One way we could analyze this data is by just looking at the average effect.

In [None]:
mean(data$effect)

But some of these experiments are much more precise than others:

In [None]:
qplot(data$SE)

As a result, when we conduct meta-analyses, we have to *weight* our estimates by how precise they are.

Why? Remember this example from last week:

<table class="tg">
<thead>
  <tr>
    <th class="tg-0pky">Experiment</th>
    <th class="tg-7btt">Estimate</th>
    <th class="tg-7btt">Standard Error</th>
    <th class="tg-amwm">95% Confidence<br>Interval</th>
  </tr>
</thead>
<tbody>
  <tr>
    <td class="tg-fymr">Experiment 1: Letter A</td>
    <td class="tg-0pky">8%</td>
    <td class="tg-0pky">30%</td>
    <td class="tg-0lax">[-50%, 66%]</td>
  </tr>
  <tr>
    <td class="tg-fymr">Experiment 2: Letter B</td>
    <td class="tg-0pky">5%</td>
    <td class="tg-0pky">1%</td>
    <td class="tg-0lax">[3%, 7%]</td>
  </tr>
</tbody>
</table>

If we wanted to combine the data from these two datasets to ask "what is the average effect of letters", we wouldn't want to just average 8% and 5%. We have a lot more information about the effects of letters from Experiment 2, so we'd want to put a lot more weight on experiment 2.

It turns out that the amount of information in an experiment -- how *precise* it is -- is defined as $\frac{1}{\text{Standard Error}^2}$. The basic message: the smaller the standard error, the more precise the experiment is.

In [None]:
example.data <- data.frame(experiment = c(1,2), estimate = c(8, 5), SE = c(30, 1))
example.data

Adding in a precision column:

In [None]:
example.data$precision <- 1/example.data$SE^2
example.data

Unweighted, the results would look like this:

In [None]:
mean(example.data$estimate)

To weight our average by this, we can use the `weighted.mean()` function, like this:

In [None]:
weighted.mean(example.data$estimate, example.data$precision)

Let's see this in the main dataset:

In [None]:
weighted.mean(data$effect, data$precision)

#### Using `weighted.mean()` to examine how effects vary

To examine how effects vary and generalize, we can run `weighted.mean()` in different subsets.

For example, in this example, let's look at effects in Primary elections and General elections.

In primary elections:

In [None]:
primaries <- subset(data, Primary == 1)
weighted.mean(primaries$effect, primaries$precision)

Now, in general elections:

In [None]:
general.elections <- subset(data, General == 1)
weighted.mean(general.elections$effect, general.elections$precision)

Interesting! Looks like the effects of campaign persuasion are much larger in primaries than in general elections.

This is a good example of a lesson: effects in one context doesn't necessarily generalize to others!

In the activity, you'll use `subset()` and `weighted.mean()` to explore other patterns in the size of campaigns' persuasive effects.

#### Last point: Understanding _why_ effects vary

It's not necessarily straightforward to understand _why_ effects vary across people, contexts, etc. from meta-analyses.

For example, why are the effects of campaign persuasion larger in primaries than in general elections?

This could be for a lot of reasons: in general elections, voters might remain loyal to their parties; but in primaries, candidate's party doesn't appear on the ballot, so more voters might be "up for grabs." Alternatively, in general elections tend to have more campaign spending, so the effect of additional spending might be lower than in primaries, which tend to be quieter.

One of the things meta-analyses do is prompt further theories and hypotheses to understand what creates the patterns we find in them.

For example, based on the evidence from the meta-analyses above, we might theorize that, as I said above, when candidates' parties don't appear on a ballot, it's easier to persuade voters. To test this hypothesis, one could imagine an experiment showing voters a sample ballot and randomly assigning both a) whether or not candidate party labels are present and b) information about the candidates. We could then look to see if the effect of the information is bigger when candidate party labels are shown versus when they are not.

<table>
<thead>
  <tr>
    <th></th>
      <th>Party Labels</th>
    <th>No Party Labels</th>
  </tr>
</thead>
<tbody>
  <tr>
      <td><b>No Persuasive Information</b></td>
    <td>Group 1</td>
    <td>Group 2</td>
  </tr>
  <tr>
      <td><b>Persuasive Information</b></td>
    <td>Group 3</td>
    <td>Group 4</td>
  </tr>
</tbody>
</table>