# POLSCI 3

# Week 6, Lecture Notebook 1: Quantifying Uncertainty

This week, we'll learn about **quantifying uncertainty** in experiments.

So far, we've just looked at differences between groups in experiments and assumed that the differences reflect a true causal effect.

For example, in Week 5, Lecture 2, we found that voters assigned to the Hawthorne condition voted at a rate only 0.3 percentage points higher than those assigned to the Civic Duty condition. We therefore concluded that the Hawthorne condition had a tiny effect on turnout over and above the Civic Duty condition; i.e., that there was a small but *real* effect of telling people they were being studied on voter turnout.

But what if the Hawthorne condition didn't actually cause *anyone* additional to vote that wouldn't have voted if they got the Civic Duty mailer? Is it possible that *even if* the Civic Duty mailer and Hawthorne mailer had the exact same effect, that we might see those assigned to the Hawthorne condition voting at slightly higher rates anyway? Yes.

Put another way, if the true effect of a treatment relative to a control group is 0, would we always see exactly 0 difference between the average outcomes in the treatment and control groups in an experiment? *No.* It's possible that we might, just by chance, randomize people with slightly lower/higher values of the outcome to one group in an experiment.

For example, it's possible that we might, just by chance, randomize people that were a little more likely to vote anyway into the Hawthorne group---making it look like the Hawthorne mailing has a slightly higher effect than the Civic Duty mailing, even if the effects are the same.

The good news is that we can figure out how likely this is.

## Bias vs. Noise

Wait, doesn't this mean experiments are no different than observational studies, if the answers they give us aren't guaranteed to be exactly right?

No. To understand why not, we need to understand the difference between **bias** and **noise**.

The Potential Outcomes framework showed us that, in any experiment, there is a right answer: the **True Average Treatment Effect**. For example, if we could see all the potential outcomes in the social pressure experiment from last week, we would know for sure what exactly the effect of the Hawthorne and Civic Duty mail was on *everyone*, and so we could tell what the true effect is. Real studies never give us the exact right answer, for two reasons:

### Bias

The first reason is bias, which means when *we expect* the answer that a study will give us will be *systematically different* from correct *in a consistent direction*.

For example, we talked in previous weeks about *omitted variable bias* when we saw the dataset on happiness across countries. For example, looking at the relationship between happiness and democracy might give us a *biased* answer about the causal effect of democracy on happiness because income causes both. If you and your friend both collected data on 10 countries, we'd expect to see this same bias in both sets of countries.

In other words, **we know we're looking at bias if, if we were to do the same study again, we'd keep getting results that were different from the truth in the same way**.

For example, an archer who has a bias towards the upper left part of a target would land arrows like this:

<img src="highbias_lownoise.png" width="20%" height="auto">

In this metaphor, each of these dots is a study. If a bias effects a study design, its results are going to be *consistently wrong in one direction*.

Experiments do not have bias.

### Noise

Noise is different. Experiments do have noise.

Intuitively, if we randomize two groups, we'll see a difference between them, even if there is no effect of whatever we give one group. In fact, I could just randomize a dataset on my computer into two pieces, and I'd expect to see some differences between them:

In [None]:
library(estimatr); set.seed(1027)

example.data <- data.frame(outcome = seq(from=0,to=1,length.out=10000),
                           treat = sample(c(0,1), 10000, replace = TRUE))
head(example.data)

In [None]:
difference_in_means(outcome ~ treat, example.data, condition1 = 0, condition2 = 1)

In [None]:
example.data <- data.frame(outcome = seq(from=0,to=1,length.out=10000),
                           treat = sample(c(0,1), 10000, replace = TRUE))
difference_in_means(outcome ~ treat, example.data, condition1 = 0, condition2 = 1)

In [None]:
example.data <- data.frame(outcome = seq(from=0,to=1,length.out=10000),
                           treat = sample(c(0,1), 10000, replace = TRUE))
difference_in_means(outcome ~ treat, example.data, condition1 = 0, condition2 = 1)

This shows us how noise is different than bias.

**Noise** is when a study's answer differs from the truth by random chance, even though the study's answer centers around the correct answer *on average*. That is, if we did the study many times, the right answers would center around the turth.

#### High Noise and No Bias

For example, an archer with *high noise* but *no bias* would look like this:

<img src="lowbias_highnoise.png" width="20%" height="auto">

Metaphorically, this is like an experiment with a small sample size (a concept we'll get to in a moment). The answer is right *on average*, but bounces around.

#### Low Noise and No Bias

Experiments that are *more precise* have answers that are closer to the right answer on average:

<img src="lowbias_lownoise.png" width="20%" height="auto">

This is obviously even better!

### How much noise does my experiment have?

But how do we know if my experiment looks like <img src="lowbias_highnoise.png" width="15%" height="auto"> or like
<img src="lowbias_lownoise.png" width="15%" height="auto" /> this? That is, how much noise it has?

More importantly, when I analyze an experiment, how can I know if the answer I got is just due to noise, or if there really is an effect?

To understand this, we need to learn the concept of a **standard error**, which is a measure of far off from the truth an experiment's answer usually is. I.e., the average distance between the center of the target (the truth) and where the archer's arrows actually hit (the estimates we get in particular experiments).

## Data

For today's example, we'll use data from a study by <a href="https://www.journals.uchicago.edu/doi/pdf/10.1086/714923" target="_blank">Grose et al.</a> on social lobbying.

In this study, the authors worked with a lobbying firm that was lobbying a state legislature on behalf of an education organization to increase education spending in the state. The authors did the experiment to test their theory that what they call *social lobbying*---taking a legislator to a social location such as a bar or restaurant---is more effective than lobbying legislators in their offices. To test this, they randomly assigned legislators to three groups:
- a control group that wasn't lobbied
- a social lobbying group that was asked to go to a restaurant or bar and talk about the bill
- an office lobbying group that was asked to talk about the bill in their offices

In [None]:
data <- read.csv('ps3_lobbying.csv')
head(data)

Here is a quick rundown of what each column means:

- `caseid`: Number that identifies each legislator/district
- `supportgroup`: This is the *outcome*. It is a measure of whether the legislator agreed to list their name publicly as a "sponsor" of the bill.
- `treat`: This is the *treatment*. It has several possible values:
    - `"control"`: the office received no contact from the lobbyist
    - `"officelobby"`: the legislator was asked to meet to discuss the bill in their office
    - `"sociallobby"`: the legislator was asked to meet to discuss the bill at a social location (a restaurant or bar)
- `ally`: The authors thought that social lobbying might be especially effective among legislators who had supported the group's priorities in the past. To measure this, they asked the lobbyist: "In your opinion, how well does the phrase ‘ally of the interest group’ describe the legislator?" This is therefore the lobbyists' rating of whether the legislator is an ally of the interest group (values 0, 1/3, 2/3, and 1).
- `female` : legislator gender, 1 = legislator is female; 0 = not

Again, here is what `treat` looks like:

In [None]:
table(data$treat)

By now, we know how to look at the results. We can use the `difference_in_means()` function we learned last time to see how big of a difference the researchers actually saw between the two experimental conditions in terms of how often the legislators supported the education organization's bill in the legislature:

In [None]:
difference_in_means(supportgroup ~ treat, data, condition1 = 'control', condition2 = 'sociallobby')

## How big of a difference might we see by chance?

This is a pretty small experiment, though. Might we see a difference this big by chance?

Let's for the moment assume that the lobbying actually had no effect, so all the potential outcomes are the same: it doesn't matter whether legislators were assigned to the control group, office lobbying group, or social lobbying group. That means we've already seen all the potential outcomes! And it means that the *true average treatment effect*, we are assuming, is zero.

In this world, we can examine how the *estimate* bounces around due to simple randomness---the luck of which legislators happen to get put into the treatment group; i.e., noise.

**You don't need to be able to write the code below on your own, but I do want to you to understand what it does.**

In [None]:
# Function to re-randomize, or "shuffle", the treatment variable
re.randomize <- function(input.data) {
    input.data$treat <- sample(input.data$treat)
    return(input.data)
}

head(re.randomize(data))

In [None]:
head(re.randomize(data))

In [None]:
head(re.randomize(data))

In [None]:
# Function to compute the effect of social lobbying
compute.office.lobby.effect <- function(input.data) {
    estimate <- difference_in_means(supportgroup ~ treat, input.data,
                                    condition1 = 'control', condition2 = 'sociallobby')$coefficients
    return(as.numeric(estimate))
}

actual.estimate <- compute.office.lobby.effect(data)
actual.estimate

In [None]:
compute.office.lobby.effect(re.randomize(data))

In [None]:
compute.office.lobby.effect(re.randomize(data))
compute.office.lobby.effect(re.randomize(data))
compute.office.lobby.effect(re.randomize(data))

The answer bounces around every time we run this function. This shows you why we need to worry about noise. If we just took this data, randomly shuffled it into groups, and then looked at the outcomes, having done nothing, we would still see some positive estimates even though putting legislators in the "treatment" group obviously did nothing in these simulations.

Is this what happened in our actual study?? Does our treatment have no effect on the outcomes, we just happened to randomize legislators likely to sponsor the bill anyway into the treatment group??

Let's simulate 10,000 example experiments and see how big these estimates might get...

In [None]:
simulations <- replicate(10000, compute.office.lobby.effect(re.randomize(data)))
head(simulations)

In [None]:
library(ggplot2)
qplot(simulations, binwidth = .025) + geom_vline(xintercept = actual.estimate, color = 'red')

The spread or width of this distribution shows you how large estimates are that we would see by chance.

We measure the spread of this distribution using the standard error:

In [None]:
sd(simulations)

## Good News: `difference_in_means()` tells you the standard error

You don't need to do this! The `difference_in_means()` function tells us the standard error:

In [None]:
social.dim <- difference_in_means(supportgroup ~ treat, data, condition1 = 'control', condition2 = 'sociallobby')
social.dim

The standard error is printed under the `Std. Error` heading. In this case, it's 0.076756.

In [None]:
# How big is the standard error relative to the estimte?
0.11617 / 0.076756 # divide estimate / standard.error

This is called a _t_-statistic. It's the ratio of the estimate and the standard error:

$t = \frac{\text{Estimate}}{\text{Standard Error}}$

And `difference_in_means()` tells us this, too! (See 1.513 above under `t value`.)

**The _t_-statistic is a way to measure how likely it is that an estimate of the size we saw would arise by chance even if the treatment had no effect.**

By convention, **we call an estimate _statistically significant_ if the _t_-statistic is larger than 1.96 (either lower than -1.96 or larger than 1.96).** Smaller than that (closer to zero), and we generally conclude that our estimate could have arisen by chance. We'll say more about this in a future lecture.

## Formula for the Standard Error

You won't need to memorize this, but it turns out there is a formula that tells us what a standard error will be in an experiment.

Suppose we are comparing groups 1 and 2. Let:

- $s_1$ be the standard deviation of group 1
- $n_1$ be the sample size of group 1
- $s_2$ be the standard deviation of group 2
- $n_2$ be the sample size of group 2

Then this is the formula for the standard error in an experiment:

$ \text{Standard Error} = \sqrt{\frac{s^2_1}{n_1} + \frac{s^2_2}{n_2}} $

When the standard deviations are the same (so $s = s_1 = s_2$), this simplifes to:

$ \text{Standard Error} = s \sqrt{(\frac{1}{n_1} + \frac{1}{n_2})} $

And, when the two groups are *also* the same size (let $N = n_1 + n_2$ if $n_1 = n_2$), this simplifies further to:

$ \text{Standard Error} = 2 s \sqrt{\frac{1}{N}} $

What does this mean? For example:

- If standard deviation of the outcome variable doubles, the standard error doubles.
- If the sample size of a study is cut in half, the standard error increases by a factor of $\sqrt{2} \approx 1.41$. (E.g., if the standard error were $X$ before the sample size were cut in half, it would be $X \sqrt{2}$ after the sample size were cut in half.)
- If the sample size of the study doubles, the standard error decreases by a factor of $\sqrt{\frac{1}{2}} \approx .707$. (E.g., if the standard error were $X$ before the sample size were doubled, it would be $\frac{X}{\sqrt{2}}$ after the sample size were doubled.)
- In order to make the standard error half the size, the sample size must increase by a factor of 4 (i.e., quadruple).

What I want you to remember:

- The standard error goes up with the standard deviation or spread of the outcome variable.
- The standard error goes down as the sample size goes up.
- Although an experiment is more precise (i.e., the standard error is smaller) when the sample size is larger, it follows a square root, so we have to quadruple the sample size to cut the standard error in half (since $\sqrt{\frac{1}{4}} = \frac{1}{2}$). Likewise, in order to make the standard error of the treatment effect $\frac{1}{x}$ the size, the study needs to be $x^2$ times larger.
- In an experiment with a fixed *overall* sample size, the more similar in size that two groups are, the more precise the experiment will be. The most precise experiment at a fixed sample size will assign the treatment and control groups with equal probabilities (i.e., 50\% each).

## Reviewing New Terms/Concepts

- **True Average Treatment Effect**: If we could see all the potential outcomes, the actual truth about what the causal effect of a treatment is.
- **Estimate**: From a particular study run a particular time, our best guess of what the true average treatment effect is.
- **Bias**: When a study's estimates are systematically wrong in a particular direction; e.g., because of omitted variable bias. Experiments have no bias. If a study design is biased, it would be wrong even if its sample size were infinitely large.
- **Noise**: Because of random chance, a study's estimate differs from the truth, even though it is on average correct. If a study had an infinitely large sample size, it would have no noise.
- **Standard Error**: A way of measuring *how much* a study's estimate will differ from the truth (and between different runs of the same experiment) because of random chance. I.e., a measure of how much noise there is in an experiment.
- **t-statistic**: Defined as the estimate divided by the standard error. Gives an indication of how likely a study's result is to have arisen by chance. (More soon on how to use this.)