# POLSCI 3

## Week 9, Lecture Notebook 2: Descriptive hypothesis testing

## Descriptive hypotheses

Earlier this week, we learned about how to look at *levels* of variables. For example, what percent of people support repealing the ACA?

This week, we will examine *descriptive hypotheses*, which are hypotheses about differences between groups. For example, you might be interested in asking: did people in California or in Texas support repealing the ACA at higher rates? These questions are *not* causal. For example, we wouldn't be testing whether taking a person and moving them from California to Texas would *cause* them to be more likely to support ACA repeal. We're just interested in whether the people who currently live in California and Texas have different opinions on average.

Good news: It turns out that we've already learned everything we need to do this. We just have to think about what we're doing a bit differently than we did when we were running experiments.

- We can use `difference_in_means()` to estimate a difference between two groups.
    - The basic formula is nearly identical. The formula is `difference_in_means(outcome ~ group.variable.name, data = dataset.name, weight = weight.variable.name)`. In this example, `group.variable.name` and `weight.variable.name` are two columns in a dataset called `dataset.name`.
    - We don't need to supply `condition1` or `condition2` when the group variable is 0/1. `difference_in_means()` assumes we want to look at how much higher/lower the group where `group.variable.name == 1` is relative to the group where `group.variable.name == 0`.

Let's give you an example...

In today's class, we'll look at data from the 2020 CES _post-election_ survey, which they conducted right after the 2020 election.

In [None]:
#import libraries
library(estimatr)

data <- read.csv("ps3_cces2020_post.csv")
head(data)

Here's what the variables mean:

- `vvweight_post`: Survey weight
- `person_of_color`: `1` if person identifies as a person of color; `0` if person identifies as non-Hispanic white
- `college`: `1` if the person graduated from college; `0` if the person did not
- `female`: `1` if the person identifies as female; `0` otherwise
- `medicare_expand`: `1` if the person favors expanding Medicare; `0` otherwise
- `vote_wait`: How long someone said they had to wait to vote. Rounded to 0, 5, 15, 45, or 90 minutes. `NA` means missing (question was not asked), `difference_in_means()` will ignore these cases.
- `votereg_problem`: Encountered a problem when they tried to vote (e.g., ID was rejected, didn't appear on voter registration list); `0` = no; `1` = yes

Now let's give `difference_in_means()` a try!

Are people who go to college more supportive of expanding Medicare?

In [None]:
difference_in_means(medicare_expand ~ college, data, weight = vvweight_post)

Are women more supportive of expanding Medicare?

In [None]:
difference_in_means(medicare_expand ~ female, data, weight = vvweight_post)

That's it!

There's still these other numbers `difference_in_means()` gives us, which we previously associated with uncertainty in experiments. These numbers have essentially the same meaning as before, with a small twist.

With experiments, noise was generated by random assignment to treatment or control groups.

When looking at descriptive statistics, noise comes from the random sampling of observations from the population into our sample. Even if we had a sampling approach that didn't produce bias, by chance we might sample a group that looks a bit different than the population -- random sampling has some randomness! The larger the sample size, the lower this noise will be.

With this interpretation of noise, here's what the other numbers `difference_in_means()` gives us mean:

- The **estimate** gives us the difference between the groups in this sample, and is our best guess about what the differences between the groups are in the population.
- **Standard errors** measure *how noisy* our estimate is. In this case noise is generated by the fact that our random sample might have randomly sampled a very slightly unrepresentative group.
- **$p$-values** measure the probability we would see a difference between the two groups as large as we did even if, in the population, the groups were exactly the same.
- In 95% of random samples, the **confidence intervals** will contain the true difference between the groups in the population.

**WARNING:** These are just descriptive differences! Don't interpret them as causal -- omitted variable bias could be a problem!