# POLSCI 3

## Week 3, Lecture 2: Correlation is not causation

In this notebook, we will dive deeper into tables and relationships between different variables!

As always, we can start by loading in our data. We are going to continue to work with the happiness dataset from the previous lecture.

In [None]:
happiness_data <- read.csv('happiness_polity_2018.csv')
head(happiness_data)

Again, here's the codebook:

- <code>polity2</code>: The "Polity Score" of the country, which measures its political system on a 21-pont scale ranging from -10 (hereditary monarchy) to +10 (consolidated democracy).
- <code>polity2_cat</code>: The political category the country is identified within. "autocracies" are on one end of the spectrum, "anocracies" are in the middle (semi-democracies), and "democracies" are at the top of the spectrum.
- <code>gdpcapita</code>: GDP per Capita (economic output per person)
- <code>gdpcapita_cat</code>: GDP/income category that the country falls into (based on GDP per capita)
- <code>happiness</code>: The country's happiness index, measured through surveys that require participants to rank their level of happiness based on an assortment of quality-of-life factors
- <code>happiness_cat</code>: Happiness category that the country falls into (based on happpiness indicator)
- <code>life_expectancy</code>: Average life expectancy in years
- <code>life_expectancy_cat</code>: Life Expectancy category that the country falls into

### Testing a theory

Suppose a psychologist had a _theory_ that happiness causes a longer life, _hypothesized_ that happier countries would therefore have people who lived longer, and decided to _empirically test_ this hypothesis in the dataset we're looking at this week.

To do so, we can look at this two-way table:

In [None]:
table(happiness_data$life_expectancy_cat, happiness_data$happiness_cat)

One might look at this and be *tempted* to conclude that being happy causes people live longer. Is the theory supported?

But wait a minute... *there might be other explanations for this relationship*.

When trying to establish a causal relationship that A causes B, people are aruging that:

$$A \rightarrow B$$

### Reverse Causation

Is it that living longer causes one to be happier? This would be called **reverse causation**; if we think A causes B because there is a relationship between A and B, **reverse causation** is the possibility that B actually causes A.

$$B \rightarrow A$$

Or, wait... maybe there is something else explaining this. The next table we will create is also looking at life expectancy, but this time pairing it with political system categories. 

In [None]:
table(happiness_data$life_expectancy_cat, happiness_data$polity2_cat)

### Omitted Variable Bias

Maybe what's really going on is that democracies are causing people to be happier and to live longer. This would be an example of **omitted variable bias**: there is a third factor, C, which causes both A and B. This would lead A to be related to B, even if A doesn't cause B.

$$C \rightarrow A \text{ and } B$$

Democracy isn't the only possible omitted variable, though.

One of the most important lessons in this class is that *correlation is not causation*. Life expectancy, happiness, and political regime type are related, but just from the fact that they are related, we don't know what causes what.

One of the most important skills you can learn in this class is how to tell stories about why a correlation isn't necessarily causal. To do so, think about possibilities for reverse causation and omitted variable bias:

- **Reverse causation**: When A is related to B and someone argues A causes B, but, in fact, B might cause A
- **Omitted variable bias**: When A is related to B and someone argues A causes B, but, in fact, C might both A and B


### Back to science in the abstract

There's a few difficult parts of the scientific method, including:
- coming up with theories in the first place
- coming up with hypotheses that can help understand whether those theories have merit
- coming up with empirical tests that test the hypothesis holds or not

All these steps can be difficult. This last step is much trickier than it looks. Often times, evidence that could be consistent with a hypothesis could also be interpreted as _inconsistent_ with it due to _alternative explanations_. One of our most important jobs as critical consumers of data is to think of potential alternative explanations for an empirical pattern, since this helps us evaluate whether the empirical test is actually consistent with the hypothesis, or might not actually say much about it at all.

The data we've looked at this week, I would say, hasn't said much about these hypotheses at all -- because it's just too likely that reverse causation and omitted variable bias could be responsible for these empirical patterns instead.