# Quantifying Association

We'll start our journey into causal inference by looking at several ways of quantifying association.

Throughout, we'll assume we have two random variables, $x$ and $y$. Although most of the methods we'll describe can be used regardless of how we interpret them, it will be helpful when we move to causality to think of $x$ as either a treatment or covariate, and to think of $y$ as an outcome.

## Numerical Data: Correlation and Regression

### Correlation coefficient

There are a few different ways to measure correlation. The most common is the **Pearson correlation** (also called the correlation coefficient), usually denoted with $r$ or $\rho$ (the greek letter *rho*):

$$
    \rho_{xy} = \frac{\text{cov}(X, Y)}{\sqrt{\text{var}(x)\text{var}(y)}}
$$

This is a good measure of the linear association between $x$ and $y$. For a refresher on Pearson correlation, see the [Data 8 textbook](https://www.inferentialthinking.com/chapters/15/1/Correlation.html) and the [Prob 140 textbook](http://prob140.org/textbook/content/Chapter_13/01_Covariance.html#correlation).

### Linear regression

If we were to fit a linear model to predict $y$ from $x$, it would look something like:

$$y = \alpha + \beta x + \varepsilon.$$

As usual, we assume that $\varepsilon$ is zero-mean noise, with the additional property that cov$(x, \varepsilon) = 0$. We've talked a lot about how to interpret this as a predictive model, but now we'll look at it slightly differently. 

We'll think of this equation as simply a descriptive explanation of a relationship between $x$ and $y$, where the most important part of the relationship is $\beta$. We can use all the same computational machinery we've already developed to fit the model and compute $\beta$, and the interpretation is subject to the limitations we've already learned about (e.g., it doesn't capture nonlinear association, it can be impacted by outliers, etc.). While it's common to describe $\beta$ as quantifying the "effect" of $x$ on $y$, it's important to understand the limitations of the word "effect" here: linear regression can only tell us the *predictive* effect, rather than the causal effect.

So, we'll use the coefficient $\beta$ as a means to quantify the relationship between $x$ and $y$. Starting from our assumption that cov$(x, \varepsilon) = 0$ and using properties of covariance, we can show that $\beta = \frac{\text{cov}(x, y)}{\text{var}(x)}$. From here, we can also show that $\beta = \rho_{xy} \sqrt{\frac{\text{var}(y)}{\text{var}(x)}}$ (as you may have seen empirically in Data 8).

For example, suppose we're interested in quantifying the relationship between the number of years of schooling an individual has received ($x$) and their income ($y$). If we were to compute the coefficient $\beta$, it would provide a way of quantifying the association between these two variables.

### Multiple linear regression

*This section is in progress.*

Suppose we are now interested in quantifying the relationship between two variables $x$ and $y$, but we also want to account for or "control for" the effect of a third variable, $w$. Assuming a linear relationship between them, we can extend our earlier relationship:

$$y = \alpha + \beta x + \gamma w + \varepsilon.$$

In this case, we can interpret $\beta$ as a measure of association between $x$ and $y$ while controlling for the effect of a third variable $w$.

Here are some refresher resources on linear regression:
* [Prob 140 textbook Chapter 24](http://prob140.org/textbook/content/Chapter_24/00_Simple_Linear_Regression.html)
* Data 100 textbook: [Chapter 14](https://www.textbook.ds100.org/ch/14/linear_models.html) and [Chapter 18](https://www.textbook.ds100.org/ch/18/mult_model.html)

## Categorical Data

Correlation and regression coefficients are a fine way to measure association between continuous numerical variables, but what if our data are categorical? We'll restrict ourselves to binary data in this section for simplicity. We'll look at three commonly used metrics for these cases: risk difference, risk ratio, and odds ratio.

When dealing with categorical data, we'll often start by visualizing the data in a contingency table.  We've already seen examples of these in the previous section, when we looked at <todo simpson example>.
    

|      | $y=0$ | $y=1$ |
| --- | --- | --- |
|$x=0$ | $n_{00}$ | $n_{01}$ |
|$x=1$ | $n_{10}$ | $n_{11}$ |

Note that these are different from the 2x2 tables that we used at the beginning of the course: there, the rows and columns represented reality and our decisions, respectively. Here, they represent two different observed variables in our data. Just as with those tables, there's no standard convention about what to put in the rows vs the columns.

For example, suppose we are interested in examining the relationship between receiving a vaccine ($x$) for a particular virus and being infected with that virus ($y$). We'll look at a study conducted on that vaccine. We'll use $x=1$ to indicate getting the vaccine, and $y=1$ to indicate being infected with the virus. In this case, for example, $n_{10}$ would represent the number of people in the study who received the vaccine and did not get infected.

Most of the metrics we'll discuss are based on the **risk**, which represents the probability of $y=1$ given a particular value of $x$. For example, the **risk difference (RD)** is defined as follows:

$$
\begin{align}
RD 
    &= \underbrace{P(y=1 \mid x=1)}_{\text{risk for }x=1}
       - \underbrace{P(y=1 \mid x=0)}_{\text{risk for }x=0} \\
    &= \quad\,\overbrace{\frac{n_{11}}{n_{10} + n_{11}}}^{} \quad\,\,-\quad\, \overbrace{\frac{n_{01}}{n_{00} + n_{01}}}^{}
\end{align}
$$

In our vaccination example, the term *risk* has an intuitive interpretation: it represents your risk of being infected given whether or not you were vaccinated. If the vaccine works as intended (i.e., there's a strong association between being vaccinated and being infected), your risk of being infected should decrease after being vaccinated, and the risk difference should be a negative number far from 0. On the other hand, if there's little to no relationship between vaccination and infection, then the two terms should be very similar, and the risk difference should be close to 0. We can see the same fact mathematically. If $x$ and $y$ are independent, then $P(y=1 \mid x=1) = P(y=1 \mid x=0) = P(y=1)$, so the two terms are equal. This means that they cancel and that the risk difference is 0.

The **risk ratio (RR)** is defined similarly as the ratio (instead of the difference) between the two quantities above:

$$
RR = \frac{P(y=1 \mid x=1)}{P(y=1 \mid x=0)}
$$

We can use similar reasoning as above to conclude that this ratio should 1 when $x$ and $y$ are independent.

The third commonly used measure is the **odds ratio (OR)**. It's the ratio of two odds, where each odds is itself a ratio:

$$
OR = \frac{%
        \overbrace{P(y=1|x=1)/P(y=0|x=1)}^{\text{odds of }y\text{ in the presence of }x}}{%
        \underbrace{P(y=1|x=0)/P(y=0|x=0)}_{\text{odds of }y\text{ in the presence of }x}}
$$

While this looks more complicated, we can show that it simplifies to:

$$
OR = \frac{n_{00}}{n_{10}} \cdot \frac{n_{11}}{n_{01}}
$$