In [1]:
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
import pandas as pd
from pylab import rcParams

%matplotlib inline

In [2]:
sns.set(style='whitegrid', palette='muted', font_scale=1.5)

rcParams['figure.figsize'] = 14, 8

plt.xkcd()

RANDOM_SEED = 42

# Correlation

## Analogy

## Diagram

## Example

## Plain English

The correlation gives us a way to measure how strong the relationship between two variables is.

## Technical Definition

### Covariance

Consider two random variables $X$ and $Y$. The *covariance* between $X$ and $Y$ is written as $Cov(X, Y)$. The covariance gives us information about how $X$ and $Y$ are statistically related. Here is a definition:

$$\text{Cov}(X,Y)=\mathbb{E}\big[(X-\mathbb{E}[X])(Y-\mathbb{E}[Y])\big]=\mathbb{E}[XY]-\mathbb{E}[X]\mathbb{E}[Y].$$

Proof:

Let $a = \mathbb{E}[X]$ and $b = \mathbb{E}[Y]$, then we have

$$
\begin{align}
\mathbb{E}\big[(X - a)(Y - b)\big] &=\mathbb{E}(XY - bX - aY + ab)\\
&=\mathbb{E}(XY) - b\mathbb{E}(X) - a\mathbb{E}(Y) + ab\\
&=\mathbb{E}(XY) - \mathbb{E}(X)\mathbb{E}(Y) - \mathbb{E}(Y)\mathbb{E}(X) + \mathbb{E}(X)\mathbb{E}(Y)\\
&=\mathbb{E}(XY) - \mathbb{E}(X)\mathbb{E}(Y).
\end{align}$$

When $Cov(X, Y) = 0$, we say that $X$ and $Y$ are *uncorrelated*. A positive or negative covariance indicates the direction and strength of the correlation.

#### Some properties of covariance

For any random variables $X, Y and Z$ and any scalar values $a and b$:

1. $Cov(X, X) = var(X)$ 
2. if $X$ and $Y$ are independent then $Cov(X, Y) = 0$
3. $Cov(X, Y) = Cov(Y, X)$
4. $Cov(aX,Y)=aCov(X,Y)$
5. $Cov(X + c, Y) = Cov(X, Y)$
6. $Cov(X + Y, Z) = Cov(X, Z) + Cov(Y, Z)$
7. $cov(X, aY + b) = aCov(X, Y)$
8. More generally,

$$Cov\left(\sum_{i=1}^{m}a_iX_i, \sum_{j=1}^{n}b_jY_j\right)=\sum_{i=1}^{m} \sum_{j=1}^{n} a_ib_j Cov(X_i,Y_j).$$

Note that if $X$ and $Y$ are independent, we have $\mathbb{E}[XY] = \mathbb{E}[X]\mathbb{E}[Y]$, which implies that $Cov(X, Y) = 0.$ Thus, if $X$ and $Y$ are independent, they are uncorrelated. Generally, the converse is not true.

### Variance of a sum

One common application of covariance is finding the variance of a sum of several random variables. Let $Z = X + Y$, then

$$
\begin{align}
Var(Z)&=Cov(Z,Z)\\
&=Cov(X+Y,X+Y)\\
&=Cov(X,X)+Cov(X,Y)+ Cov(Y,X)+Cov(Y,Y)\\
&=Var(X)+Var(Y)+2 Cov(X,Y).
\end{align}$$


### Correlation coefficient

The **correlation coefficient** $\rho(X, Y)$ of two random variables $X$ and $Y$ that have nonzero variances is defined as

$$\rho(X, Y) = \frac{Cov(X, Y)}{\sqrt{Var(X)Var(Y)}}$$

It can be viewed as normalized version of the covariance $Cov(X, Y)$ and it can be shown that $\rho$ ranges from -1 to 1.

If $\rho > 0$ or $\rho < 0$, the the values of $Cov(X, Y)$ "tend" to have the same sign. The size of $|\rho|$ provides a normalized measure of the extent to which this is true. If $\rho = 1$ or $\rho = -1$ and assuming that $X$ and $Y$ have positive variances, there exists a positive (or negative, respectively) constant $c$ such that:

$$Y - \mathbb{E}[Y] = c(X - \mathbb{E}[X])$$

#### Some properties of the correlation coefficient

For any random variables $X$ and $Y$ and any scalar values $a, b, c and d$:

1. $-1 \leq \rho(X, Y) \leq 1$
2. if $\rho(X, Y) = 1$, then $Y = aX + b$, where $a > 0$
3. if $\rho(X, Y) = -1$, then $Y = aX + b$, where $a < 0$
4. $\rho(aX + b, cY + d) = \rho(X,Y)$ for $a, c > 0$

Note that if $X$ and $Y$ are uncorrelated, we can conclude that 

$$Var(X + Y) = Var(X) + Var(Y).$$

In practice, we often talk about the following degrees of correlation:

- weak - $0 \leq \rho < 0.3$
- moderate - $0.3 \leq \rho < 0.5$
- significant - $0.5 \leq \rho < 0.7$
- strong - $0.7 \leq \rho < 0.9$
- very strong - $0.9 \leq \rho$

The degree of correlation can also be described using the Pearson coefficient (in the case that we have discrete two dimensional space) which is defined as:

$$\phi^2 = \sum_{i, j}\frac{(p_{ij} - p_ip_j)^2}{p_ip_j}$$ given that $p_i = \sum_j p_{ij}$ and $p_j = \sum_i p{ij}$. If the random variables $X$ and $Y$ are independent $phi^2 = 0$. The reverse is also true.