# Covariance and correlation

### Covariance
Covariance is a measure of the joint variability of two variables. If the greater values of one variable mainly correspond with the greater values of the other variable, and the same holds for the lesser values, i.e., they tend to show similar behavior, the covariance is positive. Formally, given two random variables $X$ and $Y$ with associated vectors of samples $\mathbf{x}$ and $\mathbf{y} \in \mathbb{R}^m$, respectively, the **sample covariance** is defined as:
\begin{align*}
\text{cov}(\mathbf{x}, \mathbf{y}) & = \frac{(\mathbf{x}'\cdot \mathbf{y}')}{m-1}\\
                                &  = \frac{1}{m-1} \sum_{i=1}^{m} (x_i - \overline{\mathbf{x}})(y_i - \overline{\mathbf{y}}),
\end{align*}
where $\overline{\mathbf{x}}$ and $\overline{\mathbf{y}}$ are the sample means of $\mathbf{x}$ and $\mathbf{y}$, respectively, and $\mathbf{x}'$ and $\mathbf{y}'$ are the de-meaned versions of $\mathbf{x}$ and $\mathbf{y}$, respectively. 

The sample covariance $\textup{cov}(\mathbf{x},\mathbf{y})$ is a measure of how the two variables simultaneously vary about their respective means (*for the particular sample under consideration*):
- If the covariance is positive, then on average, when one feature varies above its mean, so does the other.
- If the covariance is negative, then on average, when one feature varies above its mean, the other varies below its mean.
- If the covariance is zero (or close to zero), then the two features are **uncorrelated**, which means that there is no discernible relationship between how they vary about their respective means. That is, when one feature varies above its mean, the other does not have a particular tendency to vary above or below its mean.

### De-unitizing features
This is all well and good, yet the covariance is not a very interpretable measure of the relationship between two features. This is because the covariance is not standardized: it depends on the scales of the features. For example, if we were to measure the height of a person in meters, and their weight in kilograms, then the covariance between height and weight would be different than if we measured height in centimeters and weight in grams.

To fix this issue, we follow the principle from physics: *If you divide two quantities measured with the same scale, then the resulting quantity is unit-less (or scale-less) quantity; that is, it is a pure and simple ratio.* 

So, to get rid of the scale issue, we want to *normalize* our vectors $\mathbf{x}$ and $\mathbf{y}$ by some scalars $s_{\mathbf{x}}$ and $s_{\mathbf{y}}$ having the same units as the vectors. A natural choice is to take these scalars to be the sample standard deviations; recall that the sample standard deviation of $\mathbf{x} \in \mathbb{R}^m$ is given by:
\begin{equation*}
    s_{\mathbf{x}} = \frac{|| \mathbf{x}' ||}{\sqrt{m-1}} = \sqrt{\frac{1}{m-1} \sum_{i=1}^{m} (x_i - \overline{\mathbf{x}})^2}.
\end{equation*}
Here is a completely made-up definition for a natural concept which is very often used but I could not find a particular name for: wWe define the **de-unitized** feature vector $\mathbf{x}^* \in \mathbb{R}^m$ corresponding to $\mathbf{x} \in \mathbb{R}^m$ as:
\begin{align*}
    \mathbf{x}^* & = \frac{\mathbf{x}}{s_{\mathbf{x}}}\\
                & = \sqrt{m-1} \frac{ \mathbf{x} }{ || \mathbf{x}' || }.
\end{align*}
The de-unitized vectors satisfy two important properties, which you will verify in HW 2:

1. They are unit-less, as desired (hence the name).
2. Their sample variance (and hence, also standard deviation) equals $1$, i.e. $s_{\mathbf{x}^*} = 1$.

### Correlation
The **sample correlation** between two features $\mathbf{x}$ and $\mathbf{y}$ is defined as the sample covariance of their de-unitized versions:
\begin{align*}
    \rho (\mathbf{x}, \mathbf{y}) & = \text{cov}(\mathbf{x}^*, \mathbf{y}^*).
\end{align*}
You will verify in HW 2 that this can be re-written in two different ways:
\begin{align*}
    \rho (\mathbf{x}, \mathbf{y}) & = \frac{ \text{cov}(\mathbf{x}, \mathbf{y}) }{ s_{\mathbf{x}} s_{\mathbf{y}} } = \frac{ \mathbf{x}'\cdot\mathbf{y}' }{ ||\mathbf{x}'|| ||\mathbf{y}'|| }.
\end{align*}
The right-most expression in particular implies a very natural geometric interpretation of the correlation: it is the cosine of the angle between the de-meaned vectors $\mathbf{x}'$ and $\mathbf{y}'$. Indeed, if the angle between these vectors is $\theta$, then we have
\begin{equation*}
    \mathbf{x}'\cdot\mathbf{y}' = ||\mathbf{x}'|| ||\mathbf{y}'|| \cos(\theta),
\end{equation*}
which implies that 
\begin{equation*}
    \rho (\mathbf{x}, \mathbf{y}) = \cos(\theta).
\end{equation*}
Thus, we see that the correlation is a standardized measure of the relationship between two features, which is independent of the scales of the features. In particular, we have
\begin{equation*}
    -1 \leqslant \rho (\mathbf{x}, \mathbf{y}) \leqslant 1,
\end{equation*}
with the following interpretation of the values (very similar to the covariance, but magically scale-less):

- If $\rho (\mathbf{x}, \mathbf{y}) > 0$, then the two features are positively correlated, which means that if one of these varies above its mean, then the other tends to vary above its mean as well. The greater the value of $\rho (\mathbf{x}, \mathbf{y})$, the stronger the positive correlation, with the largest possible value of $1$ indicating that the two features are perfectly positively correlated.
- If $\rho (\mathbf{x}, \mathbf{y}) < 0$, then the two features are negatively correlated, which means that if one of these varies above its mean, then the other tends to vary below its mean. The smaller the value of $\rho (\mathbf{x}, \mathbf{y})$, the stronger the negative correlation, with the smallest possible value of $-1$ indicating that the two features are perfectly negatively correlated.
- If $\rho (\mathbf{x}, \mathbf{y}) = 0$, then the two features are uncorrelated, which means that there is no discernible relationship between how they vary about their respective means. In this case, when one feature varies above its mean, the other does not have a particular tendency to vary above or below its mean.

### Standardized features
The **standardized** version of a feature vector $\mathbf{x} \in \mathbb{R}^m$ is defined as:
\begin{align*}
    \mathbf{x}^{\dagger} & = \frac{\mathbf{x}'}{s_{\mathbf{x}}} = \sqrt{m-1} \frac{ \mathbf{x}' }{ || \mathbf{x}' || }.
\end{align*}
Till now, for a given $\mathbf{x}$, we've defined three modifications $\mathbf{x}'$, $\mathbf{x}^*$, and $\mathbf{x}^{\dagger}$:

1. **The de-meaned** version $\mathbf{x}'$: this has mean $0$. If we visualize the sample distribution of $\mathbf{x}$, then the de-meaned version is the same distribution, but shifted by $\overline{\mathbf{x}}$ so that its mean is $0$.
2. **The de-unitized** version $\mathbf{x}^*$: this has standard deviation $1$, and is unit-less. If we visualize the sample distribution of $\mathbf{x}$, then the de-unitized version is the same distribution, but stretched/scaled by $1/s_{\mathbf{x}}$ so that its standard deviation is $1$.
3. **The standardized** version $\mathbf{x}^{\dagger}$: this has mean $0$ and standard deviation $1$, and is unit-less. Visualize, the distribution of $\mathbf{x}^{\dagger}$ is the same as that of $\mathbf{x}$, but shifted to have mean $0$ and stretched/scaled to have standard deviation $1$.

It is a simple exercise to see that standardizing a vector is the same as both de-meaning and de-unitizing it (in either order):
\begin{equation*}
    \mathbf{x}^{\dagger} = (\mathbf{x}^*)' = (\mathbf{x}')^*.
\end{equation*}

Moreover, you can also check that the correlation between $\mathbf{x}$ and $\mathbf{y}$ is simply a dot product of the standardized versions, scaled by $1/m-1$:
\begin{equation*}
    \rho (\mathbf{x}, \mathbf{y}) = \frac{(\mathbf{x}^{\dagger} \cdot \mathbf{y}^{\dagger})}{m-1},
\end{equation*}
which is exactly analogous to the definition of the covariance:
\begin{equation*}
    \text{cov}(\mathbf{x}, \mathbf{y}) = \frac{(\mathbf{x}' \cdot \mathbf{y}')}{m-1}.
\end{equation*}

### Covariance matrix

Given a list of features $X_1,\dotsc,X_n$, each of which correpsonds to a vector of samples $\mathbf{x}_i \in \mathbb{R}^m$, we can package all the sample covariances into a single matrix. To do this, consider the design matrix
\begin{equation*}
    \mathbf{X} = \begin{bmatrix} \; \mathbf{x}_1 & \dotsb & \mathbf{x}_n \; \end{bmatrix} \; \in \mathbb{R}^{m\times n}.
\end{equation*}
Following our previous conventions, we can de-mean each feature to get the matrix of de-meaned features:
\begin{equation*}
    \mathbf{X}' = \begin{bmatrix} \; \mathbf{x}_1' & \dotsb & \mathbf{x}_n' \; \end{bmatrix} \; \in \mathbb{R}^{m\times n}.
\end{equation*}

The **covariance matrix** is defined as (duh) the matrix of all covariances between the features:
\begin{align*}
\Sigma_{\mathbf{X}} & = \frac{1}{m-1}(\mathbf{X}')^T \mathbf{X}'\\
                        & = \frac{1}{m-1}\begin{bmatrix} (\mathbf{x}_1')^T \\ (\mathbf{x}_2')^T \\ \vdots \\ (\mathbf{x}_n')^T \end{bmatrix} \begin{bmatrix} \mathbf{x}_1' & \mathbf{x}_2' & \cdots & \mathbf{x}_n' \end{bmatrix}\\
                        & = \frac{1}{m-1}\begin{bmatrix}
                                (\mathbf{x}_1')^T \mathbf{x}_1'  & \cdots & (\mathbf{x}_1')^T \mathbf{x}_n'\\
                                \vdots  & \ddots & \vdots\\
                                (\mathbf{x}_n')^T \mathbf{x}_1'  & \cdots & (\mathbf{x}_n')^T \mathbf{x}_n'
                            \end{bmatrix}\\
                        & = \frac{1}{m-1}\begin{bmatrix}
                                \mathbf{x}_1'\cdot \mathbf{x}_1'  & \cdots & \mathbf{x}_1'\cdot \mathbf{x}_n'\\
                                \vdots & \ddots & \vdots\\
                                \mathbf{x}_n'\cdot \mathbf{x}_1'  & \cdots & \mathbf{x}_n'\cdot \mathbf{x}_n'                       
                            \end{bmatrix}\\               
                        & = \begin{bmatrix}
                                \text{cov}(\mathbf{x}_1, \mathbf{x}_1) & \cdots & \text{cov}(\mathbf{x}_1, \mathbf{x}_n)\\
                                \vdots & \ddots & \vdots\\
                                \text{cov}(\mathbf{x}_n, \mathbf{x}_1) & \cdots & \text{cov}(\mathbf{x}_n, \mathbf{x}_n)
                            \end{bmatrix}\quad \in \mathbb{R}^{n\times n}.
\end{align*}

If we want a unit-less covariance matrix, then we can use the de-unitized features to get a matrix of correlations. That is, we start with the de-unitized features and put them into a matrix
\begin{equation*}
    \mathbf{X}^* = \begin{bmatrix} \; \mathbf{x}_1^* & \dotsb & \mathbf{x}_n^* \; \end{bmatrix} \; \in \mathbb{R}^{m\times n}.
\end{equation*}
De-meaning the columns of the matrix results in the matrix of standardized features:
\begin{equation*}
    \mathbf{X}^{\dagger} = \begin{bmatrix} \; \mathbf{x}_1^{\dagger} & \dotsb & \mathbf{x}_n^{\dagger} \; \end{bmatrix} \; \in \mathbb{R}^{m\times n}.
\end{equation*}
Then, the **correlation matrix** is defined as:
\begin{align*}
    \mathbf{K}_{\mathbf{X}} & = \frac{1}{m-1}(\mathbf{X}^{\dagger})^T \mathbf{X}^{\dagger}\\
                            & = \begin{bmatrix}
                                \textup{cov}(\mathbf{x}^*_1,\mathbf{x}^*_1) & \cdots & \textup{cov}(\mathbf{x}^*_1,\mathbf{x}^*_n)\\
                                \vdots & \ddots & \vdots\\
                                \textup{cov}(\mathbf{x}^*_n,\mathbf{x}^*_1) & \cdots & \textup{cov}(\mathbf{x}^*_n,\mathbf{x}^*_n)
                            \end{bmatrix}\\
                        & = \begin{bmatrix}
                                \rho(\mathbf{x}_1, \mathbf{x}_1) & \cdots & \rho(\mathbf{x}_1, \mathbf{x}_n)\\
                                \vdots & \ddots & \vdots\\
                                \rho(\mathbf{x}_n, \mathbf{x}_1) & \cdots & \rho(\mathbf{x}_n, \mathbf{x}_n)
                            \end{bmatrix}\quad \in \mathbb{R}^{n\times n}.
\end{align*}

Here are some basic properties of the covariance matrix $\Sigma = \Sigma_{\mathbf{X}}$:

- The covariance matrix is symmetric, i.e., $\Sigma = \Sigma^T$.
- The diagonal entries of the covariance matrix are the variances of the features, i.e., $\text{cov}(\mathbf{x}_i, \mathbf{x}_i) = \text{var}(\mathbf{x}_i) = \frac{||\mathbf{x}_1'||}{m-1}$.
- Since the covariance matrix is the Gram matrix of the de-meaned vectors $\mathbf{x}_1',\dotsc,\mathbf{x}_n'$, it satisfies two properties:
    - It is invertible if and only if the de-meaned features are linearly independent.
    - It is *positive semi-definite*, which means that for any vector $\mathbf{v} = \begin{bmatrix} \; v_1 & \dotsb & v_n \; \end{bmatrix}^T \in \mathbb{R}^n$, we have:
    \begin{align*}
        \mathbf{v}^T\Sigma\mathbf{v} & = \frac{1}{m-1}\mathbf{v}^T(\mathbf{X}')^T\mathbf{X}'\mathbf{v}\\
                                                    & = \frac{1}{m-1} (\mathbf{X}'\mathbf{v})^T(\mathbf{X}'\mathbf{v})\\
                                                    & = \frac{1}{m-1}||\mathbf{X}'\mathbf{v}||^2 \\
                                                    & = \frac{1}{m-1} \left|\left| \sum_{i=1}^n v_i \mathbf{x}_i' \right|\right|^2 \\
                                                    & \geq 0.
    \end{align*}

The correlation matrix $\mathbf{K} = \mathbf{K}_{\mathbf{X}}$ has more or less the same properties, but with some simplifications:

- The correlation matrix is symmetric, i.e., $\mathbf{K} = \mathbf{K}^T$.
- The diagonal entries of the correlation matrix are $1$, i.e., $\rho(\mathbf{x}_i, \mathbf{x}_i) = \textup{var}(\mathbf{x}_i^*) = 1$.
- The correlation matrix is *positive semi-definite*. This follows from a similar calculation as the one above, i.e. for any $\mathbf{v} \in \mathbb{R}^n$, we have
\begin{align*}
    \mathbf{v}^T\mathbf{K}\mathbf{v} & = \frac{1}{m-1}\mathbf{v}^T(\mathbf{X}^{\dagger})^T\mathbf{X}^{\dagger}\mathbf{v}\\
                                                    & = \frac{1}{m-1} (\mathbf{X}^{\dagger}\mathbf{v})^T(\mathbf{X}^{\dagger}\mathbf{v})\\
                                                    & = \frac{1}{m-1}||\mathbf{X}^{\dagger}\mathbf{v}||^2 \\
                                                    & = \frac{1}{m-1} \left|\left| \sum_{i=1}^n v_i \mathbf{x}_i^{\dagger} \right|\right|^2 \\
                                                    & \geq 0.
    \end{align*}

We discuss some more interesting properties of the covariance matrix as preparation for our future discussion on PCA.

### Principal components
Since the covariance matrix is symmetric, the classical **Spectral Theorem** implies that it can be diagonalized, and morevoer, in a very nice way. Let me try to summarize the key points:

1. **Orthonormal eigenvectors.** 

    There exist vectors $\mathbf{u}_1,\dotsc,\mathbf{u}_n \in \mathbb{R}^n$ that are orthonormal (i.e. $\mathbf{u}_i\cdot \mathbf{u}_j = 0$ if $i \neq j$ and $1$ if = $i = j$), which are **eigenvectors** of $\Sigma$. This means that there exist scalars $\lambda_1,\dotsc,\lambda_n \in \mathbb{R}$ such that

    \begin{equation*}
        \Sigma \mathbf{u}_i = \lambda_i \mathbf{u}_i, \quad \text{for } i = 1,\dotsc,n.
    \end{equation*}
    Orthonormality here means that the collection $\mathbf{u}_1,\dotsc,\mathbf{u}_n$ looks and smells like the standard basis vectors of $\mathbb{R}^n$ (which you would have seen as $\hat{i},\hat{j}, \hat{k}$ in 3D space in vector calc).
2. **The orthogonal matrix $\mathbf{U}$.** 
We can take the orthonormal list of eignevectors of $\Sigma$ and package them into a matrix $$\mathbf{U} = \begin{bmatrix} \; \mathbf{u}_1 & \mathbf{u}_2 & \cdots & \mathbf{u}_n \; \end{bmatrix}.$$  This is an orthogonal matrix (by HW 1), which means that $\mathbf{U}^T = \mathbf{U}^{-1}$. 
3. **Representation with respect to $\mathbf{u}_i$'s.** 
    It is a fact that $\mathbf{u}_1,\dotsc,\mathbf{u}_n$ form an orthonormal basis for $\mathbb{R}^n$. This means that any vector $\mathbf{v} \in \mathbb{R}^n$ can be *uniquely* written as a linear combination of the $\mathbf{u}_i$'s:

    \begin{equation*}
        \mathbf{v} = \sum_{i=1}^n (\mathbf{v}\cdot \mathbf{u}_i) \mathbf{u}_i.
    \end{equation*}
    NOTE: the $\mathbf{u}_i$'s form a system of orthogonal directions, i.e. a coordinate system. The above identity is saying that, if we were to use these directions as the axes for a new coordinate system on $\mathbb{R}^n$, then the coordinates of $\mathbf{v}$ would be $(\mathbf{v}\cdot \mathbf{u}_1,\dotsc,\mathbf{v}\cdot \mathbf{u}_n)$! 
    The above equation can be derived as follows (for any $\mathbf{v} \in \mathbb{R}^n$):

    \begin{align*}
        \mathbf{v} & = I\cdot \mathbf{v}\\
                    & = \mathbf{U} \mathbf{U}^T \mathbf{v}\\
                    & = \begin{bmatrix} \; \mathbf{u}_1 & \dotsb & \mathbf{u}_n \; \end{bmatrix} \cdot \begin{bmatrix} \mathbf{u}_1^T \\ \vdots \\ \mathbf{u}_n^T \end{bmatrix} \cdot \mathbf{v}\\
                    & = \begin{bmatrix} \; \mathbf{u}_1 & \dotsb & \mathbf{u}_n \; \end{bmatrix} \cdot \begin{bmatrix} \mathbf{u}_1\cdot \mathbf{v} \\ \vdots \\ \mathbf{u}_n\cdot \mathbf{v} \end{bmatrix}\\
                    & = \sum_{i=1}^n (\mathbf{v}\cdot \mathbf{u}_i) \mathbf{u}_i.
    \end{align*}
- **Diagonalization.** Then, the covariance matrix can be diagonalized via the orthogonal matrix $\mathbf{U}$ as follows:
\begin{equation}\tag{1}
    \Sigma = \mathbf{U} \cdot \text{diag}(\lambda_1,\dotsc,\lambda_n) \cdot \mathbf{U}^T,
\end{equation}
where $\text{diag}(\lambda_1,\dotsc,\lambda_n)$ is the diagonal matrix with the eigenvalues of $\Sigma$ on the diagonal. Note: the eigenvalues are non-negative because of positive semi-definiteness, and if the vectors $\mathbf{x}_1',\dotsc,\mathbf{x}_n'$ are linearly independent (which will virtually always be the case in practice), then the eigenvalues are all strictly positive.

    When we apply the matrix in the RHS of (1) to a vector we are doing the following: 

    - Multiplying by $U^T$ sends the eigenvectors of $\Sigma$ (normalized to be unit vectors) to the standard basis vectors. Moreover, the orthogonality of $U$ means that $U$ is a **rotation** matrix (i.e. it preserves lengths and angles, but possibly reverses orientation).
    - Multiplying by $\text{diag}(\lambda_1,\dotsc,\lambda_n)$ scales the standard basis vectors by the eigenvalues.
    - Multiplying by $U$ then rotates the scaled standard basis vectors back to the eigenvectors of $\Sigma$.

The eigenvectors $\mathbf{u}_1,\dotsc, \mathbf{u}_n$ of the covariance matrix (i.e. the columns of $\mathbf{U}$) are the so-called **principal components** of the de-meaned design matrix $\mathbf{X}'$: they are the directions in which the data varies the most. The eigenvalues $\lambda_1,\dotsc,\lambda_n$ are the variances of the data in the corresponding principal component directions. **Principal component analysis** (PCA), applied when there are many features ($n$ is large), is a technique used to reduce the dimensionality of the data by projecting it onto the first few principal components (i.e. the eigenvectors corresponding to the largest eigenvalues). We will touch on this at a later date.