# Feature target relationships
In this notebook:
- We will go over some of the ways to analyze the relationships between features, and between features and targets, mainly focussing on covariance on correlation. 
- We will also look at some of the ways to visualize these relationships, such as scatterplots, kdeplots, and heatmaps.

### (Sample) covariance
Covariance is a measure of the joint variability of two variables. If the greater values of one variable mainly correspond with the greater values of the other variable, and the same holds for the lesser values, i.e., they tend to show similar behavior, the covariance is positive. Formally, given two random variables $X$ and $Y$ with associated vectors of samples $\mathbf{x}$ and $\mathbf{y} \in \mathbb{R}^m$, respectively, the **sample covariance** is defined as:
\begin{align*}
\text{cov}(\mathbf{x}, \mathbf{y}) & = \frac{1}{m-1}(\mathbf{x}'\cdot \mathbf{y}')\\
                                &  = \frac{1}{m-1} \sum_{i=1}^{m} (x_i - \overline{\mathbf{x}})(y_i - \overline{\mathbf{y}}),
\end{align*}
where $\overline{\mathbf{x}}$ and $\overline{\mathbf{y}}$ are the sample means of $\mathbf{x}$ and $\mathbf{y}$, respectively, and $\mathbf{x}'$ and $\mathbf{y}'$ are the de-meaned versions of $\mathbf{x}$ and $\mathbf{y}$, respectively.

Given a list of features $X_1,\dotsc,X_n$, each of which correpsonds to a vector of samples $\mathbf{x}_i \in \mathbb{R}^m$, we can package all the sample covariances into a single matrix. To do this, consider the design matrix
\begin{equation*}
    \mathbf{X} = \begin{bmatrix} \; \mathbf{x}_1 & \mathbf{x}_1 & \dotsb & \mathbf{x}_n \; \end{bmatrix} \; \in \mathbb{R}^{m\times n}.
\end{equation*}
Following our previous conventions, we can de-mean each feature to get the matrix of de-meaned features:
\begin{equation*}
    \mathbf{X}' = \begin{bmatrix} \; \mathbf{x}_1' & \mathbf{x}_1' & \dotsb & \mathbf{x}_n' \; \end{bmatrix} \; \in \mathbb{R}^{m\times n}.
\end{equation*}

The **covariance matrix** is defined as (duh) the matrix of all covariances between the features:
\begin{align*}
\Sigma_{\mathbf{X}}= & = \frac{1}{m-1}(\mathbf{X}')^T \mathbf{X}'\\
                        & = \frac{1}{m-1}\begin{bmatrix} (\mathbf{x}_1')^T \\ (\mathbf{x}_2')^T \\ \vdots \\ (\mathbf{x}_n')^T \end{bmatrix} \begin{bmatrix} \mathbf{x}_1' & \mathbf{x}_2' & \cdots & \mathbf{x}_n' \end{bmatrix}\\
                        & = \begin{bmatrix}
                                (\mathbf{x}_1')^T \mathbf{x}_1' & (\mathbf{x}_1')^T \mathbf{x}_2' & \cdots & (\mathbf{x}_1')^T \mathbf{x}_n'\\
                                (\mathbf{x}_2')^T \mathbf{x}_1' & (\mathbf{x}_2')^T \mathbf{x}_2' & \cdots & (\mathbf{x}_2')^T \mathbf{x}_n'\\
                                \vdots & \vdots & \ddots & \vdots\\
                                (\mathbf{x}_n')^T \mathbf{x}_1' & (\mathbf{x}_n')^T \mathbf{x}_2' & \cdots & (\mathbf{x}_n')^T \mathbf{x}_n'
                            \end{bmatrix}\\
                        & = \begin{bmatrix}
                                \frac{\mathbf{x}_1'\cdot \mathbf{x}_1'}{m-1} & \frac{\mathbf{x}_1'\cdot \mathbf{x}_2'}{m-1} & \cdots & \frac{\mathbf{x}_1'\cdot \mathbf{x}_n'}{m-1}\\
                                \frac{\mathbf{x}_2'\cdot \mathbf{x}_1'}{m-1} & \frac{\mathbf{x}_2'\cdot \mathbf{x}_2'}{m-1} & \cdots & \frac{\mathbf{x}_2'\cdot \mathbf{x}_n'}{m-1}\\
                                \vdots & \vdots & \ddots & \vdots\\
                                \frac{\mathbf{x}_n'\cdot \mathbf{x}_1'}{m-1} & \frac{\mathbf{x}_n'\cdot \mathbf{x}_2'}{m-1} & \cdots & \frac{\mathbf{x}_n'\cdot \mathbf{x}_n'}{m-1}
                            \end{bmatrix}\\               
                        & = \begin{bmatrix}
                                \text{cov}(\mathbf{x}_1, \mathbf{x}_1) & \text{cov}(\mathbf{x}_1, \mathbf{x}_2) & \cdots & \text{cov}(\mathbf{x}_1, \mathbf{x}_n)\\
                                \text{cov}(\mathbf{x}_2, \mathbf{x}_1) & \text{cov}(\mathbf{x}_2, \mathbf{x}_2) & \cdots & \text{cov}(\mathbf{x}_2, \mathbf{x}_n)\\
                                \vdots & \vdots & \ddots & \vdots\\
                                \text{cov}(\mathbf{x}_n, \mathbf{x}_1) & \text{cov}(\mathbf{x}_n, \mathbf{x}_2) & \cdots & \text{cov}(\mathbf{x}_n, \mathbf{x}_n)
                            \end{bmatrix}\quad \in \mathbb{R}^{n\times n}.
\end{align*}

Here are some basic properties of the covariance matrix $\Sigma = \Sigma_{\mathbf{X}}$:

- The covariance matrix is symmetric, i.e., $\Sigma = \Sigma^T$.
- The diagonal entries of the covariance matrix are the variances of the features, i.e., $\text{cov}(\mathbf{x}_i, \mathbf{x}_i) = \text{var}(\mathbf{x}_i) = \frac{||\mathbf{x}_1'||}{m-1}$.
- Since the covariance matrix is the Gram matrix of the de-meaned vectors $\mathbf{x}_1',\dotsc,\mathbf{x}_n'$, it satisfies two properties:
    - It is invertible if and only if the de-meaned features are linearly independent.
    - It is *positive semi-definite*, which means that for any vector $\mathbf{v} = \begin{bmatrix} \; v_1 & \dotsb & v_n \; \end{bmatrix}^T \in \mathbb{R}^n$, we have:
    \begin{align*}
        \mathbf{v}^T\Sigma_{\mathbf{X}}\mathbf{v} & = \frac{1}{m-1}\mathbf{v}^T(\mathbf{X}')^T\mathbf{X}'\mathbf{v}\\
                                                    & = \frac{1}{m-1} (\mathbf{X}'\mathbf{v})^T(\mathbf{X}'\mathbf{v})\\
                                                    & = \frac{1}{m-1}||\mathbf{X}'\mathbf{v}||^2 \\
                                                    & = \frac{1}{m-1} \left|\left| \sum_{i=1}^n v_i \mathbf{x}_i' \right|\right|^2 \\
                                                    & \geq 0.
    \end{align*}

We discuss some more interesting properties of the covariance matrix as preparation for our future discussion on PCA.

### Principal components
Since the covariance matrix is symmetric, the classical **Spectral Theorem** implies that it can be diagonalized, and morevoer, in a very nice way. Let me try to summarize the key points:

1. **Orthonormal eigenvectors.** 

    There exist vectors $\mathbf{u}_1,\dotsc,\mathbf{u}_n \in \mathbb{R}^n$ that are orthonormal (i.e. $\mathbf{u}_i\cdot \mathbf{u}_j = 0$ if $i \neq j$ and $1$ if = $i = j$), which are **eigenvectors** of $\Sigma$. This means that there exist scalars $\lambda_1,\dotsc,\lambda_n \in \mathbb{R}$ such that

    \begin{equation*}
        \Sigma \mathbf{u}_i = \lambda_i \mathbf{u}_i, \quad \text{for } i = 1,\dotsc,n.
    \end{equation*}
    Orthonormality here means that the collection $\mathbf{u}_1,\dotsc,\mathbf{u}_n$ looks and smells like the standard basis vectors of $\mathbb{R}^n$ (which you would have seen as $\hat{i},\hat{j}, \hat{k}$ in 3D space in vector calc).
2. **The orthogonal matrix $\mathbf{U}$.** 
We can take the orthonormal list of eignevectors of $\Sigma$ and package them into a matrix $$\mathbf{U} = \begin{bmatrix} \; \mathbf{u}_1 & \mathbf{u}_2 & \cdots & \mathbf{u}_n \; \end{bmatrix}.$$  This is an orthogonal matrix (by HW 1), which means that $\mathbf{U}^T = \mathbf{U}^{-1}$. 
3. **Representation with respect to $\mathbf{u}_i$'s.** 
    It is a fact that $\mathbf{u}_1,\dotsc,\mathbf{u}_n$ form an orthonormal basis for $\mathbb{R}^n$. This means that any vector $\mathbf{v} \in \mathbb{R}^n$ can be *uniquely* written as a linear combination of the $\mathbf{u}_i$'s:

    \begin{equation*}
        \mathbf{v} = \sum_{i=1}^n (\mathbf{v}\cdot \mathbf{u}_i) \mathbf{u}_i.
    \end{equation*}
    NOTE: the $\mathbf{u}_i$'s form a system of orthogonal directions, i.e. a coordinate system. The above identity is saying that, if we were to use these directions as the axes for a new coordinate system on $\mathbb{R}^n$, then the coordinates of $\mathbf{v}$ would be $(\mathbf{v}\cdot \mathbf{u}_1,\dotsc,\mathbf{v}\cdot \mathbf{u}_n)$! 
    The above equation can be derived as follows (for any $\mathbf{v} \in \mathbb{R}^n$):

    \begin{align*}
        \mathbf{v} & = I\cdot \mathbf{v}\\
                    & = \mathbf{U} \mathbf{U}^T \mathbf{v}\\
                    & = \begin{bmatrix} \; \mathbf{u}_1 & \dotsb & \mathbf{u}_n \; \end{bmatrix} \cdot \begin{bmatrix} \mathbf{u}_1^T \\ \vdots \\ \mathbf{u}_n^T \end{bmatrix} \cdot \mathbf{v}\\
                    & = \begin{bmatrix} \; \mathbf{u}_1 & \dotsb & \mathbf{u}_n \; \end{bmatrix} \cdot \begin{bmatrix} \mathbf{u}_1\cdot \mathbf{v} \\ \vdots \\ \mathbf{u}_n\cdot \mathbf{v} \end{bmatrix}\\
                    & = \sum_{i=1}^n (\mathbf{v}\cdot \mathbf{u}_i) \mathbf{u}_i.
    \end{align*}
- **Diagonalization.** Then, the covariance matrix can be diagonalized via the orthogonal matrix $\mathbf{U}$ as follows:
\begin{equation}\tag{1}
    \Sigma = \mathbf{U} \cdot \text{diag}(\lambda_1,\dotsc,\lambda_n) \cdot \mathbf{U}^T,
\end{equation}
where $\text{diag}(\lambda_1,\dotsc,\lambda_n)$ is the diagonal matrix with the eigenvalues of $\Sigma$ on the diagonal. Note: the eigenvalues are non-negative because of positive semi-definiteness, and if the vectors $\mathbf{x}_1',\dotsc,\mathbf{x}_n'$ are linearly independent (which will virtually always be the case in practice), then the eigenvalues are all strictly positive.

    When we apply the matrix in the RHS of (1) to a vector we are doing the following: 

    - Multiplying by $U^T$ sends the eigenvectors of $\Sigma$ (normalized to be unit vectors) to the standard basis vectors. Moreover, the orthogonality of $U$ means that $U$ is a **rotation** matrix (i.e. it preserves lengths and angles, but possibly reverses orientation).
    - Multiplying by $\text{diag}(\lambda_1,\dotsc,\lambda_n)$ scales the standard basis vectors by the eigenvalues.
    - Multiplying by $U$ then rotates the scaled standard basis vectors back to the eigenvectors of $\Sigma$.

The eigenvectors $\mathbf{u}_1,\dotsc, \mathbf{u}_n$ of the covariance matrix (i.e. the columns of $\mathbf{U}$) are the so-called **principal components** of the de-meaned design matrix $\mathbf{X}'$: they are the directions in which the data varies the most. The eigenvalues $\lambda_1,\dotsc,\lambda_n$ are the variances of the data in the corresponding principal component directions. **Principal component analysis** (PCA), applied when there are many features ($n$ is large), is a technique used to reduce the dimensionality of the data by projecting it onto the first few principal components (i.e. the eigenvectors corresponding to the largest eigenvalues). We will touch on this at a later date.

### Standardizing features
Given two vectors of features $\mathbf{x}$ and $\mathbf{y}$, the sample covariance $\textup{cov}(\mathbf{x}_1,\mathbf{x}_2)$ is a measure of how the two features simultaneously vary about their respective means (*for the particular sample under consideration*):
- If the covariance is positive, then on average, when one feature varies above its mean, so does the other.
- If the covariance is negative, then on average, when one feature varies above its mean, the other varies below its mean.
- If the covariance is zero (or close to zero), then the two features are **uncorrelated**, which means that there is no discernible relationship between how they vary about their respective means. That is, when one feature varies above its mean, the other does not have a particular tendency to vary above or below its mean.

This is all well and good, yet the covariance is not a very interpretable measure of the relationship between two features. This is because the covariance is not standardized: it depends on the scales of the features. For example, if we were to measure the height of a person in meters, and their weight in kilograms, then the covariance between height and weight would be different than if we measured height in centimeters and weight in grams.

To fix this issue, we follow the principle from physics: *If you divide two quantities measured with the same scale, then the resulting quantity is unit-less (or scale-less) quantity; that is, it is a pure and simple ratio.* 

So, to get rid of the scale issue, we want to *normalize* our vectors $\mathbf{x}$ and $\mathbf{y}$ by some scalars $S_{\mathbf{x}}$ and $S_{\mathbf{y}}$ having the same units as the vectors. A natural choice is to take these to be the sample standard deviations. Thus, we define the *standardized* feature vectors, which are unit vectors with mean zero: