# SVD and PCA

# The Singular Value Decomposition (SVD)
TODO

# Principal Component Analysis (PCA)
PCA is closely related to SVD, and can be thought of as providing a statistical interpretation of SVD. To begin, consider the $N \times D$ design matrix $\mathbf{X}$. Note that this follows the standard statistical convention of arranging observations $\mathbf{x}_1, \dots, \mathbf{x}_N \in \mathbb{R}^D$ in the rows of the matrix, rather than columns as is the case with the "snapshots" arrangement common in the engineering sciences. 

The general problem that PCA answers is to find the best affine approximation to $\mathbf{X}$, where "best" is defined in the squared error sense. In other words, we seek a shifted linear $R$-dimensional subspace such that the projection of the observed data onto this subspace has minimal $L_2$ error with respect to the original data. Of course, for this to be useful we choose $R < D$. In particular, we seek an orthonormal basis for this optimal subspace. In other words, we seek orthonormal basis vectors $\mathbf{b}_1, \dots, \mathbf{b}_R \in \mathbb{R}^D$, associated weights $w_1, \dots, w_R \in \mathbb{R}$, and an intercept $\mathbf{w}_0 \in \mathbb{R}^D$ such that 

$$ \mathbf{x}_n \approx \mathbf{w}_0 + \sum_{r = 1}^{R} w_r \mathbf{b}_r$$

for all $n = 1, \dots, N$. Collecting the weights in a vector $\mathbf{w} \in \mathbb{R}^D$ and the the basis vectors of the columns of $D \times R$ matrix $\mathbf{B}$ we can write the approximation in matrix form as 

$$ \mathbf{x}_n \approx \mathbf{w}_0 + \mathbf{B}\mathbf{w}$$

To be clear, we will be seeking a fixed basis $\mathbf{B}$ and intercept $\mathbf{w}_0$ that is optized to the specific dataset $\mathbf{X}$, but the weights $w_1, \dots, w_R$ will of course be different for each $\mathbf{x}_n$. In its full generality, the optimization problem of interest is

$$
\min_{\mathbf{B}, \mathbf{w}_0, \{\mathbf{w}_n\}_{n = 1}^{N}} \sum_{n = 1}^{N} ||\mathbf{x}_n - (\mathbf{w}_0 + \mathbf{B}\mathbf{w}_n)||_2^2
$$

where the matrix $\mathbf{B}$ is constrained to have $R$ orthonormal columns (meaning that $\mathbf{B}^T \mathbf{B} = \mathbf{I}_R$). There are really two different questions here: 1.) finding the optimal basis, and 2.) given the optimal basis, finding the optimal way to represent the data with respect to the basis. The second question may seem obvious to those familiar with the properties of orthogonal projections. Indeed, given fixed basis $\mathbf{B}$ it is not difficult to show that the optimal values of the other parameters are 

$$
\begin{align*}
&\hat{\mathbf{w}}_0 = \overline{\mathbf{x}} &&\hat{\mathbf{w}}_n = \mathbf{B}^T (\mathbf{x}_n - \overline{\mathbf{x}})
\end{align*}
$$

We see that these equations essentially tell us that the data should first be centered by the sample mean

$$ \overline{\mathbf{x}} := \frac{1}{N} \sum_{n = 1}^{N} \mathbf{x}_n $$

Thus, it makes things much easier to subtract $\overline{\mathbf{x}}$ from each column of $\mathbf{X}$ before proceeding, resulting in a centered data matrix. I henceforth assume that $\mathbf{X}$ has been centered in this manner. Given this simplification, the above optima tell us that the weights used to approximate the $n^{\text{th}}$ data point are given by 

$$ \mathbf{w}_n = \mathbf{B}^T \mathbf{x}_n = \begin{pmatrix} \langle \mathbf{b}_1, \mathbf{x}_n \rangle \\ \vdots \\ \langle \mathbf{b}_R, \mathbf{x}_n \rangle  \end{pmatrix}
$$

and therefore the approximation to this data point is

$$
\hat{\mathbf{x}}_n = \mathbf{B}\hat{\mathbf{w}}_n = \mathbf{B}\mathbf{B}^T \hat{\mathbf{w}}_n = \sum_{r = 1}^{R} \langle \mathbf{x}_n, \mathbf{b}_r \rangle \mathbf{b}_r
$$

This is why I mentioned that the solution to the optimization problem with $\mathbf{B}$ fixed is unsurprising; all it says it that the data points are projected onto the subspace spanned by the basis vectors! The associated weights on the basis vectors are simply the inner products as above, due to the fact that the basis vectors are orthonormal. Another way to see this is that $\mathbf{P} := \mathbf{B}\mathbf{B}^T$ is a projection matrix. Indeed, as an outer product it is symmetric positive definite. It is also idempotent following from the orthogonality of the basis vectors: 

$$ \mathbf{P}^2 = \mathbf{P}\mathbf{P} = \mathbf{B}\mathbf{B}^T$$