Principle components analysis (PCA) is a standard way to reduce the dimension 
 (which can be quite large) to something more manageable, given a $n\times p$ matrix.

PCA tries to find “components” that capture the maximal variance within the data. For three dimensional data, this is the basic image you may have come across:

<img src=img/pca_classic.png width="40%" height="40%">

**Classic view of PCA**. Each blue point corresponds to an observation (a row of **X**). There are $n=20$ observations, each with $p=3$ features. In this schematic, PCX reduces the dimensionality from three to 
$r=2$. In particular, it finds a pair of orthogonal vectors (red arrows) that define a lower-dimensional space (grey plane) which captures as much variance as possible from the original dataset.

Now let’s express the above picture mathematically. Assume that each column of **X** has been mean subtracted so that the datapoints are centered around the origin. Then finding the direction of maximal variance (i.e. the first principal component) corresponds to solving the following optimization problem

$$
\max_c c^T\textbf{X}^T\textbf{X}c
$$
$$
c^Tc = 1
$$

Let $w = \textbf{X}c$ which is the projection of each datapoint onto the top principal component (since we imposed  $c^Tc = 1$). Because the data have been mean subtracted, the variance of the projected data is $w^Tw$, which equals the objective function $c^T\textbf{X}^T\textbf{X}c$. The vector c is the top principal component, and the vector w contains the “loadings” for each observation along this axis.

There are a few ways to solve this optimization problem to determine c and w. The classic approach would be to compute the eigenvalues of $\textbf{X}^T\textbf{X}$ (the covariance matrix with dimensions 
) and set c to the eigenvector associated with the largest eigenvalue. The eigendecomposition of the covariance matrix is computed via the singular value decomposition (SVD). 

It turns out that this approach does not work for tensors, matrices with incomplete data, or many other interesting cases.



Let’s assume that we solve the optimization problem by some method. Then our best approximation of the data is the outer product of c and w:

$$
\textbf{X} \approx wc^T
$$

This is called a rank-one reconstruction of the data because $wc^T$ produces a matrix with rank=1. Visually, our reconstruction looks something like this:

<img src=img/rank_one.png width="60%" height="60%">

**Example reconstruction of data with 1 principal component.** An example data matrix (left) with $n=12 $observations and $p=8$ features is approximated by the outer product $wc^T$ (middle) which produces a rank-one matrix (right). Note w is labeled as loadings and $c^T$ is labeled as component

Most data can’t be well-described by a single principal component. Typically, we compute multiple principal components by computing all eigenvectors of $\textbf{X}^T\textbf{X}$ and ranking them by their eigenvalues. This can be visualized by a scree plot, which plots the variance explained by each successive principal component. People may have told you to look for the “knee” or inflection point in the scree plot to determine the number of components to keep (the rest are noise).

<img src=img/scree.png width="60%" height="60%">

**Scree plot**. Principal components are ranked by the amount of variance they capture in the original dataset, a scree plot can provide some sense of how many components are needed.

We can organize the top r principal components into a matrix $C = [c_1,c_2,...,c_r]$ and the loading weights into $W = [w_1,w_2,...,w_r]$. Our reconstruction of the data is now a sum of r outer products:

$$
\textbf{X} \approx \sum_{k=1}^r w_kc_k^T \text{ or } X\approx WC^T
$$

<img src=img/pca_3.png>

**Example reconstruction of data with 3 principal components.** A data matrix (left) is approximated by the product of a $n\times r$ matrix and a $r\times p$ matrix (i.e. $WC^T$). This product is at most a rank-r matrix (in this example, $r=3). Each paired column of W and row of $C^T$ form an outer product, so the full reconstruction can also be thought of as a sum of r rank-one matrices.