## 4.1 Principal component analysis

In the previous discussion we mentioned that in certain datasets, it might be useful to (i) discard low variance, uninformative features of the data to simplify the problem and model or to (ii) find the directions of highest variance to determine the most descriptive features of the data. These two goals point at two correpsonding approaches for dimensionality reduction, which turn out to be equivalent. The first amounts to a basis rotation and disposal of some dimensions, picked such that the sum-of-squares error from the original data is minimised $-$ this is the formulation of reconstruction error minimisation. In the second approach, we discard a number of directions along which the data variance is low, retaining only directions of high variance $-$ this is the formulation of variance maximisation. When using a sum-of-squares as the reconstruction error, the minimum error and maximum variance approaches are equivalent and are really the same method, called **principal component analysis** or PCA.

Starting with error minimisation, suppose $\{\mathbf{u}_d\}^D_{d = 1}$ is an *orthonormal* basis, with which we can express any datapoint as a linear combination

\begin{align}
\mathbf{x}_n &= \sum^D_{d = 1} a_{dn} \mathbf{u}_d\\
\end{align}

Separating this in the first \\(M\\) and remaining \\((D - M)\\) components

\begin{align}
\mathbf{x}_n =  \sum^M_{d = 1} a_{dn} \mathbf{u}_d + \sum^D_{d = M + 1} a_{dn} \mathbf{u}_d\\
\end{align}

and discarding the second sum term involving the \\((D - M)\\) components, we obtain an approximatate representation of \\(\mathbf{x}_n^\star\\)

\begin{align}
\mathbf{x}_n^\star =  \sum^M_{d = 1} a_{dn} \mathbf{u}_d\\
\end{align}

The mean squared reconstruction error is then defined as:

\begin{align}
E_{rms} &= \frac{1}{N}\sum^N_{n = 1} \big|\mathbf{x}_n - \mathbf{x}_n^\star \big|^2\\
\end{align}

and after some manipulation we can write $-$ you can try this as an excercise:

\begin{align}
E_{rms} &= \sum^D_{d = M + 1} \mathbf{u}_d^\top \mathbf{S} \mathbf{u}_d\\
~\\
\text{ where } \mathbf{S} &= \frac{1}{N}\sum^N_{n = 1}(\mathbf{x}_n - \bar{\mathbf{x}})(\mathbf{x}_n - \bar{\mathbf{x}})^\top\\\\
\end{align}

<details>
<summary>Reconstruction error in detail</summary>
<div>
    
\begin{align}
E_{rms} &= \frac{1}{N}\sum^N_{n = 1} \big|\mathbf{x}_n - \mathbf{x}_n^\star \big|^2\\
~\\
&= \frac{1}{N}\sum^N_{n = 1} \bigg[\sum^D_{d = M + 1} a_{dn} \mathbf{u}_d\bigg]^2\\
~\\
&=  \frac{1}{N}\sum^N_{n = 1}\bigg[\sum^D_{d = M + 1} a_{dn} \mathbf{u}_d\bigg]^\top \bigg[\sum^D_{d = M + 1} a_{dn} \mathbf{u}_d\bigg]\\
~\\
&=  \frac{1}{N}\sum^N_{n = 1}\sum^D_{d = M + 1} a_{dn}^2, \text{ (using the basis orthonormality)}\\
~\\
&=  \frac{1}{N}\sum^N_{n = 1}\sum^D_{d = M + 1} (\mathbf{u}_d^\top (\mathbf{x}_n - \bar{\mathbf{x}}))((\mathbf{x}_n - \bar{\mathbf{x}})^\top\mathbf{u}_d), \text{ (using } a_{dn} = \mathbf{u}_d^\top (\mathbf{x}_n - \bar{\mathbf{x}}_n))\\
~\\
&=  \sum^D_{d = M + 1} \mathbf{u}_d^\top \mathbf{S} \mathbf{u}_d, \text{ (using the definition of } \mathbf{S})\\
~\\
\end{align}


</div>
</details>

Now we seek to minimise $E_{rms}$ with respect to $\mathbf{u}_d$. However doing so directly would give the vacuous solution $\mathbf{u}_d = 0$, becuase we haven't constrained the magnitude of the basis vectors which are free to collapse. Requiring $||\mathbf{u}_d|| = 1$ and using a Lagrange multiplier, we minimise

\begin{align}
E = E_{rms} - \lambda_d(\mathbf{u}_d^\top \mathbf{u}_d - 1) &= \bigg[\sum^D_{d = M + 1} \mathbf{u}_d^\top \mathbf{S} \mathbf{u}_d \bigg] - \lambda_d(\mathbf{u}_d^\top \mathbf{u}_d - 1)\\
\end{align}

with respect to $\mathbf{u}_d$ to obtain the result $-$ again you can try this as an excercise:

\begin{align}
\boxed{\mathbf{S} \mathbf{u}_d = \lambda_d\mathbf{u}_d}\\
\end{align}

<details>
<summary>Extremisation in detail</summary>
<div>
Here is a detailed derivation for the result \\(\mathbf{S} \mathbf{u}_d = \lambda_d\mathbf{u}_d\\), with explicit summations.
    
\begin{align}
\bigg(\frac{\partial E}{\partial \mathbf{u}_d}\bigg)_i &= \frac{\partial }{\partial \mathbf{u}_{d, i}} \Bigg[ \bigg[\sum^D_{n = M + 1} \mathbf{u}_n^\top \mathbf{S} \mathbf{u}_n \bigg] - \lambda_d(\mathbf{u}_d^\top \mathbf{u}_d - 1)\Bigg]\\
~\\\
    &= \frac{\partial }{\partial \mathbf{u}_{d, i}} \Bigg[\sum^D_{n = M + 1} \sum^D_{j = 1}\sum^D_{k = 1} \mathbf{u}_{n, j} \mathbf{S}_{j, k} \mathbf{u}_{n, k} - \lambda_d\bigg[ \sum^D_{j = 1}\mathbf{u}_{d, j} \mathbf{u}_{d, j} - 1\bigg] \Bigg]\\
~\\\
    &= 2 \Bigg[\sum^D_{j = 1}\sum^D_{k = 1} \frac{\partial \mathbf{u}_{d, j}} {\partial \mathbf{u}_{d, i}}\mathbf{S}_{j, k} \mathbf{u}_{d, k} - \lambda_d\sum^D_{j = 1} \frac{\partial \mathbf{u}_{d, j}}{\partial \mathbf{u}_{d, i}\mathbf{u}_{d, j}}\Bigg]\\
~\\\
    &= 2 \Bigg[\sum^D_{k = 1} \delta_{ij} \mathbf{S}_{j, k} \mathbf{u}_{d, k} - \lambda_d\sum^D_{j = 1} \mathbf{u}_{d, j} \delta_{ij} \Bigg]\\
~\\\
    &= 2 \Bigg[\sum^D_{k = 1} \mathbf{S}_{i, k}\mathbf{u}_{d, k} - \lambda_d \mathbf{u}_{d, i}\Bigg]\\
\end{align}

Setting the derivative to $0$:

\\[
\bigg(\frac{\partial E}{\partial \mathbf{u}_d}\bigg)_i = 0\\
~\\
\sum^D_{k = 1} \mathbf{S}_{i, k}\mathbf{u}_{d, k} - \lambda_d \mathbf{u}_{d, i} = 0\\
~\\
~\\
\mathbf{S}\mathbf{u}_{d} - \lambda_d\mathbf{u}_{d} = 0\\
\\]

Arriving at the result:
\\[
\boxed{\mathbf{S}\mathbf{u}_{d} = \lambda_d\mathbf{u}_{d}}\\
\\]

</div>
</details>

Determining $\mathbf{u}_d$ has therefore turned into an eigenproblem. The $\mathbf{u}_d$'s which minimize the reconstruction loss are eigenvectors of $\mathbf{S}$. In addition, each of the corresponding eigenvalues $\lambda_d$ is equal to the reconstruction loss due to discarding $\mathbf{u}_d$:

\begin{align}
\lambda_d &= \mathbf{u}_d^\top\mathbf{S}\mathbf{u}_d =  \frac{1}{N}\sum^N_{n = 1} \mathbf{u}_d^\top(\mathbf{x}_n - \bar{\mathbf{x}})(\mathbf{x}_n - \bar{\mathbf{x}})^\top \mathbf{u}_d\\
~\\
\implies \sum_{d = M +1}^D \lambda_d &= E =  \frac{1}{N}\sum_{d = M +1}^D\sum^N_{n = 1} \mathbf{u}_d^\top(\mathbf{x}_n - \bar{\mathbf{x}})(\mathbf{x}_n - \bar{\mathbf{x}})^\top \mathbf{u}_d\\
\end{align}

which is a pleasing result. We can implement PCA straightforwardly by solving the eigenproblem $\mathbf{S} \mathbf{u}_d = \lambda_d\mathbf{u}_d$, and retaining the dimensions $\mathbf{u}_d$ with the highest eigenvalues $-$ discarding low eigenvalues means low reconstruction loss. Before that however, we will show the equivalence between reconstruction loss minimisation and variance maximisation. The latter amounts to selecting $M$ orthogonal directions such that the variance of the dataset in these directions is maximal:

\begin{align}
\text{Var}_{1:M}(\{\mathbf{x}\}) &=  \frac{1}{N}\sum^N_{n = 1}\bigg[\sum^M_{d = 1} a_{dn} \mathbf{u}_d \bigg]^2\\
~\\
&=  \frac{1}{N} \sum^N_{n = 1}\bigg[\sum^M_{d = 1} a_{dn} \mathbf{u}_d \bigg]^\top \bigg[\sum^M_{d = 1} a_{dn} \mathbf{u}_d \bigg]\\
\end{align}

The total variance of the dataset, $\text{Var}_{1:D}(\{\mathbf{x}\})$, can be expressed as

\begin{align}
\text{Var}_{1:D}(\{\mathbf{x}\}) &=  \frac{1}{N}\sum^N_{n = 1}\bigg[\sum^D_{d = 1} a_{dn} \mathbf{u}_d \bigg]^2\\
~\\
&=  \frac{1}{N}\sum^N_{n = 1}\Bigg[\bigg[\sum^M_{d = 1} a_{dn} \mathbf{u}_d \bigg]^2 + \bigg[\sum^D_{d = M + 1} a_{dn} \mathbf{u}_d \bigg]^2\Bigg]\\
~\\
&= \text{Var}_{1:M}(\{\mathbf{x}\}) + \text{Var}_{M:D}(\{\mathbf{x}\})\\
\end{align}

where we have used the orthogonality of the basis vectors $\mathbf{u}_d$. We can read off that the second term $\text{Var}_{M:D}(\{\mathbf{x}\})$ is equal to the rms reconstruction loss found earlier:

\\[
\text{Var}_{M:D}(\{\mathbf{x}\}) = E_{rms}
\\]

Considering that $\text{Var}_{1:D}(\{\mathbf{x}\})$ is constant and independent of the choice of basis, we see that maximizing the variance $\text{Var}_{1:M}(\{\mathbf{x}\})$ is equivalent to minimising the reconstruction loss $\text{Var}_{M:D}(\{\mathbf{x}\})$:

\begin{align}
\boxed{\text{Reconstruction loss minimisation}\Longleftrightarrow
\text{Variance maximisation}}
\end{align}

In [1]:
%config InlineBackend.figure_format = 'svg'
import numpy as np
import matplotlib.pyplot as plt
from helper_functions import *
set_notebook_preferences()