## Chapter 8 - Dimensionality Reduction

### Unsupervised Learning

In supervised learning, we have an access to $p$ features measured on $n$ observations, and a response $y$ is given. The goal is then to predict $y$ using the $p$ features.

In unsupervised learning, we only have a set of features $X_1, \cdots, X_p$ measured on $n$ observations. We are not interested in prediction because we do not have an associated response variable $y$. Rather, the goal is to discover interesting things about the measurements $X_1, \cdots, X_p$. Can we visualise the data? Can we discover subgroups among the variables or the observations?

Unsupervised learning is much more challenging. The analysis tends to be more subjective / biased and there is no simple goal of the analysis. Unsupervised learning is part of <u>exploratory data analysis</u>. Furthermore, in unsupervised learning there is no way to check our work - we don't have tools like cross-validation to measure the performance of our technique.

### The Curse of Dimensionality

Many ML problems involve training on many features, for each training instance - $p$ can be very large. This process is slow and makes it harder to find a good solution. This is called the curse of dimensionality.

Consider the MNIST example. The pixels on the image borders are almost always white (feature has low variation) so they can be removed. Neighbouring pixels usually have the same colour so they can be averaged to form one feature (features have high correlation). Such steps do not result in much information loss.

In theory, one solution to overcome the curse of dimsensionality is to increase the size of the training set. However, in reality, the number of training instances required to reach a given density ($\frac np$) grows exponentially with the number of dimensions. 

In [1]:
import pandas as pd
import numpy as np

from sklearn.model_selection import train_test_split
from sklearn.decomposition import PCA

import matplotlib.pyplot as plt

def load(fname):
    import pickle
    mnist = None
    try:
        with open(fname, 'rb') as f:
            mnist = pickle.load(f)
            return mnist
    except FileNotFoundError:
        from sklearn.datasets import fetch_openml
        mnist = fetch_openml('mnist_784', version=1, cache=True)
        with open(fname, 'wb') as f:
            mnist = pickle.dump(mnist, f)
        return mnist

### Principal Component Analysis (PCA)

Consider an ML problem that has a large set of correlated variables (e.g. the neighbouring pixels example in the MNIST dataset). We can summarize this large set of correlated variables with a smaller number of representative variables using principal components.

Say we want to visualise $n$ observations with $p$ features $X_1, \cdots, X_p$. We can visualise the data using $n \choose 2$ scatterplots. If $p$ is large then we cannot possibly look at all of them. Also, most of them will likely be uninformative as they contain only a small fraction of the total information / variance in the dataset. A better method is to visualise the $n$ observations when $p$ is large. Particularly, we want to find a low-dimensional representation of the data / reduce the dimensions of the data, capturing as much of the information as possible.

PCA allows us to do so. The approach is to pick the hyperplane that preserves the most amount of variance, as it will likely lose the least amount of information compared to other projections. Each of these hyperplanes is a <u>linear combination</u> of the $p$ features.

The <u>first principal component</u> of a set of features $X_1, \cdots, X_p$ is the normalised linear combination of the features:

$$Z_1 = \phi_{11}X_1 + \phi_{21}X_2 + \cdots + \phi_{p1}X_p$$

that has the largest variance. The elements $\phi_{j1} \forall j \in \{1\cdots p\}$ are the <u>loadings</u> of the first principal component and together, they make the principal component loading vector $\phi_1$. Mathematically, the first principal component loading vector has the loadings:

$$\phi_1 = \begin{pmatrix}\phi_{11}&\phi_{21}&\cdots&\phi_{p1}\end{pmatrix}^T$$

Normalised means that the sum of the loadings $\sum_{j=1}^p\phi_{j1}^2=1$. This constraint is needed as setting these elements to be arbitrarily large would results in an arbitrary large variance.

To find the first principal components of a $n\times p$ training set $\mathbf X$, we first center the data to have mean zero. Then, we find the linear combination of the feature values: 

$$z_{i1} = \phi_{11}x_{i1} +  \phi_{21}x_{i2} + \cdots +  \phi_{p1}x_{ip}\,\, \forall i \in \{1,\cdots,n\}$$

that has the largest sample variance subject to the constraint $\sum_{j=1}^p\phi_{j1}^2=1$. In other words, the first principal component loading vector solves the optimisation problem:

$$\underset{\phi_{11}, \cdots, \phi_{p1}}{\text{Maximise }} \left\{\frac 1n \sum_{i=1}^n\begin{pmatrix}\sum_{j=1}^p\phi_{j1}x_{ij}\end{pmatrix}^2\right\} \text{ s. t. }$$
$$\sum_{j=1}^p \phi_{j1}^2=1$$

Since $z_{i1} = \phi_{11}x_{i1} +  \phi_{21}x_{i2} + \cdots +  \phi_{p1}x_{ip}$ we can simplify the optimisation problem to:

$$\underset{\phi_{11}, \cdots, \phi_{p1}}{\text{Maximise }} \left\{\frac 1n \sum_{i=1}^nz_{i1}^2\right\} \text{ s. t. }$$ 
$$\sum_{j=1}^p \phi_{j1}^2=1$$

Furthermore, since we have a zero-ed mean, that means $\frac 1n \sum_{i=1}^nx_{ij}=0$, the mean of $z_{11}, \cdots, z_{n1}$ is zero as well. Hence, the objective we are maximising is just the sample variance of the $n$ values of $z_{i1}$. We refer $z_{11}, \cdots, z_{n1}$ as the scores of the first principal component.

Solving the optimisation problem involves eigenvalue decomposition. In particular, there is a standard matrix factorization technique called Singlular Value Decomposition (SVD) that decomposes the training set matrix $\mathbf X$ to the dot product of three matrices:
$$\mathbf X = \mathbf U \cdot \Sigma \cdot \mathbf V^T$$ 
where $\mathbf V^T$ contains all the principal components that we are looking for.

<b>Interpretation</b>: The loadings of the first principal component, $\phi_1$ is the direction in feature space along which the data varies the most. If we project the $n$ training samples onto this direction, the projected values are the principal component scores $z_{11}, \cdots, z_{n1}$ themselves and they will lose the least amount of information compared to other projections. PCA identifies the axis that accounts for the largest amount of variance in the training set.

In this example, the observations are in 2D. The first principal component loading vector is the green line. $\phi_1 = (\phi_{11}, \phi_{21}) = (0.839, 0.544)$
<img src="0801.png" width="600" />

In [2]:
# Ingest
mnsit = load('mnist.data.pkl')
X, y = mnsit['data'], mnsit['target']
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.05, random_state=0)

In [3]:
# For testing
print(pd.Series(y_test).value_counts())

1    404
7    380
8    357
2    350
0    348
3    347
9    346
6    340
4    335
5    293
dtype: int64


In [4]:
# Center the data
X_test_centered = X_test - X_test.mean(axis=0)

In [5]:
# For testing
print(X_test.shape)
print(X_test_centered.shape)
X_test_centered[:2]

(3500, 784)
(3500, 784)


array([[0., 0., 0., ..., 0., 0., 0.],
       [0., 0., 0., ..., 0., 0., 0.]])

In [6]:
# Get the principal components using the SVD algorithm
u, s, Vt = np.linalg.svd(X_test_centered)

In [7]:
# For testing
# c1 = Vt.T[:,0] # First PC
# c2 = Vt.T[:,1] # Second PC
# print(c1)
# print(c2)

In [8]:
# Obtain the training set in lower dimensions
W2 = Vt.T[:,:196]
X2D = X_test_centered.dot(W2)
print(X2D.shape)

(3500, 196)


After the first principal component $Z_1$ of the features are determined, we can find the second principal component $Z_2$. The second principal component is the linear combination $X_1, \cdots, X_p$ that has maximal variance out of all linear combinations that are uncorrelated with $Z_1$. The second principal component scores $z_{12}, \cdots, z_{n2}$ take the form:

$$z_{i2} = \phi_{12}x_{i1} +  \phi_{22}x_{i2} + \cdots +  \phi_{p2}x_{ip}$$

where $\phi_{2}$ is the second principal component loading vector, with elements $\phi_{12}, \phi_{22}, \cdots, \phi_{p2}$. Note that this loading vector is constrained such that the direction must be orthogonal (perpendicular) to the direction of $\phi_1$. 

In 3D space, once we have found $\phi_1$, there is only one possibility for $\phi_2$, which is the blue dashed line.

<img src="0801.png" width="350" />
<img src="0803.png" width="600" />

But in a larger dataset with $p>2$ variables, there are multiple candidates for principal components, and they are defined in a similar manner. To find $\phi_2$, we solve the same maximisation problem, but with the additional constraint that $\phi_2$ is orthogonal to $\phi_1$.

Once all the principal components are identified, you can reduce the dimensionality of the dataset by projecting it onto the hyperplane defined by the first $d$ principal components. Selecting this hyperplane ensures that the projection will preserve as much variance as possible. To do so, simply compute the dot product of the training sest matrix $\mathbf X$ by the matrix $\mathbf W_d$.

$$\mathbf X_{d\text{-proj}} = \mathbf X \cdot \mathbf W_d$$

The following is the `sklearn` implementation.

In [9]:
pca = PCA(n_components=14**2)
X2D_2 = pca.fit_transform(X_test_centered)

In [10]:
# For testing
print(X2D[:,4])
print()
print(X2D_2[:,4]*-1.0)
print(np.allclose(X2D[:,:], (X2D_2[:,]*-1.0))) # Validate that both are equal.

[ -55.48560463 -208.27434056  223.45570475 ...  159.80138012  541.66155115
  600.07788564]

[ -55.48560461 -208.27434053  223.45570474 ...  159.80138011  541.66155118
  600.07788563]
False


### Proportion of Variance Explained

After a projection is complete, we ask how much of the information in a given dataset is lost by projection the observations to the first few principal components? In other words, how much of the variance in the data is not contained in the first few principal components? 

Generally, we want to find the proportion of variance explained (PVE) by each principal component. The total variance in a dataset is defined as:

$$\sum_{j=1}^p \text{Var}(X_j) = \sum_{j=1}^p\frac 1n \sum_{i=1}^n x^2_{ij}$$

and the variance explained by the $m$th principal component is:

$$\frac 1n \sum_{i=1}^n z^2_{im} = \frac 1n \sum_{i=1}^n\begin{pmatrix}\sum_{j=1}^p \phi_{jm}x_{ij}\end{pmatrix}^2$$

and hence the PVE of the $m$th principal component is:

$$\frac{\sum_{i=1}^n\begin{pmatrix}\sum_{j=1}^p \phi_{jm}x_{ij}\end{pmatrix}^2}{\sum_{j=1}^p \sum_{i=1}^n x^2_{ij}}$$

The PVE of each principal component is a positive quantity. In order to compute the cumulative PVE of the first $M$ principal components, we simply sum the PVE of the expression above.

The PVE of each principal component as well as the cumulative PVE can be shown in a scree plot.

<img src="0804.png" width="600" />

Of course, we aim to use the smallest number of principal components to describe the data aptly. However, there is no good solution to this.

Generally, we use a scree plot to help us. Specifically, we look for a point where the proportion of variance explained by each subsequentn principal component drops off. It is referred to as the elbow in the scree plot. In the above example, the elbow after the second principal component. The third principal component captures 10% of the variance and the fourth principal component explains less that 5% and is essentially worthless.

Another thing to consider is to choose the number of dimensions that add up to a sufficiently large portion of the variance e.g. 95%.

In [11]:
print(pca.explained_variance_)
print(pca.explained_variance_ratio_)

[333546.90709433 245901.93949278 208021.62030925 181785.38709485
 174320.99459052 147759.68427128 111460.35237981 100206.02474743
  96227.01303702  84701.88249465  71803.81834356  71369.59851144
  60006.8579139   59285.67358456  54027.28339148  52252.30102123
  45526.34206775  43809.53252434  40531.50698102  39038.84182736
  36614.20697174  35906.63659951  33696.27789464  31729.46761082
  29861.70635076  28572.48275191  28181.19662231  26462.09367191
  24373.09757148  23662.80238437  22698.8408301   21927.18722011
  20804.276075    20103.20944223  19103.77179376  18481.79710125
  17863.06071146  17091.44316387  16576.04938153  16074.73288768
  15716.4105393   15129.20431514  14265.62620464  13415.91008828
  12985.48045643  12673.74288751  12243.44826715  12151.38190587
  11624.01388672  11381.53572301  10827.3353768   10698.1021714
  10220.64256376   9987.8319534    9746.14321808   9578.48277644
   9264.19197474   9036.62671419   8961.46492002   8422.30584433
   8320.40753362   8131.57