## Chapter 8 - Dimensionality Reduction

### Unsupervised Learning

In supervised learning, we have an access to $p$ features measured on $n$ observations, and a response $y$ is given. The goal is then to predict $y$ using the $p$ features.

In unsupervised learning, we only have a set of features $X_1, \cdots, X_p$ measured on $n$ observations. We are not interested in prediction because we do not have an associated response variable $y$. Rather, the goal is to discover interesting things about the measurements $X_1, \cdots, X_p$. Can we visualise the data? Can we discover subgroups among the variables or the observations?

Unsupervised learning is much more challenging. The analysis tends to be more subjective / biased and there is no simple goal of the analysis. Unsupervised learning is part of <u>exploratory data analysis</u>. Furthermore, in unsupervised learning there is no way to check our work - we don't have tools like cross-validation to measure the performance of our technique.

### The Curse of Dimensionality

Many ML problems involve training on many features, for each training instance - $p$ can be very large. This process is slow and makes it harder to find a good solution. This is called the curse of dimensionality.

Consider the MNIST example. The pixels on the image borders are almost always white (feature has low variation) so they can be removed. Neighbouring pixels usually have the same colour so they can be averaged to form one feature (features have high correlation). Such steps do not result in much information loss.

In theory, one solution to overcome the curse of dimsensionality is to increase the size of the training set. However, in reality, the number of training instances required to reach a given density ($\frac np$) grows exponentially with the number of dimensions. 

In [1]:
import numpy as np

from sklearn.decomposition import PCA

### Principal Component Analysis (PCA)

Consider an ML problem that has a large set of correlated variables (e.g. the neighbouring pixels example in the MNIST dataset). We can summarize this large set of correlated variables with a smaller number of representative variables using principal components.

Say we want to visualise $n$ observations with $p$ features $X_1, \cdots, X_p$. We can visualise the data using $n \choose 2$ scatterplots. If $p$ is large then we cannot possibly look at all of them. Also, most of them will likely be uninformative as they contain only a small fraction of the total information / variance in the dataset. A better method is to visualise the $n$ observations when $p$ is large. Particularly, we want to find a low-dimensional representation of the data / reduce the dimensions of the data, capturing as much of the information as possible.

PCA allows us to do so. The approach is to pick the hyperplane that preserves the most amount of variance, as it will likely lose the least amount of information compared to other projections. Each of these hyperplane is a <u>linear combination</u> of the $p$ features.

The <u>first principal component</u> of a set of features $X_1, \cdots, X_p$ is the normalised linear combination of the features:

$$Z_1 = \phi_{11}X_1 + \phi_{21}X_2 + \cdots + \phi_{p1}X_p$$

that has the largest variance. Normalised means that $\sum_{j=1}^p\phi_{j1}^2=1$. The elements $\phi_{j1} \forall j \in \{1\cdots p\}$ are the <u>loadings</u> of the first principal component and together, they make the principal component loading vector $\phi_1$.

To find the first principal components of a $n\times p$ training set $\mathbf X$, we first center the data to have mean zero. Then, we find the linear combination of the feature values: 

$$z_{i1} = \phi_{11}x_{i1} +  \phi_{21}x_{i2} + \cdots +  \phi_{p1}x_{ip}$$

that has the largest sample variance subject to the constraint $\sum_{j=1}^p\phi_{j1}^2=1$. In other words, the first principal component loading vector solves the optimisation problem:

$$\underset{\phi_{11}, \cdots, \phi_{p1}}{\text{Maximise }} \left\{\frac 1n \sum_{i=1}^n\begin{pmatrix}\sum_{j=1}^p\phi_{j1}x_{ij}\end{pmatrix}^2\right\} \text{ s. t. }$$
$$\sum_{j=1}^p \phi_{j1}^2=1$$

Since $z_{i1} = \phi_{11}x_{i1} +  \phi_{21}x_{i2} + \cdots +  \phi_{p1}x_{ip}$ we can simplify the optimisation function to 

$$\underset{\phi_{11}, \cdots, \phi_{p1}}{\text{Maximise }} \left\{\frac 1n \sum_{i=1}^nz_{i1}^2\right\} \text{ s. t. } \sum_{j=1}^p \phi_{j1}^2=1$$

Furthermore, since we have a zero-ed mean, that means $\frac 1n \sum_{i=1}^nx_{ij}=0$, the mean of $z_{11}, \cdots, z_{n1}$ is zero as well. Hence, the objective we are maximising is just the sample variance of the $n$ values of $z_{i1}$. We refer $z_{11}, \cdots, z_{n1}$ as the scores of the first principal component.

Solving the optimisation problem involves eigenvalue decomposition. In particular, there is a standard matrix factorization technique called Singlular Value Decomposition (SVD) that decomposes the training set matrix $\mathbf X$ to the dot product of three matrices:
$$\mathbf X = \mathbf U \cdot \Sigma \cdot \mathbf V^T$$ 
where $\mathbf V^T$ contains all the principal components that we are looking for.

The loadings of the first principal component is considered to be the hyperplane where the data varies the most. If we project the training data onto this direction, the projected values are the principal component scores $z_{11}, \cdots, z_{n1}$, and they will lose the least amount of information compared to other projections. PCA identifies the axis that accounts for the largest amount of variance in the training set.

In [10]:
m = 60
w1, w2 = 0.1, 0.3
noise = 0.1

# Generate Data
angles = np.random.rand(m)*3 * np.pi /2-0.5
X = np.empty((m,3))
X[:,0] = np.cos(angles) + np.sin(angles)/2 + noise* np.random.randn(m)/2
X[:,1] = np.sin(angles) *0.7 + noise * np.random.randn(m)/2
X[:,2] = X[:,0]*w1 + X[:,1]*w2 + noise* np.random.randn(m)/2

# Centering the data
X_centered = X - X.mean(axis=0)
print(X_centered.shape)

(60, 3)


In [7]:
# For testing
# X_centered[:5]

In [11]:
# Get SVD REsults
u, s, Vt = np.linalg.svd(X_centered)
print(Vt)

[[-0.95864883 -0.22920631 -0.1686917 ]
 [ 0.26533402 -0.93417774 -0.23855777]
 [-0.10290908 -0.27345277  0.95636463]]


After the first principal component $Z_1$ of the features are determined, we can find the second principal component $Z_2$. The second principal component is the linear combination $X_1, \cdots, X_p$ that has maximal variance out of all linear combinations that are uncorrelated with $Z_1$. The second principal component scores $z_{12}, \cdots, z_{n2}$ take the form:

$$z_{i2} = \phi_{12}x_{i1} +  \phi_{22}x_{i2} + \cdots +  \phi_{p2}x_{ip}$$

where $\phi{2}$ is the second principal component loading vector, with elements $\phi_{12}, \phi_{22}, \cdots, \phi_{p2}$. Note that this loading vector is constrained such that the direction must be orthogonal (perpendicular) to the direction of $\phi_1$. 

Once all the principal components are identified, you can reduce the dimensionality of the dataset by projecting it onto the hyperplane defined by the first $d$ principal components. Selecting this hyperplane ensures that the projection will preserve as much variance as possible. To do so, simply compute the dot product of the training sest matrix $\mathbf X$ by the matrix $\mathbf W_d$.

$$\mathbf X_{d\text{-proj}} = \mathbf X \cdot \mathbf W_d$$

In [12]:
c1 = Vt.T[:,0] # First PC
c2 = Vt.T[:,1] # Second PC

In [13]:
# For testing
print(c1)
print(c2)

[-0.95864883 -0.22920631 -0.1686917 ]
[ 0.26533402 -0.93417774 -0.23855777]


In [14]:
# Obtain the training set in lower dimensions
W2 = Vt.T[:,:2]
X2D = X_centered.dot(W2)
print(X2D.shape)

(60, 2)


The following is the `sklearn` implementation.

In [15]:
pca = PCA(n_components=2)
X2D_2 = pca.fit_transform(X)

In [18]:
# For testing
print(X2D[:2])
print()
print(X2D_2[:2])
print(np.allclose(X2D, X2D_2)) # Validate that both are equal.

[[-0.98022795  0.21001806]
 [-0.54767601  0.63946824]]

[[-0.98022795  0.21001806]
 [-0.54767601  0.63946824]]
True


Another useful piece of information is the explained variance ratio of each PC. It indiates the proportion of the dataset's variance that lie along the axis of each PC. 

In [21]:
print(pca.explained_variance_)
print(pca.explained_variance_ratio_)

[0.83362472 0.11366809]
[0.87730052 0.11962347]


Instead of arbitrarily choosing the number of dimensions to reduce down to, it is generally preferable to choose the number of dimensions that add up to a sufficiently large portion of the variance e.g. 95%.