## Chapter 8 - Dimensionality Reduction

SVMs are very powerful and versatile ML models, able to perform linear or nonlinear classification, regression and outlier detection.

### Curse of Dimensionality

In most problems, training instances are not spread out uniformly across all dimensions. Some are almost constant while some are highely correlated with each other. Hence, all training instances actually lie within a much lower dimensional subspace of the high dimensional space.

Manifold learning relies on the manifuold assumption, that most real-world high dimensional datasets lie close to a much lower-dimensional manifold. E.g. a Swiss Roll is a 2D manifold. It resembles a 2D space but it is bent and we see a 3D space. Also, we assume that the ML task (regression/classification) will be simpler if expressed in the lower dimensional space of the menifold.

In [1]:
import numpy as np

from sklearn.decomposition import PCA

In [2]:
m = 60
w1, w2 = 0.1, 0.3
noise = 0.1

angles = np.random.rand(m)*3 * np.pi /2-0.5
X = np.empty((m,3))
X[:,0] = np.cos(angles) + np.sin(angles)/2 + noise* np.random.randn(m)/2
X[:,1] = np.sin(angles) *0.7 + noise * np.random.randn(m)/2
X[:,2] = X[:,0]*w1 + X[:,1]*w2 + noise* np.random.randn(m)/2

X_centered = X - X.mean(axis=0)

In [3]:
# For testing
X_centered[:5]

array([[ 0.64719512, -0.50180842, -0.11727884],
       [-0.07422696,  0.24952118,  0.06608283],
       [-0.44530359,  0.21637711, -0.01268627],
       [ 1.1094908 ,  0.14934078,  0.12444768],
       [ 0.55256103,  0.44261305,  0.24590183]])

### Principal Component Analysis

PCA is a very popular dimensionality reduction algorithm. It first identifies the hyperplane that lies closest to the data, then projects the data to it.

The idea is to pick the hyperplane that preserves the most amount of variance, as it will likely lose the least amount of information compared to other projections. PCA identifies the axist that accounts for the largest amount of variance in the training set.

The unit vector that defines the $i$th axis is called the $i$th principal component (PC). 

To find the principal components of a training set, there is a standard matrix factorization technique called Singlular Value Decomposition (SVD) that decomposes the training set matrix $\mathbf X$ to the dot product of three matrices $$\mathbf U \cdot \Sigma \cdot \mathbf V^T$$ where $\mathbf V^T$ contains all the principal components that we are looking for.

In [4]:
u, s, Vt = np.linalg.svd(X_centered)

Once all the principal components are identified, you can reduce the dimensionality of the dataset by projecting it onto the hyperplane defined by the first $d$ principal components. Selecting this hyperplane ensures that the projection will preserve as much variance as possible. To do so, simply compute the dot product of the training sest matrix $\mathbf X$ by the matrix $\mathbf W_d$.

$$\mathbf X_{d\text{-proj}} = \mathbf X \cdot \mathbf W_d$$

Another useful piece of information is the explained variance ratio of each PC. It indiates the proportion of the dataset's variance that lie along the axis of each PC. 

Instead of arbitrarily choosing the number of dimensions to reduce down to, it is generally preferable ot choose the number of dimensions that add up to a sufficiently large portion of the variance e.g. 95%.

In [5]:
c1 = Vt.T[:,0]
c2 = Vt.T[:,1]

In [6]:
# For testing
print(c1)
print(c2)

[-0.95865753 -0.23337302 -0.16282746]
[-0.27013944  0.92621578  0.26296201]


In [7]:
W2 = Vt.T[:,:2]
X2D = X_centered.dot(W2)

Using SKLearn

In [8]:
pca = PCA(n_components=2)
X2D_2 = pca.fit_transform(X)

In [11]:
# For testing
print(X2D[:5])
print()
print(X2D_2[:5])
print(np.allclose(X2D, X2D_2))

[[-0.48423371 -0.67045568]
 [ 0.00216662  0.26853936]
 [ 0.37846273  0.31736995]
 [-1.11873732 -0.12867043]
 [-0.67305031  0.32534951]]

[[-0.48423371  0.67045568]
 [ 0.00216662 -0.26853936]
 [ 0.37846273 -0.31736995]
 [-1.11873732  0.12867043]
 [-0.67305031 -0.32534951]]
False
