Machine Learning problems involve thousands or even millions of features for each training instance. Not only do all these features make training extremely slow, but they can make it harder to find a good solution. This problem is ofter referred to as the ***curse of dimensionality***.

Fortunately, in real-world problems, it is often possible to reduce the number of features considerably, turning an intractable problem into a tractable one. Be careful though, reducing dimensionality does cause some information loss, so even though it will speed up training, it may make your system perform sightly worse. It also makes the pipelines a bit more complex and thus harder to maintain. So, if training is too slow, you should try to train the system with the original data before considering using dimensionality reduction. In some cases, reducing the dimensionality of the training data may filter out some noise and unnecessary details and thus result in higher performance, but in general, it won't. It will just speed up training.

Apart from speeding up training, dimensionality reduction is also extremely useful for data visualization because it makes possible to plot a condensed view of a high-dimensional training set on a graph and often gain some important insights by visually detecting patterns, such as clusters or to communicate the conclusions to non Data Scientist Audience, in particular decision makers.

## The curse of dimensionality

High-dimensional datasets are at risk of being very sparse: most training instances are likely to be far away from any training instance, making predictions much less reliable than in lower dimensions, since they will be based on much larger extrapolations. In short, the more dimensions the training set has, the greater the risk of overfitting it.

In theory, one solution to the curse of dimensionality could be to increase the size of the training set to reach asufficient density of training instances. Unfortunately, in pratice, the number of training instances required to reach a given density grows exponentially with the number of dimensions.

## Main Approaches for Dimensionality Reduction

The two main approaches to reducing dimensionality: projection and Manifold Learning.

### Projection

In most real-world problems, training instances are not spread out uniformly across all dimensions. Many features are almost constant, while others are highly correlated. As a result, all training instances lie within, or close to, a much lower-dimensional subspace of high-dimensional space.

However, projection is not always the best approach to dimensionality reduction. In many cases the subspace may twist and turn as the famous *Swiss roll* dataset. Simply projecting onto a plane would squash different layers of the Swiss roll together.

### Manifold Learning

The Swiss roll is an example of a 2D *manifold*. A d-dimensional manifold is a part of an n-dimensional space (where d < n) that locally resembles a d-dimensional hyperplane. In the case of the Swiss roll, d = 2 and n = 3; it locally resembles a 2D plane, but it is rolled in the third dimension.

Many dimensionally reduction algorithms work by modeling the manifold on which the training instances lie: this is called *Manifold Learning*. It relies on the *Manifold assumption*, also called the *Manifold hypothesis*, which holds that most real-world high-dimensional datasets lie to a much lower-dimensional manifold. This assumption is very often empirically observed.

The manifold assumption is often accompanied by another implicit assumption: that the task at hand (classification or regression) will be easier to express in a lower-dimensional space of the manifold. However, this implicit assumption does not always hold. In short, reducing the dimensionality of the training set before training a model will usually speed up training, but it may not always lead to a better or simpler solution. It all depends on the dataset.

## PCA

*Principal Component Analysis* is by far the most popular dimensionality reduction algorithm.

### Preserving the variance

Before you can project the training set onto a lower-dimensional hyperplane, first it is needed to choose the right hyerplane that will preserve the maximum variance of the dataset.

One way to justify the choice of the hyperplan is that the chosen axis will be the one that minimizes the mean squared distance between the original dataset and its projection onto that axis. This is the main idea behind PCA.

## Principal Components 

PCA identifies the axis that accounts for the largest amount of variance in the training set. The i^th axis is called the i^th *principal component* (PC) of the data.

To find the principal components of a training set, a standard matrix factorization technique is used and it is called * Singular Value Decomposition* (SVD) that can decompose the training set matrix X into the matrix multiplication of three matrices U E V^t where V contains the unit vectors that define all the principal components that we are looking for.

/!\ The PCA assumes that the dataset is centered around the origin. With Scikit-Learn's PCA classes take care of centering the data for us. But if we want to use other libraries, we must centered the data first.

In [34]:
#Import dataset
from sklearn.datasets import make_moons

#Import train test split
from sklearn.model_selection import train_test_split

#Numpy 
import numpy as np

In [35]:
X,y = make_moons(n_samples=100000,noise=0.4)

In [43]:
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.33, random_state=42)

In [38]:
X_centered = X[0:1] - X[0:1].mean(axis=0)
U,s,Vt = np.linalg.svd(X_centered)
c1 = Vt.T[:,0]
c2 = Vt.T[:,1]

### Projecting Down to d Dimensions

Once all the principal components have been identified, the dimensionality of the dataset can be reduced down to d dimensions by projecting it onto the hyperplane defined by the first *d* principal components.

Selecting this hyperplane ensures that the projection will preserve as much variance as possible. To project the training set onto the hyperplane and obtain a reduced dataset X d-proj of dimensionality *d*, we have to compute multiplication of the training set matrix X by the matrix Wd, defined as the matrix containing the first *d* columns of V.

In [39]:
# In python with Numpy

W2 = Vt.T[:,:2]
X2D = X_centered.dot(W2)

### Using Scikit-Learn

In [40]:
from sklearn.decomposition import PCA

In [41]:
pca = PCA(n_components = 2)
X2D = pca.fit_transform(X) #Scikit-learn takes care of centering the data for us

After fitting, the components_ attribute holds the tranpose of Wd (e.g the unit vector that defines the first principal component is equal to pca.components_.T[:,0])

### Explained Variance Ratio

Another useful piece of information is the *explained variance ratio* of each principal component. The ratio indicates the proportion of the dataset's variance that lies along each principal component.

In [42]:
pca.explained_variance_ratio_

array([0.74116403, 0.25883597])

This output tells us that 74.11% and 25.88% of the dataset's variance lies along the two firsts PC.

### Choosing the Right Number of Dimensions

Instead of arbitrarily choosing the number of dimensions to reduce down to, it is simpler to choose the number of dimensions that add up to a sufficiently large porportion of the variance (e.g 95%), unless it is done for data visualization purposes.

In [45]:
pca = PCA()
pca.fit(X_train)
cumsum = np.cumsum(pca.explained_variance_ratio_)
d = np.argmax(cumsum >= 0.95) +1

In [48]:
print(f"Number of dimensions required to preserve 95% of the training set's variance: {d}")

Number of dimensions required to preserve 95% of the training set's variance: 2


In [49]:
pca = PCA(n_components = 0.95)
X_reduced = pca.fit_transform(X_train)

Another way of doing it if to plot the explained variance as a function of the number of dimensions. There will usually be an elbow in the curve, where the explained variance stops growing fast.

### PCA for Compression