# Chapter 8: Dimensionality Reduction

**Reference:** Hands-On Machine Learning with Scikit-Learn, Keras, and TensorFlow (Aurélien Géron)

---

## 1. Chapter Introduction

Many Machine Learning problems involve thousands or even millions of features for each training instance. Not only do all these features make training extremely slow, but they can also make it much harder to find a good solution, as we will see. This problem is often referred to as the **curse of dimensionality**.

Fortunately, in real-world problems, it is often possible to reduce the number of features considerably, turning an intractable problem into a tractable one. For example, consider the MNIST images (introduced in Chapter 3): the pixels on the image borders are almost always white, so you could completely drop these pixels from the training set without losing much information. Additionally, two neighboring pixels are often highly correlated: if you merge them into a single pixel (e.g., by taking the mean of the two pixel intensities), you will not lose much information.

Reducing dimensionality does cause some information loss (just like compressing an image to JPEG can degrade its quality), so even though it will speed up training, it may make your system perform slightly worse. It also makes your pipelines a bit more complex and thus harder to maintain. So, if training is too slow, you should first try to train your system with the original data before considering using dimensionality reduction. In some cases, however, reducing the dimensionality of the training data may filter out some noise and unnecessary details and thus result in higher performance (but in general it won’t; it will only speed up training).

Apart from speeding up training, dimensionality reduction is also extremely useful for data visualization (or *DataViz*). Reducing the number of dimensions down to two (or three) makes it possible to plot a condensed view of a high-dimensional training set on a graph and often gain some important insights by visually detecting patterns, such as clusters.

In this chapter we will discuss the curse of dimensionality and get a sense of what goes on in high-dimensional space. Then, we will look at the two main approaches to dimensionality reduction (projection and manifold learning), and we will go through three of the most popular dimensionality reduction techniques: PCA, Kernel PCA, and LLE.

## 2. The Curse of Dimensionality

We are so used to living in three dimensions that our intuition fails us when we try to imagine a high-dimensional space. Even a basic 4D hypercube is incredibly hard to visualize in our mind, let alone a 200-dimensional ellipsoid bent in a 1,000-dimensional space.

It turns out that many things behave very differently in high-dimensional space. For example, if you pick a random point in a unit square (a $1 \times 1$ square), it will have only about a 0.4% chance of being located less than 0.001 from a border (in other words, it is very unlikely that a random point will be "extreme" along any dimension). But in a 10,000-dimensional unit hypercube, this probability is greater than 99.999999%. Most points in a high-dimensional hypercube are very close to the border.

Here is a more troublesome difference: if you pick two points randomly in a unit square, the distance between these two points will be, on average, roughly 0.52. If you pick two random points in a unit 3D cube, the average distance will be roughly 0.66. But what about two points picked randomly in a 1,000,000-dimensional hypercube? The average distance, believe it or not, will be about 408.25 (roughly $\sqrt{1,000,000/6}$)! This is quite counterintuitive: how can two points be so far apart when they both lie within the same unit hypercube? This fact implies that high-dimensional datasets are at risk of being very sparse: most training instances are likely to be far away from each other. This also means that a new instance will likely be far away from any training instance, making predictions much less reliable than in lower dimensions, since they will be based on much larger extrapolations. In short, the more dimensions the training set has, the greater the risk of overfitting it.

In theory, one solution to the curse of dimensionality could be to increase the size of the training set to reach a sufficient density of training instances. Unfortunately, in practice, the number of training instances required to reach a given density grows exponentially with the number of dimensions. With just 100 features (significantly fewer than in the MNIST problem), you would need more training instances than the number of atoms in the observable universe in order for them to be within 0.1 of each other on average, assuming they were spread out uniformly across all dimensions.

## 3. Main Approaches for Dimensionality Reduction

Before we dive into specific dimensionality reduction algorithms, let’s look at the two main approaches to reducing dimensionality: **projection** and **manifold learning**.

### Projection

In most real-world problems, training instances are not spread out uniformly across all dimensions. Many features are almost constant, while others are highly correlated (as discussed earlier for MNIST). As a result, all training instances lie within (or close to) a much lower-dimensional *subspace* of the high-dimensional space.

For example, imagine a dataset of 3D points that are roughly located on a 2D plane (like a sheet of paper floating in 3D space). If we project every training instance perpendicularly onto this subspace (the plane), we get a new 2D dataset. We have just reduced the dimensionality from 3D to 2D. The axes of this new subspace correspond to new features $z_1$ and $z_2$ (the coordinates of the projections on the plane).

However, projection is not always the best approach. In many cases the subspace may twist and turn, such as in the famous **Swiss Roll** toy dataset. If you project the Swiss Roll onto a plane (e.g., by dropping the $x_3$ coordinate), you squash the different layers of the roll together, which mixes up the data points completely. What you really want is to unroll the Swiss Roll to obtain a 2D dataset.

### Manifold Learning

The Swiss Roll is an example of a 2D **manifold**. Put simply, a 2D manifold is a 2D shape that can be bent and twisted in a higher-dimensional space. More generally, a $d$-dimensional manifold is a part of an $n$-dimensional space (where $d < n$) that locally resembles a $d$-dimensional hyperplane. In the case of the Swiss Roll, $d=2$ and $n=3$: it locally resembles a 2D plane, but it is rolled in the third dimension.

Many dimensionality reduction algorithms work by modeling the manifold on which the training instances lie; this is called **Manifold Learning**. It relies on the *manifold assumption*, also called the *manifold hypothesis*, which holds that most real-world high-dimensional datasets lie close to a much lower-dimensional manifold. This assumption is very often empirically observed.

The manifold assumption is often accompanied by another implicit assumption: that the task at hand (e.g., classification or regression) will be simpler if expressed in the lower-dimensional space of the manifold. For example, in the Swiss Roll, the decision boundary between two classes might be very complex in 3D space (a spiral shell), but a simple straight line in the unrolled 2D manifold space.

However, this implicit assumption does not always hold. For example, if the decision boundary is $x_1 = 5$ in 3D space (a vertical plane cutting through the roll), it will become very complex (a collection of disjoint segments) in the unrolled 2D space. So, dimensionality reduction speeds up training, but it may not always lead to a better or simpler solution; it all depends on the dataset.

## 4. PCA (Principal Component Analysis)

Principal Component Analysis (PCA) is by far the most popular dimensionality reduction algorithm. First it identifies the hyperplane that lies closest to the data, and then it projects the data onto it.

### Preserving the Variance

Before you can project the training set onto a lower-dimensional hyperplane, you first need to select the right hyperplane. For example, consider a simple 2D dataset. You could project it onto the x-axis, the y-axis, or any other line. Which one is best?

It seems reasonable to select the axis that preserves the maximum amount of variance, as it will most likely lose less information than the other projections. Another way to justify this choice is that it is the axis that minimizes the mean squared distance between the original dataset and its projection onto that axis. This is the rather simple idea behind PCA.

### Principal Components

PCA identifies the axis that accounts for the largest amount of variance in the training set. It also finds a second axis, orthogonal to the first one, that accounts for the largest amount of remaining variance. In higher dimensions, PCA would find a third axis, orthogonal to both previous axes, and so on. The unit vector that defines the $i^{th}$ axis is called the $i^{th}$ **principal component** (PC).

So how can you find the principal components of a training set? There is a standard matrix factorization technique called **Singular Value Decomposition (SVD)** that can decompose the training set matrix $\mathbf{X}$ into the matrix multiplication of three matrices $\mathbf{U} \boldsymbol{\Sigma} \mathbf{V}^T$, where $\mathbf{V}$ contains the unit vectors that define all the principal components that we are looking for.

**Equation 8-1: Matrix V containing principal components**
$$ \mathbf{V} = \begin{pmatrix} | & | & & | \\ \mathbf{c}_1 & \mathbf{c}_2 & \dots & \mathbf{c}_n \\ | & | & & | \end{pmatrix} $$

### Projecting Down to d Dimensions

Once you have identified all the principal components, you can reduce the dimensionality of the dataset down to $d$ dimensions by projecting it onto the hyperplane defined by the first $d$ principal components. Selecting this hyperplane ensures that the projection will preserve as much variance as possible.

To project the training set onto the hyperplane and obtain a reduced dataset $\mathbf{X}_{d\text{-proj}}$ of dimensionality $d$, you simply compute the matrix multiplication of the training set matrix $\mathbf{X}$ by the matrix $\mathbf{W}_d$, defined as the matrix containing the first $d$ columns of $\mathbf{V}$.

**Equation 8-2: Projecting the training set down to d dimensions**
$$ \mathbf{X}_{d\text{-proj}} = \mathbf{X} \mathbf{W}_d $$

In [None]:
import numpy as np
from sklearn.decomposition import PCA

# 1. Generate a 3D dataset
np.random.seed(42)
m = 60
w1, w2 = 0.1, 0.3
noise = 0.1

angles = np.random.rand(m) * 3 * np.pi / 2 - 0.5
X = np.empty((m, 3))
X[:, 0] = np.cos(angles) + np.sin(angles)/2 + noise * np.random.randn(m) / 2
X[:, 1] = np.sin(angles) * 0.7 + noise * np.random.randn(m) / 2
X[:, 2] = X[:, 0] * w1 + X[:, 1] * w2 + noise * np.random.randn(m)

# 2. Apply PCA to reduce to 2 dimensions
# Scikit-Learn uses SVD decomposition to implement PCA.
pca = PCA(n_components=2)
X2D = pca.fit_transform(X)

print("Original shape:", X.shape)
print("Reduced shape:", X2D.shape)
print("Principal Components (V^T):\n", pca.components_)

### Explained Variance Ratio

Another very useful piece of information is the **explained variance ratio** of each principal component, available via the `explained_variance_ratio_` variable. It indicates the proportion of the dataset’s variance that lies along each principal component.

In [None]:
print("Explained Variance Ratio:", pca.explained_variance_ratio_)

This tells you that roughly 84.2% of the dataset’s variance lies along the first PC, and 14.6% lies along the second PC. This leaves less than 1.2% for the third PC, so it is reasonable to assume that the third PC carries little information.

### Choosing the Right Number of Dimensions

Instead of arbitrarily choosing the number of dimensions to reduce down to, it is generally preferable to choose the number of dimensions that add up to a sufficiently large portion of the variance (e.g., 95%). Unless, of course, you are reducing dimensionality for data visualization—in which case you will want to reduce the dimensionality to 2 or 3.

The following code computes PCA without reducing dimensionality, then computes the minimum number of dimensions required to preserve 95% of the training set’s variance:

In [None]:
from sklearn.datasets import fetch_openml
from sklearn.model_selection import train_test_split

# Load MNIST Data
mnist = fetch_openml('mnist_784', version=1, as_frame=False)
X_train, X_test, y_train, y_test = train_test_split(mnist.data, mnist.target, test_size=0.2, random_state=42)

pca = PCA()
pca.fit(X_train)
cumsum = np.cumsum(pca.explained_variance_ratio_)
d = np.argmax(cumsum >= 0.95) + 1
print("Dimensions required for 95% variance:", d)

You can then set `n_components=d` and run PCA again. But there is a much better option: instead of specifying the number of principal components you want to preserve, you can set `n_components` to be a float between 0.0 and 1.0, indicating the ratio of variance you wish to preserve:

In [None]:
pca = PCA(n_components=0.95)
X_reduced = pca.fit_transform(X_train)
print("Reduced features:", X_reduced.shape[1])

### Randomized PCA

If you set the `svd_solver` hyperparameter to `"randomized"`, Scikit-Learn uses a stochastic algorithm called *Randomized PCA* that quickly finds an approximation of the first $d$ principal components. Its computational complexity is $O(m \times d^2) + O(d^3)$, instead of $O(m \times n^2) + O(n^3)$ for the full SVD approach, so it is dramatically faster when $d$ is much smaller than $n$.

In [None]:
rnd_pca = PCA(n_components=154, svd_solver="randomized")
X_reduced = rnd_pca.fit_transform(X_train)

### Incremental PCA

One problem with the preceding implementations of PCA is that they require the whole training set to fit in memory in order for the algorithm to run. Fortunately, *Incremental PCA* (IPCA) algorithms have been developed. They allow you to split the training set into mini-batches and feed an IPCA algorithm one mini-batch at a time. This is useful for large training sets and for applying PCA online (i.e., on the fly, as new instances arrive).

In [None]:
from sklearn.decomposition import IncrementalPCA

n_batches = 100
inc_pca = IncrementalPCA(n_components=154)
for X_batch in np.array_split(X_train, n_batches):
    inc_pca.partial_fit(X_batch)

X_reduced = inc_pca.transform(X_train)

## 5. Kernel PCA

In Chapter 5 we discussed the kernel trick, a mathematical technique that implicitly maps instances into a very high-dimensional space (called the *feature space*), enabling nonlinear classification and regression with Support Vector Machines. Recall that a linear decision boundary in the high-dimensional feature space corresponds to a complex nonlinear decision boundary in the original space.

It turns out that the same trick can be applied to PCA, making it possible to perform complex nonlinear projections for dimensionality reduction. This is called **Kernel PCA** (kPCA). It is often good at preserving clusters of instances after projection, or sometimes even unrolling datasets that lie close to a twisted manifold.

The following code uses Scikit-Learn’s `KernelPCA` class to perform kPCA with an RBF kernel:

In [None]:
from sklearn.datasets import make_swiss_roll
from sklearn.decomposition import KernelPCA

# Generate Swiss Roll
X, t = make_swiss_roll(n_samples=1000, noise=0.2, random_state=42)

# Apply Kernel PCA with RBF kernel
rbf_pca = KernelPCA(n_components=2, kernel="rbf", gamma=0.04)
X_reduced = rbf_pca.fit_transform(X)

print("Kernel PCA Reduction Complete.")

### Selecting a Kernel and Tuning Hyperparameters

As kPCA is an unsupervised learning algorithm, there is no obvious performance measure to help you select the best kernel and hyperparameter values. However, dimensionality reduction is often a preparation step for a supervised learning task (e.g., classification), so you can use grid search to select the kernel and hyperparameters that lead to the best performance on that task.

In [None]:
from sklearn.linear_model import LogisticRegression
from sklearn.pipeline import Pipeline
from sklearn.model_selection import GridSearchCV

clf = Pipeline([
    ("kpca", KernelPCA(n_components=2)),
    ("log_reg", LogisticRegression())
])

param_grid = [{
    "kpca__gamma": np.linspace(0.03, 0.05, 10),
    "kpca__kernel": ["rbf", "sigmoid"]
}]

grid_search = GridSearchCV(clf, param_grid, cv=3)
grid_search.fit(X, t > 69) # Binary classification target for demo
print("Best params:", grid_search.best_params_)

## 6. LLE (Locally Linear Embedding)

**Locally Linear Embedding** (LLE) is another powerful **nonlinear dimensionality reduction** (NLDR) technique. It is a Manifold Learning technique that does not rely on projections, unlike PCA. In a nutshell, LLE works by first measuring how each training instance linearly relates to its closest neighbors (c.n.), and then looking for a low-dimensional representation of the training set where these local relationships are best preserved (more details shortly). This makes it particularly good at unrolling twisted manifolds, especially when there is not too much noise.

**Step 1: Linear Modeling**
For each training instance $\mathbf{x}^{(i)}$, the algorithm identifies its $k$ closest neighbors. It then tries to reconstruct $\mathbf{x}^{(i)}$ as a linear function of these neighbors. Specifically, it finds the weights $w_{i,j}$ such that the squared distance between $\mathbf{x}^{(i)}$ and $\sum_{j=1}^{m} w_{i,j} \mathbf{x}^{(j)}$ is minimized, assuming $w_{i,j}=0$ if $\mathbf{x}^{(j)}$ is not one of the $k$ closest neighbors of $\mathbf{x}^{(i)}$.

**Step 2: Dimensionality Reduction**
After this step, the weight matrix $\mathbf{W}$ (containing the weights $w_{i,j}$) encodes the local linear relationships between the training instances. Now the second step is to map the training instances into a $d$-dimensional space (where $d < n$) while preserving these local relationships as much as possible. If $\mathbf{z}^{(i)}$ is the image of $\mathbf{x}^{(i)}$ in this $d$-dimensional space, then we want the squared distance between $\mathbf{z}^{(i)}$ and $\sum_{j=1}^{m} w_{i,j} \mathbf{z}^{(j)}$ to be as small as possible.

Here is how to use Scikit-Learn’s `LocallyLinearEmbedding` class to unroll the Swiss roll:

In [None]:
from sklearn.manifold import LocallyLinearEmbedding

lle = LocallyLinearEmbedding(n_components=2, n_neighbors=10, random_state=42)
X_reduced = lle.fit_transform(X)

print("LLE Reduction Complete.")