# Dimensionality Reduction

- to speed training by removing unimportant features
- to allow for DataViz
- does not always lead to the best or simplest solution, depending on the underlying data

Problems with high dimensions:
- Extreme values increase as dimensions increase.
- The distance between randomly selected points increases, so training instances tend to be far apart
- New instances will be far from the training instances

# Approaches for Dimensionality Reduction

- Projection - for data that can be projected perpendicularly to a smaller subspace
- Manifold learning - for data that is "twisted" in a manifold

### Projection

![](images/projection1.png)
![](images/projection2.png)

## Manifold Learning

![](images/manifold1.png)
![](images/manifold2.png)
![](images/manifold3.png)

# PCA 

Most popular dimensionality reduction algorithm. Identifies the axes with the largest variance.

Important to choose the right hyperplane for projection
- preserves maximum variance
- minimises mean squared distance between original data and projected data

In [19]:
# Python ≥3.5 is required
import sys
assert sys.version_info >= (3, 5)

# Scikit-Learn ≥0.20 is required
import sklearn
assert sklearn.__version__ >= "0.20"

# Common imports
import numpy as np
import os

# to make this notebook's output stable across runs
np.random.seed(42)

# To plot pretty figures
%matplotlib inline
import matplotlib as mpl
import matplotlib.pyplot as plt
mpl.rc('axes', labelsize=14)
mpl.rc('xtick', labelsize=12)
mpl.rc('ytick', labelsize=12)

# Where to save the figures
PROJECT_ROOT_DIR = "."
CHAPTER_ID = "dim_reduction"
IMAGES_PATH = os.path.join(PROJECT_ROOT_DIR, "images", CHAPTER_ID)
os.makedirs(IMAGES_PATH, exist_ok=True)

def save_fig(fig_id, tight_layout=True, fig_extension="png", resolution=300):
    path = os.path.join(IMAGES_PATH, fig_id + "." + fig_extension)
    print("Saving figure", fig_id)
    if tight_layout:
        plt.tight_layout()
    plt.savefig(path, format=fig_extension, dpi=resolution)

__A 3D Dataset__

In [20]:
np.random.seed(4)
m = 60
w1, w2 = 0.1, 0.3
noise = 0.1

angles = np.random.rand(m) * 3 * np.pi / 2 - 0.5
X = np.empty((m, 3))
X[:, 0] = np.cos(angles) + np.sin(angles)/2 + noise * np.random.randn(m) / 2
X[:, 1] = np.sin(angles) * 0.7 + noise * np.random.randn(m) / 2
X[:, 2] = X[:, 0] * w1 + X[:, 1] * w2 + noise * np.random.randn(m)

# Principal Components

`np.linalg.svd()` - Singular Value Decomposition

Doing it "manually".

The following extracts the two unit vectors that define the first two PCs.

In [21]:
X_centered = X - X.mean(axis=0)
U, s, Vt = np.linalg.svd(X_centered)
c1 = Vt.T[:, 0]
c2 = Vt.T[:, 1]
m, n = X.shape

In [22]:
S = np.zeros(X_centered.shape)
S[:n, :n] = np.diag(s)

In [23]:
np.allclose(X_centered, U.dot(S).dot(Vt))

True

Projecting down to the plane defined by the first two principle components

In [31]:
W2 = Vt.T[:, :2]
X2D = X_centered.dot(W2)
X2D_using_svd = X2D

With Scikit-Learn

In [32]:
from sklearn.decomposition import PCA

pca = PCA(n_components=2)
X2D = pca.fit_transform(X)

The `components_` attribute holds the transpose of $W_d$, so the unit vector with the first principal component is `pca.components_.T[:, 0]`.

In [33]:
X2D[:5]

array([[-1.26203346, -0.42067648],
       [ 0.08001485,  0.35272239],
       [-1.17545763, -0.36085729],
       [-0.89305601,  0.30862856],
       [-0.73016287,  0.25404049]])

In [34]:
X2D_using_svd[:5]

array([[-1.26203346, -0.42067648],
       [ 0.08001485,  0.35272239],
       [-1.17545763, -0.36085729],
       [-0.89305601,  0.30862856],
       [-0.73016287,  0.25404049]])

In [38]:
np.allclose(X2D, X2D_using_svd)

True

Recover the 3D points

In [39]:
X3D_inv = pca.inverse_transform(X2D)

In [40]:
np.allclose(X3D_inv, X)

False

## Explained Variance Ratioabs

The proportion of the dataset's variance that lies along each principle component.

In [10]:
pca.explained_variance_ratio_

array([0.84248607, 0.14631839])

In [12]:
pca.noise_variance_

0.010342716399506999

## Automatically choosing number of dimensions

Compute minimum dimensions to preserve 95% of the variance:

In [13]:
from sklearn.datasets import fetch_openml

mnist = fetch_openml('mnist_784', version=1, as_frame=False)
mnist.target = mnist.target.astype(np.uint8)
from sklearn.model_selection import train_test_split

X = mnist["data"]
y = mnist["target"]

X_train, X_test, y_train, y_test = train_test_split(X, y)

In [14]:
pca = PCA()
pca.fit(X_train)
cumsum = np.cumsum(pca.explained_variance_ratio_)
d = np.argmax(cumsum >= 0.95) + 1

In [15]:
d

154