# The Curse of Dimensionality


High-dimensional datasets are at risk of being very sparse. New instance will likely be far away from any training instance, making predictions much less reliable. Greater risk of overfitting it.


# Main Approaches

## Projection


## Manifold

A 2D manifold is a 2D shape that can be bent and twisted in a higher-dimensional space.


# PCA

## Preserving the variance

Select the axis that preserves the maximum amount of variance. It will most likely lose less information than the other projections.

## Principal Components


In [4]:
import sklearn
import numpy as np
import os

np.random.seed(4)
m = 60
w1, w2 = 0.1, 0.3
noise = 0.1

angles = np.random.rand(m) * 3 * np.pi / 2 - 0.5
X = np.empty((m, 3))
X[:, 0] = np.cos(angles) + np.sin(angles)/2 + noise * np.random.randn(m) / 2
X[:, 1] = np.sin(angles) * 0.7 + noise * np.random.randn(m) / 2
X[:, 2] = X[:, 0] * w1 + X[:, 1] * w2 + noise * np.random.randn(m)

## PCA using SVD decomposition

Eigenvalue and Eigenvector: https://www.youtube.com/watch?v=PFDu9oVAE-g&ab_channel=3Blue1Brown
SVD lecture: https://www.youtube.com/watch?v=mBcLRGuAFUk&t=8s&ab_channel=MITOpenCourseWare



In [6]:
X_centered = X - X.mean(axis=0)
U, s, Vt = np.linalg.svd(X_centered)
c1 = Vt.T[:, 0]
c2 = Vt.T[:, 1]


In [7]:
m, n = X.shape
S = np.zeros(X_centered.shape)
S[:n, :n] = np.diag(s)

In [8]:
np.allclose(X_centered, U.dot(S).dot(Vt))


True

In [14]:
W2 = Vt.T[:, :2]
X2D = X_centered.dot(W2)
X2D_using_svd = X2D

## PCA using Scikit-Learn

In [11]:
from sklearn.decomposition import PCA
import warnings
warnings.filterwarnings(action="ignore", message="^internal gelsd")

pca = PCA(n_components=2)
X2D = pca.fit_transform(X)

In [12]:
X2D[:5]

array([[ 1.26203346,  0.42067648],
       [-0.08001485, -0.35272239],
       [ 1.17545763,  0.36085729],
       [ 0.89305601, -0.30862856],
       [ 0.73016287, -0.25404049]])

In [15]:
X2D_using_svd[:5]

array([[-1.26203346, -0.42067648],
       [ 0.08001485,  0.35272239],
       [-1.17545763, -0.36085729],
       [-0.89305601,  0.30862856],
       [-0.73016287,  0.25404049]])

In [17]:
np.allclose(X2D, X2D_using_svd)

True

In [18]:
X3D_inv = pca.inverse_transform(X2D)

In [19]:
pca.components_

array([[-0.93636116, -0.29854881, -0.18465208],
       [ 0.34027485, -0.90119108, -0.2684542 ]])