# Dimensionality reduction

The number of features in a real-life project can reach thousands even millions which can make training extremely slow. To mitigate that problem 
we can reduce the number of features which is called _dimensionality reduction_(for example in our mnist image dataset each pixel of the 28x28 image is a
feature and we have observed that the white pixels are not important so we can eliminate them).  
__Important Note__: We have to recon that reducing the number of features lead to lost of informations in most cases.  
Reducing the dimensions of a dataset is also helpful for data visualization which is essential to communicate with non developper.
In this chapter we will present the most common way to reduce dimensions of datasets.

## Curse of Dimensionality

As the number of features in a dataset increases, the amount of data required to maintain the same level of statistical significance grows exponentially.
This phenomenon is known as the __curse of dimensionality__ and it can lead to overfitting, poor model performance, and increased computational complexity.
Many methods have beed developped to mitigate this problem we will discuss them in the next sections.

## Projection

Instances in a datasets are often not uniformly spread out across all dimensions, actually most of them will lie in a subdimensional space relatively to
the dataset. In simpler words most instances that are on a 3 dimensional dataset will be close to a plane instead of spreading accross all 3 dimensions.
The concept is to simply take those instances and transform their coordinates into the subdimension that they are fitted to. 

## Manifold

In many cases it might not be possible to project the instances in the dataset in a lower dimension because it is rolling into itself like in the
_swiss roll toy_ dataset. Projecting the instances in 2d would squash them into each other resulting in a great lost of informations. The solution is
to unroll the instances in a plane instead. The main motivation for manifold learning is that many high-dimensional datasets have an intrinsic 
low-dimensional structure that is not readily apparent in the original high-dimensional representation. By identifying and preserving this low-dimensional 
manifold structure, manifold learning techniques can provide a more meaningful and efficient representation of the data. Now that we know the most commonly
used method to deal with high dimensional datasets, we are going to defined and implement the algorithms that are used as solution.

### PCA (Principal Component Analysis)

The approach of this algorithm is simple. It works by first centering the data by subtracting the average value from each feature. Then, PCA finds new axes (called __principal components__) that line up with the directions where the data has the most variation. The principal components are arranged in order, 
from the one that captures the most variation to the one that captures the least. The original high-dimensional data is then projected onto the top few 
principal components, effectively reducing the number of dimensions. The key benefits of PCA are that it removes redundant or less important features, 
making the data easier to work with; it preserves the most essential information from the original high-dimensional data; the principal components are 
uncorrelated, which can be useful for further analysis; and it enables data visualization by reducing the dimensions to 2 or 3. To find the principal
components we can use an equation called __SVD__(Singular Value Decomposition) we decompose the matrix of all the features into the matrix multiplication
of 3 matrices ($\mathsf{U}, \Sigma, \mathsf{V}$) where $\mathsf{V}$ is the vector containing the features that are the principal components. Note that
numpy possess an _svd()_ function. We can now project our dataset into a hyperplane defined by the dimensions of the principal components found, to do that
we multiply the matrix of the dataset with the matrix of the principal components:
$$X_{d-proj} = XW_d$$
Here is a code implementation using scikit-learn's PCA class:


In [None]:
from sklearn.decomposition import PCA
import numpy as np

X = np.random.rand(200, 1) - 0.5

pca = PCA(n_components=2)
X2D = pca.fit_transform(X)

An important problem will be to choose the right number of dimensions to converse the maximum accuracy for our models.

In [None]:
from sklearn.datasets import fetch_openml
from sklearn.decomposition import PCA
import numpy as np


mnist = fetch_openml('mnist-784', as_frame=False)
X_train, y_train = mnist.data[:60_000], mnist.target[:60_000]
X_test, y_test = mnist.data[60_000:], mnist.target[60_000:]

pca = PCA()
pca.fit(X_train)
cumsum = np.cumsum(pca.explained_variance_ratio_)
d = np.argmax(cumsum >= 0.95) + 1 # This computes the number of dimensions necessary to conserve a 95% variance in our training set.
# Another option is to set it directly during the PCA call by setting the n_components parameter between 0 and 1 (representing the ratio of variance we want to conserve)
pca1 = PCA(n_components=0.95)

One problem with this implementation is that we need the dataset to be able to fit in memory but to solve that scikit-learn offers us the _IncrementalPCA_
class that allows us to be able to split the data into mini-batch first enabling online learning by the same occasion.

In [None]:
from sklearn.decomposition import IncrementalPCA

n_batches = 100
inc_pca = IncrementalPCA(n_components=154)
for X_batch in np.array_split(X_train, n_batches):
    inc_pca.partial_fit(X_train)

X_reduced = inc_pca.transform(X_train)

For very high dimensional datasets PCA might be too slow, in this case we need to revert to random projection.

## Random Projection

The concept is to project the data into a lower dimensional space using a random linear projection. To know the number of dimensions we need to conserve we
have an equation that determine that minimum number of dimensions if we want to preserve a tolerance of at least 10%. This equation is implemented in 
scikit-learn by the _johnson\_lindenstrauss\_min\_dim()_ 

In [None]:
from sklearn.random_projection import johnson_lindenstrauss_min_dim

m, e = 5_000, 0.1

d = johnson_lindenstrauss_min_dim(m=m, eps=e)

# Now we generate a matrix of shape [d, n] and use it to project the dataset from n dimensions to d
n= 20_000
np.random.seed(42)
P = np.random.randn(d, n) / np.sqrt(d)

X = np.random.randn(m, n) # Here we are just generating a random dataset

X_reduces = X @ P.T
# Alternatively we can use the GaussianRandomProjection class from scikit-learn

## Local Linear Embedding(LLE)

LLE is a nonlinear dimensionality reduction and a manifold technique that doesn't rely on projection or PCA. It measures the distance between a point and
its nearest neighbors and search a for a lower dimensional representation that best conserve this distances. This method is very efficient to unroll
manifold datasets such as the __swiss roll__ which we are going to use as an example: 

In [None]:
from sklearn.datasets import make_swiss_roll
from sklearn.manifold import LocallyLinearEmbedding

X_swiss, t = make_swiss_roll(n_samples=1000, noise=0.2, random_state=42) # t is the variable containing the position of each instances along the rolled axis
# It can be used as a label set for regression tasks purposes.
lle = LocallyLinearEmbedding(n_components=2, n_neighbors=10, random_state=42)
X_unrolled = lle.fit_transform(X_swiss)
