# Dimensionality reduction

*Fun fact: anyone you know is probably an extremist in at least one dimension (e.g., how much sugar they put in their coffee), if you consider enough dimensions.* 🤗

- If you have many dimensions you're more likely to overfit, as the points will be likely very far from each other.

- If you reduce the dimensionality of a DS before training, it will train quicker, but you don't necessarily get better results. It always depends on the dataset.

- Main approaches for dimensionality reduction: PROJECTION & MANIFOLD
Projection is the process of mapping data from a higher-dimensional space to a lower-dimensional space.
Manifold is the underlying geometric structure of high-dimensional data, often nonlinear and complex.

- PCA (Principal Component Analysis) is a projection algorithm that tries to preserve the variance in the data (as much as possible), by using their mean squared distances to project the data into a lower hyperplane.
Here *pca.explained_variance_ratio_* can tell you information about the number of dimensions. Imagine the result is *array([0.84, 0.14])*. This means about 0.98 of the data is in this two dimensions, so you're not missing much information in the other axis. So instead of choosing an arbitrary number of dimensions, try to have at least 95% of the data. You can do this by using n_components (between 1 and 0)
**If your data doesn't fit in memory, or for online learning, you need to use INCREMENTALPCA CLASS**

## Projection vs. Manifold: A Simplified Explanation
Projection and manifold are two important concepts in machine learning, particularly in dimensionality reduction and manifold learning techniques.

### Projection
**Simple Definition:** Imagine shining a light on a 3D object and casting its shadow onto a 2D surface. The shadow is a projection of the 3D object onto a lower-dimensional space.
In Machine Learning:
Dimensionality Reduction: Techniques like Principal Component Analysis (PCA) project high-dimensional data onto a lower-dimensional space while preserving most of the variance.
Feature Extraction: Feature extraction techniques like Linear Discriminant Analysis (LDA) project data onto a lower-dimensional space that is more discriminative for classification tasks.

### Manifold
**Simple Definition:** A manifold is a geometric object that locally resembles Euclidean space. Think of a curved surface like the surface of a sphere or a torus.
In Machine Learning:
High-Dimensional Data: High-dimensional data can often be visualized as points lying on a low-dimensional manifold embedded in a high-dimensional space.
Manifold Learning: Techniques like t-SNE and Isomap aim to uncover the underlying low-dimensional manifold structure of high-dimensional data.

### Relationship Between Projection and Manifold:

Manifold Learning as Projection: Manifold learning techniques can be seen as a type of projection, but onto a nonlinear, curved manifold rather than a simple linear subspace.
Projection as a Special Case of Manifold Learning: Linear projection techniques like PCA can be considered a special case of manifold learning, where the manifold is a linear subspace.


In [None]:
# PCA in scikit-learn
# for PCA the data needs to be centered around its origin, but the scikit does that for us

from sklearn.decomposition import PCA

pca = PCA(n_components=2) # the number of components is the number of dimensions in which the data will be projected
# if you use the n_components as a float between 0 and 1, it will use the number of dimensions that fit x% of the data, 
# according to the value n_compnents you specified. so n_components=0.95 will give you n dimensions so that 95% of your data fits in it
data_to_2d = pca.fit_transform(x)


# you can also inverse the transformation, albeit with a little loss in data
X_reduced = pca.fit_transform(X_train)
X_recovered = pca.inverse_transform(X_reduced)