# Dimensionality reduction

*Fun fact: anyone you know is probably an extremist in at least one dimension (e.g., how much sugar they put in their coffee), if you consider enough dimensions.* 🤗

- If you have many dimensions you're more likely to overfit, as the points will be likely very far from each other.

- If you reduce the dimensionality of a DS before training, it will train quicker, but you don't necessarily get better results. It always depends on the dataset.

- Main approaches for dimensionality reduction: PROJECTION & MANIFOLD
Projection is the process of mapping data from a higher-dimensional space to a lower-dimensional space.
Manifold is the underlying geometric structure of high-dimensional data, often nonlinear and complex.

- PCA (Principal Component Analysis) is a projection algorithm that tries to preserve the variance in the data (as much as possible), by using their mean squared distances to project the data into a lower hyperplane.
Here *pca.explained_variance_ratio_* can tell you information about the number of dimensions. Imagine the result is *array([0.84, 0.14])*. This means about 0.98 of the data is in this two dimensions, so you're not missing much information in the other axis. So instead of choosing an arbitrary number of dimensions, try to have at least 95% of the data. You can do this by using n_components (between 1 and 0)
**If your data doesn't fit in memory, or for online learning, you need to use INCREMENTALPCA CLASS**

- KernelPCA is a "trick" like the kernel in SVM. It has no inverse_transform(), unless you set fit_inverse_transform=True. You can choose from kernels [ linear (equivalent to regular PCA) rbf, sigmoid] and a gamma. This is an unsupervised method, so you can use a grid to select the best one, since this usually comes before a supervised task.

- LLE (Locally Linear Embedding) is a **nonlinear dimensionality reduction**. It works well when there is no much noise. It scales poorly to large datasets.


## Comparing algoritms for dimensionality reduction

**Remember:**

- Feature Selection: Techniques like feature importance analysis (e.g., using Random Forest or XGBoost) can be used to reduce dimensionality by selecting the most relevant features.

- Regularization: Regularization techniques like L1 and L2 regularization can implicitly reduce the effective number of features in linear models.

- One approach you can take to decide which algorithm to use is to select a sample of your training data and do a 2d plot of the models (labels) and see how they compare, which one overlaps the least, etc.


| Algorithm | Use Case | Key Parameters | Import | Explanation |
|---|---|---|---|---|
| Principal Component Analysis (PCA) | Reducing dimensionality while preserving most of the variance. | `n_components` | `from sklearn.decomposition import PCA` | Projects data onto a lower-dimensional space defined by the principal components, which are linear combinations of the original features. |
| Linear Discriminant Analysis (LDA) | Reducing dimensionality while maximizing class separability. **Great to run before classification algorithm such as SVM** | `n_components` | `from sklearn.discriminant_analysis import LinearDiscriminantAnalysis` | Projects data onto a lower-dimensional space that maximizes the separation between classes. |
| t-SNE (t-Distributed Stochastic Neighbor Embedding) | Visualizing high-dimensional data in 2D or 3D space. | `n_components`, `perplexity`, `learning_rate` | `from sklearn.manifold import TSNE` | Non-linear dimensionality reduction technique that preserves local structure. |
| UMAP (Uniform Manifold Approximation and Projection) | Visualizing high-dimensional data in 2D or 3D space. | `n_neighbors`, `min_dist`, `metric` | `from umap import UMAP` | Non-linear dimensionality reduction technique that often outperforms t-SNE in terms of visualization quality and computational efficiency. |
| MDS (Multidimensional Scaling) | Visualizing similarities between data points in a low-dimensional space. | `n_components`, `metric` | `from sklearn.manifold import MDS` | Projects data points into a lower-dimensional space while preserving pairwise distances. |
| Isomap (Isometric Mapping) | Non-linear dimensionality reduction technique that preserves geodesic distances between data points. | `n_neighbors`, `metric` | `from sklearn.manifold import Isomap` | Preserves the intrinsic geometry of the data manifold. |
| LLE (Locally Linear Embedding) | Non-linear dimensionality reduction technique that preserves local linear relationships between data points. | `n_neighbors`, `n_components` | `from sklearn.manifold import LocallyLinearEmbedding` | Embeds high-dimensional data into a lower-dimensional space while preserving local neighborhood relationships. |

## Projection vs. Manifold: A Simplified Explanation
Projection and manifold are two important concepts in machine learning, particularly in dimensionality reduction and manifold learning techniques.

### Projection
**Simple Definition:** Imagine shining a light on a 3D object and casting its shadow onto a 2D surface. The shadow is a projection of the 3D object onto a lower-dimensional space.
In Machine Learning:
Dimensionality Reduction: Techniques like Principal Component Analysis (PCA) project high-dimensional data onto a lower-dimensional space while preserving most of the variance.
Feature Extraction: Feature extraction techniques like Linear Discriminant Analysis (LDA) project data onto a lower-dimensional space that is more discriminative for classification tasks.

### Manifold
**Simple Definition:** A manifold is a geometric object that locally resembles Euclidean space. Think of a curved surface like the surface of a sphere or a torus.
In Machine Learning:
High-Dimensional Data: High-dimensional data can often be visualized as points lying on a low-dimensional manifold embedded in a high-dimensional space.
Manifold Learning: Techniques like t-SNE and Isomap aim to uncover the underlying low-dimensional manifold structure of high-dimensional data.

### Relationship Between Projection and Manifold:

Manifold Learning as Projection: Manifold learning techniques can be seen as a type of projection, but onto a nonlinear, curved manifold rather than a simple linear subspace.
Projection as a Special Case of Manifold Learning: Linear projection techniques like PCA can be considered a special case of manifold learning, where the manifold is a linear subspace.


In [None]:
# PCA in scikit-learn
# for PCA the data needs to be centered around its origin, but the scikit does that for us

from sklearn.decomposition import PCA

pca = PCA(n_components=2) # the number of components is the number of dimensions in which the data will be projected
# if you use the n_components as a float between 0 and 1, it will use the number of dimensions that fit x% of the data, 
# according to the value n_compnents you specified. so n_components=0.95 will give you n dimensions so that 95% of your data fits in it
data_to_2d = pca.fit_transform(x)


# you can also inverse the transformation, albeit with a little loss in data
X_reduced = pca.fit_transform(X_train)
X_recovered = pca.inverse_transform(X_reduced)

# NOTES

* **The main motivations for dimensionality reduction are:**
    * To speed up a subsequent training algorithm (in some cases it may even remove noise and redundant features, making the training algorithm perform better)
    * To visualize the data and gain insights on the most important features
    * To save space (compression)

* **The main drawbacks are:**
    * Some information is lost, possibly degrading the performance of subsequent training algorithms.
    * It can be computationally intensive.
    * It adds some complexity to your Machine Learning pipelines.
    * Transformed features are often hard to interpret.

* **The curse of dimensionality** refers to the fact that many problems that do not exist in low-dimensional space arise in high-dimensional space. In Machine Learning, one common manifestation is the fact that randomly sampled high-dimensional vectors are generally far from one another, increasing the risk of overfitting and making it very difficult to identify patterns without having plenty of training data.

* **Once a dataset's dimensionality has been reduced it is almost always impossible to perfectly reverse the operation, because some information gets lost during dimensionality reduction.** Moreover, while some algorithms (such as PCA) have a simple reverse transformation procedure that can reconstruct a dataset relatively similar to the original, other algorithms (such as t-SNE) do not.

* PCA can be used to significantly reduce the dimensionality of most datasets, even if they are highly nonlinear, because it can at least get rid of useless dimensions. However, if there are no useless dimensions—as in the Swiss roll dataset—then reducing dimensionality with PCA will lose too much information. You want to unroll the Swiss roll, not squash it.

    * Regular PCA is the default**, but it works only if the dataset fits in memory.
    * Incremental PCA is useful for large datasets that don't fit in memory, but it is slower than regular PCA, so if the dataset fits in memory you should prefer regular PCA. Incremental PCA is also useful for online tasks, when you need to apply PCA on the fly, every time a new instance arrives.
    * Randomized PCA is useful when you want to considerably reduce dimensionality and the dataset fits in memory; in this case, it is much faster than regular PCA.
    * Random Projection is great for very high-dimensional datasets.

* A dimensionality reduction algorithm performs well if it eliminates a lot of dimensions from the dataset without losing too much information. One way to measure this is to apply the reverse transformation and measure the reconstruction error. However, not all dimensionality reduction algorithms provide a reverse transformation. Alternatively, if you are using dimensionality reduction as a preprocessing step before another Machine Learning algorithm (e.g., a Random Forest classifier), then you can simply measure the performance of that second algorithm; if dimensionality reduction did not lose too much information, then the algorithm should perform just as well as when using the original dataset.

* **It can absolutely make sense to chain two different dimensionality reduction algorithms.** A common example is using PCA or Random Projection to quickly get rid of a large number of useless dimensions, then applying another much slower dimensionality reduction algorithm, such as LLE. This two-step approach will likely yield roughly the same performance as using LLE only, but in a fraction of the time.