Dimensionality Reduction (DR)

We are basically trying to map a point from the high-dimensional space into a low (often 2) dimensionsal space. This is dimensionality reduction. It helps you to gain insights from high-dimensional data. And there are several ways of doing that.

Linear and Non-Linear ways

Linear ways does this mapping through projections. This is similar to how globe is projected on a map. Different functions can do this projections.

Example: PCA

Non-linear ways does this mapping in a , non-lienar way :o

Example: kernel PCA.

Parametric and Non-Parametric ways In non-paramteric ways, we do not have an explicit function that does the mapping. Instead we just position points in the low dimensional space trying to optimise something (preserve nearest neighbours (t-SNE); preserve pairwise distances (MDS)).

Example: tSNE, MDS

Paramteric ways - has an explicity function that does the mapping.

Example: PCA. The loadings are nothing but parameters. They are helping you map the high-dimensional space to a low dimensional space of Principal Components.

Why do DR?

It enables data visualisation. Without DR, good luck visualising a dataset with 43 columns.
With this data visualisation, you can generate useful hypotheses.

Global Structures vs Local Structures

Local structure preservation means that neighbors in the high-dimensional space should still be neighbors in the low-dimensional space. This is best explained by the concept -- Homophily

Homophily: members of each class should be close to other members of the same class, and should be far from members of other classes. (False clusters would pass the Homophily check)

When Homophily is failed - different classes mix together. This is what happens when we say "local structure" is not preserved. We would have less number of clusters than they actually exist.

PCA could fail to preserve local structure

Neither Local structure nor Global Structure is preserved.

Local structure terrificly preerved. Global also decent.

See, if you use more PCs you may preserve local structure. But that defeats the objective of dimensionality reduction

Global structure preservation means that relative positions between clusters are preserved, as well as larger-scale manifold structures. PCA can preserve it (so you could answer "How many clusters are there?")

If a DR model fails to preserve global structure =>

False clusters (See UMAP in the figure below - brown cluster got clustered into two different clusters)
Crowding Problem: Distances between clusters are not preserved. So even though homophily is maintained (likes stay with likes) - the distance between clusters is very less - you cannot differentiate the clusters. This happens with t-SNE often. Crowding problem

[3] PacMAP can preserve both Global and Local structures. It is a new DR model.

What can a bad DR model do?

It can show you false clusters. Meaning no actual differences between datapoints, but DR model shows them as different clusters. Then you'll bang your head trying to explore a hypothesis that explains this differentiation.

DR model outputs cannot be trusted out of the box. Should be evaulated.

How to evaluate Dimensional Reduction methods?

See, whatever DR model you run, you'll be able to see a reduced scatter plot. How do you know if you're doing it right? The following 5 criterias will help[2]

Preservation of Local Structure
Preservation of Global Structure
Sensitivity to parameter choices
Sensitivity to pre-processing choices
Computational Efficiency: Algorithms that calculate pair-wise distances (MDS) and pairwise similarities (tSNE) are hard to scale.

Ideally, a DR method would preserve local structure and global structure, be somewhat insensitive to parameter choices and pre-processing and be computationally efficient[2]

Applications

Clustering, DataViz, Hypothesis generation in:

Digital Humanities: Samples are books | Features are words (each word a variable!)
BioInformatics: Samples are cells | Features are genes.

References

Tubingen ML Lecture
Comprehensive evaluation of Dimensionality Reduction - Nature
DR - biology - Supplementary Notes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

DimensionalityReduction.md

DimensionalityReduction.md

Dimensionality Reduction (DR)

Why do DR?

Global Structures vs Local Structures

What can a bad DR model do?

How to evaluate Dimensional Reduction methods?

Applications

References

Files

DimensionalityReduction.md

Latest commit

History

DimensionalityReduction.md

File metadata and controls

Dimensionality Reduction (DR)

Why do DR?

Global Structures vs Local Structures

What can a bad DR model do?

How to evaluate Dimensional Reduction methods?

Applications

References