# Embeddings to test

* UMAP
* t-SNE
* Parametric UMAP (part of UMAP)
* DenseMap (part of UMAP)
* [PacMap](https://github.com/YingfanWang/PaCMAP)
* [TriMap](https://github.com/eamid/trimap)
* PCA
* Laplacian eigenmaps
* MDS
* Isomap
* [MDE](https://github.com/cvxgrp/pymde)
* [PHATE](https://github.com/KrishnaswamyLab/PHATE)
* ForceAtlas2
* dbMAP


# Experiments

* distance/distance-rank preservation with varying ```n_neighbors```, ```n_components``` and ```min_dist```, measured with Pearson's corr.
* hierarchical embedding: original -> 1000d -> 100d -> 2d
* negative test: does it magically create clusters? Test using a high dimensional Gaussian


Metrics:
* Spearman rank correlation between samples
* Pearson correlation of distances
* Distance correlation of distances
* Average Jaccard distance


In [1]:
%load_ext autoreload
%autoreload 1
%aimport omic_helpers
%matplotlib inline

from omic_helpers import graph_clustering


from sklearn import datasets
from sklearn.preprocessing import StandardScaler, QuantileTransformer, RobustScaler, MinMaxScaler, FunctionTransformer
from matplotlib import pyplot as plt
import seaborn as sns
import numpy as np
from scipy.stats import chisquare, chi2_contingency, pearsonr
from scipy.stats import kendalltau,spearmanr, weightedtau, theilslopes, wilcoxon, ttest_rel
from scipy.spatial import distance
import dcor

import umap
import pacmap
import trimap
import pymde
import dbmap

import pandas as pd
import seaborn as sns

from sklearn.pipeline import Pipeline

from sklearn.decomposition import PCA, KernelPCA, NMF, FactorAnalysis
from sklearn.manifold import Isomap, MDS, SpectralEmbedding
from sklearn.manifold import LocallyLinearEmbedding as LLE, TSNE, smacof, trustworthiness

from tqdm import tqdm

from scipy.sparse import csr_matrix

import gc

None
None


In [2]:
def get_intra_sample_distances(X, how='euclidean'):
    if how == 'euclidean':  
        return distance.pdist(X)

In [None]:
# [x] Sammon mapping: https://arxiv.org/pdf/2009.08136.pdf
# [x] landmark maximum variance unfolding 
# [x] Landmark MDS
# [x] GSOM: https://github.com/CDAC-lab/pygsom/tree/master/gsom -> never mind this is a clustering method..
# [x] SMACOF

# IVIS: https://github.com/beringresearch/ivis, https://www.nature.com/articles/s41598-019-45301-0
# RankVisu
# diffeomorphic dimensionality reduction Diffeomap
# FastMap MDS: https://github.com/shawn-davis/FastMapy
# FactorizedEmbeddings: https://github.com/TrofimovAssya/FactorizedEmbeddings, https://academic.oup.com/bioinformatics/article/36/Supplement_1/i417/5870511
# MetricMap
# SparseMap: https://github.com/vene/sparsemap
# growing curvilinear component analysis
# curvilinear distance analysis
# autoencoder NeuroScale
# PHATE
# GPLVM
# FA
# Nonlinear PCA
# SDNE 
# GCN
# Graph Factorisation
# HOPE
# opt-SNE: https://github.com/omiq-ai/Multicore-opt-SNE
#  Poincare embedding : https://github.com/facebookresearch/poincare-embeddings
# NN-graph/Parametric UMAP -> GraphSage/Node2Vec/etc.. see NetworkX and karateclub!
# https://github.com/benedekrozemberczki/karateclub
# https://github.com/palash1992/GEM-Benchmark, https://github.com/palash1992/GEM

# https://www.sciencedirect.com/science/article/pii/S0950705118301540

# Multi-dimensional datasets

In [None]:
dimensionality = 80
num_blobs = 5
test_data_multidim = []

rnd_perturbation = np.random.normal(0, 1, (1000,dimensionality))
test_data_multidim.append(('blobs1', datasets.make_blobs(n_samples=1000, 
                                                         n_features=dimensionality, 
                                                         centers=num_blobs)[0]+rnd_perturbation))
test_data_multidim.append(('blobs2', datasets.make_blobs(n_samples=1000, 
                                                         n_features=dimensionality, 
                                                         centers=2*num_blobs)[0]+rnd_perturbation))
test_data_multidim.append(('Class1', datasets.make_classification(n_samples=1000, 
                                                                  n_features=dimensionality, 
                                                                  n_informative=20, 
                                                                  n_redundant=0)[0]+rnd_perturbation))
test_data_multidim.append(('Class2', datasets.make_classification(n_samples=1000, 
                                                                  n_features=dimensionality, 
                                                                  n_informative=5, 
                                                                  n_redundant=0)[0]+rnd_perturbation))

In [None]:
num_samples = 1000
sample_size = 250
sample_selection = np.random.randint(0,num_samples, sample_size)

In [None]:
embedder_type = 'Sammon'
n_n = 77
reduce_dim = 11
scaler = StandardScaler() # QuantileTransformer(output_distribution='normal')#  QuantileTransformer(output_distribution='normal')

In [None]:

embedder = {}
embedder['umap'] = umap.UMAP(n_components=reduce_dim, densmap=True, metric='euclidean',
                             n_neighbors=n_n, min_dist=0.25, disconnection_distance=15.)
embedder['trimap'] = trimap.TRIMAP(n_dims=reduce_dim, n_iters=2500);
embedder['pacmap'] = pacmap.PaCMAP(n_dims=reduce_dim, n_neighbors=n_n)
embedder['SpectralEmbedding'] = SpectralEmbedding(n_components=reduce_dim, n_neighbors=n_n)
embedder['Isomap'] = Isomap(n_components=reduce_dim)
embedder['MDS'] = MDS(n_components=reduce_dim, metric='euclidean')
embedder['KernelPCA'] = KernelPCA(n_components=reduce_dim, kernel='sigmoid')
embedder['PCA'] = PCA(n_components=reduce_dim)
embedder['FA'] = FactorAnalysis(n_components=reduce_dim, max_iter=1000)
embedder['dbmap'] = dbmap.diffusion.Diffusor(n_components=120, ann_dist='euclidean')
embedder['LLE'] = LLE(n_components=reduce_dim, n_neighbors=n_n, method='ltsa')
embedder['NMF'] = NMF(n_components=reduce_dim, max_iter=10000)
embedder['TSNE'] = TSNE(n_components=3, perplexity=50)
embedder['Sammon'] = graph_clustering.Sammon(n_components=reduce_dim, n_neighbors=n_n,
                                            max_iterations=250, learning_rate=0.05, init_type='PCA')
embedder['MVU'] = graph_clustering.MaximumVarianceUnfolding(n_components=2, n_neighbors=n_n)
embedder['LMVU'] = graph_clustering.LandmarkMaximumVarianceUnfolding(n_components=reduce_dim, 
                                                                     n_neighbors=n_n, 
                                                                     n_landmarks=n_landmarks)
embedder['LMDS'] = graph_clustering.LandmarkMultiDimensionalScaling(n_components=reduce_dim,
                                                                     n_landmarks=n_landmarks)

In [None]:
#embedder['MVU']
#embedder['GSOM']
#embedder['MetricMap']
#embedder['SparseMap']

test_sets_embedded = []
if embedder_type == 'dbmap':
    pipe = Pipeline([('scaler', scaler), 
                     ('prepmap', embedder['dbmap']), 
                     ('reducer', embedder['umap'])])
    for _, ts in tqdm(test_data_multidim):
        tts = embedder['dbmap'].fit_transform(ts)
        test_sets_embedded.append(np.array(pipe.fit_transform(tts)))
elif embedder_type == 'NMF':    
    for _, ts in tqdm(test_data_multidim):
        nonnegger = lambda x: x + 2*np.abs(np.min(x, axis=0))
        nonnegger_F = FunctionTransformer(func=nonnegger)

        pipe = Pipeline([('scaler', scaler), 
                         ('nngr', nonnegger_F), 
                         ('reducer', embedder['NMF'])])
        test_sets_embedded.append(pipe.fit_transform(ts)) 
else:
    pipe = Pipeline([('scaler', scaler), 
                     ('reducer', embedder[embedder_type])])
    for _, ts in tqdm(test_data_multidim):
        test_sets_embedded.append(pipe.fit_transform(ts))

In [None]:
#fig, ax = plt.subplots(ncols=2, nrows=2, figsize=(22,25))
#for k, ds in enumerate(test_sets_embedded):
#    j=k%2 
#    i=int(k/2)
#    ax[i,j].scatter(x=ds[:,0], y=ds[:,1], color='black')
#    ax[i,j].set_title(f'Image:{k}')

In [None]:
dist_preservation_overall = []
dists = []
for num in tqdm(range(0,4)):
    dist_or = get_intra_sample_distances(test_data_multidim[num][1][sample_selection,:])
    dist_emb = get_intra_sample_distances(test_sets_embedded[num][sample_selection,:])

    dists.append({'d_or': dist_or, 'd_emb': dist_emb})
    dist_preservation_overall.append({'dataset': test_data_multidim[num][0], 
                              'corr':dcor.distance_correlation(dist_or, dist_emb)})

In [None]:
fig, ax = plt.subplots(ncols=2, nrows=2, figsize=(22,25))
for k, ds in enumerate(dists):
    j=k%2 
    i=int(k/2)
    ax[i,j].scatter(x=ds['d_or'], y=ds['d_emb'], color='black', alpha=0.01)
    mx,my = max(ds['d_or']), max(ds['d_emb'])
    ax[i,j].plot([0,mx], [0, my], ls='--', c='blue')
    ax[i,j].set_title(f'Image:{k}')

In [None]:
dist_preservation_overall

So far, roughly: *a better metric approximation is co-related with a worse cluster separation*

# Semi-supervised UMAP

The main flavor is to add labels for the different clusters we know we want to 
see. This can be based on a clustering on a sample set of the original data (perhaps also a selection of features).

# Parametric UMAP

* Create nearest-neighbor graph with fuzzy simplicials
* Apply graph embedder

# Anchored embedding

# Distance preserving embedding

* Siamese twins networks
* distance as outcome
* pairs as input

The method IVIS seems to use this idea.

# Ranking based embedder

# Multi-patch UMAP

The core assumption of UMAP is that all points lie on the same manifold. What if we split our data in dense patches prior to the creation of the fuzzy simplicials? 

To make this tractable this split should be computationally in-expensive. One way to go about is to treat overlapping regions with a sufficient number of samples as patches. The embeddings associated with these patches can later be combined.



# Multi-sample UMAP


* $N$ sampled UMAP embedders with/without minimal perturbations
* aligned using Procrustes
* uniform scaling
* concensus distance determination

# Landmarkbased embeddings coupled to sparse exemplar finders

Instead of random landmarks we can use exemplars based on 
* points closest to centroids
* exemplars based on e.g. affinity propagation