(Visit the
[documentation](https://datafold-dev.gitlab.io/datafold/tutorial_index.html) page
to view the executed notebook.)

# RObust and Scalable Embedding via LANdmark Diffusion (Roseland)

For a detailed introduction see paper 

Shen, Chao, and Hau-Tieng Wu. "Scalability and robustness of spectral embedding: landmark diffusion is all you need." arXiv preprint arXiv:2001.00801 (2020). Available at: https://arxiv.org/abs/2001.00801

The Roseland algorithm is a dimensionality reduction technique based on the manifold assumption (cf. "manifold learning", "diffusion maps"). It is motivated by the challenge to develop a scalable and robust algorithm capable of handling large datasets. The Roseland algorithm can be viewed as a generalization of the Diffusion Maps (DM) algorithm with which shares various properties. Its main advantage lies in mitigating the computational effort of computing the eigendecomposition of a large matrix containing the affinities between each two points in the dataset.

Instead, the Roseland algorithm utlizies a "landmark set" in the computation of the affinity matrix. For large scale applications, this "landmark set" can be much smaller than the full dataset. This results in a rectangular matrix proportional to the size of the landmark set and full dataset to be decomposed with singular value decomposition, rather than the large full-dataset-square matrix of DM. It is also aparent than in the cases when the landmark coincides with the full dataset, the Roseland matches DM.

**In this tutorial** we reimplement two of the earlier Diffusion Maps tutorials, [Embedding of an S-curved manifold](https://datafold-dev.gitlab.io/datafold/tutorial_03_basic_dmap_scurve.html) and [Manifold learning on handwritten digits](https://datafold-dev.gitlab.io/datafold/tutorial_04_basic_dmap_digitclustering.html).

Alternative manifold learning methods are, for example: Isomaps, Local Linear Embedding or Hessian eigenmaps. For a quick comparison (without Diffusion Maps) see the [scikit-learn page](http://scikit-learn.org/stable/auto_examples/manifold/plot_compare_methods.html#sphx-glr-auto-examples-manifold-plot-compare-methods-py).

In [None]:
import copy
import time

import matplotlib.pyplot as plt
import mpl_toolkits.mplot3d.axes3d as p3
import numpy as np
import sklearn.manifold as manifold
from matplotlib import offsetbox
from sklearn.datasets import load_digits, make_s_curve, make_swiss_roll
from sklearn.decomposition import PCA
from sklearn.pipeline import make_pipeline

import datafold.dynfold as dfold
import datafold.pcfold as pfold
from datafold.dynfold import LocalRegressionSelection
from datafold.utils.plot import plot_pairwise_eigenvector

## Embedding of an S-curved manifold

We use the `PCManifold` to select suitable parameters (`epsilon` and `cut_off`), fit a `Roseland` model to learn the S-curved manifold and in the last step we show how to find the "right" parsimonious representation automatically with `LocalRegressionSelection`.

### Generate S-curved point cloud  

We use the generator `make_s_curve` from scikit-learn. The points have a three-dimensional representation (ambient space) and the points lie on a (non-linear) S-curved shaped manifold with intrinsic dimension two (i.e. on the folded plane). The scikit-learn function also provides pseudo-colouring, which allows to better visualize different embeddings.

In [None]:
nr_samples = 15000

# reduce number of points for plotting
nr_samples_plot = 1000
idx_plot = np.random.permutation(nr_samples)[0:nr_samples_plot]

# generate point cloud
X, X_color = make_s_curve(nr_samples, random_state=3, noise=0)

# plot
fig = plt.figure(figsize=(10, 10))
ax = fig.add_subplot(111, projection="3d")
ax.scatter(
    X[idx_plot, 0],
    X[idx_plot, 1],
    X[idx_plot, 2],
    c=X_color[idx_plot],
    cmap=plt.cm.Spectral,
)
ax.set_xlabel("x")
ax.set_ylabel("y")
ax.set_zlabel("z")
ax.set_title("point cloud on S-shaped manifold")
ax.view_init(10, 70)

### Optimize kernel parameters 

We now use a `PCManifold` to estimate parameters. The attached kernel in `PCManifold` defaults to a `GaussianKernel`.

* `epsilon` - the scale of the Gaussian kernel
* `cut_off` - promotes kernel sparsity and allows the number of samples in a dataset to be scaled

In [None]:
X_pcm = pfold.PCManifold(X)
X_pcm.optimize_parameters()

print(f"epsilon={X_pcm.kernel.epsilon}, cut-off={X_pcm.cut_off}")

### Fit Roseland model 

We first fit a `Roseland` model with the optimized parameters and then compare potential two-dimensional embeddings. For this we fix the first non-trivial svdvector ($\psi_1$) and compare it to the other computed svdvectors ($\{\psi_i\}_{i=1, i \neq 1}^{9}$). 

**Comaprison with Diffusion Maps:**

* Roseland makes use of similar metaparameters as the Diffusion maps, the major difference in the parameter `landmarks` which governs the size of the landmark set. Changing `landmarks` allows for a compromise between the accuracy and runtime. 

In [None]:
for lm in (0.1, 0.25, 0.7):
    rose = dfold.Roseland(
        kernel=pfold.GaussianKernel(
            epsilon=X_pcm.kernel.epsilon, distance=dict(cut_off=X_pcm.cut_off)
        ),
        n_svdtriplet=9,
        landmarks=lm,
    )
    t0 = time.time()
    rose = rose.fit(X_pcm)
    dt = time.time() - t0
    svdvecs, svdvals = rose.svdvec_left_, rose.svdvalues_

    plot_pairwise_eigenvector(
        eigenvectors=svdvecs[idx_plot, :],
        n=1,
        fig_params=dict(figsize=[15, 15]),
        scatter_params=dict(cmap=plt.cm.Spectral, c=X_color[idx_plot]),
    )

    plt.suptitle(rf"Roseland embeddings for $landmarks={lm}$ in {dt:.2f} s.", y=0.9)
    plt.show()

### Automatic embedding selection

In the visual comparison, we can identify good choices if the dimension is low (two dimensional for plotting). For larger (intrinsic) dimensions of the manifold, this becomes much harder or impossible. Furthermore, we also wish to automatize this process.

In a (linear) Principal Component Analysis (PCA), the eigenvectors are sorted by relevance and each eigenvector points in the direction of (next) larger variance. This is a property by construction and because of this intrinsic order, the selection process is simpler as we only have to look at the magnitude of corresponding eigenvalues or a gap in the eigenvalues. 

For manifold learning models like `DiffusionMaps` or `Roseland` we trade-off lower-dimensional embeddings (by accounting for non-linearity) with a harder selection process. The eigenvectors with large eigenvalue may not add new information compared to the previous eigenvectors (as in the plot above). For an automatic selection of suitable eigenvector coordinates, we need a model that quantifies what a "good" embedding is and optimize for this quantity. 

An automatic selection model is provided in `datafold.dynfold.LocalRegressionSelection`. It provides two strategies ("fixed dimension" or "threshold"). Because we know *apriori* that the intrinsic dimension is two, we choose this strategy.

In [None]:
selection = LocalRegressionSelection(
    intrinsic_dim=2, n_subsample=500, strategy="dim"
).fit(svdvecs)
print(f"Found parsimonious eigenvectors (indices): {selection.evec_indices_}")

We see that the selection-model was able to find the same eigenvector pairs for embedding that we identified as the best choice before. Finally, we plot the unfolded point cloud:

In [None]:
target_mapping = selection.transform(svdvecs)

f, ax = plt.subplots(figsize=(15, 9))
ax.scatter(
    target_mapping[idx_plot, 0],
    target_mapping[idx_plot, 1],
    c=X_color[idx_plot],
    cmap=plt.cm.Spectral,
);

## Manifold learning on handwritten digits

Disclaimer: Code parts are taken from [scikit-learn: Manifold learning on handwritten digits](https://scikit-learn.org/stable/auto_examples/manifold/plot_lle_digits.html). 

Based on the scikit-learn comparison of manifold learning models, we add the `Roseland` algorithm. We will also show the additional functionality of embedding unseen points (out-of-sampling). Also comapre to the 04_basic_dmap_digitclustering.ipynb.

In [None]:
# Source code taken and adapted from https://scikit-learn.org/stable/auto_examples/manifold/plot_lle_digits.html


def plot_embedding(X, y, digits, title=None):
    """Scale and visualize the embedding vectors"""
    x_min, x_max = np.min(X, 0), np.max(X, 0)
    X = (X - x_min) / (x_max - x_min)

    plt.figure(figsize=[10, 10])
    ax = plt.subplot(111)

    for i in range(X.shape[0]):
        plt.text(
            X[i, 0],
            X[i, 1],
            str(y[i]),
            color=plt.cm.Set1(y[i] / 10.0),
            fontdict={"weight": "bold", "size": 9},
        )

    if hasattr(offsetbox, "AnnotationBbox"):
        # only print thumbnails with matplotlib > 1.0
        shown_images = np.array([[1.0, 1.0]])  # just something big
        for i in range(X.shape[0]):
            dist = np.sum((X[i] - shown_images) ** 2, 1)
            if np.min(dist) < 4e-3:
                # don't show points that are too close
                continue
            shown_images = np.r_[shown_images, [X[i]]]
            imagebox = offsetbox.AnnotationBbox(
                offsetbox.OffsetImage(digits[i], cmap=plt.cm.gray_r), X[i]
            )
            ax.add_artist(imagebox)
    plt.xticks([]), plt.yticks([])

    if title is not None:
        plt.title(title)

### Roseland embedding on the entire dataset

We first carry out the same steps from the scikit-learn [comparison](https://scikit-learn.org/stable/auto_examples/manifold/plot_lle_digits.html) to load  the handwritten digits from the scikit-learn dataset and embed the entire available dataset. We optimize the parameters `epsilon` and `cut_off` using `PCManifold` (which uses the default `GaussianKernel`). These steps are optional, and the data `X` could have also just fitted directly with a user choice of `epsilon`. We use the default landmark set consisting of randomly selected 25% of the original dataset.

We also illustrate the usage of the Roseland as a transfomrer in an `sklearn.pipeline`. The dimensionality of the dataset is first reduced using `PCA` before further manifold embedding by Roseland. Finally, to choose the svdvector coordinates for the embedding, we employ the `LocalRegressionSelection` as above.

In [None]:
digits = load_digits(n_class=6)
X = digits.data
y = digits.target
images = digits.images

In [None]:
X_pcm = pfold.PCManifold(X)
X_pcm.optimize_parameters(result_scaling=2)

In [None]:
transformer = make_pipeline(
    PCA(n_components=8),
    dfold.Roseland(
        n_svdtriplet=6,
        kernel=pfold.GaussianKernel(
            epsilon=X_pcm.kernel.epsilon, distance=dict(cut_off=X_pcm.cut_off)
        ),
        random_state=42,
    ),
    LocalRegressionSelection(intrinsic_dim=2, n_subsample=500, strategy="dim"),
)
projection = transformer.fit_transform(X, y)

In [None]:
plot_embedding(
    projection,
    y,
    images,
    title="Roseland embedding of the digits",
)