(Visit the
[documentation](https://datafold-dev.gitlab.io/datafold/tutorial_index.html) page
to view the executed notebook.)

# Manifold learning on handwritten digits

Disclaimer: Code parts are taken from [scikit-learn: Manifold learning on handwritten digits](https://scikit-learn.org/stable/auto_examples/manifold/plot_lle_digits.html). 

Based on the scikit-learn comparison of manifold learning models, we add the `DiffusionMaps` algorithm. We will also show the additional functionality of embedding unseen points (out-of-sampling).

In [None]:
import sys
import time

import matplotlib.pyplot as plt
import numpy as np
from matplotlib import offsetbox
from sklearn import datasets
from sklearn.model_selection import train_test_split

import datafold.pcfold as pfold
from datafold.dynfold import DiffusionMaps
from datafold.utils.plot import plot_pairwise_eigenvector

In [None]:
# Source code taken and adapted from https://scikit-learn.org/stable/auto_examples/manifold/plot_lle_digits.html


def plot_embedding(X, y, digits, title=None):
    """Scale and visualize the embedding vectors"""
    x_min, x_max = np.min(X, 0), np.max(X, 0)
    X = (X - x_min) / (x_max - x_min)

    plt.figure(figsize=[10, 10])
    ax = plt.subplot(111)

    for i in range(X.shape[0]):
        plt.text(
            X[i, 0],
            X[i, 1],
            str(y[i]),
            color=plt.cm.Set1(y[i] / 10.0),
            fontdict={"weight": "bold", "size": 9},
        )

    if hasattr(offsetbox, "AnnotationBbox"):
        # only print thumbnails with matplotlib > 1.0
        shown_images = np.array([[1.0, 1.0]])  # just something big
        for i in range(X.shape[0]):
            dist = np.sum((X[i] - shown_images) ** 2, 1)
            if np.min(dist) < 4e-3:
                # don't show points that are too close
                continue
            shown_images = np.r_[shown_images, [X[i]]]
            imagebox = offsetbox.AnnotationBbox(
                offsetbox.OffsetImage(digits[i], cmap=plt.cm.gray_r), X[i]
            )
            ax.add_artist(imagebox)
    plt.xticks([]), plt.yticks([])

    if title is not None:
        plt.title(title)

## Generate point cloud of handwritten digits

First, we create the handwritten digits from the scikit-learn dataset (only numbers 0-5 as in the comparison). For the separate analysis of out-of-sample embeddings, we also split the dataset in separate a training and test.

In [None]:
digits = datasets.load_digits(n_class=6)
X = digits.data
y = digits.target
images = digits.images

X_train, X_test, y_train, y_test, images_train, images_test = train_test_split(
    X, y, images, train_size=2 / 3, test_size=1 / 3
)

## Diffusion map embedding on the entire dataset

We first carry out the same steps from the scikit-learn [comparison](https://scikit-learn.org/stable/auto_examples/manifold/plot_lle_digits.html) and embed the entire available dataset. We optimize the parameters `epsilon` and `cut_off` using `PCManifold` (which uses the default `GaussianKernel`). These steps are optional, and the data `X` could have also just fitted directly with a user choice of `epsilon`.

We choose the eigenvector coordinates $\psi_1$ and $\psi_2$, which are the first two non-trivial eigenvectors (the first eigenvector is constant with eigenvalue $\lambda=1$). Note that the timings do not directly compare to the timings stated at the comparison webpage as it is executed on different hardware.

In [None]:
X_pcm = pfold.PCManifold(X)
X_pcm.optimize_parameters(result_scaling=2)

print(f"epsilon={X_pcm.kernel.epsilon}, cut-off={X_pcm.cut_off}")

t0 = time.time()
dmap = DiffusionMaps(
    kernel=pfold.GaussianKernel(epsilon=X_pcm.kernel.epsilon),
    n_eigenpairs=6,
    dist_kwargs=dict(cut_off=X_pcm.cut_off),
)

dmap = dmap.fit(X_pcm)
dmap = dmap.set_target_coords([1, 2])
X_dmap = dmap.transform(X_pcm)

# Mapping of diffusion maps
plot_embedding(
    X_dmap,
    y,
    images,
    title="Diffusion map embedding of the digits (time %.2fs)" % (time.time() - t0),
)

### Compare different embeddings

It may not always be the best choice to use the first two non-trivial eigenvectors (cf. functional dependence between eigenvectors). We can compare different embeddings by plotting $\psi_1$ versus $\psi_2$ : $\psi_5$ (using a `datafold.utils` function from *datafold*).

In [None]:
dmap = DiffusionMaps(
    kernel=pfold.GaussianKernel(epsilon=X_pcm.kernel.epsilon),
    n_eigenpairs=6,
    dist_kwargs=dict(cut_off=X_pcm.cut_off),
)
dmap = dmap.fit(X_pcm)
plot_pairwise_eigenvector(
    eigenvectors=dmap.eigenvectors_[:, 1:],
    n=0,
    idx_start=1,
    fig_params=dict(figsize=(10, 10)),
    scatter_params=dict(c=y),
)

## Out-of-sample embedding 

We add another analysis to highlight the out-of-sample functionality of `DiffusionMaps`. We then only use the training data set to fit the model. Afterwards, we carry out the embedding for both the training and test set and visually compare, if the out-of-sample points are mapped to the same regions as the training set. 

**Note:**
Because this is in the context of unsupervised learning, we cannot easily measure an error. There are strategies such as interpreting the embedding as a classification task, but this is out of scope for this tutorial. 

In [None]:
X_pcm_train = pfold.PCManifold(X_train)
X_pcm_train.optimize_parameters(result_scaling=2)
print(f"epsilon={X_pcm_train.kernel.epsilon}, cut-off={X_pcm_train.cut_off}")

dmap = DiffusionMaps(
    kernel=pfold.GaussianKernel(epsilon=X_pcm_train.kernel.epsilon),
    n_eigenpairs=6,
    dist_kwargs=dict(cut_off=X_pcm_train.cut_off),
)
dmap.fit(X_pcm_train)
dmap = dmap.set_target_coords([1, 2])

X_dmap_train = dmap.transform(X_pcm_train)
X_dmap_test = dmap.transform(X_test)

### Visually compare original mapping versus out-of-sample mapping

The upper plot shows the embedding of the training data `fit` and the lower plot the embedding for out-of-sample points. We see that the colour regions match and call it a success. 

In [None]:
plot_embedding(X_dmap_train, y_train, images_train, title="training data")
plot_embedding(X_dmap_test, y_test, images_test, title="out-of-sample data")