# Geometric Harmonics: interpolate function values on data manifold

Geometric harmonics are eigenvectors (corresponding to point evaluations of eigenfunctions) of a kernel matrix computed on data. For example, the kernel matrix from a Gaussian kernel

$$W_{i,j} = \exp{(-d^2_{i,j}/\varepsilon)}, i,j = 1,\ldots,M$$

have geometric harmonics $\phi$ and corresponding eigenvalues $\lambda$ from, $(\{\phi_i\}_{i=1}^{M}, \{\lambda\}_{i=1}^{M} = \operatorname{eig}(W)$. Performing interpolation with geometric harmonics builds up on the idea of the Nyström extension: instead of extending the geometric harmonics (i.e. eigenvectors) itself to a neighborhood region, the method allows to interpolate arbitrary function values defined on a manifold. 

The method is described in detail in

*Coifman, Ronald R., and Stéphane Lafon. “Geometric Harmonics: A Novel Tool for Multiscale out-of-Sample Extension of Empirical Functions.” Applied and Computational Harmonic Analysis 21, no. 1 (July 2006): 31–52. https://doi.org/10.1016/j.acha.2005.07.005.*

**Usecase: an out-of-sample method for Diffusion Map model**

We set up a Diffusion Maps model that parametrizes an intrinsic low dimensional manifold $\Psi$ in a high-dimensional dataset $X$. After we embedded the available data, we want to now learn the mapping

$$f(X): X \rightarrow \Psi$$

also for new samples $x_{new} \notin X$ (image) or $\psi \notin \Psi$ (pre-image). This is referred to as the so-called "out-of-sample extension". For image mappings, an out-of-sample method must be able to handle the high-dimensional ambient space of $ X $.

The `DiffusionMaps` model maps the three-dimensional swiss-roll data $X$ to the two-dimensional manifold embedding $\Psi$. We then learn an out-of-sample mapping $f$ and pre-image $f^{-1}$ between the two spaces with a geometric harmonics interpolation.

In [None]:
import matplotlib.pyplot as plt
import mpl_toolkits.mplot3d.axes3d as p3  # noqa: F401
import numpy as np
import pandas as pd
from scipy.spatial.distance import pdist
from sklearn.datasets import make_swiss_roll
from sklearn.model_selection import train_test_split

import datafold.dynfold as dfold
import datafold.pcfold as pfold
from datafold import GeometricHarmonicsInterpolator as GHI
from datafold import LaplacianPyramidsInterpolator, LocalRegressionSelection

## Manifold embedding of swiss-roll data with Diffusion Maps

We first generate the swiss-roll dataset with scikit-learn and then fit a `DiffusionMaps` model to compute the manifold embedding on the generated data. In this tutorial we skip the eigenvector selection process and choose the suitable embedding $\Psi = \{\psi_1, \psi_5\}$.

In [None]:
# obtain dataset with a lot of points to get accurate eigenfunctions
rng = np.random.default_rng(1)
X_all, color = make_swiss_roll(n_samples=15000, noise=0, random_state=1)

num_eigenpairs = 6

pcm = pfold.PCManifold(X_all)
pcm.optimize_parameters(random_state=0)

dmap = dfold.DiffusionMaps(
    pfold.GaussianKernel(
        epsilon=pcm.kernel.epsilon, distance=dict(cut_off=pcm.cut_off)
    ),
    n_eigenpairs=num_eigenpairs,
)
dmap = dmap.fit(pcm)
evecs, evals = dmap.eigenvectors_, dmap.eigenvalues_

# find the best eigenvectors automatically
selection = LocalRegressionSelection(
    intrinsic_dim=2, n_subsample=500, strategy="dim"
).fit(dmap.eigenvectors_)

psi_all = evecs[:, selection.evec_indices_]

# select a subset of the data to proceed, so that the next computations are less expensive
ind_subset = np.sort(rng.permutation(np.arange(X_all.shape[0]))[0:2500])
X_all = X_all[ind_subset, :]
psi_all = psi_all[ind_subset, :]
color = color[ind_subset]

# Plotting
fig = plt.figure(figsize=[12, 5])

fig.suptitle(f"total #samples={pcm.shape[0]}")

ax = fig.add_subplot(1, 2, 1, projection="3d")
ax.set_title("all data, original coordinates")
ax.scatter(*X_all.T, s=5, c=color, cmap=plt.cm.Spectral)
ax.set_xlabel("x")
ax.set_ylabel("y")
ax.set_zlabel("z")

ax = fig.add_subplot(1, 2, 2)
ax.set_title("all data, embedding coordinates")
ax.scatter(*psi_all.T, s=5, c=color, cmap=plt.cm.Spectral)
ax.set_xlabel(r"$\psi_" + str(selection.evec_indices_[0]) + r"$")
ax.set_ylabel(r"$\psi_" + str(selection.evec_indices_[1]) + r"$");

## Image mapping

We now have both the original dataset $X$ (=`X_all`) and the embedding $\Psi$ (=`psi_all`). We proceed to train the `GeometricHarmonicsInterpolator` model, which will perform the out-of-sample mapping. For the image mapping, we want to learn the function $f: X \rightarrow \Psi$. Here, the ambient space of $X$ is relatively low with only three dimensions, which is mainly for plotting purposes. If the same data manifold was embedded in a much larger space, say 100 by linearly transforming it into this space, we would obtain the same results.

We split the datasets $X$ and $\Psi$ into a training and test set accordingly. We do not use the test set for parameter optimization, but later to compare the out-of-sample for "new" points against the ground truth.

In [None]:
# the psi_test values are used for the "ground truth" values to measure an error
# the color values are used for plotting
X_train, X_test, psi_train, psi_test, color_train, color_test = train_test_split(
    X_all, psi_all, color, train_size=2 / 3
)

The `GeometricHarmonicsInterpolator` has two important parameters, the number of geometric harmonics (`n_eigenpairs`) and the kernel scale (`epsilon`). We manually select the two values and refer to the end of this tutorial where we also use Bayesian optimization to find suitable parameters in the pre-image problem. 

Note that the `DiffusionMaps` model also comes with a native image mapping, based on the Nyström extension. A direct comparison, however, is difficult, because here we use the `DiffusionMap` embedding as ground truth and cannot map truly new samples. 

**Note**: The eigenvalues corresponding to the geometric harmonics (kernel eigenvectors) are $\lambda_i \rightarrow 0, \text{ for } i \rightarrow \infty$. The geometric harmonics interpolation (like the Nyström extension) involves the reciprocal of eigenvalues $1/\lambda_i$. This means, that too many eigenpairs can lead to numerical instabilities.

In [None]:
# compute the geometric harmonics from X to \Psi
n_eigenpairs = 100
epsilon = 30

# construct the GeometricHarmonicsInterpolator and fit it to the data.
gh_interpolant = GHI(
    pfold.GaussianKernel(epsilon=epsilon),
    n_eigenpairs=n_eigenpairs,
)

gh_interpolant.fit(X_train, psi_train);

### Evaluate interpolation function for image

Because `GeometricHarmonicsInterpolator` follows the scikit-learn API, we can now use the `.score` method to evaluate the residual and the error for the validation points. The default metric is a mean squared error. 

**Note** that the model scores are the mean squared error, we therefore maximize negative errors. This is according to the definition of scikit-learn's scoring parameter which [states](https://scikit-learn.org/stable/modules/model_evaluation.html#scoring-parameter) "higher return values are better than lower return values".

In [None]:
residual_train = gh_interpolant.score(X_train, psi_train)
error_test = gh_interpolant.score(X_test, psi_test)

# use pandas for table
df = pd.DataFrame(
    np.row_stack([residual_train, error_test]),
    index=["residual", "error"],
    columns=["psi1", "psi5"],
)

print(f"mean: \n{df.mean()}")

df

On the left side, we plot the ground truth test set embedding (from the `DiffusionMaps` model). On the right side, we predict the embedding values with geometric harmonics interpolation.

In [None]:
fig, ax = plt.subplots(1, 2, figsize=[12, 5], sharey=True)

ax[0].set_title("ground truth test set ${\psi}_{1,5}$")
ax[0].scatter(psi_test[:, 0], psi_test[:, 1], s=10, c=color_test, cmap=plt.cm.Spectral)
ax[0].set_xlabel(r"$\psi_1$")
ax[0].set_ylabel(r"$\psi_5$")

ax[1].scatter(
    *gh_interpolant.predict(X_test).T, s=10, c=color_test, cmap=plt.cm.Spectral
)
ax[1].set_title(r"interpolated values $\hat{\psi}_{1,5}$")
ax[1].set_xlabel(r"$\hat{\psi}_1$")
ax[1].set_ylabel(r"$\hat{\psi}_5$");

In [None]:
psi_test_interp = gh_interpolant.predict(X_test)

error_color = (psi_test[:, 0] - psi_test_interp[:, 0]) ** 2 + (
    psi_test[:, 1] - psi_test_interp[:, 1]
) ** 2

fig, ax = plt.subplots(1, 2, figsize=[12, 5], sharey=True)

sc = ax[0].scatter(psi_test[:, 0], psi_test[:, 1], c=error_color, cmap="Reds")
ax[0].set_title("test data with error (absolute)")
plt.colorbar(sc, ax=ax[0])
ax[0].set_xlabel(r"${\psi}_1$")
ax[0].set_ylabel(r"${\psi}_5$")

# the np.newaxis need are required to have 2D arrays:
norm_factor = np.max(
    [
        np.max(pdist(psi_test[0, :][:, np.newaxis])),
        np.max(pdist(psi_test[1, :][:, np.newaxis])),
    ]
)  # take max. distance in test dataset as the norming factor

sc = ax[1].scatter(
    psi_test[:, 0],
    psi_test[:, 1],
    vmin=0,
    vmax=0.1,
    c=error_color / norm_factor,
    cmap="Reds",
)
plt.colorbar(sc, ax=ax[1])
ax[1].set_title("test data with error (relative)")
ax[1].set_xlabel(r"${\psi}_1$")
ax[1].set_ylabel(r"${\psi}_5$");

## Pre-Image mapping

We now want to learn the inverse mapping and interpolate the $X$ values (as function values) on $\Psi$.

$$f^{-1}: \Psi \rightarrow X$$
 
In the literature, this is often referred to as the pre-image problem of manifold learning.

We repeat the above steps and use the same embedding (but re-shuffle the training and test set). 

In the following, we will train a `LaplacianPyramidsInterpolator` model, for details see

Rabin, Neta, and Ronald R. Coifman. "Heterogeneous datasets representation and learning using diffusion maps and Laplacian pyramids." Proceedings of the 2012 SIAM International Conference on Data Mining. Society for Industrial and Applied Mathematics, 2012. https://doi.org/10.1137/1.9781611972825.17 

The model is generalized to handle multiple target values (here three). 

In [None]:
lpi = LaplacianPyramidsInterpolator(residual_tol=0.001)
lpi.fit(psi_train, X_train)
lpi.score(psi_test, X_test)

### Compare ground truth points and points reconstructed in pre-image map in plot

In [None]:
# plot ground truth and interpolated values
fig = plt.figure(figsize=[12, 5])

ax = fig.add_subplot(121, projection="3d")
ax.scatter(X_test[:, 0], X_test[:, 1], X_test[:, 2], c=color_test, cmap=plt.cm.Spectral)
ax.set_title("ground truth values")
ax.set_xlabel(r"$x_1$")
ax.set_ylabel(r"$x_2$")
ax.set_zlabel(r"$x_3$")

ax = fig.add_subplot(122, projection="3d")
ax.scatter(*(lpi.predict(psi_test)).T, c=color_test, cmap=plt.cm.Spectral)
ax.set_title("interpolated values")
ax.set_xlabel(r"$x_1$")
ax.set_ylabel(r"$x_2$")
ax.set_zlabel(r"$x_3$")

### Compare error of reconstruction in error plot 

In [None]:
# compute and plot error
error_color = np.linalg.norm(X_test - lpi.predict(psi_test), axis=1)

fig = plt.figure(figsize=[12, 5])

ax = fig.add_subplot(121, projection="3d")
sc = ax.scatter(X_test[:, 0], X_test[:, 1], X_test[:, 2], c=error_color, cmap="Reds")
plt.title("test data with error (absolute)")
plt.colorbar(sc)
ax.set_xlabel(r"$x_1$")
ax.set_ylabel(r"$x_2$")
ax.set_zlabel(r"$x_3$")

ax = fig.add_subplot(122, projection="3d")
# the np.newaxis need are required to have 2D arrays:
norm_factor = np.max(
    [
        np.max(pdist(X_test[0, :][:, np.newaxis])),
        np.max(pdist(X_test[1, :][:, np.newaxis])),
        np.max(pdist(X_test[2, :][:, np.newaxis])),
    ]
)  # take max. distance in test dataset as the norming factor

sc = ax.scatter(
    X_test[:, 0],
    X_test[:, 1],
    X_test[:, 2],
    vmin=0,
    vmax=0.1,
    c=error_color / norm_factor,
    cmap="Reds",
)
plt.colorbar(sc)
plt.title("test data with error (relative)")
ax.set_xlabel(r"$x_1$")
ax.set_ylabel(r"$x_2$")
ax.set_zlabel(r"$x_3$");

We see that particularly at the edges the error is largest.