# Embeddings invariant to the observables with Mahalanobis kernel

This short notebook demonstrates how to use the Mahalanobis distance with the predefined MahalanobisKernel in datafold.pcfold.kernels to obtain an embedding that is invariant to the observaton function. This idea and the example are taken from the following paper:

**Reference:**

Singer, A. & Coifman, R. R.: Non-linear independent component analysis with diffusion maps.
Applied and Computational Harmonic Analysis, Elsevier BV, 2008, 25, 226-239. Available at: https://www.doi.org/10.1016/j.acha.2007.11.001

In [None]:
import matplotlib.pyplot as plt
import numpy as np

import datafold.dynfold as dfold
import datafold.pcfold as pfold
from datafold.pcfold.kernels import GaussianKernel, MahalanobisKernel

# Example data

The standard example for this idea is the transformation of a square to a mushroom, through a transformation function that we assume is unknown in real examples. We only need access to neighbors of each point in the original space, pushed forward through the transformation. Actually, even less is needed: just the covariance matrix of these neighbors.

In [None]:
random_state = 1
n_pts = 2500
n_neighbors = 100
eps_covariance = 1e-2
pinv_tol = np.exp(-11)

rng = np.random.default_rng(random_state)


def transformation(x):
    return np.column_stack([x[:, 0] + x[:, 1] ** 3, x[:, 1] - x[:, 0] ** 3])


# sample original data (not used until we compare at the end)
data_x = rng.uniform(low=0, high=1, size=(n_pts, 2))
# sample transformed data
data_y_original = rng.uniform(low=0, high=1, size=(n_pts, 2))
data_y = transformation(data_y_original)

# sample covariance data from neighborhoods
covariances = np.zeros((n_pts, 2, 2))
for k in range(n_pts):
    neighbors_x = rng.normal(
        loc=data_y_original[k, :], scale=eps_covariance, size=(n_neighbors, 2)
    )
    neighbors_y = transformation(neighbors_x)
    covariances[k, :, :] = np.cov(neighbors_y.T)

fig, ax = plt.subplots(1, 2, figsize=(10, 6))
ax[0].scatter(*data_x.T, s=5, c=data_x[:, 0])
ax[0].scatter(*neighbors_x.T, s=5, c="red")
ax[0].set_xticks([0, 0.5, 1])
ax[0].set_yticks([0, 0.5, 1])
ax[0].set_xlabel(r"$x_1$")
ax[0].set_ylabel(r"$x_2$")

ax[1].scatter(*data_y.T, s=5, c=data_y_original[:, 0])
ax[1].scatter(*neighbors_y.T, s=5, c="red")
ax[1].set_xlabel(r"$y_1=x_1+x_2^3$")
ax[1].set_ylabel(r"$y_2=x_2-x_1^3$")

fig.tight_layout()

In [None]:
# compute the pseudo inverses of the covariances and pass them to the metric
covariances_inv = np.zeros_like(covariances)

some_evals1 = []
some_ranks1 = []

from time import time

t0 = time()
print("Computing %g inverse matrices..." % (n_pts), end="")
for k in range(n_pts):
    covariances_inv[k, :, :] = np.linalg.pinv(covariances[k, :, :], rcond=pinv_tol)
    if k < 1000:
        evals = np.linalg.eigvals(covariances[k, :, :])
        some_evals1.append(evals)
        some_ranks1.append(np.sum(evals > pinv_tol))
print(f"done in {time()-t0} seconds.")

some_evals1 = np.row_stack(some_evals1)
some_ranks1 = np.row_stack(some_ranks1)

Here we can check if most covariance matrices have the correct rank (two). The pseudo-inverse computation should be set so that most matrices have the correct rank.

In [None]:
fig, ax = plt.subplots(1, 2, figsize=(10, 5))
ax[0].hist(np.log(np.abs(some_evals1.ravel())), 150)
ax[0].plot([np.log(pinv_tol), np.log(pinv_tol)], [0, 300], "r-", label="cutoff")
ax[0].set_title("covariance eigenvalue distribution")
ax[0].set_xlabel(r"log $\lambda$")
ax[0].legend()

rank_bins = np.arange(0, 10) - 0.5
ax[1].hist(some_ranks1, rank_bins - 0.05, alpha=0.5)
ax[1].set_xlim([0, 5])
ax[1].set_title("covariance rank distribution")
ax[1].set_xlabel("rank")

# Invariant embedding

Once the covariance matrices are obtained, we can use them to create an embedding of the mushroom data that is invariant to the observation function (here, the function from x to y). This will - in essence - create an embedding of the original square points x, even though we use the mushroom data y as an input.

The MahalanobisKernel from datafold used in Diffusion maps internally constructs a sparse distance matrix and also automatically determines a good kernel bandwidth.

In [None]:
# compute DMAPS with the given mahalanobis metric
n_evecs = 10

t0 = time()
pcm = pfold.PCManifold(data_y)
pcm.optimize_parameters(random_state=1, k=30, result_scaling=1)

dmap = dfold.DiffusionMaps(
    kernel=MahalanobisKernel(epsilon=8, distance={"cut_off": pcm.cut_off, "k": 30}),
    n_eigenpairs=n_evecs,
)
dmap.fit(data_y, kernel_kwargs=dict(cov_matrices=covariances_inv))
evecs1, evals1 = dmap.eigenvectors_, dmap.eigenvalues_

print(f"Diffusion map done in {time()-t0} seconds.")

In [None]:
rng = np.random.default_rng(random_state)

# Automatically select the embedding eigenfunctions
lrs = dfold.LocalRegressionSelection(intrinsic_dim=2, n_subsample=500, strategy="dim")
selection1 = lrs.fit(evecs1)

# plot the results
idx_ev = rng.permutation(evecs1.shape[0])[0:2000]

fig, ax = plt.subplots(1, 2, figsize=(10, 6))
ax[0].plot(selection1.residuals_, ".-")
ax[0].plot(
    [1, 2], selection1.residuals_[selection1.evec_indices_[:2]], "r.", label="selected"
)
ax[0].legend()

ax[1].scatter(
    evecs1[idx_ev, selection1.evec_indices_[0]],
    evecs1[idx_ev, selection1.evec_indices_[1]],
    s=2,
    c=data_y_original[idx_ev, 0],
)
ax[1].set_title("invariant embedding")
ax[1].set_xlabel(r"$\phi_1$")
ax[1].set_ylabel(r"$\phi_2$")
ax[1].set_aspect(1)

fig.tight_layout()

# Demonstrating similarity to original
We can now easily demonstrate that the invariant embedding of the mushroom data y is the same embedding (up to isometry, here: rotation) as if we would have directly embedded the square.

In [None]:
t0 = time()
pcm = pfold.PCManifold(data_x)
pcm.optimize_parameters(random_state=1, k=30, result_scaling=1)
dmap = dfold.DiffusionMaps(
    kernel=GaussianKernel(epsilon=8, distance={"cut_off": pcm.cut_off, "k": 30}),
    n_eigenpairs=n_evecs,
)
dmap.fit(pcm)
evecs2, evals2 = dmap.eigenvectors_, dmap.eigenvalues_
print(f"Diffusion map on x data done in {time()-t0} seconds.")

In [None]:
selection2 = lrs.fit(evecs2)

fig, ax = plt.subplots(1, 2, figsize=(10, 6), sharey=True)

ax[0].scatter(
    evecs1[idx_ev, selection1.evec_indices_[0]],
    evecs1[idx_ev, selection1.evec_indices_[1]],
    s=2,
    c=data_y_original[idx_ev, 0],
)
ax[0].set_title("embedding of y")
ax[0].set_xlabel(r"$\phi_1$")
ax[0].set_ylabel(r"$\phi_2$")
ax[0].set_aspect(1)

ax[1].scatter(
    evecs2[idx_ev, selection2.evec_indices_[0]],
    evecs2[idx_ev, selection2.evec_indices_[1]],
    s=2,
    c=data_x[idx_ev, 0],
)
ax[1].set_title("embedding of x")
ax[1].set_xlabel(r"$\psi_1$")
ax[1].set_ylabel(r"$\psi_2$")
ax[1].set_aspect(1)

fig.tight_layout()