(Visit the
[documentation](https://datafold-dev.gitlab.io/datafold/tutorial_index.html) page
to view the executed notebook.)

# Uniform subsampling of point cloud manifold

In this tutorial, we use the `PCManifold` class to subsample a massive data set. We want to highlight a *datafold* method that subsamples the dataset such that we have a near-uniform distribution over the manifold. 

In contrast to randomly selected samples from the dataset, we aim to subsample the data such that it covers the manifold uniformly. That is we want to have similar sampling densities over the manifold (and boundaries) and therefore sample proportionally less in high-density regions and keep more samples in low-density regions. 

In [None]:
import time

import matplotlib.pyplot as plt
import numpy as np

import datafold.pcfold as pfold
from datafold.utils.general import random_subsample

## Create the dataset

We create a dataset with 10 Mio. samples of 2-dim. points for visualization purposes. The generated dataset has regions of higher sampling density, which can be a property of a hidden system but could as well be an artefact of data collection or measurement.


**NOTE:**

The default value of 10 Mio. samples may be too much depending on the available RAM (we used a Laptop with 16 GiB RAM).


In [None]:
# create large data set with non-uniform sampling
np.random.seed(5)
n_pts = int(1e7)  # default 1e7

data = np.random.rand(n_pts, 2)
data[:, 0] = np.sin(4 * data[:, 0]) ** 2 / 5 + data[:, 0]
data[:, 1] = np.cos(2 * data[:, 0]) ** 2 / 5 + data[:, 1]

# plot
plot_idx = np.random.permutation(n_pts)[0:5000]

fig = plt.figure(figsize=(6, 6))
plt.scatter(*data[plot_idx, :].T, s=1)
plt.title("Full dataset, only showing %g points" % (len(plot_idx)));

## `PCManifold`: estimate parameters and subsample with `pcm_subsample`

A uniform sampling density is a useful property for manifold learning algorithms that cannot "correct" the sampling density.  

We first compute a cut-off estimate with `PCManifold.optimize_parameters`. Internally of `pcm_subsample`, the cut-off rate is used to compute the `min_distance = cut_off * 2` for the parameter `pcm_subsample`. In the first subsample we take `min_distance = cut_off * 10` because the number of points in the full sample is so large. 

In the following, we subsample the large dataset three times consecutively (subsample on the previous subsample).

In [None]:
t0 = time.time()
pcm_original = pfold.PCManifold(data)
# only use 10 samples to estimate scales, otherwise the memory requirements are too high
pcm_original.optimize_parameters(n_subsample=10)
min_distance = pcm_original.cut_off * 10
print(
    f"optimize took {time.time()-t0:3f}s and is now using min_distance={min_distance:3f}"
)

t0 = time.time()
pcm_subsample, indices = pfold.pcm_subsample(pcm_original, min_distance=min_distance)
print(
    f"first subsample took {time.time()-t0:3f}s and has "
    f"n_samples={pcm_subsample.shape[0]} using min_distance={min_distance:3f}"
)

# subsample on first subsample
t0 = time.time()
pcm_subsample.optimize_parameters(n_subsample=1000)
pcm_sub2sample, indices = pfold.pcm_subsample(pcm_subsample)
print(
    f"second subsample took {time.time()-t0:3f}s and has "
    f"n_samples={pcm_sub2sample.shape[0]} with min_distance={pcm_subsample.cut_off * 2 :3f}"
)

# subsample on second subsample
t0 = time.time()
pcm_sub2sample.optimize_parameters(n_subsample=1000)
pcm_sub3sample, indices = pfold.pcm_subsample(pcm_sub2sample)
print(
    f"third subsample took {time.time()-t0:3f}s and has"
    f"n_samples={pcm_sub3sample.shape[0]} with min_distance={pcm_sub2sample.cut_off * 2 :3f}"
)

### Plot subsampled point clouds

In [None]:
fig, ax = plt.subplots(1, 3, sharey=True, figsize=(16, 5))
ax[0].scatter(*pcm_subsample.T, s=2)
ax[0].set_title("#pts: %g" % pcm_subsample.shape[0])

ax[1].scatter(*pcm_sub2sample.T, s=2)
ax[1].set_title("#pts: %g" % pcm_sub2sample.shape[0])

ax[2].scatter(*pcm_sub3sample.T, s=2)
ax[2].set_title("#pts: %g" % pcm_sub3sample.shape[0]);

## Compare uniform to random subsampling

We visually compare the manifold subsampling to the naive random subsampling. For this, we set the parameter `min_distance` directly which also controls the number of subsamples. 

From the plots, we can see that the random subsampling removes samples with the same probability, and therefore does not account for the sampling density in the dataset. Regions that are already sparse in the original dataset become even more sparse in the subsample. In contrast, for the uniform subsampling, the method keeps proportionally more samples in sparse regions than in dense sample regions. 

There is currently no direct control of how many points to subsample. Another disadvantage is that it is computationally costly, compared to the randomized subsampling (not optimizing the `cut_off` greatly improves the computation speed). 

In [None]:
# Set "min_distance" directly to steer the number of points

min_distance1 = 0.02
min_distance2 = 0.01

print(f"min_distance={min_distance1}")
print("----------------------------")
t0 = time.time()
pcm_dist1, indices1 = pfold.pcm_subsample(pcm_original, min_distance=min_distance1)
print(f"manifold subsampling took {time.time() - t0} s")

t0 = time.time()
pcm_random1, indices_random1 = random_subsample(pcm_original, n_samples=len(indices1))
print(f"random subsampling took {time.time() - t0} s")

# Decrease "min_distance" to obtain a larger subsample set
print("")
print(f"min_distance={min_distance2}")
print("----------------------------")

t0 = time.time()
pcm_dist2, indices2 = pfold.pcm_subsample(pcm_original, min_distance=min_distance2)
print(
    f"manifold subsampling with min_distance={min_distance2} took {time.time() - t0} s"
)

t0 = time.time()
pcm_random2, indices_random2 = random_subsample(pcm_original, n_samples=len(indices2))
print(f"random subsampling took {time.time() - t0} s")

fig, ax = plt.subplots(2, 3, figsize=(16, 10))

# first plot row
ax[0][0].scatter(*data[plot_idx, :].T, s=1)
ax[0][0].set_title(
    f"original dataset (#pts: {data.shape[0]}) (showing {len(plot_idx)} pts)"
)

ax[0][1].scatter(*pcm_random1.T, s=1)
ax[0][1].set_title(f"#pts: {pcm_random1.shape[0]}, randomized subsample")

ax[0][2].scatter(*pcm_dist1.T, s=1)
ax[0][2].set_title(
    f"#pts: {pcm_dist1.shape[0]}, iterative manifold subsample \n (min_distance={min_distance1})"
)

# second plot row
ax[1][0].scatter(*data[plot_idx, :].T, s=1)
ax[1][0].set_title(
    f"original dataset (#pts: {data.shape[0]}), (showing {len(plot_idx)})"
)

ax[1][1].scatter(*pcm_random2.T, s=1)
ax[1][1].set_title(f"#pts: {pcm_random2.shape[0]}, randomized subsample")

ax[1][2].scatter(*pcm_dist2.T, s=1)
ax[1][2].set_title(
    f"#pts: {pcm_dist2.shape[0]}, iterative manifold subsample \n (min_distance={min_distance2})"
);