# Local Homology

In this tutorial we use tools from Topological Data Analysis (TDA) to study the local geometry of point clouds.

If you are looking at a static version of this notebook and would like to run its contents, head over to [GitHub](https://github.com/giotto-ai/giotto-tda/blob/master/examples/local_homology.ipynb) and download the source.

## See also

- [Topological feature extraction using VietorisRipsPersistence and PersistenceEntropy](https://giotto-ai.github.io/gtda-docs/latest/notebooks/vietoris_rips_quickstart.html) for a quick introduction to general topological feature extraction in ``giotto-tda``.
- [Local Homology NLP Use Cases: Unsupervised text disambiguation](https://giotto-ai.github.io/gtda-docs/latest/notebooks/local_hom_NLP_disambiguation.html), in which the ideas presented here are applied to natural language data!

**License: AGPLv3**

In [None]:
import numpy as np

from gtda.plotting import plot_point_cloud
from gtda.local_homology import KNeighborsLocalVietorisRips, RadiusLocalVietorisRips

# Dimensionality

## What is "dimensionality", and how can we formalize it? 

In this tutorial we will learn how to algorithmically distinguish between possible *local geometries* around data points. Consider, for instance, the local geometry around the red dot in the three shapes below:

![locally_different](images/locally_different.png)

## How?

We propose an answer to this question based on the notion of **local homology**. We estimate the "local dimension" around each point by studying the topological properties of the data points' neighbourhoods. We start by studying three shapes which differ by their dimensions, namely the line, the plane, and the cube.

In [None]:
n = 10
line_1d = np.linspace(0, 1, num=n, endpoint=False)

line = np.c_[np.zeros((n, 2)), line_1d]
plane = np.array([[0, y, z] for y in line_1d for z in line_1d])
cube = np.array([[x, y, z] for x in line_1d for y in line_1d for z in line_1d])

line_plane_cube = np.vstack((line, plane + [0.5, -0.5, 0], cube + [1, -0.5, 0]))
plot_point_cloud(line_plane_cube)

## The method

As explained in [Topological feature extraction using VietorisRipsPersistence and PersistenceEntropy](https://giotto-ai.github.io/gtda-docs/latest/notebooks/vietoris_rips_quickstart.html), "persistence diagrams" are a common and useful way to store information about the topology and geometry of data. Their content can be summarized and made interpretable even further by means of a variety of featurization methods.

In this notebook, we show how to compute and featurize persistence diagrams for *local neighbourhoods* of all vectors in a dataset (e.g. in a point cloud), in such a way that each point's features capture the dataset's dimensionality around that point. We do this in steps:
1. create "local point clouds" around our data points and compute their persistence diagrams in a few chosen dimensions (more on this in "How does it work?" below);
2. featurize the "local persistence diagrams" using a modified version of the *persistence entropy* computed by the ``PersistenceEntropy`` transformer;
3. plot the original point cloud with its points coloured according to the value of the dimension-$i$ feature.

Let us see this sequence of steps in action in the case of the point cloud representing a line. The colours will correspond to the value of the dimension-1 local topological feature. The local persistence diagrams are computed here using the ``KNeighborsLocalVietorisRips`` transformer.

In [None]:
from gtda.diagrams import PersistenceEntropy
from sklearn.preprocessing import FunctionTransformer
from gtda.pipeline import make_pipeline

## 1 - Initialize a local homology transformer. 
# The notion of neighborhood here is determined by the number of closest neighbors to consider
n_neighbors = (1, 7)
homology_dimensions = (1, 2)
kn_lh = KNeighborsLocalVietorisRips(n_neighbors=n_neighbors,
                                    homology_dimensions=homology_dimensions)

## 2 - Define a featurization method for persistence diagrams.
mod_pe =  make_pipeline(PersistenceEntropy(),
                        FunctionTransformer(func=lambda X: 2 ** X))

# Summarize the steps above in a single end-to-end scikit-learn pipeline.
# Then, apply the pipeline
pipe = make_pipeline(kn_lh,
                     mod_pe)
loc_dim_features = pipe.fit_transform(line)

## 3 - Plot the point cloud, colouring according to the value of the dimension-1 feature
# A simple helper function
def plot_coloured_cloud(point_cloud, colors):
    """This is a custom plotting function, that plots the point cloud and colours 
     the points according to a vector of colours."""
    plotly_params = {"trace" : {"marker": {"color": colors,
                                           "size": 6,
                                           "showscale": True},
                                "text": [f"color: {v}" for v in colors]}}
    return plot_point_cloud(point_cloud, plotly_params=plotly_params)

# Find the column index for homology dimension 1 in `loc_dim_features`
dimension = 1
dim_index = homology_dimensions.index(dimension)
colors = loc_dim_features[:, dim_index]

# Plot the point cloud using the results in local homology dimension 1
plot_coloured_cloud(line, colors)

For each non-boundary point, the algorithm only detects one local feature, in dimension 1!

Now let's try with the plane point cloud, colouring according to the value of the dimension-2 local topological feature.  The local persistence diagrams are computed here using the ``RadiusLocalVietorisRips`` transformer.

In [None]:
# Plane coloured with two-dimensional features.
# The notion of neighborhood here is determined by radii.
radii = (0.001, 0.3)
homology_dimensions = (1, 2, 3)
r_lh = RadiusLocalVietorisRips(radii=radii,
                               homology_dimensions=homology_dimensions)

# Construct and apply the pipeline
pipe = make_pipeline(r_lh,
                     mod_pe)
loc_dim_features = pipe.fit_transform(plane)

# Find the column index for homology dimension 1 in `loc_dim_features`
dimension = 2
dim_index = homology_dimensions.index(dimension)
colors = loc_dim_features[:, dim_index]

# Plot the point cloud using the results in local homology dimension 1
plot_coloured_cloud(plane, colors)

As for the line with its one-dimensional local features, the *two-dimensional* local features of the plane are equal to 1 everywhere except at the borders/boundaries.

## Non-homogeneous example

In the following example, we  consider a point cloud with varying dimensionality. We plot the three shapes next to each other. We encourage you to play around with the dimension parameter, to compare the local structure of points in the line, plane and cube!

One sees that one-dimensional features are observed in all three shapes, but two-dimensional features only exist in the plane and three dimensional features only exist in the cube.

In [None]:
# Here we set the `collapse_edges` parameter to True, which often improves performance in higher homology dimensions.
# We also set the `n_jobs` parameter to -1, to allow the algorithm to distribute across CPU cores!
radii = (0.01, 0.25)
homology_dimensions = (1, 2, 3)
collapse_edges = True
n_jobs = -1
r_lh = RadiusLocalVietorisRips(radii=radii,
                               homology_dimensions=homology_dimensions,
                               collapse_edges=collapse_edges,
                               n_jobs=n_jobs)

# Construct and apply the pipeline
pipe = make_pipeline(r_lh,
                     mod_pe)
loc_dim_features = pipe.fit_transform(line_plane_cube)

# Find the column index for homology dimension 1 in `loc_dim_features`.
# We encourage you to change the dimension parameter
dimension = 1
dim_index = homology_dimensions.index(dimension)
colors = loc_dim_features[:, dim_index]

# Plot the point cloud using the results in the chosen local homology dimension
plot_coloured_cloud(line_plane_cube, colors)

## "Non-manifold" points|

Our last example is inherently one-dimensional, but the point cloud still has some interesting local structures.

In [None]:
# First generate the point cloud
line1 = [[x, 0, 0] for x in np.arange(-1, 1, 1/10)]
line2 = [[0, y, 0] for y in np.arange(-1, 1, 1/10)]
line3 = [[x, 1, 0] for x in np.arange(-1, 1, 1/10)]
lines = np.array(line1 + line2 + line3)

# Plot the point cloud
plot_point_cloud(lines)

Here, there are 4 different types of points:
- 'edge points', with no local structure;
- 'line points', which locally look like one-dimensional lines;
- 'intersection points' which locally look like an X;
- 'intersection points' which locally look like a T.

The next block looks at the local homology features!

In [None]:
# Initialize the radius local homology transformer.
radii = (0.21, 0.31)
homology_dimensions = (0, 1)
n_jobs = -1
r_lh = RadiusLocalVietorisRips(radii=radii,
                               homology_dimensions=homology_dimensions,
                               n_jobs=n_jobs)

# Construct and apply the pipeline
pipe = make_pipeline(r_lh,
                     mod_pe)
loc_dim_features = pipe.fit_transform(lines)

# Find the column index for homology dimension 1 in `loc_dim_features`.
dimension = 1
dim_index = homology_dimensions.index(dimension)
colors = loc_dim_features[:, dim_index]

# Plot the point cloud using the results in the chosen local homology dimension
plot_coloured_cloud(lines, colors)

Notice that the one-dimensional feature counts the branches at a given point!

We encourage you to change the values of the ``radii`` parameter. If you are curious of what these parameters are, please scroll down to the 'How does it work?' section.

# How does it work?

As the mathematics behind this algorithm is quite involved, we give an explanation for one-dimensional examples. We also give a picture for the process in a 2-dimensional example.

A central theme in topology is the study of properties that are invariant under deformation. A popular one-dimensional invariant is the number of 'loops' in the space. For a graph, the number of 'loops' is the number of cycles of that graph. In the case of connected graphs this is equal to $1 - \chi(G)$, where $\chi(G) = |V| - |E|$ is the *Euler characteristic* of the graph, and $|V|$ and $|E|$ are the number of vertices and edges respectively. 

Let us consider *trees*, which are a particular kind of graph. Although all trees have Euler characteristic equal to $1$, they can have very different local structure. By computing this local structure, we can discriminate between topologically distinct trees (or graphs, or higher-dimensional objects). In order to compute the local structure around each data point with standard algorithms for point clouds or dissimilarity matrices, such as the highly-optimised routines used by ``VietorisRipsPersistence``, we use a few tricks!

The first step is to discard points that are far away from the point under consideration, since they do not affect the local geometry around it. We thus only keep points that are "not far" from our point, according to a certain criterion. For the ``RadiusLocalVietorisRips`` transformer, the criterion is that the distance from the chosen point should be less than or equal to the second entry in the ``radii`` pair. For the ``KNeighborsVietorisRips`` transformer, we only include the $k$-nearest neighbors to our chosen point, where $k$ is the second entry in the ``n_neighbors`` pair.

The second step is the one making loops reflect dimensionality. From a line, how can we obtain a loop? We stick the two ends of the line together! This trick can be thought of as "coning off" the endpoints: adding an extra point and linking some points (the furthest points) to this extra point. In the case of the ``RadiusLocalVietorisRips`` transformer, our algorithm cones off the points that lie in between the first and second entries in ``radii``, as depicted below. For the ``KNeighborsVietorisRips`` transformer, we cone off in an analogous way according to the ``n_neighbors`` pair.

So what is the Euler characteristic of the result of applying this procedure to an "X shape" graph, which could be the local neighborhood around a point in a larger graph? The result after "coning off" has two vertices (at the intersection points) and 4 edges, so the Euler characteristic is $-2$! This reflects the complexity at the point that we are considering. (Notice also that a loop has Euler characteristic $0$, and a triple intersection will give Euler characteristic $-1$.)

As a 2-dimensional example of this procedure, consider a disk (again, you can think of this as the local neighbourhood around a chosen point). By coning off the outer part of this disk, we end up with a sphere. And a sphere's only non-trivial topological features are 2-dimensional.

![coning_off](images/coning_off.png)
<div style="text-align: left">
   <p style="text-align: center;">  Coning off a cross and a disk.</p>
</div>