# Tutorial: Persistent homology with `giotto-ph`

This tutorial shows the basic functionalities and API of `giotto-ph`.

## Introduction

`giotto-ph` is a reworked and **parallelised** version of state-of-the-art algorithms for computing persistent homology of Vietoris–Rips filtrations. For a complete history and full references, see [our paper](https://arxiv.org/abs/2107.05412). Here we mention that its core API and some of its features are based upon the [ripser.py](https://joss.theoj.org/papers/10.21105/joss.00925) project.

A Vietoris–Rips filtration is obtained from a point cloud or even just a distance-like matrix by progressively enlarging a parameter $\varepsilon$ to build a nesting sequence of simplicial complexes. The $d$-dimensional simplex with vertices $(v_0, v_1, \ldots, v_d)$ belongs to the complex at value $\varepsilon$ if all "pairwise distances" between pairs of these vertices are less than or equal to $\varepsilon$. Thus, even in the case of a geometric point cloud, only pairwise information is needed to construct this type of filtration.

The reason we say "distance-like" and put quotes between "pairwise distances" is because, although Vietoris–Rips filtrations are easiest to think about and visualize on point clouds, with distances being genuine Euclidean distances, the procedure can deal with much more general input. Vietoris–Rips persistent homology (or "flag" persistent homology in this more general case) just needs as input a square matrix $M$ such that

  - for all $i$ and $j$, $M_{ij} \geq \max \{M_{ii}, M_{jj}\}$.

In this picture we see a sequence of complexes based on a point cloud in Euclidean space, forming as $\varepsilon$ increases over time. The bubbles are the disks of radius $\varepsilon$.

In [None]:
from IPython.display import Image  # to display images
Image("images/ph.gif")

In [None]:
# Install missing dependencies
import sys
!{sys.executable} -m pip install giotto-tda

In [None]:
# here comes our protagonist!
from gph import ripser_parallel

# Import utils
import numpy as np
from gtda.homology._utils import _postprocess_diagrams

# To generate dataset
from sklearn import datasets

# Plotting
from plotly import graph_objects as go
from gtda.plotting import plot_diagram, plot_point_cloud

Let us produce a noisy circle as a point cloud for this tutorial:

In [None]:
data = datasets.make_circles(n_samples=100, noise=0.1, factor=0.7, random_state=42)[0]

# Plot the point cloud
plot_point_cloud(data)

## Default arguments (serial computation)

Parallelism is *not* exploited when calling `ripser_parallel` with default arguments: all computations are performed by a single thread in that case.

One of the most effective ways of presenting persistent topological features is to plot them as points $(b, d)$ in a two-dimensional plane. The $x$- and $y$-coordinates of one such point are the birth and death filtration values for the corresponding feature, i.e. the values at which the topological feature is created (e.g. the circle is formed) and at which it is destroyed (e.g. the circle is filled up). The result is a *persistence diagram*. and one can say that the farther a point is from the diagonal $b = d$, the less *persistent* the feature it represents.

In [None]:
# Compute the persistence diagram
dgm = ripser_parallel(data)

# Convert to gtda format
dgm_gtda = _postprocess_diagrams([dgm["dgms"]], "ripser", (0, 1), np.inf, True)[0]

# Plot the persistence diagram
plot_diagram(dgm_gtda, homology_dimensions=(0, 1))

## Higher homology groups

We can compute any degree of homology but, by default, we only compute $H_0$ and $H_1$. 

You can specify that higher-degree homological features should be computed by increasing the argument `maxdim`. It practice, anything above $H_1$ will benefit more substantially form parallelisation.

In [None]:
# Compute the persistence diagram
dgm = ripser_parallel(data, maxdim=2)

# Convert to gtda format
dgm_gtda = _postprocess_diagrams([dgm["dgms"]], "ripser", (0, 1, 2), np.inf, True)[0]

# plot
plot_diagram(dgm_gtda, homology_dimensions=(0, 1, 2))

## Parallelised computations

The parameter `n_threads` in `ripser_parallel` can be set to an integer greater than `1` to activate parallel computations. In addition, `-1` will use all available cores in your machine. Skip to the code cell below to see this in action, or keep reading here for some more background on how parallelism is achieved.

### How it works

The parallelism implemented in `giotto-ph` builds on different sources already present in the literature.

The basic idea is to parallelise the reduction of the filtration (co)boundary matrix $\mathbf{D}$. In order to compute homology, one needs to know the *image* and the *kernel* of $\mathbf{D}$ (see e.g. [Wikipedia](https://en.wikipedia.org/wiki/Simplicial_homology) for more details). These can be easily read off from $\mathbf{D}$ if we reduce this matrix to *column echelon form*: the kernel will then be generated by the linear combinations leading to zero columns, while the image will be given by the non-zero columns themselves.

When computing *persistent homology*, we are only allowed to perform this column reduction by adding columns from left to right. However, we can do this in any order we like and even in parallel, as long as we respect the basic invariants of the reduction algorithm. The idea or reducing in parallel in this way was first presented in [Towards Lockfree Persistent Homology](https://www.mrzv.org/publications/lockfree-persistence/spaa/) by Morozov and Nigmetov, and implemented by those authors on top of Ulrich Bauer's [Ripser](https://arxiv.org/abs/1908.02518).

More potential for parallelism is unlocked in `giotto-ph` by adapting ideas from [GPU-Accelerated Computation of Vietoris-Rips Persistence Barcodes](https://arxiv.org/abs/2003.07989) by Zhang, Xiao and Wang, but we do not discuss these here.

In [None]:
Image("images/multithread.png")

In [None]:
# Compute the persistence diagram up to degree 5 in homology
dgm = ripser_parallel(data, n_threads=-1, maxdim=5)

# Convert to gtda format
dgm_gtda = _postprocess_diagrams([dgm["dgms"]], "ripser", (0, 1, 2, 3, 4, 5), np.inf, True)[0]

# Plot
plot_diagram(dgm_gtda, homology_dimensions=(0, 1, 2, 3, 4, 5))

This particular dataset does not have any features in homology dimensions 3 to 5. Too bad!

## Specify a maximum filtration parameter

We can decide to stop the construction of filtration at a specific value of the filtration parameter by setting the argument `thresh`. In the example point cloud above, this corresponds to specifying a maximum size for the bubbles. Since fewer simplices are added in total in this case, the computations will be faster if you specify such thresholds.

### Warning

Reducing the threshold implies that topological features that only appear at a large radius (a.k.a. filtration value) may not be present at all in your persistence diagram or that some features may not die (as in the example below).

In [None]:
# compute the persistence diagram
dgm = ripser_parallel(data, thresh=2.5)

# convert to gtda format
dgm_gtda = _postprocess_diagrams([dgm["dgms"]], "ripser", (0, 1), np.inf, True)[0]

# plot
plot_diagram(dgm_gtda, homology_dimensions=(0, 1))

## Edge Collapser integration

By setting the optional parameter `collapse_edges` to `True`, the [Edge Collapse](https://hal.inria.fr/hal-02873740/document) algorithm is used before performing any matrix reduction. This algorithm flags some of the edges as *dominated* and removes them completely from the filtration. This can lead to a greatly sparsified filtration and therefore to immense speed-ups especially when high homology dimensions are required.

**Persistent barcodes computed with or without edge collapses are exactly the same**.

In [None]:
# compute the persistence diagram
dgm = ripser_parallel(data, collapse_edges=True)

# convert to gtda format
dgm_gtda = _postprocess_diagrams([dgm["dgms"]], "ripser", (0, 1), np.inf, True)[0]

# plot
plot_diagram(dgm_gtda, homology_dimensions=(0, 1))

## Changing the basis field

It is possible to change the field of coefficients used to define homology, to any finite field in (prime) characteristic.

The default is $\mathbb{F}_2$, but it can be generalised to any $\mathbb{F}_p$, $p$ prime.

In [None]:
# Compute the persistence diagram
dgm = ripser_parallel(data, coeff=7)

# Convert to gtda format
dgm_gtda = _postprocess_diagrams([dgm["dgms"]], "ripser", (0, 1), np.inf, True)[0]

# plot
plot_diagram(dgm_gtda, homology_dimensions=(0, 1))