# An elementary introduction to TDA via synthetic data

In this notebook, we will introduce Persistent Homology on several toy data sets. You should try to guess what each data set is sampled from without looking at the scatterplots at the bottom (with the help of Occam's razor). All of these data sets are located in `/data/`.

For this notebook, we'll use the `giotto-tda` package. This can be installed via `pip install giotto-tda`.

In [None]:
import numpy as np
from gtda.homology import VietorisRipsPersistence
from gtda.plotting import plot_diagram

### Data sets 1 & 2

Let's load the data and compute its Vietoris-Rips persistent homology in degrees 0, 1, and 2.

In [None]:
data1 = np.loadtxt('data/synth_data1.txt', delimiter=' ')

VR = VietorisRipsPersistence(homology_dimensions=[0,1,2])
data1_dgms = VR.fit_transform(data1[None,:,:])

In [None]:
plot_diagram(data1_dgms[0])

We find a single homology class in degree 1 (there is always a class in homological degree 0 that lives forever--it is omitted from the persistence diagram). There is no class in homological degree 2, so this eliminates any space with a void, like a standard sphere or a torus.

This data was produced by sampling a nice space exactly with no noise. But real data always has noise! Instead, let's sample from the same space with noise so that you can get a better feel for what persistence diagrams will look like.

In [None]:
data2 = np.loadtxt('data/synth_data2.txt', delimiter = ' ')
data2_dgms = VR.fit_transform(data2[None,:,:])
plot_diagram(data2_dgms[0])

Now we see much more 'noise' in the diagram. This refers to the additional $H_1$ classes appearing near the diagonal. There is a slogan in TDA that homology classes lying near the diagonal are 'noise', but this interpretation is highly-dependent on the data set and problem you are trying to solve, so be careful applying this type of logic. It is easy to come up with examples where classes far from the diagonal are noise and those near the diagonal are essential.

### Data sets 3 & 4

Increasing in complexity, let's try to again guess what space we've pulled the following data set of points from just based on its persistent homology.

In [None]:
data3 = np.loadtxt('data/synth_data3.txt', delimiter=' ')
dgms3 = VR.fit_transform(data3[None,:,:])
plot_diagram(dgms3[0])

This is the first time we've seen 2-dimensional homology away from the diagonal! So this space has one 0-dimensional homology class (the space is connected), one 1-dimensional homology class (one noncontractible loop), and one 2-dimensional homology class (one void). (This one is difficult to guess without having taken a topology course.)

Finally, we can look at data set 4.

In [None]:
data4 = np.loadtxt('data/synth_data4.txt', delimiter=',')
data4.shape

This data set consists of 400 points in $\mathbb{R}^4$, so direct visualization of the data set is impossible. One could try using data reduction techniques like PCA to analyze the data...

In [None]:
from sklearn.decomposition import PCA

pca2 = PCA(n_components=2)
data4_2d = pca2.fit_transform(data4)

print(f'2-dimensional PCA has an explained variance ratio of {pca2.explained_variance_ratio_}, for a total of '
      f'{pca2.explained_variance_ratio_[0] + pca2.explained_variance_ratio_[1]}.')

...but this is not fantastic. Let's see if we can guess using persistent homology again.

In [None]:
dgms4 = VR.fit_transform(data4[None,:,:])
plot_diagram(dgms4[0])

Now we have two 1-dimensional homology classes (close together in the persistence diagram). So `data4` has one connected component, two 1-dimensional homology classes (two distinct loops), and one 2-dimensional homology class (so one void). 

### Bonus

Here is the last synthetic data set we'll consider. This persistence diagram reflects/hides periodicity in the data. See if you can find it by zooming in on the diagram.

In [None]:
data5 = np.loadtxt('data/synth_data5.csv',delimiter=',')
dgms5 = VR.fit_transform(data5[None,:,:])
plot_diagram(dgms5[0])

In particular, zoom in on the $H_1$ classes which are a bit off the diagonal. How many are there clumped up together? How many $H_1$ classes are far off the diagonal? (This data set is pictured in the slides.)

### Answers

In [None]:
import matplotlib.pyplot as plt

In [None]:
plt.scatter(data1[:,0], data1[:,1])

In [None]:
plt.scatter(data2[:,0],data2[:,1])

In [None]:
plt.scatter(data3[:,0],data3[:,1])

In [None]:
plt.scatter(data4[:,1],data4[:,2])