# Giotta-TDA

A high-performance topological machine learning toolbox in Python

giotto-tda is a high performance topological machine learning toolbox in Python built on top of scikit-learn and is distributed under the GNU AGPLv3 license. It is part of the Giotto family of open-source projects.

To read about it more, please refer [this](https://analyticsindiamag.com/guide-to-giotto-tda-a-high-performance-topological-machine-learning-toolbox/) article.

# Code Implementation

## Classifying 3D Shapes

Let’s see an example of this process to gain a better understanding. We use giotto_tda: a high performing topological machine learning toolkit in python. It integrates with sklearn really well and is very intuitive to use.
Setup

In [None]:
!python -m pip install pip --upgrade --user -q --no-warn-script-location
!python -m pip install numpy pandas seaborn matplotlib scipy sklearn statsmodels tensorflow keras --user -q --no-warn-script-location

In [None]:
!python -m pip install -U giotto-tda --user -q --no-warn-script-location
!python -m pip install openml --user -q --no-warn-script-location
!python -m pip install delayed --user -q --no-warn-script-location



In [None]:
import IPython
IPython.Application.instance().kernel.do_shutdown(True)

In [None]:
from sklearn.datasets import make_circles
import matplotlib.pyplot as plt

In [None]:
X=make_circles(100)
y=[1 if (i[0]>0.1 and i[1]>0.1) or (i[0]<-0.1 and i[1]<-0.1) else 0 for i in X[0]]
plt.scatter(X[0][:,0],X[0][:,1],c=y)
plt.axis('off')
plt.show()

In [None]:
import numpy as np
np.array(y)

## Data

We use the same data used in tutorials of giotto_data.Data is loaded from Princeton’s Computer Vision Course. 

In [None]:
from openml.datasets.functions import get_dataset
df = get_dataset('shapes').get_data(dataset_format='dataframe')[0]
df.head()

In [None]:
df['target'].map(lambda x:x[:-1]).value_counts()

In [None]:
from gtda.plotting import plot_point_cloud,plot_diagram
plot=plot_point_cloud(df.query('target == "biplane0"')[["x", "y", "z"]].values)
plot

In [None]:
from gtda.plotting import plot_point_cloud,plot_diagram
plot=plot_point_cloud(df.query('target == "human_arms_out0"')[["x", "y", "z"]].values)
plot

In [None]:
type(plot)

In [None]:
# plot.write_html('plot.html')

There are 4 classes of 3D objects in data with 10 samples for each class. 400 points in 3D space represent each object.

We have to transform the data into point clouds to work with the library

In [None]:
import numpy as np

point_clouds = np.asarray(
    [
        df.query("target == @shape")[["x", "y", "z"]].values
        for shape in df["target"].unique()
    ]
)
point_clouds.shape

## Calculating Persistence Diagrams

In [None]:
from gtda.homology import VietorisRipsPersistence

# Track connected components, loops, and voids
homology_dimensions = [0, 1, 2]

persistence = VietorisRipsPersistence(
    metric="euclidean",
    homology_dimensions=homology_dimensions,
    n_jobs=6,
    collapse_edges=True,
)
persistence_diagrams = persistence.fit_transform(point_clouds)

#Example Persistence Diagram
plot_diagram(persistence_diagrams[10])

## Persistence Entropy and Other Features

We can get persistence entropies of each homology dimension using 

In [None]:
from gtda.diagrams import PersistenceEntropy
persistence_entropy = PersistenceEntropy(normalize=True)
# Calculate topological feature matrix
X = persistence_entropy.fit_transform(persistence_diagrams)
X.shape

Since we used only 3 dimensions, we get only three numbers for each data point. To increase the number of features, we can calculate other types of features. Following are some examples.

In [None]:
from gtda.diagrams import NumberOfPoints,Amplitude
from sklearn.pipeline import make_union

# Select a variety of metrics to calculate amplitudes
metrics = [
    {"metric": metric}
    for metric in ["bottleneck", "wasserstein", "landscape", "persistence_image"]
]

# Concatenate to generate 3 + 3 + (4 x 3) = 18 topological features
feature_union = make_union(
    PersistenceEntropy(normalize=True),
    NumberOfPoints(n_jobs=-1),
    *[Amplitude(**metric, n_jobs=-1) for metric in metrics]
)

## Classification Pipeline

Finally, we can put all these things together and build a classification model.

In [None]:
from gtda.pipeline import Pipeline
from sklearn.ensemble import RandomForestClassifier

steps = [
    ("persistence", VietorisRipsPersistence(metric="euclidean", homology_dimensions=homology_dimensions, n_jobs=6)),
    ("features", feature_union),
    ("model", RandomForestClassifier(oob_score=True)),
]

pipeline = Pipeline(steps)

In [None]:
labels = np.zeros(40)
labels[10:20] = 1
labels[20:30] = 2
labels[30:] = 3

In [None]:
pipeline.fit(point_clouds,labels)

In [None]:
pipeline['model'].oob_score_