# Getting Started with Mapper

In this notebook we explore a few of the core features included in giotto-learn's implementation of the [Mapper algorithm](https://research.math.osu.edu/tgda/mapperPBG.pdf). 

### Useful references

* [An introduction to Topological Data Analysis: fundamental and practical aspects for data scientists](https://arxiv.org/abs/1710.04019)
* [An Introduction to Topological Data Analysis for Physicists: From LGM to FRBs](https://arxiv.org/abs/1904.11044)

## Import libraries

In [None]:
%load_ext autoreload
%autoreload 2

In [None]:
# data wrangling
import numpy as np

# data viz
import plotly.graph_objects as go

# tda magic
from giotto.mapper.cover import CubicalCover
from giotto.mapper.pipeline import make_mapper_pipeline
from giotto.mapper.filter import Projection
from giotto.mapper.visualization import (
    create_network_2d,
    create_network_3d,
    create_interactive_network,
    set_node_sizeref,
)

# ml tools
from sklearn import datasets
from sklearn.cluster import DBSCAN

import warnings
warnings.simplefilter(action="ignore", category=FutureWarning)

## Generate and visualise data
As a simple example, let's generate a two-dimensional point cloud of two concentric circles. The goal will be to examine how Mapper can be used to generate a topological graph that captures the salient features of the data.

In [None]:
data, _ = datasets.make_circles(n_samples=5000, noise=0.05, factor=0.3, random_state=42)

fig = go.Figure(
    data=go.Scatter(x=data[:, 0], y=data[:, 1], mode="markers"),
    layout={"autosize": False},
)
fig.show()

In [None]:
import pandas as pd
data = pd.DataFrame(data, columns=['a', 'b'])

## Configure the Mapper pipeline
Given a dataset ${\cal D}$ of points $x \in \mathbb{R}^n$, the basic steps behind Mapper are as follows:

1. Map ${\cal D}$ to a lower-dimensional space using a _**filter function**_ $ f: \mathbb{R}^n \to \mathbb{R}^m $. Common choices for the filter function include projection onto one or more axes via PCA or density-based methods. In giotto-learn, you can import the desired filter function as follows:

```python
from giotto.mapper.filter import FilterFunctionName
```

2. Construct a cover of the filter values ${\cal U} = (U_i)_{i\in I}$, typically in the form of a set of overlapping intervals which have constant length. As with the filter, a choice of cover can be imported as follows:

```python
from giotto.mapper.cover import CoverName
```

3. For each interval $U_i \in {\cal U}$ cluster the points in the preimage $f^{-1}(U_i)$ into sets $C_{i,1}, \ldots , C_{i,k_i}$. The choice of clustering algorithm can be any of scikit-learn's [clustering methods](https://scikit-learn.org/stable/modules/clustering.html) or an implementation of agglomerative clustering in giotto-learn:

```python
# scikit-learn method
from sklearn.cluster import ClusteringAlgorithm
# giotto-learn method
from giotto.mapper.cluster import FirstSimpleGap
```

4. Construct the topological graph whose vertices are the cluster sets $(C_{i,j})_{i\in I, j \in \{1,\ldots,k_i\}}$ and an edge exists between two nodes if they share points in common: $C_{i,j} \cap C_{k,l} \neq \emptyset$. This step is handled automatically by giotto-learn.

These four steps are implemented in `MapperPipeline()`  object that mimics the `Pipeline()` class from scikit-learn. We provide a convenience function `make_mapper_pipeline()` that allows you to pass the choice of filter function, cover, and clustering algorithm as arguments. For example, to project our data onto the $x$- and $y$-axes, we could setup the pipeline as follows:

In [None]:
# define filter function - can be any scikit-learn Transformer
# filter_func = Projection(columns=[0,1])
filter_func = Projection(columns=['a', 'b'])
# define cover
cover = CubicalCover(n_intervals=10, overlap_frac=0.3)
# choose clustering algorithm - default is DBSCAN
clusterer = DBSCAN()

# configure parallelism
parallel_clustering_n_jobs = 1

# initialise pipeline
pipe = make_mapper_pipeline(
    filter_func=filter_func,
    cover=cover,
    clusterer=clusterer,
    verbose=True,
    parallel_clustering_n_jobs=parallel_clustering_n_jobs,
)

## Run the Mapper pipeline
Now that our pipeline is configured, we can apply it to our data to to return the topological graph object:

In [None]:
graph = pipe.fit_transform(data)

The resulting graph is an [python-igraph](https://igraph.org/python/) object that contains metadata that is stored in the form of dictionaries. We can access this data as follows:

In [None]:
graph['node_metadata'].keys()

Here `node_id` is an identifier used to construct the graph, while `interval_id` and `cluster_id` refer to the interval and cluster sets described above. The `node_elements` refers to the indices of our original data that belong to each node. For example, to find which points belong to the first node of the graph we can access the desired data as follows:

In [None]:
node_id, node_elements = graph['node_metadata']['node_id'], graph['node_metadata']['node_elements']

In [None]:
# print('Node Id: {}, \nNode elements: {}, \nData points: {}'.format(node_id[0], node_elements[0], data[node_elements[0]]))

print('Node Id: {}, \nNode elements: {}, \nData points: {}'.format(node_id[0], node_elements[0], data.iloc[node_elements[0]]))

In general the node metadata is useful for defining the manner in which one wishes to color the nodes in the Mapper graph.

## Visualise the 2D Mapper graph

With the Mapper graph at hand, it is now a simple matter to visualise it. To warm up, let's examine the graph in two-dimensions using the default arguments of giotto-learn's plotting function:

In [None]:
fig = create_network_2d(pipe, data, color_by_columns_dropdown=False)
# display figure
fig.show(config={"scrollZoom": True})

From the figure we can see that we have captured the salient topological features of our underlying data, namely two holes!

## Visualise the 3D Mapper graph
In some cases, it can be useful to explore the Mapper graph in three-dimensions. Below we also show how one can configure the size of the nodes using the utility function `set_node_sizeref()`, along with the ability to choose which column of `data` to color the nodes by:

In [None]:
# configure choice of layout
layout = graph.layout("kk", dim=3)
# get cluster member indices
node_elements = graph["node_metadata"]["node_elements"]
# define node coloring by mean value
node_colors = [np.mean(data[el]) for el in node_elements]
# alternatively define (column_name, column_index) pairs to colour by
columns_to_color = {"x-axis": 0, "y-axis": 1}
# configure scale for node sizes
plotly_kwargs = {
    "node_trace_marker_sizeref": set_node_sizeref(node_elements, node_scale=30)
}
# initialise figure
fig = create_network_3d(
    graph,
    data,
    layout,
    node_colors,
    columns_to_color=columns_to_color,
    plotly_kwargs=plotly_kwargs,
)
# display figure
fig.show()

By toggling the _Color by_ dropdown menu, we can see how the $x$ and $y$ values of the circle are distributed over the Mapper graph.

## Visualise the 2D Mapper graph interactively
In general, buidling useful Mapper graphs requires some iteration through the various parameters in the cover and clustering algorithm. To simplify that process, giotto-learn provides an interactive figure that can be configured in real-time. If invalid parameters are selected, the _Show logs_ checkbox can be used to see what went wrong.

In [None]:
# define (column_name, column_index) pairs to colour by
columns_to_color = {"x-axis": 0, "y-axis": 1}
# customise plot as desired
plotly_kwargs = {"node_trace_marker_reversescale": True}
# generate interactive plot
create_interactive_network(
    pipe, data, columns_to_color=columns_to_color, plotly_kwargs=plotly_kwargs
)

In [None]:
pipe.transform

In [None]:
X