# Case study: MNIST hand-written digits dataset

##### License: Apache 2.0


This notebook shows how to use *giotto-tda* to generate features for classifying digits. We first show how to build a few topological features and present a pipeline extracting a very large amount of features for classification.

The MNIST database of handwritten digits with 784 features, raw data available at: http://yann.lecun.com/exdb/mnist/. It can be split in a training set of the first 60,000 examples, and a test set of 10,000 examples.

## Import libraries
The first step consists in importing relevant *gtda* components and other useful libraries or modules.

In [None]:
from gtda.images import Binarizer, Inverter, ImageToPointCloud, HeightFiltration, DilationFiltration, RadialFiltration, ErosionFiltration, SignedDistanceFiltration
from gtda.homology import CubicalPersistence
from gtda.diagrams import ForgetDimension, Amplitude, Scaler, PersistenceEntropy, BettiCurve, PersistenceLandscape, HeatKernel, Silhouette
from gtda.plotting import plot_heatmap, plot_betti_curves, plot_diagram
from sklearn.pipeline import Pipeline, make_pipeline, FeatureUnion, make_union
import numpy as np

## Loading the MNIST dataset

In [None]:
from sklearn.datasets import fetch_openml

(X, y) = fetch_openml(data_id=554, return_X_y=True)
X = X.reshape((-1, 28, 28))

In [None]:
# For a full-blown example, you can set 
# n_train, n_test = 60000, 10000
n_train, n_test = 20, 10

X_train = X[:n_train]
y_train = y[:n_train]
X_test = X[n_train:n_train+n_test]
y_test = y[n_train:n_train+n_test]

print(X_train.shape, y_train.shape, X_test.shape, y_test.shape)

## Some examples of the input data
We choose the first 20 samples from the training set and visualize them.

In [None]:
plot_heatmap(X_train[0])

In [None]:
plot_heatmap(X_train[1])

## Binarization of the images

In [None]:
binarizer = Binarizer(threshold=0.4)
X_train_binarized = binarizer.fit_transform(X_train)

In [None]:
# NOTE TO UMBE: plot_heatmap does not support binary arrays
plot_heatmap(X_train_binarized[1]*1.)

## Inverting the boolean images

In [None]:
inverter = Inverter(n_jobs=4)
X_train_inverted = inverter.fit_transform(X_train_binarized)

In [None]:
plot_heatmap(X_train_inverted[1]*1.)

## Applying a boolean image filtration

In [None]:
n_iterations = 28

erosion_filtration = ErosionFiltration(n_iterations=n_iterations, n_jobs=4)
X_train_filtered = erosion_filtration.fit_transform(X_train_inverted)

In [None]:
plot_heatmap(X_train_filtered[1])

## Getting persistence diagrams out of images

In [None]:
cubical_complex = CubicalPersistence(n_jobs=1)
X_train_cubical = cubical_complex.fit_transform(X_train_filtered)

In [None]:
plot_diagram(X_train_cubical[1])

## Computing the betti curves

In [None]:
betti = BettiCurve(n_bins=36, n_jobs=1)
X_train_betti = betti.fit_transform(X_train_cubical)

In [None]:
betti.plot(X_train_betti, sample=1)

## Computing the heat kernel of stacked diagrams

In [None]:
diagram_stacker = ForgetDimension()
X_train_stacked = diagram_stacker.fit_transform(X_train_cubical)

In [None]:
heat = HeatKernel(sigma=3., n_bins=36, n_jobs=1)
X_train_heat = heat.fit_transform(X_train_stacked)

In [None]:
plot_heatmap(X_train_heat[1, 0])

## Rescaling the diagrams

In [None]:
metric = {'metric': 'bottleneck', 'metric_params': {}}

diagram_scaler = Scaler(**metric)
diagram_scaler.fit(X_train_cubical)
X_train_scaled = diagram_scaler.transform(X_train_cubical)

In [None]:
diagram_scaler.plot(X_train_scaled, 1)

## Building a pipeline to extract features

In [None]:
steps = [
    ('binarizer', Binarizer(threshold=0.4)),
    ('filtration', SignedDistanceFiltration(n_iterations=28)),
    ('persistence', CubicalPersistence(n_jobs=1)),
    ('amplitude', Amplitude(metric='wasserstein', metric_params={'p': 2}, n_jobs=1))
    ]

pipeline_signed_distance = Pipeline(steps)

In [None]:
X_train_pipeline_distance = pipeline_signed_distance.fit_transform(X_train)

## Applying several pipelines based on different filtrations

In [None]:
direction_list = [ np.array([0, 1]), np.array([0, -1]), np.array([1, 0]), np.array([-1, 0]) ]

filtration_list = [HeightFiltration(direction=direction) 
                    for direction in direction_list]

steps_list = [ [
    ('binarizer', Binarizer(threshold=0.4)),
    ('filtration', filtration),
    ('persistence', CubicalPersistence()),
    ('amplitude', Amplitude(metric='heat', metric_params={'p': 2}))]
    for filtration in filtration_list ]

pipeline_list = [ (str(direction_list[i]), Pipeline(steps_list[i])) for i in range(len(steps_list))]
feature_union_filtrations = FeatureUnion(pipeline_list, n_jobs=-1)

In [None]:
feature_union_filtrations.fit(X_train[:20])
X_train_filtrations = feature_union_filtrations.transform(X_train)