# Case study: MNIST hand-written digits dataset

##### License: Apache 2.0


This notebook shows how to use *giotto-tda* to generate features for classifying digits. We first show how to build a few topological features and present a pipeline extracting a very large amount of features for classification.

The MNIST database of handwritten digits with 784 features, raw data available at: http://yann.lecun.com/exdb/mnist/. It can be split in a training set of the first 60,000 examples, and a test set of 10,000 examples.

## Import libraries
The first step consists in importing relevant *gtda* components and other useful libraries or modules.

In [None]:
from gtda.images import Binarizer, Inverter, ImageToPointCloud, HeightFiltration, DilationFiltration, RadialFiltration, ErosionFiltration, SignedDistanceFiltration
from gtda.homology import CubicalPersistence
from gtda.diagrams import ForgetDimension, Amplitude, Scaler, PersistenceEntropy, BettiCurve, PersistenceLandscape, HeatKernel, Silhouette
from gtda.plotting import plot_heatmap, plot_betti_curves, plot_diagram
from sklearn.pipeline import Pipeline, make_pipeline, FeatureUnion, make_union
from sklearn.ensemble import RandomForestClassifier
from sklearn.model_selection import GridSearchCV
from sklearn.metrics import classification_report
import numpy as np

# This would be quite nice but is available with sklearn >= 0.23
# from sklearn import set_config
#set_config(display='diagram') 

## Loading the MNIST dataset

In [None]:
from sklearn.datasets import fetch_openml

(X, y) = fetch_openml(data_id=554, return_X_y=True)
X = X.reshape((-1, 28, 28))

In [None]:
# For a full-blown example, you can set 
# n_train, n_test = 60000, 10000
n_train, n_test = 60, 10

X_train = X[:n_train]
y_train = y[:n_train]
X_test = X[n_train:n_train+n_test]
y_test = y[n_train:n_train+n_test]

print(X_train.shape, y_train.shape, X_test.shape, y_test.shape)

## Some examples of the input data
We choose the first 20 samples from the training set and visualize them.

In [None]:
plot_heatmap(X_train[0])

In [None]:
plot_heatmap(X_train[1])

## Binarization of the images

In [None]:
binarizer = Binarizer(threshold=0.4)
X_train_binarized = binarizer.fit_transform(X_train)

In [None]:
# NOTE TO UMBE: plot_heatmap does not support binary arrays
plot_heatmap(X_train_binarized[1]*1.)

## Inverting the boolean images

In [None]:
inverter = Inverter(n_jobs=4)
X_train_inverted = inverter.fit_transform(X_train_binarized)

In [None]:
plot_heatmap(X_train_inverted[1]*1.)

## Applying a boolean image filtration

In [None]:
n_iterations = 28

erosion_filtration = ErosionFiltration(n_iterations=n_iterations, n_jobs=4)
X_train_filtered = erosion_filtration.fit_transform(X_train_inverted)

In [None]:
plot_heatmap(X_train_filtered[1])

## Getting persistence diagrams out of images

In [None]:
cubical_complex = CubicalPersistence(n_jobs=1)
X_train_cubical = cubical_complex.fit_transform(X_train_filtered)

In [None]:
plot_diagram(X_train_cubical[1])

## Computing the betti curves

In [None]:
betti = BettiCurve(n_bins=36, n_jobs=1)
X_train_betti = betti.fit_transform(X_train_cubical)

In [None]:
betti.plot(X_train_betti, sample=1)

## Computing the heat kernel of stacked diagrams

In [None]:
diagram_stacker = ForgetDimension()
X_train_stacked = diagram_stacker.fit_transform(X_train_cubical)

In [None]:
heat = HeatKernel(sigma=3., n_bins=36, n_jobs=1)
X_train_heat = heat.fit_transform(X_train_stacked)

In [None]:
plot_heatmap(X_train_heat[1, 0])

In [None]:
print(X_train_heat.shape)

## Rescaling the diagrams

In [None]:
metric = {'metric': 'bottleneck', 'metric_params': {}}

diagram_scaler = Scaler(**metric)
diagram_scaler.fit(X_train_cubical)
X_train_scaled = diagram_scaler.transform(X_train_cubical)

In [None]:
diagram_scaler.plot(X_train_scaled, 1)

## Building a pipeline to extract features

In [None]:
steps = [
    ('binarizer', Binarizer(threshold=0.4)),
    ('filtration', SignedDistanceFiltration(n_iterations=28)),
    ('diagram', CubicalPersistence(n_jobs=1)),
    ('amplitude', Amplitude(metric='wasserstein', metric_params={'p': 2}, n_jobs=1))
    ]

pipeline_signed_distance = Pipeline(steps)

In [None]:
X_train_pipeline_distance = pipeline_signed_distance.fit_transform(X_train)

## Obtaining features from different filtrations

In [None]:
direction_list = [ np.array([0, 1]), np.array([0, -1]), np.array([1, 0]), np.array([-1, 0]) ]

filtration_list = [HeightFiltration(direction=direction) 
                    for direction in direction_list]

steps_list = [ [
    ('binarizer', Binarizer(threshold=0.4)),
    ('filtration', filtration),
    ('diagram', CubicalPersistence()),
    ('amplitude', Amplitude(metric='heat', metric_params={'p': 2}))]
    for filtration in filtration_list ]

pipeline_list = [ (str(direction_list[i]), Pipeline(steps_list[i])) for i in range(len(steps_list))]
feature_union_filtrations = FeatureUnion(pipeline_list, n_jobs=-1)

In [None]:
feature_union_filtrations.fit(X_train[:20])
X_train_filtrations = feature_union_filtrations.transform(X_train)

## Deriving a full-scale TDA feature extraction pipeline

We can go full-scale and extract a large number of features. Careful, some of them will be highly correlated, so it would be good to use a feature selection algorithm to reduce their number before passing them to a classifier.

In [None]:
direction_list = [ [1, 0], [1, 1], [0, 1], [-1, 1], [-1, 0], [-1, -1], [0, -1], [1, -1] ] 
center_list = [ [13, 6], [6, 13], [13, 13], [20, 13], [13, 20], [6, 6], [6, 20], [20, 6], [20, 20] ]
n_iterations_erosion_list = [6, 10]
n_iterations_dilation_list = [6, 10]
n_iterations_signed_list = [6, 10]
n_neighbors_list = [2, 4]

# Creating a list of all filtration transformer, we will be applying
filtration_list =  [HeightFiltration(direction=np.array(direction)) 
                    for direction in direction_list] \
                 + [RadialFiltration(center=np.array(center)) 
                    for center in center_list] \
                 + [ErosionFiltration(n_iterations=n_iterations) 
                    for n_iterations in n_iterations_erosion_list] \
                 + [DilationFiltration(n_iterations=n_iterations) 
                    for n_iterations in n_iterations_dilation_list] \
                 + [SignedDistanceFiltration(n_iterations=n_iterations) 
                    for n_iterations in n_iterations_signed_list] \
                 + ['passthrough']

# Creating the diagram generation pipeline
diagram_steps = [[Binarizer(threshold=0.4), 
                  filtration, 
                  CubicalPersistence(homology_dimensions=[0, 1]), 
                  Scaler(metric='bottleneck')] 
                  for filtration in filtration_list]

# Listing all metrics we want to use to extract diagram amplitudes
metric_list = [ 
   {'metric': 'bottleneck', 'metric_params': {}},
   {'metric': 'wasserstein', 'metric_params': {'p': 1}},
   {'metric': 'wasserstein', 'metric_params': {'p': 2}},
   {'metric': 'landscape', 'metric_params': {'p': 1, 'n_layers': 1, 'n_bins': 100}},
   {'metric': 'landscape', 'metric_params': {'p': 1, 'n_layers': 2, 'n_bins': 100}},
   {'metric': 'landscape', 'metric_params': {'p': 2, 'n_layers': 1, 'n_bins': 100}},
   {'metric': 'landscape', 'metric_params': {'p': 2, 'n_layers': 2, 'n_bins': 100}},
   {'metric': 'betti', 'metric_params': {'p': 1, 'n_bins': 100}},
   {'metric': 'betti', 'metric_params': {'p': 2, 'n_bins': 100}},
   {'metric': 'heat', 'metric_params': {'p': 1, 'sigma': 1.6, 'n_bins': 100}},
   {'metric': 'heat', 'metric_params': {'p': 1, 'sigma': 3.2, 'n_bins': 100}},
   {'metric': 'heat', 'metric_params': {'p': 2, 'sigma': 1.6, 'n_bins': 100}},
   {'metric': 'heat', 'metric_params': {'p': 2, 'sigma': 3.2, 'n_bins': 100}}
]

#
feature_union = make_union(*[PersistenceEntropy()] + [Amplitude(**metric, order=None) 
                                                      for metric in metric_list])

tda_union = make_union(*[make_pipeline(*diagram_step, feature_union)
                         for diagram_step in diagram_steps], n_jobs=-1)


In [None]:
X_train_tda = tda_union.fit_transform(X_train)
X_train_tda.shape

We have generated 672 topological features per image! Now, those features were not chosen properly and some of them are highly correlated.

In [None]:
#plot_features()

We can run a hyperparameter search to find the best one! Let's do it for a simple pipeline that uses HeightFiltration and let's find the best direction for a classification problem using a RandomForestClassifier:

In [None]:
height_pipeline = Pipeline([
    ('binarizer', Binarizer(threshold=0.4)),
    ('filtration', HeightFiltration()),
    ('diagram', CubicalPersistence()),
    ('feature', PersistenceEntropy()),
    ('classifier', RandomForestClassifier())
])

We can tune features hyper parameters and classifier hyper parameters together in a single hyper parameters grid search.

In [None]:
direction_list = [ [1, 0], [1, 1], [0, 1], [-1, 1], [-1, 0], [-1, -1], [0, -1], [1, -1] ] 
homology_dimensions_list = [ [0], [1] ]
n_estimators_list = [ 500, 1000, 2000 ]

param_grid = {
    'filtration__direction': [np.array(direction) for direction in direction_list],
    'diagram__homology_dimensions' : [homology_dimensions for homology_dimensions in homology_dimensions_list],
    'classifier__n_estimators': [n_estimators for n_estimators in n_estimators_list]
}

grid_search = GridSearchCV(estimator=height_pipeline, param_grid=param_grid, cv=3, n_jobs=-1)
grid_search.fit(X_train, y_train)

In [None]:
print("Best parameters set found on validation set:")
print()
print(grid_search.best_params_)
print()
print("Grid scores on validation set:")
print()
means = grid_search.cv_results_['mean_test_score']
stds = grid_search.cv_results_['std_test_score']
for mean, std, params in zip(means, stds, grid_search.cv_results_['params']):
    print("%0.3f (+/-%0.03f) for %r"
          % (mean, std * 2, params))
print()

print("Detailed classification report:")
print()
print("The model is trained on the full development set.")
print("The scores are computed on the full evaluation set.")
print()
y_true, y_pred = y_test, grid_search.predict(X_test)
print(classification_report(y_true, y_pred))
print()

We have a full report on the grid search result! Even on this very small train set, HeightFiltration with direction [1, 0] in dimension 0 (connected components) provides the most promising feature! Can you interpret why?