# Applicability of training ML model with TDA using Delaunay-Rips complex vs. using Rips vs. using Alpha
Author: Amish Mishra  
Date: November 1, 2022

## Notes
* We will use DR for "Delaunay-Rips"
* We will refer to the pipeline that uses DR, Rips, or Alpha for training/validating the corresponding ML model as the "DR method", "Rips method", or "Alpha method", respectively.
* Rename folders with 1, 2, 3,... ahead of them to show what order they are used in

## Import the necessary libraries

In [1]:
%pip install git+https://github.com/amish-mishra/cechmate_DR.git

Collecting git+https://github.com/amish-mishra/cechmate_DR.git
  Cloning https://github.com/amish-mishra/cechmate_DR.git to /tmp/pip-req-build-g4mn1vcp
Building wheels for collected packages: cechmate
  Building wheel for cechmate (setup.py) ... [?25ldone
[?25h  Created wheel for cechmate: filename=cechmate-0.1.0-py3-none-any.whl size=25699 sha256=d1b65a114783857f31ed30b8ef61734723a886c9ad97039bed5624c273f5b437
  Stored in directory: /tmp/pip-ephem-wheel-cache-gpkhmimr/wheels/95/7a/b2/9a6ca1ac9896cacb6226d881ec1e4ec562da85b744d83c210c
Successfully built cechmate
Note: you may need to restart the kernel to use updated packages.


In [3]:
import os
import time
import pandas
import pickle
import numpy as np
import cechmate as cm
import matplotlib.pyplot as plt
from ripser import ripser
from sklearn.svm import SVC
from scipy import stats
from sklearn import metrics
from scipy.stats import median_test
from persistence_stats import generate_training_validation_pers_stats
from train_ml_classifiers import train_ml_classifiers
from validate_ml_classifiers import validate_ml_classifiers

ValueError: numpy.ndarray size changed, may indicate binary incompatibility. Expected 96 from C header, got 80 from PyObject

## 1. Generate Persistence Statistics from Persistence Diagrams using DR, Rips, and Alpha

In [None]:
types = ['Training', 'Validation']
methods = ['rips', 'alpha', 'del_rips']
for t in types:
    for m in methods:
        generate_training_validation_pers_stats(type_of_data=t, method=m, verbose=False)

## 2. Train ML models (SVM) based on Persistence Statistics

In [None]:
func_arr = ['rips', 'alpha', 'del_rips']
for func in func_arr:
    train_ml_classifiers(func)

## 3. Validate ML models

In [None]:
func_arr = ['rips', 'alpha', 'del_rips']
for func in func_arr:
    validate_ml_classifiers(func)

## 4. Generate performance metrics

### Calculate the median and IQR for each method's performance metrics table

In [None]:
func_arr = ['rips', 'alpha', 'del_rips']
all_perf_stats_by_func = {'rips':0, 'alpha':0, 'del_rips':0}

for func in func_arr:
    print(f'========== {func} performance ==========')
    perf_metrics = pandas.read_pickle(
        f'performance_metrics_tables/perf_metrics_{func}_svm_classifier.pkl')
    summary_metrics = pandas.DataFrame({'median':[], 'iqr':[]})
    # print(perf_metrics)
    summary_metrics['median'] = perf_metrics.median(axis=1)
    quantile_75 = perf_metrics.quantile(0.75, axis=1)
    quantile_25 = perf_metrics.quantile(0.25, axis=1)
    summary_metrics['iqr'] = quantile_75 - quantile_25
    relavant_summary_metrics = summary_metrics.iloc[4:] # The median and IQR of the confusion matrix elements are not relevant
    all_perf_stats_by_func[func] = perf_metrics.iloc[4:]
    print(relavant_summary_metrics)

### Perform a row-by-row median test pairwise between DR method, Rips method, and Alpha method

In [None]:
p_val_df = pandas.DataFrame({'p-value for rips vs alpha':[], 'p-value for rips vs del-rips':[],'p-value for alpha vs del-rips':[]})
for idx, row in all_perf_stats_by_func['rips'].iterrows():
    rips_row = all_perf_stats_by_func['rips'].loc[[idx]].values[0]
    alpha_row = all_perf_stats_by_func['alpha'].loc[[idx]].values[0]
    del_rips_row = all_perf_stats_by_func['del_rips'].loc[[idx]].values[0]
    _, p_r_a, _, _ = median_test(rips_row, alpha_row)
    _, p_r_d, _, _ = median_test(rips_row, del_rips_row)
    _, p_a_d, _, _ = median_test(alpha_row, del_rips_row)
    p_val_df.loc[len(p_val_df.index)] = [p_r_a, p_r_d, p_a_d]
p_val_df.index = all_perf_stats_by_func['rips'].index

In [None]:
print(p_val_df[['p-value for rips vs del-rips', 'p-value for alpha vs del-rips']])

## Conclusion
A p-value smaller than 0.2 tells us that there is a significant difference between the medians for the corresponding performance metrics between the two filtration-functions-based classification models. The p-value for the aps metric for rips vs del-rips classifiers is the only one well below 0.2. This means we have sufficient evidence to suggest that the medians of the aps performance metrics for the rips-based classifier and the del-rips-based classifier are significantly different. However for the rest, notice that the p-values in table above are well above 0.2. This suggests that we cannot conclude the medians for each metric are not the same for our classification task using Delaunay-Rips or one of the other methods. **Hence, when looking at any of the performance metrics that interest us (except for aps), training a classifier using statistics generated using the Delaunay-Rips complex will perform satisfyingly as good as using either Rips or Alpha as the underlying method for persistent homology.**

Based on this project, we make the following suggestion. 

A data analyst would benefit from making use of the Delaunay-Rips Complex in their data analysis application when
1. Topological features of the dataset are of high interest
2. Computation time is an essential resource 
3. Dimension of the input data is not too high/low (between 3 and 8)
4. Average Precision Score (APS) of an ML model is not crucially important