# Comparing algorithms with ParetoBench

In this tutorial, we will compare the performance of 2 algorithms, ChronoClust and FlowSOM on 2 mass cytometry datasets, Levine13 and Levine32, using 4 metrics, F1-score, ARI, Accuracy, and V-measure.

## Importing

If you have not downloaded and installed ParetoBench, please go to our github page and download ParetoBench (https://github.com/ghar1821/ParetoBench).
Unzip it to a folder and run the setup.py file (python3 setup.py install).

Then run the next cell to import ParetoBench

In [1]:
import ParetoBench

## Specifying details for the comparison study

The main function which will compare the algorithms performance is call "compare".
For this to work, there are a few things you need to first specify: 

* Metrics used for comparison
* Dataset names
* Algorithm names
* Location where the quality of clustering solutions are stored
* The column denoting the unique identifier of the clustering result

The first 3 are rather self explanatory. 
The 4th one is the directory containing all the csv files storing the quality of algorithm's clusterings.
The last one is the name of the column in each clustering solutions' quality result file denoting the unique identifier for that solution. 
Let's explore this in the next section.

### Data folder

Here, the folder containing the data is stored under ~/Documents/phd/code/ParetoBench/examples/data.
Let's inspect its content.

In [3]:
import os
data_dir = '//Documents/phd/code/ParetoBench/examples/data'
os.listdir(data_dir)

FileNotFoundError: [Errno 2] No such file or directory: '~/Documents/phd/code/ParetoBench/examples/data'

As you can see we have 2 folders, each named after the algorithm we want to compare. 
Let's look into what's inside flowsom folder.

In [None]:
os.listdir('{}/flowsom'.format(data_dir))

We have 2 datasets, and thus we need to have 2 files in the flowsom ***and chronoclust*** folder.
Each file must contain the quality of flowsom's clustering solutions for a given dataset.

In [None]:
import pandas as pd
dat = pd.read_csv('{}/flowsom/Levine32_scores.csv'.format(data_dir))
dat.head()

This is what the score file looks like. 
Each row represents a clustering solution produced by flowsom.
Each column (f1, accuracy, ari, v_measure) indicates the metrics used to evaluate the clustering solutions.
Param column denotes the parameter id of that clustering solution, and seed is the seed used to generate that clustering solution.
The seed column here is unique to flowsom. 
If we look at the chronoclust's score, we won't have this.

In [None]:
dat = pd.read_csv('{}/chronoclust/Levine32_scores.csv'.format(data_dir))
dat.head()

Instead we only have the 4 metrics and the parameter column. 
This is the bare minimum required for ParetoBench to function.
If you require the parameter value for each solution, by all means add them in (just as the case with flowsom above). 
ParetoBench will simply ignore it.

Now, all of that hopefully make sense.
Let's define the parameters for ParetoBench.

### ParetoBench parameter definitions

In [None]:
metrics = ['accuracy', 'ari', 'f1', 'v_measure']
datasets = ['Levine13', 'Levine32']
algorithms = ['chronoclust', 'flowsom']
datadir = data_dir
savedir = '/Users/givanna/Documents/phd/code/ParetoBench/examples/pareto_comparison'
param_id_col = 'param'

***Very important!***

The metrics name must match the columns in the score files.

The datasets name must match the score filename (before the _scores.csv)

The algorithms name must match the folder name storing the score files.

savedir is basically the location where ParetoBench will output the result. 

param_id_col denotes the name of the column which uniquely identify your result. This shall allow you to identify (post-running) which result is in which front.
In this example, I'm just using straight numbering. 
You can share same numbering/values across different result files, but not within the same result file.

Now let's run it.

## Run ParetoBench

In [None]:
ParetoBench.compare(
    metrics = metrics,
    datasets = datasets,
    algorithms = algorithms,
    datadir = datadir,
    savedir = savedir,
    param_id_col = param_id_col
)

You won't see anything returned, which is normal as ParetoBench store the results as csv files in savedir.
Let's look at the result now.

## ParetoBench results

In [None]:
os.listdir(savedir)

We have 4 different types of csv files here:

1. front_positions_XXX.csv: this shall show you the front positions (normalised or not) of each clustering solution. If XXX is all_datasets, then it's basically the concatenation of front positions for each dataset. Do note that this file is important if you are looking at the normalised front positions as it's normalised based on ***all datasets***, not ***not per dataset!***
2. ks_XXX.csv: comparison of distribution of the normalised front positions using KS test.
3. proportion_solutions_per_front_XXX.csv: the number of solutions (and thus proportion) contributed by each algorithm to each front position.
4. summary_XXX.csv: handy single value summary data highlighting the proportion solutions from each algorithm residing on Pareto front, top 10% and 33% of the fronts.

Let's look at each result in turn.

### Front_positions_XXX.csv

In [None]:
dat = pd.read_csv("{}/front_positions_Levine13.csv".format(savedir))
dat.head()

Here, you can see which solution reside in which front position, which parameter index it is, and what are the metric scores (and the negation used to compute the fronts).
You can also see the normalised front position.

### ks_XXX.csv

In [None]:
dat = pd.read_csv("{}/ks_Levine13.csv".format(savedir))
dat

This is comparison of the normalised fronts distribution of the algorithms. 
You can see for this dataset, the difference is statistically significant.

### proportion_solutions_per_front_XXX.csv

In [None]:
dat = pd.read_csv("{}/proportion_solutions_per_front_Levine13.csv".format(savedir))
dat.head()

The number of solutions in each front for each algorithm. 
Proportion column presents the count as the proportion of number of solutions contributed by the algorithm for that dataset.

### summary_XXX.csv

In [None]:
dat = pd.read_csv("{}/summary_Levine13.csv".format(savedir))
dat

This file contains single summary value showing the proportion of solutions residing on Pareto front, top 10% and 33% of the fronts.
These can be used to quickly infer which algorithm is superior.
The one contributes the most solutions to Pareto front performed better while that with higher proportion of solutions in top 10% and/or 33% is less vulnerable to parameter variations.

The top 10% and 33% can be changed to any value by specifying them (as an array of x/100) as fronts_limits argument for the compare function.
For example, if you want 20% and 50% respectively, pass [0.2, 0.5] as fronts_limits argument to compare method.

## Interpretation

In the manuscript, in addition to extracting the number of solutions in the pareto front for each dataset and algorithm, we draw up CDF plots and swarm plots to interpret the results.

Now, the plots in the manuscript are drawn using ggplot in R, just because well it's prettier..
You can use the same script if you want. 
They're stored under plots folder of ParetoBench directory in github.

You can of course draw them using Seaborn, but I don't like how it looks. 

Refer to separate notebook on how to create those pretty visualisations.