# Example usage for analysis pipeline

In this notebook, we will be exploring the functionality that allows to run various classification and resampling methods of different datasets using both Python code and the command line interface (CLI). We will be able to compare the results and efficiency of these methods in order to determine the best approach for our specific use case. The following will show the combination into one pipeline selected for resampling, classifiers and various metrics. Then, for selected resampling, classifier and metric methods, possible statistical analysis will be presented.

## 1. Python code

### Imports

In [15]:
from sklearn.neighbors import KNeighborsClassifier
from sklearn.tree import DecisionTreeClassifier
from imblearn.metrics import geometric_mean_score
from pathlib import Path
from sklearn.metrics import accuracy_score
from tempfile import NamedTemporaryFile
from scikit_posthocs import posthoc_wilcoxon, posthoc_mannwhitney
import pandas as pd

from multi_imbalance.datasets.analysis import AnalysisPipeline, Config, Result
from multi_imbalance.datasets import load_datasets
from multi_imbalance.resampling.soup import SOUP
from multi_imbalance.resampling.spider import SPIDER3
from multi_imbalance.resampling.static_smote import StaticSMOTE
from multi_imbalance.resampling.global_cs import GlobalCS
from multi_imbalance.resampling.mdo import MDO

Load datasets from `data.tar.gz` file to csv files

In [16]:
_ = load_datasets(save_to_csv=True)

### Prepare configuration for analysis pipeline

In this example we will be working with three datasets: glass, new_ecoli, and dermatology. We will be applying two classifiers (Decision Tree, K-Nearest Neighbors) with two different configurations each to these datasets. We will be using three resampling methods (GlobalCS, MDO, SOUP) with default configurations as well as special configurations for MDO and glass dataset. These combinations will be evaluated using two metrics: geometric mean score and accuracy score. We will be repeating each combination five times and using the train_test_split method to divide the data into train and test sets.

In [75]:
cwd = Path.cwd()

config = {
    "datasets": [
        cwd.parents[1] / "data" / "csv" / "cmc.csv",
        cwd.parents[1] / "data" / "csv" / "new_ecoli.csv",
        cwd.parents[1] / "data" / "csv" / "cleveland.csv",
    ],
    "classifiers": {
        DecisionTreeClassifier: [{"max_depth": 100}, {}],
        KNeighborsClassifier: [{"n_neighbors": 7}, {}],
    },
    "resampling_methods": {
        GlobalCS: {"default": {"shuffle": True}},
        MDO: {"default": {"k1_frac": 0.3, "maj_int_min": {"maj": [0, 1], "min": [2, 3, 4, 5]}}, "cmc": {"k1_frac": 0.5}},
    },
    "metrics": {geometric_mean_score: {"correction": 0.001}, accuracy_score: {}},
    "n_repeats": 20,
    "split_method": ["train_test", {"test_size": 0.3}],
}

Prepare temporary file for result

In [77]:
result_file = NamedTemporaryFile(suffix=".csv")
result_file.close()

Create an AnalysisPipeline object and run the analysis, including a comparison of training without resampling to demonstrate the effectiveness of the resampling methods.

In [78]:
c = Config.from_dict(config)
pipeline = AnalysisPipeline(c)
pipeline.run_analysis(result_file.name, train_without_resampling=True)

Generate summary for selected classifiers, metric and dataset. If you don't know which names should you write you can use appropriate property to find e.g. dataset name

In [79]:
pipeline.dataset_names, pipeline.clf_names

(['new_ecoli', 'cmc', 'cleveland'],
 ['decisiontreeclassifier', 'kneighborsclassifier'])

In [80]:
query_dict = {
    "classifier": ["decisiontreeclassifier", "kneighborsclassifier"],
    "metric_name": ["geometric_mean_score"],
    "dataset_name": ["cmc", "new_ecoli", "cleveland"],
}

summary_results = pipeline.generate_summary(
    query_dict, save_to_csv=False, csv_path=result_file.name, aggregate_func=[min], concat_results=False
)

There is two classifiers, one metric and three datasets, so the length of results will be $2\cdot1\cdot3=6$

In [81]:
len(summary_results)

6

#### First result

In [82]:
summary_results[0]

Unnamed: 0_level_0,Unnamed: 1_level_0,Unnamed: 2_level_0,Unnamed: 3_level_0,Unnamed: 4_level_0,metric_value,metric_value,metric_value
Unnamed: 0_level_1,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,mean,std,min
metric_name,classifier,dataset_name,resampling_method,clf_params,Unnamed: 5_level_2,Unnamed: 6_level_2,Unnamed: 7_level_2
geometric_mean_score,decisiontreeclassifier,cmc,Not defined,{'max_depth': 100},0.45151,0.021547,0.40201
geometric_mean_score,decisiontreeclassifier,cmc,Not defined,{},0.441469,0.029395,0.385481
geometric_mean_score,decisiontreeclassifier,cmc,globalcs,{'max_depth': 100},0.452274,0.020277,0.4174
geometric_mean_score,decisiontreeclassifier,cmc,globalcs,{},0.456366,0.018096,0.412834
geometric_mean_score,decisiontreeclassifier,cmc,mdo,{'max_depth': 100},0.443627,0.025639,0.395393
geometric_mean_score,decisiontreeclassifier,cmc,mdo,{},0.444472,0.020072,0.410953


Generate posthoc analysis for Wilcoxon test. You have to define names of classifiers, dataset names and metric names in query dict.

In [83]:
query_dict = {
    "metric_name": ["geometric_mean_score"],
    "classifier": ["decisiontreeclassifier", "kneighborsclassifier"],
    "dataset_name": ["cmc", "new_ecoli"],
}

analysis_results, param_comb = AnalysisPipeline.generate_posthoc_analysis(
    query_dict, save_to_csv=False, csv_path=result_file.name, posthoc_func_list=[[posthoc_wilcoxon, {}], [posthoc_mannwhitney, {}]]
)

Search for analysis for specific test (Wilcoxon), metric name, classifier and dataset using name of analysis

In [84]:
func_name = posthoc_wilcoxon.__name__
metric_name = "geometric_mean_score"  # you can find it using pipeline.metric_names
clf_name = "kneighborsclassifier"  # you can find it using pipeline.clf_names
dataset_name = "cmc"  # you can find it using pipeline.dataset_names

analysis_name = "_".join([func_name, metric_name, clf_name, dataset_name])

In [85]:
analysis_result = pd.DataFrame(analysis_results[analysis_name])
analysis_result

Unnamed: 0,Unnamed: 1,p-value
globalcs_0,mdo_0,0.430433
globalcs_0,Not defined_0,0.009436
globalcs_0,globalcs_1,0.001209
globalcs_0,mdo_1,0.000134
globalcs_0,Not defined_1,4e-06
mdo_0,Not defined_0,0.388376
mdo_0,globalcs_1,0.004221
mdo_0,mdo_1,0.00486
mdo_0,Not defined_1,0.00021
Not defined_0,globalcs_1,0.044054


In [86]:
alpha = 0.05
analysis_result[analysis_result["p-value"] < alpha]

Unnamed: 0,Unnamed: 1,p-value
globalcs_0,Not defined_0,0.009436
globalcs_0,globalcs_1,0.001209
globalcs_0,mdo_1,0.000134
globalcs_0,Not defined_1,4e-06
mdo_0,globalcs_1,0.004221
mdo_0,mdo_1,0.00486
mdo_0,Not defined_1,0.00021
Not defined_0,globalcs_1,0.044054
Not defined_0,mdo_1,0.021484
Not defined_0,Not defined_1,0.000134


In [87]:
concat_summary_df = pipeline.generate_summary(query_dict, save_to_csv=False, csv_path=result_file.name, concat_results=True)

concat_summary_df.loc[metric_name, clf_name, dataset_name]

Unnamed: 0_level_0,Unnamed: 1_level_0,metric_value,metric_value
Unnamed: 0_level_1,Unnamed: 1_level_1,mean,std
resampling_method,clf_params,Unnamed: 2_level_2,Unnamed: 3_level_2
Not defined,{'n_neighbors': 7},0.494294,0.017129
Not defined,{},0.465963,0.021218
globalcs,{'n_neighbors': 7},0.508187,0.016241
globalcs,{},0.476802,0.023789
mdo,{'n_neighbors': 7},0.500966,0.021062
mdo,{},0.481425,0.017707


## 2. CLI

You can also use the CLI to run the pipeline for analysis. To do this, you need to prepare JSON files that will contain configurations for the given functions.

## Imports

In [88]:
import json
from tempfile import NamedTemporaryFile, TemporaryDirectory
import pandas as pd
import os
from pathlib import Path

from multi_imbalance.datasets.helpers import read_summary_from_csv

First we specify path to file which contain the defition of AnalysisPipeline

In [89]:
cwd = Path.cwd()

In [90]:
path_to_analysis_file = str(cwd.parents[1] / "multi_imbalance" / "datasets" / "analysis.py")

Now we print help with descriptions of options

In [91]:
!python $path_to_analysis_file --help

Usage: analysis.py [OPTIONS] OUTPUT_PATH

  This function helps to use pipeline analysis, summary and posthoc tests by
  CLI. Output path is path to result csv file from analysis pipeline.

Options:
  --run-analysis              Option specifying whether it should be run
                              analysis pipeline
  --summary                   Option specifying whether it should be run
                              summary
  --posthoc-analysis          Option specifying whether it should be run
                              posthoc analysis
  --config-json TEXT          Path to json file which contain config for
                              pipeline analysis
  --query-json TEXT           Path to json file which contain query dict for
                              generating summary
  --posthoc-query-json TEXT   Path to json file which contain query dict for
                              posthoc analysis
  --aggregate-json TEXT       Optional, path to json file which contain paths


It is important to prepare yourself configuration files before running the file.
The files shown below will be similar to those used directly in Python, but will differ in some details.

In [92]:
config = {
    "datasets": [
        str(cwd.parents[1] / "data" / "csv" / "cmc.csv"),
        str(cwd.parents[1] / "data" / "csv" / "new_ecoli.csv"),
        str(cwd.parents[1] / "data" / "csv" / "cleveland.csv"),
    ],
    "classifiers": {
        "sklearn.tree.DecisionTreeClassifier": [{"max_depth": 100}, {}],
        "sklearn.neighbors.KNeighborsClassifier": [{"n_neighbors": 7}, {}],
    },
    "resampling_methods": {
        "multi_imbalance.resampling.global_cs.GlobalCS": {"default": {"shuffle": True}},
        "multi_imbalance.resampling.mdo.MDO": {
            "default": {"k1_frac": 0.3, "maj_int_min": {"maj": [0, 1], "min": [2, 3, 4, 5]}},
            "cmc": {"k1_frac": 0.5},
        },
    },
    "metrics": {"imblearn.metrics.geometric_mean_score": {"correction": 0.001}, "sklearn.metrics.accuracy_score": {}},
    "n_repeats": 20,
    "split_method": ["train_test", {"test_size": 0.2}],
}

config_json = NamedTemporaryFile(suffix=".json")
config_json.close()
with open(config_json.name, "w") as f:
    json.dump(config, f)

Now we prepare path for result file

In [93]:
result_file = NamedTemporaryFile(suffix=".csv")
result_file.close()

Now we can run analysis pipeline

In [94]:
!python $path_to_analysis_file $result_file.name --run-analysis --config-json $config_json.name --train-without-resampling

Start
Run analysis pipeline
Done


  _warn_prf(average, modifier, msg_start, len(result))
  _warn_prf(average, modifier, msg_start, len(result))
  _warn_prf(average, modifier, msg_start, len(result))
  _warn_prf(average, modifier, msg_start, len(result))
  _warn_prf(average, modifier, msg_start, len(result))
  _warn_prf(average, modifier, msg_start, len(result))
  _warn_prf(average, modifier, msg_start, len(result))
  _warn_prf(average, modifier, msg_start, len(result))
  _warn_prf(average, modifier, msg_start, len(result))
  _warn_prf(average, modifier, msg_start, len(result))
  _warn_prf(average, modifier, msg_start, len(result))
  _warn_prf(average, modifier, msg_start, len(result))
  _warn_prf(average, modifier, msg_start, len(result))


We can check the contents of the file that was created

In [95]:
pd.read_csv(result_file.name).head(6)

Unnamed: 0,metric_name,classifier,dataset_name,resampling_method,metric_value,no_repeat,clf_params
0,geometric_mean_score,decisiontreeclassifier,cmc,globalcs,0.439404,0,{'max_depth': 100}
1,accuracy_score,decisiontreeclassifier,cmc,globalcs,0.467797,0,{'max_depth': 100}
2,geometric_mean_score,decisiontreeclassifier,cmc,mdo,0.468248,0,{'max_depth': 100}
3,accuracy_score,decisiontreeclassifier,cmc,mdo,0.515254,0,{'max_depth': 100}
4,geometric_mean_score,decisiontreeclassifier,cmc,Not defined,0.422863,0,{'max_depth': 100}
5,accuracy_score,decisiontreeclassifier,cmc,Not defined,0.461017,0,{'max_depth': 100}


As before if we already have the results prepared we can generate a summary for them. Again, create a JSON file that will contain a query dictionary of specific combinations of classifiers, datasets etc.

In [96]:
query_dict = {
    "metric_name": ["geometric_mean_score"],
    "classifier": ["decisiontreeclassifier", "kneighborsclassifier"],
    "dataset_name": ["cmc", "new_ecoli", "cleveland"],
}

query_json = NamedTemporaryFile(suffix=".json")
query_json.close()
with open(query_json.name, "w") as f:
    json.dump(query_dict, f)

Now we will create an optional JSON file, which will contain a list of paths to aggregate functions

In [97]:
aggr_func_list = ["numpy.min"]

aggr_func_json = NamedTemporaryFile(suffix=".json")
aggr_func_json.close()
with open(aggr_func_json.name, "w") as f:
    json.dump(aggr_func_list, f)

We will also specify the destination path where the files generated from the summary will be located

In [98]:
temp_dir = TemporaryDirectory()

Now we can run summary

In [99]:
!python $path_to_analysis_file $result_file.name --summary --query-json $query_json.name --save-path $temp_dir.name --aggregate-json $aggr_func_json.name

Start
Run generate summary
Done


We can check to see if as many files have been generated as expected (i.e. 6)

In [100]:
csv_dir_files = os.listdir(temp_dir.name)

csv_dir_files, len(csv_dir_files)

(['geometric_mean_score_decisiontreeclassifier_cleveland.csv',
  'geometric_mean_score_decisiontreeclassifier_cmc.csv',
  'geometric_mean_score_decisiontreeclassifier_new_ecoli.csv',
  'geometric_mean_score_kneighborsclassifier_cleveland.csv',
  'geometric_mean_score_kneighborsclassifier_cmc.csv',
  'geometric_mean_score_kneighborsclassifier_new_ecoli.csv'],
 6)

We will open the same file that was shown previously as the first result

In [101]:
df = read_summary_from_csv(os.path.join(temp_dir.name, "geometric_mean_score_decisiontreeclassifier_cmc.csv"))
df

Unnamed: 0_level_0,Unnamed: 1_level_0,Unnamed: 2_level_0,Unnamed: 3_level_0,Unnamed: 4_level_0,metric_value,metric_value,metric_value
Unnamed: 0_level_1,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,mean,std,amin
metric_name,clf_name,dataset_name,resampling_method,clf_params,Unnamed: 5_level_2,Unnamed: 6_level_2,Unnamed: 7_level_2
geometric_mean_score,decisiontreeclassifier,cmc,Not defined,{'max_depth': 100},0.444781,0.0326,0.375016
geometric_mean_score,decisiontreeclassifier,cmc,Not defined,{},0.446082,0.032315,0.358658
geometric_mean_score,decisiontreeclassifier,cmc,globalcs,{'max_depth': 100},0.457594,0.030509,0.407711
geometric_mean_score,decisiontreeclassifier,cmc,globalcs,{},0.461998,0.035399,0.379794
geometric_mean_score,decisiontreeclassifier,cmc,mdo,{'max_depth': 100},0.443874,0.031524,0.380103
geometric_mean_score,decisiontreeclassifier,cmc,mdo,{},0.446591,0.028359,0.37773


The resulting dataframe is the same in terms of structure, only some values, for example, for the mean are minimally different.

The last thing to do is to perform statistical tests. As before, you will need a JSON file containing the query and a second JSON file containing a dictionary with the paths to the functions and their possible parameters.

In [102]:
query_dict = {
    "metric_name": ["geometric_mean_score"],
    "classifier": ["decisiontreeclassifier", "kneighborsclassifier"],
    "dataset_name": ["cmc", "new_ecoli"],
}


query_json = NamedTemporaryFile(suffix=".json")
query_json.close()
with open(query_json.name, "w") as f:
    json.dump(query_dict, f)

In [103]:
posthoc_func_dict = {"scikit_posthocs.posthoc_wilcoxon": {}, "scikit_posthocs.posthoc_mannwhitney": {}}

posthoc_func_json = NamedTemporaryFile(suffix=".json")
posthoc_func_json.close()
with open(posthoc_func_json.name, "w") as f:
    json.dump(posthoc_func_dict, f)

In [104]:
temp_dir = TemporaryDirectory()

In [105]:
!python $path_to_analysis_file $result_file.name --posthoc-analysis --posthoc-query-json $query_json.name --save-path $temp_dir.name --posthoc-func-json $posthoc_func_json.name

Start
Run generate posthoc analysis
Done


Again, let's check if as many files as expected have been obtained (this time 8)

In [106]:
csv_dir_files = os.listdir(temp_dir.name)

csv_dir_files, len(csv_dir_files)

(['posthoc_mannwhitney_geometric_mean_score_decisiontreeclassifier_cmc.csv',
  'posthoc_mannwhitney_geometric_mean_score_decisiontreeclassifier_new_ecoli.csv',
  'posthoc_mannwhitney_geometric_mean_score_kneighborsclassifier_cmc.csv',
  'posthoc_mannwhitney_geometric_mean_score_kneighborsclassifier_new_ecoli.csv',
  'posthoc_wilcoxon_geometric_mean_score_decisiontreeclassifier_cmc.csv',
  'posthoc_wilcoxon_geometric_mean_score_decisiontreeclassifier_new_ecoli.csv',
  'posthoc_wilcoxon_geometric_mean_score_kneighborsclassifier_cmc.csv',
  'posthoc_wilcoxon_geometric_mean_score_kneighborsclassifier_new_ecoli.csv'],
 8)

Now open the same file as before

In [113]:
func_name = posthoc_wilcoxon.__name__
metric_name = "geometric_mean_score"
clf_name = "kneighborsclassifier"
dataset_name = "cmc"

analysis_name_file = "_".join([func_name, metric_name, clf_name, dataset_name]) + ".csv"

In [114]:
alpha = 0.05
analysis_result = pd.read_csv(os.path.join(temp_dir.name, analysis_name_file), index_col=[0, 1])

analysis_result

Unnamed: 0,Unnamed: 1,p-value
globalcs_0,mdo_0,0.261099
globalcs_0,Not defined_0,0.089695
globalcs_0,globalcs_1,0.000851
globalcs_0,mdo_1,4.8e-05
globalcs_0,Not defined_1,0.000168
mdo_0,Not defined_0,0.388376
mdo_0,globalcs_1,0.000586
mdo_0,mdo_1,0.00486
mdo_0,Not defined_1,0.000851
Not defined_0,globalcs_1,0.000708


In [115]:
analysis_result[analysis_result["p-value"] < alpha]

Unnamed: 0,Unnamed: 1,p-value
globalcs_0,globalcs_1,0.000851
globalcs_0,mdo_1,4.8e-05
globalcs_0,Not defined_1,0.000168
mdo_0,globalcs_1,0.000586
mdo_0,mdo_1,0.00486
mdo_0,Not defined_1,0.000851
Not defined_0,globalcs_1,0.000708
Not defined_0,mdo_1,0.004221
Not defined_0,Not defined_1,0.000105


In both examples shown (using Python and CLI) for cmc dataset and KNN classifier, the resampling methods used (or not using any) differ significantly in some cases, e.g. GlobalCS with MDO.