# Example usage for analysis pipeline

In this notebook, we will be exploring the functionality that allows to run various classification and resampling methods of different datasets using both Python code and the command line interface (CLI). We will be able to compare the results and efficiency of these methods in order to determine the best approach for our specific use case.

## 1. Python code

### Imports

In [17]:
from sklearn.neighbors import KNeighborsClassifier
from sklearn.tree import DecisionTreeClassifier
from imblearn.metrics import geometric_mean_score
from pathlib import Path
from sklearn.metrics import accuracy_score
from tempfile import NamedTemporaryFile
from scikit_posthocs import posthoc_wilcoxon

from multi_imbalance.datasets.analysis import AnalysisPipeline, Config, Result
from multi_imbalance.datasets import load_datasets
from multi_imbalance.resampling.soup import SOUP
from multi_imbalance.resampling.spider import SPIDER3
from multi_imbalance.resampling.static_smote import StaticSMOTE
from multi_imbalance.resampling.global_cs import GlobalCS
from multi_imbalance.resampling.mdo import MDO

Load datasets from `data.tar.gz` file to csv files

In [3]:
_ = load_datasets(save_to_csv=True)

### Prepare configuration for analysis pipeline

In this example we will be working with three datasets: glass, new_ecoli, and dermatology. We will be applying two classifiers (Decision Tree, K-Nearest Neighbors) with two different configurations each to these datasets. We will be using three resampling methods (GlobalCS, MDO, SOUP) with default configurations as well as special configurations for MDO and glass dataset. These combinations will be evaluated using two metrics: geometric mean score and accuracy score. We will be repeating each combination five times and using the train_test_split method to divide the data into train and test sets.

In [18]:
cwd = Path.cwd()

config = {
    "datasets": [cwd.parents[1] / "data" / "csv" / "glass.csv", cwd.parents[1] / "data" / "csv" / "new_ecoli.csv", cwd.parents[1] / "data" / "csv" / "dermatology.csv"],
    "classifiers": {
        DecisionTreeClassifier: [{"max_depth" : 100}, {}],
        KNeighborsClassifier: [{"n_neighbors": 7}, {}],
    },
    "resampling_methods": {
        GlobalCS: {"default": {"shuffle": True}},
        MDO: {"default": {"k1_frac": 0.3, "maj_int_min":{"maj": [0, 1], "min": [2, 3, 4, 5]}}, "glass": {"k1_frac": 0.5}},
        SOUP: {"default" : {"shuffle": True}},
    },
    "metrics": {geometric_mean_score: {"correction": 0.001}, accuracy_score: {}},
    "n_repeats": 10, 
    "split_method": ["train_test",{}],
}

Prepare temporary file for result

In [19]:
result_file = NamedTemporaryFile(suffix=".csv")
result_file.close()

Create an AnalysisPipeline object and run the analysis, including a comparison of training without resampling to demonstrate the effectiveness of the resampling methods.

In [20]:
c = Config.from_dict(config)
pipeline = AnalysisPipeline(c)
pipeline.run_analysis(result_file.name, train_without_resampling = True)

  _warn_prf(average, modifier, msg_start, len(result))
  _warn_prf(average, modifier, msg_start, len(result))
  _warn_prf(average, modifier, msg_start, len(result))
  _warn_prf(average, modifier, msg_start, len(result))
  _warn_prf(average, modifier, msg_start, len(result))
  _warn_prf(average, modifier, msg_start, len(result))
  _warn_prf(average, modifier, msg_start, len(result))
  _warn_prf(average, modifier, msg_start, len(result))
  _warn_prf(average, modifier, msg_start, len(result))
  _warn_prf(average, modifier, msg_start, len(result))


Generate summary for selected classifiers, metric and dataset. If you don't know which names should you write you can use appropriate property to find e.g. dataset name

In [21]:
pipeline.dataset_names, pipeline.clf_names

(['dermatology', 'glass', 'new_ecoli'],
 ['kneighborsclassifier', 'decisiontreeclassifier'])

In [29]:
query_dict = {"classifier": ["decisiontreeclassifier", "kneighborsclassifier"], "metric_name": ["geometric_mean_score"], "dataset_name": ["glass", "new_ecoli"]}

results = pipeline.generate_summary(query_dict,save_to_csv=False, csv_path=result_file.name, aggregate_func=[min])

There is two classifiers, one metric and two datasets, so the length of results will be $2\cdot1\cdot2=4$

In [30]:
len(results)

4

#### First result

In [31]:
results[0]

Unnamed: 0_level_0,Unnamed: 1_level_0,Unnamed: 2_level_0,Unnamed: 3_level_0,Unnamed: 4_level_0,metric_value,metric_value,metric_value
Unnamed: 0_level_1,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,mean,std,min
metric_name,classifier,dataset_name,resampling_method,clf_params,Unnamed: 5_level_2,Unnamed: 6_level_2,Unnamed: 7_level_2
geometric_mean_score,decisiontreeclassifier,glass,Not defined,{'max_depth': 100},0.356512,0.222512,0.177231
geometric_mean_score,decisiontreeclassifier,glass,Not defined,{},0.566907,0.185909,0.225857
geometric_mean_score,decisiontreeclassifier,glass,globalcs,{'max_depth': 100},0.477872,0.217623,0.179135
geometric_mean_score,decisiontreeclassifier,glass,globalcs,{},0.447095,0.237363,0.057638
geometric_mean_score,decisiontreeclassifier,glass,mdo,{'max_depth': 100},0.589195,0.170955,0.205291
geometric_mean_score,decisiontreeclassifier,glass,mdo,{},0.505958,0.23995,0.068824
geometric_mean_score,decisiontreeclassifier,glass,soup,{'max_depth': 100},0.613124,0.218905,0.204827
geometric_mean_score,decisiontreeclassifier,glass,soup,{},0.582646,0.198582,0.20906


#### Second result

In [27]:
results[1]

Unnamed: 0_level_0,Unnamed: 1_level_0,Unnamed: 2_level_0,Unnamed: 3_level_0,Unnamed: 4_level_0,metric_value,metric_value,metric_value
Unnamed: 0_level_1,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,mean,std,min
metric_name,classifier,dataset_name,resampling_method,clf_params,Unnamed: 5_level_2,Unnamed: 6_level_2,Unnamed: 7_level_2
geometric_mean_score,decisiontreeclassifier,new_ecoli,Not defined,{'max_depth': 100},0.749784,0.066426,0.602696
geometric_mean_score,decisiontreeclassifier,new_ecoli,Not defined,{},0.698934,0.052891,0.620988
geometric_mean_score,decisiontreeclassifier,new_ecoli,globalcs,{'max_depth': 100},0.71799,0.034846,0.6702
geometric_mean_score,decisiontreeclassifier,new_ecoli,globalcs,{},0.700889,0.080018,0.565038
geometric_mean_score,decisiontreeclassifier,new_ecoli,mdo,{'max_depth': 100},0.767789,0.045542,0.713386
geometric_mean_score,decisiontreeclassifier,new_ecoli,mdo,{},0.771318,0.048929,0.703493
geometric_mean_score,decisiontreeclassifier,new_ecoli,soup,{'max_depth': 100},0.738233,0.053773,0.649839
geometric_mean_score,decisiontreeclassifier,new_ecoli,soup,{},0.7146,0.063548,0.583691


#### Third result

In [33]:
results[2]

Unnamed: 0_level_0,Unnamed: 1_level_0,Unnamed: 2_level_0,Unnamed: 3_level_0,Unnamed: 4_level_0,metric_value,metric_value,metric_value
Unnamed: 0_level_1,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,mean,std,min
metric_name,classifier,dataset_name,resampling_method,clf_params,Unnamed: 5_level_2,Unnamed: 6_level_2,Unnamed: 7_level_2
geometric_mean_score,kneighborsclassifier,glass,Not defined,{'n_neighbors': 7},0.204697,0.191663,0.068485
geometric_mean_score,kneighborsclassifier,glass,Not defined,{},0.171839,0.126395,0.075119
geometric_mean_score,kneighborsclassifier,glass,globalcs,{'n_neighbors': 7},0.543043,0.177286,0.202031
geometric_mean_score,kneighborsclassifier,glass,globalcs,{},0.672387,0.08829,0.536591
geometric_mean_score,kneighborsclassifier,glass,mdo,{'n_neighbors': 7},0.387075,0.20691,0.07894
geometric_mean_score,kneighborsclassifier,glass,mdo,{},0.14561,0.061634,0.074212
geometric_mean_score,kneighborsclassifier,glass,soup,{'n_neighbors': 7},0.585421,0.154868,0.206894
geometric_mean_score,kneighborsclassifier,glass,soup,{},0.354566,0.2069,0.153585


#### Fourth result

In [34]:
results[3]

Unnamed: 0_level_0,Unnamed: 1_level_0,Unnamed: 2_level_0,Unnamed: 3_level_0,Unnamed: 4_level_0,metric_value,metric_value,metric_value
Unnamed: 0_level_1,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,mean,std,min
metric_name,classifier,dataset_name,resampling_method,clf_params,Unnamed: 5_level_2,Unnamed: 6_level_2,Unnamed: 7_level_2
geometric_mean_score,kneighborsclassifier,new_ecoli,Not defined,{'n_neighbors': 7},0.848028,0.039082,0.805196
geometric_mean_score,kneighborsclassifier,new_ecoli,Not defined,{},0.829293,0.025965,0.777511
geometric_mean_score,kneighborsclassifier,new_ecoli,globalcs,{'n_neighbors': 7},0.779214,0.03268,0.736093
geometric_mean_score,kneighborsclassifier,new_ecoli,globalcs,{},0.789073,0.063314,0.72463
geometric_mean_score,kneighborsclassifier,new_ecoli,mdo,{'n_neighbors': 7},0.784751,0.040992,0.722135
geometric_mean_score,kneighborsclassifier,new_ecoli,mdo,{},0.824005,0.041023,0.755261
geometric_mean_score,kneighborsclassifier,new_ecoli,soup,{'n_neighbors': 7},0.839763,0.034917,0.789633
geometric_mean_score,kneighborsclassifier,new_ecoli,soup,{},0.814969,0.059139,0.710178


Generate posthoc analysis for Wilcoxon test. You have to define names of classifiers, dataset names and metric names in query dict.

In [36]:
query_dict = {"classifier": ["decisiontreeclassifier", "kneighborsclassifier"], "metric_name": ["geometric_mean_score"], "dataset_name": ["glass", "new_ecoli"]}

results = AnalysisPipeline.generate_posthoc_analysis(query_dict, save_to_csv=False, csv_path=result_file.name ,posthoc_func_list=[[posthoc_wilcoxon, {}]])

In [37]:
results[0]

posthoc_wilcoxon_decisiontreeclassifier_geometric_mean_score_glass,globalcs,mdo,soup,Not defined
globalcs,1.0,0.245487,0.058258,0.898317
mdo,0.245487,1.0,0.621513,0.329983
soup,0.058258,0.621513,1.0,0.044054
Not defined,0.898317,0.329983,0.044054,1.0
