# 1 Introduction

This notebook contains the experimental procedure in the [Self-Organizing Map Oversampling (SOMO) for imbalanced data set learning](http://www.sciencedirect.com/science/article/pii/S0957417417302324) paper.

# 2 Imports

In [1]:
from os.path import expanduser
from sklearn.linear_model import LogisticRegression
from sklearn.ensemble import GradientBoostingClassifier
from imblearn.over_sampling import RandomOverSampler, SMOTE, ADASYN
from imbtools.evaluation import BinaryExperiment

# 3 Experiment

#### 3.1 Configure experiment

In [2]:
classifiers = [LogisticRegression(), GradientBoostingClassifier()]
oversampling_methods = [None, RandomOverSampler(), SMOTE(), SMOTE(kind='borderline1'), ADASYN()]
param_grids = [None, {'max_depth':[2, 3, 5, 8], 'n_estimators':[10, 50, 80, 100]}]

In [3]:
experiment = BinaryExperiment(expanduser('~/Main/data/somo/'), classifiers, oversampling_methods)

#### 3.2 Run experiment

In [4]:
experiment.run(logging_results=False)

  'precision', 'predicted', average, warn_for)


#### 3.3 Datasets summary

In [5]:
experiment.datasets_summary_

Unnamed: 0,Dataset name,# of features,# of instances,# of minority instances,# of majority instances,Imbalanced Ratio
0,breast_tissue,9,106,36,70,1.94
1,ecoli,7,336,52,284,5.46
2,eucalyptus,8,642,98,544,5.55
3,glass,9,214,70,144,2.06
4,haberman,3,306,81,225,2.78
5,heart,13,270,120,150,1.25
6,iris,4,150,50,100,2.0
7,libra,90,360,72,288,4.0
8,liver_disorders,6,345,145,200,1.38
9,pima,8,768,268,500,1.87


#### 3.4 Mean CV results

In [6]:
experiment.mean_cv_results_

Unnamed: 0,Dataset,Classifier,Oversampling method,Metric,Mean CV score
0,breast_tissue,GradientBoostingClassifier,ADASYN,f1 score,0.694370
1,breast_tissue,GradientBoostingClassifier,ADASYN,geometric mean score,0.755950
2,breast_tissue,GradientBoostingClassifier,ADASYN,roc auc score,0.881149
3,breast_tissue,GradientBoostingClassifier,,f1 score,0.671965
4,breast_tissue,GradientBoostingClassifier,,geometric mean score,0.731175
5,breast_tissue,GradientBoostingClassifier,,roc auc score,0.861061
6,breast_tissue,GradientBoostingClassifier,RandomOverSampler,f1 score,0.660953
7,breast_tissue,GradientBoostingClassifier,RandomOverSampler,geometric mean score,0.739070
8,breast_tissue,GradientBoostingClassifier,RandomOverSampler,roc auc score,0.865972
9,breast_tissue,GradientBoostingClassifier,SMOTE,f1 score,0.686341


#### 3.5 Standard deviation CV results

In [7]:
experiment.std_cv_results_

Unnamed: 0,Dataset,Classifier,Oversampling method,Metric,Std CV score
0,breast_tissue,GradientBoostingClassifier,ADASYN,f1 score,0.035976
1,breast_tissue,GradientBoostingClassifier,ADASYN,geometric mean score,0.039353
2,breast_tissue,GradientBoostingClassifier,ADASYN,roc auc score,0.028755
3,breast_tissue,GradientBoostingClassifier,,f1 score,0.022996
4,breast_tissue,GradientBoostingClassifier,,geometric mean score,0.044824
5,breast_tissue,GradientBoostingClassifier,,roc auc score,0.017276
6,breast_tissue,GradientBoostingClassifier,RandomOverSampler,f1 score,0.028659
7,breast_tissue,GradientBoostingClassifier,RandomOverSampler,geometric mean score,0.029337
8,breast_tissue,GradientBoostingClassifier,RandomOverSampler,roc auc score,0.010915
9,breast_tissue,GradientBoostingClassifier,SMOTE,f1 score,0.030207


#### 3.6. Oversampling methods mean ranking

In [8]:
experiment.mean_ranking_results_

Unnamed: 0_level_0,Unnamed: 1_level_0,ADASYN,None,RandomOverSampler,SMOTE,SMOTE2
Classifier,Metric,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1
GradientBoostingClassifier,f1 score,2.69,3.69,3.31,2.38,2.92
GradientBoostingClassifier,geometric mean score,3.0,4.15,2.85,2.46,2.54
GradientBoostingClassifier,roc auc score,3.08,2.62,2.85,3.38,3.08
LogisticRegression,f1 score,3.46,4.15,2.77,2.31,2.31
LogisticRegression,geometric mean score,3.46,4.69,2.62,1.92,2.31
LogisticRegression,roc auc score,3.38,2.77,2.77,2.23,3.85


#### 3.7 Friedman test

In [9]:
experiment.friedman_test_results_

Unnamed: 0_level_0,Unnamed: 1_level_0,p-value
Classifier,Metric,Unnamed: 2_level_1
GradientBoostingClassifier,f1 score,0.241765
GradientBoostingClassifier,geometric mean score,0.046532
GradientBoostingClassifier,roc auc score,0.786522
LogisticRegression,f1 score,0.009932
LogisticRegression,geometric mean score,4.4e-05
LogisticRegression,roc auc score,0.087172
