We need to reproduce, in part, the study by Jain et al. "Dynamic selection of normalization techniques using data complexity measures" in order to compare its performance with the Meta-scaler. Their code is not publicly available and the authors did not respond to our requests via e-mail. Our only guide will be the information in the paper.

What we know:
- Their set of meta-features:
F1, F2, F3, N2, N3, T1, T2, D1 (density), N4, L1, L2, L3.

These are all available in the PyMFE library.

- Their ST selection:
MinMaxScaler and StandardScaler

- Their method for constructing the meta-dataset:
1. Scale all datasets with both STs (resulting in two versions of each DS).
2. Evaluate the performance of a Gaussian Kernel ELM on both versions of the DS (with 5-fold CV) and define the label of each instance (dataset) as the name of the ST with maximum performance.
Notice that there is no 'NS' (nonscaled) label, hence the trained system will always recommend to scale the data.
3. Merge the label for each DS with the 12 extracted meta-features for the same DS.

- The classification algorithms they use to train their meta-models:
ELM, SVM, MLP, Complement Naive Bayes, Naive Bayes Mutinomial, Naive Bayes Mutinomial Updateable, FT, OneR, Bayesian Logistic regression, Random Forest, RIDOR, J48, RBF Network, lbk.
Since their best result was obtained with the ELM, we are going to use this algorithm only and create only one meta-model to compare with our approach. Note: in the paper they detail that they used the GKELM.



In [128]:
import pandas as pd
import numpy as np
#import matplotlib.pyplot as plt
from scipy.io.arff import loadarff
from datetime import datetime
import missingno as msno
from sklearn import preprocessing
import copy
from sklearn.preprocessing import StandardScaler
from sklearn.preprocessing import MinMaxScaler
from sklearn.metrics import accuracy_score
# from sklearn.metrics import recall_score
# from sklearn.metrics import precision_score
from sklearn.metrics import f1_score
from imblearn.metrics import geometric_mean_score
from sklearn.model_selection import LeaveOneOut
from sklearn.impute import KNNImputer
import time
from sklearn.preprocessing import LabelEncoder

# Let's import our slightly modified version of the elm package available at https://github.com/acba/elm
# We commented out unused parts to reduce dependencies. 
from elm import elmk

# Reading meta-features for the 300 datasets

In [2]:
# Now, we could extract here the 12 meta-features that they used for our 300 datasets, but we already did 
# this for our paper. We just have to select only the 12 mfs they used.

all_mfs = pd.read_csv('../Meta_features_extraction/pymfe_meta_features.csv')
mfs = all_mfs[['f1.mean', 'f2.mean', 'f3.mean', 'n2.mean', 'n3.mean', 't1.mean',
               't2',  'density', 'n4.mean', 'l1.mean', 'l2.mean', 'l3.mean']]

# Measuring classification performances

In [3]:
# I will create a dict structure such that I can access train fold 1 from 
# dataset D1 as datasets[1]['train'][0]
print('Loading data ', end='')
data_dir = '../../data/5-fold'
datasets = {}
for i in range(1,301):
    datasets[i] = {}
    datasets[i]['train'] = []
    datasets[i]['test'] = []
    for f in range(1,6): #for each fold
        csv_filename = f'{data_dir}/D{i}-fold{f}-train.csv'
        df_train = pd.read_csv(csv_filename, encoding='utf8', engine='python', sep=',', 
                     header=0, on_bad_lines='skip')
        csv_filename = f'{data_dir}/D{i}-fold{f}-test.csv'
        df_test = pd.read_csv(csv_filename, encoding='utf8', engine='python', sep=',', 
                     header=0, on_bad_lines='skip')
        datasets[i]['train'].append(df_train)
        datasets[i]['test'].append(df_test)
    print('.', end='')

Loading data ............................................................................................................................................................................................................................................................................................................

In [4]:
print('Scaling ', end='')
# Creating copies of the datasets:
datasets_ss = copy.deepcopy(datasets)
datasets_mms = copy.deepcopy(datasets)

Scaling 

In [5]:
# import warnings
# Ignoring warnings from QuantileTransformer when number of samples is lower then 1000:
# warnings.filterwarnings(action = "ignore", category=UserWarning) 

ss = StandardScaler()
mms = MinMaxScaler() 

for i in range(1,301):
    for fold in range(5):
        #print(f'Dataset: {name}, fold {fold}.', end = '')
        datasets_ss[i]['train'][fold].iloc[:,:-1] = ss.fit_transform(datasets_ss[i]['train'][fold].iloc[:,:-1])
        datasets_ss[i]['test'][fold].iloc[:,:-1] = ss.transform(datasets_ss[i]['test'][fold].iloc[:,:-1])
        datasets_mms[i]['train'][fold].iloc[:,:-1] = mms.fit_transform(datasets_mms[i]['train'][fold].iloc[:,:-1])
        datasets_mms[i]['test'][fold].iloc[:,:-1] = mms.transform(datasets_mms[i]['test'][fold].iloc[:,:-1])        
    print('.', end='') 
# Restablishing warnings:
# warnings.filterwarnings(action = "default", category=UserWarning)

............................................................................................................................................................................................................................................................................................................

In [51]:
def run_model(model, model_name, results_df): 
#This function was modified to deal with the different API for the GKELM
    superset = {'SS': datasets_ss, 'MMS': datasets_mms}
    
    print('Starting '+ model_name +', time: ', datetime.now())
    for name in range(1,301): #name is actually a number
    #for name in [1]: #testing 
        print(f'\nCurrent dataset: {name}', end = '')
        for k in superset:
            print(' '+k+' ', end = '')
            acc_folds = []
            recall_folds = []
            precision_folds = []
            f1_folds = []
            #roc_auc_folds = []
            gmean_folds = []
            
            ds = superset[k]
            target_att = ds[name]['train'][0].columns.tolist()[-1]
            for fold in range(5):
                print('.', end = '')
                #Gather training data:
                ds_train = ds[name]['train'][fold]
                X_train = ds_train.drop(labels=target_att, axis = 1)
                y_train = ds_train[target_att]
            
                # Gather test data:
                ds_test = ds[name]['test'][fold]
                X_test = ds_test.drop(labels=target_att, axis = 1)
                y_test = ds_test[target_att]
                
                
                #Fit the model:
                # For elmk, target variable must be the first (!):
                tr_data = pd.concat([y_train, X_train], axis=1).to_numpy()
                
                # search for best parameter for this dataset
                #model.search_param(tr_data, cv="kfold", of="accuracy", eval=10)
                                #model.fit(X_train, y_train)

                model.train(tr_data)
                
                # Test model:
                # y_pred = model.predict(X_test)
                tst_data = pd.concat([y_test, X_test], axis=1).to_numpy()
                result = model.test(tst_data).predicted_targets.reshape(1, -1)
                y_pred = []
                for x in result[0]:
                    if abs(x-0) >= abs(x-1): y_pred.append(1) # If the regressed value is closer to 1.
                    else: y_pred.append(0) # If the regressed value is closer to 0.
                # print('y_pred = ', y_pred)
                acc = accuracy_score(y_test, y_pred)
                # recall = recall_score(y_test, y_pred, pos_label=1)
                # precision = precision_score(y_test, y_pred, pos_label=1, zero_division=0)
                f1 = f1_score(y_test, y_pred, pos_label=1, zero_division=0)
                gmean = geometric_mean_score(y_test, y_pred, pos_label=1)
                #roc_auc = roc_auc_score(y_test, y_score)

                # Store metrics for this fold
                acc_folds.append(acc)
                # recall_folds.append(recall)
                # precision_folds.append(precision)
                f1_folds.append(f1)
                # roc_auc_folds.append(roc_auc)
                gmean_folds.append(gmean)
            
            new_row = {'Dataset' : name, 'Scaling technique' : k, 'Model' : model_name,
                       'acc_fold1' : acc_folds[0], 'acc_fold2' : acc_folds[1], 'acc_fold3' : acc_folds[2], 
                       'acc_fold4' : acc_folds[3], 'acc_fold5' : acc_folds[4], 
                       'acc_mean': np.mean(acc_folds), 'acc_stddev': np.std(acc_folds),
                       # 'recall_fold1' : recall_folds[0], 'recall_fold2' : recall_folds[1], 'recall_fold3' : recall_folds[2],
                       # 'recall_fold4' : recall_folds[3], 'recall_fold5' : recall_folds[4], 
                       # 'recall_mean': np.mean(recall_folds), 'recall_stddev':np.std(recall_folds),
                       # 'precision_fold1' : precision_folds[0], 'precision_fold2' : precision_folds[1] , 'precision_fold3' : precision_folds[2],
                       # 'precision_fold4' : precision_folds[3], 'precision_fold5' : precision_folds[4],
                       # 'precision_mean': np.mean(precision_folds), 'precision_stddev': np.std(precision_folds),
                       'f1_fold1' : f1_folds[0], 'f1_fold2' : f1_folds[1], 'f1_fold3' : f1_folds[2], 
                       'f1_fold4' : f1_folds[3], 'f1_fold5' : f1_folds[4], 
                       'f1_mean': np.mean(f1_folds), 'f1_stddev': np.std(f1_folds),
#                        'roc_auc_fold1' : roc_auc_folds[0], 'roc_auc_fold2' : roc_auc_folds[1], 'roc_auc_fold3' : roc_auc_folds[2], 
#                        'roc_auc_fold4' : roc_auc_folds[3], 'roc_auc_fold5' : roc_auc_folds[4], 
#                        'roc_auc_mean': np.mean(f1_folds), 'roc_auc_stddev': np.std(roc_auc_folds),
                       'gmean_fold1' : gmean_folds[0], 'gmean_fold2' : gmean_folds[1], 'gmean_fold3' : gmean_folds[2], 
                       'gmean_fold4' : gmean_folds[3], 'gmean_fold5' : gmean_folds[4], 
                       'gmean_mean': np.mean(gmean_folds), 'gmean_stddev' : np.std(gmean_folds),
                      }

            #results_df = results_df.append(new_row, ignore_index=True) #Deprecated
            results_df = pd.concat([results_df, pd.DataFrame.from_records([new_row])],ignore_index=True)

    print('Finishing '+ model_name +', time: ', datetime.now())   
    return results_df

In [52]:
# Creating a dataframe to store results:
results_df = pd.DataFrame({'Dataset' : [], 'Scaling technique' : [], 'Model' : [],
                           'acc_fold1' : [], 'acc_fold2' : [], 'acc_fold3' : [], 'acc_fold4' : [], 'acc_fold5' : [], 
                           'acc_mean':[], 'acc_stddev':[],
                           # 'recall_fold1' : [], 'recall_fold2' : [], 'recall_fold3' : [], 'recall_fold4' : [], 'recall_fold5' : [], 
                           # 'recall_mean':[], 'recall_stddev':[],
                           # 'precision_fold1' : [], 'precision_fold2' : [], 'precision_fold3' : [], 'precision_fold4' : [], 
                           # 'precision_fold5' : [], 'precision_mean':[], 'precision_stddev': [],
                           'f1_fold1' : [], 'f1_fold2' : [], 'f1_fold3' : [], 'f1_fold4' : [], 'f1_fold5' : [], 
                           'f1_mean': [], 'f1_stddev': [],
                           'gmean_fold1' : [], 'gmean_fold2' : [], 'gmean_fold3' : [], 'gmean_fold4' : [], 'gmean_fold5' : [], 
                           'gmean_mean':[], 'gmean_stddev' : []
                           })

In [53]:
## Instantiating model:
models = {'GKELM': elmk.ELMKernel()}

In [54]:
# Running models:
for name,model in models.items():
        results_df = run_model(model, name, results_df)
results_df.to_csv('results_ST_perfs_GKELM.csv', index=False)

Starting GKELM, time:  2023-10-20 15:06:27.510881

Current dataset: 1 SS ..... MMS .....
Current dataset: 2 SS ..... MMS .....
Current dataset: 3 SS ..... MMS .....
Current dataset: 4 SS ..... MMS .....
Current dataset: 5 SS ..... MMS .....
Current dataset: 6 SS ..... MMS .....
Current dataset: 7 SS ..... MMS .....
Current dataset: 8 SS ..... MMS .....
Current dataset: 9 SS ..... MMS .....
Current dataset: 10 SS ..... MMS .....
Current dataset: 11 SS ..... MMS .....
Current dataset: 12 SS ..... MMS .....
Current dataset: 13 SS ..... MMS .....
Current dataset: 14 SS ..... MMS .....
Current dataset: 15 SS ..... MMS .....
Current dataset: 16 SS ..... MMS .....
Current dataset: 17 SS ..... MMS .....
Current dataset: 18 SS ..... MMS .....
Current dataset: 19 SS ..... MMS .....
Current dataset: 20 SS ..... MMS .....
Current dataset: 21 SS ..... MMS .....
Current dataset: 22 SS ..... MMS .....
Current dataset: 23 SS ..... MMS .....
Current dataset: 24 SS ..... MMS .....
Current dataset: 25 SS

In [55]:
#results_df = pd.read_csv('results_ST_perfs_GKELM.csv')
results_df

Unnamed: 0,Dataset,Scaling technique,Model,acc_fold1,acc_fold2,acc_fold3,acc_fold4,acc_fold5,acc_mean,acc_stddev,...,f1_fold5,f1_mean,f1_stddev,gmean_fold1,gmean_fold2,gmean_fold3,gmean_fold4,gmean_fold5,gmean_mean,gmean_stddev
0,1.0,SS,GKELM,0.475410,0.500000,0.500000,0.433333,0.466667,0.475082,0.024721,...,0.500000,0.456400,0.045659,0.468353,0.490170,0.498888,0.428174,0.461880,0.469493,0.024731
1,1.0,MMS,GKELM,0.459016,0.516667,0.450000,0.466667,0.450000,0.468470,0.024891,...,0.476190,0.417401,0.036774,0.439941,0.483314,0.449691,0.447214,0.447214,0.453475,0.015273
2,2.0,SS,GKELM,0.617021,0.695652,0.586957,0.608696,0.695652,0.640796,0.045854,...,0.681818,0.603944,0.061955,0.617914,0.685913,0.577350,0.583874,0.694949,0.632000,0.049740
3,2.0,MMS,GKELM,0.574468,0.695652,0.608696,0.586957,0.673913,0.627937,0.048182,...,0.545455,0.499010,0.063403,0.542720,0.627922,0.583874,0.488504,0.612372,0.571079,0.050444
4,3.0,SS,GKELM,0.406250,0.375000,0.281250,0.484375,0.317460,0.372867,0.070734,...,0.295082,0.382179,0.094986,0.405244,0.369936,0.279508,0.478033,0.315909,0.369726,0.069322
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
595,298.0,MMS,GKELM,0.590164,0.650000,0.600000,0.466667,0.583333,0.578033,0.060398,...,0.675325,0.639569,0.053380,0.583874,0.633866,0.577672,0.436862,0.511039,0.548663,0.068200
596,299.0,SS,GKELM,0.300000,0.516667,0.450000,0.416667,0.550000,0.446667,0.087178,...,0.526316,0.432965,0.077429,0.300000,0.514242,0.447214,0.416333,0.547723,0.445102,0.086253
597,299.0,MMS,GKELM,0.300000,0.483333,0.450000,0.450000,0.483333,0.433333,0.068313,...,0.474576,0.418514,0.090090,0.292499,0.483046,0.442217,0.447214,0.483046,0.429604,0.070681
598,300.0,SS,GKELM,0.508197,0.590164,0.450000,0.500000,0.516667,0.513005,0.045008,...,0.355556,0.300480,0.058149,0.449664,0.412710,0.322603,0.426763,0.456229,0.413594,0.048109


# Creating Meta-dataset

## Creating target attribute (Best_st)

In [67]:
perfs = results_df[['Dataset', 'Scaling technique', 'f1_mean']]
perfs['ds_name'] = perfs['Dataset'].apply(lambda x: f'D{int(x)}')
perfs = perfs[['ds_name', 'Scaling technique', 'f1_mean']]
perfs

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  perfs['ds_name'] = perfs['Dataset'].apply(lambda x: f'D{int(x)}')


Unnamed: 0,ds_name,Scaling technique,f1_mean
0,D1,SS,0.456400
1,D1,MMS,0.417401
2,D2,SS,0.603944
3,D2,MMS,0.499010
4,D3,SS,0.382179
...,...,...,...
595,D298,MMS,0.639569
596,D299,SS,0.432965
597,D299,MMS,0.418514
598,D300,SS,0.300480


In [86]:
ds_names = list(perfs.ds_name.unique())
meta_dataset_dict = {'ds_name':[], 'SS':[], 'MMS':[], 'best_st':[] #, 'best_sts':[]
                    }
for ds_name in ds_names:
    perfs_filtered_by_ds = perfs[perfs['ds_name']==ds_name]
    max_perf = perfs_filtered_by_ds.f1_mean.max()
    best_sts = perfs_filtered_by_ds[perfs_filtered_by_ds['f1_mean'] == max_perf]['Scaling technique'].values
    best_st = best_sts[0] #Sometimes, both STs attain max perf, we will just pick the first, as we did in our paper.
    meta_dataset_dict['ds_name'].append(ds_name)
    meta_dataset_dict['best_st'].append(best_st)
    for st in ['SS', 'MMS']:
        row_for_this_st = perfs_filtered_by_ds['Scaling technique'] == st
        perf_for_this_st = perfs_filtered_by_ds[row_for_this_st].f1_mean.values[0]
        meta_dataset_dict[st].append(perf_for_this_st)
    #meta_dataset_dict['best_sts'].append(best_sts)


In [87]:
meta_dataset = pd.DataFrame(meta_dataset_dict)
meta_dataset

Unnamed: 0,ds_name,SS,MMS,best_st
0,D1,0.456400,0.417401,SS
1,D2,0.603944,0.499010,SS
2,D3,0.382179,0.459705,MMS
3,D4,0.490649,0.547206,MMS
4,D5,0.510446,0.599733,MMS
...,...,...,...,...
295,D296,0.547996,0.452201,SS
296,D297,0.683706,0.650371,SS
297,D298,0.607885,0.639569,MMS
298,D299,0.432965,0.418514,SS


## Adding the meta-features

In [92]:
meta_dataset = pd.concat([meta_dataset[meta_dataset.columns[:1]], mfs, meta_dataset[meta_dataset.columns[1:]]], axis = 1)

In [93]:
meta_dataset

Unnamed: 0,ds_name,f1.mean,f2.mean,f3.mean,n2.mean,n3.mean,t1.mean,t2,density,n4.mean,l1.mean,l2.mean,l3.mean,SS,MMS,best_st
0,D1,0.997618,0.216065,0.976744,0.503851,0.574751,0.006667,0.066445,1.000000,0.182724,0.177843,0.455150,0.421927,0.456400,0.417401,SS
1,D2,0.966098,0.193836,0.965368,0.506958,0.528139,0.006536,0.034632,0.989046,0.307359,0.175905,0.311688,0.251082,0.603944,0.499010,SS
2,D3,0.998351,0.140620,0.984326,0.506327,0.567398,0.007143,0.062696,1.000000,0.172414,0.186916,0.445141,0.388715,0.382179,0.459705,MMS
3,D4,0.997004,0.052404,0.966777,0.502377,0.554817,0.007576,0.066445,1.000000,0.205980,0.182535,0.388704,0.358804,0.490649,0.547206,MMS
4,D5,0.997819,0.090321,0.970000,0.505203,0.580000,0.007752,0.066667,1.000000,0.203333,0.188478,0.390000,0.463333,0.510446,0.599733,MMS
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
295,D296,0.968520,0.069950,0.931624,0.502993,0.508547,0.018868,0.034188,0.989032,0.380342,0.163246,0.354701,0.256410,0.547996,0.452201,SS
296,D297,0.968030,0.056096,0.900433,0.498197,0.528139,0.008929,0.034632,0.992208,0.311688,0.158659,0.303030,0.242424,0.683706,0.650371,SS
297,D298,0.992193,0.071053,0.870432,0.512317,0.707641,0.005236,0.066445,1.000000,0.176080,0.172496,0.355482,0.285714,0.607885,0.639569,MMS
298,D299,0.998167,0.011849,0.660000,0.502802,0.536667,0.006849,0.066667,1.000000,0.170000,0.206342,0.390000,0.360000,0.432965,0.418514,SS


# Train the meta-model

Here we need to train/test a GKELM meta-model. We will do it with Leave One Out Cross Validation - LOOCV (as we did in our Meta-scaler paper). We are interested in the base level classification performance of the same 12 base classifiers used in our paper, attained with the recommended ST, so that we can compare that with the results of our paper (Meta-scaler). Notice that while the GKELM will be trained and tested on the meta-dataset constructed in the previews section (the one with just two STs) the ground truth will come from the meta-dataset used in our paper, where we used 12 base models and six STs.

In [136]:
%%time

training_time = 0
testing_time = 0

print(f'\nTraining metamodel with GKELM:')
df = meta_dataset.copy() # Using the meta-dataset that we built in this notebook.
df.reset_index(inplace = True, drop = True)

# Separating X and y.
X = df.iloc[:, 1:-3] # Just the metafeatures.
y = df.iloc[:,-1] # Just the best ST.


# Splitting the dataset into the Training set and Test set folds, 
# according to Leave One Out Cross Validation:
loo = LeaveOneOut()
y_pred = list()
for train_index, test_index in loo.split(X):
    current_ds = df['ds_name'].iloc[test_index].values[0]
    #print(f'Test index is {test_index}, corresponding to DS {current_ds}.')
    print('.', end='')
    # Separating training and test sets:
    X_train, X_test = X.iloc[train_index,:], X.iloc[test_index,:]
    y_train, y_test = y[train_index], y[test_index]
    # Filling missing values with a KNN imputer.  Each sample’s missing values are imputed
    # using the mean value from n_neighbors nearest neighbors found in the training set. 
    imputer = KNNImputer(n_neighbors=2)
    X_train = imputer.fit_transform(X_train)
    X_test = imputer.transform(X_test)

    # The above method returns a nd.array, so here we rebuild the DataFrame:
    X_train = pd.DataFrame(X_train, columns=X.columns) 
    X_test = pd.DataFrame(X_test, columns=X.columns)
    
    # Feature Scaling (Not needed if using a tree based meta-model)
    sc = MinMaxScaler()
    X_train = sc.fit_transform(X_train)
    X_test = sc.transform(X_test)

    # The above method returns a nd.array, so here we rebuild the DataFrame:
    X_train = pd.DataFrame(X_train, columns=X.columns) 
    X_test = pd.DataFrame(X_test, columns=X.columns)
    
    # Encoding class labes as 0s and 1s:
    le = LabelEncoder()
    le.fit(y_train)
    y_train = le.transform(y_train)
    y_train = pd.Series(y_train, name='best_st')
    y_test = le.transform(y_test)
    y_test = pd.Series(y_test, name='best_st')
    
    meta_model = elmk.ELMKernel()
    #Fit the meta-model:
    # For elmk, target variable must be the first (!):
    
    tr_data = pd.concat([y_train, X_train], axis=1).to_numpy()
    tic = time.perf_counter()
    meta_model.train(tr_data)
    
    toc = time.perf_counter()
    training_time += toc-tic      
    
    # Test meta_model:
    tst_data = pd.concat([y_test, X_test], axis=1).to_numpy()
    tic = time.perf_counter()
    
    result = meta_model.test(tst_data).predicted_targets.reshape(1, -1)
    
    toc = time.perf_counter()
    testing_time += toc-tic
    
    # Save prediction:
    x = result[0][0]
    if abs(x-0) >= abs(x-1): y_pred.append(1) # If the regressed value is closer to 1.
    else: y_pred.append(0) # If the regressed value is closer to 0.


print(f'\nTraining time: {training_time} seconds.')   
print(f'Testing time: {testing_time} seconds.')       
computing_times = {'Testing': testing_time, 'Training': training_time, 
                   'Total': testing_time+training_time}


Training metamodel with GKELM:
............................................................................................................................................................................................................................................................................................................
Training time: 1.6102496450330364 seconds.
Testing time: 0.024362957999983337 seconds.
CPU times: user 24.7 s, sys: 3.87 s, total: 28.6 s
Wall time: 3.79 s


In [141]:
y_pred = le.inverse_transform(y_pred) 

In [150]:
# If wanted to know the meta-model performance in this meta-dataset, we would do:
acc = accuracy_score(y, y_pred)
f1 = f1_score(y, y_pred, average="macro")
print(f'Meta-model performance on GKELM meta-dataset: F1 = {f1}, acc = {acc}')

Meta-model performance on GKELM meta-dataset: F1 = 0.6347513219821554, acc = 0.7766666666666666


# Assess base level performance

We want to know how Jain's approach perform (in terms of base-model performance, measured with F1) for each one of the 12 base models used in our paper

In [112]:
# Importing the meta-dataset from our paper:
our_meta_dataset = pd.read_csv('../metafeat_pymfe+imbcol_and_ST_perform_for_pairs_of_dataset_and_model.csv')

In [151]:
our_meta_dataset

Unnamed: 0,Model,Dataset,attr_conc.mean,attr_ent.mean,attr_to_inst,best_node.mean,best_node.mean.relative,c1,c2,can_cor.mean,...,linearity.class.L3_partial.1,NS,SS,MMS,MAS,RS,QT,Max_F1_perf,Best_STs,Best_ST
0,Bagging,D1,0.019398,2.584913,0.066445,0.494839,5.0,0.999928,0.000199,0.202900,...,0.448505,0.440079,0.466025,0.394689,0.303649,0.449671,0.451235,0.466025,['SS'],SS
1,GLVQ,D1,0.019398,2.584913,0.066445,0.494839,5.0,0.999928,0.000199,0.202900,...,0.448505,0.434862,0.462094,0.418003,0.445430,0.469747,0.487364,0.487364,['QT'],QT
2,GP,D1,0.019398,2.584913,0.066445,0.494839,5.0,0.999928,0.000199,0.202900,...,0.448505,0.362818,0.474876,0.384721,0.000000,0.484660,0.441929,0.484660,['RS'],RS
3,KNORAE,D1,0.019398,2.584913,0.066445,0.494839,5.0,0.999928,0.000199,0.202900,...,0.448505,0.435348,0.447054,0.492395,0.450039,0.432771,0.413654,0.492395,['MMS'],MMS
4,KNORAU,D1,0.019398,2.584913,0.066445,0.494839,5.0,0.999928,0.000199,0.202900,...,0.448505,0.443713,0.468347,0.487077,0.432025,0.478051,0.505159,0.505159,['QT'],QT
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
3595,MLP,D300,0.019184,2.584899,0.066225,0.569462,6.5,0.986003,0.037949,0.250186,...,0.572848,0.601871,0.431757,0.601871,0.601871,0.438325,0.415073,0.601871,['NS' 'MMS' 'MAS'],NS
3596,OLA,D300,0.019184,2.584899,0.066225,0.569462,6.5,0.986003,0.037949,0.250186,...,0.572848,0.414099,0.445926,0.386136,0.446185,0.376316,0.497744,0.497744,['QT'],QT
3597,Percep,D300,0.019184,2.584899,0.066225,0.569462,6.5,0.986003,0.037949,0.250186,...,0.572848,0.265806,0.434663,0.237405,0.267137,0.377495,0.421451,0.434663,['SS'],SS
3598,SVM_RBF,D300,0.019184,2.584899,0.066225,0.569462,6.5,0.986003,0.037949,0.250186,...,0.572848,0.186589,0.190331,0.201437,0.179149,0.217621,0.202262,0.217621,['RS'],RS


In [179]:
models_names = our_meta_dataset['Model'].unique()
jains_performances = {}
for model in models_names:
    jains_performances[model] = []
    #Here we fetch only the performances attained by the current model when using SS and MMS:
    perfs_for_this_model = our_meta_dataset[our_meta_dataset['Model'] == model].iloc[:,-8:-6].reset_index(drop=True)
    #Now, we have to check what was Jain's predicted ST for each dataset and check what would be
    #the performance of the current model with that ST on the dataset.
    for ds in range(0,300):
        jains_performances[model].append(perfs_for_this_model[y_pred[ds]].iloc[ds])
    

In [186]:
pd.DataFrame(jains_performances)

Unnamed: 0,Bagging,GLVQ,GP,KNORAE,KNORAU,LCA,MCB,MLP,OLA,Percep,SVM_RBF,SVM_lin
0,0.394689,0.418003,0.384721,0.492395,0.487077,0.220839,0.492665,0.662205,0.535772,0.383329,0.453572,0.457443
1,0.627831,0.619160,0.625233,0.586660,0.642697,0.455531,0.596047,0.000000,0.591616,0.557115,0.587516,0.609299
2,0.366334,0.512624,0.485306,0.444194,0.450647,0.399639,0.470822,0.673568,0.451370,0.434061,0.451279,0.412555
3,0.464659,0.513954,0.275781,0.424394,0.458579,0.420745,0.443409,0.665201,0.442450,0.558772,0.448221,0.492700
4,0.459579,0.624894,0.611336,0.460374,0.523039,0.408728,0.505028,0.692785,0.508054,0.305671,0.535946,0.514049
...,...,...,...,...,...,...,...,...,...,...,...,...
295,0.544634,0.542386,0.375519,0.428957,0.538095,0.504026,0.538888,0.000000,0.563854,0.527724,0.534791,0.533769
296,0.686574,0.654669,0.676788,0.598052,0.681330,0.556507,0.635977,0.113043,0.616204,0.595677,0.684449,0.655115
297,0.612490,0.666195,0.635883,0.456978,0.613727,0.591554,0.527027,0.659512,0.551723,0.459202,0.605506,0.604518
298,0.448158,0.149134,0.110849,0.424710,0.385779,0.504779,0.394609,0.666667,0.392255,0.272819,0.622763,0.374748


In [185]:
pd.DataFrame(jains_performances).describe()

Unnamed: 0,Bagging,GLVQ,GP,KNORAE,KNORAU,LCA,MCB,MLP,OLA,Percep,SVM_RBF,SVM_lin
count,300.0,300.0,300.0,300.0,300.0,300.0,300.0,300.0,300.0,300.0,300.0,300.0
mean,0.71287,0.624984,0.705526,0.722417,0.718334,0.676628,0.718868,0.687647,0.718981,0.678886,0.726086,0.70226
std,0.276603,0.361435,0.318522,0.249137,0.275972,0.276476,0.242325,0.323022,0.241523,0.257378,0.287759,0.301822
min,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
25%,0.601652,0.408757,0.655037,0.585104,0.613499,0.522499,0.590553,0.622907,0.593792,0.525688,0.665153,0.588404
50%,0.792042,0.783934,0.825006,0.807816,0.802952,0.75167,0.787938,0.804493,0.786939,0.742064,0.810698,0.7988
75%,0.922935,0.915278,0.929133,0.929048,0.92435,0.904144,0.914107,0.924678,0.914329,0.886834,0.929554,0.926884
max,0.991837,0.989058,0.990639,0.989733,0.991837,0.993939,0.991878,0.989058,0.991878,0.991256,0.989058,0.991878


In [187]:
pd.DataFrame(jains_performances).to_csv('jain_et_al_classification_performances.csv')