# Speed test

In this notebook, we compare the performance of the ```smote_variants``` package with that of the ```imblearn``` package through the three oversamplers implemented in common. Note that the implementations contain different logic to determine the number of samples to be generated. Generally, ```imblearn``` implementations are more flexible, ```smote_variants``` implementations are more simple to use.

In [1]:
import smote_variants as sv
import mldb.binary_classification as bin_clas

from imblearn.over_sampling import SMOTE, ADASYN, BorderlineSMOTE, SVMSMOTE

import matplotlib.pyplot as plt
import time
import numpy as np
import pandas as pd

import logging

logger = logging.getLogger('smote_variants')
logger.setLevel(logging.CRITICAL)

2022-08-16 15:59:49.042613: I tensorflow/core/util/util.cc:169] oneDNN custom operations are on. You may see slightly different numerical results due to floating-point round-off errors from different computation orders. To turn them off, set the environment variable `TF_ENABLE_ONEDNN_OPTS=0`.
2022-08-16 15:59:49.047106: W tensorflow/stream_executor/platform/default/dso_loader.cc:64] Could not load dynamic library 'libcudart.so.11.0'; dlerror: libcudart.so.11.0: cannot open shared object file: No such file or directory
2022-08-16 15:59:49.047124: I tensorflow/stream_executor/cuda/cudart_stub.cc:29] Ignore above cudart dlerror if you do not have a GPU set up on your machine.


In [2]:
import mldb.binary_classification as bin_clas

In [3]:
datasets = bin_clas.get_filtered_data_loaders(n_bounds=(1, 1000), 
                                                n_attr_encoded_bounds=(1, 50))

In [4]:
def measure(sv, imb, datasets):
    """
    The function measuring the runtimes of oversamplers on a set of datasets.
    
    Args:
        sv (list(smote_variants.Oversampling)): the list of oversampling objects from smote_variants
        imb (list(imblearn.Oversampling)): the list of oversampling objects from imblearn, imb[i] is the
                                            implementation corresponding to sv[i]
        datasets (list(function)): dataset loading functions
    Returns:
        pd.DataFrame: mean oversampling runtimes for the various oversamplers over all datasets
    """
    
    results= {}
    # iterating through all datasets
    for d in datasets:
        data= d()
        print('processing: %s' % data['name'])
        
        X= data['data']
        y= data['target']
        for i, s in enumerate(sv):
            # imblearn seems to fail on some edge cases
            try:
                # measuring oversampling runtime using smote_variants
                t0= time.time()
                X_samp, y_samp= sv[i].sample(X, y)
                res_sv= time.time() - t0
                
                # measuring oversampling runtime using imblearn
                t0= time.time()
                X_samp, y_samp= imb[i].fit_resample(X, y)
                res_imb= time.time() - t0
                
                if not s.__class__.__name__ in results:
                    results[s.__class__.__name__]= ([], [])
                
                # appending the results
                results[s.__class__.__name__][0].append(res_sv)
                results[s.__class__.__name__][1].append(res_imb)
            except:
                pass
    
    # preparing the final dataframe
    for k in results:
        results[k]= [np.mean(results[k][0]), np.mean(results[k][1])]
    
    results= pd.DataFrame(results).T
    results.columns= ['smote_variants', 'imblearn']
    
    return results


In [5]:
# Executing the evaluation for the techniques implemented by both smote_variants and imblearn, using the
# same parameters, involving 104 datasets

sv_techniques= [sv.SMOTE(), sv.Borderline_SMOTE2(k_neighbors=10), sv.ADASYN()]
imb_techniques= [SMOTE(), BorderlineSMOTE(), ADASYN()]

results= measure(sv_techniques,
                 imb_techniques,
                 bin_clas.get_data_loaders())

processing: ADA
processing: CM1
processing: german
processing: hepatitis




processing: HIVA




processing: hypothyroid
processing: KC1
processing: PC1
processing: SATIMAGE
processing: SPECT_F
processing: abalone_17_vs_7_8_9_10
processing: abalone-19_vs_10-11-12-13




processing: abalone-20_vs_8-9-10
processing: abalone-21_vs_8
processing: abalone-3_vs_11
processing: abalone19




processing: abalone9-18
processing: car_good




processing: car-vgood
processing: cleveland-0_vs_4
processing: dermatology-6
processing: ecoli-0-1-3-7_vs_2-6
processing: ecoli-0-1-4-6_vs_5
processing: ecoli-0-1-4-7_vs_2-3-5-6
processing: ecoli-0-1-4-7_vs_5-6
processing: ecoli-0-1_vs_2-3-5
processing: ecoli-0-1_vs_5
processing: ecoli-0-2-3-4_vs_5
processing: ecoli-0-2-6-7_vs_3-5
processing: ecoli-0-3-4-6_vs_5
processing: ecoli-0-3-4-7_vs_5-6
processing: ecoli-0-3-4_vs_5
processing: ecoli-0-4-6_vs_5
processing: ecoli-0-6-7_vs_3-5
processing: ecoli-0-6-7_vs_5
processing: ecoli4
processing: flare-F




processing: glass-0-1-4-6_vs_2
processing: glass-0-1-5_vs_2
processing: glass-0-1-6_vs_2
processing: glass-0-1-6_vs_5
processing: glass-0-4_vs_5
processing: glass-0-6_vs_5
processing: glass2
processing: glass4
processing: glass5




processing: kddcup-buffer_overflow_vs_back
processing: kddcup-guess_passwd_vs_satan




processing: kddcup-land_vs_portsweep




processing: kddcup-land_vs_satan
processing: kddcup-rootkit-imap_vs_back




processing: kr-vs-k-one_vs_fifteen
processing: kr-vs-k-three_vs_eleven




processing: kr-vs-k-zero-one_vs_draw
processing: kr-vs-k-zero_vs_eight




processing: kr-vs-k-zero_vs_fifteen
processing: led7digit-0-2-4-6-7-8-9_vs_1
processing: lymphography-normal-fibrosis
processing: page-blocks-1-3_vs_4
processing: poker-8-9_vs_5
processing: poker-8-9_vs_6
processing: poker-8_vs_6
processing: poker-9_vs_7
processing: shuttle-2_vs_5
processing: shuttle-6_vs_2-3
processing: shuttle-c0-vs-c4
processing: shuttle-c2-vs-c4
processing: vowel0
processing: winequality-red-3_vs_5
processing: winequality-red-4
processing: winequality-red-8_vs_6
processing: winequality-red-8_vs_6-7
processing: winequality-white-3-9_vs_5
processing: winequality-white-3_vs_7
processing: winequality-white-9_vs_4
processing: yeast-0-2-5-6_vs_3-7-8-9
processing: yeast-0-2-5-7-9_vs_3-6-8
processing: yeast-0-3-5-9_vs_7-8
processing: yeast-0-5-6-7-9_vs_4
processing: yeast-1-2-8-9_vs_7
processing: yeast-1-4-5-8_vs_7
processing: yeast-1_vs_7
processing: yeast-2_vs_4
processing: yeast-2_vs_8
processing: yeast4
processing: yeast5
processing: yeast6
processing: zoo-3




processing: ecoli-0_vs_1
processing: ecoli1
processing: ecoli2
processing: ecoli3
processing: glass-0-1-2-3_vs_4-5-6
processing: glass0
processing: glass1
processing: glass6
processing: habarman
processing: iris0
processing: new_thyroid1
processing: page_blocks0
processing: pima
processing: segment0
processing: vehicle0
processing: vehicle1
processing: vehicle2
processing: vehicle3
processing: wisconsin
processing: yeast1
processing: yeast3
processing: mammographic
processing: bupa
processing: monk-2
processing: appendicitis
processing: saheart
processing: australian




processing: crx
(array([0, 1], dtype=object), array([144,   4]))
processing: lymphography
processing: wdbc




processing: ionosphere
processing: spectfheart


In [6]:
# Printing the results, the unit is 'seconds'

print(results)

                   smote_variants  imblearn
SMOTE                    0.012090  0.007889
Borderline_SMOTE2        0.023553  0.014605
ADASYN                   0.016606  0.014517
