# Speed test against imblearn

In this notebook, we compare the performance of the ```smote_variants``` package with that of the ```imblearn``` package through the three oversamplers implemented in common. Note that the implementations contain different logic to determine the number of samples to be generated. Generally, ```imblearn``` implementations are more flexible, ```smote_variants``` implementations are more simple to use.

In [1]:
import smote_variants as sv
import common_datasets.binary_classification as bin_clas

from imblearn.over_sampling import SMOTE, ADASYN, BorderlineSMOTE, SVMSMOTE

import matplotlib.pyplot as plt
import time
import numpy as np
import pandas as pd

import logging

logger = logging.getLogger('smote_variants')
logger.setLevel(logging.CRITICAL)

2022-08-20 11:42:30.634527: I tensorflow/core/util/util.cc:169] oneDNN custom operations are on. You may see slightly different numerical results due to floating-point round-off errors from different computation orders. To turn them off, set the environment variable `TF_ENABLE_ONEDNN_OPTS=0`.
2022-08-20 11:42:30.641263: W tensorflow/stream_executor/platform/default/dso_loader.cc:64] Could not load dynamic library 'libcudart.so.11.0'; dlerror: libcudart.so.11.0: cannot open shared object file: No such file or directory
2022-08-20 11:42:30.641286: I tensorflow/stream_executor/cuda/cudart_stub.cc:29] Ignore above cudart dlerror if you do not have a GPU set up on your machine.


In [2]:
import common_datasets.binary_classification as bin_clas

In [3]:
datasets = bin_clas.get_filtered_data_loaders(n_bounds=(1, 1000), 
                                                n_col_bounds=(1, 50))

In [4]:
def measure(sv, imb, datasets):
    """
    The function measuring the runtimes of oversamplers on a set of datasets.
    
    Args:
        sv (list(smote_variants.Oversampling)): the list of oversampling objects from smote_variants
        imb (list(imblearn.Oversampling)): the list of oversampling objects from imblearn, imb[i] is the
                                            implementation corresponding to sv[i]
        datasets (list(function)): dataset loading functions
    Returns:
        pd.DataFrame: mean oversampling runtimes for the various oversamplers over all datasets
    """
    
    results= {}
    # iterating through all datasets
    for d in datasets:
        data= d()
        print('processing: %s' % data['name'])
        
        X= data['data']
        y= data['target']
        for i, s in enumerate(sv):
            # imblearn seems to fail on some edge cases
            try:
                # measuring oversampling runtime using smote_variants
                t0= time.time()
                X_samp, y_samp= sv[i].sample(X, y)
                res_sv= time.time() - t0
                
                # measuring oversampling runtime using imblearn
                t0= time.time()
                X_samp, y_samp= imb[i].fit_resample(X, y)
                res_imb= time.time() - t0
                
                if not s.__class__.__name__ in results:
                    results[s.__class__.__name__]= ([], [])
                
                # appending the results
                results[s.__class__.__name__][0].append(res_sv)
                results[s.__class__.__name__][1].append(res_imb)
            except:
                pass
    
    # preparing the final dataframe
    for k in results:
        results[k]= [np.mean(results[k][0]), np.mean(results[k][1])]
    
    results= pd.DataFrame(results).T
    results.columns= ['smote_variants', 'imblearn']
    
    return results


In [5]:
# Executing the evaluation for the techniques implemented by both smote_variants and imblearn, using the
# same parameters, involving 104 datasets

sv_techniques= [sv.SMOTE(), sv.Borderline_SMOTE2(k_neighbors=10), sv.ADASYN()]
imb_techniques= [SMOTE(), BorderlineSMOTE(), ADASYN()]

results= measure(sv_techniques,
                 imb_techniques,
                 bin_clas.get_data_loaders())

processing: abalone19
processing: abalone9_18
processing: abalone-17_vs_7-8-9-10
processing: abalone-19_vs_10-11-12-13
processing: abalone-20_vs_8_9_10
processing: abalone-22_vs_8
processing: abalone-3_vs_11
processing: ADA
processing: appendicitis
processing: australian
processing: bupa
processing: car_good
processing: car-vgood
processing: cleveland-0_vs_4
processing: CM1
processing: crx


In [None]:
# Printing the results, the unit is 'seconds'

print(results)

                   smote_variants  imblearn
SMOTE                    0.012090  0.007889
Borderline_SMOTE2        0.023553  0.014605
ADASYN                   0.016606  0.014517
