-
Notifications
You must be signed in to change notification settings - Fork 137
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Comparison of some SMOTE Variants without considering the entire dataset #23
Comments
Hi, thank you! I think there are multiple things to consider here. First, you cannot evaluate an oversampler alone, you always need a classifier trained on the oversampled dataset. The evaluation should be done in a cross-validation manner, that is, you repeatedly split the dataset to train and test set, oversample the training set, fit a classifier to the oversampled training set and predict the test set. This can be achieved in a couple of lines of code: import numpy as np
import smote_variants as sv
import imblearn.datasets as imb_datasets
from sklearn.tree import DecisionTreeClassifier
from sklearn.model_selection import StratifiedKFold
from sklearn.metrics import roc_auc_score
libras= imb_datasets.fetch_datasets()['libras_move']
X= libras['data']
y= libras['target']
classifier= DecisionTreeClassifier(max_depth=3, random_state=5)
aucs= []
# cross-validation
for train, test in StratifiedKFold(random_state=5).split(X, y):
# splitting
X_train, X_test= X[train], X[test]
y_train, y_test= y[train], y[test]
# oversampling
X_train_samp, y_train_samp= sv.SMOTE(n_neighbors=3, random_state=5).sample(X_train, y_train)
classifier.fit(X_train_samp, y_train_samp)
# prediction
y_pred= classifier.predict_proba(X_test)
# evaluation
aucs.append(roc_auc_score(y_test, y_pred[:,1]))
print('AUC', np.mean(aucs)) You can add any number of further classifier or oversampler to this evaluation loop. One must be careful that all the oversamplers and classifiers should be evaluated on the very same database folds for comparability. On the other hand, one needs to consider that many oversampling techniques have various parameters that can be tuned. So, it is usually not enough to evaluate SMOTE with one single parameter settings, it needs to be evaluated with many different parameter settings, again, in a cross-validation manner, before one could say that one oversampler works better than another. Also, classifiers can have a bunch of parameters to tune, thus, in order to carry out a proper comparison, one needs to evaluate oversamplers with many parameter combinations and subsequently applied classifiers with many parameter combinations. This is the only way to draw fair and valid consequences. Now, if you foresee this process, it is a decent amount of oversampling and classification jobs to be executed, and each of them needs to be done in a proper cross-validation. You have basically two options. Option #1 is to extend the sample code above to evaluate many oversamplers with many parameter combinations followed by classifiers with many parameter combinations in each step of the loop, and then unify the results. Alternatively, option #2, you can use the Just to emphasize: It is incorrect that the So, as a summary, just like with most of the machine learning code on github, the oversamplers implemented in smote_variants process and sample all the data you feed them. If you want to do cross-validation by hand, you need to split the dataset in whatever way, just like in the sample code above. Alternatively you can use the built-in evaluation functions and carry out all of this work in one single line of code. |
Thanks a ton! It is really helpful.. I will do the needful combinations of hyperparameter tuning. |
No problem. No, it is not correct. |
ok thanks.. It would be really helpful if you provide the link of implementation of SMOTE from scratch not by any package |
Well, the point of having packages is exactly to avoid and prevent implementation from scratch. Also, as this is an open source package, you can find the implementation of all the oversampling techniques in it, from scratch. Particularly, the SMOTE algorithm is implemented here: smote_variants/smote_variants/_smote_variants.py Lines 1349 to 1368 in 888109f
|
Hi @shwetashrmgithub , can we close this issue? |
yeah sure! @gykovacs but one last thing i tried to map other metrics like f1 score and precision on the first given code by you to compare various oversampling algo. but that's showing some error.. could you pl look into that.. thanks |
@shwetashrmgithub , well, the evaluation function from sklearn.metrics import f1_score
scores = []
# prediction
y_pred= classifier.predict(X_test)
# evaluation
scores.append(f1_score(y_test, y_pred)) |
@gykovacs Great work! I want to compare some of the variants of SMOTE and I follow your Code smote_variants/examples/003_evaluation_one_dataset.ipynb and also looked some examples of your paper, but it has oversampled the entire dataset which should not be the case as it must be validated on testing data only.. Could you please provide me the code how I can compare Auc applying SMOTE, BorderlineSMOTE, AHC, CLuster SMOTE, ADASYN..Thanks in advance
The text was updated successfully, but these errors were encountered: