-
-
Notifications
You must be signed in to change notification settings - Fork 996
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[WIP][RFC] SMOTE oversampling #3269
Conversation
I don't think this is a good idea. First, I don't know how this widget would be used. I imagine that the only use of such data would be for learning. However, oversampled data must not be used in cross validation or similar schema because it would result in having duplicates of training data in test data. Oversampling would have to happen within cross validation, on training data only. But for this, the method would have to be implemented as a preprocessor, and not within this widget. Second, I checked the paper and I don't understand how they managed to publish it. If I understood it correctly, they published ROC curves in which individual points correspond to different oversampling settings. In other words, their "ROC curves" (which are not really ROC curves in the usual meaning) show how oversampling affects cross validation -- whatever this is supposed to show. Unless I'm mistaken about this, I really wonder how could this get past reviewers. |
We discussed this PR at today's meeting.
|
|
It would be if it worked. My observation is that it works just in paper about oversampling. Balancing imbalanced data results in models that are incorrectly biased towards the minority class. With 95 % majority, it makes perfect sense to always predict the majority class ... unless your goal is detecting rare cases (which it actually probably is). But then you should not use algorithms that were designed for classification and that optimize classification accuracy (balancing will decrease it), predicted probabilities (balancing will obviously change them) or a similar score. Balancing data won't translate a problem of detecting anomalies into a classification problem. Thus oversampling (although in such a way that will only oversample the training data) would lead to bad practice, while I don't see any scenarios in which this would actually be useful. One could object that Orange is just a tool and it's the user's responsibility to not misuse it. I can think of two other issues in which we had similar discussions. But think about programming languages: modern languages are all about null-safety, though one could say it's user's responsibility to not dereference them. Kotlin's switch requires listing all cases (or explicitly give a default); wouldn't it be the programmers responsibility to cover all that can happen? Language designers would be hesitant to add features that are unreadable, confusing, ambiguous. I believe that some reasonable guidance is a good thing. |
Hi @janezd - actually the problem of imbalanced classes is a very relevant and important problem in the field of geoscience data analysis. Addressing imbalance is fundamentally important, e.g., problems that relate to the prediction of rock types from geophysical data (as one example https://www.sciencedirect.com/science/article/pii/S2590197419300011. Note the use of SMOTE in this paper ** I will freely admit ** that SMOTE isn't a perfect solution, but those of us in the geoscience/spatial data fields need (at least) a way to randomly undersample majority classes; optimally a robust way to balance the classes in a sensible way |
Hi. Emergency physician, pandemic advisor and Ai practitioner: We really need SMOTE for health care research and modeling of the pandemic. As above, not perfect but the vast majority of our outcome classes are undersampled and we need to figure out all sorts of synthetic data methods to improve our models and save lives, SMOTE, digital twins etc. |
Wow - that's a much more important use-case than my rock related one! |
Is it possible to re-open this request, @janezd ? |
I am following the Orange team suggestion of using SMOTE (and potentially random undersamplig as well) inside a Python Script widget for 511 data point from a stroke dataset. Here is the Python code: ` Copy data to outputnew_data = in_data.copy() Transform output datasetoversample = SMOTE(sampling_strategy='auto') #undersample = RandomUnderSampler(sampling_strategy='auto') Borrow some information from the in data to build a new table for outputdomain = Domain(in_data.domain.attributes, in_data.domain.class_vars) and this codes results in the following differences in an associated Scatter Plot: The problem is that there is something wrong with the associated output_data variable, even though the Python Script widget is not signaling anything. When viewed in the Table widget, only the original data is shown, and when put into modelling, the Test and Score widget complains with the following text:  IndexError: index 526 is out of bounds for axis 0 with size 511. So somehow the output data from the Python Script widget knows it has extra data, and somehow it is not included in the data structure. I guess my problem is that I have not coded correct above, but I could not find any examples in the documentation on how to extend the amount of data in the Python Script widget to follow. Any kind of help would be benefical, perhaps could a full working example be added to the Orange documentation. |
@slimebob1975, your code was almost there. Orange's
Be careful with validation: on the |
Many thanks, now it works as expected. |
I'd be nice if you could add you script here: https://github.com/biolab/orange-scripts |
Hi @ajdapretnar Just did so. It seem to work pretty well. On the very unbalanced stroke dataset (1:20) it increased the accurary rate of the minority class from 0.0 % to 67% while maintaining the accuracy rate of the majority class at 95 % using a neural network with standard parameter settings. |
Can you show us the details of your workflow? You balanced only the training data, right? |
Why would this make you happy? :) So the improved accuracy is what you read from Predictions at the bottom right, not in any of Test and Score widget? In that case, it seems OK. I'm sorry for disappointing you. :) |
Making mistakes is the only way to learn :-) The following pictures shows the results in the Confusion matrices from top to bottom. First the original training data (80 % randomly selected), the middle one is the results from the balanced data, and the one in the bottom is the test data (the 20 % remaining data). |
I've no further comments. :) Looks OK. |
@slimebob1975 thanks for posting that workflow! Very relevant for an upcoming shortcourse my colleagues and I will run, pertaining to earth science data |
Issue
Implements SMOTE (or rather the SMOTE-NC variation) - relevant issues: #3198, #2436.
Description of changes
This pull request adds the SMOTE oversampling method and an interface to use it via Data sampler.
It requires more testing and likely some other changes, but I wanted to get some feedback and ask a question.
The main thing that I'm not sure about is the way output should be provided. With other methods, there's a pretty clear distinction between what belongs in the sample and what in the remaining data (the 2 ouputs of Data sampler).
With oversampling, I see two ways to divide up the data:
(1.) Putting the existing data, merged with newly generated data all into the 'sample' output slot and None into the 'remaining' output slot (current way) or
(2.) putting the newly generated data into into the 'sample' output slot and the existing data into the 'remaining' output slot.
I would like to get a second opinion / new suggestion on how this should be done.
General comments are also much appreciated.
Things I am aware I need to fix:
__call__()
),Includes