Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[WIP][RFC] SMOTE oversampling #3269

Closed
wants to merge 2 commits into from
Closed

Conversation

matejklemen
Copy link
Contributor

@matejklemen matejklemen commented Sep 27, 2018

Issue

Implements SMOTE (or rather the SMOTE-NC variation) - relevant issues: #3198, #2436.

Description of changes

This pull request adds the SMOTE oversampling method and an interface to use it via Data sampler.
It requires more testing and likely some other changes, but I wanted to get some feedback and ask a question.

The main thing that I'm not sure about is the way output should be provided. With other methods, there's a pretty clear distinction between what belongs in the sample and what in the remaining data (the 2 ouputs of Data sampler).
With oversampling, I see two ways to divide up the data:
(1.) Putting the existing data, merged with newly generated data all into the 'sample' output slot and None into the 'remaining' output slot (current way) or
(2.) putting the newly generated data into into the 'sample' output slot and the existing data into the 'remaining' output slot.
I would like to get a second opinion / new suggestion on how this should be done.

General comments are also much appreciated.

Things I am aware I need to fix:
  • more testing (particularly the case where input is a SQLTable, which I've not given any attention yet),
  • move the SMOTE logic into a separate function (i.e. outside of __call__()),
  • include meta attributes in distance calculation (same way as categorical ones),
  • figure out how to move generation of new data into a new thread (GUI currently freezes on bigger data sets).
Includes
  • Code changes
  • Tests
  • Documentation

@janezd
Copy link
Contributor

janezd commented Nov 24, 2018

I don't think this is a good idea.

First, I don't know how this widget would be used. I imagine that the only use of such data would be for learning. However, oversampled data must not be used in cross validation or similar schema because it would result in having duplicates of training data in test data. Oversampling would have to happen within cross validation, on training data only. But for this, the method would have to be implemented as a preprocessor, and not within this widget.

Second, I checked the paper and I don't understand how they managed to publish it. If I understood it correctly, they published ROC curves in which individual points correspond to different oversampling settings. In other words, their "ROC curves" (which are not really ROC curves in the usual meaning) show how oversampling affects cross validation -- whatever this is supposed to show. Unless I'm mistaken about this, I really wonder how could this get past reviewers.

@janezd
Copy link
Contributor

janezd commented Nov 30, 2018

We discussed this PR at today's meeting.

  1. Oversampling does not belong to this widget. If anywhere, it should be in preprocessing.
  2. If anywhere refers to the observation that oversampling does not work (except perhaps in papers about oversampling).
  3. If we still decide we want to have it in Orange, we should select an appropriate algorithm. Based on my reading the paper, SMOTE's approach doesn't look OK, and it was not properly tested, either.

@janezd janezd closed this Nov 30, 2018
@matejklemen matejklemen deleted the enh_smote branch Feb 2, 2019
@trivedi-group
Copy link

trivedi-group commented Mar 13, 2019

We discussed this PR at today's meeting.

  1. Oversampling does not belong to this widget. If anywhere, it should be in preprocessing.
  2. If anywhere refers to the observation that oversampling does not work (except perhaps in papers about oversampling).
  3. If we still decide we want to have it in Orange, we should select an appropriate algorithm. Based on my reading the paper, SMOTE's approach doesn't look OK, and it was not properly tested, either.
    If SMOTE oversampling isn't the right way then addressing sample imbalance becomes even trickier. Not all studies have the lavishness of being able to have balanced sample size or to go and collect more. Artificial over sampling should be implemented as a widget at some point. Machine learning with imbalance in data is bound to produce error-laden training models. Being able to train on oversampled data and then test on real data without any sampling is the easier way forward, no?

@janezd
Copy link
Contributor

janezd commented Mar 13, 2019

It would be if it worked. My observation is that it works just in paper about oversampling.

Balancing imbalanced data results in models that are incorrectly biased towards the minority class. With 95 % majority, it makes perfect sense to always predict the majority class ... unless your goal is detecting rare cases (which it actually probably is). But then you should not use algorithms that were designed for classification and that optimize classification accuracy (balancing will decrease it), predicted probabilities (balancing will obviously change them) or a similar score. Balancing data won't translate a problem of detecting anomalies into a classification problem.

Thus oversampling (although in such a way that will only oversample the training data) would lead to bad practice, while I don't see any scenarios in which this would actually be useful.

One could object that Orange is just a tool and it's the user's responsibility to not misuse it. I can think of two other issues in which we had similar discussions. But think about programming languages: modern languages are all about null-safety, though one could say it's user's responsibility to not dereference them. Kotlin's switch requires listing all cases (or explicitly give a default); wouldn't it be the programmers responsibility to cover all that can happen? Language designers would be hesitant to add features that are unreadable, confusing, ambiguous. I believe that some reasonable guidance is a good thing.

@AuSpotter
Copy link

AuSpotter commented May 13, 2020

It would be if it worked. My observation is that it works just in paper about oversampling.

Balancing imbalanced data results in models that are incorrectly biased towards the minority class. With 95 % majority, it makes perfect sense to always predict the majority class ... unless your goal is detecting rare cases (which it actually probably is). But then you should not use algorithms that were designed for classification and that optimize classification accuracy (balancing will decrease it), predicted probabilities (balancing will obviously change them) or a similar score. Balancing data won't translate a problem of detecting anomalies into a classification problem.

Thus oversampling (although in such a way that will only oversample the training data) would lead to bad practice, while I don't see any scenarios in which this would actually be useful.

One could object that Orange is just a tool and it's the user's responsibility to not misuse it. I can think of two other issues in which we had similar discussions. But think about programming languages: modern languages are all about null-safety, though one could say it's user's responsibility to not dereference them. Kotlin's switch requires listing all cases (or explicitly give a default); wouldn't it be the programmers responsibility to cover all that can happen? Language designers would be hesitant to add features that are unreadable, confusing, ambiguous. I believe that some reasonable guidance is a good thing.

Hi @janezd - actually the problem of imbalanced classes is a very relevant and important problem in the field of geoscience data analysis. Addressing imbalance is fundamentally important, e.g., problems that relate to the prediction of rock types from geophysical data (as one example https://www.sciencedirect.com/science/article/pii/S2590197419300011. Note the use of SMOTE in this paper

** I will freely admit ** that SMOTE isn't a perfect solution, but those of us in the geoscience/spatial data fields need (at least) a way to randomly undersample majority classes; optimally a robust way to balance the classes in a sensible way

@bellini1
Copy link

bellini1 commented Aug 6, 2021

Hi. Emergency physician, pandemic advisor and Ai practitioner: We really need SMOTE for health care research and modeling of the pandemic. As above, not perfect but the vast majority of our outcome classes are undersampled and we need to figure out all sorts of synthetic data methods to improve our models and save lives, SMOTE, digital twins etc.
Please think about adding, Orange folks!
Thanks

@AuSpotter
Copy link

AuSpotter commented Aug 6, 2021

Wow - that's a much more important use-case than my rock related one!

@AuSpotter
Copy link

AuSpotter commented Aug 6, 2021

Is it possible to re-open this request, @janezd ?

@slimebob1975
Copy link

slimebob1975 commented Oct 22, 2021

I am following the Orange team suggestion of using SMOTE (and potentially random undersamplig as well) inside a Python Script widget for 511 data point from a stroke dataset.

Here is the Python code:

`
import Orange
import sklearn
import numpy as np
from imblearn.over_sampling import SMOTE
from imblearn.under_sampling import RandomUnderSampler
from Orange.data import Domain, Table

Copy data to output

new_data = in_data.copy()

Transform output dataset

oversample = SMOTE(sampling_strategy='auto')
new_data.X, new_data.Y = oversample.fit_resample(new_data.X, new_data.Y)

#undersample = RandomUnderSampler(sampling_strategy='auto')
#new_data.X, new_data.Y = undersample.fit_resample(new_data.X, new_data.Y)

Borrow some information from the in data to build a new table for output

domain = Domain(in_data.domain.attributes, in_data.domain.class_vars)
out_data = Table.from_table(domain, new_data)`

and this codes results in the following differences in an associated Scatter Plot:

SMOTE_example_before

SMOTE_example_after

The problem is that there is something wrong with the associated output_data variable, even though the Python Script widget is not signaling anything. When viewed in the Table widget, only the original data is shown, and when put into modelling, the Test and Score widget complains with the following text:

 IndexError: index 526 is out of bounds for axis 0 with size 511.

So somehow the output data from the Python Script widget knows it has extra data, and somehow it is not included in the data structure. I guess my problem is that I have not coded correct above, but I could not find any examples in the documentation on how to extend the amount of data in the Python Script widget to follow.

Any kind of help would be benefical, perhaps could a full working example be added to the Orange documentation.

@markotoplak
Copy link
Member

markotoplak commented Oct 22, 2021

@slimebob1975, your code was almost there. Orange's Table contains more arrays than X and Y, which have to match, so do not replace them directly. Instead, use code like this:

oversample = SMOTE(sampling_strategy='auto')
X, Y = oversample.fit_resample(in_data.X, in_data.Y)

domain = Domain(in_data.domain.attributes, in_data.domain.class_vars)
out_data = Table.from_numpy(domain, X, Y)

Be careful with validation: on the out_data you absolutely should not use cross-validation - that would completely invalidate your results. At a minimum, you will have to split the data first into train and test data and then process over/undersample them separately. This is approximately what @janezd meant with his comment "Oversampling does not belong to this widget (Data Sampler). If anywhere, it should be in preprocessing."

@slimebob1975
Copy link

slimebob1975 commented Oct 22, 2021

Many thanks, now it works as expected.

@ajdapretnar
Copy link
Contributor

ajdapretnar commented Oct 22, 2021

I'd be nice if you could add you script here: https://github.com/biolab/orange-scripts

@slimebob1975
Copy link

slimebob1975 commented Oct 22, 2021

Hi @ajdapretnar Just did so. It seem to work pretty well. On the very unbalanced stroke dataset (1:20) it increased the accurary rate of the minority class from 0.0 % to 67% while maintaining the accuracy rate of the majority class at 95 % using a neural network with standard parameter settings.

@janezd
Copy link
Contributor

janezd commented Oct 22, 2021

Can you show us the details of your workflow? You balanced only the training data, right?

@slimebob1975
Copy link

slimebob1975 commented Oct 22, 2021

Yes, balanced only the training set. I might be mistaken, put this together yesterday and today. I would be happy if you could see any flaws in my workflow.
SMOT_test_example

@janezd
Copy link
Contributor

janezd commented Oct 22, 2021

I would be happy if you could see any flaws in my workflow.

Why would this make you happy? :)

So the improved accuracy is what you read from Predictions at the bottom right, not in any of Test and Score widget? In that case, it seems OK. I'm sorry for disappointing you. :)

@slimebob1975
Copy link

slimebob1975 commented Oct 22, 2021

Making mistakes is the only way to learn :-)

The following pictures shows the results in the Confusion matrices from top to bottom. First the original training data (80 % randomly selected), the middle one is the results from the balanced data, and the one in the bottom is the test data (the 20 % remaining data).

SMOT_test_example_results

@janezd
Copy link
Contributor

janezd commented Oct 22, 2021

I've no further comments. :) Looks OK.

@ThirstyGeo
Copy link

ThirstyGeo commented Oct 22, 2021

@slimebob1975 thanks for posting that workflow! Very relevant for an upcoming shortcourse my colleagues and I will run, pertaining to earth science data

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

9 participants