Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Class Imbalance Data splitter samplers #1775

Merged
merged 124 commits into from Mar 11, 2021
Merged

Class Imbalance Data splitter samplers #1775

merged 124 commits into from Mar 11, 2021

Conversation

bchen1116
Copy link
Contributor

@bchen1116 bchen1116 commented Feb 1, 2021

fix #1786

Creating a few class imbalance data splitters to determine which strategies will be beneficial for EvalML

Current issues/questions:

  • SMOTE cannot handle categorical data, so our current implementations can only handle numeric data. SMOTE also treats all data as continuous for its synthetic generation ability, so this is something we should keep in mind. For categorical and numeric data, we can incorporate SMOTENC, which I am addressing in a separate PR now (new issue here).
  • The default implementation for the samplers cannot handle small samples of super imbalanced data well (for instance, 200 rows with a [0.02, 0.98] class balance). In this scenario, especially for KMeansSMOTE, we would want the user to pass in the data_splitter object initialized with the proper thresholds. However, how should we tell users to do this? Is this a message we should try to catch/throw in AutoMLSearch (might be hard to catch), or can we just put it in the docs?
  • These names might be complicated to users. Any suggestions on name simplifications or simplifications for how users can pass these splitters to EvalML?
  • Is it confusing to create and use the sampler arg? I figured it would be nice to avoid an import for the user, but should we stick with just having them pass in the class instance for consistency?

Docs for API reference here and for usage here
Clean Performance doc here.

@bchen1116 bchen1116 self-assigned this Feb 1, 2021
@codecov
Copy link

codecov bot commented Feb 2, 2021

Codecov Report

Merging #1775 (54caa45) into main (91775ff) will increase coverage by 0.1%.
The diff coverage is 100.0%.

Impacted file tree graph

@@            Coverage Diff            @@
##             main    #1775     +/-   ##
=========================================
+ Coverage   100.0%   100.0%   +0.1%     
=========================================
  Files         268      273      +5     
  Lines       22076    22339    +263     
=========================================
+ Hits        22070    22333    +263     
  Misses          6        6             
Impacted Files Coverage Δ
evalml/automl/automl_search.py 100.0% <ø> (ø)
evalml/tests/utils_tests/test_gen_utils.py 100.0% <ø> (ø)
evalml/utils/__init__.py 100.0% <ø> (ø)
evalml/preprocessing/__init__.py 100.0% <100.0%> (ø)
evalml/preprocessing/data_splitters/__init__.py 100.0% <100.0%> (ø)
...lml/preprocessing/data_splitters/base_splitters.py 100.0% <100.0%> (ø)
evalml/preprocessing/data_splitters/rus_split.py 100.0% <100.0%> (ø)
evalml/preprocessing/data_splitters/smote_split.py 100.0% <100.0%> (ø)
...lml/preprocessing/data_splitters/smote_tl_split.py 100.0% <100.0%> (ø)
...alml/preprocessing/data_splitters/smotenc_split.py 100.0% <100.0%> (ø)
... and 9 more

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 91775ff...54caa45. Read the comment docs.

X_train, X_valid = self.X_train.iloc[train], self.X_train.iloc[valid]
y_train, y_valid = self.y_train.iloc[train], self.y_train.iloc[valid]
else:
X_train, y_train = train[0], train[1]
Copy link
Contributor Author

@bchen1116 bchen1116 Feb 3, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The sampler splitter passes out the [(X_train, y_train), (X_test, y_test)] data as tuples, not indices. We can't pass out indices with SMOTE since it generates synthetic data

@@ -9,7 +9,7 @@ def test_has_minimal_deps(has_minimal_dependencies):
lines = open(reqs_path, 'r').readlines()
lines = [line for line in lines if '-r ' not in line]
reqs = requirements.parse(''.join(lines))
extra_deps = [req.name for req in reqs]
extra_deps = [req.name if req.name != 'imbalanced-learn' else 'imblearn' for req in reqs]
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The pip install is for imbalanced-learn but the import itself is imblearn, which is why I added this conditional here

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why is this not an issue with scikit-learn? Same patten, scikit-learn is the pip install but import is sklearn 🤔

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think it's because this test looks only at the extra requirements in the requirements.txt file, not including the modules in core_requirements.txt, which is where scikit-learn is.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

👍

Once we add full support for oversampling, I hope we can either move imbalanced-learn to core-requirements.txt or remove it in favor of our own impls. At that point, this won't be an issue, right?

Copy link
Contributor

@dsherry dsherry left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@bchen1116 this is looking pretty good to me!

I had two main comments. Neither blocking merge but we should address them either in this PR or a follow-on PR:

  1. Let's clean up the inheritance structure we're using. Before this PR we had TrainingValidationSplit which inherits from sklearn BaseCrossValidator. Now we'd like to add a few splitters which perform undersampling or oversampling.

I also see there's duplicated code between BaseTVSplit and BaseCVSplit.

My suggestion:

  • Define one class BaseSamplingSplitter which inherits from BaseCrossValidator.
  • That's where you can put all the changes that are currently in your two base classes
  • Have all the oversamplers inherit from BaseSamplingSplitter
  • The TV methods can just ignore the constructor value for n_splits and hard-code it to 1, and the CV methods can keep n_splits exposed via the constructor
  1. I left a comment about _convert_numeric_dataset_for_data_sampler. I think we should split that into two methods, one for each direction we need.

docs/source/release_notes.rst Outdated Show resolved Hide resolved
requirements.txt Show resolved Hide resolved
evalml/preprocessing/data_splitters/base_splitters.py Outdated Show resolved Hide resolved
evalml/preprocessing/data_splitters/base_splitters.py Outdated Show resolved Hide resolved
evalml/utils/woodwork_utils.py Outdated Show resolved Hide resolved
@@ -9,7 +9,7 @@ def test_has_minimal_deps(has_minimal_dependencies):
lines = open(reqs_path, 'r').readlines()
lines = [line for line in lines if '-r ' not in line]
reqs = requirements.parse(''.join(lines))
extra_deps = [req.name for req in reqs]
extra_deps = [req.name if req.name != 'imbalanced-learn' else 'imblearn' for req in reqs]
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

👍

Once we add full support for oversampling, I hope we can either move imbalanced-learn to core-requirements.txt or remove it in favor of our own impls. At that point, this won't be an issue, right?

@bchen1116 bchen1116 merged commit 80267d7 into main Mar 11, 2021
@dsherry dsherry mentioned this pull request Mar 24, 2021
@freddyaboulton freddyaboulton deleted the bc_973_imbalance branch May 13, 2022 15:17
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Implement SMOTENC as a data splitting strategy for handling categorical data
6 participants