New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Class Imbalance Data splitter samplers #1775
Conversation
Codecov Report
@@ Coverage Diff @@
## main #1775 +/- ##
=========================================
+ Coverage 100.0% 100.0% +0.1%
=========================================
Files 268 273 +5
Lines 22076 22339 +263
=========================================
+ Hits 22070 22333 +263
Misses 6 6
Continue to review full report at Codecov.
|
evalml/automl/automl_search.py
Outdated
X_train, X_valid = self.X_train.iloc[train], self.X_train.iloc[valid] | ||
y_train, y_valid = self.y_train.iloc[train], self.y_train.iloc[valid] | ||
else: | ||
X_train, y_train = train[0], train[1] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The sampler splitter passes out the [(X_train, y_train), (X_test, y_test)]
data as tuples, not indices. We can't pass out indices with SMOTE
since it generates synthetic data
@@ -9,7 +9,7 @@ def test_has_minimal_deps(has_minimal_dependencies): | |||
lines = open(reqs_path, 'r').readlines() | |||
lines = [line for line in lines if '-r ' not in line] | |||
reqs = requirements.parse(''.join(lines)) | |||
extra_deps = [req.name for req in reqs] | |||
extra_deps = [req.name if req.name != 'imbalanced-learn' else 'imblearn' for req in reqs] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The pip install is for imbalanced-learn
but the import itself is imblearn
, which is why I added this conditional here
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why is this not an issue with scikit-learn? Same patten, scikit-learn is the pip install but import is sklearn 🤔
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think it's because this test looks only at the extra requirements in the requirements.txt
file, not including the modules in core_requirements.txt
, which is where scikit-learn is.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
👍
Once we add full support for oversampling, I hope we can either move imbalanced-learn
to core-requirements.txt
or remove it in favor of our own impls. At that point, this won't be an issue, right?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@bchen1116 this is looking pretty good to me!
I had two main comments. Neither blocking merge but we should address them either in this PR or a follow-on PR:
- Let's clean up the inheritance structure we're using. Before this PR we had
TrainingValidationSplit
which inherits from sklearnBaseCrossValidator
. Now we'd like to add a few splitters which perform undersampling or oversampling.
I also see there's duplicated code between BaseTVSplit
and BaseCVSplit
.
My suggestion:
- Define one class
BaseSamplingSplitter
which inherits fromBaseCrossValidator
. - That's where you can put all the changes that are currently in your two base classes
- Have all the oversamplers inherit from
BaseSamplingSplitter
- The TV methods can just ignore the constructor value for
n_splits
and hard-code it to 1, and the CV methods can keepn_splits
exposed via the constructor
- I left a comment about
_convert_numeric_dataset_for_data_sampler
. I think we should split that into two methods, one for each direction we need.
@@ -9,7 +9,7 @@ def test_has_minimal_deps(has_minimal_dependencies): | |||
lines = open(reqs_path, 'r').readlines() | |||
lines = [line for line in lines if '-r ' not in line] | |||
reqs = requirements.parse(''.join(lines)) | |||
extra_deps = [req.name for req in reqs] | |||
extra_deps = [req.name if req.name != 'imbalanced-learn' else 'imblearn' for req in reqs] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
👍
Once we add full support for oversampling, I hope we can either move imbalanced-learn
to core-requirements.txt
or remove it in favor of our own impls. At that point, this won't be an issue, right?
fix #1786
Creating a few class imbalance data splitters to determine which strategies will be beneficial for EvalML
Current issues/questions:
KMeansSMOTE
, we would want the user to pass in the data_splitter object initialized with the proper thresholds. However, how should we tell users to do this? Is this a message we should try to catch/throw in AutoMLSearch (might be hard to catch), or can we just put it in the docs?sampler
arg? I figured it would be nice to avoid an import for the user, but should we stick with just having them pass in the class instance for consistency?Docs for API reference here and for usage here
Clean Performance doc here.