New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Consolidate oversamplers to select based on true input data #2695
Conversation
Codecov Report
@@ Coverage Diff @@
## main #2695 +/- ##
=======================================
- Coverage 99.9% 99.9% -0.0%
=======================================
Files 300 300
Lines 27459 27426 -33
=======================================
- Hits 27415 27382 -33
Misses 44 44
Continue to review full report at Codecov.
|
…to 2605_oversampler_logic
("binary", {0: 1, 1: 0.5}, 900), | ||
("binary", {0: 1, 1: 0.8}, 1080), | ||
("binary", {0: 1, 1: 0.5}, 1000), | ||
("binary", {0: 1, 1: 0.8}, 1200), |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
With respect to these changes and a few others in this file, they were necessary to make tests pass but I have no idea why. In the debugging I did to try and figure out what was wrong, I actually don't know how these tests passed in the first place, as the issue seemed to arise in code I didn't change.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@bchen1116 any thoughts? It doesn't make sense that these values should change
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is this related to changing the ratio of the target below to y = pd.Series([0] * 1000 + [1] * 200)
or did you change that to get the tests to pass as well?
RE the logic behind these tests originally (also writing out for my own sanity / logic-checking):
For the binary case, we have a target of y = pd.Series([0] * 900 + [1] * 300)
; AutoML does CV on this (though it's unclear what the exact split of targets might be? 😬 ). Breakpointing tells me that the oversampler component sees a training target of 800 values--600 of class 0, and 200 of class 1. Hence, oversampling with ratio 0:1, 1:0.5 will result in a new target of 600 * 1 + max((600 * 0.5), 300) => total len of 900. Second case is 600 * 1 + max((600 * 0.8), 300) => total len of 1080.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I also changed the y = pd.Series([0] * 1000 + [1] * 200)
to get the test to pass
With the original settings, these lines:
counts = y.value_counts()
minority_class = min(counts)
class_ratios = minority_class / counts
if all(class_ratios >= sampler_balanced_ratio):
return None
were setting the sampler to none before the Oversampler could be selected.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
But before this change was made to consolidate all of the oversamplers, this test hardcoded automl to set the sampler so we should have never ran this code in get_best_sampler_for_data
! Now that we're consolidating all of the oversamplers, self.sampler_method in ["auto", "Oversampler"]
evaluates to true. If it's setting the sampler to None here, we might need to change the logic here? If we're just interested in testing the SMOTE oversampler specifically, we can also mock get_best_sampler_for_data
to return one of the SMOTE oversamplers specifically!
if self.sampler_method in ["auto", "Oversampler"]:
self._sampler_name = get_best_sampler_for_data(
self.X_train,
self.y_train,
self.sampler_method,
self.sampler_balanced_ratio,
)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ahh ok, I finally understand how this came up as an issue! Thanks so much - I like your idea of mocking I'm going to update the sampler selection logic in get_best_sampler_for_data
, I'll implement that.AutoMLSearch.__init__
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This test looks good to me! The current numbers check out, but I believe the previous numbers could have been left the same and this test still passes?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Becca, great work, as usual. I'd definitely like to get @bchen1116's opinion on this as I think he set up most of this to begin with and would thus have a good perspective on it. The changes that I see that we might need are reintroduction of the explicit testing of the oversampler selection logic, both to verify that it's working as intended and demonstrate to devs what's happening, as well as perhaps combining the selection logic as mentioned.
evalml/pipelines/components/transformers/samplers/oversampler.py
Outdated
Show resolved
Hide resolved
("binary", {0: 1, 1: 0.5}, 900), | ||
("binary", {0: 1, 1: 0.8}, 1080), | ||
("binary", {0: 1, 1: 0.5}, 1000), | ||
("binary", {0: 1, 1: 0.8}, 1200), |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@bchen1116 any thoughts? It doesn't make sense that these values should change
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nice! Thanks for doing this!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Left a few comments/questions, but this looks good to me! Can we add a test where we make sure the sample code in the issue passes with this new implementation? I think it would be useful to keep it as a test to ensure that future changes don't break our pipelines.
("binary", {0: 1, 1: 0.5}, 900), | ||
("binary", {0: 1, 1: 0.8}, 1080), | ||
("binary", {0: 1, 1: 0.5}, 1000), | ||
("binary", {0: 1, 1: 0.8}, 1200), |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This test looks good to me! The current numbers check out, but I believe the previous numbers could have been left the same and this test still passes?
Closes #2605 by replacing
SMOTEOversampler
,SMOTENOversampler
, andSMOTENCOversampler
with a singleOversampler
class, and moves the SMOTE/N/C selection logic intoOversampler.fit()