New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add Custom Downsampler #1857
Add Custom Downsampler #1857
Conversation
Codecov Report
@@ Coverage Diff @@
## main #1857 +/- ##
=========================================
+ Coverage 100.0% 100.0% +0.1%
=========================================
Files 260 263 +3
Lines 20893 21215 +322
=========================================
+ Hits 20887 21209 +322
Misses 6 6
Continue to review full report at Codecov.
|
…c_973_downsampler
) | ||
|
||
|
||
class BalancedClassificationSampler(): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Missing base class
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@bchen1116 I think this is great! The tests are very solid.
Nothing blocking from me.
# if any classes have than min_samples counts and are less than min_percentage of the total data, | ||
# then it's severely imbalanced | ||
if any(counts < self.min_samples) and any(normalized_counts < self.min_percentage): | ||
return {} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Playing devil's advocate here, should severely imbalanced and balanced data produce the same behavior?
The user's data can fall into four cases:
- Balanced
- Severely Imbalanced
- Imbalanced and can downsample
- Imbalanced but can't downsample because of
min_samples
Right now users can't tell whether they are in cases 1, 2, or 4 and I think there is value in communicating that. Maybe it belongs in our class imbalance data check though.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@freddyaboulton Yeah, I think in discussion with @dsherry, we needed to figure out how to get the class imbalance data check to work with this. Definitely agree that the behavior could be confusing, but I think as long as the data check can help us catch the severely-imbalanced case and warn the user, it clarifies the case.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@freddyaboulton just filed a new issue to track updating the class imbalance data check
evalml/tests/preprocessing_tests/test_balanced_classification_splitter.py
Show resolved
Hide resolved
evalml/preprocessing/data_splitters/balanced_classification_splitter.py
Outdated
Show resolved
Hide resolved
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Very nice job - super thorough testing. Love it. I had a comment that I don't think is merge blocking that we should probably resolve (or it is resolved already) concerning what to do when a class cannot be downsampled to the desired ratio.
evalml/preprocessing/data_splitters/balanced_classification_splitter.py
Outdated
Show resolved
Hide resolved
# find goal size, round it down if it's a float | ||
goal_value = max(int((self.balanced_ratio * counts.values[-1]) // 1), self.min_samples) | ||
# we don't want to drop less than 0 rows | ||
drop_values = {k: max(0, counts[k] - goal_value) for k in undersample_classes} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
For some reason, my original incomplete review was deleted....but here, if we are trying to hit a goal ratio but don't end up subtracting the goal_value, then the value remains unchanged and the ratio remains unchanged. Is that the behavior that we want? If the goal ratio cannot be achieved, to not do anything and not alert the user? I'm sure you and Dylan discussed this - I'm just playing some catch up.
evalml/tests/preprocessing_tests/test_balanced_classification_splitter.py
Show resolved
Hide resolved
y = pd.Series([0] * 90 + [1] * 100 + [2] * 120 + [3] * 40 + [4] * 70) | ||
# will downsample the [2] target | ||
# will try to downsample [0] and [4], but max(0, x) will prevent that | ||
bcs = BalancedClassificationSampler(balanced_ratio=1, min_percentage=0.01) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think this is the test that my above comment was interested in. I would think that using the sampler that the expected behavior is that the populations achieved the balanced ratios. Do we want the lack of downsampling for 0 and 4 to be silent?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Mmm, yeah that's fair. My thought was that we cannot sample if we end up with less than min_samples
values of a class. For instance, if we have 40 instances of class 1 and 100 instances of class 2, we might not want to sample the 100 instances down even if we had wanted a 1:1 ratio, whereas in a case of 40,000 vs 100,000 samples, it would be fine to sample the 100,000 to 40,000 cases since we have enough samples. As for warning or telling the user, I wasn't sure if this was something I should warn the user about or do silently. Thoughts @chukarsten @dsherry?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think the understanding should be that if the number of instances falls below the threshold for certain classes, as it does for "0" and "4" here, that the method will simply not downsample those classes. No error in that case.
part 1 of 2 for #973.
This PR tracks adding in the custom sampler to EvalML, and we will turn this into a data splitter once merged with the class imbalance PR. This does not include the actual data splitter itself.
Functionality of this sampler (and future data splitter) here