Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add Custom Downsampler #1857

Merged
merged 22 commits into from Feb 23, 2021
Merged

Add Custom Downsampler #1857

merged 22 commits into from Feb 23, 2021

Conversation

bchen1116
Copy link
Contributor

@bchen1116 bchen1116 commented Feb 18, 2021

part 1 of 2 for #973.

This PR tracks adding in the custom sampler to EvalML, and we will turn this into a data splitter once merged with the class imbalance PR. This does not include the actual data splitter itself.

Functionality of this sampler (and future data splitter) here

@bchen1116 bchen1116 self-assigned this Feb 18, 2021
@codecov
Copy link

codecov bot commented Feb 18, 2021

Codecov Report

Merging #1857 (d021425) into main (31f9226) will increase coverage by 0.1%.
The diff coverage is 100.0%.

Impacted file tree graph

@@            Coverage Diff            @@
##             main    #1857     +/-   ##
=========================================
+ Coverage   100.0%   100.0%   +0.1%     
=========================================
  Files         260      263      +3     
  Lines       20893    21215    +322     
=========================================
+ Hits        20887    21209    +322     
  Misses          6        6             
Impacted Files Coverage Δ
evalml/preprocessing/data_splitters/__init__.py 100.0% <100.0%> (ø)
...data_splitters/balanced_classification_splitter.py 100.0% <100.0%> (ø)
...valml/preprocessing/data_splitters/sampler_base.py 100.0% <100.0%> (ø)
...ing_tests/test_balanced_classification_splitter.py 100.0% <100.0%> (ø)

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 31f9226...d021425. Read the comment docs.

@bchen1116 bchen1116 changed the title Add Custom Downsampling Datasplitter Add Custom Downsampler Feb 19, 2021
@bchen1116 bchen1116 marked this pull request as ready for review February 19, 2021 15:42
)


class BalancedClassificationSampler():
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Missing base class

Copy link
Contributor

@freddyaboulton freddyaboulton left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@bchen1116 I think this is great! The tests are very solid.

Nothing blocking from me.

# if any classes have than min_samples counts and are less than min_percentage of the total data,
# then it's severely imbalanced
if any(counts < self.min_samples) and any(normalized_counts < self.min_percentage):
return {}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Playing devil's advocate here, should severely imbalanced and balanced data produce the same behavior?

The user's data can fall into four cases:

  1. Balanced
  2. Severely Imbalanced
  3. Imbalanced and can downsample
  4. Imbalanced but can't downsample because of min_samples

Right now users can't tell whether they are in cases 1, 2, or 4 and I think there is value in communicating that. Maybe it belongs in our class imbalance data check though.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@freddyaboulton Yeah, I think in discussion with @dsherry, we needed to figure out how to get the class imbalance data check to work with this. Definitely agree that the behavior could be confusing, but I think as long as the data check can help us catch the severely-imbalanced case and warn the user, it clarifies the case.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@freddyaboulton just filed a new issue to track updating the class imbalance data check

Copy link
Contributor

@chukarsten chukarsten left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Very nice job - super thorough testing. Love it. I had a comment that I don't think is merge blocking that we should probably resolve (or it is resolved already) concerning what to do when a class cannot be downsampled to the desired ratio.

# find goal size, round it down if it's a float
goal_value = max(int((self.balanced_ratio * counts.values[-1]) // 1), self.min_samples)
# we don't want to drop less than 0 rows
drop_values = {k: max(0, counts[k] - goal_value) for k in undersample_classes}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For some reason, my original incomplete review was deleted....but here, if we are trying to hit a goal ratio but don't end up subtracting the goal_value, then the value remains unchanged and the ratio remains unchanged. Is that the behavior that we want? If the goal ratio cannot be achieved, to not do anything and not alert the user? I'm sure you and Dylan discussed this - I'm just playing some catch up.

y = pd.Series([0] * 90 + [1] * 100 + [2] * 120 + [3] * 40 + [4] * 70)
# will downsample the [2] target
# will try to downsample [0] and [4], but max(0, x) will prevent that
bcs = BalancedClassificationSampler(balanced_ratio=1, min_percentage=0.01)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this is the test that my above comment was interested in. I would think that using the sampler that the expected behavior is that the populations achieved the balanced ratios. Do we want the lack of downsampling for 0 and 4 to be silent?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Mmm, yeah that's fair. My thought was that we cannot sample if we end up with less than min_samples values of a class. For instance, if we have 40 instances of class 1 and 100 instances of class 2, we might not want to sample the 100 instances down even if we had wanted a 1:1 ratio, whereas in a case of 40,000 vs 100,000 samples, it would be fine to sample the 100,000 to 40,000 cases since we have enough samples. As for warning or telling the user, I wasn't sure if this was something I should warn the user about or do silently. Thoughts @chukarsten @dsherry?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think the understanding should be that if the number of instances falls below the threshold for certain classes, as it does for "0" and "4" here, that the method will simply not downsample those classes. No error in that case.

@bchen1116 bchen1116 merged commit 73eeafc into main Feb 23, 2021
@chukarsten chukarsten mentioned this pull request Feb 23, 2021
@dsherry dsherry mentioned this pull request Mar 10, 2021
@freddyaboulton freddyaboulton deleted the bc_973_downsampler branch May 13, 2022 15:01
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

4 participants