Consolidate oversamplers to select based on true input data #2695

eccabay · 2021-08-25T20:15:33Z

Closes #2605 by replacing SMOTEOversampler, SMOTENOversampler, and SMOTENCOversampler with a single Oversampler class, and moves the SMOTE/N/C selection logic into Oversampler.fit()

…to fit()

codecov · 2021-08-25T20:20:29Z

Codecov Report

Merging #2695 (4930e6a) into main (aa57a39) will decrease coverage by 0.1%.
The diff coverage is 97.3%.

@@           Coverage Diff           @@
##            main   #2695     +/-   ##
=======================================
- Coverage   99.9%   99.9%   -0.0%     
=======================================
  Files        300     300             
  Lines      27459   27426     -33     
=======================================
- Hits       27415   27382     -33     
  Misses        44      44

Impacted Files	Coverage Δ
evalml/pipelines/components/__init__.py	`100.0% <ø> (ø)`
...alml/pipelines/components/transformers/__init__.py	`100.0% <ø> (ø)`
...s/components/transformers/samplers/base_sampler.py	`100.0% <ø> (ø)`
evalml/pipelines/utils.py	`99.2% <ø> (ø)`
evalml/tests/automl_tests/test_automl.py	`99.7% <ø> (ø)`
...lml/tests/automl_tests/test_iterative_algorithm.py	`100.0% <ø> (ø)`
...s/prediction_explanations_tests/test_explainers.py	`100.0% <ø> (ø)`
...understanding_tests/test_permutation_importance.py	`100.0% <ø> (ø)`
evalml/tests/pipeline_tests/test_pipeline_utils.py	`100.0% <ø> (ø)`
evalml/tests/component_tests/test_utils.py	`95.3% <20.0%> (ø)`
... and 11 more

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update aa57a39...4930e6a. Read the comment docs.

…to 2605_oversampler_logic

eccabay · 2021-08-26T13:46:13Z

evalml/tests/automl_tests/test_automl_search_classification.py

-        ("binary", {0: 1, 1: 0.5}, 900),
-        ("binary", {0: 1, 1: 0.8}, 1080),
+        ("binary", {0: 1, 1: 0.5}, 1000),
+        ("binary", {0: 1, 1: 0.8}, 1200),


With respect to these changes and a few others in this file, they were necessary to make tests pass but I have no idea why. In the debugging I did to try and figure out what was wrong, I actually don't know how these tests passed in the first place, as the issue seemed to arise in code I didn't change.

@bchen1116 any thoughts? It doesn't make sense that these values should change

Is this related to changing the ratio of the target below to y = pd.Series([0] * 1000 + [1] * 200) or did you change that to get the tests to pass as well?

RE the logic behind these tests originally (also writing out for my own sanity / logic-checking):

For the binary case, we have a target of y = pd.Series([0] * 900 + [1] * 300); AutoML does CV on this (though it's unclear what the exact split of targets might be? 😬 ). Breakpointing tells me that the oversampler component sees a training target of 800 values--600 of class 0, and 200 of class 1. Hence, oversampling with ratio 0:1, 1:0.5 will result in a new target of 600 * 1 + max((600 * 0.5), 300) => total len of 900. Second case is 600 * 1 + max((600 * 0.8), 300) => total len of 1080.

I also changed the y = pd.Series([0] * 1000 + [1] * 200) to get the test to pass

With the original settings, these lines:

counts = y.value_counts() minority_class = min(counts) class_ratios = minority_class / counts if all(class_ratios >= sampler_balanced_ratio): return None

were setting the sampler to none before the Oversampler could be selected.

But before this change was made to consolidate all of the oversamplers, this test hardcoded automl to set the sampler so we should have never ran this code in get_best_sampler_for_data! Now that we're consolidating all of the oversamplers, self.sampler_method in ["auto", "Oversampler"] evaluates to true. If it's setting the sampler to None here, we might need to change the logic here? If we're just interested in testing the SMOTE oversampler specifically, we can also mock get_best_sampler_for_data to return one of the SMOTE oversamplers specifically!

if self.sampler_method in ["auto", "Oversampler"]: self._sampler_name = get_best_sampler_for_data( self.X_train, self.y_train, self.sampler_method, self.sampler_balanced_ratio, )

Ahh ok, I finally understand how this came up as an issue! Thanks so much - ~~I like your idea of mocking get_best_sampler_for_data, I'll implement that.~~ I'm going to update the sampler selection logic in AutoMLSearch.__init__

This test looks good to me! The current numbers check out, but I believe the previous numbers could have been left the same and this test still passes?

chukarsten

Becca, great work, as usual. I'd definitely like to get @bchen1116's opinion on this as I think he set up most of this to begin with and would thus have a good perspective on it. The changes that I see that we might need are reintroduction of the explicit testing of the oversampler selection logic, both to verify that it's working as intended and demonstrate to devs what's happening, as well as perhaps combining the selection logic as mentioned.

evalml/pipelines/components/transformers/samplers/oversampler.py

chukarsten · 2021-08-26T14:14:22Z

evalml/tests/automl_tests/test_automl_search_classification.py

-        ("binary", {0: 1, 1: 0.5}, 900),
-        ("binary", {0: 1, 1: 0.8}, 1080),
+        ("binary", {0: 1, 1: 0.5}, 1000),
+        ("binary", {0: 1, 1: 0.8}, 1200),


@bchen1116 any thoughts? It doesn't make sense that these values should change

evalml/tests/automl_tests/test_automl_search_classification.py

chukarsten

Nice! Thanks for doing this!

bchen1116

Left a few comments/questions, but this looks good to me! Can we add a test where we make sure the sample code in the issue passes with this new implementation? I think it would be useful to keep it as a test to ensure that future changes don't break our pipelines.

evalml/pipelines/components/transformers/samplers/oversampler.py

evalml/tests/component_tests/test_oversampler.py

bchen1116 · 2021-08-27T15:49:10Z

evalml/tests/automl_tests/test_automl_search_classification.py

-        ("binary", {0: 1, 1: 0.5}, 900),
-        ("binary", {0: 1, 1: 0.8}, 1080),
+        ("binary", {0: 1, 1: 0.5}, 1000),
+        ("binary", {0: 1, 1: 0.8}, 1200),


This test looks good to me! The current numbers check out, but I believe the previous numbers could have been left the same and this test still passes?

evalml/tests/automl_tests/test_automl_search_classification.py

eccabay added 6 commits August 25, 2021 08:39

Combine SMOTE samplers into one Oversampler and move selection logic …

e26b99e

…to fit()

Adjust tests to use Oversampler instead of SMOTE

c78db00

Adjust oversampler tests to reflect change

789cec8

More complicated test changes

3cc8d57

Renamed oversamplers.py to oversampler.py

b868371

Update documentation to reflect change

e2984bf

eccabay added 3 commits August 25, 2021 16:24

Update release notes

2d81964

Cleanup for PR

1f2960c

Merge branch 'main' into 2605_oversampler_logic

aa9ccb1

eccabay self-assigned this Aug 25, 2021

eccabay added 2 commits August 26, 2021 09:20

Fix codecov issue

ce65298

Merge branch '2605_oversampler_logic' of github.com:alteryx/evalml in…

c095c71

…to 2605_oversampler_logic

eccabay commented Aug 26, 2021

View reviewed changes

eccabay marked this pull request as ready for review August 26, 2021 13:51

eccabay requested review from angela97lin, bchen1116, chukarsten, dsherry, freddyaboulton and christopherbunn August 26, 2021 13:51

chukarsten suggested changes Aug 26, 2021

View reviewed changes

Address PR comments

ed84b31

eccabay requested a review from chukarsten August 27, 2021 12:30

chukarsten approved these changes Aug 27, 2021

View reviewed changes

bchen1116 approved these changes Aug 27, 2021

View reviewed changes

eccabay added 3 commits August 27, 2021 14:09

Add test to ensure original error is fixed and small PR fixes

bc50f0b

Merge branch 'main' into 2605_oversampler_logic

acb3841

lint

4930e6a

eccabay merged commit af5f29f into main Aug 30, 2021

eccabay deleted the 2605_oversampler_logic branch August 30, 2021 14:30

chukarsten mentioned this pull request Sep 1, 2021

Release v0.32.0 #2729

Merged

eccabay mentioned this pull request Sep 8, 2021

Remove imblearn module from being an attribute of Oversampler #2754

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Consolidate oversamplers to select based on true input data #2695

Consolidate oversamplers to select based on true input data #2695

eccabay commented Aug 25, 2021

codecov bot commented Aug 25, 2021 •

edited

Loading

eccabay Aug 26, 2021

chukarsten Aug 26, 2021

angela97lin Aug 26, 2021

eccabay Aug 26, 2021 •

edited

Loading

angela97lin Aug 26, 2021

eccabay Aug 26, 2021 •

edited

Loading

bchen1116 Aug 27, 2021

chukarsten left a comment

chukarsten Aug 26, 2021

chukarsten left a comment

bchen1116 left a comment

bchen1116 Aug 27, 2021

Consolidate oversamplers to select based on true input data #2695

Consolidate oversamplers to select based on true input data #2695

Conversation

eccabay commented Aug 25, 2021

codecov bot commented Aug 25, 2021 • edited Loading

Codecov Report

eccabay Aug 26, 2021

Choose a reason for hiding this comment

chukarsten Aug 26, 2021

Choose a reason for hiding this comment

angela97lin Aug 26, 2021

Choose a reason for hiding this comment

eccabay Aug 26, 2021 • edited Loading

Choose a reason for hiding this comment

angela97lin Aug 26, 2021

Choose a reason for hiding this comment

eccabay Aug 26, 2021 • edited Loading

Choose a reason for hiding this comment

bchen1116 Aug 27, 2021

Choose a reason for hiding this comment

chukarsten left a comment

Choose a reason for hiding this comment

chukarsten Aug 26, 2021

Choose a reason for hiding this comment

chukarsten left a comment

Choose a reason for hiding this comment

bchen1116 left a comment

Choose a reason for hiding this comment

bchen1116 Aug 27, 2021

Choose a reason for hiding this comment

codecov bot commented Aug 25, 2021 •

edited

Loading

eccabay Aug 26, 2021 •

edited

Loading

eccabay Aug 26, 2021 •

edited

Loading