Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Consolidate oversamplers to select based on true input data #2695

Merged
merged 15 commits into from
Aug 30, 2021

Conversation

eccabay
Copy link
Contributor

@eccabay eccabay commented Aug 25, 2021

Closes #2605 by replacing SMOTEOversampler, SMOTENOversampler, and SMOTENCOversampler with a single Oversampler class, and moves the SMOTE/N/C selection logic into Oversampler.fit()

@codecov
Copy link

codecov bot commented Aug 25, 2021

Codecov Report

Merging #2695 (4930e6a) into main (aa57a39) will decrease coverage by 0.1%.
The diff coverage is 97.3%.

Impacted file tree graph

@@           Coverage Diff           @@
##            main   #2695     +/-   ##
=======================================
- Coverage   99.9%   99.9%   -0.0%     
=======================================
  Files        300     300             
  Lines      27459   27426     -33     
=======================================
- Hits       27415   27382     -33     
  Misses        44      44             
Impacted Files Coverage Δ
evalml/pipelines/components/__init__.py 100.0% <ø> (ø)
...alml/pipelines/components/transformers/__init__.py 100.0% <ø> (ø)
...s/components/transformers/samplers/base_sampler.py 100.0% <ø> (ø)
evalml/pipelines/utils.py 99.2% <ø> (ø)
evalml/tests/automl_tests/test_automl.py 99.7% <ø> (ø)
...lml/tests/automl_tests/test_iterative_algorithm.py 100.0% <ø> (ø)
...s/prediction_explanations_tests/test_explainers.py 100.0% <ø> (ø)
...understanding_tests/test_permutation_importance.py 100.0% <ø> (ø)
evalml/tests/pipeline_tests/test_pipeline_utils.py 100.0% <ø> (ø)
evalml/tests/component_tests/test_utils.py 95.3% <20.0%> (ø)
... and 11 more

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update aa57a39...4930e6a. Read the comment docs.

@eccabay eccabay self-assigned this Aug 25, 2021
Comment on lines 1362 to 1344
("binary", {0: 1, 1: 0.5}, 900),
("binary", {0: 1, 1: 0.8}, 1080),
("binary", {0: 1, 1: 0.5}, 1000),
("binary", {0: 1, 1: 0.8}, 1200),
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

With respect to these changes and a few others in this file, they were necessary to make tests pass but I have no idea why. In the debugging I did to try and figure out what was wrong, I actually don't know how these tests passed in the first place, as the issue seemed to arise in code I didn't change.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@bchen1116 any thoughts? It doesn't make sense that these values should change

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this related to changing the ratio of the target below to y = pd.Series([0] * 1000 + [1] * 200) or did you change that to get the tests to pass as well?

RE the logic behind these tests originally (also writing out for my own sanity / logic-checking):

For the binary case, we have a target of y = pd.Series([0] * 900 + [1] * 300); AutoML does CV on this (though it's unclear what the exact split of targets might be? 😬 ). Breakpointing tells me that the oversampler component sees a training target of 800 values--600 of class 0, and 200 of class 1. Hence, oversampling with ratio 0:1, 1:0.5 will result in a new target of 600 * 1 + max((600 * 0.5), 300) => total len of 900. Second case is 600 * 1 + max((600 * 0.8), 300) => total len of 1080.

Copy link
Contributor Author

@eccabay eccabay Aug 26, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I also changed the y = pd.Series([0] * 1000 + [1] * 200) to get the test to pass

With the original settings, these lines:

    counts = y.value_counts()
    minority_class = min(counts)
    class_ratios = minority_class / counts
    if all(class_ratios >= sampler_balanced_ratio):
        return None

were setting the sampler to none before the Oversampler could be selected.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

But before this change was made to consolidate all of the oversamplers, this test hardcoded automl to set the sampler so we should have never ran this code in get_best_sampler_for_data! Now that we're consolidating all of the oversamplers, self.sampler_method in ["auto", "Oversampler"] evaluates to true. If it's setting the sampler to None here, we might need to change the logic here? If we're just interested in testing the SMOTE oversampler specifically, we can also mock get_best_sampler_for_data to return one of the SMOTE oversamplers specifically!

            if self.sampler_method in ["auto", "Oversampler"]:
                self._sampler_name = get_best_sampler_for_data(
                    self.X_train,
                    self.y_train,
                    self.sampler_method,
                    self.sampler_balanced_ratio,
                )

Copy link
Contributor Author

@eccabay eccabay Aug 26, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ahh ok, I finally understand how this came up as an issue! Thanks so much - I like your idea of mocking get_best_sampler_for_data, I'll implement that. I'm going to update the sampler selection logic in AutoMLSearch.__init__

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This test looks good to me! The current numbers check out, but I believe the previous numbers could have been left the same and this test still passes?

Copy link
Contributor

@chukarsten chukarsten left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Becca, great work, as usual. I'd definitely like to get @bchen1116's opinion on this as I think he set up most of this to begin with and would thus have a good perspective on it. The changes that I see that we might need are reintroduction of the explicit testing of the oversampler selection logic, both to verify that it's working as intended and demonstrate to devs what's happening, as well as perhaps combining the selection logic as mentioned.

Comment on lines 1362 to 1344
("binary", {0: 1, 1: 0.5}, 900),
("binary", {0: 1, 1: 0.8}, 1080),
("binary", {0: 1, 1: 0.5}, 1000),
("binary", {0: 1, 1: 0.8}, 1200),
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@bchen1116 any thoughts? It doesn't make sense that these values should change

Copy link
Contributor

@chukarsten chukarsten left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice! Thanks for doing this!

Copy link
Contributor

@bchen1116 bchen1116 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Left a few comments/questions, but this looks good to me! Can we add a test where we make sure the sample code in the issue passes with this new implementation? I think it would be useful to keep it as a test to ensure that future changes don't break our pipelines.

evalml/tests/component_tests/test_oversampler.py Outdated Show resolved Hide resolved
evalml/tests/component_tests/test_oversampler.py Outdated Show resolved Hide resolved
Comment on lines 1362 to 1344
("binary", {0: 1, 1: 0.5}, 900),
("binary", {0: 1, 1: 0.8}, 1080),
("binary", {0: 1, 1: 0.5}, 1000),
("binary", {0: 1, 1: 0.8}, 1200),
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This test looks good to me! The current numbers check out, but I believe the previous numbers could have been left the same and this test still passes?

@eccabay eccabay merged commit af5f29f into main Aug 30, 2021
@eccabay eccabay deleted the 2605_oversampler_logic branch August 30, 2021 14:30
@chukarsten chukarsten mentioned this pull request Sep 1, 2021
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Logic for selecting oversampler class does not take into account data transformations that happen in pipeline
4 participants