Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add dictionary support for oversamplers #2288

Merged
merged 29 commits into from May 28, 2021
Merged

Conversation

bchen1116
Copy link
Contributor

fix #2142

@bchen1116 bchen1116 self-assigned this May 18, 2021
@codecov
Copy link

codecov bot commented May 18, 2021

Codecov Report

Merging #2288 (f6ca020) into main (3faffc8) will increase coverage by 0.1%.
The diff coverage is 100.0%.

Impacted file tree graph

@@           Coverage Diff           @@
##            main   #2288     +/-   ##
=======================================
+ Coverage   99.9%   99.9%   +0.1%     
=======================================
  Files        280     280             
  Lines      24360   24490    +130     
=======================================
+ Hits       24333   24463    +130     
  Misses        27      27             
Impacted Files Coverage Δ
evalml/automl/automl_search.py 100.0% <100.0%> (ø)
...s/components/transformers/samplers/base_sampler.py 100.0% <100.0%> (ø)
.../automl_tests/test_automl_search_classification.py 100.0% <100.0%> (ø)
evalml/tests/component_tests/test_components.py 100.0% <100.0%> (ø)
evalml/tests/component_tests/test_oversamplers.py 100.0% <100.0%> (ø)

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 3faffc8...f6ca020. Read the comment docs.

parameters[self._sampler_name] = {"sampling_ratio": self.sampler_balanced_ratio}
self._frozen_pipeline_parameters[self._sampler_name] = {"sampling_ratio": self.sampler_balanced_ratio}
elif self._sampler_name in parameters:
parameters[self._sampler_name].update({"sampling_ratio": self.sampler_balanced_ratio})
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Decided to default to the sampler_balanced_ratio value that was provided for AutoMLSearch to set the sampling_ratio, which means it overrides the sampling_ratio that a user might input to the pipeline_params via

pipeline_parameters = {sampler : {"sampling_ratio": 0.5}}

Updated the docstring above to mention this

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Very cool! I don't think you need an elif here, it can just be an else since I don't think we'd have another logic path but total nitpick.

Also I like the way you set the logic for the elif block. You could do the same to the if block and move setting the self._frozen_pipeline_parameters after:

if self._sampler_name not in parameters:
     parameters[self._sampler_name] = {"sampling_ratio": self.sampler_balanced_ratio}
else:
     parameters[self._sampler_name].update({"sampling_ratio": self.sampler_balanced_ratio})
self._frozen_pipeline_parameters[self._sampler_name] = parameters[self._sampler_name]

Copy link
Contributor

@chukarsten chukarsten left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice Bryan, this is good work. I think you and I discussed a bunch of things in slack parallel to this and I like the addition of docs and the test for the string keys. I still personally think the binary case with the key/val in the dict for the majority class in the oversampler case or the minority class in the undersampler case is a little confusing that they don't do anything. Personally, I just struggle with the dictionary representation, but I think that's a personal problem X-D.

@@ -748,3 +748,87 @@ def test_automl_search_sampler_method(sampler_method, categorical_features, prob
sampler_method = 'Undersampler'
assert 'Could not import imblearn.over_sampling' in caplog.text
assert all(any(sampler_method in comp.name for comp in pipeline.component_graph) for pipeline in pipelines)


@pytest.mark.parametrize("sampling_ratio", [0.1, 0.2, 0.5, 1])
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Cool, I like this test. Very readable.

Copy link
Contributor

@ParthivNaresh ParthivNaresh left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Great job with this, excellent test coverage

parameters[self._sampler_name] = {"sampling_ratio": self.sampler_balanced_ratio}
self._frozen_pipeline_parameters[self._sampler_name] = {"sampling_ratio": self.sampler_balanced_ratio}
elif self._sampler_name in parameters:
parameters[self._sampler_name].update({"sampling_ratio": self.sampler_balanced_ratio})
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Very cool! I don't think you need an elif here, it can just be an else since I don't think we'd have another logic path but total nitpick.

Also I like the way you set the logic for the elif block. You could do the same to the if block and move setting the self._frozen_pipeline_parameters after:

if self._sampler_name not in parameters:
     parameters[self._sampler_name] = {"sampling_ratio": self.sampler_balanced_ratio}
else:
     parameters[self._sampler_name].update({"sampling_ratio": self.sampler_balanced_ratio})
self._frozen_pipeline_parameters[self._sampler_name] = parameters[self._sampler_name]

@bchen1116 bchen1116 merged commit f2a3cc1 into main May 28, 2021
@chukarsten chukarsten mentioned this pull request Jun 2, 2021
@freddyaboulton freddyaboulton deleted the bc_2142_oversampler_dic branch May 13, 2022 15:02
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Design and create dictionary input for Oversampler
3 participants