Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Change undersampler to use minority:majority ratio #2077

Merged
merged 20 commits into from Apr 8, 2021

Conversation

bchen1116
Copy link
Contributor

fix #2076

@bchen1116 bchen1116 self-assigned this Apr 2, 2021
@codecov
Copy link

codecov bot commented Apr 2, 2021

Codecov Report

Merging #2077 (b306370) into main (581a7fb) will increase coverage by 0.1%.
The diff coverage is 100.0%.

Impacted file tree graph

@@            Coverage Diff            @@
##             main    #2077     +/-   ##
=========================================
+ Coverage   100.0%   100.0%   +0.1%     
=========================================
  Files         288      288             
  Lines       23419    23428      +9     
=========================================
+ Hits        23409    23418      +9     
  Misses         10       10             
Impacted Files Coverage Δ
evalml/tests/automl_tests/test_automl.py 100.0% <ø> (ø)
...s/components/transformers/samplers/undersampler.py 100.0% <100.0%> (ø)
.../data_splitters/balanced_classification_sampler.py 100.0% <100.0%> (ø)
...data_splitters/balanced_classification_splitter.py 100.0% <100.0%> (ø)
evalml/tests/component_tests/test_components.py 100.0% <100.0%> (ø)
evalml/tests/component_tests/test_undersampler.py 100.0% <100.0%> (ø)
...ests/test_balanced_classification_data_splitter.py 100.0% <100.0%> (ø)
...sing_tests/test_balanced_classification_sampler.py 100.0% <100.0%> (ø)

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 581a7fb...b306370. Read the comment docs.

@bchen1116 bchen1116 marked this pull request as ready for review April 2, 2021 17:22
Copy link
Contributor

@chukarsten chukarsten left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good!

evalml/tests/automl_tests/test_automl.py Outdated Show resolved Hide resolved
Copy link
Contributor

@dsherry dsherry left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looking close! I think I found one minor bug, and I had a couple test suggestions.

docs/source/release_notes.rst Outdated Show resolved Hide resolved
@@ -82,15 +85,15 @@ def test_classification_imbalanced_balanced_ratio(num_classes, balanced_ratio):
indices = bcs.fit_resample(X, y)
X2 = X.loc[indices]
y2 = y.loc[indices]
if balanced_ratio >= 3:
if balanced_ratio <= .3:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is because the default balanced_ratio is 0.25? Could you say if balanced_ratio <= 0.25 for clarity?

(Its probably more trouble than its worth but you could also use introspection to retrieve the default balanced_ratio value from the method signature 😆 )

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@dsherry We choose 0.3 since the targets have a class ratio of 1/3 for the minority/majority balance. In scenarios where we consider anything less than 1/3 as balanced (for the sampler), we expect the data to not be undersampled. For instance, if we consider class ratios down to 1/5 as balanced, then this dataset shouldn't be sampled.

@@ -167,13 +170,13 @@ def test_classification_imbalanced_normal_imbalance_binary(min_samples, balanced
indices = bcs.fit_resample(X, y)
X2 = X.loc[indices]
y2 = y.loc[indices]
if balanced_ratio >= 6:
if balanced_ratio < .2:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Same as before, please change to 0.25 for clarity, and I see one more similar usage below.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Similar reasoning as before!

@bchen1116 bchen1116 requested a review from dsherry April 6, 2021 15:01
Copy link
Contributor

@dsherry dsherry left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🚢 🙏 😁

@bchen1116 bchen1116 merged commit 23fb534 into main Apr 8, 2021
chukarsten added a commit that referenced this pull request Apr 20, 2021
…skEngine`` #1975.

- Added optional ``engine`` argument to ``AutoMLSearch`` #1975
- Added a warning about how time series support is still in beta when a user passes in a time series problem to ``AutoMLSearch`` #2118
- Added ``NaturalLanguageNaNDataCheck`` data check #2122
- Added ValueError to ``partial_dependence`` to prevent users from computing partial dependence on columns with all NaNs #2120
- Added standard deviation of cv scores to rankings table #2154
- Fixed ``BalancedClassificationDataCVSplit``, ``BalancedClassificationDataTVSplit``, and ``BalancedClassificationSampler`` to use ``minority:majority`` ratio instead of ``majority:minority`` #2077
- Fixed bug where two-way partial dependence plots with categorical variables were not working correctly #2117
- Fixed bug where ``hyperparameters`` were not displaying properly for pipelines with a list ``component_graph`` and duplicate components #2133
- Fixed bug where ``pipeline_parameters`` argument in ``AutoMLSearch`` was not applied to pipelines passed in as ``allowed_pipelines`` #2133
- Fixed bug where ``AutoMLSearch`` was not applying custom hyperparameters to pipelines with a list ``component_graph`` and duplicate components #2133
- Removed ``hyperparameter_ranges`` from Undersampler and renamed ``balanced_ratio`` to ``sampling_ratio`` for samplers #2113
- Renamed ``TARGET_BINARY_NOT_TWO_EXAMPLES_PER_CLASS`` data check message code to ``TARGET_MULTICLASS_NOT_TWO_EXAMPLES_PER_CLASS`` #2126
- Modified one-way partial dependence plots of categorical features to display data with a bar plot #2117
- Renamed ``score`` column for ``automl.rankings`` as ``mean_cv_score`` #2135
- Fixed ``conf.py`` file #2112
- Added a sentence to the automl user guide stating that our support for time series problems is still in beta. #2118
- Fixed documentation demos #2139
- Update test badge in README to use GitHub Actions #2150
- Fixed ``test_describe_pipeline`` for ``pandas`` ``v1.2.4`` #2129
- Added a GitHub Action for building the conda package #1870 #2148
.. warning::
- Renamed ``balanced_ratio`` to ``sampling_ratio`` for the ``BalancedClassificationDataCVSplit``, ``BalancedClassificationDataTVSplit``, ``BalancedClassficationSampler``, and Undersampler #2113
- Deleted the "errors" key from automl results #1975
- Deleted the ``raise_and_save_error_callback`` and the ``log_and_save_error_callback`` #1975
- Fixed ``BalancedClassificationDataCVSplit``, ``BalancedClassificationDataTVSplit``, and ``BalancedClassificationSampler`` to use minority:majority ratio instead of majority:minority #2077
@chukarsten chukarsten mentioned this pull request Apr 20, 2021
@freddyaboulton freddyaboulton deleted the bc_rename_balanced_ratio branch May 13, 2022 15:34
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Change undersampler to use minority:majority ratio
3 participants