Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add ClassImbalanceDataCheck to DefaultDataChecks #1333

Merged
merged 12 commits into from
Oct 26, 2020
Merged

Conversation

bchen1116
Copy link
Contributor

@bchen1116 bchen1116 commented Oct 21, 2020

fix #1276

  • Throw error when the number of instances of any label is less than 2 * num_cv_folds
    (We default to num_cv_folds == 3 since that is what our data split techniques use, although we leave it as an input parameter for the user)
  • Add ClassImbalanceDataCheck to DefaultDataChecks

** This datacheck is only used in DefaultDataChecks if the problem type is classification, not regression

@bchen1116 bchen1116 self-assigned this Oct 21, 2020
@codecov
Copy link

codecov bot commented Oct 22, 2020

Codecov Report

Merging #1333 into main will increase coverage by 0.01%.
The diff coverage is 100.00%.

Impacted file tree graph

@@            Coverage Diff             @@
##             main    #1333      +/-   ##
==========================================
+ Coverage   99.95%   99.95%   +0.01%     
==========================================
  Files         213      213              
  Lines       13575    13606      +31     
==========================================
+ Hits        13568    13599      +31     
  Misses          7        7              
Impacted Files Coverage Δ
evalml/data_checks/class_imbalance_data_check.py 100.00% <100.00%> (ø)
evalml/data_checks/default_data_checks.py 100.00% <100.00%> (ø)
evalml/tests/automl_tests/test_automl.py 100.00% <100.00%> (ø)
...ta_checks_tests/test_class_imbalance_data_check.py 100.00% <100.00%> (ø)
evalml/tests/data_checks_tests/test_data_checks.py 100.00% <100.00%> (ø)

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 4c9e5a9...9d38286. Read the comment docs.

@bchen1116 bchen1116 marked this pull request as ready for review October 22, 2020 14:35
@bchen1116
Copy link
Contributor Author

Using this datacheck on the Abalone dataset (with y = abalone['Rings']), we get the following datacheck error messages:
Screen Shot 2020-10-22 at 11 15 08 AM

We see the datacheck works as expected on this imbalanced dataset

Copy link
Contributor

@freddyaboulton freddyaboulton left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@bchen1116 Looks good!

evalml/data_checks/class_imbalance_data_check.py Outdated Show resolved Hide resolved
# search for targets that occur less than twice the number of cv folds first
below_threshold_folds = fold_counts.where(fold_counts < self.cv_folds).dropna()
if len(below_threshold_folds):
warning_msg = "The number of instances of these targets is less than 2 * the number of cross folds = {} instances: {}"
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Helpful error message! But it might be confusing if the user passes in a custom 4 fold CV and see 2 * number of cross folds = 6 in the message. Is there a plan to modify AutoMLSearch to init this class with the number of folds in the cv? Or maybe we make this more generic like "The number of instances of these targets is too small for cross validation? Just thinking out loud - this is fine to merge and we can refine if we get user feedback 😅

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@freddyaboulton Mmmm that's fair, I didn't think about that. I think it would make sense to modify such that we can pass num_folds arg from AutoMLSearch to the ClassImbalanceDataCheck itself. For instance, if the user passed in a custom data-splitter with > 6 folds, this data check now does nothing, and if they use the default data checks, it could pass that while failing to catch the error. I think that can be filed as a separate issue, unless it's better to implement it into this PR?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@freddyaboulton filed that as an issue here. I'll merge this PR and we can address the num_folds arg in AutoMLSearch afterwards

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good point @freddyaboulton and thank you for filing @bchen1116 ! Yep sounds good. I agree this message can be improved.

@bchen1116 bchen1116 merged commit fe73ad5 into main Oct 26, 2020
@dsherry dsherry mentioned this pull request Oct 29, 2020
@freddyaboulton freddyaboulton deleted the bc_1276_data_check branch May 13, 2022 15:35
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Add class imbalance data check to automl
3 participants