-
Notifications
You must be signed in to change notification settings - Fork 87
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add ClassImbalanceDataCheck to DefaultDataChecks #1333
Conversation
Codecov Report
@@ Coverage Diff @@
## main #1333 +/- ##
==========================================
+ Coverage 99.95% 99.95% +0.01%
==========================================
Files 213 213
Lines 13575 13606 +31
==========================================
+ Hits 13568 13599 +31
Misses 7 7
Continue to review full report at Codecov.
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@bchen1116 Looks good!
# search for targets that occur less than twice the number of cv folds first | ||
below_threshold_folds = fold_counts.where(fold_counts < self.cv_folds).dropna() | ||
if len(below_threshold_folds): | ||
warning_msg = "The number of instances of these targets is less than 2 * the number of cross folds = {} instances: {}" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Helpful error message! But it might be confusing if the user passes in a custom 4 fold CV and see 2 * number of cross folds = 6
in the message. Is there a plan to modify AutoMLSearch to init this class with the number of folds in the cv? Or maybe we make this more generic like "The number of instances of these targets is too small for cross validation? Just thinking out loud - this is fine to merge and we can refine if we get user feedback 😅
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@freddyaboulton Mmmm that's fair, I didn't think about that. I think it would make sense to modify such that we can pass num_folds
arg from AutoMLSearch to the ClassImbalanceDataCheck itself. For instance, if the user passed in a custom data-splitter with > 6 folds, this data check now does nothing, and if they use the default data checks, it could pass that while failing to catch the error. I think that can be filed as a separate issue, unless it's better to implement it into this PR?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@freddyaboulton filed that as an issue here. I'll merge this PR and we can address the num_folds
arg in AutoMLSearch afterwards
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good point @freddyaboulton and thank you for filing @bchen1116 ! Yep sounds good. I agree this message can be improved.
fix #1276
2 * num_cv_folds
(We default to
num_cv_folds == 3
since that is what our data split techniques use, although we leave it as an input parameter for the user)** This datacheck is only used in DefaultDataChecks if the problem type is classification, not regression