New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Class Imbalance Data Check for Severe imbalance #1905
Conversation
Codecov Report
@@ Coverage Diff @@
## main #1905 +/- ##
============================
============================
Continue to review full report at Codecov.
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@bchen1116 This looks good to me!
if len(below_threshold) and len(sample_counts): | ||
sample_count_values = sample_counts.index.tolist() | ||
severe_imbalance = [v for v in sample_count_values if v in below_threshold] | ||
warning_msg = "The following labels have severe class imbalance because they fall under {:.0f}% of the target and have less than {} samples: {}" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nit-pick: "The following labels in the target have severe class imbalanced because" etc
part 1 of #1864
Added check to support our severe class imbalance scenario for the new datasplitter.
Still need to address how to find multiclass class imbalances, but I'll leave that for a future PR since I think we need to discuss the best way for identifying that.