-
Notifications
You must be signed in to change notification settings - Fork 86
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
creating class imbalance data checker #1135
Conversation
Codecov Report
@@ Coverage Diff @@
## main #1135 +/- ##
=======================================
Coverage 99.91% 99.91%
=======================================
Files 195 197 +2
Lines 11596 11650 +54
=======================================
+ Hits 11586 11640 +54
Misses 10 10
Continue to review full report at Codecov.
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Overall looks good however we should think about adding this to automl in this PR or in a future PR. I think it could be good to add it in.
@jeremyliweishih do you mean add this data checker into automl |
@bchen1116 I was talking about |
Filed issue #1139 to track taking |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nice! Left some comments about docs, and I think it could be useful to add two more tests:
- What happens when y is empty?
- What happens when there are np.nan? I see that you call
.dropna()
so we should validate that it's correctly accounting for np.nan's in the input.
from .data_check_message import DataCheckWarning | ||
|
||
|
||
class ClassImbalanceDataCheck(DataCheck): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks like this is useful for classification problems, so could be good to mention that. I believe we had another issue tracking passing in the problem type to the data checks and updating the API. @jeremyliweishih Until we have that in place, I don't think it's ideal to add it to automl quite yet
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yep agreed!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We should be in a good position to add this soon though.
evalml/tests/data_checks_tests/test_class_imbalance_data_check.py
Outdated
Show resolved
Hide resolved
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looking good, added comment about more testing + a question that we should confirm the answer to before merging!
class_imbalance_check = ClassImbalanceDataCheck() | ||
|
||
assert class_imbalance_check.validate(X, y=pd.Series([])) == [] | ||
assert class_imbalance_check.validate(X, y=pd.Series([np.nan, np.nan, np.nan, np.nan, 1, 1, 1, 1, 2]), threshold=0.5) == [DataCheckWarning("Label 2 makes up 20.000% of the target data, which is below the recommended threshold of 50%", "ClassImbalanceDataCheck")] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hmmm. This behavior could be worth discussing more. Do we want to ignore nans? I could see in the case where we have [np.nan, np.nan, np.nan, np.nan, 1, 2], we would pass this threshold of 0.5, but we'd still want to alert to the user that something is wrong. Maybe since we have the highly null and no variance data checks this isn't that big of a concern, but more opinions wouldn't hurt. @freddyaboulton @dsherry, what do you guys think?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
One solution could be to expose a count_nan
parameter and toggle this, so user has more flexibility?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@angela97lin I think a count_nan
parameter would be useful for this. If I were to add this, should the default behavior count NaN
values or exclude them? @freddyaboulton @dsherry what are your thoughts?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
InvalidTargetDataCheck
will raise a warning if there are any nulls in the target so I vote we keep things simple and implement ClassImbalanceDataCheck
assuming we have a target variable that doesn't have nulls.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@bchen1116 I think this is great! I vote we don't worry about handling nans in this data check because that is what InvalidTargetDataCheck
is handling. Besides, our estimators can't handle nan values anyways so I don't think it makes sense to ask if the classes are balanced in the presence of nans.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
👍 LGTM! I left a couple suggestions
from .data_check_message import DataCheckWarning | ||
|
||
|
||
class ClassImbalanceDataCheck(DataCheck): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yep agreed!
from .data_check_message import DataCheckWarning | ||
|
||
|
||
class ClassImbalanceDataCheck(DataCheck): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We should be in a good position to add this soon though.
evalml/tests/data_checks_tests/test_class_imbalance_data_check.py
Outdated
Show resolved
Hide resolved
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for addressing everything! LGTM 👍
fix #971
Creating a class imbalance data checker to determine and raise a warning when a target class falls below a given threshold.