Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

creating class imbalance data checker #1135

Merged
merged 20 commits into from
Sep 11, 2020
Merged

creating class imbalance data checker #1135

merged 20 commits into from
Sep 11, 2020

Conversation

bchen1116
Copy link
Contributor

fix #971

Creating a class imbalance data checker to determine and raise a warning when a target class falls below a given threshold.

@bchen1116 bchen1116 self-assigned this Sep 3, 2020
@codecov
Copy link

codecov bot commented Sep 3, 2020

Codecov Report

Merging #1135 into main will increase coverage by 0.00%.
The diff coverage is 100.00%.

Impacted file tree graph

@@           Coverage Diff           @@
##             main    #1135   +/-   ##
=======================================
  Coverage   99.91%   99.91%           
=======================================
  Files         195      197    +2     
  Lines       11596    11650   +54     
=======================================
+ Hits        11586    11640   +54     
  Misses         10       10           
Impacted Files Coverage Δ
evalml/data_checks/class_imbalance_data_check.py 100.00% <100.00%> (ø)
...ta_checks_tests/test_class_imbalance_data_check.py 100.00% <100.00%> (ø)

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 706b9f0...41889d1. Read the comment docs.

@bchen1116 bchen1116 marked this pull request as ready for review September 3, 2020 15:52
Copy link
Collaborator

@jeremyliweishih jeremyliweishih left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Overall looks good however we should think about adding this to automl in this PR or in a future PR. I think it could be good to add it in.

evalml/data_checks/class_imbalance_data_check.py Outdated Show resolved Hide resolved
@bchen1116
Copy link
Contributor Author

@jeremyliweishih do you mean add this data checker into automl _validate_data_checks method?

@jeremyliweishih
Copy link
Collaborator

@bchen1116 I was talking about DefaultDataChecks which is used in automl

@bchen1116
Copy link
Contributor Author

bchen1116 commented Sep 3, 2020

Filed issue #1139 to track taking num_classes into account for default thresholding, as well as adding ClassImbalanceDataCheck to DefaultDataChecks.

Copy link
Collaborator

@jeremyliweishih jeremyliweishih left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM!

Copy link
Contributor

@angela97lin angela97lin left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice! Left some comments about docs, and I think it could be useful to add two more tests:

  • What happens when y is empty?
  • What happens when there are np.nan? I see that you call .dropna() so we should validate that it's correctly accounting for np.nan's in the input.

evalml/data_checks/class_imbalance_data_check.py Outdated Show resolved Hide resolved
evalml/data_checks/class_imbalance_data_check.py Outdated Show resolved Hide resolved
from .data_check_message import DataCheckWarning


class ClassImbalanceDataCheck(DataCheck):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks like this is useful for classification problems, so could be good to mention that. I believe we had another issue tracking passing in the problem type to the data checks and updating the API. @jeremyliweishih Until we have that in place, I don't think it's ideal to add it to automl quite yet

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yep agreed!

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We should be in a good position to add this soon though.

Copy link
Contributor

@angela97lin angela97lin left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looking good, added comment about more testing + a question that we should confirm the answer to before merging!

class_imbalance_check = ClassImbalanceDataCheck()

assert class_imbalance_check.validate(X, y=pd.Series([])) == []
assert class_imbalance_check.validate(X, y=pd.Series([np.nan, np.nan, np.nan, np.nan, 1, 1, 1, 1, 2]), threshold=0.5) == [DataCheckWarning("Label 2 makes up 20.000% of the target data, which is below the recommended threshold of 50%", "ClassImbalanceDataCheck")]
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hmmm. This behavior could be worth discussing more. Do we want to ignore nans? I could see in the case where we have [np.nan, np.nan, np.nan, np.nan, 1, 2], we would pass this threshold of 0.5, but we'd still want to alert to the user that something is wrong. Maybe since we have the highly null and no variance data checks this isn't that big of a concern, but more opinions wouldn't hurt. @freddyaboulton @dsherry, what do you guys think?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

One solution could be to expose a count_nan parameter and toggle this, so user has more flexibility?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@angela97lin I think a count_nan parameter would be useful for this. If I were to add this, should the default behavior count NaN values or exclude them? @freddyaboulton @dsherry what are your thoughts?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

InvalidTargetDataCheck will raise a warning if there are any nulls in the target so I vote we keep things simple and implement ClassImbalanceDataCheck assuming we have a target variable that doesn't have nulls.

Copy link
Contributor

@freddyaboulton freddyaboulton left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@bchen1116 I think this is great! I vote we don't worry about handling nans in this data check because that is what InvalidTargetDataCheck is handling. Besides, our estimators can't handle nan values anyways so I don't think it makes sense to ask if the classes are balanced in the presence of nans.

Copy link
Contributor

@dsherry dsherry left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

👍 LGTM! I left a couple suggestions

from .data_check_message import DataCheckWarning


class ClassImbalanceDataCheck(DataCheck):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yep agreed!

from .data_check_message import DataCheckWarning


class ClassImbalanceDataCheck(DataCheck):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We should be in a good position to add this soon though.

evalml/data_checks/class_imbalance_data_check.py Outdated Show resolved Hide resolved
Copy link
Contributor

@angela97lin angela97lin left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for addressing everything! LGTM 👍

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Add data check for severe class imbalance
7 participants