creating class imbalance data checker #1135

bchen1116 · 2020-09-03T14:45:40Z

Creating a class imbalance data checker to determine and raise a warning when a target class falls below a given threshold.

codecov · 2020-09-03T14:48:41Z

Codecov Report

Merging #1135 into main will increase coverage by 0.00%.
The diff coverage is 100.00%.

@@           Coverage Diff           @@
##             main    #1135   +/-   ##
=======================================
  Coverage   99.91%   99.91%           
=======================================
  Files         195      197    +2     
  Lines       11596    11650   +54     
=======================================
+ Hits        11586    11640   +54     
  Misses         10       10

Impacted Files	Coverage Δ
evalml/data_checks/class_imbalance_data_check.py	`100.00% <100.00%> (ø)`
...ta_checks_tests/test_class_imbalance_data_check.py	`100.00% <100.00%> (ø)`

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 706b9f0...41889d1. Read the comment docs.

jeremyliweishih

Overall looks good however we should think about adding this to automl in this PR or in a future PR. I think it could be good to add it in.

evalml/data_checks/class_imbalance_data_check.py

bchen1116 · 2020-09-03T19:03:17Z

@jeremyliweishih do you mean add this data checker into automl _validate_data_checks method?

jeremyliweishih · 2020-09-03T19:03:59Z

@bchen1116 I was talking about DefaultDataChecks which is used in automl

bchen1116 · 2020-09-03T20:17:47Z

Filed issue #1139 to track taking num_classes into account for default thresholding, as well as adding ClassImbalanceDataCheck to DefaultDataChecks.

evalml/data_checks/class_imbalance_data_check.py

jeremyliweishih

LGTM!

angela97lin

Nice! Left some comments about docs, and I think it could be useful to add two more tests:

What happens when y is empty?
What happens when there are np.nan? I see that you call .dropna() so we should validate that it's correctly accounting for np.nan's in the input.

evalml/data_checks/class_imbalance_data_check.py

angela97lin · 2020-09-03T21:46:55Z

evalml/data_checks/class_imbalance_data_check.py

+from .data_check_message import DataCheckWarning
+
+
+class ClassImbalanceDataCheck(DataCheck):


Looks like this is useful for classification problems, so could be good to mention that. I believe we had another issue tracking passing in the problem type to the data checks and updating the API. @jeremyliweishih Until we have that in place, I don't think it's ideal to add it to automl quite yet

Yep agreed!

We should be in a good position to add this soon though.

evalml/tests/data_checks_tests/test_class_imbalance_data_check.py

evalml/data_checks/class_imbalance_data_check.py

angela97lin

Looking good, added comment about more testing + a question that we should confirm the answer to before merging!

evalml/data_checks/class_imbalance_data_check.py

evalml/tests/data_checks_tests/test_class_imbalance_data_check.py

angela97lin · 2020-09-04T19:33:20Z

evalml/tests/data_checks_tests/test_class_imbalance_data_check.py

+    class_imbalance_check = ClassImbalanceDataCheck()
+
+    assert class_imbalance_check.validate(X, y=pd.Series([])) == []
+    assert class_imbalance_check.validate(X, y=pd.Series([np.nan, np.nan, np.nan, np.nan, 1, 1, 1, 1, 2]), threshold=0.5) == [DataCheckWarning("Label 2 makes up 20.000% of the target data, which is below the recommended threshold of 50%", "ClassImbalanceDataCheck")]


Hmmm. This behavior could be worth discussing more. Do we want to ignore nans? I could see in the case where we have [np.nan, np.nan, np.nan, np.nan, 1, 2], we would pass this threshold of 0.5, but we'd still want to alert to the user that something is wrong. Maybe since we have the highly null and no variance data checks this isn't that big of a concern, but more opinions wouldn't hurt. @freddyaboulton @dsherry, what do you guys think?

One solution could be to expose a count_nan parameter and toggle this, so user has more flexibility?

@angela97lin I think a count_nan parameter would be useful for this. If I were to add this, should the default behavior count NaN values or exclude them? @freddyaboulton @dsherry what are your thoughts?

InvalidTargetDataCheck will raise a warning if there are any nulls in the target so I vote we keep things simple and implement ClassImbalanceDataCheck assuming we have a target variable that doesn't have nulls.

freddyaboulton

@bchen1116 I think this is great! I vote we don't worry about handling nans in this data check because that is what InvalidTargetDataCheck is handling. Besides, our estimators can't handle nan values anyways so I don't think it makes sense to ask if the classes are balanced in the presence of nans.

evalml/data_checks/class_imbalance_data_check.py

evalml/tests/data_checks_tests/test_class_imbalance_data_check.py

dsherry

👍 LGTM! I left a couple suggestions

dsherry · 2020-09-11T15:35:06Z

evalml/data_checks/class_imbalance_data_check.py

+from .data_check_message import DataCheckWarning
+
+
+class ClassImbalanceDataCheck(DataCheck):


Yep agreed!

dsherry · 2020-09-11T15:35:23Z

evalml/data_checks/class_imbalance_data_check.py

+from .data_check_message import DataCheckWarning
+
+
+class ClassImbalanceDataCheck(DataCheck):


We should be in a good position to add this soon though.

evalml/data_checks/class_imbalance_data_check.py

evalml/tests/data_checks_tests/test_class_imbalance_data_check.py

angela97lin

Thanks for addressing everything! LGTM 👍

creating class imbalance data checker

21aec27

bchen1116 self-assigned this Sep 3, 2020

bchen1116 added 2 commits September 3, 2020 11:08

add test coverage

60daf61

fix implementation

de9a938

bchen1116 marked this pull request as ready for review September 3, 2020 15:52

bchen1116 requested review from angela97lin, freddyaboulton, eccabay and jeremyliweishih September 3, 2020 15:52

jeremyliweishih requested changes Sep 3, 2020

View reviewed changes

evalml/data_checks/class_imbalance_data_check.py Outdated Show resolved Hide resolved

bchen1116 added 2 commits September 3, 2020 12:03

fix description

d40b0ec

Merge branch 'main' into bc_971_imbalance

07b5e49

bchen1116 requested a review from jeremyliweishih September 3, 2020 20:01

eccabay reviewed Sep 3, 2020

View reviewed changes

evalml/data_checks/class_imbalance_data_check.py Outdated Show resolved Hide resolved

bchen1116 added 2 commits September 3, 2020 16:41

update documentation

dc26100

Merge branch 'main' into bc_971_imbalance

f95bee2

jeremyliweishih approved these changes Sep 3, 2020

View reviewed changes

angela97lin suggested changes Sep 3, 2020

View reviewed changes

bchen1116 added 2 commits September 4, 2020 11:58

update test and description

f43b3c6

Merge branch 'main' into bc_971_imbalance

8c4911d

bchen1116 requested a review from angela97lin September 4, 2020 16:09

kmax12 reviewed Sep 4, 2020

View reviewed changes

evalml/data_checks/class_imbalance_data_check.py Outdated Show resolved Hide resolved

bchen1116 and others added 3 commits September 4, 2020 13:03

Merge branch 'main' into bc_971_imbalance

3477fde

change wording

968a04d

fix release notes

8ed63bb

angela97lin suggested changes Sep 4, 2020

View reviewed changes

bchen1116 added 5 commits September 4, 2020 17:42

update tests for non-numeric values

d60c6a3

Merge branch 'main' into bc_971_imbalance

4140435

fix documentation, merge main

3a624a0

fix docs

0b0fba6

Merge branch 'main' into bc_971_imbalance

99d8b2d

bchen1116 requested review from angela97lin and dsherry September 9, 2020 19:14

freddyaboulton approved these changes Sep 11, 2020

View reviewed changes

evalml/data_checks/class_imbalance_data_check.py Show resolved Hide resolved

evalml/tests/data_checks_tests/test_class_imbalance_data_check.py Show resolved Hide resolved

dsherry approved these changes Sep 11, 2020

View reviewed changes

bchen1116 added 3 commits September 11, 2020 12:26

update message

85f9495

merge main

b27c761

Merge branch 'main' into bc_971_imbalance

41889d1

angela97lin approved these changes Sep 11, 2020

View reviewed changes

bchen1116 merged commit 6d4e98e into main Sep 11, 2020

This was referenced Sep 17, 2020

Release v0.14.0 #1191

Closed

Release v0.13.2 #1192

Merged

Update class imbalance data check #1139

Closed

bchen1116 mentioned this pull request Sep 24, 2020

Internal target check to ensure no class missing from train/val #1226

Merged

freddyaboulton deleted the bc_971_imbalance branch May 13, 2022 14:50

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

creating class imbalance data checker #1135

creating class imbalance data checker #1135

bchen1116 commented Sep 3, 2020

codecov bot commented Sep 3, 2020 •

edited

Loading

jeremyliweishih left a comment

bchen1116 commented Sep 3, 2020

jeremyliweishih commented Sep 3, 2020

bchen1116 commented Sep 3, 2020 •

edited

Loading

jeremyliweishih left a comment

angela97lin left a comment •

edited

Loading

angela97lin Sep 3, 2020

dsherry Sep 11, 2020

dsherry Sep 11, 2020

angela97lin left a comment

angela97lin Sep 4, 2020

angela97lin Sep 4, 2020

bchen1116 Sep 8, 2020

freddyaboulton Sep 11, 2020

freddyaboulton left a comment

dsherry left a comment

dsherry Sep 11, 2020

dsherry Sep 11, 2020

angela97lin left a comment

		from .data_check_message import DataCheckWarning


		class ClassImbalanceDataCheck(DataCheck):

creating class imbalance data checker #1135

creating class imbalance data checker #1135

Conversation

bchen1116 commented Sep 3, 2020

codecov bot commented Sep 3, 2020 • edited Loading

Codecov Report

jeremyliweishih left a comment

Choose a reason for hiding this comment

bchen1116 commented Sep 3, 2020

jeremyliweishih commented Sep 3, 2020

bchen1116 commented Sep 3, 2020 • edited Loading

jeremyliweishih left a comment

Choose a reason for hiding this comment

angela97lin left a comment • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

angela97lin left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

freddyaboulton left a comment

Choose a reason for hiding this comment

dsherry left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

angela97lin left a comment

Choose a reason for hiding this comment

codecov bot commented Sep 3, 2020 •

edited

Loading

bchen1116 commented Sep 3, 2020 •

edited

Loading

angela97lin left a comment •

edited

Loading