Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Support for Multiclass Classification in Class Imbalance data checks #1986

Merged
merged 10 commits into from
Mar 18, 2021

Conversation

bchen1116
Copy link
Contributor

@bchen1116 bchen1116 commented Mar 16, 2021

fix #1864

Summary:

  • Compute the number of samples in each class of the training data
  • Identify the majority class which has the largest number of samples
  • For each class, compute the ratio of majority class samples to that classes sample count
  • If that ratio exceeds a threshold (default 9:1), raise a class imbalance warning
  • If that ratio exceeds a threshold AND the number of samples for that class < 100, raise a severe class imbalance warning

Did not incorporate steps for number of classes in order to keep this data check a little simpler and keep the logic in line with what users would expect. (ie at larger numbers of classes, we don't automatically lower the threshold, even though it might be imbalanced for the model to learn).

@bchen1116 bchen1116 self-assigned this Mar 16, 2021
@codecov
Copy link

codecov bot commented Mar 16, 2021

Codecov Report

Merging #1986 (73a7934) into main (2afdc84) will increase coverage by 0.1%.
The diff coverage is 100.0%.

Impacted file tree graph

@@            Coverage Diff            @@
##             main    #1986     +/-   ##
=========================================
+ Coverage   100.0%   100.0%   +0.1%     
=========================================
  Files         274      274             
  Lines       22286    22299     +13     
=========================================
+ Hits        22280    22293     +13     
  Misses          6        6             
Impacted Files Coverage Δ
evalml/data_checks/class_imbalance_data_check.py 100.0% <100.0%> (ø)
...ta_checks_tests/test_class_imbalance_data_check.py 100.0% <100.0%> (ø)

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 2afdc84...73a7934. Read the comment docs.

@bchen1116 bchen1116 marked this pull request as ready for review March 16, 2021 21:33
Copy link
Contributor

@dsherry dsherry left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks great! I left a couple impl comments

Copy link
Contributor

@angela97lin angela97lin left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Neat, I'm a big fan of this improvement! LGTM 🥳

Copy link
Contributor

@ParthivNaresh ParthivNaresh left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Very cleanly done, I love it

@bchen1116 bchen1116 merged commit 372f753 into main Mar 18, 2021
@dsherry dsherry mentioned this pull request Mar 24, 2021
@freddyaboulton freddyaboulton deleted the bc_1864_multiclass branch May 13, 2022 15:01
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Update Class Imbalance Data Check
4 participants