New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add support for missing classes #511
Conversation
Please edit directly. I don't plan to make further edits. |
4010478
to
c82a3d3
Compare
Codecov Report
@@ Coverage Diff @@
## master #511 +/- ##
==========================================
- Coverage 97.58% 97.30% -0.29%
==========================================
Files 24 24
Lines 1906 1930 +24
Branches 381 385 +4
==========================================
+ Hits 1860 1878 +18
- Misses 17 21 +4
- Partials 29 31 +2
Help us with your feedback. Take ten seconds to tell us how you rate us. Have a feature suggestion? Share it here. |
and multi-labeled labels. If multi_label is set to None (default) | ||
this method will infer if multi_label is True or False based on | ||
the format of labels. | ||
This allows for a more general form of multiclass labels that looks |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
should we extend this to all multilabel formats?
just a question, I can do this once it's merged
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
only if you need it for multilabel find-label-issues to work, when some of the classes never appear in data.
Our main priority is just to get multilabel find-label-issues to work when some of the classes never appear in data.
Extends cleanlab to support missing classes
This PR adds support when
pred_probs.shape[1] > len(set(labels))
for most of the main methods in thecount
andfilter
modules. See examles below.Completed
count
andfilter
modules. I testedcompute_confident_joint
,find_label_issues
, andget_confident_thresholds
.If you're curious, the most complicated parts of this PR was getting everything to work smoothly with multi_label which we did not plan for in advance (including parallelization). It should work now, but more testing is needed.
TODO
CleanLearning
works with new functionality. Only supportedfilter
andcount
modules.Tests TODO (strongly recommend adding/checking these tests)
count
andfilter
work with missing classes, not just the three main ones I checked.CleanLearning
works with missing classes.Useful (and illustrative) sanity checks
1. removing a class by changing the fifth label from 1 --> 2
LOOKS GOOD! Confident joint updates correctly. Label errors update correctly.
2. Does multi-label work with missing classes?
OUTPUT (looks good!)
Two logical issues to be aware of (this PR does NOT renormalize predicted probabilities)
These are not errors, but consider alternatives... currently if the predicted probabilities are high on a missing class, a label error will not be detected since there are no labels in that class and thus no data for cleanlab to work with for that class. That probability mass is effectively lost. I left as is because I recommend merging this PR first. This PR handles the software mechanics of supporting missing classes. Its complicated enough as is. Easy to decide if you want to renormalize once this PR is merged.
1. Example of label error not counted because prob mass on missing class
2. Example of label error no longer being detected as probability mass shifts to a missing class
See the fourth True/False value