-
Notifications
You must be signed in to change notification settings - Fork 686
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Major API change. Introducing Cleanlab 2.0 #128
Conversation
Codecov Report
@@ Coverage Diff @@
## master #128 +/- ##
==========================================
+ Coverage 86.50% 86.89% +0.38%
==========================================
Files 12 11 -1
Lines 956 908 -48
Branches 163 166 +3
==========================================
- Hits 827 789 -38
+ Misses 115 103 -12
- Partials 14 16 +2
Continue to review full report at Codecov.
|
cleanlab/filter.py
Outdated
Method to order label error indices (instead of a bool mask), either: | ||
'normalized_margin' := normalized margin (p(s = k) - max(p(s != k))) | ||
'prob_given_label' := [psx[i][labels[i]] for i in label_errors_idx] | ||
psx : np.array (shape (N, K)) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I prefer pred_probs
. Most intuitive name for me.
@jwmueller @calebchiam @JohnsonKuan - Note that sometimes |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In some functions, we are switching parameter order in this release. Related to that: can we switch to using kw-only arguments for a lot of functionality where we don't expect people to use positional arguments? E.g. the signature of find_label_issues
can be def find_label_issues(labels, pred_probs, *, confident_joint=None, ...)
. This way, we'll have freedom to move things around / insert parameters as we please, and we won't break any client code.
There's a bunch of weird issues in the diff, probably from a search-and-replace or from an imprecise refactoring tool. I highlighted a couple of them. As we get closer to finalizing this PR, we should take a close line-by-line look at the diff to make sure we haven't missed any.
|
||
```python | ||
# Compute psx (n x m matrix of predicted probabilities) on your own, with any classifier. | ||
# Here is an example that shows in detail how to compute psx on CIFAR-10: | ||
# Compute pred_probs (n x m matrix of predicted probabilities) on your own, with any classifier. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should we be careful to always mention "out-of-sample predicted probabilities" every time we say pred_probs
? Or alternatively, come up with a different name for pred_probs
that makes it more obvious that these should be out-of-sample? We've seen a couple examples of people doing the wrong thing in the wild, training on the dataset and then just evaluating the trained model on that same dataset to compute psx
to feed into cleanlab.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@anishathalye oos_pred_probs
? or pred_probs_cv
? or heldout_pred_probs
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
i might want to skip this change. The comment just below that line says
# Be sure you compute probs in a holdout/out-of-sample manner (e.g. via cross-validation)
ordered_label_issues = find_label_issues( | ||
labels=numpy_array_of_noisy_labels, | ||
pred_probs=numpy_array_of_predicted_probabilities, | ||
return_indices_ranked_by='normalized_margin', # Orders label issues |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Minor point, but I prefer the imperativerank_by=
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In our previous discussion, we agreed that we should mention that this changes the format of the return output from a bool mask to indices and hence it was a good idea to include return_indices...ranked_by
versus just rank_by
which doesn't indicate clearly the return type is changing (and will be a different length as well)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Note that all the methods in rank.py
use rank_by
. Only this method which changes the return type uses a different parameter name, to make that clear.
CHANGELOG Cleanlab 1.0.1 --> 2.0:
At a high level, this update redesigns
cleanlab
to be scalable and extensible for growth (e.g., adding new ways to rank data and labels, adding new methods for computing data and labels quality, adding new tasks like regression and object detection, etc.) This update also simplifies most of the naming conventions, redesigningcleanlab
to be more developer friendly and less academic.Module name changes:
pruning.py
-->filter.py
latent_estimation.py
-->count.py
models/
-->example_models/
New module created:
rank.py
pruning.py
/filter.py
to hereMethod name changes:
pruning.get_noise_indices()
-->filter.find_label_issues()
count.num_label_errors()
-->count.num_label_issues()
Methods added:
rank.py
addsget_self_confidence_for_each_label()
get_normalized_margin_for_each_label()
filter.py
addsfilter.find_label_issues()
(select method using thefilter_by
parameter)confident_learning
, which has been shown to work very well and may become the default in the future, andpredicted_neq_given
, which is useful for benchmarking a simple baseline approach, but underperformant relative to the other filter_by methods)classification.py
addsLearningWithNoisyLabels.get_label_issues()
LearningWithNoisyLabels().fit(X, y).get_label_issues()
Naming conventions changed in method names, comments, parameters, etc.
s
->labels
psx
->pred_probs
label_errors
-->label_issues
noise_mask
-->label_issues_mask
label_errors_bool
-->label_issues_mask
prune_method
-->filter_by
prob_given_label
-->self_confidence
pruning
-->filtering
Parameter re-ordering:
labels
,pred_probs
) parameters to be consistent (in that order) in all methods.frac_noise
) in filter.find_label_issues()Parameter changes:
order_label_issues()
sorted_index_method
-->rank_by
find_label_issues()
sorted_index_method
-->return_indices_ranked_by
prune_method
-->filter_by
Global variables changed:
filter.py
MIN_NUM_PER_CLASS = 5
-->MIN_NUM_PER_CLASS = 1