Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Major API change. Introducing Cleanlab 2.0 #128

Merged
merged 38 commits into from
Mar 16, 2022

Conversation

cgnorthcutt
Copy link
Member

@cgnorthcutt cgnorthcutt commented Mar 10, 2022

CHANGELOG Cleanlab 1.0.1 --> 2.0:

At a high level, this update redesigns cleanlab to be scalable and extensible for growth (e.g., adding new ways to rank data and labels, adding new methods for computing data and labels quality, adding new tasks like regression and object detection, etc.) This update also simplifies most of the naming conventions, redesigning cleanlab to be more developer friendly and less academic.

Module name changes:

  • pruning.py --> filter.py
  • latent_estimation.py --> count.py
  • parent module/folder models/ --> example_models/

New module created:

  • rank.py
    • moved all ranking and ordering functions from pruning.py/filter.py to here

Method name changes:

  • pruning.get_noise_indices() --> filter.find_label_issues()
  • count.num_label_errors() --> count.num_label_issues()

Methods added:

  • rank.py adds
    • two ranking functions to rank data based on label quality for entire dataset (not just examples with label issues)
    • get_self_confidence_for_each_label()
    • get_normalized_margin_for_each_label()
  • filter.py adds
    • two more methods added to filter.find_label_issues() (select method using the filter_by parameter)
      • confident_learning, which has been shown to work very well and may become the default in the future, and
      • predicted_neq_given, which is useful for benchmarking a simple baseline approach, but underperformant relative to the other filter_by methods)
  • classification.py adds
    • LearningWithNoisyLabels.get_label_issues()
      • for a canonical one-line of code use:LearningWithNoisyLabels().fit(X, y).get_label_issues()
      • no need to compute predicted probabilities in advance

Naming conventions changed in method names, comments, parameters, etc.

  • s -> labels
  • psx -> pred_probs
  • label_errors --> label_issues
  • noise_mask --> label_issues_mask
  • label_errors_bool --> label_issues_mask
  • prune_method --> filter_by
  • prob_given_label --> self_confidence
  • pruning --> filtering

Parameter re-ordering:

  • re-ordered (labels, pred_probs) parameters to be consistent (in that order) in all methods.
  • re-ordered parameters (e.g. frac_noise) in filter.find_label_issues()

Parameter changes:

  • in order_label_issues()
    • param: sorted_index_method --> rank_by
  • in find_label_issues()
    • param: sorted_index_method --> return_indices_ranked_by
    • param: prune_method --> filter_by

Global variables changed:

  • filter.py
    • Only require 1 example to be left in each class
    • MIN_NUM_PER_CLASS = 5 --> MIN_NUM_PER_CLASS = 1
    • enables cleanlab to work for toy-sized datasets

@codecov-commenter
Copy link

codecov-commenter commented Mar 10, 2022

Codecov Report

Merging #128 (71a79d7) into master (b1ea583) will increase coverage by 0.38%.
The diff coverage is 94.80%.

@@            Coverage Diff             @@
##           master     #128      +/-   ##
==========================================
+ Coverage   86.50%   86.89%   +0.38%     
==========================================
  Files          12       11       -1     
  Lines         956      908      -48     
  Branches      163      166       +3     
==========================================
- Hits          827      789      -38     
+ Misses        115      103      -12     
- Partials       14       16       +2     
Impacted Files Coverage Δ
cleanlab/coteaching.py 0.00% <ø> (ø)
cleanlab/utils/util.py 100.00% <ø> (ø)
cleanlab/version.py 100.00% <ø> (ø)
cleanlab/filter.py 91.77% <91.77%> (ø)
cleanlab/count.py 93.96% <96.22%> (ø)
cleanlab/__init__.py 100.00% <100.00%> (ø)
cleanlab/classification.py 100.00% <100.00%> (+2.24%) ⬆️
cleanlab/example_models/mnist_pytorch.py 97.61% <100.00%> (ø)
cleanlab/noise_generation.py 97.18% <100.00%> (-2.12%) ⬇️
cleanlab/rank.py 100.00% <100.00%> (ø)
... and 6 more

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update b1ea583...71a79d7. Read the comment docs.

cleanlab/rank.py Outdated Show resolved Hide resolved
cleanlab/filter.py Outdated Show resolved Hide resolved
Method to order label error indices (instead of a bool mask), either:
'normalized_margin' := normalized margin (p(s = k) - max(p(s != k)))
'prob_given_label' := [psx[i][labels[i]] for i in label_errors_idx]
psx : np.array (shape (N, K))
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I prefer pred_probs. Most intuitive name for me.

cleanlab/rank.py Outdated Show resolved Hide resolved
@cgnorthcutt
Copy link
Member Author

@jwmueller @calebchiam @JohnsonKuan - Note that sometimes prob_given_label is used and other times self_confidence is used throughout the repo. Changing to self_confidence everywhere.

cleanlab/util.py Outdated Show resolved Hide resolved
README.md Outdated Show resolved Hide resolved
Copy link
Member

@anishathalye anishathalye left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In some functions, we are switching parameter order in this release. Related to that: can we switch to using kw-only arguments for a lot of functionality where we don't expect people to use positional arguments? E.g. the signature of find_label_issues can be def find_label_issues(labels, pred_probs, *, confident_joint=None, ...). This way, we'll have freedom to move things around / insert parameters as we please, and we won't break any client code.

There's a bunch of weird issues in the diff, probably from a search-and-replace or from an imprecise refactoring tool. I highlighted a couple of them. As we get closer to finalizing this PR, we should take a close line-by-line look at the diff to make sure we haven't missed any.


```python
# Compute psx (n x m matrix of predicted probabilities) on your own, with any classifier.
# Here is an example that shows in detail how to compute psx on CIFAR-10:
# Compute pred_probs (n x m matrix of predicted probabilities) on your own, with any classifier.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should we be careful to always mention "out-of-sample predicted probabilities" every time we say pred_probs? Or alternatively, come up with a different name for pred_probs that makes it more obvious that these should be out-of-sample? We've seen a couple examples of people doing the wrong thing in the wild, training on the dataset and then just evaluating the trained model on that same dataset to compute psx to feed into cleanlab.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@anishathalye oos_pred_probs? or pred_probs_cv? or heldout_pred_probs

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

i might want to skip this change. The comment just below that line says
# Be sure you compute probs in a holdout/out-of-sample manner (e.g. via cross-validation)

ordered_label_issues = find_label_issues(
labels=numpy_array_of_noisy_labels,
pred_probs=numpy_array_of_predicted_probabilities,
return_indices_ranked_by='normalized_margin', # Orders label issues
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Minor point, but I prefer the imperativerank_by=

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In our previous discussion, we agreed that we should mention that this changes the format of the return output from a bool mask to indices and hence it was a good idea to include return_indices...ranked_by versus just rank_by which doesn't indicate clearly the return type is changing (and will be a different length as well)

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Note that all the methods in rank.py use rank_by. Only this method which changes the return type uses a different parameter name, to make that clear.

cleanlab/coteaching.py Outdated Show resolved Hide resolved
cleanlab/example_models/README.md Outdated Show resolved Hide resolved
cleanlab/filter.py Outdated Show resolved Hide resolved
@cgnorthcutt cgnorthcutt reopened this Mar 16, 2022
@cgnorthcutt cgnorthcutt merged commit 8f9f3f5 into cleanlab:master Mar 16, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

6 participants