cleanlab
is a machine learning python package for learning with noisy labels and finding label errors in datasets. cleanlab
CLEANs LABels.
cleanlab
finds and cleans label errors in any dataset using state-of-the-art algorithms for learning with noisy labels and estimating the complete joint distribution of label noise. cleanlab
is fast: its built on optimized algorithms and parallelized across all CPU threads automatically. cleanlab
implements the family of theory and algorithms called confident learning with provable guarantees of exact noise estimation and label error finding (when model output probabilities are exact).
cleanlab
is:
- fast - runs in parallel on all CPU threads automatically. (e.g. < 1 second to find label errors in ImageNet)
- robust - Provable generalization and risk minimimzation guarantees, including imperfect probability estimation.
- general - Works with any probablistic classifier: PyTorch, Tensorflow, MxNet, Caffe2, scikit-learn, etc.
- unique - The only package for multiclass learning with noisy labels or finding label errors for any dataset / classifier.
from cleanlab.classification import LearningWithNoisyLabels
from sklearn.linear_model import LogisticRegression
# Learning with noisy labels in 3 lines of code!
# Wrap around any classifier. Yup, you can use sklearn/pyTorch/Tensorflow/FastText/etc.
lnl = LearningWithNoisyLabels(clf=LogisticRegression())
lnl.fit(X = X_train_data, s = train_noisy_labels)
# Estimate the predictions you would have gotten by training with *no* label errors.
predicted_test_labels = lnl.predict(X_test)
Check out these examples and tests (includes how to use pyTorch, FastText, etc.).
Python 2.7, 3.4, 3.5, and 3.6 are supported.
Stable release:
$ pip install cleanlab
Developer (unstable) release:
$ pip install git+https://github.com/cgnorthcutt/cleanlab.git
To install the codebase (enabling you to make modifications):
$ conda update pip # if you use conda $ git clone https://github.com/cgnorthcutt/cleanlab.git $ cd cleanlab $ pip install -e .
Although this package goes far beyond our 2017 publication, if you find this repository helpful, please cite our paper http://auai.org/uai2017/proceedings/papers/35.pdf. New papers will be posted here when they are published.
@inproceedings{northcutt2017rankpruning, author={Northcutt, Curtis G. and Wu, Tailin and Chuang, Isaac L.}, title={Learning with Confident Examples: Rank Pruning for Robust Classification with Noisy Labels}, booktitle = {Proceedings of the Thirty-Third Conference on Uncertainty in Artificial Intelligence}, series = {UAI'17}, year = {2017}, location = {Sydney, Australia}, numpages = {10}, url = {http://auai.org/uai2017/proceedings/papers/35.pdf}, publisher = {AUAI Press}, }
Most of the algorithms, theory, and results of cleanlab
remain unpublished. If you'd like to work together, please reach out.
We use cleanlab
to automatically identify ~50 label errors in the MNIST dataset.
Label errors of the original MNIST train dataset identified algorithmically using the rankpruning algorithm. Depicts the 24 least confident labels, ordered left-right, top-down by increasing self-confidence (probability of belonging to the given label), denoted conf in teal. The label with the largest predicted probability is in green. Overt errors are in red.
We use cleanlab
to automatically learn with noisy labels regardless of dataset distribution or classifier.
Each figure depicts the decision boundary learned using cleanlab.classification.LearningWithNoisyLabels
in the presence of extreme (~35%) label errors. Label errors are circled in green. Label noise is class-conditional (not simply uniformly random). Columns are organized by the classifier used, except the left-most column which depicts the ground-truth dataset distribution. Rows are organized by dataset used. A matrix characterizing the label noise for the first row is shown below.
Each figure depicts accuracy scores on a test set as decimal values:
- LEFT (in black): The classifier test accuracy trained with perfect labels (no label errors).
- MIDDLE (in blue): The classifier test accuracy trained with noisy labels using
cleanlab
. - RIGHT (in white): The baseline classifier test accuracy trained with noisy labels.
As an example, this is the noise matrix (noisy channel) P(s | y) characterizing the label noise for the first dataset row in the figure. s represents the observed noisy labels and y represents the latent, true labels. The trace of this matrix is 2.6. A trace of 4 implies no label noise. A cell in this matrix is read like, "A random 38% of '3' labels were flipped to '2' labels."
p(s|y) | y=0 | y=1 | y=2 | y=3 |
---|---|---|---|---|
s=0 | 0.55 | 0.01 | 0.07 | 0.06 |
s=1 | 0.22 | 0.87 | 0.24 | 0.02 |
s=2 | 0.12 | 0.04 | 0.64 | 0.38 |
s=3 | 0.11 | 0.08 | 0.05 | 0.54 |
The code to reproduce this figure is available here.
New to cleanlab? Start with:
- Visualizing confident learning
- A simple example of learning with noisy labels on the multiclass Iris dataset.
These examples show how easy it is to characterize label noise in datasets, learn with noisy labels, identify label errors, estimate latent priors and noisy channels, and more.
All of the features of the cleanlab
package work with any model.
Yes, any model. Feel free to use PyTorch, Tensorflow, caffe2,
scikit-learn, mxnet, etc. If you use a scikit-learn classifier, all
cleanlab
methods will work out-of-the-box. It’s also easy to use
your favorite model from a non-scikit-learn package, just wrap your
model into a Python class that inherits the
sklearn.base.BaseEstimator
:
from sklearn.base import BaseEstimator
class YourFavoriteModel(BaseEstimator): # Inherits sklearn base classifier
def __init__(self, ):
pass
def fit(self, X, y, sample_weight = None):
pass
def predict(self, X):
pass
def predict_proba(self, X):
pass
def score(self, X, y, sample_weight = None):
pass
# Now you can use your model with `cleanlab`. Here's one example:
from cleanlab.classification import LearningWithNoisyLabels
lnl = LearningWithNoisyLabels(clf=YourFavoriteModel())
lnl.fit(train_data, train_labels_with_errors)
Want to see a working example? Here’s a compliant PyTorch MNIST CNN class
As you can see
here,
technically you don’t actually need to inherit from
sklearn.base.BaseEstimator
, as you can just create a class that
defines .fit(), .predict(), and .predict_proba(), but inheriting makes
downstream scikit-learn applications like hyper-parameter optimization
work seamlessly. For example, the LearningWithNoisyLabels()
model
is fully compliant.
Note, some libraries exists to do this for you. For pyTorch, check out
the skorch
Python library which will wrap your pytorch
model
into a scikit-learn
compliant model.
- cleanlab/classification.py - The LearningWithNoisyLabels() class for learning with noisy labels.
- cleanlab/latent_algebra.py - Equalities when noise information is known.
- cleanlab/latent_estimation.py - Estimates and fully characterizes all variants of label noise.
- cleanlab/noise_generation.py - Generate mathematically valid synthetic noise matrices.
- cleanlab/polyplex.py - Characterizes joint distribution of label noise EXACTLY from noisy channel.
- cleanlab/pruning.py - Finds the indices of the examples with label errors in a dataset.
Many of these methods have default parameters that won’t be covered here. Check out the method docstrings for full documentation.
rankpruning is a fast, general, robust algorithm for multiclass learning with noisy labels. It adds minimal overhead, needing only O(nm2) time for n training examples and m classes, works with any classifier, and is easy to use. Here is the example from above, with added commments for clarity.
# LearningWithNoisyLabels implements a faster,
# cross-platform and more-compatible version of the RankPruning
# algorithm for learning with noisy labels. Unlike the original
# algorithm which only worked for binary classification,
# LearningWithNoisyLabels generalizes the theory and algorithms
# of RankPruning for any number of classes.
from cleanlab.classification import LearningWithNoisyLabels
# LearningWithNoisyLabels uses logreg by default, so this is unnecessary.
# We include it here for clarity, but this step is omitted below.
from sklearn.linear_model import LogisticRegression as logreg
# 1.
# Wrap around any classifier. Yup, neural networks work, too.
lnl = LearningWithNoisyLabels(clf=logreg())
# 2.
# X_train is numpy matrix of training examples (integers for large data)
# train_labels_with_errors is a numpy array of labels of length n (# of examples), usually denoted 's'.
lnl.fit(X_train, train_labels_with_errors)
# 3.
# Estimate the predictions you would have gotten by training with *no* label errors.
predicted_test_labels = lnl.predict(X_test)
Estimate the confident joint, the latent noisy channel matrix, P(s | y) and inverse, P(y | s), the latent prior of the unobserved, actual true labels, p(y), and the predicted probabilities.:
where s denotes a random variable that represents the observed, noisy
label and y denotes a random variable representing the hidden, actual
labels. Both s and y take any of the m classes as values. The
cleanlab
package supports different levels of granularity for
computation depending on the needs of the user. Because of this, we
support multiple alternatives, all no more than a few lines, to estimate
these latent distribution arrays, enabling the user to reduce
computation time by only computing what they need to compute, as seen in
the examples below.
Throughout these examples, you’ll see a variable called confident_joint. The confident joint is an m x m matrix (m is the number of classes) that counts, for every observed, noisy class, the number of examples that confidently belong to every latent, hidden class. It counts the number of examples that we are confident are labeled correctly or incorrectly for every pair of obseved and unobserved classes. The confident joint is an unnormalized estimate of the complete-information latent joint distribution, Ps,y. Most of the methods in the cleanlab package start by first estimating the confident_joint.
from cleanlab.latent_estimation import estimate_latent
from cleanlab.latent_estimation import estimate_confident_joint_and_cv_pred_proba
# Compute the confident joint and the n x m predicted probabilities matrix (psx),
# for n examples, m classes. Stop here if all you need is the confident joint.
confident_joint, psx = estimate_confident_joint_and_cv_pred_proba(
X=X_train,
s=train_labels_with_errors,
clf = logreg(), # default, you can use any classifier
)
# Estimate latent distributions: p(y) as est_py, P(s|y) as est_nm, and P(y|s) as est_inv
est_py, est_nm, est_inv = estimate_latent(confident_joint, s=train_labels_with_errors)
from cleanlab.latent_estimation import estimate_py_noise_matrices_and_cv_pred_proba
est_py, est_nm, est_inv, confident_joint, psx = estimate_py_noise_matrices_and_cv_pred_proba(
X=X_train,
s=train_labels_with_errors,
)
# Already have psx? (n x m matrix of predicted probabilities)
# For example, you might get them from a pre-trained model (like resnet on ImageNet)
# With the cleanlab package, you estimate directly with psx.
from cleanlab.latent_estimation import estimate_py_and_noise_matrices_from_probabilities
est_py, est_nm, est_inv, confident_joint = estimate_py_and_noise_matrices_from_probabilities(
s=train_labels_with_errors,
psx=psx,
)
With the cleanlab
package, we can instantly fetch the indices of all
estimated label errors, with nothing provided by the user except a
classifier, examples, and their noisy labels. Like the previous example,
there are various levels of granularity.
from cleanlab.pruning import get_noise_indices
# We computed psx, est_inv, confident_joint in the previous example.
label_errors = get_noise_indices(
s=train_labels_with_errors, # required
psx=psx, # required
inverse_noise_matrix=est_inv, # not required, include to avoid recomputing
confident_joint=confident_joint, # not required, include to avoid recomputing
)
To compute P(s,y), the complete-information distribution matrix that captures the number of pairwise label flip errors when multipled by the total number of examples as n P(s,y)*. Using cleanlab.latent_estimation.calibrate_confident_joint, this method guarantees the rows of P(s,y) correctly sum to p(s), and the entire matrix sums to 1.
This method occurs when hyperparameter prune_count_method = ‘inverse_nm_dot_s’ in LearningWithNoisyLabels.fit() and get_noise_indices().
from cleanlab.latent_estimation import estimate_confident_joint_from_probabilities
joint = estimate_confident_joint_from_probabilities(s=noisy_labels, psx=probabilities)
If you already have the confident joint, then you can quickly estimate the complete joint distribution of label noise by:
from cleanlab.latent_estimation import estimate_joint
joint = estimate_joint(confident_joint=cj, s=noisy_labels, psx=probabilities)
# Generate a valid (necessary conditions for learnability are met) noise matrix for any trace > 1
from cleanlab.noise_generation import generate_noise_matrix_from_trace
noise_matrix = generate_noise_matrix_from_trace(
K = number_of_classes,
trace = float_value_greater_than_1_and_leq_K,
py = prior_of_y_actual_labels_which_is_just_an_array_of_length_K,
frac_zero_noise_rates = float_from_0_to_1_controlling_sparsity,
)
# Check if a noise matrix is valid (necessary conditions for learnability are met)
from cleanlab.noise_generation import noise_matrix_is_valid
is_valid = noise_matrix_is_valid(noise_matrix, prior_of_y_which_is_just_an_array_of_length_K)
# Generate noisy labels using the noise_marix. Guarantees exact amount of noise in labels.
from cleanlab.noise_generation import generate_noisy_labels
s_noisy_labels = generate_noisy_labels(y_hidden_actual_labels, noise_matrix)
# This package is a full of other useful methods for learning with noisy labels.
# The tutorial stops here, but you don't have to. Inspect method docstrings for full docs.
The key to learning in the presence of label errors is estimating the joint distribution between the actual, hidden labels ‘y’ and the observed, noisy labels ‘s’. Using cleanlab
and the theory of confident learning, we can completely characterize the trace of the latent joint distribution, trace(P(s,y)), given p(y), for any fraction of label errors, i.e. for any trace of the noisy channel, trace(P(s|y)).
You can check out how to do this yourself here: 1. Drawing Polyplices 2. Computing Polyplices
Copyright (c) 2017-2019 Curtis Northcutt. Released under the MIT License. See LICENSE for details.