cleanlab documentation

cleanlab automatically finds and fixes label issues in your ML datasets.

This reduces manual work needed to fix data errors and helps train reliable ML models on noisy real-world datasets. cleanlab has already found thousands of label errors in ImageNet, MNIST, and other popular ML benchmarking datasets, so let's get started with yours!

Quickstart

1. Install `cleanlab`

pip

pip install cleanlab

conda

conda install -c cleanlab cleanlab

source

pip install git+https://github.com/cleanlab/cleanlab.git

2. Find label errors in your data

cleanlab finds issues in any dataset that a classifier can be trained on. The cleanlab package works with any model by using model outputs (predicted probabilities) as input -- it doesn't depend on which model created those outputs.

If you're using a scikit-learn-compatible model (option 1), you don't need to train a model -- you can pass the model, data, and labels into :pyCleanLearning.find_label_issues <cleanlab.classification.CleanLearning.find_label_issues> and cleanlab will handle model training for you. If you want to use any non-sklearn-compatible model (option 2), you can input the trained model's out-of-sample predicted probabilities into :pyfind_label_issues <cleanlab.filter.find_label_issues>. Examples for both options are below.

from cleanlab.classification import CleanLearning
from cleanlab.filter import find_label_issues

# Option 1 - works with sklearn-compatible models - just input the data and labels ツ
label_issues_info = CleanLearning(clf=sklearn_compatible_model).find_label_issues(data, labels)

# Option 2 - works with ANY ML model - just input the model's predicted probabilities
ordered_label_issues = find_label_issues(
    labels=labels,
    pred_probs=pred_probs,  # predicted probabilities from any model (ideally out-of-sample predictions)
    return_indices_ranked_by='self_confidence',
)

:pyCleanLearning <cleanlab.classification.CleanLearning> (option 1) also works with models from most standard ML frameworks by wrapping the model for scikit-learn compliance, e.g. huggingface/tensorflow/keras (using our KerasWrapperModel), pytorch (using skorch package), etc.

By default, :pyfind_label_issues <cleanlab.filter.find_label_issues> returns a boolean mask of label issues. You can instead return the indices of potential mislabeled examples by setting return_indices_ranked_by in :pyfind_label_issues <cleanlab.filter.find_label_issues>. The indices are ordered by likelihood of a label error (estimated via :pyrank.get_label_quality_scores <cleanlab.rank.get_label_quality_scores>).

Important

Cleanlab performs better if the pred_probs from your model are out-of-sample. Details on how to compute out-of-sample predicted probabilities for your entire dataset are here <pred_probs_cross_val>.

3. Train robust models with noisy labels

cleanlab's :pyCleanLearning <cleanlab.classification.CleanLearning> class adapts any existing (scikit-learn compatible) classification model, clf, to a more reliable one by allowing it to train directly on partially mislabeled datasets.

When the :py.fit() <cleanlab.classification.CleanLearning.fit> method is called, it automatically removes any examples identified as "noisy" in the provided dataset and returns a model trained only on the clean data.

from sklearn.linear_model import LogisticRegression
from cleanlab.classification import CleanLearning

cl = CleanLearning(clf=LogisticRegression())  # any sklearn-compatible classifier
cl.fit(train_data, labels)

# Estimate the predictions you would have gotten if you trained without mislabeled data.
predictions = cl.predict(test_data)

4. Dataset curation: fix dataset-level issues

cleanlab's :pydataset <cleanlab.dataset> module helps you deal with dataset-level issues by finding overlapping classes <cleanlab.dataset.find_overlapping_classes> (classes to merge), rank class-level label quality <cleanlab.dataset.rank_classes_by_label_quality> (classes to keep/delete), and measure overall dataset health <cleanlab.dataset.overall_label_health_score> (to track dataset quality as you make adjustments).

The example below shows how to view all dataset-level issues in one line of code with :pydataset.health_summary() <cleanlab.dataset.health_summary>. Check out the dataset tutorial for more examples.

from cleanlab.dataset import health_summary

health_summary(labels, pred_probs, class_names=class_names)

Contributing

As cleanlab is an open-source project, we welcome contributions from the community.

Please see our contributing guidelines for more information.

Quickstart <self>

Workflows of Data-Centric AI <tutorials/indepth_overview> Image Classification (pytorch) <tutorials/image> Text Classification (tensorflow) <tutorials/text> Tabular Classification (sklearn) <tutorials/tabular> Audio Classification (speechbrain) <tutorials/audio> Find Dataset-level Issues <tutorials/dataset_health> Identifying Outliers (pytorch) <tutorials/outliers> Improving Consensus Labels for Multiannotator Data <tutorials/multiannotator> Multi-Label Classification <tutorials/multilabel_classification> Token Classification (text) <tutorials/token_classification> Predicted Probabilities via Cross Validation <tutorials/pred_probs_cross_val> FAQ <tutorials/faq>

cleanlab/classification cleanlab/filter cleanlab/rank cleanlab/count cleanlab/dataset cleanlab/outlier cleanlab/multiannotator cleanlab/multilabel_classification cleanlab/token_classification/index cleanlab/benchmarking/index cleanlab/experimental/index cleanlab/internal/index

How to contribute <https://github.com/cleanlab/cleanlab/blob/master/CONTRIBUTING.md> Migrating to v2.x <migrating/migrate_v2>

Website <https://cleanlab.ai> GitHub <https://github.com/cleanlab/cleanlab> PyPI <https://pypi.org/project/cleanlab/> Conda <https://anaconda.org/Cleanlab/cleanlab> Cleanlab Studio <https://cleanlab.ai/studio/>

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

index.rst

index.rst

cleanlab documentation

Quickstart

1. Install `cleanlab`

2. Find label errors in your data

3. Train robust models with noisy labels

4. Dataset curation: fix dataset-level issues

Contributing

Files

index.rst

Latest commit

History

index.rst

File metadata and controls

cleanlab documentation

Quickstart

1. Install cleanlab

2. Find label errors in your data

3. Train robust models with noisy labels

4. Dataset curation: fix dataset-level issues

Contributing

1. Install `cleanlab`