cleanlab automatically finds and fixes label issues in your ML datasets.
pip
pip install cleanlab
conda
conda install -c cleanlab cleanlab
source
pip install git+https://github.com/cleanlab/cleanlab.git
cleanlab finds issues in any dataset that a classifier can be trained on. The cleanlab package works with any model by using model outputs (predicted probabilities) as input -- it doesn't depend on which model created those outputs.
If you're using a scikit-learn-compatible model (option 1), you don't need to train a model -- you can pass the model, data, and labels into :pyCleanLearning.find_label_issues <cleanlab.classification.CleanLearning.find_label_issues>
and cleanlab will handle model training for you. If you want to use any non-sklearn-compatible model (option 2), you can input the trained model's out-of-sample predicted probabilities into :pyfind_label_issues <cleanlab.filter.find_label_issues>
. Examples for both options are below.
from cleanlab.classification import CleanLearning
from cleanlab.filter import find_label_issues
# Option 1 - works with sklearn-compatible models - just input the data and labels ツ
label_issues_info = CleanLearning(clf=sklearn_compatible_model).find_label_issues(data, labels)
# Option 2 - works with ANY ML model - just input the model's predicted probabilities
ordered_label_issues = find_label_issues(
labels=labels,
pred_probs=pred_probs, # predicted probabilities from any model (ideally out-of-sample predictions)
return_indices_ranked_by='self_confidence',
)
:pyCleanLearning <cleanlab.classification.CleanLearning>
(option 1) also works with models from most standard ML frameworks by wrapping the model for scikit-learn compliance, e.g. huggingface/tensorflow/keras (using our KerasWrapperModel), pytorch (using skorch package), etc.
By default, :pyfind_label_issues <cleanlab.filter.find_label_issues>
returns a boolean mask of label issues. You can instead return the indices of potential mislabeled examples by setting return_indices_ranked_by in :pyfind_label_issues <cleanlab.filter.find_label_issues>
. The indices are ordered by likelihood of a label error (estimated via :pyrank.get_label_quality_scores <cleanlab.rank.get_label_quality_scores>
).
Important
Cleanlab performs better if the pred_probs
from your model are out-of-sample. Details on how to compute out-of-sample predicted probabilities for your entire dataset are here <pred_probs_cross_val>
.
cleanlab's :pyCleanLearning <cleanlab.classification.CleanLearning>
class adapts any existing (scikit-learn compatible) classification model, clf, to a more reliable one by allowing it to train directly on partially mislabeled datasets.
When the :py.fit() <cleanlab.classification.CleanLearning.fit>
method is called, it automatically removes any examples identified as "noisy" in the provided dataset and returns a model trained only on the clean data.
from sklearn.linear_model import LogisticRegression
from cleanlab.classification import CleanLearning
cl = CleanLearning(clf=LogisticRegression()) # any sklearn-compatible classifier
cl.fit(train_data, labels)
# Estimate the predictions you would have gotten if you trained without mislabeled data.
predictions = cl.predict(test_data)
cleanlab's :pydataset <cleanlab.dataset>
module helps you deal with dataset-level issues by finding overlapping classes <cleanlab.dataset.find_overlapping_classes>
(classes to merge), rank class-level label quality <cleanlab.dataset.rank_classes_by_label_quality>
(classes to keep/delete), and measure overall dataset health <cleanlab.dataset.overall_label_health_score>
(to track dataset quality as you make adjustments).
The example below shows how to view all dataset-level issues in one line of code with :pydataset.health_summary() <cleanlab.dataset.health_summary>
. Check out the dataset tutorial for more examples.
from cleanlab.dataset import health_summary
health_summary(labels, pred_probs, class_names=class_names)
As cleanlab is an open-source project, we welcome contributions from the community.
Please see our contributing guidelines for more information.
Quickstart <self>
Workflows of Data-Centric AI <tutorials/indepth_overview> Image Classification (pytorch) <tutorials/image> Text Classification (tensorflow) <tutorials/text> Tabular Classification (sklearn) <tutorials/tabular> Audio Classification (speechbrain) <tutorials/audio> Find Dataset-level Issues <tutorials/dataset_health> Identifying Outliers (pytorch) <tutorials/outliers> Improving Consensus Labels for Multiannotator Data <tutorials/multiannotator> Multi-Label Classification <tutorials/multilabel_classification> Token Classification (text) <tutorials/token_classification> Predicted Probabilities via Cross Validation <tutorials/pred_probs_cross_val> FAQ <tutorials/faq>
cleanlab/classification cleanlab/filter cleanlab/rank cleanlab/count cleanlab/dataset cleanlab/outlier cleanlab/multiannotator cleanlab/multilabel_classification cleanlab/token_classification/index cleanlab/benchmarking/index cleanlab/experimental/index cleanlab/internal/index
How to contribute <https://github.com/cleanlab/cleanlab/blob/master/CONTRIBUTING.md> Migrating to v2.x <migrating/migrate_v2>
Website <https://cleanlab.ai> GitHub <https://github.com/cleanlab/cleanlab> PyPI <https://pypi.org/project/cleanlab/> Conda <https://anaconda.org/Cleanlab/cleanlab> Cleanlab Studio <https://cleanlab.ai/studio/>