Release v2.5.0 -- All major ML tasks now supported · cleanlab/cleanlab

This release is non-breaking when upgrading from v2.4.0 (except for certain methods in cleanlab.experimental that have been moved, especially utility methods related to Datalab).

New ML tasks supported

Cleanlab now supports all of the most common ML tasks! This newest release adds dedicated support for the following types of datasets:

regression (finding errors in numeric data): see cleanlab.regression and the "noisy labels in regression" quickstart tutorial.
object detection: see cleanlab.object_detection and the "Object Detection" quickstart tutorial.
image segmentation: see cleanlab.segmentation and the "Semantic Segmentation tutorial.

Cleanlab previously already supported: multi-class classification, multi-label classification (image/document tagging), token classification (entity recognition, sequence prediction).

If there is another ML task you'd like to see this package support, please let us know (or even better open a Pull Request)!

Supporting these ML tasks properly required significant research and novel algorithms developed by our scientists. We have published papers on these for transparency and scientific rigor, check out the list in the README or learn more at:
https://cleanlab.ai/research/
https://cleanlab.ai/blog/

Improvements to Datalab

Datalab is a general platform for detecting all sorts of common issues in real-world data, and the best place to get started for running this library on your datasets.

This release introduces major improvements and new functionalities in Datalab that include the ability to:

Detect low-quality images in computer vision data (blurry, over/under-exposed, low-information, ...) via the integration of CleanVision.
Detect label issues even without pred_probs from a ML model (you can instead just provide features).
Flag rare classes in imbalanced classification datasets.
Audit unlabeled datasets.

Other major improvements

50x speedup in the cleanlab.multiannotator code for analyzing data labeled by multiple annotators.
Out-of-Distribution detection based on pred_probs via the GEN algorithm which is particularly effective for datasets with tons of classes.
Many of the methods across the package to find label issues now support a low_memory option. When specified, it uses an approximate mini-batching algorithm that returns results much faster and requires much less RAM.

New Contributors

Transforming cleanlab into the first universal data-centric AI platform is a major effort and we need your help! Many easy ways to contribute are listed on our github or you can jump into the discussions on Slack. We immensely appreciate all of the contributors who've helped build this package into what it is today, especially:

@gordon-lim made their first contribution in #746
@tataganesh made their first contribution in #751
@vdlad made their first contribution in #677
@axl1313 made their first contribution in #798
@coding-famer made their first contribution in #800

Change Log

New feature: Label error detection in regression datasets by @krmayankb in #572; by @huiwengoh in #830
New feature: ObjectLab for detecting mislabeled images in objection detection datasets by @ulya-tkch in #676, #739, #745, #770, #779, #807, #833; by @aditya1503 in #750, #804
New feature: Label error detection in segmentation datasets by @vdlad in #677; by @ulya-tkch in #754, #756, #759, #772; by @elisno in #775
New feature: CleanVision to detect low-quality images by @sanjanag in #679, #797
New image quickstart tutorial that uses Datalab by @sanjanag in #795
Datalab code refactoring by @elisno in #803, #783, #793, #729
Make labels optional in Datalab by @elisno in #730
Update near-duplicate sets in Datalab by @elisno in #781
Include non-IID detection in set of default Datalab issue types by @elisno in #723
Extend Datalab to be able to detect label issues based on features by @Steven-Yiran in #760
Add imbalance issue type to Datalab by @tataganesh in #758, #828
Catch specific exception for knn in Datalab issue managers by @tataganesh in #825
Make plots smaller for datalab tutorials by @tataganesh in #751
50x speedup and other improvements in multiannotator module by @huiwengoh in #821, #784; by @ulya-tkch in #827
ENH: make clipping unnecessary for entropy by @DerWeh in #703
Extend default CleanLearning classifier to work for more datasets by @Steven-Yiran in #749
CleanLearning code improvements by @huiwengoh in #724; by @jwmueller in #744
Change CleanLearning inspect.getfullargspec to signature for sklearn v1.3 compatibility by @huiwengoh in #761
Expose low memory option for finding label issues by @tataganesh in #791, #822
Add GEN OOD-detection algorithm by @coding-famer in #800
Unify softmax implementations throughout package by @elisno in #826
Better warning handling for off_calibrated_custom in confident joint by @gordon-lim in #746
Clearer explanations in documentation/tutorials/readme by @cgnorthcutt in #725; by @jwmueller in #726, #734, #741, #743, #766, #832, #799, #752, #841, #816, #755, #731, #753, #845, #835, #847
CI and documentation system updates by @anishathalye in #742, #768, #769; by @jwmueller in #837; by @huiwengoh in #788, #757, #738, #794; by @sanjanag in #843; by @ulya-tkch in #777; by @elisno in #802; by @axl1313 in #798
Improved tests by @huiwengoh in #778, #763

Full Changelog: v2.4.0...v2.5.0

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

v2.5.0 -- All major ML tasks now supported

New ML tasks supported

Improvements to Datalab

Other major improvements

New Contributors

Change Log

Contributors