v2.5.0 -- All major ML tasks now supported
This release is non-breaking when upgrading from v2.4.0 (except for certain methods in cleanlab.experimental
that have been moved, especially utility methods related to Datalab).
New ML tasks supported
Cleanlab now supports all of the most common ML tasks! This newest release adds dedicated support for the following types of datasets:
- regression (finding errors in numeric data): see
cleanlab.regression
and the "noisy labels in regression" quickstart tutorial. - object detection: see
cleanlab.object_detection
and the "Object Detection" quickstart tutorial. - image segmentation: see
cleanlab.segmentation
and the "Semantic Segmentation tutorial.
Cleanlab previously already supported: multi-class classification, multi-label classification (image/document tagging), token classification (entity recognition, sequence prediction).
If there is another ML task you'd like to see this package support, please let us know (or even better open a Pull Request)!
Supporting these ML tasks properly required significant research and novel algorithms developed by our scientists. We have published papers on these for transparency and scientific rigor, check out the list in the README or learn more at:
https://cleanlab.ai/research/
https://cleanlab.ai/blog/
Improvements to Datalab
Datalab is a general platform for detecting all sorts of common issues in real-world data, and the best place to get started for running this library on your datasets.
This release introduces major improvements and new functionalities in Datalab that include the ability to:
- Detect low-quality images in computer vision data (blurry, over/under-exposed, low-information, ...) via the integration of CleanVision.
- Detect label issues even without
pred_probs
from a ML model (you can instead just providefeatures
). - Flag rare classes in imbalanced classification datasets.
- Audit unlabeled datasets.
Other major improvements
- 50x speedup in the cleanlab.multiannotator code for analyzing data labeled by multiple annotators.
- Out-of-Distribution detection based on
pred_probs
via the GEN algorithm which is particularly effective for datasets with tons of classes. - Many of the methods across the package to find label issues now support a
low_memory
option. When specified, it uses an approximate mini-batching algorithm that returns results much faster and requires much less RAM.
New Contributors
Transforming cleanlab into the first universal data-centric AI platform is a major effort and we need your help! Many easy ways to contribute are listed on our github or you can jump into the discussions on Slack. We immensely appreciate all of the contributors who've helped build this package into what it is today, especially:
- @gordon-lim made their first contribution in #746
- @tataganesh made their first contribution in #751
- @vdlad made their first contribution in #677
- @axl1313 made their first contribution in #798
- @coding-famer made their first contribution in #800
Change Log
-
New feature: Label error detection in regression datasets by @krmayankb in #572; by @huiwengoh in #830
-
New feature: ObjectLab for detecting mislabeled images in objection detection datasets by @ulya-tkch in #676, #739, #745, #770, #779, #807, #833; by @aditya1503 in #750, #804
-
New feature: Label error detection in segmentation datasets by @vdlad in #677; by @ulya-tkch in #754, #756, #759, #772; by @elisno in #775
-
New feature: CleanVision to detect low-quality images by @sanjanag in #679, #797
-
New image quickstart tutorial that uses Datalab by @sanjanag in #795
-
Datalab code refactoring by @elisno in #803, #783, #793, #729
-
Include non-IID detection in set of default Datalab issue types by @elisno in #723
-
Extend Datalab to be able to detect label issues based on features by @Steven-Yiran in #760
-
Add imbalance issue type to Datalab by @tataganesh in #758, #828
-
Catch specific exception for knn in Datalab issue managers by @tataganesh in #825
-
Make plots smaller for datalab tutorials by @tataganesh in #751
-
50x speedup and other improvements in multiannotator module by @huiwengoh in #821, #784; by @ulya-tkch in #827
-
ENH: make clipping unnecessary for entropy by @DerWeh in #703
-
Extend default CleanLearning classifier to work for more datasets by @Steven-Yiran in #749
-
CleanLearning code improvements by @huiwengoh in #724; by @jwmueller in #744
-
Change CleanLearning inspect.getfullargspec to signature for sklearn v1.3 compatibility by @huiwengoh in #761
-
Expose low memory option for finding label issues by @tataganesh in #791, #822
-
Add GEN OOD-detection algorithm by @coding-famer in #800
-
Unify softmax implementations throughout package by @elisno in #826
-
Better warning handling for off_calibrated_custom in confident joint by @gordon-lim in #746
-
Clearer explanations in documentation/tutorials/readme by @cgnorthcutt in #725; by @jwmueller in #726, #734, #741, #743, #766, #832, #799, #752, #841, #816, #755, #731, #753, #845, #835, #847
-
CI and documentation system updates by @anishathalye in #742, #768, #769; by @jwmueller in #837; by @huiwengoh in #788, #757, #738, #794; by @sanjanag in #843; by @ulya-tkch in #777; by @elisno in #802; by @axl1313 in #798
-
Improved tests by @huiwengoh in #778, #763
Full Changelog: v2.4.0...v2.5.0