Skip to content

v2.5.0 -- All major ML tasks now supported

Compare
Choose a tag to compare
@jwmueller jwmueller released this 11 Sep 14:44
· 292 commits to master since this release
d45537e

This release is non-breaking when upgrading from v2.4.0 (except for certain methods in cleanlab.experimental that have been moved, especially utility methods related to Datalab).

New ML tasks supported

Cleanlab now supports all of the most common ML tasks! This newest release adds dedicated support for the following types of datasets:

  • regression (finding errors in numeric data): see cleanlab.regression and the "noisy labels in regression" quickstart tutorial.
  • object detection: see cleanlab.object_detection and the "Object Detection" quickstart tutorial.
  • image segmentation: see cleanlab.segmentation and the "Semantic Segmentation tutorial.

Cleanlab previously already supported: multi-class classification, multi-label classification (image/document tagging), token classification (entity recognition, sequence prediction).

If there is another ML task you'd like to see this package support, please let us know (or even better open a Pull Request)!

Supporting these ML tasks properly required significant research and novel algorithms developed by our scientists. We have published papers on these for transparency and scientific rigor, check out the list in the README or learn more at:
https://cleanlab.ai/research/
https://cleanlab.ai/blog/

Improvements to Datalab

Datalab is a general platform for detecting all sorts of common issues in real-world data, and the best place to get started for running this library on your datasets.

This release introduces major improvements and new functionalities in Datalab that include the ability to:

  • Detect low-quality images in computer vision data (blurry, over/under-exposed, low-information, ...) via the integration of CleanVision.
  • Detect label issues even without pred_probs from a ML model (you can instead just provide features).
  • Flag rare classes in imbalanced classification datasets.
  • Audit unlabeled datasets.

Other major improvements

  • 50x speedup in the cleanlab.multiannotator code for analyzing data labeled by multiple annotators.
  • Out-of-Distribution detection based on pred_probs via the GEN algorithm which is particularly effective for datasets with tons of classes.
  • Many of the methods across the package to find label issues now support a low_memory option. When specified, it uses an approximate mini-batching algorithm that returns results much faster and requires much less RAM.

New Contributors

Transforming cleanlab into the first universal data-centric AI platform is a major effort and we need your help! Many easy ways to contribute are listed on our github or you can jump into the discussions on Slack. We immensely appreciate all of the contributors who've helped build this package into what it is today, especially:

Change Log

Full Changelog: v2.4.0...v2.5.0