Introducing the new Dataset Module for cleanlab 2.0 #182

cgnorthcutt · 2022-04-07T02:35:16Z

Dataset module

This module introduces the new API for interacting and working with the entire datasets, with suggestions for overlapping classes that you may want to merge, low quality classes you may want to remove, overall dataset health score, and a fully health summary report (generate for your dataset in a single line of code).

Code should probably be tested a bit more thoroughly before merge.
Merge or rebase this commit. Do not squash and merge.

need to add tests!

please add tests as you review.

Updates made to module during this PR:

Merged the two label quality and label noise methods.
report conditional matrix scores (instead of joint) for class methods
All methods that previously returned multiple objects now work with and return pandas dataframes.
Added support for string classes
Added a health_summary function that prints everything about your dataset health in one line of code.
completely rewrote the docstrings.
all methods take in all input types.
all methods have been renamed.
Added testing for four real-world datasets.

Future support

in the future we should add support for all the functions from the raw data directly + a model (like CleanLearning), but that's for a later release.

codecov · 2022-04-07T02:41:32Z

Codecov Report

Merging #182 (c1d27cf) into master (0dc384a) will decrease coverage by 0.44%.
The diff coverage is 89.85%.

@@            Coverage Diff             @@
##           master     #182      +/-   ##
==========================================
- Coverage   95.41%   94.97%   -0.45%     
==========================================
  Files          11       12       +1     
  Lines         786      855      +69     
  Branches      167      185      +18     
==========================================
+ Hits          750      812      +62     
- Misses         13       15       +2     
- Partials       23       28       +5

Impacted Files	Coverage Δ
cleanlab/dataset.py	`89.85% <89.85%> (ø)`

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 0dc384a...c1d27cf. Read the comment docs.

cleanlab/dataset.py

jwmueller · 2022-04-07T05:03:46Z

cleanlab/dataset.py

+# along with cleanlab.  If not, see <https://www.gnu.org/licenses/>.
+
+
+"""Dataset Module


consider demonstrating these functions at end of one of the quickstart tutorials, maybe text.ipynb

this is left up for grabs!

cleanlab/dataset.py

cgnorthcutt · 2022-04-10T06:06:46Z

IMPORTANT - this merge should include the commit history (do not squash, either merge directly or rebase and merge).

…ary.

cleanlab/dataset.py

jwmueller · 2022-04-10T23:08:29Z

cleanlab/dataset.py

+    This method provides two scores in the pandas Data Frame that is returned:
+    * "Num Overlapping Examples" - The number of examples where the two classes overlap
+    * "Joint Probability" - "Num Overlapping Examples" / total number of examples in the dataset.
+    interchangeable and returns a dataframe with the classe and the joint probability score


typos in text and formatting issue

jwmueller · 2022-04-10T23:15:36Z

cleanlab/dataset.py

+
+    This method provides two scores in the pandas Data Frame that is returned:
+    * "Num Overlapping Examples" - The number of examples where the two classes overlap
+    * "Joint Probability" - "Num Overlapping Examples" / total number of examples in the dataset.


I like the score "joint probability" because it's very interpretable. However one issue is this violates the convention of our package that lower scores => more serious problem.

If we want to be consistent with this convention (which I advise given we plan to have tons of data-quality scores in the package), then alternate option is to call this "Overlap Score" and in the docstring define it as 1 - joint probability, i.e. = 1 - (Number of overlapping examples) / (Total number of examples in the dataset)

cleanlab/dataset.py

jwmueller · 2022-04-10T23:25:58Z

cleanlab/dataset.py

+    confident_joint=None,
+    multi_label=False,
+):
+    """Prints a healthy summary of your datasets including results for powerful statistics like:


"healthy summary of your datasets including results for powerful statistics like"
->
"summary of your dataset's overall label health, including:"

[Nit] There are a couple occurrences of "datasets" in this file. Would replace them with "dataset"

jwmueller · 2022-04-10T23:26:53Z

cleanlab/dataset.py

+    """Prints a healthy summary of your datasets including results for powerful statistics like:
+    * the classes with the most and least label issues
+    * classes that overlap and could potentially be merged
+    * overall data label quality health score statistics for your dataset


[Nit] could delete final bullet since it's implied by "overall health summary"

jwmueller · 2022-04-10T23:29:32Z

cleanlab/dataset.py

+
+    Parameters
+    ----------
+    For parameter info, see the docstring of `dataset.find_overlapping_classes`


I'd imagine health_summary() is the method we envision most users invoking from this module (and just getting results from find_overlapping_classes from the internal call inside health_summary).

So consider moving the docstrings for parameters to this method, and then having the Docstring for find_overlapping_classes say: See docstring of health_summary instead.

cleanlab/dataset.py

jwmueller

New output format looks great! I mostly suggest some minor API refactors, and major changes in naming conventions to improve usability/clarity.

JohnsonKuan · 2022-04-11T18:08:40Z

Example output of all methods for Google Quickdraw dataset (below):

Example output of health summary in one line of code for CIFAR-100:

Output looks awesome.

It may be helpful to explain in the docstring why the estimated joint prob of (A, B) is not the same as (B, A) like in your quickdraw example where joint prob of (birthday cake, cake) differed from (cake, birthday cake). I know why they differ (after initial read of the confident learning paper) but I'm guessing most new users will probably not know.

tests/test_dataset.py

Co-Authored-By: Curtis G. Northcutt <curtis.northcutt@gmail.com>

We already decided that pandas will be a dependency of cleanlab (also used in the dataset module, see cleanlab#182).

cleanlab/dataset.py

This makes it consistent with the other two, "classes_by_label_quality" and "overlapping_classes".

cgnorthcutt · 2022-04-13T04:54:57Z

Addressed all comments and added thorough testing on real datasets with 2, 10, and more classes. Verified that methods return correct values in some cases.

* df return type, need tests still * Add pandas as a dependency We already decided that pandas will be a dependency of cleanlab (also used in the dataset module, see #182). * Tweak documentation * addressed comments * remove lazy import * address 2nd round comments * unit tests * improve codecov * Fix typo * methods to save more space * nocover statements for prints * extra nocover * nocover warnings * test docstring formatting * test docstring formatting2 * test docstring formatting2 * move compress to helper, find-label docs params * readded stuff lost in merge conflict * addressed remaining PR review comments * docs formatting * docs formatting2 * docs formatting3 * docs formatting4 * docs formatting5 * docs formatting5 * docs formatting6 * docs formatting7 * docs formatting8 * docs formatting9 * docs formatting19 * docs formatting20 * docs formatting20 * docs formatting21 * code formatting * fix a bug where confident joint isnt computed The confident joint wasn't getting computed if noise_matrix was passed in and pred_probs was not passed in. But that's bad because it stops workflows like: ```python cl = CleanLearning() cl.fit(data, labels, noise_matrix=noise_matrix) cleanlab.dataset.health_summary(labels, confident_joint=cl.confident_joint) ``` * fixed bug from last commit. code in wrong place. * print overwrite bugfix Co-authored-by: Anish Athalye <me@anishathalye.com> Co-authored-by: Curtis G. Northcutt <curtis.northcutt@gmail.com>

Skeleton commit for first version of new dataset module. UNTESTED

06b990b

cgnorthcutt requested review from anishathalye, JohnsonKuan and jwmueller April 7, 2022 02:35

cgnorthcutt self-assigned this Apr 7, 2022

jwmueller marked this pull request as draft April 7, 2022 03:26

jwmueller reviewed Apr 7, 2022

View reviewed changes

cleanlab/dataset.py Outdated Show resolved Hide resolved

jwmueller reviewed Apr 7, 2022

View reviewed changes

cleanlab/dataset.py Outdated Show resolved Hide resolved

jwmueller reviewed Apr 7, 2022

View reviewed changes

cleanlab/dataset.py Outdated Show resolved Hide resolved

jwmueller reviewed Apr 7, 2022

View reviewed changes

cleanlab/dataset.py Show resolved Hide resolved

jwmueller reviewed Apr 7, 2022

View reviewed changes

cleanlab/dataset.py Outdated Show resolved Hide resolved

jwmueller reviewed Apr 7, 2022

View reviewed changes

cleanlab/dataset.py Show resolved Hide resolved

jwmueller reviewed Apr 7, 2022

View reviewed changes

cleanlab/dataset.py Outdated Show resolved Hide resolved

jwmueller reviewed Apr 7, 2022

View reviewed changes

cleanlab/dataset.py Show resolved Hide resolved

jwmueller reviewed Apr 7, 2022

View reviewed changes

cleanlab/dataset.py Outdated Show resolved Hide resolved

jwmueller reviewed Apr 7, 2022

View reviewed changes

cleanlab/dataset.py Outdated Show resolved Hide resolved

jwmueller reviewed Apr 7, 2022

View reviewed changes

cleanlab/dataset.py Outdated Show resolved Hide resolved

jwmueller reviewed Apr 7, 2022

View reviewed changes

cleanlab/dataset.py Outdated Show resolved Hide resolved

jwmueller reviewed Apr 7, 2022

View reviewed changes

cleanlab/dataset.py Outdated Show resolved Hide resolved

jwmueller reviewed Apr 7, 2022

View reviewed changes

cleanlab/dataset.py Outdated Show resolved Hide resolved

jwmueller reviewed Apr 7, 2022

View reviewed changes

cleanlab/dataset.py Outdated Show resolved Hide resolved

jwmueller reviewed Apr 7, 2022

View reviewed changes

cleanlab/dataset.py Show resolved Hide resolved

jwmueller reviewed Apr 7, 2022

View reviewed changes

cleanlab/dataset.py Outdated Show resolved Hide resolved

jwmueller reviewed Apr 7, 2022

View reviewed changes

cleanlab/dataset.py Show resolved Hide resolved

cgnorthcutt added 3 commits April 10, 2022 14:37

Add pandas as a dependency to cleanlab.

b65d0fc

Totally rewritten. class methods merged. now uses pandas. health summ…

0fb2439

…ary.

Add initial test for dataset module.

68d9c75

jwmueller reviewed Apr 10, 2022

View reviewed changes

cleanlab/dataset.py Show resolved Hide resolved

jwmueller reviewed Apr 10, 2022

View reviewed changes

cleanlab/dataset.py Show resolved Hide resolved

jwmueller reviewed Apr 10, 2022

View reviewed changes

cleanlab/dataset.py Outdated Show resolved Hide resolved

jwmueller reviewed Apr 10, 2022

View reviewed changes

cleanlab/dataset.py Outdated Show resolved Hide resolved

jwmueller reviewed Apr 10, 2022

View reviewed changes

cleanlab/dataset.py Outdated Show resolved Hide resolved

jwmueller reviewed Apr 10, 2022

View reviewed changes

cleanlab/dataset.py Outdated Show resolved Hide resolved

jwmueller reviewed Apr 10, 2022

View reviewed changes

cleanlab/dataset.py Outdated Show resolved Hide resolved

jwmueller requested changes Apr 10, 2022

View reviewed changes

anishathalye added 2 commits April 11, 2022 12:45

Merge branch 'master' into dataset_module

eae2924

Make a pass over docs

dc6fa95

jwmueller reviewed Apr 11, 2022

View reviewed changes

tests/test_dataset.py Show resolved Hide resolved

Checkpoint addressing Jonas's feedback

8470c72

Co-Authored-By: Curtis G. Northcutt <curtis.northcutt@gmail.com>

cgnorthcutt changed the title ~~Skeleton commit for first version of new dataset module. UNTESTED~~ Introducing the new Dataset Module for cleanlab 2.0 Apr 12, 2022

anishathalye added a commit to jwmueller/cleanlab that referenced this pull request Apr 12, 2022

Add pandas as a dependency

13518fe

We already decided that pandas will be a dependency of cleanlab (also used in the dataset module, see cleanlab#182).

jwmueller reviewed Apr 12, 2022

View reviewed changes

cleanlab/dataset.py Outdated Show resolved Hide resolved

jwmueller reviewed Apr 12, 2022

View reviewed changes

cleanlab/dataset.py Outdated Show resolved Hide resolved

anishathalye and others added 4 commits April 12, 2022 09:21

Fix typo

374f6e0

Make key match function name

2597e3e

This makes it consistent with the other two, "classes_by_label_quality" and "overlapping_classes".

Clarify joint asymmetry in docs. reset indices in dfs

9cdf505

Added thorough testing. ready for release.

c1d27cf

cgnorthcutt merged commit 56a4ec8 into cleanlab:master Apr 13, 2022

cgnorthcutt deleted the dataset_module branch April 13, 2022 04:55

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Introducing the new Dataset Module for cleanlab 2.0 #182

Introducing the new Dataset Module for cleanlab 2.0 #182

cgnorthcutt commented Apr 7, 2022 •

edited

codecov bot commented Apr 7, 2022 •

edited

jwmueller Apr 7, 2022

cgnorthcutt Apr 10, 2022 •

edited

cgnorthcutt commented Apr 10, 2022

jwmueller Apr 10, 2022

jwmueller Apr 10, 2022

jwmueller Apr 10, 2022

jwmueller Apr 10, 2022

jwmueller Apr 10, 2022

jwmueller Apr 10, 2022 •

edited

jwmueller left a comment

JohnsonKuan commented Apr 11, 2022

cgnorthcutt commented Apr 13, 2022

		# along with cleanlab. If not, see <https://www.gnu.org/licenses/>.


		"""Dataset Module

Introducing the new Dataset Module for cleanlab 2.0 #182

Introducing the new Dataset Module for cleanlab 2.0 #182

Conversation

cgnorthcutt commented Apr 7, 2022 • edited

Dataset module

need to add tests!

Updates made to module during this PR:

Future support

codecov bot commented Apr 7, 2022 • edited

Codecov Report

jwmueller Apr 7, 2022

Choose a reason for hiding this comment

cgnorthcutt Apr 10, 2022 • edited

Choose a reason for hiding this comment

cgnorthcutt commented Apr 10, 2022

jwmueller Apr 10, 2022

Choose a reason for hiding this comment

jwmueller Apr 10, 2022

Choose a reason for hiding this comment

jwmueller Apr 10, 2022

Choose a reason for hiding this comment

jwmueller Apr 10, 2022

Choose a reason for hiding this comment

jwmueller Apr 10, 2022

Choose a reason for hiding this comment

jwmueller Apr 10, 2022 • edited

Choose a reason for hiding this comment

jwmueller left a comment

Choose a reason for hiding this comment

JohnsonKuan commented Apr 11, 2022

cgnorthcutt commented Apr 13, 2022

cgnorthcutt commented Apr 7, 2022 •

edited

codecov bot commented Apr 7, 2022 •

edited

cgnorthcutt Apr 10, 2022 •

edited

jwmueller Apr 10, 2022 •

edited