Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Introducing the new Dataset Module for cleanlab 2.0 #182

Merged
merged 12 commits into from
Apr 13, 2022

Conversation

cgnorthcutt
Copy link
Member

@cgnorthcutt cgnorthcutt commented Apr 7, 2022

Dataset module

This module introduces the new API for interacting and working with the entire datasets, with suggestions for overlapping classes that you may want to merge, low quality classes you may want to remove, overall dataset health score, and a fully health summary report (generate for your dataset in a single line of code).

  • Code should probably be tested a bit more thoroughly before merge.
  • Merge or rebase this commit. Do not squash and merge.

need to add tests!

  • please add tests as you review.

Updates made to module during this PR:

  • Merged the two label quality and label noise methods.
  • report conditional matrix scores (instead of joint) for class methods
  • All methods that previously returned multiple objects now work with and return pandas dataframes.
  • Added support for string classes
  • Added a health_summary function that prints everything about your dataset health in one line of code.
  • completely rewrote the docstrings.
  • all methods take in all input types.
  • all methods have been renamed.
  • Added testing for four real-world datasets.

Future support

  • in the future we should add support for all the functions from the raw data directly + a model (like CleanLearning), but that's for a later release.

@codecov
Copy link

codecov bot commented Apr 7, 2022

Codecov Report

Merging #182 (c1d27cf) into master (0dc384a) will decrease coverage by 0.44%.
The diff coverage is 89.85%.

@@            Coverage Diff             @@
##           master     #182      +/-   ##
==========================================
- Coverage   95.41%   94.97%   -0.45%     
==========================================
  Files          11       12       +1     
  Lines         786      855      +69     
  Branches      167      185      +18     
==========================================
+ Hits          750      812      +62     
- Misses         13       15       +2     
- Partials       23       28       +5     
Impacted Files Coverage Δ
cleanlab/dataset.py 89.85% <89.85%> (ø)

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 0dc384a...c1d27cf. Read the comment docs.

@jwmueller jwmueller marked this pull request as draft April 7, 2022 03:26
cleanlab/dataset.py Outdated Show resolved Hide resolved
cleanlab/dataset.py Outdated Show resolved Hide resolved
cleanlab/dataset.py Outdated Show resolved Hide resolved
cleanlab/dataset.py Outdated Show resolved Hide resolved
cleanlab/dataset.py Outdated Show resolved Hide resolved
cleanlab/dataset.py Outdated Show resolved Hide resolved
cleanlab/dataset.py Outdated Show resolved Hide resolved
cleanlab/dataset.py Outdated Show resolved Hide resolved
cleanlab/dataset.py Outdated Show resolved Hide resolved
cleanlab/dataset.py Outdated Show resolved Hide resolved
cleanlab/dataset.py Outdated Show resolved Hide resolved
cleanlab/dataset.py Outdated Show resolved Hide resolved
cleanlab/dataset.py Outdated Show resolved Hide resolved
# along with cleanlab. If not, see <https://www.gnu.org/licenses/>.


"""Dataset Module
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

consider demonstrating these functions at end of one of the quickstart tutorials, maybe text.ipynb

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this is left up for grabs!

@cgnorthcutt
Copy link
Member Author

IMPORTANT - this merge should include the commit history (do not squash, either merge directly or rebase and merge).

This method provides two scores in the pandas Data Frame that is returned:
* "Num Overlapping Examples" - The number of examples where the two classes overlap
* "Joint Probability" - "Num Overlapping Examples" / total number of examples in the dataset.
interchangeable and returns a dataframe with the classe and the joint probability score
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

typos in text and formatting issue


This method provides two scores in the pandas Data Frame that is returned:
* "Num Overlapping Examples" - The number of examples where the two classes overlap
* "Joint Probability" - "Num Overlapping Examples" / total number of examples in the dataset.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I like the score "joint probability" because it's very interpretable. However one issue is this violates the convention of our package that lower scores => more serious problem.

If we want to be consistent with this convention (which I advise given we plan to have tons of data-quality scores in the package), then alternate option is to call this "Overlap Score" and in the docstring define it as 1 - joint probability, i.e. = 1 - (Number of overlapping examples) / (Total number of examples in the dataset)

cleanlab/dataset.py Outdated Show resolved Hide resolved
cleanlab/dataset.py Outdated Show resolved Hide resolved
cleanlab/dataset.py Outdated Show resolved Hide resolved
confident_joint=None,
multi_label=False,
):
"""Prints a healthy summary of your datasets including results for powerful statistics like:
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

"healthy summary of your datasets including results for powerful statistics like"
->
"summary of your dataset's overall label health, including:"

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

[Nit] There are a couple occurrences of "datasets" in this file. Would replace them with "dataset"

"""Prints a healthy summary of your datasets including results for powerful statistics like:
* the classes with the most and least label issues
* classes that overlap and could potentially be merged
* overall data label quality health score statistics for your dataset
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

[Nit] could delete final bullet since it's implied by "overall health summary"


Parameters
----------
For parameter info, see the docstring of `dataset.find_overlapping_classes`
Copy link
Member

@jwmueller jwmueller Apr 10, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'd imagine health_summary() is the method we envision most users invoking from this module (and just getting results from find_overlapping_classes from the internal call inside health_summary).

So consider moving the docstrings for parameters to this method, and then having the Docstring for find_overlapping_classes say: See docstring of health_summary instead.

cleanlab/dataset.py Outdated Show resolved Hide resolved
cleanlab/dataset.py Outdated Show resolved Hide resolved
Copy link
Member

@jwmueller jwmueller left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

New output format looks great! I mostly suggest some minor API refactors, and major changes in naming conventions to improve usability/clarity.

@JohnsonKuan
Copy link
Contributor

Example output of all methods for Google Quickdraw dataset (below): Screen Shot 2022-04-10 at 7 31 09 AM

Example output of health summary in one line of code for CIFAR-100: image

Output looks awesome.

It may be helpful to explain in the docstring why the estimated joint prob of (A, B) is not the same as (B, A) like in your quickdraw example where joint prob of (birthday cake, cake) differed from (cake, birthday cake). I know why they differ (after initial read of the confident learning paper) but I'm guessing most new users will probably not know.

Co-Authored-By: Curtis G. Northcutt <curtis.northcutt@gmail.com>
@cgnorthcutt cgnorthcutt changed the title Skeleton commit for first version of new dataset module. UNTESTED Introducing the new Dataset Module for cleanlab 2.0 Apr 12, 2022
anishathalye added a commit to jwmueller/cleanlab that referenced this pull request Apr 12, 2022
We already decided that pandas will be a dependency of cleanlab (also
used in the dataset module, see
cleanlab#182).
cleanlab/dataset.py Outdated Show resolved Hide resolved
cleanlab/dataset.py Outdated Show resolved Hide resolved
anishathalye and others added 4 commits April 12, 2022 09:21
@cgnorthcutt
Copy link
Member Author

Addressed all comments and added thorough testing on real datasets with 2, 10, and more classes. Verified that methods return correct values in some cases.

@cgnorthcutt cgnorthcutt merged commit 56a4ec8 into cleanlab:master Apr 13, 2022
@cgnorthcutt cgnorthcutt deleted the dataset_module branch April 13, 2022 04:55
cgnorthcutt added a commit that referenced this pull request Apr 13, 2022
* df return type, need tests still

* Add pandas as a dependency

We already decided that pandas will be a dependency of cleanlab (also
used in the dataset module, see
#182).

* Tweak documentation

* addressed comments

* remove lazy import

* address 2nd round comments

* unit tests

* improve codecov

* Fix typo

* methods to save more space

* nocover statements for prints

* extra nocover

* nocover warnings

* test docstring formatting

* test docstring formatting2

* test docstring formatting2

* move compress to helper, find-label docs params

* readded stuff lost in merge conflict

* addressed remaining PR review comments

* docs formatting

* docs formatting2

* docs formatting3

* docs formatting4

* docs formatting5

* docs formatting5

* docs formatting6

* docs formatting7

* docs formatting8

* docs formatting9

* docs formatting19

* docs formatting20

* docs formatting20

* docs formatting21

* code formatting

* fix a bug where confident joint isnt computed

The confident joint wasn't getting computed if noise_matrix was passed in and pred_probs was not passed in. But that's bad because it stops workflows like:

```python
cl = CleanLearning()
cl.fit(data, labels, noise_matrix=noise_matrix)
cleanlab.dataset.health_summary(labels, confident_joint=cl.confident_joint)
```

* fixed bug from last commit. code in wrong place.

* print overwrite bugfix

Co-authored-by: Anish Athalye <me@anishathalye.com>
Co-authored-by: Curtis G. Northcutt <curtis.northcutt@gmail.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

4 participants