-
Notifications
You must be signed in to change notification settings - Fork 686
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Introducing the new Dataset Module for cleanlab 2.0 #182
Conversation
Codecov Report
@@ Coverage Diff @@
## master #182 +/- ##
==========================================
- Coverage 95.41% 94.97% -0.45%
==========================================
Files 11 12 +1
Lines 786 855 +69
Branches 167 185 +18
==========================================
+ Hits 750 812 +62
- Misses 13 15 +2
- Partials 23 28 +5
Continue to review full report at Codecov.
|
cleanlab/dataset.py
Outdated
# along with cleanlab. If not, see <https://www.gnu.org/licenses/>. | ||
|
||
|
||
"""Dataset Module |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
consider demonstrating these functions at end of one of the quickstart tutorials, maybe text.ipynb
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this is left up for grabs!
IMPORTANT - this merge should include the commit history (do not squash, either merge directly or rebase and merge). |
cleanlab/dataset.py
Outdated
This method provides two scores in the pandas Data Frame that is returned: | ||
* "Num Overlapping Examples" - The number of examples where the two classes overlap | ||
* "Joint Probability" - "Num Overlapping Examples" / total number of examples in the dataset. | ||
interchangeable and returns a dataframe with the classe and the joint probability score |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
typos in text and formatting issue
cleanlab/dataset.py
Outdated
|
||
This method provides two scores in the pandas Data Frame that is returned: | ||
* "Num Overlapping Examples" - The number of examples where the two classes overlap | ||
* "Joint Probability" - "Num Overlapping Examples" / total number of examples in the dataset. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I like the score "joint probability" because it's very interpretable. However one issue is this violates the convention of our package that lower scores => more serious problem.
If we want to be consistent with this convention (which I advise given we plan to have tons of data-quality scores in the package), then alternate option is to call this "Overlap Score" and in the docstring define it as 1 - joint probability, i.e. = 1 - (Number of overlapping examples) / (Total number of examples in the dataset)
cleanlab/dataset.py
Outdated
confident_joint=None, | ||
multi_label=False, | ||
): | ||
"""Prints a healthy summary of your datasets including results for powerful statistics like: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
"healthy summary of your datasets including results for powerful statistics like"
->
"summary of your dataset's overall label health, including:"
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
[Nit] There are a couple occurrences of "datasets" in this file. Would replace them with "dataset"
cleanlab/dataset.py
Outdated
"""Prints a healthy summary of your datasets including results for powerful statistics like: | ||
* the classes with the most and least label issues | ||
* classes that overlap and could potentially be merged | ||
* overall data label quality health score statistics for your dataset |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
[Nit] could delete final bullet since it's implied by "overall health summary"
cleanlab/dataset.py
Outdated
|
||
Parameters | ||
---------- | ||
For parameter info, see the docstring of `dataset.find_overlapping_classes` |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'd imagine health_summary()
is the method we envision most users invoking from this module (and just getting results from find_overlapping_classes from the internal call inside health_summary).
So consider moving the docstrings for parameters to this method, and then having the Docstring for find_overlapping_classes say: See docstring of health_summary
instead.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
New output format looks great! I mostly suggest some minor API refactors, and major changes in naming conventions to improve usability/clarity.
Co-Authored-By: Curtis G. Northcutt <curtis.northcutt@gmail.com>
We already decided that pandas will be a dependency of cleanlab (also used in the dataset module, see cleanlab#182).
This makes it consistent with the other two, "classes_by_label_quality" and "overlapping_classes".
Addressed all comments and added thorough testing on real datasets with 2, 10, and more classes. Verified that methods return correct values in some cases. |
* df return type, need tests still * Add pandas as a dependency We already decided that pandas will be a dependency of cleanlab (also used in the dataset module, see #182). * Tweak documentation * addressed comments * remove lazy import * address 2nd round comments * unit tests * improve codecov * Fix typo * methods to save more space * nocover statements for prints * extra nocover * nocover warnings * test docstring formatting * test docstring formatting2 * test docstring formatting2 * move compress to helper, find-label docs params * readded stuff lost in merge conflict * addressed remaining PR review comments * docs formatting * docs formatting2 * docs formatting3 * docs formatting4 * docs formatting5 * docs formatting5 * docs formatting6 * docs formatting7 * docs formatting8 * docs formatting9 * docs formatting19 * docs formatting20 * docs formatting20 * docs formatting21 * code formatting * fix a bug where confident joint isnt computed The confident joint wasn't getting computed if noise_matrix was passed in and pred_probs was not passed in. But that's bad because it stops workflows like: ```python cl = CleanLearning() cl.fit(data, labels, noise_matrix=noise_matrix) cleanlab.dataset.health_summary(labels, confident_joint=cl.confident_joint) ``` * fixed bug from last commit. code in wrong place. * print overwrite bugfix Co-authored-by: Anish Athalye <me@anishathalye.com> Co-authored-by: Curtis G. Northcutt <curtis.northcutt@gmail.com>
Dataset module
This module introduces the new API for interacting and working with the entire datasets, with suggestions for overlapping classes that you may want to merge, low quality classes you may want to remove, overall dataset health score, and a fully health summary report (generate for your dataset in a single line of code).
need to add tests!
Updates made to module during this PR:
Future support
CleanLearning
), but that's for a later release.