-
Notifications
You must be signed in to change notification settings - Fork 686
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Returns DataFrame type from CleanLearning functions #199
Conversation
Note: sample_weight column corresponds to what was being returned from CleanLearning.fit() before (except now padded with 0s in pruned examples). It contains floats and is not binary vector, just happens to look like that in this particular example. |
We already decided that pandas will be a dependency of cleanlab (also used in the dataset module, see cleanlab#182).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
New API LGTM. Left a couple comments and made some smaller tweaks directly.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Great idea to return the dataframe for CleanLearning.find_label_issues
Main feedback:
- Maybe just return self.clf in fit(). That's standard for sklearn compatible classifiers.
- move the logic you added for building up the
self.label_issues_df
into the accessorget_label_errors
(it doesn't belong in .fit()` - I would keep self.sample_weight and self.label_errors_mask and related instance variables around. Dataframes can use pointers (so no space duplication by storing both) and these are independent of the label_issues_df, so they should be accessible on their own (easier to find them too)
- I'd add given labels as a column to the dataframe which does add space, but i think it makes it more useful when juxtaposed with the prediction column.
Addressed comments except lazy import. See new workflow/outputs above. |
cleanlab/classification.py
Outdated
label_issues_df : pd.DataFrame | ||
DataFrame with same format as the one returned by :py:meth:`CleanLearning.fit() | ||
<cleanlab.classification.CleanLearning.fit>`. | ||
See there for documentation regarding column definitions. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think you sould move the return docstring for label_issues_df here since THIS is where the df actually gets returned and just refer to it here in fit()?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
But fit() may add additional columns (eg. sample-weights).
So then would have to define columns of this DF in 2 places which is not nice. So prefer to define all the columns in one place, and it seems like it has to be fit() for all the possible column definitions to make sense.
Eventually want to add other info because CleanLearning.fit() could choose to auto-fix some labels and do other stuff beyond just pruning all issues.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Again, I really think you should provide a spec for the df thats returned since this is the method to create that df and return it and fit doesnt even return it
Updated again to address 2nd round comments, did not move docstring with label_issues_df column descriptions from fit() for reasons stated above. |
Codecov Report
@@ Coverage Diff @@
## master #199 +/- ##
==========================================
- Coverage 95.41% 95.26% -0.16%
==========================================
Files 11 12 +1
Lines 786 908 +122
Branches 167 180 +13
==========================================
+ Hits 750 865 +115
+ Misses 13 12 -1
- Partials 23 31 +8
Continue to review full report at Codecov.
|
added tests |
Note all lines which do not pass codecov are defensive |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
two other small changes
cleanlab/classification.py
Outdated
label_issues_df : pd.DataFrame | ||
DataFrame with same format as the one returned by :py:meth:`CleanLearning.fit() | ||
<cleanlab.classification.CleanLearning.fit>`. | ||
See there for documentation regarding column definitions. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Again, I really think you should provide a spec for the df thats returned since this is the method to create that df and return it and fit doesnt even return it
fixed docstrings formatting |
The confident joint wasn't getting computed if noise_matrix was passed in and pred_probs was not passed in. But that's bad because it stops workflows like: ```python cl = CleanLearning() cl.fit(data, labels, noise_matrix=noise_matrix) cleanlab.dataset.health_summary(labels, confident_joint=cl.confident_joint) ```
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Potentially a bug or potentially an old print statement that is out of date. see comment.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM
CleanLearning.fit() can take in intermediate computation from either:
CleanLearning.find_label_issues() -- now a DataFrame
or filter.find_label_issues() -- a 1D np.array which is Boolean mask or integer indices if return_indices_ranked_by was specified.
This PR also fixes bug in entropy() and in get_confidence_weighted_entropy_for_each_label() related to potential 0s in logarithms.
Note: I have not updated unit tests (so currently tests will fail). I'll update the unit tests after getting feedback on the APIs/code.
Many possible workflows (updated):