Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Returns DataFrame type from CleanLearning functions #199

Merged
merged 39 commits into from
Apr 13, 2022

Conversation

jwmueller
Copy link
Member

@jwmueller jwmueller commented Apr 12, 2022

CleanLearning.fit() can take in intermediate computation from either:
CleanLearning.find_label_issues() -- now a DataFrame
or filter.find_label_issues() -- a 1D np.array which is Boolean mask or integer indices if return_indices_ranked_by was specified.

This PR also fixes bug in entropy() and in get_confidence_weighted_entropy_for_each_label() related to potential 0s in logarithms.

Note: I have not updated unit tests (so currently tests will fail). I'll update the unit tests after getting feedback on the APIs/code.

Many possible workflows (updated):

from cleanlab.classification import CleanLearning
from sklearn.ensemble import RandomForestClassifier

clf = RandomForestClassifier()
pred_probs = get_cross_validated_pred_probs(data, clf)

cl = CleanLearning(clf=RandomForestClassifier())

df = cl.find_label_issues(data, labels)
>>> df
    is_label_issue  label_quality  given_label  predicted_label
0             True       0.393853            2                1
1             True       0.322861            4                3
2             True       0.345246            4                3
3            False       0.850860            0                0
4            False       0.855554            0                0
..             ...            ...          ...              ...
95           False       0.500262            4                4
96           False       0.787859            0                0
97           False       0.706194            4                4
98           False       0.825338            4                4
99           False       0.970152            4                4
_ = cl.fit(data, labels)  # returns self 
>>> cl.get_label_issues()
    is_label_issue  label_quality  given_label  predicted_label  sample_weight
0             True       0.287438            2                1            0.0
1             True       0.468316            4                3            0.0
2             True       0.374178            4                3            0.0
3            False       0.949294            0                0            1.0
4            False       0.808242            0                0            1.0
..             ...            ...          ...              ...            ...
95            True       0.424141            4                3            0.0
96           False       0.869043            0                0            1.0
97           False       0.699087            4                4            1.0
98           False       0.895582            4                4            1.0
99           False       0.970152            4                4            1.0
issue_mask = find_label_issues(labels, pred_probs)
_ = cl.fit(data, labels, label_issues=issue_mask)
>>> cl.get_label_issues()
    is_label_issue  sample_weight
0             True            0.0
1             True            0.0
2             True            0.0
3            False            1.0
4            False            1.0
..             ...            ...
95           False            1.0
96           False            1.0
97           False            1.0
98           False            1.0
99           False            1.0
issue_inds = find_label_issues(y, pred_probs, return_indices_ranked_by='normalized_margin')
_ = cl.fit(data, labels,  label_issues=issue_inds)
>>> cl.get_label_issues()
    is_label_issue  sample_weight
0             True            0.0
1             True            0.0
2             True            0.0
3            False            1.0
4            False            1.0
..             ...            ...
95           False            1.0
96           False            1.0
97           False            1.0
98           False            1.0
99           False            1.0
>>> cl.save_space()
Deleted non-sklearn attributes such as label_issues_df to save space.

>>> cl.get_label_issues()
UserWarning: The label issues have not yet been computed. Run `self.find_label_issues()` or `self.fit()` first.

@jwmueller jwmueller marked this pull request as draft April 12, 2022 00:51
@jwmueller
Copy link
Member Author

Note: sample_weight column corresponds to what was being returned from CleanLearning.fit() before (except now padded with 0s in pruned examples). It contains floats and is not binary vector, just happens to look like that in this particular example.

We already decided that pandas will be a dependency of cleanlab (also
used in the dataset module, see
cleanlab#182).
Copy link
Member

@anishathalye anishathalye left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

New API LGTM. Left a couple comments and made some smaller tweaks directly.

cleanlab/classification.py Outdated Show resolved Hide resolved
cleanlab/classification.py Outdated Show resolved Hide resolved
Copy link
Member

@cgnorthcutt cgnorthcutt left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Great idea to return the dataframe for CleanLearning.find_label_issues

Main feedback:

  • Maybe just return self.clf in fit(). That's standard for sklearn compatible classifiers.
  • move the logic you added for building up the self.label_issues_df into the accessor get_label_errors (it doesn't belong in .fit()`
  • I would keep self.sample_weight and self.label_errors_mask and related instance variables around. Dataframes can use pointers (so no space duplication by storing both) and these are independent of the label_issues_df, so they should be accessible on their own (easier to find them too)
  • I'd add given labels as a column to the dataframe which does add space, but i think it makes it more useful when juxtaposed with the prediction column.

cleanlab/internal/util.py Outdated Show resolved Hide resolved
cleanlab/classification.py Outdated Show resolved Hide resolved
cleanlab/classification.py Show resolved Hide resolved
cleanlab/classification.py Outdated Show resolved Hide resolved
cleanlab/classification.py Outdated Show resolved Hide resolved
cleanlab/classification.py Show resolved Hide resolved
cleanlab/classification.py Show resolved Hide resolved
cleanlab/classification.py Outdated Show resolved Hide resolved
cleanlab/classification.py Outdated Show resolved Hide resolved
cleanlab/classification.py Outdated Show resolved Hide resolved
@jwmueller
Copy link
Member Author

Addressed comments except lazy import. See new workflow/outputs above.

cleanlab/classification.py Outdated Show resolved Hide resolved
setup.py Show resolved Hide resolved
cleanlab/classification.py Outdated Show resolved Hide resolved
cleanlab/classification.py Outdated Show resolved Hide resolved
cleanlab/classification.py Outdated Show resolved Hide resolved
cleanlab/classification.py Show resolved Hide resolved
cleanlab/classification.py Show resolved Hide resolved
label_issues_df : pd.DataFrame
DataFrame with same format as the one returned by :py:meth:`CleanLearning.fit()
<cleanlab.classification.CleanLearning.fit>`.
See there for documentation regarding column definitions.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think you sould move the return docstring for label_issues_df here since THIS is where the df actually gets returned and just refer to it here in fit()?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

But fit() may add additional columns (eg. sample-weights).
So then would have to define columns of this DF in 2 places which is not nice. So prefer to define all the columns in one place, and it seems like it has to be fit() for all the possible column definitions to make sense.

Eventually want to add other info because CleanLearning.fit() could choose to auto-fix some labels and do other stuff beyond just pruning all issues.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Again, I really think you should provide a spec for the df thats returned since this is the method to create that df and return it and fit doesnt even return it

cleanlab/classification.py Show resolved Hide resolved
cleanlab/classification.py Outdated Show resolved Hide resolved
@jwmueller
Copy link
Member Author

Updated again to address 2nd round comments, did not move docstring with label_issues_df column descriptions from fit() for reasons stated above.

@codecov
Copy link

codecov bot commented Apr 12, 2022

Codecov Report

Merging #199 (2f20374) into master (0dc384a) will decrease coverage by 0.15%.
The diff coverage is 94.79%.

@@            Coverage Diff             @@
##           master     #199      +/-   ##
==========================================
- Coverage   95.41%   95.26%   -0.16%     
==========================================
  Files          11       12       +1     
  Lines         786      908     +122     
  Branches      167      180      +13     
==========================================
+ Hits          750      865     +115     
+ Misses         13       12       -1     
- Partials       23       31       +8     
Impacted Files Coverage Δ
cleanlab/internal/util.py 99.01% <85.71%> (-0.99%) ⬇️
cleanlab/classification.py 94.58% <95.23%> (+1.67%) ⬆️
cleanlab/internal/label_quality_utils.py 100.00% <100.00%> (ø)
cleanlab/rank.py 96.82% <100.00%> (+2.97%) ⬆️
cleanlab/benchmarking/noise_generation.py 95.53% <0.00%> (-0.20%) ⬇️
cleanlab/filter.py 93.58% <0.00%> (-0.17%) ⬇️
cleanlab/count.py 94.96% <0.00%> (-0.04%) ⬇️
cleanlab/internal/latent_algebra.py 100.00% <0.00%> (ø)
cleanlab/dataset.py 89.70% <0.00%> (ø)
... and 1 more

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 0dc384a...2f20374. Read the comment docs.

@jwmueller jwmueller marked this pull request as ready for review April 12, 2022 08:36
@jwmueller
Copy link
Member Author

added tests

@jwmueller
Copy link
Member Author

jwmueller commented Apr 12, 2022

Note all lines which do not pass codecov are defensive raise ValueError statements that don't warrant testing. Added a raise ValueError line to .coveragerc under exclude_lines which should stop codecov complaining about such lines but I don't think it takes effect until this PR is merged in.

Copy link
Member

@cgnorthcutt cgnorthcutt left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

two other small changes

label_issues_df : pd.DataFrame
DataFrame with same format as the one returned by :py:meth:`CleanLearning.fit()
<cleanlab.classification.CleanLearning.fit>`.
See there for documentation regarding column definitions.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Again, I really think you should provide a spec for the df thats returned since this is the method to create that df and return it and fit doesnt even return it

cleanlab/classification.py Outdated Show resolved Hide resolved
@jwmueller
Copy link
Member Author

fixed docstrings formatting

jwmueller and others added 3 commits April 13, 2022 04:53
The confident joint wasn't getting computed if noise_matrix was passed in and pred_probs was not passed in. But that's bad because it stops workflows like:

```python
cl = CleanLearning()
cl.fit(data, labels, noise_matrix=noise_matrix)
cleanlab.dataset.health_summary(labels, confident_joint=cl.confident_joint)
```
Copy link
Member

@cgnorthcutt cgnorthcutt left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Potentially a bug or potentially an old print statement that is out of date. see comment.

cleanlab/classification.py Outdated Show resolved Hide resolved
Copy link
Member

@cgnorthcutt cgnorthcutt left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@cgnorthcutt cgnorthcutt merged commit d1a4bc8 into cleanlab:master Apr 13, 2022
@cgnorthcutt cgnorthcutt deleted the cldataframe branch April 13, 2022 20:00
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

3 participants