Returns DataFrame type from CleanLearning functions #199

jwmueller · 2022-04-12T00:50:38Z

CleanLearning.fit() can take in intermediate computation from either:
CleanLearning.find_label_issues() -- now a DataFrame
or filter.find_label_issues() -- a 1D np.array which is Boolean mask or integer indices if return_indices_ranked_by was specified.

This PR also fixes bug in entropy() and in get_confidence_weighted_entropy_for_each_label() related to potential 0s in logarithms.

Note: I have not updated unit tests (so currently tests will fail). I'll update the unit tests after getting feedback on the APIs/code.

Many possible workflows (updated):

from cleanlab.classification import CleanLearning
from sklearn.ensemble import RandomForestClassifier

clf = RandomForestClassifier()
pred_probs = get_cross_validated_pred_probs(data, clf)

cl = CleanLearning(clf=RandomForestClassifier())

df = cl.find_label_issues(data, labels)

>>> df
    is_label_issue  label_quality  given_label  predicted_label
0             True       0.393853            2                1
1             True       0.322861            4                3
2             True       0.345246            4                3
3            False       0.850860            0                0
4            False       0.855554            0                0
..             ...            ...          ...              ...
95           False       0.500262            4                4
96           False       0.787859            0                0
97           False       0.706194            4                4
98           False       0.825338            4                4
99           False       0.970152            4                4

_ = cl.fit(data, labels)  # returns self

>>> cl.get_label_issues()
    is_label_issue  label_quality  given_label  predicted_label  sample_weight
0             True       0.287438            2                1            0.0
1             True       0.468316            4                3            0.0
2             True       0.374178            4                3            0.0
3            False       0.949294            0                0            1.0
4            False       0.808242            0                0            1.0
..             ...            ...          ...              ...            ...
95            True       0.424141            4                3            0.0
96           False       0.869043            0                0            1.0
97           False       0.699087            4                4            1.0
98           False       0.895582            4                4            1.0
99           False       0.970152            4                4            1.0

issue_mask = find_label_issues(labels, pred_probs)
_ = cl.fit(data, labels, label_issues=issue_mask)

>>> cl.get_label_issues()
    is_label_issue  sample_weight
0             True            0.0
1             True            0.0
2             True            0.0
3            False            1.0
4            False            1.0
..             ...            ...
95           False            1.0
96           False            1.0
97           False            1.0
98           False            1.0
99           False            1.0

issue_inds = find_label_issues(y, pred_probs, return_indices_ranked_by='normalized_margin')
_ = cl.fit(data, labels,  label_issues=issue_inds)

>>> cl.get_label_issues()
    is_label_issue  sample_weight
0             True            0.0
1             True            0.0
2             True            0.0
3            False            1.0
4            False            1.0
..             ...            ...
95           False            1.0
96           False            1.0
97           False            1.0
98           False            1.0
99           False            1.0

>>> cl.save_space()
Deleted non-sklearn attributes such as label_issues_df to save space.

>>> cl.get_label_issues()
UserWarning: The label issues have not yet been computed. Run `self.find_label_issues()` or `self.fit()` first.

jwmueller · 2022-04-12T01:01:04Z

Note: sample_weight column corresponds to what was being returned from CleanLearning.fit() before (except now padded with 0s in pruned examples). It contains floats and is not binary vector, just happens to look like that in this particular example.

We already decided that pandas will be a dependency of cleanlab (also used in the dataset module, see cleanlab#182).

anishathalye

New API LGTM. Left a couple comments and made some smaller tweaks directly.

cleanlab/classification.py

cgnorthcutt

Great idea to return the dataframe for CleanLearning.find_label_issues

Main feedback:

Maybe just return self.clf in fit(). That's standard for sklearn compatible classifiers.
move the logic you added for building up the self.label_issues_df into the accessor get_label_errors (it doesn't belong in .fit()`
I would keep self.sample_weight and self.label_errors_mask and related instance variables around. Dataframes can use pointers (so no space duplication by storing both) and these are independent of the label_issues_df, so they should be accessible on their own (easier to find them too)
I'd add given labels as a column to the dataframe which does add space, but i think it makes it more useful when juxtaposed with the prediction column.

cleanlab/internal/util.py

cleanlab/classification.py

jwmueller · 2022-04-12T05:37:04Z

Addressed comments except lazy import. See new workflow/outputs above.

cleanlab/classification.py

setup.py

cleanlab/classification.py

cgnorthcutt · 2022-04-12T06:28:46Z

cleanlab/classification.py

+        label_issues_df : pd.DataFrame
+          DataFrame with same format as the one returned by :py:meth:`CleanLearning.fit()
+        <cleanlab.classification.CleanLearning.fit>`.
+          See there for documentation regarding column definitions.


I think you sould move the return docstring for label_issues_df here since THIS is where the df actually gets returned and just refer to it here in fit()?

But fit() may add additional columns (eg. sample-weights).
So then would have to define columns of this DF in 2 places which is not nice. So prefer to define all the columns in one place, and it seems like it has to be fit() for all the possible column definitions to make sense.

Eventually want to add other info because CleanLearning.fit() could choose to auto-fix some labels and do other stuff beyond just pruning all issues.

Again, I really think you should provide a spec for the df thats returned since this is the method to create that df and return it and fit doesnt even return it

cleanlab/classification.py

jwmueller · 2022-04-12T07:03:43Z

Updated again to address 2nd round comments, did not move docstring with label_issues_df column descriptions from fit() for reasons stated above.

codecov · 2022-04-12T08:34:56Z

Codecov Report

Merging #199 (2f20374) into master (0dc384a) will decrease coverage by 0.15%.
The diff coverage is 94.79%.

@@            Coverage Diff             @@
##           master     #199      +/-   ##
==========================================
- Coverage   95.41%   95.26%   -0.16%     
==========================================
  Files          11       12       +1     
  Lines         786      908     +122     
  Branches      167      180      +13     
==========================================
+ Hits          750      865     +115     
+ Misses         13       12       -1     
- Partials       23       31       +8

Impacted Files	Coverage Δ
cleanlab/internal/util.py	`99.01% <85.71%> (-0.99%)`	⬇️
cleanlab/classification.py	`94.58% <95.23%> (+1.67%)`	⬆️
cleanlab/internal/label_quality_utils.py	`100.00% <100.00%> (ø)`
cleanlab/rank.py	`96.82% <100.00%> (+2.97%)`	⬆️
cleanlab/benchmarking/noise_generation.py	`95.53% <0.00%> (-0.20%)`	⬇️
cleanlab/filter.py	`93.58% <0.00%> (-0.17%)`	⬇️
cleanlab/count.py	`94.96% <0.00%> (-0.04%)`	⬇️
cleanlab/internal/latent_algebra.py	`100.00% <0.00%> (ø)`
cleanlab/dataset.py	`89.70% <0.00%> (ø)`
... and 1 more

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 0dc384a...2f20374. Read the comment docs.

jwmueller · 2022-04-12T08:36:26Z

added tests

jwmueller · 2022-04-12T09:25:50Z

Note all lines which do not pass codecov are defensive raise ValueError statements that don't warrant testing. Added a raise ValueError line to .coveragerc under exclude_lines which should stop codecov complaining about such lines but I don't think it takes effect until this PR is merged in.

…nto cldataframe

cgnorthcutt

two other small changes

cgnorthcutt · 2022-04-13T03:21:42Z

cleanlab/classification.py

+        label_issues_df : pd.DataFrame
+          DataFrame with same format as the one returned by :py:meth:`CleanLearning.fit()
+        <cleanlab.classification.CleanLearning.fit>`.
+          See there for documentation regarding column definitions.


Again, I really think you should provide a spec for the df thats returned since this is the method to create that df and return it and fit doesnt even return it

cleanlab/classification.py

jwmueller · 2022-04-13T11:47:48Z

fixed docstrings formatting

The confident joint wasn't getting computed if noise_matrix was passed in and pred_probs was not passed in. But that's bad because it stops workflows like: ```python cl = CleanLearning() cl.fit(data, labels, noise_matrix=noise_matrix) cleanlab.dataset.health_summary(labels, confident_joint=cl.confident_joint) ```

cgnorthcutt

Potentially a bug or potentially an old print statement that is out of date. see comment.

cleanlab/classification.py

cgnorthcutt

LGTM

df return type, need tests still

d250b75

jwmueller requested review from cgnorthcutt, anishathalye and JohnsonKuan April 12, 2022 00:50

jwmueller marked this pull request as draft April 12, 2022 00:51

anishathalye added 2 commits April 11, 2022 21:20

Add pandas as a dependency

13518fe

We already decided that pandas will be a dependency of cleanlab (also used in the dataset module, see cleanlab#182).

Tweak documentation

35a9d03

anishathalye approved these changes Apr 12, 2022

View reviewed changes

cleanlab/classification.py Outdated Show resolved Hide resolved

cleanlab/classification.py Outdated Show resolved Hide resolved

cgnorthcutt requested changes Apr 12, 2022

View reviewed changes

jwmueller added 2 commits April 11, 2022 22:31

addressed comments

e742071

merge conflict lazy import

ea154fe

remove lazy import

fce9f83

cgnorthcutt requested changes Apr 12, 2022

View reviewed changes

address 2nd round comments

1422347

unit tests

37896c5

jwmueller requested a review from cgnorthcutt April 12, 2022 08:35

jwmueller marked this pull request as ready for review April 12, 2022 08:36

improve codecov

08f8dce

anishathalye and others added 6 commits April 12, 2022 12:49

Fix typo

7164130

methods to save more space

8e8ef79

Merge branch 'cldataframe' of https://github.com/jwmueller/cleanlab i…

d5b9572

…nto cldataframe

nocover statements for prints

71acef7

extra nocover

eac30ac

nocover warnings

20e490a

cgnorthcutt requested changes Apr 13, 2022

View reviewed changes

jwmueller added 20 commits April 13, 2022 01:01

test docstring formatting

9762d0a

test docstring formatting2

937df62

test docstring formatting2

f141f3a

move compress to helper, find-label docs params

8930ff5

readded stuff lost in merge conflict

4434a91

addressed remaining PR review comments

71e486d

docs formatting

962815f

docs formatting2

313a7b3

docs formatting3

2ad6c41

docs formatting4

7770e94

docs formatting5

0c24417

docs formatting5

7ab4294

docs formatting6

6a507b6

docs formatting7

f6e4490

docs formatting8

e0bfdd5

docs formatting9

bc7ee91

docs formatting19

6f077d7

docs formatting20

445d282

docs formatting20

13c13f1

docs formatting21

590089e

jwmueller and others added 3 commits April 13, 2022 04:53

code formatting

dee8b40

fixed bug from last commit. code in wrong place.

d20a6eb

cgnorthcutt reviewed Apr 13, 2022

View reviewed changes

cleanlab/classification.py Outdated Show resolved Hide resolved

print overwrite bugfix

2f20374

cgnorthcutt approved these changes Apr 13, 2022

View reviewed changes

cgnorthcutt merged commit d1a4bc8 into cleanlab:master Apr 13, 2022

cgnorthcutt deleted the cldataframe branch April 13, 2022 20:00

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Returns DataFrame type from CleanLearning functions #199

Returns DataFrame type from CleanLearning functions #199

jwmueller commented Apr 12, 2022 •

edited

jwmueller commented Apr 12, 2022

anishathalye left a comment

cgnorthcutt left a comment

jwmueller commented Apr 12, 2022

cgnorthcutt Apr 12, 2022

jwmueller Apr 12, 2022

cgnorthcutt Apr 13, 2022

jwmueller commented Apr 12, 2022

codecov bot commented Apr 12, 2022 •

edited

jwmueller commented Apr 12, 2022

jwmueller commented Apr 12, 2022 •

edited

cgnorthcutt left a comment

cgnorthcutt Apr 13, 2022

jwmueller commented Apr 13, 2022

cgnorthcutt left a comment

cgnorthcutt left a comment

Returns DataFrame type from CleanLearning functions #199

Returns DataFrame type from CleanLearning functions #199

Conversation

jwmueller commented Apr 12, 2022 • edited

jwmueller commented Apr 12, 2022

anishathalye left a comment

Choose a reason for hiding this comment

cgnorthcutt left a comment

Choose a reason for hiding this comment

jwmueller commented Apr 12, 2022

cgnorthcutt Apr 12, 2022

Choose a reason for hiding this comment

jwmueller Apr 12, 2022

Choose a reason for hiding this comment

cgnorthcutt Apr 13, 2022

Choose a reason for hiding this comment

jwmueller commented Apr 12, 2022

codecov bot commented Apr 12, 2022 • edited

Codecov Report

jwmueller commented Apr 12, 2022

jwmueller commented Apr 12, 2022 • edited

cgnorthcutt left a comment

Choose a reason for hiding this comment

cgnorthcutt Apr 13, 2022

Choose a reason for hiding this comment

jwmueller commented Apr 13, 2022

cgnorthcutt left a comment

Choose a reason for hiding this comment

cgnorthcutt left a comment

Choose a reason for hiding this comment

jwmueller commented Apr 12, 2022 •

edited

codecov bot commented Apr 12, 2022 •

edited

jwmueller commented Apr 12, 2022 •

edited