Add overview for computing out-of-sample predicted probabilities with cross-validation to doc site #166

weijinglok · 2022-04-05T08:19:23Z

This PR introduces an additional page that covers an overview for computing out-of-sample predicted probabilities with cross-validation on the doc site. A mockup is available here.

codecov · 2022-04-05T08:21:13Z

Codecov Report

Merging #166 (fa20733) into master (5b6d297) will decrease coverage by 0.68%.
The diff coverage is n/a.

@@            Coverage Diff             @@
##           master     #166      +/-   ##
==========================================
- Coverage   87.64%   86.96%   -0.69%     
==========================================
  Files          12       12              
  Lines        1028     1028              
  Branches      194      194              
==========================================
- Hits          901      894       -7     
- Misses        104      108       +4     
- Partials       23       26       +3

Impacted Files	Coverage Δ
cleanlab/filter.py	`87.42% <0.00%> (-4.41%)`	⬇️

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 5b6d297...fa20733. Read the comment docs.

jwmueller

I can add the links to this from the docstrings where they are appropriate in future PR.
But you can also help point out anywhere you think it would be good to link to this page from. Could you add the first link to this page though in this PR?

I suggest adding a link from Quickstart ("obtained via cross-validation")

cgnorthcutt · 2022-04-05T08:30:15Z

I would delete the figure entirely and just show cross validation simply like this:

# Step 0
# Separate your data into three equal sized chunks (this is called 3-fold cross validation)
# Data = A B C

# Step 1 -- get oos pred probs for A
model = Model()
model.fit(data=B+C)
out_of_sample_pred_probs_for_A = model.pred_proba(data=A)

# Step 2 -- get oos pred probs for B
model = Model()
model.fit(data=A+C)
out_of_sample_pred_probs_for_B = model.pred_proba(data=B)

# Step 3 -- get oos pred probs for C
model = Model()
model.fit(data=A+B)
out_of_sample_pred_probs_for_C = model.pred_proba(data=C)

# Final step -- combine to get oos pred probs for entire dataset.
out_of_sample_pred_probs = concatenate([
  out_of_sample_pred_probs_for_A,
  out_of_sample_pred_probs_for_B,
  out_of_sample_pred_probs_for_C,
])

jwmueller · 2022-04-05T08:40:53Z

@cgnorthcutt What if we add the pseudocode at the bottom of page and keep the figure at the bottom too?

cgnorthcutt · 2022-04-05T08:45:09Z

I recommend removing the figure entirely because I don't find it to be clear.

A more clear media component would be an animated gif that just shows three sets and circles two and predicts on the third and does it three times and then combines the outputs.

The figure looks a bit unprofessional and needs work imo.

anishathalye · 2022-04-05T09:51:43Z

+1 on adding a code sample, this is a tutorial after all. I think the figure is pretty good, and in my opinion, could be improved and then kept in the docs.

Many users of cleanlab will be familiar with the idea of k-fold cross validation, but the traditional use of cross val is to find hyperparameters for a model. Let's link to this or a similar resource, briefly explain that this is the traditional use of cross val that the reader is likely familiar with, and that we're using cross validation for a different purpose, namely computing out-of-sample predicted probabilities for the entire dataset.

weijinglok · 2022-04-05T10:12:42Z

I can add the links to this from the docstrings where they are appropriate in future PR. But you can also help point out anywhere you think it would be good to link to this page from. Could you add the first link to this page though in this PR?

I suggest adding a link from Quickstart ("obtained via cross-validation")

@jwmueller, have added links in the Quickstart page.

I would delete the figure entirely and just show cross validation simply like this:

# Step 0
# Separate your data into three equal sized chunks (this is called 3-fold cross validation)
# Data = A B C

# Step 1 -- get oos pred probs for A
model = Model()
model.fit(data=B+C)
out_of_sample_pred_probs_for_A = model.pred_proba(data=A)

# Step 2 -- get oos pred probs for B
model = Model()
model.fit(data=A+C)
out_of_sample_pred_probs_for_B = model.pred_proba(data=B)

# Step 3 -- get oos pred probs for C
model = Model()
model.fit(data=A+B)
out_of_sample_pred_probs_for_C = model.pred_proba(data=C)

# Final step -- combine to get oos pred probs for entire dataset.
out_of_sample_pred_probs = concatenate([
  out_of_sample_pred_probs_for_A,
  out_of_sample_pred_probs_for_B,
  out_of_sample_pred_probs_for_C,
])

@cgnorthcutt, have removed the figure and add the pseudocode at the bottom of the page.

+1 on adding a code sample, this is a tutorial after all. I think the figure is pretty good, and in my opinion, could be improved and then kept in the docs.

Many users of cleanlab will be familiar with the idea of k-fold cross validation, but the traditional use of cross val is to find hyperparameters for a model. Let's link to this or a similar resource, briefly explain that this is the traditional use of cross val that the reader is likely familiar with, and that we're using cross validation for a different purpose, namely computing out-of-sample predicted probabilities for the entire dataset.

@anishathalye, have added a hyperlink to the sklearn cross-val page.

jwmueller

feedback's been addressed. Plan is to revisit the figure for future PR and see if there's easy way to improve it.

add pred probs cross val tutorial

ccb3f31

weijinglok added this to the Cleanlab 2.0 milestone Apr 5, 2022

weijinglok requested review from jwmueller and anishathalye April 5, 2022 08:19

jwmueller approved these changes Apr 5, 2022

View reviewed changes

weijinglok added 3 commits April 5, 2022 16:49

add ref to cross val guide

e1c28a0

replace cross-val image with pseudocodes

a8a57e5

change TOC path

ed3e983

add sklearn cross validation link

900b3e9

weijinglok requested review from jwmueller and cgnorthcutt April 5, 2022 10:12

weijinglok and others added 3 commits April 5, 2022 18:19

Merge branch 'master' into cross-val

ca8368c

formatting of K

0411b8e

shorten text

fa20733

jwmueller approved these changes Apr 5, 2022

View reviewed changes

jwmueller merged commit 080c7a8 into cleanlab:master Apr 5, 2022

weijinglok deleted the cross-val branch April 5, 2022 19:19

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add overview for computing out-of-sample predicted probabilities with cross-validation to doc site #166

Add overview for computing out-of-sample predicted probabilities with cross-validation to doc site #166

weijinglok commented Apr 5, 2022

codecov bot commented Apr 5, 2022 •

edited

jwmueller left a comment

cgnorthcutt commented Apr 5, 2022 •

edited

jwmueller commented Apr 5, 2022

cgnorthcutt commented Apr 5, 2022 •

edited

anishathalye commented Apr 5, 2022

weijinglok commented Apr 5, 2022

jwmueller left a comment

Add overview for computing out-of-sample predicted probabilities with cross-validation to doc site #166

Add overview for computing out-of-sample predicted probabilities with cross-validation to doc site #166

Conversation

weijinglok commented Apr 5, 2022

codecov bot commented Apr 5, 2022 • edited

Codecov Report

jwmueller left a comment

Choose a reason for hiding this comment

cgnorthcutt commented Apr 5, 2022 • edited

jwmueller commented Apr 5, 2022

cgnorthcutt commented Apr 5, 2022 • edited

anishathalye commented Apr 5, 2022

weijinglok commented Apr 5, 2022

jwmueller left a comment

Choose a reason for hiding this comment

codecov bot commented Apr 5, 2022 •

edited

cgnorthcutt commented Apr 5, 2022 •

edited

cgnorthcutt commented Apr 5, 2022 •

edited