Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add overview for computing out-of-sample predicted probabilities with cross-validation to doc site #166

Merged
merged 8 commits into from
Apr 5, 2022

Conversation

weijinglok
Copy link
Contributor

This PR introduces an additional page that covers an overview for computing out-of-sample predicted probabilities with cross-validation on the doc site. A mockup is available here.

@weijinglok weijinglok added this to the Cleanlab 2.0 milestone Apr 5, 2022
@codecov
Copy link

codecov bot commented Apr 5, 2022

Codecov Report

Merging #166 (fa20733) into master (5b6d297) will decrease coverage by 0.68%.
The diff coverage is n/a.

@@            Coverage Diff             @@
##           master     #166      +/-   ##
==========================================
- Coverage   87.64%   86.96%   -0.69%     
==========================================
  Files          12       12              
  Lines        1028     1028              
  Branches      194      194              
==========================================
- Hits          901      894       -7     
- Misses        104      108       +4     
- Partials       23       26       +3     
Impacted Files Coverage Δ
cleanlab/filter.py 87.42% <0.00%> (-4.41%) ⬇️

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 5b6d297...fa20733. Read the comment docs.

Copy link
Member

@jwmueller jwmueller left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I can add the links to this from the docstrings where they are appropriate in future PR.
But you can also help point out anywhere you think it would be good to link to this page from. Could you add the first link to this page though in this PR?

I suggest adding a link from Quickstart ("obtained via cross-validation")

@cgnorthcutt
Copy link
Member

cgnorthcutt commented Apr 5, 2022

I would delete the figure entirely and just show cross validation simply like this:

# Step 0
# Separate your data into three equal sized chunks (this is called 3-fold cross validation)
# Data = A B C

# Step 1 -- get oos pred probs for A
model = Model()
model.fit(data=B+C)
out_of_sample_pred_probs_for_A = model.pred_proba(data=A)

# Step 2 -- get oos pred probs for B
model = Model()
model.fit(data=A+C)
out_of_sample_pred_probs_for_B = model.pred_proba(data=B)

# Step 3 -- get oos pred probs for C
model = Model()
model.fit(data=A+B)
out_of_sample_pred_probs_for_C = model.pred_proba(data=C)

# Final step -- combine to get oos pred probs for entire dataset.
out_of_sample_pred_probs = concatenate([
  out_of_sample_pred_probs_for_A,
  out_of_sample_pred_probs_for_B,
  out_of_sample_pred_probs_for_C,
])

@jwmueller
Copy link
Member

@cgnorthcutt What if we add the pseudocode at the bottom of page and keep the figure at the bottom too?

@cgnorthcutt
Copy link
Member

cgnorthcutt commented Apr 5, 2022

I recommend removing the figure entirely because I don't find it to be clear.

A more clear media component would be an animated gif that just shows three sets and circles two and predicts on the third and does it three times and then combines the outputs.

The figure looks a bit unprofessional and needs work imo.

@anishathalye
Copy link
Member

+1 on adding a code sample, this is a tutorial after all. I think the figure is pretty good, and in my opinion, could be improved and then kept in the docs.

Many users of cleanlab will be familiar with the idea of k-fold cross validation, but the traditional use of cross val is to find hyperparameters for a model. Let's link to this or a similar resource, briefly explain that this is the traditional use of cross val that the reader is likely familiar with, and that we're using cross validation for a different purpose, namely computing out-of-sample predicted probabilities for the entire dataset.

@weijinglok
Copy link
Contributor Author

I can add the links to this from the docstrings where they are appropriate in future PR. But you can also help point out anywhere you think it would be good to link to this page from. Could you add the first link to this page though in this PR?

I suggest adding a link from Quickstart ("obtained via cross-validation")

@jwmueller, have added links in the Quickstart page.

I would delete the figure entirely and just show cross validation simply like this:

# Step 0
# Separate your data into three equal sized chunks (this is called 3-fold cross validation)
# Data = A B C

# Step 1 -- get oos pred probs for A
model = Model()
model.fit(data=B+C)
out_of_sample_pred_probs_for_A = model.pred_proba(data=A)

# Step 2 -- get oos pred probs for B
model = Model()
model.fit(data=A+C)
out_of_sample_pred_probs_for_B = model.pred_proba(data=B)

# Step 3 -- get oos pred probs for C
model = Model()
model.fit(data=A+B)
out_of_sample_pred_probs_for_C = model.pred_proba(data=C)

# Final step -- combine to get oos pred probs for entire dataset.
out_of_sample_pred_probs = concatenate([
  out_of_sample_pred_probs_for_A,
  out_of_sample_pred_probs_for_B,
  out_of_sample_pred_probs_for_C,
])

@cgnorthcutt, have removed the figure and add the pseudocode at the bottom of the page.

+1 on adding a code sample, this is a tutorial after all. I think the figure is pretty good, and in my opinion, could be improved and then kept in the docs.

Many users of cleanlab will be familiar with the idea of k-fold cross validation, but the traditional use of cross val is to find hyperparameters for a model. Let's link to this or a similar resource, briefly explain that this is the traditional use of cross val that the reader is likely familiar with, and that we're using cross validation for a different purpose, namely computing out-of-sample predicted probabilities for the entire dataset.

@anishathalye, have added a hyperlink to the sklearn cross-val page.

Copy link
Member

@jwmueller jwmueller left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

feedback's been addressed. Plan is to revisit the figure for future PR and see if there's easy way to improve it.

@jwmueller jwmueller merged commit 080c7a8 into cleanlab:master Apr 5, 2022
@weijinglok weijinglok deleted the cross-val branch April 5, 2022 19:19
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

4 participants