-
Notifications
You must be signed in to change notification settings - Fork 686
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add overview for computing out-of-sample predicted probabilities with cross-validation to doc site #166
Conversation
Codecov Report
@@ Coverage Diff @@
## master #166 +/- ##
==========================================
- Coverage 87.64% 86.96% -0.69%
==========================================
Files 12 12
Lines 1028 1028
Branches 194 194
==========================================
- Hits 901 894 -7
- Misses 104 108 +4
- Partials 23 26 +3
Continue to review full report at Codecov.
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I can add the links to this from the docstrings where they are appropriate in future PR.
But you can also help point out anywhere you think it would be good to link to this page from. Could you add the first link to this page though in this PR?
I suggest adding a link from Quickstart ("obtained via cross-validation")
I would delete the figure entirely and just show cross validation simply like this: # Step 0
# Separate your data into three equal sized chunks (this is called 3-fold cross validation)
# Data = A B C
# Step 1 -- get oos pred probs for A
model = Model()
model.fit(data=B+C)
out_of_sample_pred_probs_for_A = model.pred_proba(data=A)
# Step 2 -- get oos pred probs for B
model = Model()
model.fit(data=A+C)
out_of_sample_pred_probs_for_B = model.pred_proba(data=B)
# Step 3 -- get oos pred probs for C
model = Model()
model.fit(data=A+B)
out_of_sample_pred_probs_for_C = model.pred_proba(data=C)
# Final step -- combine to get oos pred probs for entire dataset.
out_of_sample_pred_probs = concatenate([
out_of_sample_pred_probs_for_A,
out_of_sample_pred_probs_for_B,
out_of_sample_pred_probs_for_C,
]) |
@cgnorthcutt What if we add the pseudocode at the bottom of page and keep the figure at the bottom too? |
I recommend removing the figure entirely because I don't find it to be clear. A more clear media component would be an animated gif that just shows three sets and circles two and predicts on the third and does it three times and then combines the outputs. The figure looks a bit unprofessional and needs work imo. |
+1 on adding a code sample, this is a tutorial after all. I think the figure is pretty good, and in my opinion, could be improved and then kept in the docs. Many users of cleanlab will be familiar with the idea of k-fold cross validation, but the traditional use of cross val is to find hyperparameters for a model. Let's link to this or a similar resource, briefly explain that this is the traditional use of cross val that the reader is likely familiar with, and that we're using cross validation for a different purpose, namely computing out-of-sample predicted probabilities for the entire dataset. |
@jwmueller, have added links in the Quickstart page.
@cgnorthcutt, have removed the figure and add the pseudocode at the bottom of the page.
@anishathalye, have added a hyperlink to the sklearn cross-val page. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
feedback's been addressed. Plan is to revisit the figure for future PR and see if there's easy way to improve it.
This PR introduces an additional page that covers an overview for computing out-of-sample predicted probabilities with cross-validation on the doc site. A mockup is available here.