-
Notifications
You must be signed in to change notification settings - Fork 28k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[SPARK-18678][ML] Skewed reservoir sampling in SamplingUtils #16129
Conversation
Test build #69618 has finished for PR 16129 at commit
|
Test build #3466 has finished for PR 16129 at commit
|
Test build #3467 has finished for PR 16129 at commit
|
@felixcheung maybe you can advise me on this. I think this is a correct fix, but ends up changing the results of decision forests a little bit. The SparkR test you wrote fails:
Of course I can just paste in the new values, as I expect a small change in the result, but wanted to sense-check it. The new answers are closer to the answers in the nearly-identical case above with 1 tree, which seems a little positive. |
this seems like the reasonable approach. your intuition and explanation make sense to me. |
Test build #69709 has finished for PR 16129 at commit
|
This LGTM. Now that I'm looking at it, the test suite never actually tests for correctness, just basic input/output sizes. We really should have better tests, but it's ok with me if it's done in a separate JIRA. Also, I'd be in favor of changing the title since, while it does affect RandomForest/ML, it's really an error in the SamplingUtils, and this method is used in at least one other place (RangePartitioner). |
I changed the title. The PR does add a correctness test, at least one that addresses the case being fixed here. |
Merged to master/2.1 |
## What changes were proposed in this pull request? Fix reservoir sampling bias for small k. An off-by-one error meant that the probability of replacement was slightly too high -- k/(l-1) after l element instead of k/l, which matters for small k. ## How was this patch tested? Existing test plus new test case. Author: Sean Owen <sowen@cloudera.com> Closes #16129 from srowen/SPARK-18678. (cherry picked from commit 79f5f28) Signed-off-by: Sean Owen <sowen@cloudera.com>
## What changes were proposed in this pull request? Fix reservoir sampling bias for small k. An off-by-one error meant that the probability of replacement was slightly too high -- k/(l-1) after l element instead of k/l, which matters for small k. ## How was this patch tested? Existing test plus new test case. Author: Sean Owen <sowen@cloudera.com> Closes apache#16129 from srowen/SPARK-18678.
## What changes were proposed in this pull request? Fix reservoir sampling bias for small k. An off-by-one error meant that the probability of replacement was slightly too high -- k/(l-1) after l element instead of k/l, which matters for small k. ## How was this patch tested? Existing test plus new test case. Author: Sean Owen <sowen@cloudera.com> Closes apache#16129 from srowen/SPARK-18678.
What changes were proposed in this pull request?
Fix reservoir sampling bias for small k. An off-by-one error meant that the probability of replacement was slightly too high -- k/(l-1) after l element instead of k/l, which matters for small k.
How was this patch tested?
Existing test plus new test case.