Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[SPARK-18678][ML] Skewed reservoir sampling in SamplingUtils #16129

Closed
wants to merge 2 commits into from

Conversation

srowen
Copy link
Member

@srowen srowen commented Dec 3, 2016

What changes were proposed in this pull request?

Fix reservoir sampling bias for small k. An off-by-one error meant that the probability of replacement was slightly too high -- k/(l-1) after l element instead of k/l, which matters for small k.

How was this patch tested?

Existing test plus new test case.

@SparkQA
Copy link

SparkQA commented Dec 3, 2016

Test build #69618 has finished for PR 16129 at commit 8ac5dee.

  • This patch fails Spark unit tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@SparkQA
Copy link

SparkQA commented Dec 3, 2016

Test build #3466 has finished for PR 16129 at commit 8ac5dee.

  • This patch fails SparkR unit tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@SparkQA
Copy link

SparkQA commented Dec 3, 2016

Test build #3467 has finished for PR 16129 at commit 8ac5dee.

  • This patch fails SparkR unit tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@srowen
Copy link
Member Author

srowen commented Dec 3, 2016

@felixcheung maybe you can advise me on this. I think this is a correct fix, but ends up changing the results of decision forests a little bit. The SparkR test you wrote fails:

Failed -------------------------------------------------------------------------
1. Failure: spark.randomForest (@test_mllib.R#937) -----------------------------
predictions$prediction not equal to c(...).
16/16 mismatches (average diff: 0.108)
[1] 60.3 - 60.4 == -0.0508
[2] 61.2 - 61.1 ==  0.1272
[3] 60.7 - 60.6 ==  0.0543
[4] 62.1 - 62.3 == -0.1473
[5] 63.5 - 63.7 == -0.2044
[6] 64.1 - 64.3 == -0.2413
[7] 65.1 - 64.9 ==  0.2591
[8] 64.3 - 64.3 ==  0.0045
[9] 66.7 - 66.7 ==  0.0001
...

Of course I can just paste in the new values, as I expect a small change in the result, but wanted to sense-check it. The new answers are closer to the answers in the nearly-identical case above with 1 tree, which seems a little positive.

@felixcheung
Copy link
Member

just paste in the new values

this seems like the reasonable approach. your intuition and explanation make sense to me.
thanks @srowen

@SparkQA
Copy link

SparkQA commented Dec 6, 2016

Test build #69709 has finished for PR 16129 at commit b4a197a.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@sethah
Copy link
Contributor

sethah commented Dec 7, 2016

This LGTM. Now that I'm looking at it, the test suite never actually tests for correctness, just basic input/output sizes. We really should have better tests, but it's ok with me if it's done in a separate JIRA.

Also, I'd be in favor of changing the title since, while it does affect RandomForest/ML, it's really an error in the SamplingUtils, and this method is used in at least one other place (RangePartitioner).

@srowen srowen changed the title [SPARK-18678][ML] Skewed feature subsampling in Random forest [SPARK-18678][ML] Skewed reservoir sampling in SamplingUtils Dec 7, 2016
@srowen
Copy link
Member Author

srowen commented Dec 7, 2016

I changed the title. The PR does add a correctness test, at least one that addresses the case being fixed here.

@srowen
Copy link
Member Author

srowen commented Dec 7, 2016

Merged to master/2.1

asfgit pushed a commit that referenced this pull request Dec 7, 2016
## What changes were proposed in this pull request?

Fix reservoir sampling bias for small k. An off-by-one error meant that the probability of replacement was slightly too high -- k/(l-1) after l element instead of k/l, which matters for small k.

## How was this patch tested?

Existing test plus new test case.

Author: Sean Owen <sowen@cloudera.com>

Closes #16129 from srowen/SPARK-18678.

(cherry picked from commit 79f5f28)
Signed-off-by: Sean Owen <sowen@cloudera.com>
@asfgit asfgit closed this in 79f5f28 Dec 7, 2016
@srowen srowen deleted the SPARK-18678 branch December 10, 2016 20:29
robert3005 pushed a commit to palantir/spark that referenced this pull request Dec 15, 2016
## What changes were proposed in this pull request?

Fix reservoir sampling bias for small k. An off-by-one error meant that the probability of replacement was slightly too high -- k/(l-1) after l element instead of k/l, which matters for small k.

## How was this patch tested?

Existing test plus new test case.

Author: Sean Owen <sowen@cloudera.com>

Closes apache#16129 from srowen/SPARK-18678.
uzadude pushed a commit to uzadude/spark that referenced this pull request Jan 27, 2017
## What changes were proposed in this pull request?

Fix reservoir sampling bias for small k. An off-by-one error meant that the probability of replacement was slightly too high -- k/(l-1) after l element instead of k/l, which matters for small k.

## How was this patch tested?

Existing test plus new test case.

Author: Sean Owen <sowen@cloudera.com>

Closes apache#16129 from srowen/SPARK-18678.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
4 participants