[SPARK-18678][ML] Skewed reservoir sampling in SamplingUtils #16129

srowen · 2016-12-03T09:41:10Z

What changes were proposed in this pull request?

Fix reservoir sampling bias for small k. An off-by-one error meant that the probability of replacement was slightly too high -- k/(l-1) after l element instead of k/l, which matters for small k.

How was this patch tested?

Existing test plus new test case.

SparkQA · 2016-12-03T12:00:49Z

Test build #69618 has finished for PR 16129 at commit 8ac5dee.

This patch fails Spark unit tests.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2016-12-03T15:25:52Z

Test build #3466 has finished for PR 16129 at commit 8ac5dee.

This patch fails SparkR unit tests.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2016-12-03T19:26:51Z

Test build #3467 has finished for PR 16129 at commit 8ac5dee.

This patch fails SparkR unit tests.
This patch merges cleanly.
This patch adds no public classes.

srowen · 2016-12-03T19:36:18Z

@felixcheung maybe you can advise me on this. I think this is a correct fix, but ends up changing the results of decision forests a little bit. The SparkR test you wrote fails:

Failed -------------------------------------------------------------------------
1. Failure: spark.randomForest (@test_mllib.R#937) -----------------------------
predictions$prediction not equal to c(...).
16/16 mismatches (average diff: 0.108)
[1] 60.3 - 60.4 == -0.0508
[2] 61.2 - 61.1 ==  0.1272
[3] 60.7 - 60.6 ==  0.0543
[4] 62.1 - 62.3 == -0.1473
[5] 63.5 - 63.7 == -0.2044
[6] 64.1 - 64.3 == -0.2413
[7] 65.1 - 64.9 ==  0.2591
[8] 64.3 - 64.3 ==  0.0045
[9] 66.7 - 66.7 ==  0.0001
...

Of course I can just paste in the new values, as I expect a small change in the result, but wanted to sense-check it. The new answers are closer to the answers in the nearly-identical case above with 1 tree, which seems a little positive.

felixcheung · 2016-12-04T19:59:56Z

just paste in the new values

this seems like the reasonable approach. your intuition and explanation make sense to me.
thanks @srowen

SparkQA · 2016-12-06T05:58:53Z

Test build #69709 has finished for PR 16129 at commit b4a197a.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

sethah · 2016-12-07T04:52:57Z

This LGTM. Now that I'm looking at it, the test suite never actually tests for correctness, just basic input/output sizes. We really should have better tests, but it's ok with me if it's done in a separate JIRA.

Also, I'd be in favor of changing the title since, while it does affect RandomForest/ML, it's really an error in the SamplingUtils, and this method is used in at least one other place (RangePartitioner).

srowen · 2016-12-07T08:49:06Z

I changed the title. The PR does add a correctness test, at least one that addresses the case being fixed here.

srowen · 2016-12-07T09:35:06Z

Merged to master/2.1

## What changes were proposed in this pull request? Fix reservoir sampling bias for small k. An off-by-one error meant that the probability of replacement was slightly too high -- k/(l-1) after l element instead of k/l, which matters for small k. ## How was this patch tested? Existing test plus new test case. Author: Sean Owen <sowen@cloudera.com> Closes #16129 from srowen/SPARK-18678. (cherry picked from commit 79f5f28) Signed-off-by: Sean Owen <sowen@cloudera.com>

## What changes were proposed in this pull request? Fix reservoir sampling bias for small k. An off-by-one error meant that the probability of replacement was slightly too high -- k/(l-1) after l element instead of k/l, which matters for small k. ## How was this patch tested? Existing test plus new test case. Author: Sean Owen <sowen@cloudera.com> Closes apache#16129 from srowen/SPARK-18678.

Fix reservoir sampling bias for small k

8ac5dee

Update sparkr random forest test to reflect slightly different sampling

b4a197a

srowen changed the title ~~[SPARK-18678][ML] Skewed feature subsampling in Random forest~~ [SPARK-18678][ML] Skewed reservoir sampling in SamplingUtils Dec 7, 2016

sethah approved these changes Dec 7, 2016

View reviewed changes

asfgit closed this in 79f5f28 Dec 7, 2016

srowen deleted the SPARK-18678 branch December 10, 2016 20:29

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[SPARK-18678][ML] Skewed reservoir sampling in SamplingUtils #16129

[SPARK-18678][ML] Skewed reservoir sampling in SamplingUtils #16129

srowen commented Dec 3, 2016

SparkQA commented Dec 3, 2016

SparkQA commented Dec 3, 2016

SparkQA commented Dec 3, 2016

srowen commented Dec 3, 2016

felixcheung commented Dec 4, 2016

SparkQA commented Dec 6, 2016

sethah commented Dec 7, 2016

srowen commented Dec 7, 2016

srowen commented Dec 7, 2016

[SPARK-18678][ML] Skewed reservoir sampling in SamplingUtils #16129

[SPARK-18678][ML] Skewed reservoir sampling in SamplingUtils #16129

Conversation

srowen commented Dec 3, 2016

What changes were proposed in this pull request?

How was this patch tested?

SparkQA commented Dec 3, 2016

SparkQA commented Dec 3, 2016

SparkQA commented Dec 3, 2016

srowen commented Dec 3, 2016

felixcheung commented Dec 4, 2016

SparkQA commented Dec 6, 2016

sethah commented Dec 7, 2016

srowen commented Dec 7, 2016

srowen commented Dec 7, 2016