Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[SPARK-4148][PySpark] fix seed distribution and add some tests for rdd.sample #3010

Closed
wants to merge 2 commits into from

Conversation

mengxr
Copy link
Contributor

@mengxr mengxr commented Oct 30, 2014

The current way of seed distribution makes the random sequences from partition i and i+1 offset by 1.

In [14]: import random

In [15]: r1 = random.Random(10)

In [16]: r1.randint(0, 1)
Out[16]: 1

In [17]: r1.random()
Out[17]: 0.4288890546751146

In [18]: r1.random()
Out[18]: 0.5780913011344704

In [19]: r2 = random.Random(10)

In [20]: r2.randint(0, 1)
Out[20]: 1

In [21]: r2.randint(0, 1)
Out[21]: 0

In [22]: r2.random()
Out[22]: 0.5780913011344704

Note: The new tests are not for this bug fix.

@SparkQA
Copy link

SparkQA commented Oct 30, 2014

Test build #22513 has started for PR 3010 at commit c1bacd9.

  • This patch merges cleanly.

@SparkQA
Copy link

SparkQA commented Oct 30, 2014

Test build #22513 has finished for PR 3010 at commit c1bacd9.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@AmplabJenkins
Copy link

Test PASSed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/22513/
Test PASSed.

>>> wr_s11 = rdd.sample(True, 0.4, 11).collect()
>>> wr_s21 = rdd.sample(True, 0.4, 21).collect()
>>> set(wr_s11) != set(wr_s21)
True
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

These tests should be moved into tests.py, then the docstring will be much clearer.

@SparkQA
Copy link

SparkQA commented Oct 31, 2014

Test build #22637 has started for PR 3010 at commit 869ae4b.

  • This patch merges cleanly.

@SparkQA
Copy link

SparkQA commented Oct 31, 2014

Test build #22637 has finished for PR 3010 at commit 869ae4b.

  • This patch fails Spark unit tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@AmplabJenkins
Copy link

Test FAILed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/22637/
Test FAILed.

@JoshRosen
Copy link
Contributor

Jenkins, retest this please.

@SparkQA
Copy link

SparkQA commented Oct 31, 2014

Test build #22648 has started for PR 3010 at commit 869ae4b.

  • This patch merges cleanly.

@SparkQA
Copy link

SparkQA commented Oct 31, 2014

Test build #22648 has finished for PR 3010 at commit 869ae4b.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@AmplabJenkins
Copy link

Test PASSed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/22648/
Test PASSed.

@mengxr
Copy link
Contributor Author

mengxr commented Nov 3, 2014

@JoshRosen Does it look good to you? We should fix it for 1.2 and backport it to 1.0.x and 1.1.x.

@JoshRosen
Copy link
Contributor

This seems fine to me.

@davies
Copy link
Contributor

davies commented Nov 3, 2014

LGTM。

asfgit pushed a commit that referenced this pull request Nov 3, 2014
…d.sample

The current way of seed distribution makes the random sequences from partition i and i+1 offset by 1.

~~~
In [14]: import random

In [15]: r1 = random.Random(10)

In [16]: r1.randint(0, 1)
Out[16]: 1

In [17]: r1.random()
Out[17]: 0.4288890546751146

In [18]: r1.random()
Out[18]: 0.5780913011344704

In [19]: r2 = random.Random(10)

In [20]: r2.randint(0, 1)
Out[20]: 1

In [21]: r2.randint(0, 1)
Out[21]: 0

In [22]: r2.random()
Out[22]: 0.5780913011344704
~~~

Note: The new tests are not for this bug fix.

Author: Xiangrui Meng <meng@databricks.com>

Closes #3010 from mengxr/SPARK-4148 and squashes the following commits:

869ae4b [Xiangrui Meng] move tests tests.py
c1bacd9 [Xiangrui Meng] fix seed distribution and add some tests for rdd.sample

(cherry picked from commit 3cca196)
Signed-off-by: Xiangrui Meng <meng@databricks.com>
@asfgit asfgit closed this in 3cca196 Nov 3, 2014
@mengxr
Copy link
Contributor Author

mengxr commented Nov 3, 2014

Thanks! I've merged this into both master and branch-1.2. I need to resolve conflicts for branch-1.1 and branch-1.0. I will send separate PRs later.

asfgit pushed a commit that referenced this pull request Nov 5, 2014
…tests for rdd.sample

Port #3010 to branch-1.1.

Author: Xiangrui Meng <meng@databricks.com>

Closes #3104 from mengxr/SPARK-4148-1.1 and squashes the following commits:

684c002 [Xiangrui Meng] apply SPARK-4148 to branch-1.1
asfgit pushed a commit that referenced this pull request Dec 19, 2014
…tests for rdd.sample

Port #3010 to branch-1.0.

Author: Xiangrui Meng <meng@databricks.com>

Closes #3106 from mengxr/SPARK-4148-1.0 and squashes the following commits:

c834cee [Xiangrui Meng] apply SPARK-4148 to branch-1.0
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
5 participants