Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[SPARK-4477] [PySpark] remove numpy from RDDSampler #3351

Closed
wants to merge 16 commits into from

Conversation

davies
Copy link
Contributor

@davies davies commented Nov 19, 2014

In RDDSampler, it try use numpy to gain better performance for possion(), but the number of call of random() is only (1+faction) * N in the pure python implementation of possion(), so there is no much performance gain from numpy.

numpy is not a dependent of pyspark, so it maybe introduce some problem, such as there is no numpy installed in slaves, but only installed master, as reported in SPARK-927.

It also complicate the code a lot, so we may should remove numpy from RDDSampler.

I also did some benchmark to verify that:

>>> from pyspark.mllib.random import RandomRDDs
>>> rdd = RandomRDDs.uniformRDD(sc, 1 << 20, 1).cache()
>>> rdd.count()  # cache it
>>> rdd.sample(True, 0.9).count()    # measure this line

the results:

withReplacement random numpy.random
True 1.5 s 1.4 s
False 0.6 s 0.8 s

closes #2313

Note: this patch including some commits that not mirrored to github, it will be OK after it catches up.

@SparkQA
Copy link

SparkQA commented Nov 19, 2014

Test build #23574 has started for PR 3351 at commit 13f7b05.

  • This patch merges cleanly.

@SparkQA
Copy link

SparkQA commented Nov 19, 2014

Test build #23574 has finished for PR 3351 at commit 13f7b05.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds the following public classes (experimental):
    • class LinearBinaryClassificationModel(LinearModel):
    • class LogisticRegressionModel(LinearBinaryClassificationModel):
    • class LogisticRegressionWithLBFGS(object):
    • class SVMModel(LinearBinaryClassificationModel):
    • class Rating(namedtuple("Rating", ["user", "product", "rating"])):
    • class RDDRangeSampler(RDDSamplerBase):
    • class SizeLimitedStream(object):
    • class CompressedStream(object):
    • class LargeObjectSerializer(Serializer):
    • class CompressedSerializer(Serializer):

@AmplabJenkins
Copy link

Test PASSed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/23574/
Test PASSed.

@mengxr
Copy link
Contributor

mengxr commented Nov 19, 2014

This issue was discussed in #2313 and #3193 . I support this change because it simplifies the implementation and eliminates the concerns in #2313 and the bug fixed in #2889 . Though I haven't tested python's random, I'm not worried about its quality. Both python and numpy implement MT19937. The only downside I can see is the performance regression during sampling, but it is usually not the bottleneck of a job.

@mattf @freeman-lab @JoshRosen

for _ in range(0, count):
yield key, val
else:
for key, val in iterator:
if self.getUniformSample(split) <= self._fractions[key]:
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Using equal for float does not make sense.

@SparkQA
Copy link

SparkQA commented Nov 19, 2014

Test build #23616 has started for PR 3351 at commit ee17d78.

  • This patch merges cleanly.

@SparkQA
Copy link

SparkQA commented Nov 19, 2014

Test build #23616 has finished for PR 3351 at commit ee17d78.

  • This patch fails MiMa tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@AmplabJenkins
Copy link

Test FAILed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/23616/
Test FAILed.

@mengxr
Copy link
Contributor

mengxr commented Nov 20, 2014

test this please

@SparkQA
Copy link

SparkQA commented Nov 20, 2014

Test build #23680 has started for PR 3351 at commit ee17d78.

  • This patch merges cleanly.

@mengxr
Copy link
Contributor

mengxr commented Nov 20, 2014

@davies I sent you a PR with a faster version of poisson generator: davies#1 . Could you test the performance and update the result? Thanks!

@SparkQA
Copy link

SparkQA commented Nov 20, 2014

Test build #23680 has finished for PR 3351 at commit ee17d78.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@AmplabJenkins
Copy link

Test PASSed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/23680/
Test PASSed.

make poisson sampling slightly faster
@SparkQA
Copy link

SparkQA commented Nov 20, 2014

Test build #23685 has started for PR 3351 at commit c5b9252.

  • This patch merges cleanly.

@davies
Copy link
Contributor Author

davies commented Nov 20, 2014

micro benchmark for possion (with fraction=0.1):

old one:

$ python -m timeit -s from pyspark.rddsampler import RDDSamplerBase; b = RDDSamplerBase(True, 42); b.initRandomGenerator(0)" "b.getPoissonSample(0.1)"
1000000 loops, best of 3: 0.952 usec per loop

new one:

$ python -m timeit -s "from pyspark.rddsampler import RDDSamplerBase; b = RDDSamplerBase(True, 42); b.initRandomGenerator(0)" "b.getPoissonSample(0.1)"
1000000 loops, best of 3: 0.56 usec per loop

So it's about two times faster.

@SparkQA
Copy link

SparkQA commented Nov 20, 2014

Test build #23686 has started for PR 3351 at commit 5c438d7.

  • This patch merges cleanly.

@davies
Copy link
Contributor Author

davies commented Nov 20, 2014

@mengxr I had updated the test result, now it's as fast as before (small difference).

@mengxr
Copy link
Contributor

mengxr commented Nov 20, 2014

LGTM. Waiting for Jenkins ...

@SparkQA
Copy link

SparkQA commented Nov 20, 2014

Test build #23685 has finished for PR 3351 at commit c5b9252.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@AmplabJenkins
Copy link

Test PASSed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/23685/
Test PASSed.

@SparkQA
Copy link

SparkQA commented Nov 20, 2014

Test build #23686 has finished for PR 3351 at commit 5c438d7.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@AmplabJenkins
Copy link

Test PASSed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/23686/
Test PASSed.

@mengxr
Copy link
Contributor

mengxr commented Nov 21, 2014

Merged into master and branch-1.2. Thanks!

andrewor14 pushed a commit to andrewor14/spark that referenced this pull request Nov 21, 2014
In RDDSampler, it try use numpy to gain better performance for possion(), but the number of call of random() is only (1+faction) * N in the pure python implementation of possion(), so there is no much performance gain from numpy.

numpy is not a dependent of pyspark, so it maybe introduce some problem, such as there is no numpy installed in slaves, but only installed master, as reported in SPARK-927.

It also complicate the code a lot, so we may should remove numpy from RDDSampler.

I also did some benchmark to verify that:
```
>>> from pyspark.mllib.random import RandomRDDs
>>> rdd = RandomRDDs.uniformRDD(sc, 1 << 20, 1).cache()
>>> rdd.count()  # cache it
>>> rdd.sample(True, 0.9).count()    # measure this line
```
the results:

|withReplacement      |  random  | numpy.random |
 ------- | ------------ |  -------
|True | 1.5 s|  1.4 s|
|False|  0.6 s | 0.8 s|

closes apache#2313

Note: this patch including some commits that not mirrored to github, it will be OK after it catches up.

Author: Davies Liu <davies@databricks.com>
Author: Xiangrui Meng <meng@databricks.com>

Closes apache#3351 from davies/numpy and squashes the following commits:

5c438d7 [Davies Liu] fix comment
c5b9252 [Davies Liu] Merge pull request #1 from mengxr/SPARK-4477
98eb31b [Xiangrui Meng] make poisson sampling slightly faster
ee17d78 [Davies Liu] remove = for float
13f7b05 [Davies Liu] Merge branch 'master' of http://git-wip-us.apache.org/repos/asf/spark into numpy
f583023 [Davies Liu] fix tests
51649f5 [Davies Liu] remove numpy in RDDSampler
78bf997 [Davies Liu] fix tests, do not use numpy in randomSplit, no performance gain
f5fdf63 [Davies Liu] fix bug with int in weights
4dfa2cd [Davies Liu] refactor
f866bcf [Davies Liu] remove unneeded change
c7a2007 [Davies Liu] switch to python implementation
95a48ac [Davies Liu] Merge branch 'master' of github.com:apache/spark into randomSplit
0d9b256 [Davies Liu] refactor
1715ee3 [Davies Liu] address comments
41fce54 [Davies Liu] randomSplit()

(cherry picked from commit d39f2e9)
Signed-off-by: Xiangrui Meng <meng@databricks.com>
@davies davies closed this Nov 21, 2014
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants