[SPARK-4477] [PySpark] remove numpy from RDDSampler #3351

davies · 2014-11-19T01:03:55Z

In RDDSampler, it try use numpy to gain better performance for possion(), but the number of call of random() is only (1+faction) * N in the pure python implementation of possion(), so there is no much performance gain from numpy.

numpy is not a dependent of pyspark, so it maybe introduce some problem, such as there is no numpy installed in slaves, but only installed master, as reported in SPARK-927.

It also complicate the code a lot, so we may should remove numpy from RDDSampler.

I also did some benchmark to verify that:

>>> from pyspark.mllib.random import RandomRDDs
>>> rdd = RandomRDDs.uniformRDD(sc, 1 << 20, 1).cache()
>>> rdd.count()  # cache it
>>> rdd.sample(True, 0.9).count()    # measure this line

the results:

withReplacement	random	numpy.random
True	1.5 s	1.4 s
False	0.6 s	0.8 s

closes #2313

Note: this patch including some commits that not mirrored to github, it will be OK after it catches up.

…into numpy Conflicts: python/pyspark/rddsampler.py

SparkQA · 2014-11-19T01:07:47Z

Test build #23574 has started for PR 3351 at commit 13f7b05.

This patch merges cleanly.

SparkQA · 2014-11-19T02:56:19Z

Test build #23574 has finished for PR 3351 at commit 13f7b05.

This patch passes all tests.
This patch merges cleanly.
This patch adds the following public classes (experimental):
- class LinearBinaryClassificationModel(LinearModel):
- class LogisticRegressionModel(LinearBinaryClassificationModel):
- class LogisticRegressionWithLBFGS(object):
- class SVMModel(LinearBinaryClassificationModel):
- class Rating(namedtuple("Rating", ["user", "product", "rating"])):
- class RDDRangeSampler(RDDSamplerBase):
- class SizeLimitedStream(object):
- class CompressedStream(object):
- class LargeObjectSerializer(Serializer):
- class CompressedSerializer(Serializer):

AmplabJenkins · 2014-11-19T02:56:23Z

Test PASSed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/23574/
Test PASSed.

mengxr · 2014-11-19T08:26:35Z

This issue was discussed in #2313 and #3193 . I support this change because it simplifies the implementation and eliminates the concerns in #2313 and the bug fixed in #2889 . Though I haven't tested python's random, I'm not worried about its quality. Both python and numpy implement MT19937. The only downside I can see is the performance regression during sampling, but it is usually not the bottleneck of a job.

@mattf @freeman-lab @JoshRosen

davies · 2014-11-19T18:59:17Z

python/pyspark/rddsampler.py

                for _ in range(0, count):
                    yield key, val
        else:
            for key, val in iterator:
-                if self.getUniformSample(split) <= self._fractions[key]:


Using equal for float does not make sense.

SparkQA · 2014-11-19T19:05:32Z

Test build #23616 has started for PR 3351 at commit ee17d78.

This patch merges cleanly.

SparkQA · 2014-11-19T20:34:14Z

Test build #23616 has finished for PR 3351 at commit ee17d78.

This patch fails MiMa tests.
This patch merges cleanly.
This patch adds no public classes.

AmplabJenkins · 2014-11-19T20:34:18Z

Test FAILed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/23616/
Test FAILed.

mengxr · 2014-11-20T20:11:39Z

test this please

SparkQA · 2014-11-20T20:15:10Z

Test build #23680 has started for PR 3351 at commit ee17d78.

This patch merges cleanly.

mengxr · 2014-11-20T21:41:14Z

@davies I sent you a PR with a faster version of poisson generator: davies#1 . Could you test the performance and update the result? Thanks!

SparkQA · 2014-11-20T21:41:38Z

Test build #23680 has finished for PR 3351 at commit ee17d78.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

AmplabJenkins · 2014-11-20T21:41:41Z

Test PASSed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/23680/
Test PASSed.

make poisson sampling slightly faster

SparkQA · 2014-11-20T22:15:39Z

Test build #23685 has started for PR 3351 at commit c5b9252.

This patch merges cleanly.

davies · 2014-11-20T22:21:49Z

micro benchmark for possion (with fraction=0.1):

old one:

$ python -m timeit -s from pyspark.rddsampler import RDDSamplerBase; b = RDDSamplerBase(True, 42); b.initRandomGenerator(0)" "b.getPoissonSample(0.1)"
1000000 loops, best of 3: 0.952 usec per loop

new one:

$ python -m timeit -s "from pyspark.rddsampler import RDDSamplerBase; b = RDDSamplerBase(True, 42); b.initRandomGenerator(0)" "b.getPoissonSample(0.1)"
1000000 loops, best of 3: 0.56 usec per loop

So it's about two times faster.

SparkQA · 2014-11-20T22:22:51Z

Test build #23686 has started for PR 3351 at commit 5c438d7.

This patch merges cleanly.

davies · 2014-11-20T22:26:15Z

@mengxr I had updated the test result, now it's as fast as before (small difference).

mengxr · 2014-11-20T23:29:17Z

LGTM. Waiting for Jenkins ...

SparkQA · 2014-11-20T23:40:01Z

Test build #23685 has finished for PR 3351 at commit c5b9252.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

AmplabJenkins · 2014-11-20T23:40:05Z

Test PASSed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/23685/
Test PASSed.

SparkQA · 2014-11-20T23:46:56Z

Test build #23686 has finished for PR 3351 at commit 5c438d7.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

AmplabJenkins · 2014-11-20T23:47:00Z

Test PASSed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/23686/
Test PASSed.

mengxr · 2014-11-21T00:41:10Z

Merged into master and branch-1.2. Thanks!

In RDDSampler, it try use numpy to gain better performance for possion(), but the number of call of random() is only (1+faction) * N in the pure python implementation of possion(), so there is no much performance gain from numpy. numpy is not a dependent of pyspark, so it maybe introduce some problem, such as there is no numpy installed in slaves, but only installed master, as reported in SPARK-927. It also complicate the code a lot, so we may should remove numpy from RDDSampler. I also did some benchmark to verify that: ``` >>> from pyspark.mllib.random import RandomRDDs >>> rdd = RandomRDDs.uniformRDD(sc, 1 << 20, 1).cache() >>> rdd.count() # cache it >>> rdd.sample(True, 0.9).count() # measure this line ``` the results: |withReplacement | random | numpy.random | ------- | ------------ | ------- |True | 1.5 s| 1.4 s| |False| 0.6 s | 0.8 s| closes apache#2313 Note: this patch including some commits that not mirrored to github, it will be OK after it catches up. Author: Davies Liu <davies@databricks.com> Author: Xiangrui Meng <meng@databricks.com> Closes apache#3351 from davies/numpy and squashes the following commits: 5c438d7 [Davies Liu] fix comment c5b9252 [Davies Liu] Merge pull request #1 from mengxr/SPARK-4477 98eb31b [Xiangrui Meng] make poisson sampling slightly faster ee17d78 [Davies Liu] remove = for float 13f7b05 [Davies Liu] Merge branch 'master' of http://git-wip-us.apache.org/repos/asf/spark into numpy f583023 [Davies Liu] fix tests 51649f5 [Davies Liu] remove numpy in RDDSampler 78bf997 [Davies Liu] fix tests, do not use numpy in randomSplit, no performance gain f5fdf63 [Davies Liu] fix bug with int in weights 4dfa2cd [Davies Liu] refactor f866bcf [Davies Liu] remove unneeded change c7a2007 [Davies Liu] switch to python implementation 95a48ac [Davies Liu] Merge branch 'master' of github.com:apache/spark into randomSplit 0d9b256 [Davies Liu] refactor 1715ee3 [Davies Liu] address comments 41fce54 [Davies Liu] randomSplit() (cherry picked from commit d39f2e9) Signed-off-by: Xiangrui Meng <meng@databricks.com>

Davies Liu added 12 commits November 10, 2014 15:54

randomSplit()

41fce54

address comments

1715ee3

refactor

0d9b256

Merge branch 'master' of github.com:apache/spark into randomSplit

95a48ac

switch to python implementation

c7a2007

remove unneeded change

f866bcf

refactor

4dfa2cd

fix bug with int in weights

f5fdf63

fix tests, do not use numpy in randomSplit, no performance gain

78bf997

remove numpy in RDDSampler

51649f5

fix tests

f583023

Merge branch 'master' of http://git-wip-us.apache.org/repos/asf/spark …

13f7b05

…into numpy Conflicts: python/pyspark/rddsampler.py

remove = for float

ee17d78

davies reviewed Nov 19, 2014
View reviewed changes

make poisson sampling slightly faster

98eb31b

Merge pull request #1 from mengxr/SPARK-4477

c5b9252

make poisson sampling slightly faster

fix comment

5c438d7

davies closed this Nov 21, 2014

This was referenced Nov 24, 2014

[branch-1.0][SPARK-4148][PySpark] fix seed distribution and add some tests for rdd.sample #3106

Closed

[SPARK-927] detect numpy at time of use #2313

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[SPARK-4477] [PySpark] remove numpy from RDDSampler #3351

[SPARK-4477] [PySpark] remove numpy from RDDSampler #3351

davies commented Nov 19, 2014

SparkQA commented Nov 19, 2014

SparkQA commented Nov 19, 2014

AmplabJenkins commented Nov 19, 2014

mengxr commented Nov 19, 2014

davies Nov 19, 2014

SparkQA commented Nov 19, 2014

SparkQA commented Nov 19, 2014

AmplabJenkins commented Nov 19, 2014

mengxr commented Nov 20, 2014

SparkQA commented Nov 20, 2014

mengxr commented Nov 20, 2014

SparkQA commented Nov 20, 2014

AmplabJenkins commented Nov 20, 2014

SparkQA commented Nov 20, 2014

davies commented Nov 20, 2014

SparkQA commented Nov 20, 2014

davies commented Nov 20, 2014

mengxr commented Nov 20, 2014

SparkQA commented Nov 20, 2014

AmplabJenkins commented Nov 20, 2014

SparkQA commented Nov 20, 2014

AmplabJenkins commented Nov 20, 2014

mengxr commented Nov 21, 2014

[SPARK-4477] [PySpark] remove numpy from RDDSampler #3351

[SPARK-4477] [PySpark] remove numpy from RDDSampler #3351

Conversation

davies commented Nov 19, 2014

SparkQA commented Nov 19, 2014

SparkQA commented Nov 19, 2014

AmplabJenkins commented Nov 19, 2014

mengxr commented Nov 19, 2014

davies Nov 19, 2014

Choose a reason for hiding this comment

SparkQA commented Nov 19, 2014

SparkQA commented Nov 19, 2014

AmplabJenkins commented Nov 19, 2014

mengxr commented Nov 20, 2014

SparkQA commented Nov 20, 2014

mengxr commented Nov 20, 2014

SparkQA commented Nov 20, 2014

AmplabJenkins commented Nov 20, 2014

SparkQA commented Nov 20, 2014

davies commented Nov 20, 2014

SparkQA commented Nov 20, 2014

davies commented Nov 20, 2014

mengxr commented Nov 20, 2014

SparkQA commented Nov 20, 2014

AmplabJenkins commented Nov 20, 2014

SparkQA commented Nov 20, 2014

AmplabJenkins commented Nov 20, 2014

mengxr commented Nov 21, 2014