[SPARK-5224] [PySpark] improve performance of parallelize list/ndarray #4024

davies · 2015-01-13T19:00:52Z

After the default batchSize changed to 0 (batched based on the size of object), but parallelize() still use BatchedSerializer with batchSize=1, this PR will use batchSize=1024 for parallelize by default.

Also, BatchedSerializer did not work well with list and numpy.ndarray, this improve BatchedSerializer by using len and getslice.

Here is the benchmark for parallelize 1 millions int with list or ndarray:

	before	after	improvements
list	11.7 s	0.8 s	14x
numpy.ndarray	32 s	0.7 s	40x

SparkQA · 2015-01-13T19:02:45Z

Test build #25479 has started for PR 4024 at commit 7618c7c.

This patch merges cleanly.

SparkQA · 2015-01-13T19:49:41Z

Test build #25479 has finished for PR 4024 at commit 7618c7c.

This patch fails Spark unit tests.
This patch merges cleanly.
This patch adds no public classes.

AmplabJenkins · 2015-01-13T19:49:45Z

Test FAILed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/25479/
Test FAILed.

SparkQA · 2015-01-13T19:56:23Z

Test build #564 has started for PR 4024 at commit 7618c7c.

This patch merges cleanly.

SparkQA · 2015-01-13T21:05:24Z

Test build #564 has finished for PR 4024 at commit 7618c7c.

This patch passes all tests.
This patch merges cleanly.
This patch adds the following public classes (experimental):
- SparkSubmit.printErrorAndExit(s"Cannot load main class from JAR $primaryResource")
- class BinaryClassificationMetrics(

davies · 2015-01-15T00:30:39Z

@JoshRosen ping!

JoshRosen · 2015-01-15T19:40:19Z

LGTM, so I'm going to merge this into master (1.3.0) and branch-1.2 (1.2.1). Thanks!

After the default batchSize changed to 0 (batched based on the size of object), but parallelize() still use BatchedSerializer with batchSize=1, this PR will use batchSize=1024 for parallelize by default. Also, BatchedSerializer did not work well with list and numpy.ndarray, this improve BatchedSerializer by using __len__ and __getslice__. Here is the benchmark for parallelize 1 millions int with list or ndarray: | before | after | improvements ------- | ------------ | ------------- | ------- list | 11.7 s | 0.8 s | 14x numpy.ndarray | 32 s | 0.7 s | 40x Author: Davies Liu <davies@databricks.com> Closes #4024 from davies/opt_numpy and squashes the following commits: 7618c7c [Davies Liu] improve performance of parallelize list/ndarray (cherry picked from commit 3c8650c) Signed-off-by: Josh Rosen <joshrosen@databricks.com>

improve performance of parallelize list/ndarray

7618c7c

davies changed the title ~~[SPARK-5224] improve performance of parallelize list/ndarray~~ [SPARK-5224] [PySpark] improve performance of parallelize list/ndarray Jan 13, 2015

asfgit closed this in 3c8650c Jan 15, 2015

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[SPARK-5224] [PySpark] improve performance of parallelize list/ndarray #4024

[SPARK-5224] [PySpark] improve performance of parallelize list/ndarray #4024

davies commented Jan 13, 2015

SparkQA commented Jan 13, 2015

SparkQA commented Jan 13, 2015

AmplabJenkins commented Jan 13, 2015

SparkQA commented Jan 13, 2015

SparkQA commented Jan 13, 2015

davies commented Jan 15, 2015

JoshRosen commented Jan 15, 2015

[SPARK-5224] [PySpark] improve performance of parallelize list/ndarray #4024

[SPARK-5224] [PySpark] improve performance of parallelize list/ndarray #4024

Conversation

davies commented Jan 13, 2015

SparkQA commented Jan 13, 2015

SparkQA commented Jan 13, 2015

AmplabJenkins commented Jan 13, 2015

SparkQA commented Jan 13, 2015

SparkQA commented Jan 13, 2015

davies commented Jan 15, 2015

JoshRosen commented Jan 15, 2015