Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[SPARK-5224] [PySpark] improve performance of parallelize list/ndarray #4024

Closed
wants to merge 1 commit into from

Conversation

davies
Copy link
Contributor

@davies davies commented Jan 13, 2015

After the default batchSize changed to 0 (batched based on the size of object), but parallelize() still use BatchedSerializer with batchSize=1, this PR will use batchSize=1024 for parallelize by default.

Also, BatchedSerializer did not work well with list and numpy.ndarray, this improve BatchedSerializer by using len and getslice.

Here is the benchmark for parallelize 1 millions int with list or ndarray:

before after improvements
list 11.7 s 0.8 s 14x
numpy.ndarray 32 s 0.7 s 40x

@SparkQA
Copy link

SparkQA commented Jan 13, 2015

Test build #25479 has started for PR 4024 at commit 7618c7c.

  • This patch merges cleanly.

@davies davies changed the title [SPARK-5224] improve performance of parallelize list/ndarray [SPARK-5224] [PySpark] improve performance of parallelize list/ndarray Jan 13, 2015
@SparkQA
Copy link

SparkQA commented Jan 13, 2015

Test build #25479 has finished for PR 4024 at commit 7618c7c.

  • This patch fails Spark unit tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@AmplabJenkins
Copy link

Test FAILed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/25479/
Test FAILed.

@SparkQA
Copy link

SparkQA commented Jan 13, 2015

Test build #564 has started for PR 4024 at commit 7618c7c.

  • This patch merges cleanly.

@SparkQA
Copy link

SparkQA commented Jan 13, 2015

Test build #564 has finished for PR 4024 at commit 7618c7c.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds the following public classes (experimental):
    • SparkSubmit.printErrorAndExit(s"Cannot load main class from JAR $primaryResource")
    • class BinaryClassificationMetrics(

@davies
Copy link
Contributor Author

davies commented Jan 15, 2015

@JoshRosen ping!

@JoshRosen
Copy link
Contributor

LGTM, so I'm going to merge this into master (1.3.0) and branch-1.2 (1.2.1). Thanks!

asfgit pushed a commit that referenced this pull request Jan 15, 2015
After the default batchSize changed to 0 (batched based on the size of object), but parallelize() still use BatchedSerializer with batchSize=1, this PR will use batchSize=1024 for parallelize by default.

Also, BatchedSerializer did not work well with list and numpy.ndarray, this improve BatchedSerializer by using __len__ and __getslice__.

Here is the benchmark for parallelize 1 millions int with list or ndarray:

    |          before     |   after  | improvements
 ------- | ------------ | ------------- | -------
list |   11.7 s  | 0.8 s |  14x
numpy.ndarray     |  32 s  |   0.7 s | 40x

Author: Davies Liu <davies@databricks.com>

Closes #4024 from davies/opt_numpy and squashes the following commits:

7618c7c [Davies Liu] improve performance of parallelize list/ndarray

(cherry picked from commit 3c8650c)
Signed-off-by: Josh Rosen <joshrosen@databricks.com>
@asfgit asfgit closed this in 3c8650c Jan 15, 2015
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants