[SPARK-9821][PYSPARK] pyspark-reduceByKey-should-take-a-custom-partitioner by holdenk · Pull Request #8569 · apache/spark

holdenk · 2015-09-02T07:28:52Z

from the issue:

In Scala, I can supply a custom partitioner to reduceByKey (and other aggregation/repartitioning methods like aggregateByKey and combinedByKey), but as far as I can tell from the Pyspark API, there's no way to do the same in Python.
Here's an example of my code in Scala:
weblogs.map(s => (getFileType(s), 1)).reduceByKey(new FileTypePartitioner(),+)
But I can't figure out how to do the same in Python. The closest I can get is to call repartition before reduceByKey like so:
weblogs.map(lambda s: (getFileType(s), 1)).partitionBy(3,hash_filetype).reduceByKey(lambda v1,v2: v1+v2).collect()
But that defeats the purpose, because I'm shuffling twice instead of once, so my performance is worse instead of better.

SparkQA · 2015-09-02T07:54:24Z

Test build #41922 has finished for PR 8569 at commit 8d272b3.

This patch fails PySpark unit tests.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2015-09-02T19:51:06Z

Test build #41943 has finished for PR 8569 at commit a155498.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

davies · 2015-09-10T21:42:25Z

@holdenk Almost all the APIs in PairRDDFunctions take an optional Partitioner, should we add this for all of them in Python? Or we just add this to the most advanced one combineByKey?

The current approach is also reasonable.

holdenk · 2015-09-10T22:32:17Z

@davies That sounds like a good plan, I'll expand the JIRA & this PR over the weekend and ping you when its done :)

SparkQA · 2015-09-14T01:47:10Z

Test build #42389 has finished for PR 8569 at commit fe3ea4f.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

holdenk · 2015-09-22T06:16:50Z

@davies this should now work in the other places

davies · 2015-09-22T06:20:59Z

LGTM, will merge into master, thanks!

Add partitioner function

8d272b3

holdenk changed the title ~~[SPARK-9821][PYSPARK] pyspark-reduceByKey-should-take-a-custom-partitioner~~ [SPARK-9821][PYSPARK][WIP] pyspark-reduceByKey-should-take-a-custom-partitioner Sep 2, 2015

s/partitioner/partitionFunc/

a155498

holdenk changed the title ~~[SPARK-9821][PYSPARK][WIP] pyspark-reduceByKey-should-take-a-custom-partitioner~~ [SPARK-9821][PYSPARK] pyspark-reduceByKey-should-take-a-custom-partitioner Sep 2, 2015

Allow people to specify partitionFunc in more places

fe3ea4f

asfgit closed this in 1cd6741 Sep 22, 2015

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[SPARK-9821][PYSPARK] pyspark-reduceByKey-should-take-a-custom-partitioner#8569

[SPARK-9821][PYSPARK] pyspark-reduceByKey-should-take-a-custom-partitioner#8569
holdenk wants to merge 3 commits intoapache:masterfrom
holdenk:SPARK-9821-pyspark-reduceByKey-should-take-a-custom-partitioner

holdenk commented Sep 2, 2015

Uh oh!

SparkQA commented Sep 2, 2015

Uh oh!

SparkQA commented Sep 2, 2015

Uh oh!

davies commented Sep 10, 2015

Uh oh!

holdenk commented Sep 10, 2015

Uh oh!

SparkQA commented Sep 14, 2015

Uh oh!

holdenk commented Sep 22, 2015

Uh oh!

davies commented Sep 22, 2015

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Conversation

holdenk commented Sep 2, 2015

Uh oh!

SparkQA commented Sep 2, 2015

Uh oh!

SparkQA commented Sep 2, 2015

Uh oh!

davies commented Sep 10, 2015

Uh oh!

holdenk commented Sep 10, 2015

Uh oh!

SparkQA commented Sep 14, 2015

Uh oh!

holdenk commented Sep 22, 2015

Uh oh!

davies commented Sep 22, 2015

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants