Skip to content

[SPARK-9821][PYSPARK] pyspark-reduceByKey-should-take-a-custom-partitioner#8569

Closed
holdenk wants to merge 3 commits intoapache:masterfrom
holdenk:SPARK-9821-pyspark-reduceByKey-should-take-a-custom-partitioner
Closed

[SPARK-9821][PYSPARK] pyspark-reduceByKey-should-take-a-custom-partitioner#8569
holdenk wants to merge 3 commits intoapache:masterfrom
holdenk:SPARK-9821-pyspark-reduceByKey-should-take-a-custom-partitioner

Conversation

@holdenk
Copy link
Contributor

@holdenk holdenk commented Sep 2, 2015

from the issue:

In Scala, I can supply a custom partitioner to reduceByKey (and other aggregation/repartitioning methods like aggregateByKey and combinedByKey), but as far as I can tell from the Pyspark API, there's no way to do the same in Python.
Here's an example of my code in Scala:
weblogs.map(s => (getFileType(s), 1)).reduceByKey(new FileTypePartitioner(),+)
But I can't figure out how to do the same in Python. The closest I can get is to call repartition before reduceByKey like so:
weblogs.map(lambda s: (getFileType(s), 1)).partitionBy(3,hash_filetype).reduceByKey(lambda v1,v2: v1+v2).collect()
But that defeats the purpose, because I'm shuffling twice instead of once, so my performance is worse instead of better.

@SparkQA
Copy link

SparkQA commented Sep 2, 2015

Test build #41922 has finished for PR 8569 at commit 8d272b3.

  • This patch fails PySpark unit tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@holdenk holdenk changed the title [SPARK-9821][PYSPARK] pyspark-reduceByKey-should-take-a-custom-partitioner [SPARK-9821][PYSPARK][WIP] pyspark-reduceByKey-should-take-a-custom-partitioner Sep 2, 2015
@SparkQA
Copy link

SparkQA commented Sep 2, 2015

Test build #41943 has finished for PR 8569 at commit a155498.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@holdenk holdenk changed the title [SPARK-9821][PYSPARK][WIP] pyspark-reduceByKey-should-take-a-custom-partitioner [SPARK-9821][PYSPARK] pyspark-reduceByKey-should-take-a-custom-partitioner Sep 2, 2015
@davies
Copy link
Contributor

davies commented Sep 10, 2015

@holdenk Almost all the APIs in PairRDDFunctions take an optional Partitioner, should we add this for all of them in Python? Or we just add this to the most advanced one combineByKey?

The current approach is also reasonable.

@holdenk
Copy link
Contributor Author

holdenk commented Sep 10, 2015

@davies That sounds like a good plan, I'll expand the JIRA & this PR over the weekend and ping you when its done :)

@SparkQA
Copy link

SparkQA commented Sep 14, 2015

Test build #42389 has finished for PR 8569 at commit fe3ea4f.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@holdenk
Copy link
Contributor Author

holdenk commented Sep 22, 2015

@davies this should now work in the other places

@davies
Copy link
Contributor

davies commented Sep 22, 2015

LGTM, will merge into master, thanks!

@asfgit asfgit closed this in 1cd6741 Sep 22, 2015
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants