[SPARK-9821][PYSPARK] pyspark-reduceByKey-should-take-a-custom-partitioner#8569
Closed
holdenk wants to merge 3 commits intoapache:masterfrom
Closed
Conversation
|
Test build #41922 has finished for PR 8569 at commit
|
|
Test build #41943 has finished for PR 8569 at commit
|
Contributor
|
@holdenk Almost all the APIs in PairRDDFunctions take an optional Partitioner, should we add this for all of them in Python? Or we just add this to the most advanced one The current approach is also reasonable. |
Contributor
Author
|
@davies That sounds like a good plan, I'll expand the JIRA & this PR over the weekend and ping you when its done :) |
|
Test build #42389 has finished for PR 8569 at commit
|
Contributor
Author
|
@davies this should now work in the other places |
Contributor
|
LGTM, will merge into master, thanks! |
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
from the issue:
In Scala, I can supply a custom partitioner to reduceByKey (and other aggregation/repartitioning methods like aggregateByKey and combinedByKey), but as far as I can tell from the Pyspark API, there's no way to do the same in Python.
Here's an example of my code in Scala:
weblogs.map(s => (getFileType(s), 1)).reduceByKey(new FileTypePartitioner(),+)
But I can't figure out how to do the same in Python. The closest I can get is to call repartition before reduceByKey like so:
weblogs.map(lambda s: (getFileType(s), 1)).partitionBy(3,hash_filetype).reduceByKey(lambda v1,v2: v1+v2).collect()
But that defeats the purpose, because I'm shuffling twice instead of once, so my performance is worse instead of better.