[SPARK-10731] [PYSPARK] [SQL] use 1 partition in take() of Python DataFrame#8852
[SPARK-10731] [PYSPARK] [SQL] use 1 partition in take() of Python DataFrame#8852davies wants to merge 1 commit intoapache:masterfrom
Conversation
|
cc @yhuai |
|
Test build #42777 has finished for PR 8852 at commit
|
|
Can you update the title to put the correct JIRA ticket? SPARK-10731 |
There was a problem hiding this comment.
Can this be made more of a wrapper around the Scala stuff rather than doing its own coalescing?
There was a problem hiding this comment.
Is it possible to just call scala DataFrame.take and get the result? You lose the socket thing if the number of rows is enormous, but that doesn't seem like a big deal for take.
Note that the current fix changes the behavior of take in Python vs Scala. In Scala, you can still get parallelism (e.g. for a highly selective filter), whereas in Python you coalesced it into a single partition, and as a result the degree of parallelism is now just 1.
And optimize this special case (zero or one partition) in Limit, remove the unnecessary shuffle.