[SPARK-10731] [PYSPARK] [SQL] use 1 partition in take() of Python DataFrame by davies · Pull Request #8852 · apache/spark

davies · 2015-09-21T21:32:32Z

And optimize this special case (zero or one partition) in Limit, remove the unnecessary shuffle.

davies · 2015-09-21T21:32:39Z

cc @yhuai

SparkQA · 2015-09-22T00:01:02Z

Test build #42777 has finished for PR 8852 at commit 0c1d21f.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

rxin · 2015-09-22T05:32:00Z

Can you update the title to put the correct JIRA ticket? SPARK-10731

rxin · 2015-09-22T05:33:17Z

python/pyspark/sql/dataframe.py

Can this be made more of a wrapper around the Scala stuff rather than doing its own coalescing?

Is it possible to just call scala DataFrame.take and get the result? You lose the socket thing if the number of rows is enormous, but that doesn't seem like a big deal for take.

Note that the current fix changes the behavior of take in Python vs Scala. In Scala, you can still get parallelism (e.g. for a highly selective filter), whereas in Python you coalesced it into a single partition, and as a result the degree of parallelism is now just 1.

use 1 partition to do take()

0c1d21f

rxin reviewed Sep 22, 2015
View reviewed changes

davies changed the title ~~[SPARK-107321] [PYSPARK] [SQL] use 1 partition in take() of Python DataFrame~~ [SPARK-10731] [PYSPARK] [SQL] use 1 partition in take() of Python DataFrame Sep 22, 2015

davies closed this Sep 23, 2015

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[SPARK-10731] [PYSPARK] [SQL] use 1 partition in take() of Python DataFrame#8852

[SPARK-10731] [PYSPARK] [SQL] use 1 partition in take() of Python DataFrame#8852
davies wants to merge 1 commit intoapache:masterfrom
davies:py_head

davies commented Sep 21, 2015

Uh oh!

davies commented Sep 21, 2015

Uh oh!

SparkQA commented Sep 22, 2015

Uh oh!

rxin commented Sep 22, 2015

Uh oh!

rxin Sep 22, 2015

Uh oh!

davies Sep 22, 2015

Uh oh!

rxin Sep 22, 2015

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Conversation

davies commented Sep 21, 2015

Uh oh!

davies commented Sep 21, 2015

Uh oh!

SparkQA commented Sep 22, 2015

Uh oh!

rxin commented Sep 22, 2015

Uh oh!

rxin Sep 22, 2015

Choose a reason for hiding this comment

Uh oh!

davies Sep 22, 2015

Choose a reason for hiding this comment

Uh oh!

rxin Sep 22, 2015

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants