Skip to content

[SPARK-10731] [PYSPARK] [SQL] use 1 partition in take() of Python DataFrame#8852

Closed
davies wants to merge 1 commit intoapache:masterfrom
davies:py_head
Closed

[SPARK-10731] [PYSPARK] [SQL] use 1 partition in take() of Python DataFrame#8852
davies wants to merge 1 commit intoapache:masterfrom
davies:py_head

Conversation

@davies
Copy link
Contributor

@davies davies commented Sep 21, 2015

And optimize this special case (zero or one partition) in Limit, remove the unnecessary shuffle.

@davies
Copy link
Contributor Author

davies commented Sep 21, 2015

cc @yhuai

@SparkQA
Copy link

SparkQA commented Sep 22, 2015

Test build #42777 has finished for PR 8852 at commit 0c1d21f.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@rxin
Copy link
Contributor

rxin commented Sep 22, 2015

Can you update the title to put the correct JIRA ticket? SPARK-10731

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can this be made more of a wrapper around the Scala stuff rather than doing its own coalescing?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is it possible to just call scala DataFrame.take and get the result? You lose the socket thing if the number of rows is enormous, but that doesn't seem like a big deal for take.

Note that the current fix changes the behavior of take in Python vs Scala. In Scala, you can still get parallelism (e.g. for a highly selective filter), whereas in Python you coalesced it into a single partition, and as a result the degree of parallelism is now just 1.

@davies davies changed the title [SPARK-107321] [PYSPARK] [SQL] use 1 partition in take() of Python DataFrame [SPARK-10731] [PYSPARK] [SQL] use 1 partition in take() of Python DataFrame Sep 22, 2015
@davies davies closed this Sep 23, 2015
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants