Skip to content

[SPARK-37649][PYTHON] Switch default index to distributed-sequence by default in pandas API on Spark#34902

Closed
HyukjinKwon wants to merge 2 commits intoapache:masterfrom
HyukjinKwon:SPARK-37649
Closed

[SPARK-37649][PYTHON] Switch default index to distributed-sequence by default in pandas API on Spark#34902
HyukjinKwon wants to merge 2 commits intoapache:masterfrom
HyukjinKwon:SPARK-37649

Conversation

@HyukjinKwon
Copy link
Member

@HyukjinKwon HyukjinKwon commented Dec 15, 2021

What changes were proposed in this pull request?

This PR proposes to switch default index to distributed-sequence by default.

Why are the changes needed?

sequence type relies on sending all data to one executor that easily causes OOM and make the computation slow
We should better switch to distributed-sequence type that truly distributes the data.

In my own internal benchmark, it boosts the performance around 2 ~ 3 times faster on average.

Does this PR introduce any user-facing change?

Ideally no. Order might be affected but that's not already guaranteed.

How was this patch tested?

Existing CI should test it out.

@HyukjinKwon
Copy link
Member Author

cc @ueshin @xinrong-databricks @itholic

@ueshin
Copy link
Member

ueshin commented Dec 15, 2021

There are some places mentioning the default type in the docs, like:

This default index is ``sequence`` which requires the computation on single partition which is discouraged. If you plan

This index type should be avoided when the data is large. This is default. See the example below:

@SparkQA
Copy link

SparkQA commented Dec 15, 2021

Test build #146202 has finished for PR 34902 at commit 3704d4d.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

Copy link
Contributor

@itholic itholic left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

Copy link
Member

@ueshin ueshin left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM, pending tests.

@SparkQA
Copy link

SparkQA commented Dec 15, 2021

Kubernetes integration test starting
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/50676/

@SparkQA
Copy link

SparkQA commented Dec 15, 2021

Test build #146204 has finished for PR 34902 at commit 969911a.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@SparkQA
Copy link

SparkQA commented Dec 15, 2021

Kubernetes integration test starting
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/50678/

@HyukjinKwon
Copy link
Member Author

All tests passed at https://github.com/HyukjinKwon/spark/actions/runs/1580641441.

Merged to master.

@SparkQA
Copy link

SparkQA commented Dec 15, 2021

Kubernetes integration test status failure
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/50676/

@SparkQA
Copy link

SparkQA commented Dec 15, 2021

Kubernetes integration test status failure
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/50678/

@HyukjinKwon HyukjinKwon deleted the SPARK-37649 branch January 4, 2022 00:51
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants