[SPARK-37649][PYTHON] Switch default index to distributed-sequence by default in pandas API on Spark by HyukjinKwon · Pull Request #34902 · apache/spark

HyukjinKwon · 2021-12-15T00:33:18Z

What changes were proposed in this pull request?

This PR proposes to switch default index to distributed-sequence by default.

Why are the changes needed?

sequence type relies on sending all data to one executor that easily causes OOM and make the computation slow
We should better switch to distributed-sequence type that truly distributes the data.

In my own internal benchmark, it boosts the performance around 2 ~ 3 times faster on average.

Does this PR introduce any user-facing change?

Ideally no. Order might be affected but that's not already guaranteed.

How was this patch tested?

Existing CI should test it out.

… on Spark

HyukjinKwon · 2021-12-15T00:34:29Z

cc @ueshin @xinrong-databricks @itholic

ueshin · 2021-12-15T00:46:42Z

There are some places mentioning the default type in the docs, like:

spark/python/docs/source/user_guide/pandas_on_spark/best_practices.rst

Line 238 in 3704d4d

    
           This default index is ``sequence`` which requires the computation on single partition which is discouraged. If you plan

spark/python/docs/source/user_guide/pandas_on_spark/options.rst

Line 163 in 3704d4d

    
           This index type should be avoided when the data is large. This is default. See the example below:

SparkQA · 2021-12-15T01:05:24Z

Test build #146202 has finished for PR 34902 at commit 3704d4d.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

itholic

LGTM

ueshin

LGTM, pending tests.

SparkQA · 2021-12-15T01:22:35Z

Kubernetes integration test starting
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/50676/

SparkQA · 2021-12-15T01:47:43Z

Test build #146204 has finished for PR 34902 at commit 969911a.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2021-12-15T02:06:55Z

Kubernetes integration test starting
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/50678/

HyukjinKwon · 2021-12-15T02:15:06Z

All tests passed at https://github.com/HyukjinKwon/spark/actions/runs/1580641441.

Merged to master.

SparkQA · 2021-12-15T02:17:39Z

Kubernetes integration test status failure
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/50676/

SparkQA · 2021-12-15T02:53:22Z

Kubernetes integration test status failure
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/50678/

Switch default index to distributed-sequence by default in pandas API…

3704d4d

… on Spark

github-actions bot added CORE PYTHON labels Dec 15, 2021

Fix other docs

969911a

itholic approved these changes Dec 15, 2021

View reviewed changes

ueshin approved these changes Dec 15, 2021

View reviewed changes

HyukjinKwon closed this in c1d80bf Dec 15, 2021

HyukjinKwon deleted the SPARK-37649 branch January 4, 2022 00:51

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[SPARK-37649][PYTHON] Switch default index to distributed-sequence by default in pandas API on Spark#34902

[SPARK-37649][PYTHON] Switch default index to distributed-sequence by default in pandas API on Spark#34902
HyukjinKwon wants to merge 2 commits intoapache:masterfrom
HyukjinKwon:SPARK-37649

HyukjinKwon commented Dec 15, 2021 •

edited

Loading

Uh oh!

HyukjinKwon commented Dec 15, 2021

Uh oh!

ueshin commented Dec 15, 2021 •

edited

Loading

Uh oh!

SparkQA commented Dec 15, 2021

Uh oh!

itholic left a comment

Uh oh!

ueshin left a comment

Uh oh!

SparkQA commented Dec 15, 2021

Uh oh!

SparkQA commented Dec 15, 2021

Uh oh!

SparkQA commented Dec 15, 2021

Uh oh!

HyukjinKwon commented Dec 15, 2021

Uh oh!

SparkQA commented Dec 15, 2021

Uh oh!

SparkQA commented Dec 15, 2021

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

Conversation

HyukjinKwon commented Dec 15, 2021 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

What changes were proposed in this pull request?

Why are the changes needed?

Does this PR introduce any user-facing change?

How was this patch tested?

Uh oh!

HyukjinKwon commented Dec 15, 2021

Uh oh!

ueshin commented Dec 15, 2021 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

SparkQA commented Dec 15, 2021

Uh oh!

itholic left a comment

Choose a reason for hiding this comment

Uh oh!

ueshin left a comment

Choose a reason for hiding this comment

Uh oh!

SparkQA commented Dec 15, 2021

Uh oh!

SparkQA commented Dec 15, 2021

Uh oh!

SparkQA commented Dec 15, 2021

Uh oh!

HyukjinKwon commented Dec 15, 2021

Uh oh!

SparkQA commented Dec 15, 2021

Uh oh!

SparkQA commented Dec 15, 2021

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

HyukjinKwon commented Dec 15, 2021 •

edited

Loading

ueshin commented Dec 15, 2021 •

edited

Loading