[SPARK-37649][PYTHON] Switch default index to distributed-sequence by default in pandas API on Spark#34902
[SPARK-37649][PYTHON] Switch default index to distributed-sequence by default in pandas API on Spark#34902HyukjinKwon wants to merge 2 commits intoapache:masterfrom
Conversation
|
There are some places mentioning the default type in the docs, like: |
|
Test build #146202 has finished for PR 34902 at commit
|
|
Kubernetes integration test starting |
|
Test build #146204 has finished for PR 34902 at commit
|
|
Kubernetes integration test starting |
|
All tests passed at https://github.com/HyukjinKwon/spark/actions/runs/1580641441. Merged to master. |
|
Kubernetes integration test status failure |
|
Kubernetes integration test status failure |
What changes were proposed in this pull request?
This PR proposes to switch default index to
distributed-sequenceby default.Why are the changes needed?
sequencetype relies on sending all data to one executor that easily causes OOM and make the computation slowWe should better switch to
distributed-sequencetype that truly distributes the data.In my own internal benchmark, it boosts the performance around 2 ~ 3 times faster on average.
Does this PR introduce any user-facing change?
Ideally no. Order might be affected but that's not already guaranteed.
How was this patch tested?
Existing CI should test it out.