-
Notifications
You must be signed in to change notification settings - Fork 28.2k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[SPARK-39911][SQL] Optimize global Sort to RepartitionByExpression #37330
Conversation
cc @sigmod |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks @ulysses-you !
thanks, merging to master! |
@ulysses-you can you open a backport PR for 3.3? I think this is a necessary followup of #37250 to avoid perf regression. |
### What changes were proposed in this pull request? Optimize Global sort to RepartitionByExpression, for example: ``` Sort local Sort local Sort global => RepartitionByExpression ``` ### Why are the changes needed? If a global sort below a local sort, the only meaningful thing is it's distribution. So this pr optimizes that global sort to RepartitionByExpression to save a local sort. ### Does this PR introduce _any_ user-facing change? no, only improve performance ### How was this patch tested? add test Closes apache#37330 from ulysses-you/optimize-sort. Authored-by: ulysses-you <ulyssesyou18@gmail.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com>
### What changes were proposed in this pull request? Optimize Global sort to RepartitionByExpression, for example: ``` Sort local Sort local Sort global => RepartitionByExpression ``` ### Why are the changes needed? If a global sort below a local sort, the only meaningful thing is it's distribution. So this pr optimizes that global sort to RepartitionByExpression to save a local sort. ### Does this PR introduce _any_ user-facing change? no, only improve performance ### How was this patch tested? add test Closes apache#37330 from ulysses-you/optimize-sort. Authored-by: ulysses-you <ulyssesyou18@gmail.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com>
### What changes were proposed in this pull request? Optimize Global sort to RepartitionByExpression, for example: ``` Sort local Sort local Sort global => RepartitionByExpression ``` ### Why are the changes needed? If a global sort below a local sort, the only meaningful thing is it's distribution. So this pr optimizes that global sort to RepartitionByExpression to save a local sort. ### Does this PR introduce _any_ user-facing change? no, only improve performance ### How was this patch tested? add test Closes apache#37330 from ulysses-you/optimize-sort. Authored-by: ulysses-you <ulyssesyou18@gmail.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com>
this is for backport #37330 into branch-3.3 ### What changes were proposed in this pull request? Optimize Global sort to RepartitionByExpression, for example: ``` Sort local Sort local Sort global => RepartitionByExpression ``` ### Why are the changes needed? If a global sort below a local sort, the only meaningful thing is it's distribution. So this pr optimizes that global sort to RepartitionByExpression to save a local sort. ### Does this PR introduce _any_ user-facing change? we fix a bug in #37250 and that pr backport into branch-3.3. However, that fix may introduce performance regression. This pr itself is only to improve performance but in order to avoid the regression, we also backport this pr. see the details #37330 (comment) ### How was this patch tested? add test Closes #37330 from ulysses-you/optimize-sort. Authored-by: ulysses-you <ulyssesyou18gmail.com> Signed-off-by: Wenchen Fan <wenchendatabricks.com> Closes #37373 from ulysses-you/SPARK-39911-3.3. Authored-by: ulysses-you <ulyssesyou18@gmail.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com>
@ulysses-you @cloud-fan
both of the above stages would run with 500 tasks
RepartitionByExpression will run with 500 tasks and create a single partition |
hi @maytasm , |
@ulysses-you
In the case that the above condition is True, then this Consider the following case:
without the change from this PR, the
This ran fast as we have high parallelism in all stages (many partitions -> many tasks running)
The problem here is that |
Are you sure? It looks wrong to do so |
@cloud-fan I believe this PR prevents the optimization added in #11840 to work as intended since it convert the no-op sort into RepartitionByExpression, which can no longer be removed |
Oh now I get it. This is sort by a constant. Can you try the latest master branch? I think https://github.com/apache/spark/pull/44429/files#diff-11264d807efa58054cca2d220aae8fba644ee0f0f2a4722c46d52828394846efR214 has solved it. |
Ah! I haven't tried running yet but took a look at PR #44429 and I think it does solve this issue. So basically the logic for converting |
@cloud-fan Confirmed that PR #44429 fixed the issue I was having with the no-op sort! |
What changes were proposed in this pull request?
Optimize Global sort to RepartitionByExpression, for example:
Why are the changes needed?
If a global sort below a local sort, the only meaningful thing is it's distribution. So this pr optimizes that global sort to RepartitionByExpression to save a local sort.
Does this PR introduce any user-facing change?
no, only improve performance
How was this patch tested?
add test