[SPARK-39911][SQL] Optimize global Sort to RepartitionByExpression #37330

ulysses-you · 2022-07-28T14:07:57Z

What changes were proposed in this pull request?

Optimize Global sort to RepartitionByExpression, for example:

Sort local             Sort local 
  Sort global    =>      RepartitionByExpression

Why are the changes needed?

If a global sort below a local sort, the only meaningful thing is it's distribution. So this pr optimizes that global sort to RepartitionByExpression to save a local sort.

Does this PR introduce any user-facing change?

no, only improve performance

How was this patch tested?

add test

cloud-fan · 2022-07-29T14:43:23Z

cc @sigmod

sigmod

Thanks @ulysses-you !

cloud-fan · 2022-08-01T05:58:17Z

thanks, merging to master!

cloud-fan · 2022-08-02T03:41:10Z

@ulysses-you can you open a backport PR for 3.3? I think this is a necessary followup of #37250 to avoid perf regression.

### What changes were proposed in this pull request? Optimize Global sort to RepartitionByExpression, for example: ``` Sort local Sort local Sort global => RepartitionByExpression ``` ### Why are the changes needed? If a global sort below a local sort, the only meaningful thing is it's distribution. So this pr optimizes that global sort to RepartitionByExpression to save a local sort. ### Does this PR introduce _any_ user-facing change? no, only improve performance ### How was this patch tested? add test Closes apache#37330 from ulysses-you/optimize-sort. Authored-by: ulysses-you <ulyssesyou18@gmail.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com>

this is for backport #37330 into branch-3.3 ### What changes were proposed in this pull request? Optimize Global sort to RepartitionByExpression, for example: ``` Sort local Sort local Sort global => RepartitionByExpression ``` ### Why are the changes needed? If a global sort below a local sort, the only meaningful thing is it's distribution. So this pr optimizes that global sort to RepartitionByExpression to save a local sort. ### Does this PR introduce _any_ user-facing change? we fix a bug in #37250 and that pr backport into branch-3.3. However, that fix may introduce performance regression. This pr itself is only to improve performance but in order to avoid the regression, we also backport this pr. see the details #37330 (comment) ### How was this patch tested? add test Closes #37330 from ulysses-you/optimize-sort. Authored-by: ulysses-you <ulyssesyou18gmail.com> Signed-off-by: Wenchen Fan <wenchendatabricks.com> Closes #37373 from ulysses-you/SPARK-39911-3.3. Authored-by: ulysses-you <ulyssesyou18@gmail.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com>

maytasm · 2023-12-01T20:52:48Z

@ulysses-you @cloud-fan
Not sure if I am missing something but can this causes performance degradation if my sort order is on a key with few/single values?
For example, if I have 500 shuffle partitions...
Without this change:

Sort local    
  Sort global

both of the above stages would run with 500 tasks
With this change: and say the RepartitionByExpression is on the column date and there is only a single value for this in my dataset

Sort local 
   RepartitionByExpression

RepartitionByExpression will run with 500 tasks and create a single partition
then the Sort would run with a single task (as there is only a single partition)

ulysses-you · 2023-12-04T01:21:58Z

hi @maytasm , Sort global is semantics equal to Sort local + RepartitionByExpression. For your case, if there is only single value on the sort column, then Sort global would also introduce a range partitioning shuffle and make a single partition task at the end. Can you show a real case if you find any behavior change ?

maytasm · 2023-12-22T07:36:15Z

@ulysses-you
Sorry for the late reply. That is true unless the following condition is True for the Sort global:

   case s @ Sort(orders, _, child) if orders.isEmpty || orders.exists(_.child.foldable) =>
      val newOrders = orders.filterNot(_.child.foldable)
      if (newOrders.isEmpty) {
        applyLocally.lift(child).getOrElse(child)
      }

In the case that the above condition is True, then this Sort global will be removed. However, with this change (in this PR), the Sort global will not be removed as it would have been turn into a RepartitionByExpression

Consider the following case:

+- Sort [date#0 ASC NULLS FIRST], false
   +- Project [...]
      +- Sort [20231221 ASC NULLS FIRST], true
         +- Aggregate [...]
               +- RelationV2[...] some_table

without the change from this PR, the Sort [20231221 ASC NULLS FIRST], true would be remove. We would end up with

+- Sort [date#0 ASC NULLS FIRST], false
   +- Project [...]
         +- Aggregate [...]
               +- RelationV2[...] some_table

This ran fast as we have high parallelism in all stages (many partitions -> many tasks running)
However, with the change in this PR:

+- Sort [date#0 ASC NULLS FIRST], false
   +- Project [...]
      +- Exchange rangepartition (20231221 ASC NULL FIRST)
         +- Aggregate [...]
               +- RelationV2[...] some_table

The problem here is that Exchange rangepartition reduces the partition to 1. Then the stages after that like the Sort on date#0 ran with a single task.

cloud-fan · 2023-12-22T07:44:33Z

without the change from this PR, the Sort [20231221 ASC NULLS FIRST], true would be remove.

Are you sure? It looks wrong to do so

maytasm · 2023-12-22T08:04:43Z

@cloud-fan
Can you expand on why you think the Sort [20231221 ASC NULLS FIRST], true looks wrong to remove? It is a sort that has no reference and should be fine to remove.
Sort [20231221 ASC NULLS FIRST], true would look something like:

Since Literal is foldable, newOrders would be empty. As newOrders is empty, the sort is then removed.
This is explained in #11840

I believe this PR prevents the optimization added in #11840 to work as intended since it convert the no-op sort into RepartitionByExpression, which can no longer be removed

cloud-fan · 2023-12-22T08:51:52Z

Oh now I get it. This is sort by a constant. Can you try the latest master branch? I think https://github.com/apache/spark/pull/44429/files#diff-11264d807efa58054cca2d220aae8fba644ee0f0f2a4722c46d52828394846efR214 has solved it.

maytasm · 2023-12-22T10:23:21Z

Ah! I haven't tried running yet but took a look at PR #44429 and I think it does solve this issue. So basically the logic for converting global sort to Exchange rangepartition is moved into another rule, RemoveRedundantSorts, and EliminateSorts is run before RemoveRedundantSorts. Meaning the no-op sort like Sort [20231221 ASC NULLS FIRST], true would be removed by EliminateSorts already by the time RemoveRedundantSorts is run.

maytasm · 2023-12-23T06:18:42Z

@cloud-fan Confirmed that PR #44429 fixed the issue I was having with the no-op sort!

Optimize global Sort to RepartitionByExpression

0a63973

github-actions bot added the SQL label Jul 28, 2022

cloud-fan approved these changes Jul 29, 2022

View reviewed changes

sigmod approved these changes Jul 29, 2022

View reviewed changes

cloud-fan closed this in 83bd4f3 Aug 1, 2022

ulysses-you deleted the optimize-sort branch August 1, 2022 12:01

ulysses-you mentioned this pull request Aug 2, 2022

[SPARK-39911][SQL][3.3] Optimize global Sort to RepartitionByExpression #37373

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[SPARK-39911][SQL] Optimize global Sort to RepartitionByExpression #37330

[SPARK-39911][SQL] Optimize global Sort to RepartitionByExpression #37330

ulysses-you commented Jul 28, 2022

cloud-fan commented Jul 29, 2022

sigmod left a comment

cloud-fan commented Aug 1, 2022

cloud-fan commented Aug 2, 2022

maytasm commented Dec 1, 2023

ulysses-you commented Dec 4, 2023

maytasm commented Dec 22, 2023 •

edited

Loading

cloud-fan commented Dec 22, 2023

maytasm commented Dec 22, 2023

cloud-fan commented Dec 22, 2023

maytasm commented Dec 22, 2023 •

edited

Loading

maytasm commented Dec 23, 2023

[SPARK-39911][SQL] Optimize global Sort to RepartitionByExpression #37330

[SPARK-39911][SQL] Optimize global Sort to RepartitionByExpression #37330

Conversation

ulysses-you commented Jul 28, 2022

What changes were proposed in this pull request?

Why are the changes needed?

Does this PR introduce any user-facing change?

How was this patch tested?

cloud-fan commented Jul 29, 2022

sigmod left a comment

Choose a reason for hiding this comment

cloud-fan commented Aug 1, 2022

cloud-fan commented Aug 2, 2022

maytasm commented Dec 1, 2023

ulysses-you commented Dec 4, 2023

maytasm commented Dec 22, 2023 • edited Loading

cloud-fan commented Dec 22, 2023

maytasm commented Dec 22, 2023

cloud-fan commented Dec 22, 2023

maytasm commented Dec 22, 2023 • edited Loading

maytasm commented Dec 23, 2023

maytasm commented Dec 22, 2023 •

edited

Loading

maytasm commented Dec 22, 2023 •

edited

Loading