Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[SPARK-39911][SQL] Optimize global Sort to RepartitionByExpression #37330

Closed
wants to merge 1 commit into from

Conversation

ulysses-you
Copy link
Contributor

What changes were proposed in this pull request?

Optimize Global sort to RepartitionByExpression, for example:

Sort local             Sort local 
  Sort global    =>      RepartitionByExpression

Why are the changes needed?

If a global sort below a local sort, the only meaningful thing is it's distribution. So this pr optimizes that global sort to RepartitionByExpression to save a local sort.

Does this PR introduce any user-facing change?

no, only improve performance

How was this patch tested?

add test

@github-actions github-actions bot added the SQL label Jul 28, 2022
@cloud-fan
Copy link
Contributor

cc @sigmod

Copy link
Contributor

@sigmod sigmod left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @ulysses-you !

@cloud-fan
Copy link
Contributor

thanks, merging to master!

@cloud-fan cloud-fan closed this in 83bd4f3 Aug 1, 2022
@ulysses-you ulysses-you deleted the optimize-sort branch August 1, 2022 12:01
@cloud-fan
Copy link
Contributor

@ulysses-you can you open a backport PR for 3.3? I think this is a necessary followup of #37250 to avoid perf regression.

ulysses-you added a commit to ulysses-you/spark that referenced this pull request Aug 2, 2022
### What changes were proposed in this pull request?

Optimize Global sort to RepartitionByExpression, for example:
```
Sort local             Sort local
  Sort global    =>      RepartitionByExpression
```

### Why are the changes needed?

If a global sort below a local sort, the only meaningful thing is it's distribution. So this pr optimizes that global sort to RepartitionByExpression to save a local sort.

### Does this PR introduce _any_ user-facing change?

no, only improve performance

### How was this patch tested?

add test

Closes apache#37330 from ulysses-you/optimize-sort.

Authored-by: ulysses-you <ulyssesyou18@gmail.com>
Signed-off-by: Wenchen Fan <wenchen@databricks.com>
ulysses-you added a commit to ulysses-you/spark that referenced this pull request Aug 2, 2022
### What changes were proposed in this pull request?

Optimize Global sort to RepartitionByExpression, for example:
```
Sort local             Sort local
  Sort global    =>      RepartitionByExpression
```

### Why are the changes needed?

If a global sort below a local sort, the only meaningful thing is it's distribution. So this pr optimizes that global sort to RepartitionByExpression to save a local sort.

### Does this PR introduce _any_ user-facing change?

no, only improve performance

### How was this patch tested?

add test

Closes apache#37330 from ulysses-you/optimize-sort.

Authored-by: ulysses-you <ulyssesyou18@gmail.com>
Signed-off-by: Wenchen Fan <wenchen@databricks.com>
ulysses-you added a commit to ulysses-you/spark that referenced this pull request Aug 2, 2022
### What changes were proposed in this pull request?

Optimize Global sort to RepartitionByExpression, for example:
```
Sort local             Sort local
  Sort global    =>      RepartitionByExpression
```

### Why are the changes needed?

If a global sort below a local sort, the only meaningful thing is it's distribution. So this pr optimizes that global sort to RepartitionByExpression to save a local sort.

### Does this PR introduce _any_ user-facing change?

no, only improve performance

### How was this patch tested?

add test

Closes apache#37330 from ulysses-you/optimize-sort.

Authored-by: ulysses-you <ulyssesyou18@gmail.com>
Signed-off-by: Wenchen Fan <wenchen@databricks.com>
cloud-fan pushed a commit that referenced this pull request Aug 3, 2022
this is for backport #37330 into branch-3.3
### What changes were proposed in this pull request?

Optimize Global sort to RepartitionByExpression, for example:
```
Sort local             Sort local
  Sort global    =>      RepartitionByExpression
```

### Why are the changes needed?

If a global sort below a local sort, the only meaningful thing is it's distribution. So this pr optimizes that global sort to RepartitionByExpression to save a local sort.

### Does this PR introduce _any_ user-facing change?

we fix a bug in #37250 and that pr backport into branch-3.3. However, that fix may introduce performance regression. This pr itself is only to improve performance but in order to avoid the regression, we also backport this pr. see the details #37330 (comment)

### How was this patch tested?

add test

Closes #37330 from ulysses-you/optimize-sort.

Authored-by: ulysses-you <ulyssesyou18gmail.com>
Signed-off-by: Wenchen Fan <wenchendatabricks.com>

Closes #37373 from ulysses-you/SPARK-39911-3.3.

Authored-by: ulysses-you <ulyssesyou18@gmail.com>
Signed-off-by: Wenchen Fan <wenchen@databricks.com>
@maytasm
Copy link

maytasm commented Dec 1, 2023

@ulysses-you @cloud-fan
Not sure if I am missing something but can this causes performance degradation if my sort order is on a key with few/single values?
For example, if I have 500 shuffle partitions...
Without this change:

Sort local    
  Sort global   

both of the above stages would run with 500 tasks
With this change: and say the RepartitionByExpression is on the column date and there is only a single value for this in my dataset

Sort local 
   RepartitionByExpression

RepartitionByExpression will run with 500 tasks and create a single partition
then the Sort would run with a single task (as there is only a single partition)

@ulysses-you
Copy link
Contributor Author

hi @maytasm , Sort global is semantics equal to Sort local + RepartitionByExpression. For your case, if there is only single value on the sort column, then Sort global would also introduce a range partitioning shuffle and make a single partition task at the end. Can you show a real case if you find any behavior change ?

@maytasm
Copy link

maytasm commented Dec 22, 2023

@ulysses-you
Sorry for the late reply. That is true unless the following condition is True for the Sort global:

   case s @ Sort(orders, _, child) if orders.isEmpty || orders.exists(_.child.foldable) =>
      val newOrders = orders.filterNot(_.child.foldable)
      if (newOrders.isEmpty) {
        applyLocally.lift(child).getOrElse(child)
      }

In the case that the above condition is True, then this Sort global will be removed. However, with this change (in this PR), the Sort global will not be removed as it would have been turn into a RepartitionByExpression

Consider the following case:

+- Sort [date#0 ASC NULLS FIRST], false
   +- Project [...]
      +- Sort [20231221 ASC NULLS FIRST], true
         +- Aggregate [...]
               +- RelationV2[...] some_table

without the change from this PR, the Sort [20231221 ASC NULLS FIRST], true would be remove. We would end up with

+- Sort [date#0 ASC NULLS FIRST], false
   +- Project [...]
         +- Aggregate [...]
               +- RelationV2[...] some_table

This ran fast as we have high parallelism in all stages (many partitions -> many tasks running)
However, with the change in this PR:

+- Sort [date#0 ASC NULLS FIRST], false
   +- Project [...]
      +- Exchange rangepartition (20231221 ASC NULL FIRST)
         +- Aggregate [...]
               +- RelationV2[...] some_table

The problem here is that Exchange rangepartition reduces the partition to 1. Then the stages after that like the Sort on date#0 ran with a single task.

@cloud-fan
Copy link
Contributor

without the change from this PR, the Sort [20231221 ASC NULLS FIRST], true would be remove.

Are you sure? It looks wrong to do so

@maytasm
Copy link

maytasm commented Dec 22, 2023

@cloud-fan
Can you expand on why you think the Sort [20231221 ASC NULLS FIRST], true looks wrong to remove? It is a sort that has no reference and should be fine to remove.
Sort [20231221 ASC NULLS FIRST], true would look something like:
image
Since Literal is foldable, newOrders would be empty. As newOrders is empty, the sort is then removed.
This is explained in #11840

I believe this PR prevents the optimization added in #11840 to work as intended since it convert the no-op sort into RepartitionByExpression, which can no longer be removed

@cloud-fan
Copy link
Contributor

Oh now I get it. This is sort by a constant. Can you try the latest master branch? I think https://github.com/apache/spark/pull/44429/files#diff-11264d807efa58054cca2d220aae8fba644ee0f0f2a4722c46d52828394846efR214 has solved it.

@maytasm
Copy link

maytasm commented Dec 22, 2023

Ah! I haven't tried running yet but took a look at PR #44429 and I think it does solve this issue. So basically the logic for converting global sort to Exchange rangepartition is moved into another rule, RemoveRedundantSorts, and EliminateSorts is run before RemoveRedundantSorts. Meaning the no-op sort like Sort [20231221 ASC NULLS FIRST], true would be removed by EliminateSorts already by the time RemoveRedundantSorts is run.

@maytasm
Copy link

maytasm commented Dec 23, 2023

@cloud-fan Confirmed that PR #44429 fixed the issue I was having with the no-op sort!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants