Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[SPARK-33183][SQL][3.0] Fix Optimizer rule EliminateSorts and add a physical rule to remove redundant sorts #30195

Closed

Conversation

allisonwang-db
Copy link
Contributor

Backport #30093 for branch-3.0. I've updated the configuration version to 2.4.8.

What changes were proposed in this pull request?

This PR aims to fix a correctness bug in the optimizer rule EliminateSorts. It also adds a new physical rule to remove redundant sorts that cannot be eliminated in the Optimizer rule after the bugfix.

Why are the changes needed?

A global sort should not be eliminated even if its child is ordered since we don't know if its child ordering is global or local. For example, in the following scenario, the first sort shouldn't be removed because it has a stronger guarantee than the second sort even if the sort orders are the same for both sorts.

Sort(orders, global = True, ...)
  Sort(orders, global = False, ...)

Since there is no straightforward way to identify whether a node's output ordering is local or global, we should not remove a global sort even if its child is already ordered.

Does this PR introduce any user-facing change?

Yes

How was this patch tested?

Unit tests

…al rule to remove redundant sorts

This PR aims to fix a correctness bug in the optimizer rule `EliminateSorts`. It also adds a new physical rule to remove redundant sorts that cannot be eliminated in the Optimizer rule after the bugfix.

A global sort should not be eliminated even if its child is ordered since we don't know if its child ordering is global or local. For example, in the following scenario, the first sort shouldn't be removed because it has a stronger guarantee than the second sort even if the sort orders are the same for both sorts.

```
Sort(orders, global = True, ...)
  Sort(orders, global = False, ...)
```

Since there is no straightforward way to identify whether a node's output ordering is local or global, we should not remove a global sort even if its child is already ordered.

Yes

Unit tests

Closes apache#30093 from allisonwang-db/fix-sort.

Authored-by: allisonwang-db <66282705+allisonwang-db@users.noreply.github.com>
Signed-off-by: Wenchen Fan <wenchen@databricks.com>
(cherry picked from commit 9fb4536)
Signed-off-by: allisonwang-db <66282705+allisonwang-db@users.noreply.github.com>
@SparkQA
Copy link

SparkQA commented Oct 30, 2020

Kubernetes integration test starting
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/35038/

@SparkQA
Copy link

SparkQA commented Oct 30, 2020

Kubernetes integration test status failure
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/35038/

@SparkQA
Copy link

SparkQA commented Oct 30, 2020

Test build #130433 has finished for PR 30195 at commit 9526dee.

  • This patch fails Spark unit tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@cloud-fan
Copy link
Contributor

GA passed, merging to 3.0!

cloud-fan pushed a commit that referenced this pull request Oct 30, 2020
…hysical rule to remove redundant sorts

Backport #30093 for branch-3.0. I've updated the configuration version to 2.4.8.

### What changes were proposed in this pull request?
This PR aims to fix a correctness bug in the optimizer rule EliminateSorts. It also adds a new physical rule to remove redundant sorts that cannot be eliminated in the Optimizer rule after the bugfix.

### Why are the changes needed?
A global sort should not be eliminated even if its child is ordered since we don't know if its child ordering is global or local. For example, in the following scenario, the first sort shouldn't be removed because it has a stronger guarantee than the second sort even if the sort orders are the same for both sorts.
```
Sort(orders, global = True, ...)
  Sort(orders, global = False, ...)
```
Since there is no straightforward way to identify whether a node's output ordering is local or global, we should not remove a global sort even if its child is already ordered.

### Does this PR introduce any user-facing change?
Yes

### How was this patch tested?
Unit tests

Closes #30195 from allisonwang-db/SPARK-33183-branch-3.0.

Authored-by: allisonwang-db <66282705+allisonwang-db@users.noreply.github.com>
Signed-off-by: Wenchen Fan <wenchen@databricks.com>
@cloud-fan cloud-fan closed this Oct 30, 2020
@maropu
Copy link
Member

maropu commented Oct 30, 2020

late LGTM. Thanks for the backporting.

@allisonwang-db allisonwang-db deleted the SPARK-33183-branch-3.0 branch January 19, 2024 01:21
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants