- 
                Notifications
    
You must be signed in to change notification settings  - Fork 28.9k
 
[SPARK-46485][SQL] V1Write should not add Sort when not needed #44458
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
| val outputOrdering = query.outputOrdering | ||
| val orderingMatched = isOrderingMatched(requiredOrdering, outputOrdering) | ||
| val outputOrdering = empty2NullPlan.outputOrdering | ||
| val orderingMatched = isOrderingMatched(requiredOrdering.map(_.child), outputOrdering) | 
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
what def isOrderingMatched does is outputOrder.satisfies(outputOrder.copy(child = requiredOrder)), so it's completely wrong to pass requiredOrdering as a Seq[SortOrder]
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good catch!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yea, so it was never matched before...
| val outputOrdering = query.outputOrdering | ||
| val orderingMatched = isOrderingMatched(requiredOrdering, outputOrdering) | ||
| val outputOrdering = empty2NullPlan.outputOrdering | ||
| val orderingMatched = isOrderingMatched(requiredOrdering.map(_.child), outputOrdering) | 
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good catch!
| }.asInstanceOf[SortOrder]) | ||
| val outputOrdering = query.outputOrdering | ||
| val orderingMatched = isOrderingMatched(requiredOrdering, outputOrdering) | ||
| val outputOrdering = empty2NullPlan.outputOrdering | 
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I rough remember before we did not support preserve ordering through empty2null so use the query.outputOrdering. I think use empty2NullPlan.outputOrdering is the expected behavior
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yea Project is an OrderPreservingUnaryNode so it should be fine.
| 
           thanks for the review, merging to master!  | 
    
| 
           What do you think about making user-desired order of partitions explicit by opening  Instead of users can explicitly sort the partitions: Then that desire is explicitly available to the writer and does not need to be derived from the plan.  | 
    
| 
           Ideally, users shouldn't care about optimal ordering during data writing. The data source should be smart enough to auto-optimize its data layout. This API goes against the eventual goal.  | 
    
| 
           This is not about optimal ordering (I presume you refer to partitions being ordered by partition columns, which is optimal to have only one file writer open at a time), but about additional ordering (to have some additional order that is not required by the writer task). Having sorted partitions is very useful when your downstream systems that consume the written data can expect some order beyond partition keys. So users care about the in-partition order. I am happy as long as   | 
    
| 
           Looks good to me.  | 
    
| 
           @EnricoMi Like your idea, so for now if u want to sort per nested partition output you will need to use 
  | 
    
| 
           I found this PR when trying to backport SPARK-53738 (#52584) to branch-3.5. From my understanding, this PR fixes a hidden bug (until exposed by #44429) that has existed since 3.4, if so, I'd like to backport this to branch-3.5, it's the pre-step for backporting SPARK-53738 @cloud-fan @viirya @yaooqinn @allisonwang-db @ulysses-you @peter-toth @EnricoMi WDYT?  | 
    
| 
           +1 to backport  | 
    
### What changes were proposed in this pull request? In `V1Writes`, we try to avoid adding Sort if the output ordering always satisfies. However, the code is completely broken with two issues: - we put `SortOrder` as the child of another `SortOrder` and compare, which always returns false. - once we add a project to do `empty2null`, we change the query output attribute id and the sort order never matches. It's not a big issue as we still have QO rules to eliminate useless sorts, but apache#44429 exposes this problem because the way we optimize sort is a bit different. For `V1Writes`, we should always avoid adding sort even if the number of ordering key is less, to not change the user query. ### Why are the changes needed? fix code mistakes. ### Does this PR introduce _any_ user-facing change? no ### How was this patch tested? updated test ### Was this patch authored or co-authored using generative AI tooling? no Closes apache#44458 from cloud-fan/sort. Authored-by: Wenchen Fan <wenchen@databricks.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com>
          
 SPARK-35738 -> SPARK-53738, but I agree, let's backport this and that PR.  | 
    
| 
           @peter-toth yes, it's a typo, should be SPARK-53738, sorry for the confusion.  | 
    
Backport #44458 to branch-3.5. Justification: it fixes a hidden bug (until exposed by #44429) that has existed since 3.4. ### What changes were proposed in this pull request? In `V1Writes`, we try to avoid adding Sort if the output ordering always satisfies. However, the code is completely broken with two issues: - we put `SortOrder` as the child of another `SortOrder` and compare, which always returns false. - once we add a project to do `empty2null`, we change the query output attribute id and the sort order never matches. It's not a big issue as we still have QO rules to eliminate useless sorts, but #44429 exposes this problem because the way we optimize sort is a bit different. For `V1Writes`, we should always avoid adding sort even if the number of ordering key is less, to not change the user query. ### Why are the changes needed? fix code mistakes. ### Does this PR introduce _any_ user-facing change? no ### How was this patch tested? updated test ### Was this patch authored or co-authored using generative AI tooling? no Closes #52692 from pan3793/SPARK-46485-3.5. Authored-by: Wenchen Fan <wenchen@databricks.com> Signed-off-by: Peter Toth <peter.toth@gmail.com>
What changes were proposed in this pull request?
In
V1Writes, we try to avoid adding Sort if the output ordering always satisfies. However, the code is completely broken with two issues:SortOrderas the child of anotherSortOrderand compare, which always returns false.empty2null, we change the query output attribute id and the sort order never matches.It's not a big issue as we still have QO rules to eliminate useless sorts, but #44429 exposes this problem because the way we optimize sort is a bit different. For
V1Writes, we should always avoid adding sort even if the number of ordering key is less, to not change the user query.Why are the changes needed?
fix code mistakes.
Does this PR introduce any user-facing change?
no
How was this patch tested?
updated test
Was this patch authored or co-authored using generative AI tooling?
no