[SPARK-46485][SQL] V1Write should not add Sort when not needed #44458

cloud-fan · 2023-12-22T07:05:24Z

What changes were proposed in this pull request?

In V1Writes, we try to avoid adding Sort if the output ordering always satisfies. However, the code is completely broken with two issues:

we put SortOrder as the child of another SortOrder and compare, which always returns false.
once we add a project to do empty2null, we change the query output attribute id and the sort order never matches.

It's not a big issue as we still have QO rules to eliminate useless sorts, but #44429 exposes this problem because the way we optimize sort is a bit different. For V1Writes, we should always avoid adding sort even if the number of ordering key is less, to not change the user query.

Why are the changes needed?

fix code mistakes.

Does this PR introduce any user-facing change?

no

How was this patch tested?

updated test

Was this patch authored or co-authored using generative AI tooling?

no

cloud-fan · 2023-12-22T07:06:42Z

sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/V1Writes.scala

-    val outputOrdering = query.outputOrdering
-    val orderingMatched = isOrderingMatched(requiredOrdering, outputOrdering)
+    val outputOrdering = empty2NullPlan.outputOrdering
+    val orderingMatched = isOrderingMatched(requiredOrdering.map(_.child), outputOrdering)


what def isOrderingMatched does is outputOrder.satisfies(outputOrder.copy(child = requiredOrder)), so it's completely wrong to pass requiredOrdering as a Seq[SortOrder]

Good catch!

Yea, so it was never matched before...

cloud-fan · 2023-12-22T07:07:25Z

cc @allisonwang-db @dongjoon-hyun @viirya @peter-toth @ulysses-you

peter-toth · 2023-12-22T09:31:46Z

sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/V1Writes.scala

-    val outputOrdering = query.outputOrdering
-    val orderingMatched = isOrderingMatched(requiredOrdering, outputOrdering)
+    val outputOrdering = empty2NullPlan.outputOrdering
+    val orderingMatched = isOrderingMatched(requiredOrdering.map(_.child), outputOrdering)


Good catch!

ulysses-you · 2023-12-22T09:42:23Z

sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/V1Writes.scala

    }.asInstanceOf[SortOrder])
-    val outputOrdering = query.outputOrdering
-    val orderingMatched = isOrderingMatched(requiredOrdering, outputOrdering)
+    val outputOrdering = empty2NullPlan.outputOrdering


I rough remember before we did not support preserve ordering through empty2null so use the query.outputOrdering. I think use empty2NullPlan.outputOrdering is the expected behavior

Yea Project is an OrderPreservingUnaryNode so it should be fine.

cloud-fan · 2023-12-22T16:08:07Z

thanks for the review, merging to master!

EnricoMi · 2023-12-22T16:20:34Z

What do you think about making user-desired order of partitions explicit by opening .write.orderBy to .write.partitionBy? Right now, .write.orderBy is exclusively used by bucketing (.write.bucketBy).

Instead of

df.sortWithinPartitions("id", "time").write.partitionBy("id")

users can explicitly sort the partitions:

df.write.partitionBy("id").sortBy("id", "time")

Then that desire is explicitly available to the writer and does not need to be derived from the plan.

cloud-fan · 2023-12-22T17:07:48Z

Ideally, users shouldn't care about optimal ordering during data writing. The data source should be smart enough to auto-optimize its data layout. This API goes against the eventual goal.

EnricoMi · 2023-12-22T18:52:43Z

This is not about optimal ordering (I presume you refer to partitions being ordered by partition columns, which is optimal to have only one file writer open at a time), but about additional ordering (to have some additional order that is not required by the writer task). Having sorted partitions is very useful when your downstream systems that consume the written data can expect some order beyond partition keys. So users care about the in-partition order.

I am happy as long as df.repartition("id").sortWithinPartitions("id", "time").write.partitionBy("id") keeps being supported.

viirya · 2023-12-22T20:09:02Z

Looks good to me.

sweetpythoncode · 2024-01-11T15:18:37Z

@EnricoMi Like your idea, so for now if u want to sort per nested partition output you will need to use

df.repartition("id", "nested_id").sortWithinPartitions("id", "nested_id", "time").write.partitionBy("id", "nested_id")?

pan3793 · 2025-10-21T21:32:53Z

I found this PR when trying to backport SPARK-53738 (#52584) to branch-3.5.

From my understanding, this PR fixes a hidden bug (until exposed by #44429) that has existed since 3.4, if so, I'd like to backport this to branch-3.5, it's the pre-step for backporting SPARK-53738

@cloud-fan @viirya @yaooqinn @allisonwang-db @ulysses-you @peter-toth @EnricoMi WDYT?

cloud-fan · 2025-10-22T07:43:00Z

+1 to backport

### What changes were proposed in this pull request? In `V1Writes`, we try to avoid adding Sort if the output ordering always satisfies. However, the code is completely broken with two issues: - we put `SortOrder` as the child of another `SortOrder` and compare, which always returns false. - once we add a project to do `empty2null`, we change the query output attribute id and the sort order never matches. It's not a big issue as we still have QO rules to eliminate useless sorts, but apache#44429 exposes this problem because the way we optimize sort is a bit different. For `V1Writes`, we should always avoid adding sort even if the number of ordering key is less, to not change the user query. ### Why are the changes needed? fix code mistakes. ### Does this PR introduce _any_ user-facing change? no ### How was this patch tested? updated test ### Was this patch authored or co-authored using generative AI tooling? no Closes apache#44458 from cloud-fan/sort. Authored-by: Wenchen Fan <wenchen@databricks.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com>

peter-toth · 2025-10-22T08:10:16Z

I found this PR when trying to backport SPARK-35738 (#52584) to branch-3.5.

SPARK-35738 -> SPARK-53738, but I agree, let's backport this and that PR.

pan3793 · 2025-10-22T08:15:44Z

@peter-toth yes, it's a typo, should be SPARK-53738, sorry for the confusion.

Backport #44458 to branch-3.5. Justification: it fixes a hidden bug (until exposed by #44429) that has existed since 3.4. ### What changes were proposed in this pull request? In `V1Writes`, we try to avoid adding Sort if the output ordering always satisfies. However, the code is completely broken with two issues: - we put `SortOrder` as the child of another `SortOrder` and compare, which always returns false. - once we add a project to do `empty2null`, we change the query output attribute id and the sort order never matches. It's not a big issue as we still have QO rules to eliminate useless sorts, but #44429 exposes this problem because the way we optimize sort is a bit different. For `V1Writes`, we should always avoid adding sort even if the number of ordering key is less, to not change the user query. ### Why are the changes needed? fix code mistakes. ### Does this PR introduce _any_ user-facing change? no ### How was this patch tested? updated test ### Was this patch authored or co-authored using generative AI tooling? no Closes #52692 from pan3793/SPARK-46485-3.5. Authored-by: Wenchen Fan <wenchen@databricks.com> Signed-off-by: Peter Toth <peter.toth@gmail.com>

V1Write should not add Sort when not needed

2c98f3c

github-actions bot added the SQL label Dec 22, 2023

cloud-fan commented Dec 22, 2023

View reviewed changes

peter-toth approved these changes Dec 22, 2023

View reviewed changes

ulysses-you approved these changes Dec 22, 2023

View reviewed changes

yaooqinn approved these changes Dec 22, 2023

View reviewed changes

EnricoMi mentioned this pull request Dec 22, 2023

[SPARK-46378][SQL][FOLLOWUP] Do not rely on TreeNodeTag in Project #44429

Closed

cloud-fan closed this in cb2f47b Dec 22, 2023

dongjoon-hyun mentioned this pull request Oct 21, 2025

[SPARK-53738][SQL] Fix planned write when query output contains foldable orderings #52584

Closed

pan3793 mentioned this pull request Oct 22, 2025

[SPARK-46485][SQL][3.5] V1Write should not add Sort when not needed #52692

Closed

[SPARK-46485][SQL] V1Write should not add Sort when not needed #44458

[SPARK-46485][SQL] V1Write should not add Sort when not needed #44458

Uh oh!

Conversation

cloud-fan commented Dec 22, 2023

What changes were proposed in this pull request?

Why are the changes needed?

Does this PR introduce any user-facing change?

How was this patch tested?

Was this patch authored or co-authored using generative AI tooling?

Uh oh!

cloud-fan Dec 22, 2023

Choose a reason for hiding this comment

Uh oh!

peter-toth Dec 22, 2023

Choose a reason for hiding this comment

Uh oh!

viirya Dec 22, 2023

Choose a reason for hiding this comment

Uh oh!

cloud-fan commented Dec 22, 2023 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

peter-toth Dec 22, 2023

Choose a reason for hiding this comment

Uh oh!

ulysses-you Dec 22, 2023

Choose a reason for hiding this comment

Uh oh!

allisonwang-db Dec 23, 2023

Choose a reason for hiding this comment

Uh oh!

cloud-fan commented Dec 22, 2023

Uh oh!

EnricoMi commented Dec 22, 2023

Uh oh!

cloud-fan commented Dec 22, 2023

Uh oh!

EnricoMi commented Dec 22, 2023

Uh oh!

viirya commented Dec 22, 2023

Uh oh!

sweetpythoncode commented Jan 11, 2024

Uh oh!

pan3793 commented Oct 21, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

cloud-fan commented Oct 22, 2025

Uh oh!

peter-toth commented Oct 22, 2025

Uh oh!

pan3793 commented Oct 22, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

9 participants

cloud-fan commented Dec 22, 2023 •

edited

Loading

pan3793 commented Oct 21, 2025 •

edited

Loading