[SPARK-54383][SQL] Add precomputed schema variant for InternalRowComparableWrapper util #53097

chirag-s-db · 2025-11-17T17:02:39Z

What changes were proposed in this pull request?

The InternalRowComparableWrapper util is often used in a very hot-path for physical planning (most often, to compare partition values for key-grouped partitioned scans). While the current implementation does schema lookup that each instance uses to create a new instance of this object, this cache lookup itself can become a bottleneck for planning when there are large numbers of partitions. This PR adds a new InternalRowComparableWrapper factory for this util that has a precomputed schema and ordering that can be shared across multiple objects, removing this schema or cache lookup from the hot-path for physical planning.

Why are the changes needed?

Removes a physical planning bottleneck.

Does this PR introduce any user-facing change?

No.

How was this patch tested?

This change should not change any behavior (existing tests should suffice).

This PR also includes changes to the InternalRowComparableWrapperBenchmark to use these new utils. Results before change:

[info] internal row comparable wrapper:          Best Time(ms)   Avg Time(ms)   Stdev(ms)    Rate(M/s)   Per Row(ns)   Relative
[info] ------------------------------------------------------------------------------------------------------------------------
[info] toSet                                                74             76           2          2.7         367.5       1.0X
[info] mergePartitions                                     136            143          11          1.5         680.0       0.5X
[success] Total time: 11 s, completed Nov 17, 2025, 2:29:22 PM

Results after change:

[info] internal row comparable wrapper:          Best Time(ms)   Avg Time(ms)   Stdev(ms)    Rate(M/s)   Per Row(ns)   Relative
[info] ------------------------------------------------------------------------------------------------------------------------
[info] toSet                                                13             13           1         15.9          62.9       1.0X
[info] mergePartitions                                      17             17           1         11.8          84.7       0.7X

Was this patch authored or co-authored using generative AI tooling?

No.

chirag-s-db · 2025-11-17T17:03:47Z

@cloud-fan @szehon-ho Could you take a look at this PR when you get the chance?

szehon-ho · 2025-11-17T21:52:12Z

...atalyst/src/main/scala/org/apache/spark/sql/catalyst/util/InternalRowComparableWrapper.scala

+    StructType(dataTypes.map(t => StructField("f", t))) ->
+      RowOrdering.createNaturalAscendingOrdering(dataTypes)
+
+  def mergePartitions(


it looks like the only caller for this method in the original class (InternalRowComparableWrapper) is the InternalRowComparableWrapper benchmark. Should we just migrate that over and deprecate this method?

also, if you have time, be interesting to see the numbers after running the benchmark against the new class,

Attached some benchmarks in PR description.

szehon-ho · 2025-11-17T22:11:15Z

...atalyst/src/main/scala/org/apache/spark/sql/catalyst/util/InternalRowComparableWrapper.scala

+ * Effectively the same as [[InternalRowComparableWrapper]], but using a precomputed `ordering`
+ * and `structType` to avoid the cache lookup for each row.
+ */
+class BoundInternalRowComparableWrapper(


As there's no checks now that derive structType/ordering to dataType, it seems a bit dangerous to not include them in hash/equals. Should we do that?

Alternatively we could also keep the binding by making a factory and keep this class constructor pviate, ie

object BoundInternalRowComparableFactory(dataTypes) { val structType = getStructType(dataTypes); val ordering = getOrdering(dataTypes) def newBoundInternalRowComparableWrapper(row) => BoundInternalRowComparableWrapper(row, structType, ordering, dataTypes) }

Good suggestion - I actually removed the new BoundInternalRowComparableWrapper and used this pattern on the original InternalRowComparableWrapper (since it would work equally well there as well). I kept the original constructor for binary compatibility, but marked it deprecated for the reasons explained in the comment.

...atalyst/src/main/scala/org/apache/spark/sql/catalyst/util/InternalRowComparableWrapper.scala

szehon-ho

Thanks! FYI @sunchao as well

sunchao

Thanks @chirag-s-db @szehon-ho , this LGTM as well. Pending CI.

cloud-fan · 2025-11-18T10:12:46Z

Seems a real test failure in KeyGroupedPartitioningSuite

chirag-s-db · 2025-11-18T16:30:08Z

@cloud-fan Fixed here: c5ae5b7 (was missing one parameter in a migrated method)

sunchao · 2025-11-18T21:36:12Z

Thanks! Merged to master.

…arableWrapper util ### What changes were proposed in this pull request? The InternalRowComparableWrapper util is often used in a very hot-path for physical planning (most often, to compare partition values for key-grouped partitioned scans). While the current implementation does schema lookup that each instance uses to create a new instance of this object, this cache lookup itself can become a bottleneck for planning when there are large numbers of partitions. This PR adds a new InternalRowComparableWrapper factory for this util that has a precomputed schema and ordering that can be shared across multiple objects, removing this schema or cache lookup from the hot-path for physical planning. ### Why are the changes needed? Removes a physical planning bottleneck. ### Does this PR introduce _any_ user-facing change? No. ### How was this patch tested? This change should not change any behavior (existing tests should suffice). This PR also includes changes to the `InternalRowComparableWrapperBenchmark` to use these new utils. Results before change: ``` [info] internal row comparable wrapper: Best Time(ms) Avg Time(ms) Stdev(ms) Rate(M/s) Per Row(ns) Relative [info] ------------------------------------------------------------------------------------------------------------------------ [info] toSet 74 76 2 2.7 367.5 1.0X [info] mergePartitions 136 143 11 1.5 680.0 0.5X [success] Total time: 11 s, completed Nov 17, 2025, 2:29:22 PM ``` Results after change: ``` [info] internal row comparable wrapper: Best Time(ms) Avg Time(ms) Stdev(ms) Rate(M/s) Per Row(ns) Relative [info] ------------------------------------------------------------------------------------------------------------------------ [info] toSet 13 13 1 15.9 62.9 1.0X [info] mergePartitions 17 17 1 11.8 84.7 0.7X ``` ### Was this patch authored or co-authored using generative AI tooling? No. Closes apache#53097 from chirag-s-db/birc. Authored-by: Chirag Singh <chirag.singh@databricks.com> Signed-off-by: Chao Sun <chao@openai.com>

fix

5c13418

github-actions bot added the SQL label Nov 17, 2025

chirag-s-db changed the title ~~[SPARK-54383][SQL] Create BoundInternalRowComparableWrapper util to avoid schema cache lookups~~ [SPARK-54383][SQL] Add precomputed schema variant for InternalRowComparableWrapper util Nov 17, 2025

szehon-ho reviewed Nov 17, 2025

View reviewed changes

chirag-s-db added 2 commits November 17, 2025 14:50

fix

aaef084

fixes

e234fdc

chirag-s-db requested a review from szehon-ho November 17, 2025 23:02

szehon-ho approved these changes Nov 18, 2025

View reviewed changes

sunchao approved these changes Nov 18, 2025

View reviewed changes

cloud-fan approved these changes Nov 18, 2025

View reviewed changes

fix

c5ae5b7

sunchao closed this in 13fea4f Nov 18, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[SPARK-54383][SQL] Add precomputed schema variant for InternalRowComparableWrapper util #53097

[SPARK-54383][SQL] Add precomputed schema variant for InternalRowComparableWrapper util #53097

Uh oh!

chirag-s-db commented Nov 17, 2025 •

edited

Loading

Uh oh!

chirag-s-db commented Nov 17, 2025

Uh oh!

szehon-ho Nov 17, 2025

Uh oh!

szehon-ho Nov 17, 2025

Uh oh!

chirag-s-db Nov 17, 2025

Uh oh!

szehon-ho Nov 17, 2025

Uh oh!

chirag-s-db Nov 17, 2025

Uh oh!

Uh oh!

szehon-ho left a comment

Uh oh!

sunchao left a comment

Uh oh!

cloud-fan commented Nov 18, 2025

Uh oh!

chirag-s-db commented Nov 18, 2025

Uh oh!

sunchao commented Nov 18, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

[SPARK-54383][SQL] Add precomputed schema variant for InternalRowComparableWrapper util #53097

[SPARK-54383][SQL] Add precomputed schema variant for InternalRowComparableWrapper util #53097

Uh oh!

Conversation

chirag-s-db commented Nov 17, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

What changes were proposed in this pull request?

Why are the changes needed?

Does this PR introduce any user-facing change?

How was this patch tested?

Was this patch authored or co-authored using generative AI tooling?

Uh oh!

chirag-s-db commented Nov 17, 2025

Uh oh!

szehon-ho Nov 17, 2025

Choose a reason for hiding this comment

Uh oh!

szehon-ho Nov 17, 2025

Choose a reason for hiding this comment

Uh oh!

chirag-s-db Nov 17, 2025

Choose a reason for hiding this comment

Uh oh!

szehon-ho Nov 17, 2025

Choose a reason for hiding this comment

Uh oh!

chirag-s-db Nov 17, 2025

Choose a reason for hiding this comment

Uh oh!

Uh oh!

szehon-ho left a comment

Choose a reason for hiding this comment

Uh oh!

sunchao left a comment

Choose a reason for hiding this comment

Uh oh!

cloud-fan commented Nov 18, 2025

Uh oh!

chirag-s-db commented Nov 18, 2025

Uh oh!

sunchao commented Nov 18, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

chirag-s-db commented Nov 17, 2025 •

edited

Loading