Skip to content

Conversation

@chirag-s-db
Copy link
Contributor

@chirag-s-db chirag-s-db commented Nov 17, 2025

What changes were proposed in this pull request?

The InternalRowComparableWrapper util is often used in a very hot-path for physical planning (most often, to compare partition values for key-grouped partitioned scans). While the current implementation does schema lookup that each instance uses to create a new instance of this object, this cache lookup itself can become a bottleneck for planning when there are large numbers of partitions. This PR adds a new InternalRowComparableWrapper factory for this util that has a precomputed schema and ordering that can be shared across multiple objects, removing this schema or cache lookup from the hot-path for physical planning.

Why are the changes needed?

Removes a physical planning bottleneck.

Does this PR introduce any user-facing change?

No.

How was this patch tested?

This change should not change any behavior (existing tests should suffice).

This PR also includes changes to the InternalRowComparableWrapperBenchmark to use these new utils. Results before change:

[info] internal row comparable wrapper:          Best Time(ms)   Avg Time(ms)   Stdev(ms)    Rate(M/s)   Per Row(ns)   Relative
[info] ------------------------------------------------------------------------------------------------------------------------
[info] toSet                                                74             76           2          2.7         367.5       1.0X
[info] mergePartitions                                     136            143          11          1.5         680.0       0.5X
[success] Total time: 11 s, completed Nov 17, 2025, 2:29:22 PM

Results after change:

[info] internal row comparable wrapper:          Best Time(ms)   Avg Time(ms)   Stdev(ms)    Rate(M/s)   Per Row(ns)   Relative
[info] ------------------------------------------------------------------------------------------------------------------------
[info] toSet                                                13             13           1         15.9          62.9       1.0X
[info] mergePartitions                                      17             17           1         11.8          84.7       0.7X

Was this patch authored or co-authored using generative AI tooling?

No.

@github-actions github-actions bot added the SQL label Nov 17, 2025
@chirag-s-db
Copy link
Contributor Author

@cloud-fan @szehon-ho Could you take a look at this PR when you get the chance?

@chirag-s-db chirag-s-db changed the title [SPARK-54383][SQL] Create BoundInternalRowComparableWrapper util to avoid schema cache lookups [SPARK-54383][SQL] Add precomputed schema variant for InternalRowComparableWrapper util Nov 17, 2025
StructType(dataTypes.map(t => StructField("f", t))) ->
RowOrdering.createNaturalAscendingOrdering(dataTypes)

def mergePartitions(
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

it looks like the only caller for this method in the original class (InternalRowComparableWrapper) is the InternalRowComparableWrapper benchmark. Should we just migrate that over and deprecate this method?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

also, if you have time, be interesting to see the numbers after running the benchmark against the new class,

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Attached some benchmarks in PR description.

* Effectively the same as [[InternalRowComparableWrapper]], but using a precomputed `ordering`
* and `structType` to avoid the cache lookup for each row.
*/
class BoundInternalRowComparableWrapper(
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As there's no checks now that derive structType/ordering to dataType, it seems a bit dangerous to not include them in hash/equals. Should we do that?

Alternatively we could also keep the binding by making a factory and keep this class constructor pviate, ie

object BoundInternalRowComparableFactory(dataTypes) {
    val structType = getStructType(dataTypes);
    val ordering = getOrdering(dataTypes)
    def newBoundInternalRowComparableWrapper(row) => BoundInternalRowComparableWrapper(row, structType, ordering, dataTypes)
}

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good suggestion - I actually removed the new BoundInternalRowComparableWrapper and used this pattern on the original InternalRowComparableWrapper (since it would work equally well there as well). I kept the original constructor for binary compatibility, but marked it deprecated for the reasons explained in the comment.

Copy link
Member

@szehon-ho szehon-ho left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks! FYI @sunchao as well

Copy link
Member

@sunchao sunchao left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @chirag-s-db @szehon-ho , this LGTM as well. Pending CI.

@cloud-fan
Copy link
Contributor

Seems a real test failure in KeyGroupedPartitioningSuite

@chirag-s-db
Copy link
Contributor Author

@cloud-fan Fixed here: c5ae5b7 (was missing one parameter in a migrated method)

@sunchao sunchao closed this in 13fea4f Nov 18, 2025
@sunchao
Copy link
Member

sunchao commented Nov 18, 2025

Thanks! Merged to master.

huangxiaopingRD pushed a commit to huangxiaopingRD/spark that referenced this pull request Nov 25, 2025
…arableWrapper util

### What changes were proposed in this pull request?
The InternalRowComparableWrapper util is often used in a very hot-path for physical planning (most often, to compare partition values for key-grouped partitioned scans). While the current implementation does schema lookup that each instance uses to create a new instance of this object, this cache lookup itself can become a bottleneck for planning when there are large numbers of partitions. This PR adds a new InternalRowComparableWrapper factory for this util that has a precomputed schema and ordering that can be shared across multiple objects, removing this schema or cache lookup from the hot-path for physical planning.

### Why are the changes needed?
Removes a physical planning bottleneck.

### Does this PR introduce _any_ user-facing change?
No.

### How was this patch tested?
This change should not change any behavior (existing tests should suffice).

This PR also includes changes to the `InternalRowComparableWrapperBenchmark` to use these new utils. Results before change:
```
[info] internal row comparable wrapper:          Best Time(ms)   Avg Time(ms)   Stdev(ms)    Rate(M/s)   Per Row(ns)   Relative
[info] ------------------------------------------------------------------------------------------------------------------------
[info] toSet                                                74             76           2          2.7         367.5       1.0X
[info] mergePartitions                                     136            143          11          1.5         680.0       0.5X
[success] Total time: 11 s, completed Nov 17, 2025, 2:29:22 PM
```

Results after change:
```
[info] internal row comparable wrapper:          Best Time(ms)   Avg Time(ms)   Stdev(ms)    Rate(M/s)   Per Row(ns)   Relative
[info] ------------------------------------------------------------------------------------------------------------------------
[info] toSet                                                13             13           1         15.9          62.9       1.0X
[info] mergePartitions                                      17             17           1         11.8          84.7       0.7X

```

### Was this patch authored or co-authored using generative AI tooling?
No.

Closes apache#53097 from chirag-s-db/birc.

Authored-by: Chirag Singh <chirag.singh@databricks.com>
Signed-off-by: Chao Sun <chao@openai.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants