Skip to content

feat: add MapSort expression support for Spark 4.0#4076

Open
andygrove wants to merge 3 commits intoapache:mainfrom
andygrove:feat/map-sort-spark4
Open

feat: add MapSort expression support for Spark 4.0#4076
andygrove wants to merge 3 commits intoapache:mainfrom
andygrove:feat/map-sort-spark4

Conversation

@andygrove
Copy link
Copy Markdown
Member

@andygrove andygrove commented Apr 24, 2026

Which issue does this PR close?

Closes #1941
Closes #3171

Rationale for this change

Spark 4.0 introduces MapSort, used for normalizing map values when they appear in shuffle hash partitioning keys, in try_element_at, and in other contexts where map ordering must be deterministic. Without native support, queries that touch maps in any of these positions fall back to Spark, which forces the entire enclosing operator off Comet (e.g. an entire shuffle exchange).

What changes are included in this PR?

  • New native scalar function map_sort in native/spark-expr/src/map_funcs/map_sort.rs that sorts map entries by key in ascending order, registered via comet_scalar_funcs.rs.
  • Wire MapSort into the Spark 4.0 CometExprShim so the expression is converted to the new scalar function during serde.
  • The columnar shuffle on map array element test in CometColumnarShuffleSuite now expects shuffle fallback on Spark 4.0+: the new shuffle-key normalization wraps mapsort inside transform(arr, x -> mapsort(x)), and Comet does not currently support ArrayTransform with a lambda body. Answer correctness is still verified via checkSparkAnswer.

How are these changes tested?

  • New unit tests in native/spark-expr/src/map_funcs/map_sort.rs cover sorting on each supported key type, null handling, and empty maps.
  • Existing CometColumnarShuffleSuite tests for map shuffle keys all pass under the Spark 4.0 profile (41/41).

andygrove and others added 2 commits April 24, 2026 17:44
Add native map_sort scalar function that sorts map entries by key in
ascending order, and wire it up via the Spark 4.0 CometExprShim so that
MapSort expressions are accelerated instead of falling back to Spark.
Re-enable all CometColumnarShuffleSuite map tests that were skipped for
Spark 4.0.

Closes apache#1941

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Spark 4.0 normalizes shuffle keys containing array<map> via
transform(arr, x -> mapsort(x)), which Comet does not yet support
because ArrayTransform with a lambda body has no serde. Mark the
columnar shuffle on map array element test as expecting the fallback
on Spark 4.0+ while still verifying answer correctness.
The MapSort serde for Spark 4.0 called scalarFunctionExprToProto without a
return type. The Rust planner then looked up "map_sort" in the session
UDF registry to infer the type, but map_sort is only handled via the
create_comet_physical_fun match dispatch, not registered as a UDF, causing
"There is no UDF named 'map_sort' in the registry" at execution time
(e.g., group-by on a map column in CollationSuite).

Pass ms.dataType explicitly via scalarFunctionExprToProtoWithReturnType,
matching the pattern used by ceil, floor, and other scalar functions.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

[Feature] Support Spark expression: map_sort Add support for MapSort expression in Spark 4.0.0

1 participant