[SPARK-50593][SQL] SPJ: Support truncate transform via generalized ReducibleFunction API by metanil · Pull Request #55885 · apache/spark

metanil · 2026-05-14T18:42:59Z

What changes were proposed in this pull request?

This PR adds Storage Partitioned Join (SPJ) support for the truncate partition transform. The approach generalizes the ReducibleFunction API to accept arbitrary parameters via a new ReducibleParameters container, so SPJ can reason about any parameterized transform (bucket, truncate, future ones) through one code path.

Key changes:

New public API ReducibleParameters in org.apache.spark.sql.connector.catalog.functions — a typed parameter container.
Generalized reducer ReducibleFunction.reducer(ReducibleParameters, ReducibleFunction, ReducibleParameters). The old reducer(int, ..., int) is marked @Deprecated but preserved as the default fallback, so existing connector implementations (e.g., Iceberg 1.10.0) continue to work unchanged.
TransformExpression refactor: literal parameters (e.g., bucket numBuckets, truncate width) now live inside children rather than a bespoke numBucketsOpt: Option[Int] field. collectLeaves() is overridden to filter literal parameters and return only column references.
Generic path in TransformExpression that extracts ReducibleParameters from literal children and delegates to the new API; compatibility checks (isCompatible, reducers) work uniformly for bucket, truncate, etc.

Why are the changes needed?

Today a join on tables partitioned by truncate(col, N) always shuffles, even when both sides share identical partitioning. The write-side was fixed by SPARK-40295 (Allow v2 functions with literal args in write distribution and ordering), but the read/join side was never enabled.

Previous work in #49211 (@szehon-ho) explored direct support for transforms with literal arguments by adjusting the SPJ paths to recognize them. This PR generalizes the reducer API so the compatibility check is function-agnostic, with a default method that delegates to the deprecated single-int signature for backward compatibility.

Does this PR introduce any user-facing change?

Yes, for connector/catalog authors:

New public class ReducibleParameters.
New overload ReducibleFunction.reducer(ReducibleParameters, ...) with a default that delegates to the deprecated single-int signature for backward compatibility.
No action required for existing connectors; they keep working via the default fallback. Iceberg 1.10.0 (which implements only the old API) is verified via a dedicated LegacyBucketFunction test fixture.

For end users, queries joining tables partitioned by compatible truncate transforms (identical widths, or reducible pairs like truncate(3) and truncate(5)) now avoid shuffle via SPJ.

How was this patch tested?

6 New tests in KeyGroupedPartitioningSuite

cc @szehon-ho @aokolnychyi @sunchao @peter-toth

Was this patch authored or co-authored using generative AI tooling?

Yes — used only for test cases and Javadoc/Scaladoc comments.

Generated-by: Claude Code (Opus 4.7)

…cibleParameters container

… Joins by generalizing parameter handling

…cibleParameters backward compatibility

sunchao · 2026-05-15T04:23:56Z

-      transform.children.size == 1 && isReference(transform.children.head)
+      // TransformExpression.collectLeaves() only returns column references, not literals.
+      // We need exactly one column reference per transform.
+      transform.collectLeaves().size == 1


[P1] This widens support from transforms with one direct reference child to any transform whose collectLeaves() returns one leaf. Because Transform arguments can themselves be nested Transforms, V2ExpressionUtils can materialize shapes like outer(years(k)) and outer(days(k)). The new gate accepts both, keyPositions later maps them only by leaf k, and TransformExpression.isSameFunction compares only the outer function name plus literals. That can make storage partitionings with different nested child semantics look compatible and let SPJ skip a required shuffle, which can drop join matches. Keep the old direct-reference constraint, or compare the full non-literal child semantics before admitting these transforms.

[ 🤖 posted by Codex on behalf of sunchao 🤖 ]

Thanks @sunchao for the review. Good point. I didn't think about nested transform. This is indeed widening the transforms, as collectLeaves() go through all the children in nested expressions.
Like you mentioned, adding direct-reference constraints here would be enough. I'll make that change by getting non literal children, and use same existing check (size and reference). With that we also don't have to then worry about TransformExpression.isSameFunction comparing only outer function.

With regards to adding support to nested transform for SPJ, very interesting idea, but we can take that as separate PR.

sunchao · 2026-05-15T04:23:56Z

+  private def extractParameters(expr: TransformExpression): ReducibleParameters = {
+    import scala.jdk.CollectionConverters._
+    val values = expr.literalChildren.map {
+      case Literal(value, _) => value.asInstanceOf[AnyRef]


[P1] extractParameters forwards raw Catalyst Literal.value objects into ReducibleParameters. For string literals, Spark stores UTF8String, while the new public API documents string parameters and exposes getString() as a java.lang.String cast. A connector implementing the documented string-parameter case will get ClassCastException from getString(0) or be forced to depend on Spark internals. Convert literal values to connector-facing external values by dataType before constructing ReducibleParameters.

[ 🤖 posted by Codex on behalf of sunchao 🤖 ]

I agree, due UTF8String, we'll get ClassCastException. I think same case for Decimal type.
I'll make the change to handle these, and possibly a little more generic way for future proofing.

sunchao · 2026-05-15T04:23:56Z

+    val thisParams = extractParameters(thisExpr)
+    val otherParams = extractParameters(otherExpr)
+
+    val res = if (!thisParams.isEmpty && !otherParams.isEmpty) {


[P2] The new generalized reducer API accepts ReducibleParameters on both sides, and this file even models zero-literal transforms as ReducibleParameters([]). But this dispatch only invokes the generalized overload when both sides are non-empty. Mixed cases such as parameterized-vs-zero-parameter transforms instead fall back to reducer(otherFunction), so connectors that correctly implement the new generalized overload for those cases are never invoked; the default legacy path can even throw UnsupportedOperationException. If mixed arity is meant to be unsupported, the new API/docs should say that explicitly. Otherwise this should dispatch through the generalized overload whenever either side wants that path.

[ 🤖 posted by Codex on behalf of sunchao 🤖 ]

@sunchao Good point. I was trying to fallback to existing behavior, but hadn't considered parameterized-vs-zero-parameter case, which it should follow the new generalized path. With regards making mixed arity "unsupported", currently I can't think of an valid example, but I also can't think of not to support it either. So, I suppose, we can let the implementor reducer decide to have it or not.

I will make the change that will dispatch through the generalized overloaded reducer().

szehon-ho · 2026-05-18T22:19:23Z

+          ReducibleFunction<?, ?> otherFunction,
+          ReducibleParameters otherParams) {
+    // Default: try old Int-based API for backward compatibility
+    if (thisParams.count() == 1 && otherParams.count() == 1) {


i think it makes more sense to have it the other way. the old one should call the new (more generic one). we can have the temporary code on spark side:

val thisParams = extractParameters(thisExpr) val otherParams = extractParameters(otherExpr) if (singleIntParam(thisParams) && singleIntParam(otherParams)) { Option(thisFunction.reducer(thisParams.getInt(0), otherFunction, otherParams.getInt(0))) .orElse(Option(thisFunction.reducer(thisParams, otherFunction, otherParams))) } else if (!thisParams.isEmpty && !otherParams.isEmpty) { Option(thisFunction.reducer(thisParams, otherFunction, otherParams)) } else { Option(thisFunction.reducer(otherFunction)) }

imo, its not much uglier than this code :)

In practice, Iceberg is the only implementation of this so we can hopefully remove this ugly wrapper from spark side.

@szehon-ho Yes, this make sense. My initial idea was to not call "deprecated" method, and when time to remove it would be only from 1 single file (ReducibleFunction), but the approach is not natural (or norm), where deprecated code calls the newer method.
When there's time to remove the method, we can remove from both, would not be a big deal, and can't be missed too.
I'll make this change along with @sunchao mixed arity case.

szehon-ho · 2026-05-18T22:46:44Z

+ *   <li>custom_transform(col, "param") → ReducibleParameters(["param"])</li>
+ * </ul>
+ *
+ * @since 4.0.0


version is not right

@szehon-ho oh good eye! I worked on this last year :) Will update to reflect current version.

…pressions

metanil · 2026-05-22T16:45:20Z

@sunchao @szehon-ho Addressed all comments, PTAL

Additionally, I'm now handling all "null" or unimplemented exception (UOE) from reducer() and displaying log warnings.

cc @peter-toth

peter-toth

Summary

Adds SPJ support for the truncate partition transform by generalizing the reducer API: literal parameters now live inline in TransformExpression.children, are surfaced to connectors through a new ReducibleParameters container, and a generalized ReducibleFunction.reducer(ReducibleParameters, ReducibleFunction, ReducibleParameters) overload replaces the bucket-only reducer(int, …, int) signature.

Prior state and problem. Bucket's numBuckets was carried as a sidecar numBucketsOpt: Option[Int] field on TransformExpression, outside the expression tree. The SPJ reducer dispatch in TransformExpression.reducer only knew how to fan out two cases — (Some, Some) → reducer(int, …, int) or anything else → reducer(otherFunction) — so any parameterized transform other than bucket (e.g. truncate(col, N)) was structurally invisible to the reducer dispatch and always shuffled. The write side was fixed in SPARK-40295; the read/join side never was. This PR supersedes the earlier exploration in #49211 by generalizing the API rather than special-casing transforms one at a time.

Design approach. Drop the numBucketsOpt sidecar and let literals participate in children. Then expose those literals to connectors as a typed parameter container, route the SPJ reducer dispatch through a single generalized overload, and keep the deprecated int-API alive (with default throw UOE) so existing connectors like Iceberg 1.10.0 continue to work. The dispatch tries the deprecated signature first for the single-int case (Iceberg shape), falls back to the new overload on UnsupportedOperationException, and goes straight to the new overload for everything else.

Key design decisions.

Literals are now inline Expression children, not a typed field. TransformExpression.collectLeaves() is overridden to strip them so the existing partitioning.scala/EnsureRequirements consumers still see exactly one leaf attribute per transform.
ReducibleParameters is a brand-new public API rather than reusing V2Literal/Literal; it exposes typed getters (getInt, getLong, getString, getDouble, getFloat, get) but deliberately hides the source DataType.
Spark's own BucketFunction was switched to override only the new API. Backward compat for connectors that override only the deprecated method is covered by a dedicated LegacyBucketFunction test fixture.
extractParameters converts UTF8String → java.lang.String and Decimal → java.math.BigDecimal, but otherwise passes Catalyst-internal values straight through.

Implementation sketch. New public API in sql/catalyst/.../functions/ (ReducibleParameters.java, generalized ReducibleFunction.reducer overload). Catalyst plumbing in TransformExpression.scala (new literalChildren, extractParameters, tryReduce, dispatch logic; overrides for collectLeaves and canonicalized). V2-to-Catalyst resolution simplified in V2ExpressionUtils (the BucketTransform case is dropped; bucket now resolves through the generic NamedTransform path). Physical layer in partitioning.scala updates KeyedPartitioning.supportsExpressions to count non-literal children and KeyedShuffleSpec.createPartitioning to preserve literals when rewriting children. DistributionAndOrderingUtils.resolveTransformExpression simplified. Test fixtures in transformFunctions.scala add TruncateFunction.reducer via the new API, plus IntegerTruncateFunction and LegacyBucketFunction. New tests in KeyGroupedPartitioningSuite.

Behavioral changes worth calling out.

truncate(col, N) now plans KeyedPartitioning where it previously planned UnknownPartitioning. The existing test non-clustered distribution: V2 function with multiple args was renamed to clustered distribution: … to reflect this.
EnsureRequirementsSuite exprA–exprD flipped from Literal(1..4) to AttributeReferences because the new code distinguishes literals from refs in transform children. Tests using bucket(N, exprA) now exercise a [Literal(N), AttributeReference] shape rather than [Literal(N), Literal(1)]. Worth a sanity check that no existing assertion relied on the prior Literal shape.

General

@sunchao (Codex, May 15) and @szehon-ho (May 18) already raised the substantive issues with the earlier commit. The author has addressed: the collectLeaves() widening on KeyedPartitioning.supportsExpressions (now uses children.filterNot(_.isInstanceOf[Literal]) directly), the mixed-arity dispatch (now routes through the generalized overload), the dispatch ordering preference (deprecated-first, new fallback per szehon's suggestion), and the @since version on ReducibleParameters. Nice turnaround on all of those.
The fix for @sunchao's extractParameters P1 is only partial. The current code converts UTF8String → String and Decimal → BigDecimal, but the original critique was structural: convert by dataType rather than special-casing each type. As written, BinaryType (byte[]), CalendarIntervalType, ArrayType/MapType Catalyst values, etc. still flow through the catch-all value.asInstanceOf[AnyRef] and a connector implementing the documented "any type" claim on ReducibleParameters will hit Catalyst-internal classes. The principled fix is to drive the conversion off the literal's DataType (e.g., via existing CatalystTypeConverters.convertToScala) — or, taken further, to surface V2 literals directly through the API; see inline #6.
The PR description claim "New overload ReducibleFunction.reducer(ReducibleParameters, ...) with a default that delegates to the deprecated single-int signature for backward compatibility" is out of date with the code. The new default throws UnsupportedOperationException; it's TransformExpression.reducer's Spark-side dispatch that tries the deprecated signature first. Worth updating the description so reviewers don't get a wrong mental model of the fallback mechanism.
The collectLeaves override on TransformExpression (inline #3 below) is the kind of contract divergence that's better fixed at the abstraction layer it crosses. I've put up #56088 as a follow-up that migrates the SPJ call sites in partitioning.scala and EnsureRequirements.scala off _.collectLeaves() — to Expression.references where set semantics suffices, and a new Expression.collectAttributes(): Seq[Attribute] where positional Seq is required. Pure refactor against today's master. Once #56088 lands, this PR can rebase on top and drop the override here entirely; the migrated call sites already filter literals correctly.

Suggested improvements

ReducibleParameters is missing getBigDecimal. extractParameters converts Decimal → java.math.BigDecimal on the wire, but the container exposes no decimal getter, so connectors with a decimal parameter must use the untyped get(index) and cast manually — exactly the type-safety the wrapper was added to prevent. (Inline #6 below proposes a structural alternative that obviates this gap entirely.) [sql/catalyst/src/main/java/org/apache/spark/sql/connector/catalog/functions/ReducibleParameters.java:114]
canonicalized override is dead code and the doc justification doesn't hold. SPJ uses isSameFunction, not canonicalized/semanticEquals, so the stated motivation ("bucket(4, tableA.id) and bucket(4, tableB.id) semantically equal") is moot — and AttributeReference.canonicalized keeps exprId, so the override wouldn't deliver that equality even if SPJ did read it. The override otherwise duplicates the default path. Remove. [sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/TransformExpression.scala:93]
collectLeaves override breaks the universal TreeNode contract — drop on rebase after #56088. The follow-up migrates the SPJ call sites off _.collectLeaves() so the override here can simply be removed; see the General section. [sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/TransformExpression.scala:111]
UnboundTruncateFunction.bind() claims BinaryType and LongType support, but neither branch actually works. BinaryType → TruncateFunction mismatches (StringType, IntegerType) inputTypes() and a getUTF8String read; LongType → IntegerTruncateFunction mismatches (IntegerType, IntegerType) inputTypes() and a getInt read. Plus the IntegerTruncateFunction doc claim about "incompatible partition structures" is wrong. Drop the unused branches, or split per type with coverage. [sql/core/src/test/scala/org/apache/spark/sql/connector/catalog/functions/transformFunctions.scala:293]
isSameFunction compares literals by value but ignores positional structure. The check tolerates literals at different positions in children and ignores nested-transform children entirely (compares only the outer function name + literal values), so bucket(4, days(col)) and bucket(4, hours(col)) would be reported as the same function. A structural per-position compare closes both gaps. [sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/TransformExpression.scala:75]
Structural alternative — reuse V2Literal instead of introducing ReducibleParameters. V2Literal is already public and already used everywhere else in the connector API; passing it through the reducer signature preserves dataType() (no Catalyst-internal-types leak), makes inline #1 moot (no typed getters to maintain), and shrinks extractParameters to a Catalyst-Literal → V2-Literal conversion. Worth weighing before locking the new public class in. [sql/catalyst/src/main/java/org/apache/spark/sql/connector/catalog/functions/ReducibleParameters.java:40]

peter-toth · 2026-05-24T12:00:22Z

+     */
+    public float getFloat(int index) {
+        return (Float) values.get(index);
+    }


extractParameters in TransformExpression.scala:171 converts Decimal → java.math.BigDecimal on the way in, but this class exposes no getBigDecimal accessor. A connector with a decimal parameter therefore has to fall back to the untyped Object get(int index) and cast by hand — exactly the type-safety the wrapper was introduced to provide. Either add a getBigDecimal getter alongside the others, or remove the Decimal special-case in extractParameters and document the supported types explicitly. Concretely:

/** * Get parameter at index as BigDecimal. * @throws ClassCastException if parameter is not a BigDecimal * @throws IndexOutOfBoundsException if index is invalid */ public java.math.BigDecimal getBigDecimal(int index) { return (java.math.BigDecimal) values.get(index); }

Same gap exists for any other type the Spark side might convert (binary, interval, etc.) — see the ## General note about driving extractParameters off DataType. Inline #6 proposes a structural alternative that would make this whole class of issue moot.

peter-toth · 2026-05-24T12:00:22Z

+   * This is crucial for Storage Partitioned Joins - we need bucket(4, tableA.id) and bucket(4,
+   * tableB.id) to be semantically equal so SPJ can be triggered.
+   */
+  override lazy val canonicalized: Expression = {


The doc claim that motivates this override doesn't hold, and the override doesn't add anything over the inherited default. Two pieces:

The motivation doesn't hold. The doc says "we need bucket(4, tableA.id) and bucket(4, tableB.id) to be semantically equal so SPJ can be triggered". SPJ doesn't use canonicalized/semanticEquals on TransformExpression though — it dispatches through isSameFunction/isCompatible (see KeyedShuffleSpec.isExpressionCompatible at partitioning.scala:1082). And even if it did, this override wouldn't deliver the claim: AttributeReference.canonicalized (namedExpressions.scala:324) normalizes the name to "none" but keeps the original exprId, so two different attributes with the same name still canonicalize to different forms. bucket(4, tableA.id) and bucket(4, tableB.id) aren't made equal by either the override or the default.

The body is equivalent to the default. Literal is a LeafExpression, so Literal.canonicalized returns the literal unchanged — the per-child case l: Literal => l branch produces the same result as l.canonicalized. The override therefore matches withCanonicalizedChildren exactly, except it skips the Canonicalize.execute post-pass that the default routes through (a no-op for TransformExpression today, but a future change to Canonicalize would silently fail to apply here).

Recommended: remove the override entirely.

Good catch, yes, not needed. Was meant to remove it.

peter-toth · 2026-05-24T12:00:22Z

+   *
+   */
+  override def collectLeaves(): Seq[Expression] = {
+    children.flatMap {


This override breaks the universal TreeNode.collectLeaves contract — the base method (TreeNode.scala:315) returns "every node in the tree where children.isEmpty", but this returns column references only, hiding the literal leaves. Functionally correct for current SPJ call sites (partitioning.scala:492/498/982/1037, EnsureRequirements.scala:89/412/417/780), but a quiet semantic divergence — any future caller of collectLeaves on a TransformExpression will silently get wrong answers.

The cleaner shape is to migrate the SPJ call sites off _.collectLeaves() and use the right helper at each:

Expression.references where set semantics suffice (existence/size checks, single-attribute lookup).

A new Expression.collectAttributes(): Seq[Attribute] where a positional Seq is required (RowOrdering.create, reorder, attributes.zip(clustering)) — AttributeSet deduplicates and its ordering is LinkedHashSet-implementation-detail rather than contract.

I've sent #56088 as the follow-up doing exactly that. Pure refactor against today's master (current TransformExpression has no literal children, so the migrated call sites produce identical results). Once it's merged, this override can be dropped on rebase, and 55885's representation change (literals as inline children) lands cleanly through the already-migrated call sites.

Yes, exactly this. I was getting away with the context of TransformExpression as it always meant for Attributes, but with respect to TreeNode.collectLeaves contract, it was indeed breaking.
I was also trying handle by filter Literals during usages, but meant modifying all the callers.

collectAttributes() is exactly what I needed, and will use it once #56088 land on main branch.

Thanks for the PR @peter-toth

Actually, now I think we don't even need a new Expression.collectAttributes(), but we can use Expression.references() in SPJ planning. Adjusted the PR.

peter-toth · 2026-05-24T12:00:22Z

+  override def bind(inputType: StructType): BoundFunction = {
+    if (inputType.size == 2) {
+      inputType.head.dataType match {
+        case StringType | BinaryType => TruncateFunction


This dispatch claims BinaryType and LongType support, but neither branch actually works:

StringType | BinaryType => TruncateFunction — TruncateFunction.inputTypes() declares (StringType, IntegerType) and produceResult calls input.getUTF8String(0). A BinaryType row (raw byte[] underneath) mismatches inputTypes() at bind-time, or — if the framework lets it through — fails at runtime on the getUTF8String read.

IntegerType | LongType => IntegerTruncateFunction — IntegerTruncateFunction.inputTypes() declares (IntegerType, IntegerType) and produceResult calls input.getInt(0). Same kind of mismatch on a LongType row, plus its resultType() is IntegerType so even if you cast input down, the column type is wrong.

None of the phantom branches are exercised by tests in this PR — every truncate test uses StringType and dispatches to TruncateFunction correctly. So the broken branches add API surface that connector authors might trust without coverage to defend it.

Adjacent issue: the doc on IntegerTruncateFunction ("different integer truncate widths produce incompatible partition structures") is wrong: truncate(v, W1) is reducible to truncate(v, W2) whenever W2 % W1 == 0 (snap to a coarser width), and when neither divides the other both reduce to truncate(_, lcm(W1, W2)). Same structural argument as bucket's GCD case.

Three options, in order of cleanliness:

Drop the unused branches and IntegerTruncateFunction from this PR — only StringType truncate is part of the SPJ story being delivered.

Split per type (IntegerTruncate / LongTruncate / BinaryTruncate) with the right inputTypes()/resultType()/produceResult and add at least one test per branch.

Make each existing object polymorphic on the input type (declare wider inputTypes(), dispatch in produceResult) and add tests.

peter-toth · 2026-05-24T12:00:22Z

+      // Compare literal arguments to ensure transforms with different parameters
+      // (e.g., bucket(32, col) vs bucket(16, col), truncate(col, 2) vs truncate(col, 4))
+      // are not considered the same
+      val otherLiterals = other.literalChildren


This compares literals by value but ignores positional structure of children. Two consequences:

Position-agnostic. Two transforms with the same function and the same literal values but different interleavings of literals/non-literals are reported as the same. In practice the V2 function catalog pins down argument order per function name, so this is unlikely to manifest, but the doc claims "same semantics as other" — the check is weaker than that without the catalog assumption.

Nested-transform children. V2ExpressionUtils.toCatalyst recurses through toCatalystTransformOpt, so children can include nested TransformExpressions (sunchao's earlier P1 raised this for supportsExpressions). bucket(4, days(col)) and bucket(4, hours(col)) both have outer function bucket and literalChildren = [Literal(4)], so this returns true even though the inner transforms differ. KeyedPartitioning.supportsExpressions rejects nested-transform shapes from SPJ planning, so this doesn't bite SPJ today, but isSameFunction is used elsewhere (isCompatible, suite assertions) where the false positive would leak.

A structural per-position walk closes both gaps and lets literalChildren shrink to just the extractParameters helper:

def isSameFunction(other: TransformExpression): Boolean = { function.canonicalName() == other.function.canonicalName() && children.length == other.children.length && children.zip(other.children).forall { case (a: Literal, b: Literal) => a == b case (a: TransformExpression, b: TransformExpression) => a.isSameFunction(b) case (_: Literal, _) | (_, _: Literal) => false case (_: TransformExpression, _) | (_, _: TransformExpression) => false case _ => true // both refs/attrs — column identity intentionally ignored } }

@sunchao raised this issue too, and I stuck with the current limitation of SPJ with only 1 Reference.
But I like the idea of generalizing it to support nested cases as well.

I'll try to follow the suggestion and make the change.

Thanks @peter-toth for suggestion.

peter-toth · 2026-05-24T12:00:22Z

+ * @since 5.0.0
+ */
+@Evolving
+public class ReducibleParameters {


Structural alternative — reuse V2Literal instead of introducing ReducibleParameters. Surfacing this even though it's late in the cycle, because it's the kind of design choice worth weighing before locking in a new public class.

The proposal: drop ReducibleParameters entirely and have the generalized reducer take org.apache.spark.sql.connector.expressions.Literal (the V2 literal type already used everywhere else in the connector API):

default Reducer<I, O> reducer( org.apache.spark.sql.connector.expressions.Literal<?>[] thisParams, ReducibleFunction<?, ?> otherFunction, org.apache.spark.sql.connector.expressions.Literal<?>[] otherParams) { throw new UnsupportedOperationException(); }

Connector use:

@Override public Reducer<UTF8String, UTF8String> reducer( Literal<?>[] thisParams, ReducibleFunction<?, ?> otherFunc, Literal<?>[] otherParams) { if (otherFunc != TruncateFunction) return null; int thisWidth = (Integer) thisParams[0].value(); int otherWidth = (Integer) otherParams[0].value(); ... }

What this fixes simultaneously:

Inline Removed reference to incubation in README.md. #1's gap evaporates. No typed getters to maintain — connectors call .value() and dispatch on .dataType(). No more "we forgot getBigDecimal" / "we'll need getCalendarInterval" / etc.

The extractParameters partial-conversion concern (General-section bullet Removed reference to incubation in Spark user docs. #2) goes away. V2Literal carries dataType() alongside the value, so the connector interprets it correctly regardless of whether it's String, BigDecimal, byte[], CalendarInterval, etc. No Catalyst-internal types leaking, no type-by-type special-casing — extractParameters shrinks to a Catalyst-Literal → V2-Literal conversion (essentially the inverse of what V2ExpressionUtils.toCatalyst already does for V2 literals).

One fewer public class to learn / maintain / stabilize. V2Literal is already public, already @Evolving → stable, already used by every connector that authors V2 transforms (Expressions.literal(N), Expressions.bucket(N, col)). Receiving them back through the reducer API is symmetric round-trip with how transforms are constructed — V2 connectors never have to cross the Catalyst boundary in this public surface.

Trade-offs being honest about:

Slightly more verbose at the connector use site ((Integer) params[0].value() vs params.getInt(0)). Real but small.

Java generics + arrays awkwardness; Literal<?>[] produces unchecked-warning ceremony. List<Literal<?>> or Literal<?>... varargs are cleaner alternatives.

The deprecated int-API → new-API dispatch shim still needed (single int → Literal<Integer>).

cc @sunchao @szehon-ho — would value your read on this trade-off, given your existing reviews of ReducibleParameters. Should the API rebase on V2Literal, or stay with the new class as drafted?

### What changes were proposed in this pull request? - Add `Expression.collectAttributes(): Seq[Attribute]` for cases that need a positional `Seq` (preserving order and duplicates). - Migrate SPJ-related uses of `_.collectLeaves()` in `partitioning.scala` and `EnsureRequirements.scala` to: - `Expression.references` / `AttributeSet.fromAttributeSets(...)` where set semantics suffice (existence/size checks, single-attribute lookup). - The new `Expression.collectAttributes()` where positional `Seq[Attribute]` is required (`RowOrdering.create`, `reorder`, `attributes.zip(clustering)`). - Drop the now-redundant `.map(_.asInstanceOf[Attribute])` cast at the `EnsureRequirements` site that previously typed the result of `collectLeaves`. ### Why are the changes needed? `TreeNode.collectLeaves()` returns every node in the tree where `children.isEmpty` — including `Literal`s. SPJ planning has always wanted attributes only, but with the existing partition expression layout (`TransformExpression.children = [col]`, parameters carried in a sidecar `numBucketsOpt: Option[Int]` field), the difference didn't surface in practice. Follow-up work that puts literal parameters directly into `TransformExpression.children` (e.g. `bucket(Literal(numBuckets), col)`, `truncate(col, Literal(width))`) — see SPARK-50593 / apache#55885 — would otherwise force `TransformExpression` to override `collectLeaves` to filter literals, breaking the universal `TreeNode.collectLeaves` contract for one expression type. Fixing the abstraction layer by using the right helper at each call site is cleaner than overriding `collectLeaves` inside `TransformExpression`. ### Does this PR introduce _any_ user-facing change? No. Pure refactor — for current partition expression shapes (no literal children) `_.collectLeaves()`, `_.references`, and `_.collectAttributes()` produce identical results. ### How was this patch tested? Covered by existing test suites that exercise the migrated call sites: `KeyGroupedPartitioningSuite`, `EnsureRequirementsSuite`, `ProjectedOrderingAndPartitioningSuite`. ### Was this patch authored or co-authored using generative AI tooling? Generated-by: Claude Opus 4.7

### What changes were proposed in this pull request? - Document `AttributeSet`'s iteration-order contract on the class scaladoc: iteration via `iterator` / `foreach` / `flatMap` returns elements in insertion order (driven by the underlying `LinkedHashSet`). `toSeq` is called out as the explicit exception — it sorts by `(name, exprId.id)` for codegen stability (SPARK-18394). - Migrate seven SPJ-related uses of `_.collectLeaves()` in `partitioning.scala` and `EnsureRequirements.scala` to `_.references` / `AttributeSet.fromAttributeSets(...)`. Drops the now-redundant `.map(_.asInstanceOf[Attribute])` cast at the `EnsureRequirements:89` site. ### Why are the changes needed? `TreeNode.collectLeaves()` returns every node in the tree where `children.isEmpty`, including `Literal`s. SPJ planning has always wanted attributes only, but with the existing partition expression layout (`TransformExpression.children = [col]`, parameters carried in a sidecar `numBucketsOpt: Option[Int]` field), the difference didn't surface. Follow-up work (e.g. SPARK-50593 / apache#55885) that puts literal parameters directly into `TransformExpression.children` (`bucket(Literal(numBuckets), col)`, `truncate(col, Literal(width))`) would otherwise force `TransformExpression` to override `collectLeaves` to filter literals, breaking the universal `TreeNode.collectLeaves` contract for one expression type. `Expression.references` already returns attributes only (filtering literals and other non-attribute leaves), and its insertion-ordered iteration is exactly what positional binding (`RowOrdering.create`, `reorder`, `attributes.zip(clustering)`) requires. The per-partition-expression single-column rule (enforced by `KeyedPartitioning.supportsExpressions`) ensures within-expression dedup never matters here. Documenting the iteration-order contract lets these call sites rely on the order without implicit dependency on implementation detail. ### Does this PR introduce _any_ user-facing change? No. ### How was this patch tested? Covered by existing test suites that exercise the migrated call sites: `KeyGroupedPartitioningSuite`, `EnsureRequirementsSuite`, `ProjectedOrderingAndPartitioningSuite`. ### Was this patch authored or co-authored using generative AI tooling? Generated-by: Claude Opus 4.7

metanil added 3 commits May 13, 2026 17:21

[SPARK-50593][SQL] Generalize ReducibleFunction reducer API with Redu…

2fdd9d2

…cibleParameters container

[SPARK-50593][SQL] Support truncate transform for Storage Partitioned…

81d63dd

… Joins by generalizing parameter handling

[SPARK-50593][SQL] Strengthen test coverage for truncate SPJ and Redu…

c51a582

…cibleParameters backward compatibility

sunchao reviewed May 15, 2026

View reviewed changes

szehon-ho reviewed May 18, 2026

View reviewed changes

[SPARK-50593][SQL] Harden reducer fallback and tighten SPJ supportsEx…

f65ac19

…pressions

peter-toth mentioned this pull request May 24, 2026

[SPARK-57038][SQL] Use Expression.references in SPJ planning #56088

Open

peter-toth reviewed May 24, 2026

View reviewed changes

Conversation

metanil commented May 14, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

What changes were proposed in this pull request?

Why are the changes needed?

Does this PR introduce any user-facing change?

How was this patch tested?

Was this patch authored or co-authored using generative AI tooling?

Uh oh!

sunchao May 15, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

sunchao May 15, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

sunchao May 15, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

szehon-ho May 18, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

metanil commented May 22, 2026

Uh oh!

peter-toth left a comment

Choose a reason for hiding this comment

Summary

General

Suggested improvements

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

metanil commented May 14, 2026 •

edited

Loading

sunchao May 15, 2026 •

edited

Loading

sunchao May 15, 2026 •

edited

Loading

sunchao May 15, 2026 •

edited

Loading

szehon-ho May 18, 2026 •

edited

Loading