[SPARK-50593][SQL] SPJ: Support truncate transform via generalized ReducibleFunction API#55885
[SPARK-50593][SQL] SPJ: Support truncate transform via generalized ReducibleFunction API#55885metanil wants to merge 4 commits into
Conversation
…cibleParameters container
… Joins by generalizing parameter handling
…cibleParameters backward compatibility
| transform.children.size == 1 && isReference(transform.children.head) | ||
| // TransformExpression.collectLeaves() only returns column references, not literals. | ||
| // We need exactly one column reference per transform. | ||
| transform.collectLeaves().size == 1 |
There was a problem hiding this comment.
[P1] This widens support from transforms with one direct reference child to any transform whose collectLeaves() returns one leaf. Because Transform arguments can themselves be nested Transforms, V2ExpressionUtils can materialize shapes like outer(years(k)) and outer(days(k)). The new gate accepts both, keyPositions later maps them only by leaf k, and TransformExpression.isSameFunction compares only the outer function name plus literals. That can make storage partitionings with different nested child semantics look compatible and let SPJ skip a required shuffle, which can drop join matches. Keep the old direct-reference constraint, or compare the full non-literal child semantics before admitting these transforms.
[ 🤖 posted by Codex on behalf of sunchao 🤖 ]
There was a problem hiding this comment.
Thanks @sunchao for the review. Good point. I didn't think about nested transform. This is indeed widening the transforms, as collectLeaves() go through all the children in nested expressions.
Like you mentioned, adding direct-reference constraints here would be enough. I'll make that change by getting non literal children, and use same existing check (size and reference). With that we also don't have to then worry about TransformExpression.isSameFunction comparing only outer function.
With regards to adding support to nested transform for SPJ, very interesting idea, but we can take that as separate PR.
| private def extractParameters(expr: TransformExpression): ReducibleParameters = { | ||
| import scala.jdk.CollectionConverters._ | ||
| val values = expr.literalChildren.map { | ||
| case Literal(value, _) => value.asInstanceOf[AnyRef] |
There was a problem hiding this comment.
[P1] extractParameters forwards raw Catalyst Literal.value objects into ReducibleParameters. For string literals, Spark stores UTF8String, while the new public API documents string parameters and exposes getString() as a java.lang.String cast. A connector implementing the documented string-parameter case will get ClassCastException from getString(0) or be forced to depend on Spark internals. Convert literal values to connector-facing external values by dataType before constructing ReducibleParameters.
[ 🤖 posted by Codex on behalf of sunchao 🤖 ]
There was a problem hiding this comment.
I agree, due UTF8String, we'll get ClassCastException. I think same case for Decimal type.
I'll make the change to handle these, and possibly a little more generic way for future proofing.
| val thisParams = extractParameters(thisExpr) | ||
| val otherParams = extractParameters(otherExpr) | ||
|
|
||
| val res = if (!thisParams.isEmpty && !otherParams.isEmpty) { |
There was a problem hiding this comment.
[P2] The new generalized reducer API accepts ReducibleParameters on both sides, and this file even models zero-literal transforms as ReducibleParameters([]). But this dispatch only invokes the generalized overload when both sides are non-empty. Mixed cases such as parameterized-vs-zero-parameter transforms instead fall back to reducer(otherFunction), so connectors that correctly implement the new generalized overload for those cases are never invoked; the default legacy path can even throw UnsupportedOperationException. If mixed arity is meant to be unsupported, the new API/docs should say that explicitly. Otherwise this should dispatch through the generalized overload whenever either side wants that path.
[ 🤖 posted by Codex on behalf of sunchao 🤖 ]
There was a problem hiding this comment.
@sunchao Good point. I was trying to fallback to existing behavior, but hadn't considered parameterized-vs-zero-parameter case, which it should follow the new generalized path. With regards making mixed arity "unsupported", currently I can't think of an valid example, but I also can't think of not to support it either. So, I suppose, we can let the implementor reducer decide to have it or not.
I will make the change that will dispatch through the generalized overloaded reducer().
| ReducibleFunction<?, ?> otherFunction, | ||
| ReducibleParameters otherParams) { | ||
| // Default: try old Int-based API for backward compatibility | ||
| if (thisParams.count() == 1 && otherParams.count() == 1) { |
There was a problem hiding this comment.
i think it makes more sense to have it the other way. the old one should call the new (more generic one). we can have the temporary code on spark side:
val thisParams = extractParameters(thisExpr)
val otherParams = extractParameters(otherExpr)
if (singleIntParam(thisParams) && singleIntParam(otherParams)) {
Option(thisFunction.reducer(thisParams.getInt(0), otherFunction, otherParams.getInt(0)))
.orElse(Option(thisFunction.reducer(thisParams, otherFunction, otherParams)))
} else if (!thisParams.isEmpty && !otherParams.isEmpty) {
Option(thisFunction.reducer(thisParams, otherFunction, otherParams))
} else {
Option(thisFunction.reducer(otherFunction))
}
imo, its not much uglier than this code :)
In practice, Iceberg is the only implementation of this so we can hopefully remove this ugly wrapper from spark side.
There was a problem hiding this comment.
@szehon-ho Yes, this make sense. My initial idea was to not call "deprecated" method, and when time to remove it would be only from 1 single file (ReducibleFunction), but the approach is not natural (or norm), where deprecated code calls the newer method.
When there's time to remove the method, we can remove from both, would not be a big deal, and can't be missed too.
I'll make this change along with @sunchao mixed arity case.
| * <li>custom_transform(col, "param") → ReducibleParameters(["param"])</li> | ||
| * </ul> | ||
| * | ||
| * @since 4.0.0 |
There was a problem hiding this comment.
@szehon-ho oh good eye! I worked on this last year :) Will update to reflect current version.
|
@sunchao @szehon-ho Addressed all comments, PTAL Additionally, I'm now handling all "null" or unimplemented exception (UOE) from cc @peter-toth |
peter-toth
left a comment
There was a problem hiding this comment.
Summary
Adds SPJ support for the truncate partition transform by generalizing the reducer API: literal parameters now live inline in TransformExpression.children, are surfaced to connectors through a new ReducibleParameters container, and a generalized ReducibleFunction.reducer(ReducibleParameters, ReducibleFunction, ReducibleParameters) overload replaces the bucket-only reducer(int, …, int) signature.
Prior state and problem. Bucket's numBuckets was carried as a sidecar numBucketsOpt: Option[Int] field on TransformExpression, outside the expression tree. The SPJ reducer dispatch in TransformExpression.reducer only knew how to fan out two cases — (Some, Some) → reducer(int, …, int) or anything else → reducer(otherFunction) — so any parameterized transform other than bucket (e.g. truncate(col, N)) was structurally invisible to the reducer dispatch and always shuffled. The write side was fixed in SPARK-40295; the read/join side never was. This PR supersedes the earlier exploration in #49211 by generalizing the API rather than special-casing transforms one at a time.
Design approach. Drop the numBucketsOpt sidecar and let literals participate in children. Then expose those literals to connectors as a typed parameter container, route the SPJ reducer dispatch through a single generalized overload, and keep the deprecated int-API alive (with default throw UOE) so existing connectors like Iceberg 1.10.0 continue to work. The dispatch tries the deprecated signature first for the single-int case (Iceberg shape), falls back to the new overload on UnsupportedOperationException, and goes straight to the new overload for everything else.
Key design decisions.
- Literals are now inline
Expressionchildren, not a typed field.TransformExpression.collectLeaves()is overridden to strip them so the existingpartitioning.scala/EnsureRequirementsconsumers still see exactly one leaf attribute per transform. ReducibleParametersis a brand-new public API rather than reusingV2Literal/Literal; it exposes typed getters (getInt,getLong,getString,getDouble,getFloat,get) but deliberately hides the sourceDataType.- Spark's own
BucketFunctionwas switched to override only the new API. Backward compat for connectors that override only the deprecated method is covered by a dedicatedLegacyBucketFunctiontest fixture. extractParametersconvertsUTF8String → java.lang.StringandDecimal → java.math.BigDecimal, but otherwise passes Catalyst-internal values straight through.
Implementation sketch. New public API in sql/catalyst/.../functions/ (ReducibleParameters.java, generalized ReducibleFunction.reducer overload). Catalyst plumbing in TransformExpression.scala (new literalChildren, extractParameters, tryReduce, dispatch logic; overrides for collectLeaves and canonicalized). V2-to-Catalyst resolution simplified in V2ExpressionUtils (the BucketTransform case is dropped; bucket now resolves through the generic NamedTransform path). Physical layer in partitioning.scala updates KeyedPartitioning.supportsExpressions to count non-literal children and KeyedShuffleSpec.createPartitioning to preserve literals when rewriting children. DistributionAndOrderingUtils.resolveTransformExpression simplified. Test fixtures in transformFunctions.scala add TruncateFunction.reducer via the new API, plus IntegerTruncateFunction and LegacyBucketFunction. New tests in KeyGroupedPartitioningSuite.
Behavioral changes worth calling out.
truncate(col, N)now plansKeyedPartitioningwhere it previously plannedUnknownPartitioning. The existing testnon-clustered distribution: V2 function with multiple argswas renamed toclustered distribution: …to reflect this.EnsureRequirementsSuiteexprA–exprDflipped fromLiteral(1..4)toAttributeReferences because the new code distinguishes literals from refs in transform children. Tests usingbucket(N, exprA)now exercise a[Literal(N), AttributeReference]shape rather than[Literal(N), Literal(1)]. Worth a sanity check that no existing assertion relied on the prior Literal shape.
General
- @sunchao (Codex, May 15) and @szehon-ho (May 18) already raised the substantive issues with the earlier commit. The author has addressed: the
collectLeaves()widening onKeyedPartitioning.supportsExpressions(now useschildren.filterNot(_.isInstanceOf[Literal])directly), the mixed-arity dispatch (now routes through the generalized overload), the dispatch ordering preference (deprecated-first, new fallback per szehon's suggestion), and the@sinceversion onReducibleParameters. Nice turnaround on all of those. - The fix for @sunchao's
extractParametersP1 is only partial. The current code convertsUTF8String → StringandDecimal → BigDecimal, but the original critique was structural: convert bydataTyperather than special-casing each type. As written,BinaryType(byte[]),CalendarIntervalType,ArrayType/MapTypeCatalyst values, etc. still flow through the catch-allvalue.asInstanceOf[AnyRef]and a connector implementing the documented "any type" claim onReducibleParameterswill hit Catalyst-internal classes. The principled fix is to drive the conversion off the literal'sDataType(e.g., via existingCatalystTypeConverters.convertToScala) — or, taken further, to surface V2 literals directly through the API; see inline #6. - The PR description claim "New overload
ReducibleFunction.reducer(ReducibleParameters, ...)with a default that delegates to the deprecated single-int signature for backward compatibility" is out of date with the code. The new default throwsUnsupportedOperationException; it'sTransformExpression.reducer's Spark-side dispatch that tries the deprecated signature first. Worth updating the description so reviewers don't get a wrong mental model of the fallback mechanism. - The
collectLeavesoverride onTransformExpression(inline #3 below) is the kind of contract divergence that's better fixed at the abstraction layer it crosses. I've put up #56088 as a follow-up that migrates the SPJ call sites inpartitioning.scalaandEnsureRequirements.scalaoff_.collectLeaves()— toExpression.referenceswhere set semantics suffices, and a newExpression.collectAttributes(): Seq[Attribute]where positionalSeqis required. Pure refactor against today'smaster. Once #56088 lands, this PR can rebase on top and drop the override here entirely; the migrated call sites already filter literals correctly.
Suggested improvements
ReducibleParametersis missinggetBigDecimal.extractParametersconvertsDecimal → java.math.BigDecimalon the wire, but the container exposes no decimal getter, so connectors with a decimal parameter must use the untypedget(index)and cast manually — exactly the type-safety the wrapper was added to prevent. (Inline #6 below proposes a structural alternative that obviates this gap entirely.) [sql/catalyst/src/main/java/org/apache/spark/sql/connector/catalog/functions/ReducibleParameters.java:114]canonicalizedoverride is dead code and the doc justification doesn't hold. SPJ usesisSameFunction, notcanonicalized/semanticEquals, so the stated motivation ("bucket(4, tableA.id) and bucket(4, tableB.id) semantically equal") is moot — andAttributeReference.canonicalizedkeepsexprId, so the override wouldn't deliver that equality even if SPJ did read it. The override otherwise duplicates the default path. Remove. [sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/TransformExpression.scala:93]collectLeavesoverride breaks the universalTreeNodecontract — drop on rebase after #56088. The follow-up migrates the SPJ call sites off_.collectLeaves()so the override here can simply be removed; see the General section. [sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/TransformExpression.scala:111]UnboundTruncateFunction.bind()claimsBinaryTypeandLongTypesupport, but neither branch actually works.BinaryType → TruncateFunctionmismatches(StringType, IntegerType)inputTypes()and agetUTF8Stringread;LongType → IntegerTruncateFunctionmismatches(IntegerType, IntegerType)inputTypes()and agetIntread. Plus theIntegerTruncateFunctiondoc claim about "incompatible partition structures" is wrong. Drop the unused branches, or split per type with coverage. [sql/core/src/test/scala/org/apache/spark/sql/connector/catalog/functions/transformFunctions.scala:293]isSameFunctioncompares literals by value but ignores positional structure. The check tolerates literals at different positions inchildrenand ignores nested-transform children entirely (compares only the outer function name + literal values), sobucket(4, days(col))andbucket(4, hours(col))would be reported as the same function. A structural per-position compare closes both gaps. [sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/TransformExpression.scala:75]- Structural alternative — reuse
V2Literalinstead of introducingReducibleParameters.V2Literalis already public and already used everywhere else in the connector API; passing it through the reducer signature preservesdataType()(no Catalyst-internal-types leak), makes inline #1 moot (no typed getters to maintain), and shrinksextractParametersto a Catalyst-Literal → V2-Literal conversion. Worth weighing before locking the new public class in. [sql/catalyst/src/main/java/org/apache/spark/sql/connector/catalog/functions/ReducibleParameters.java:40]
| */ | ||
| public float getFloat(int index) { | ||
| return (Float) values.get(index); | ||
| } |
There was a problem hiding this comment.
extractParameters in TransformExpression.scala:171 converts Decimal → java.math.BigDecimal on the way in, but this class exposes no getBigDecimal accessor. A connector with a decimal parameter therefore has to fall back to the untyped Object get(int index) and cast by hand — exactly the type-safety the wrapper was introduced to provide. Either add a getBigDecimal getter alongside the others, or remove the Decimal special-case in extractParameters and document the supported types explicitly. Concretely:
/**
* Get parameter at index as BigDecimal.
* @throws ClassCastException if parameter is not a BigDecimal
* @throws IndexOutOfBoundsException if index is invalid
*/
public java.math.BigDecimal getBigDecimal(int index) {
return (java.math.BigDecimal) values.get(index);
}Same gap exists for any other type the Spark side might convert (binary, interval, etc.) — see the ## General note about driving extractParameters off DataType. Inline #6 proposes a structural alternative that would make this whole class of issue moot.
| * This is crucial for Storage Partitioned Joins - we need bucket(4, tableA.id) and bucket(4, | ||
| * tableB.id) to be semantically equal so SPJ can be triggered. | ||
| */ | ||
| override lazy val canonicalized: Expression = { |
There was a problem hiding this comment.
The doc claim that motivates this override doesn't hold, and the override doesn't add anything over the inherited default. Two pieces:
The motivation doesn't hold. The doc says "we need bucket(4, tableA.id) and bucket(4, tableB.id) to be semantically equal so SPJ can be triggered". SPJ doesn't use canonicalized/semanticEquals on TransformExpression though — it dispatches through isSameFunction/isCompatible (see KeyedShuffleSpec.isExpressionCompatible at partitioning.scala:1082). And even if it did, this override wouldn't deliver the claim: AttributeReference.canonicalized (namedExpressions.scala:324) normalizes the name to "none" but keeps the original exprId, so two different attributes with the same name still canonicalize to different forms. bucket(4, tableA.id) and bucket(4, tableB.id) aren't made equal by either the override or the default.
The body is equivalent to the default. Literal is a LeafExpression, so Literal.canonicalized returns the literal unchanged — the per-child case l: Literal => l branch produces the same result as l.canonicalized. The override therefore matches withCanonicalizedChildren exactly, except it skips the Canonicalize.execute post-pass that the default routes through (a no-op for TransformExpression today, but a future change to Canonicalize would silently fail to apply here).
Recommended: remove the override entirely.
There was a problem hiding this comment.
Good catch, yes, not needed. Was meant to remove it.
| * | ||
| */ | ||
| override def collectLeaves(): Seq[Expression] = { | ||
| children.flatMap { |
There was a problem hiding this comment.
This override breaks the universal TreeNode.collectLeaves contract — the base method (TreeNode.scala:315) returns "every node in the tree where children.isEmpty", but this returns column references only, hiding the literal leaves. Functionally correct for current SPJ call sites (partitioning.scala:492/498/982/1037, EnsureRequirements.scala:89/412/417/780), but a quiet semantic divergence — any future caller of collectLeaves on a TransformExpression will silently get wrong answers.
The cleaner shape is to migrate the SPJ call sites off _.collectLeaves() and use the right helper at each:
Expression.referenceswhere set semantics suffice (existence/size checks, single-attribute lookup).- A new
Expression.collectAttributes(): Seq[Attribute]where a positionalSeqis required (RowOrdering.create,reorder,attributes.zip(clustering)) —AttributeSetdeduplicates and its ordering isLinkedHashSet-implementation-detail rather than contract.
I've sent #56088 as the follow-up doing exactly that. Pure refactor against today's master (current TransformExpression has no literal children, so the migrated call sites produce identical results). Once it's merged, this override can be dropped on rebase, and 55885's representation change (literals as inline children) lands cleanly through the already-migrated call sites.
There was a problem hiding this comment.
Yes, exactly this. I was getting away with the context of TransformExpression as it always meant for Attributes, but with respect to TreeNode.collectLeaves contract, it was indeed breaking.
I was also trying handle by filter Literals during usages, but meant modifying all the callers.
collectAttributes() is exactly what I needed, and will use it once #56088 land on main branch.
Thanks for the PR @peter-toth
There was a problem hiding this comment.
Actually, now I think we don't even need a new Expression.collectAttributes(), but we can use Expression.references() in SPJ planning. Adjusted the PR.
| override def bind(inputType: StructType): BoundFunction = { | ||
| if (inputType.size == 2) { | ||
| inputType.head.dataType match { | ||
| case StringType | BinaryType => TruncateFunction |
There was a problem hiding this comment.
This dispatch claims BinaryType and LongType support, but neither branch actually works:
StringType | BinaryType => TruncateFunction—TruncateFunction.inputTypes()declares(StringType, IntegerType)andproduceResultcallsinput.getUTF8String(0). ABinaryTyperow (rawbyte[]underneath) mismatchesinputTypes()at bind-time, or — if the framework lets it through — fails at runtime on thegetUTF8Stringread.IntegerType | LongType => IntegerTruncateFunction—IntegerTruncateFunction.inputTypes()declares(IntegerType, IntegerType)andproduceResultcallsinput.getInt(0). Same kind of mismatch on aLongTyperow, plus itsresultType()isIntegerTypeso even if you cast input down, the column type is wrong.
None of the phantom branches are exercised by tests in this PR — every truncate test uses StringType and dispatches to TruncateFunction correctly. So the broken branches add API surface that connector authors might trust without coverage to defend it.
Adjacent issue: the doc on IntegerTruncateFunction ("different integer truncate widths produce incompatible partition structures") is wrong: truncate(v, W1) is reducible to truncate(v, W2) whenever W2 % W1 == 0 (snap to a coarser width), and when neither divides the other both reduce to truncate(_, lcm(W1, W2)). Same structural argument as bucket's GCD case.
Three options, in order of cleanliness:
- Drop the unused branches and
IntegerTruncateFunctionfrom this PR — onlyStringTypetruncate is part of the SPJ story being delivered. - Split per type (
IntegerTruncate/LongTruncate/BinaryTruncate) with the rightinputTypes()/resultType()/produceResultand add at least one test per branch. - Make each existing object polymorphic on the input type (declare wider
inputTypes(), dispatch inproduceResult) and add tests.
| // Compare literal arguments to ensure transforms with different parameters | ||
| // (e.g., bucket(32, col) vs bucket(16, col), truncate(col, 2) vs truncate(col, 4)) | ||
| // are not considered the same | ||
| val otherLiterals = other.literalChildren |
There was a problem hiding this comment.
This compares literals by value but ignores positional structure of children. Two consequences:
-
Position-agnostic. Two transforms with the same function and the same literal values but different interleavings of literals/non-literals are reported as the same. In practice the V2 function catalog pins down argument order per function name, so this is unlikely to manifest, but the doc claims "same semantics as
other" — the check is weaker than that without the catalog assumption. -
Nested-transform children.
V2ExpressionUtils.toCatalystrecurses throughtoCatalystTransformOpt, so children can include nestedTransformExpressions (sunchao's earlier P1 raised this forsupportsExpressions).bucket(4, days(col))andbucket(4, hours(col))both have outer functionbucketandliteralChildren = [Literal(4)], so this returnstrueeven though the inner transforms differ.KeyedPartitioning.supportsExpressionsrejects nested-transform shapes from SPJ planning, so this doesn't bite SPJ today, butisSameFunctionis used elsewhere (isCompatible, suite assertions) where the false positive would leak.
A structural per-position walk closes both gaps and lets literalChildren shrink to just the extractParameters helper:
def isSameFunction(other: TransformExpression): Boolean = {
function.canonicalName() == other.function.canonicalName() &&
children.length == other.children.length &&
children.zip(other.children).forall {
case (a: Literal, b: Literal) => a == b
case (a: TransformExpression, b: TransformExpression) => a.isSameFunction(b)
case (_: Literal, _) | (_, _: Literal) => false
case (_: TransformExpression, _) | (_, _: TransformExpression) => false
case _ => true // both refs/attrs — column identity intentionally ignored
}
}There was a problem hiding this comment.
@sunchao raised this issue too, and I stuck with the current limitation of SPJ with only 1 Reference.
But I like the idea of generalizing it to support nested cases as well.
I'll try to follow the suggestion and make the change.
Thanks @peter-toth for suggestion.
| * @since 5.0.0 | ||
| */ | ||
| @Evolving | ||
| public class ReducibleParameters { |
There was a problem hiding this comment.
Structural alternative — reuse V2Literal instead of introducing ReducibleParameters. Surfacing this even though it's late in the cycle, because it's the kind of design choice worth weighing before locking in a new public class.
The proposal: drop ReducibleParameters entirely and have the generalized reducer take org.apache.spark.sql.connector.expressions.Literal (the V2 literal type already used everywhere else in the connector API):
default Reducer<I, O> reducer(
org.apache.spark.sql.connector.expressions.Literal<?>[] thisParams,
ReducibleFunction<?, ?> otherFunction,
org.apache.spark.sql.connector.expressions.Literal<?>[] otherParams) {
throw new UnsupportedOperationException();
}Connector use:
@Override
public Reducer<UTF8String, UTF8String> reducer(
Literal<?>[] thisParams,
ReducibleFunction<?, ?> otherFunc,
Literal<?>[] otherParams) {
if (otherFunc != TruncateFunction) return null;
int thisWidth = (Integer) thisParams[0].value();
int otherWidth = (Integer) otherParams[0].value();
...
}What this fixes simultaneously:
- Inline Removed reference to incubation in README.md. #1's gap evaporates. No typed getters to maintain — connectors call
.value()and dispatch on.dataType(). No more "we forgotgetBigDecimal" / "we'll needgetCalendarInterval" / etc. - The
extractParameterspartial-conversion concern (General-section bullet Removed reference to incubation in Spark user docs. #2) goes away.V2LiteralcarriesdataType()alongside the value, so the connector interprets it correctly regardless of whether it'sString,BigDecimal,byte[],CalendarInterval, etc. No Catalyst-internal types leaking, no type-by-type special-casing —extractParametersshrinks to a Catalyst-Literal → V2-Literal conversion (essentially the inverse of whatV2ExpressionUtils.toCatalystalready does for V2 literals). - One fewer public class to learn / maintain / stabilize.
V2Literalis already public, already@Evolving→ stable, already used by every connector that authors V2 transforms (Expressions.literal(N),Expressions.bucket(N, col)). Receiving them back through the reducer API is symmetric round-trip with how transforms are constructed — V2 connectors never have to cross the Catalyst boundary in this public surface.
Trade-offs being honest about:
- Slightly more verbose at the connector use site (
(Integer) params[0].value()vsparams.getInt(0)). Real but small. - Java generics + arrays awkwardness;
Literal<?>[]produces unchecked-warning ceremony.List<Literal<?>>orLiteral<?>...varargs are cleaner alternatives. - The deprecated int-API → new-API dispatch shim still needed (single int →
Literal<Integer>).
cc @sunchao @szehon-ho — would value your read on this trade-off, given your existing reviews of ReducibleParameters. Should the API rebase on V2Literal, or stay with the new class as drafted?
### What changes were proposed in this pull request? - Add `Expression.collectAttributes(): Seq[Attribute]` for cases that need a positional `Seq` (preserving order and duplicates). - Migrate SPJ-related uses of `_.collectLeaves()` in `partitioning.scala` and `EnsureRequirements.scala` to: - `Expression.references` / `AttributeSet.fromAttributeSets(...)` where set semantics suffice (existence/size checks, single-attribute lookup). - The new `Expression.collectAttributes()` where positional `Seq[Attribute]` is required (`RowOrdering.create`, `reorder`, `attributes.zip(clustering)`). - Drop the now-redundant `.map(_.asInstanceOf[Attribute])` cast at the `EnsureRequirements` site that previously typed the result of `collectLeaves`. ### Why are the changes needed? `TreeNode.collectLeaves()` returns every node in the tree where `children.isEmpty` — including `Literal`s. SPJ planning has always wanted attributes only, but with the existing partition expression layout (`TransformExpression.children = [col]`, parameters carried in a sidecar `numBucketsOpt: Option[Int]` field), the difference didn't surface in practice. Follow-up work that puts literal parameters directly into `TransformExpression.children` (e.g. `bucket(Literal(numBuckets), col)`, `truncate(col, Literal(width))`) — see SPARK-50593 / apache#55885 — would otherwise force `TransformExpression` to override `collectLeaves` to filter literals, breaking the universal `TreeNode.collectLeaves` contract for one expression type. Fixing the abstraction layer by using the right helper at each call site is cleaner than overriding `collectLeaves` inside `TransformExpression`. ### Does this PR introduce _any_ user-facing change? No. Pure refactor — for current partition expression shapes (no literal children) `_.collectLeaves()`, `_.references`, and `_.collectAttributes()` produce identical results. ### How was this patch tested? Covered by existing test suites that exercise the migrated call sites: `KeyGroupedPartitioningSuite`, `EnsureRequirementsSuite`, `ProjectedOrderingAndPartitioningSuite`. ### Was this patch authored or co-authored using generative AI tooling? Generated-by: Claude Opus 4.7
### What changes were proposed in this pull request? - Document `AttributeSet`'s iteration-order contract on the class scaladoc: iteration via `iterator` / `foreach` / `flatMap` returns elements in insertion order (driven by the underlying `LinkedHashSet`). `toSeq` is called out as the explicit exception — it sorts by `(name, exprId.id)` for codegen stability (SPARK-18394). - Migrate seven SPJ-related uses of `_.collectLeaves()` in `partitioning.scala` and `EnsureRequirements.scala` to `_.references` / `AttributeSet.fromAttributeSets(...)`. Drops the now-redundant `.map(_.asInstanceOf[Attribute])` cast at the `EnsureRequirements:89` site. ### Why are the changes needed? `TreeNode.collectLeaves()` returns every node in the tree where `children.isEmpty`, including `Literal`s. SPJ planning has always wanted attributes only, but with the existing partition expression layout (`TransformExpression.children = [col]`, parameters carried in a sidecar `numBucketsOpt: Option[Int]` field), the difference didn't surface. Follow-up work (e.g. SPARK-50593 / apache#55885) that puts literal parameters directly into `TransformExpression.children` (`bucket(Literal(numBuckets), col)`, `truncate(col, Literal(width))`) would otherwise force `TransformExpression` to override `collectLeaves` to filter literals, breaking the universal `TreeNode.collectLeaves` contract for one expression type. `Expression.references` already returns attributes only (filtering literals and other non-attribute leaves), and its insertion-ordered iteration is exactly what positional binding (`RowOrdering.create`, `reorder`, `attributes.zip(clustering)`) requires. The per-partition-expression single-column rule (enforced by `KeyedPartitioning.supportsExpressions`) ensures within-expression dedup never matters here. Documenting the iteration-order contract lets these call sites rely on the order without implicit dependency on implementation detail. ### Does this PR introduce _any_ user-facing change? No. ### How was this patch tested? Covered by existing test suites that exercise the migrated call sites: `KeyGroupedPartitioningSuite`, `EnsureRequirementsSuite`, `ProjectedOrderingAndPartitioningSuite`. ### Was this patch authored or co-authored using generative AI tooling? Generated-by: Claude Opus 4.7
### What changes were proposed in this pull request? - Document `AttributeSet`'s iteration-order contract on the class scaladoc: iteration via `iterator` / `foreach` / `flatMap` returns elements in insertion order (driven by the underlying `LinkedHashSet`). `toSeq` is called out as the explicit exception — it sorts by `(name, exprId.id)` for codegen stability (SPARK-18394). - Migrate seven SPJ-related uses of `_.collectLeaves()` in `partitioning.scala` and `EnsureRequirements.scala` to `_.references` / `AttributeSet.fromAttributeSets(...)`. Drops the now-redundant `.map(_.asInstanceOf[Attribute])` cast at the `EnsureRequirements:89` site. ### Why are the changes needed? `TreeNode.collectLeaves()` returns every node in the tree where `children.isEmpty`, including `Literal`s. SPJ planning has always wanted attributes only, but with the existing partition expression layout (`TransformExpression.children = [col]`, parameters carried in a sidecar `numBucketsOpt: Option[Int]` field), the difference didn't surface. Follow-up work (e.g. SPARK-50593 / apache#55885) that puts literal parameters directly into `TransformExpression.children` (`bucket(Literal(numBuckets), col)`, `truncate(col, Literal(width))`) would otherwise force `TransformExpression` to override `collectLeaves` to filter literals, breaking the universal `TreeNode.collectLeaves` contract for one expression type. `Expression.references` already returns attributes only (filtering literals and other non-attribute leaves), and its insertion-ordered iteration is exactly what positional binding (`RowOrdering.create`, `reorder`, `attributes.zip(clustering)`) requires. The per-partition-expression single-column rule (enforced by `KeyedPartitioning.supportsExpressions`) ensures within-expression dedup never matters here. Documenting the iteration-order contract lets these call sites rely on the order without implicit dependency on implementation detail. ### Does this PR introduce _any_ user-facing change? No. ### How was this patch tested? Covered by existing test suites that exercise the migrated call sites: `KeyGroupedPartitioningSuite`, `EnsureRequirementsSuite`, `ProjectedOrderingAndPartitioningSuite`. ### Was this patch authored or co-authored using generative AI tooling? Generated-by: Claude Opus 4.7
What changes were proposed in this pull request?
This PR adds Storage Partitioned Join (SPJ) support for the
truncatepartition transform. The approach generalizes theReducibleFunctionAPI to accept arbitrary parameters via a newReducibleParameterscontainer, so SPJ can reason about any parameterized transform (bucket, truncate, future ones) through one code path.Key changes:
ReducibleParametersinorg.apache.spark.sql.connector.catalog.functions— a typed parameter container.ReducibleFunction.reducer(ReducibleParameters, ReducibleFunction, ReducibleParameters). The oldreducer(int, ..., int)is marked@Deprecatedbut preserved as the default fallback, so existing connector implementations (e.g., Iceberg 1.10.0) continue to work unchanged.TransformExpressionrefactor: literal parameters (e.g., bucketnumBuckets, truncatewidth) now live insidechildrenrather than a bespokenumBucketsOpt: Option[Int]field.collectLeaves()is overridden to filter literal parameters and return only column references.TransformExpressionthat extractsReducibleParametersfrom literal children and delegates to the new API; compatibility checks (isCompatible,reducers) work uniformly for bucket, truncate, etc.Why are the changes needed?
Today a join on tables partitioned by
truncate(col, N)always shuffles, even when both sides share identical partitioning. The write-side was fixed by SPARK-40295 (Allow v2 functions with literal args in write distribution and ordering), but the read/join side was never enabled.Previous work in #49211 (@szehon-ho) explored direct support for transforms with literal arguments by adjusting the SPJ paths to recognize them. This PR generalizes the reducer API so the compatibility check is function-agnostic, with a default method that delegates to the deprecated single-int signature for backward compatibility.
Does this PR introduce any user-facing change?
Yes, for connector/catalog authors:
ReducibleParameters.ReducibleFunction.reducer(ReducibleParameters, ...)with a default that delegates to the deprecated single-int signature for backward compatibility.LegacyBucketFunctiontest fixture.For end users, queries joining tables partitioned by compatible
truncatetransforms (identical widths, or reducible pairs liketruncate(3)andtruncate(5)) now avoid shuffle via SPJ.How was this patch tested?
6 New tests in
KeyGroupedPartitioningSuitecc @szehon-ho @aokolnychyi @sunchao @peter-toth
Was this patch authored or co-authored using generative AI tooling?
Yes — used only for test cases and Javadoc/Scaladoc comments.
Generated-by: Claude Code (Opus 4.7)