Skip to content

[SPARK-46367][SQL][FOLLOWUP] Assert same arity in projectKeyedPartitionings#55876

Closed
cloud-fan wants to merge 2 commits into
apache:masterfrom
cloud-fan:SPARK-46367-followup
Closed

[SPARK-46367][SQL][FOLLOWUP] Assert same arity in projectKeyedPartitionings#55876
cloud-fan wants to merge 2 commits into
apache:masterfrom
cloud-fan:SPARK-46367-followup

Conversation

@cloud-fan
Copy link
Copy Markdown
Contributor

@cloud-fan cloud-fan commented May 14, 2026

What changes were proposed in this pull request?

Follow-up to #55519.

PartitioningPreservingUnaryExecNode.projectKeyedPartitionings assumes all input KPs share the same partitionKeys, which implies the same expression arity. This invariant is asserted by GroupPartitionsExec and is established by every upstream constructor of PartitioningCollection that feeds this method (a join's PartitioningCollection(left.outputPartitioning, right.outputPartitioning) combines KPs that EnsureRequirements has aligned to the same join keys).

If the invariant is ever violated upstream, indexing kp.expressions(i) for i >= kp.expressions.length throws an opaque IndexOutOfBoundsException that points at this method rather than at the producer.

This PR adds an assert that surfaces the real cause with a clear message. The invariant is unchanged; this just turns silent misuse into a debuggable failure, so a producer-side bug can be fixed at its source.

Why are the changes needed?

Improve error message when an upstream node violates the same-arity invariant. No behavior change on valid plans.

Does this PR introduce any user-facing change?

No (assertion-only; planner internal).

How was this patch tested?

New unit test SPARK-46367: mixed-arity KeyedPartitionings in input fail with a clear assertion in ProjectedOrderingAndPartitioningSuite that constructs a mixed-arity PartitioningCollection child and verifies the assert fires with the expected message instead of throwing IndexOutOfBoundsException.

Was this patch authored or co-authored using generative AI tooling?

Generated-by: Claude Opus 4.7

@cloud-fan cloud-fan force-pushed the SPARK-46367-followup branch from a828e56 to 642ae0e Compare May 14, 2026 09:09
@cloud-fan cloud-fan changed the title [SPARK-46367][SQL][FOLLOWUP] Filter mixed-arity KPs in projectKeyedPartitionings [SPARK-46367][SQL][FOLLOWUP] Group mixed-arity KPs by arity in projectKeyedPartitionings May 14, 2026
@cloud-fan
Copy link
Copy Markdown
Contributor Author

cc @peter-toth

@peter-toth
Copy link
Copy Markdown
Contributor

peter-toth commented May 14, 2026

@cloud-fan , I don't think this is valid.

There simply can't be mixed-arity KeyedPartitionings in a PartitioningCollection. This is an invariant required by GroupPartitions as well:

// There can be multiple `KeyedPartitioning` in an output partitioning of a join, but they
// can only differ in `expressions`. `partitionKeys` must match so we can calculate it only
// once via `groupedPartitions`.

and actualy, it doesn't make sense to have/keep lower arity KPs in a collection if there are higher arity ones in it because the expressions of the lower must be the subset of the expressions of the higher.
If we know more details of the keys of partitions, like KP([a, b], [(1, 1), (2, 2)]) so we know that a = 1 and b = 1 in the first; and a = 2 and b = 2 in the second partition, why would we have/keep the less granular KP([a], [(1), (2)]) or KP([b], [(1), (2)]) in the collection?
Maybe the wording of the comment is not the best but the "partitionKeys must match" means that arity of KPs in a collection must be the same.

projectKeyedPartitionings() assumes that its input holds that invariant and because it produces only the maximum achievable granularity (arity) KPs it also maintains that invariant.

…onings

### What changes were proposed in this pull request?
Follow-up to apache#55519.

`PartitioningPreservingUnaryExecNode.projectKeyedPartitionings` assumes all
input KPs share the same `partitionKeys`, which implies the same expression
arity. This invariant is asserted by `GroupPartitionsExec` and is established
by every upstream constructor of `PartitioningCollection` that feeds this
method (a join's `PartitioningCollection(left.outputPartitioning, right.outputPartitioning)`
combines KPs that `EnsureRequirements` has aligned to the same join keys).

If the invariant is ever violated upstream, indexing `kp.expressions(i)` for
`i >= kp.expressions.length` throws an opaque `IndexOutOfBoundsException`
that points at this method rather than at the producer.

Add an `assert` that surfaces the real cause with a clear message. The
invariant is unchanged; this just turns silent misuse into a debuggable
failure, so a producer-side bug can be fixed at its source.

### Why are the changes needed?
Improve error message when an upstream node violates the same-arity
invariant. No behavior change on valid plans.

### Does this PR introduce _any_ user-facing change?
No (assertion-only; planner internal).

### How was this patch tested?
New unit test `SPARK-46367: mixed-arity KeyedPartitionings in input fail with
a clear assertion` in `ProjectedOrderingAndPartitioningSuite` that constructs
a mixed-arity `PartitioningCollection` child and verifies the assert fires
with the expected message instead of throwing `IndexOutOfBoundsException`.

### Was this patch authored or co-authored using generative AI tooling?
Generated-by: Claude Opus 4.7
@cloud-fan cloud-fan force-pushed the SPARK-46367-followup branch from 642ae0e to b5fd4be Compare May 14, 2026 14:35
@cloud-fan cloud-fan changed the title [SPARK-46367][SQL][FOLLOWUP] Group mixed-arity KPs by arity in projectKeyedPartitionings [SPARK-46367][SQL][FOLLOWUP] Assert same arity in projectKeyedPartitionings May 14, 2026
@cloud-fan
Copy link
Copy Markdown
Contributor Author

@peter-toth you're right — the invariant should stay strict. I've reworked this PR to just add an assert so an upstream violation surfaces with a clear message instead of an opaque IndexOutOfBoundsException. The producer-side bug I was working around lives in our internal fork; I'll fix it there. PTAL.

// that have been aligned by [[EnsureRequirements]] to the same join keys). If the invariant
// is ever violated upstream, fail early with a clear message instead of throwing an opaque
// `IndexOutOfBoundsException` from `kp.expressions(i)` below.
assert(kps.forall(_.expressions.length == numPositions),
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe you could use kps.tail.forall(...).

Please note that in GroupPartitionsExec we use .partitionKeys comparison, which is even more stricter, but more costly.

Copy link
Copy Markdown
Contributor

@peter-toth peter-toth May 14, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I wonder if we should move these invariant checks into PartitioningCollection.
Maybe expressions.length = + partitionKeys eq checks on KPs for maximum efficiency.

Copy link
Copy Markdown
Member

@dongjoon-hyun dongjoon-hyun left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

+1, LGTM. Thank you, @cloud-fan and @peter-toth .

Use `kps.tail.forall(...)` since `numPositions` is taken from `kps.head`.
@cloud-fan
Copy link
Copy Markdown
Contributor Author

cloud-fan commented May 15, 2026

The parquet test failure is unrelated, thanks for review, merging to master/4.x!

@cloud-fan cloud-fan closed this in 2530561 May 15, 2026
cloud-fan added a commit that referenced this pull request May 15, 2026
…onings

### What changes were proposed in this pull request?
Follow-up to #55519.

`PartitioningPreservingUnaryExecNode.projectKeyedPartitionings` assumes all input KPs share the same `partitionKeys`, which implies the same expression arity. This invariant is asserted by `GroupPartitionsExec` and is established by every upstream constructor of `PartitioningCollection` that feeds this method (a join's `PartitioningCollection(left.outputPartitioning, right.outputPartitioning)` combines KPs that `EnsureRequirements` has aligned to the same join keys).

If the invariant is ever violated upstream, indexing `kp.expressions(i)` for `i >= kp.expressions.length` throws an opaque `IndexOutOfBoundsException` that points at this method rather than at the producer.

This PR adds an `assert` that surfaces the real cause with a clear message. The invariant is unchanged; this just turns silent misuse into a debuggable failure, so a producer-side bug can be fixed at its source.

### Why are the changes needed?
Improve error message when an upstream node violates the same-arity invariant. No behavior change on valid plans.

### Does this PR introduce _any_ user-facing change?
No (assertion-only; planner internal).

### How was this patch tested?
New unit test `SPARK-46367: mixed-arity KeyedPartitionings in input fail with a clear assertion` in `ProjectedOrderingAndPartitioningSuite` that constructs a mixed-arity `PartitioningCollection` child and verifies the assert fires with the expected message instead of throwing `IndexOutOfBoundsException`.

### Was this patch authored or co-authored using generative AI tooling?
Generated-by: Claude Opus 4.7

Closes #55876 from cloud-fan/SPARK-46367-followup.

Authored-by: Wenchen Fan <wenchen@databricks.com>
Signed-off-by: Wenchen Fan <wenchen@databricks.com>
(cherry picked from commit 2530561)
Signed-off-by: Wenchen Fan <wenchen@databricks.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants