Skip to content

Conversation

@chirag-s-db
Copy link
Contributor

What changes were proposed in this pull request?

When a KeyGroupedShuffleSpec is used to shuffle another child of a JOIN, we must be able to push down JOIN keys or partition values to be able to ensure that both children have matching partitioning. If one child reports a KeyGroupedPartitioning but we can't push down these values (for example, if the child was a key-grouped scan that was checkpointed), then this information cannot be pushed down to the child scan and we should avoid using this shuffle spec to shuffle other children.

Why are the changes needed?

Prevents potential correctness issue when key-grouped partitioning is used on a checkpointed RDD.

Does this PR introduce any user-facing change?

No.

How was this patch tested?

See test changes.

Was this patch authored or co-authored using generative AI tooling?

No.

…positions can be fully pushed down

### What changes were proposed in this pull request?
When a KeyGroupedShuffleSpec is used to shuffle another child of a JOIN, we must be able to push down JOIN keys or partition values to be able to ensure that both children have matching partitioning. If one child reports a KeyGroupedPartitioning but we can't push down these values (for example, if the child was a key-grouped scan that was checkpointed), then this information cannot be pushed down to the child scan and we should avoid using this shuffle spec to shuffle other children.

### Why are the changes needed?
Prevents potential correctness issue when key-grouped partitioning is used on a checkpointed RDD.

### Does this PR introduce _any_ user-facing change?
No.

### How was this patch tested?
See test changes.

### Was this patch authored or co-authored using generative AI tooling?
No.

Closes apache#53098 from chirag-s-db/checkpoint-pushdown.

Lead-authored-by: Chirag Singh <chirag.singh@databricks.com>
Co-authored-by: Chirag Singh <137233133+chirag-s-db@users.noreply.github.com>
Signed-off-by: Wenchen Fan <wenchen@databricks.com>
@chirag-s-db
Copy link
Contributor Author

FYI @szehon-ho @cloud-fan @sunchao

@dongjoon-hyun dongjoon-hyun changed the title [SPARK-53322][SQL][Backport 4.1] Select a KeyGroupedShuffleSpec only when join key positions can be fully pushed down [SPARK-53322][SQL][4.1] Select a KeyGroupedShuffleSpec only when join key positions can be fully pushed down Nov 19, 2025
Copy link
Member

@dongjoon-hyun dongjoon-hyun left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

+1, LGTM.

cc @peter-toth

dongjoon-hyun pushed a commit that referenced this pull request Nov 19, 2025
… key positions can be fully pushed down

### What changes were proposed in this pull request?
When a KeyGroupedShuffleSpec is used to shuffle another child of a JOIN, we must be able to push down JOIN keys or partition values to be able to ensure that both children have matching partitioning. If one child reports a KeyGroupedPartitioning but we can't push down these values (for example, if the child was a key-grouped scan that was checkpointed), then this information cannot be pushed down to the child scan and we should avoid using this shuffle spec to shuffle other children.

### Why are the changes needed?
Prevents potential correctness issue when key-grouped partitioning is used on a checkpointed RDD.

### Does this PR introduce _any_ user-facing change?
No.

### How was this patch tested?
See test changes.

### Was this patch authored or co-authored using generative AI tooling?
No.

Closes #53132 from chirag-s-db/checkpoint-pushdown-4.1.

Authored-by: Chirag Singh <chirag.singh@databricks.com>
Signed-off-by: Dongjoon Hyun <dongjoon@apache.org>
@dongjoon-hyun
Copy link
Member

Merged to master for Apache Spark 4.1.0.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants