[SPARK-40946][SQL] Add a new DataSource V2 interface SupportsPushDownClusterKeys by huaxingao · Pull Request #38434 · apache/spark

huaxingao · 2022-10-29T05:44:11Z

What changes were proposed in this pull request?

/**
 * A mix-in interface for {@link ScanBuilder}. Data sources can implement this interface to
 * push down all the join or aggregate keys to data sources. A return value true indicates
 * that data source will return input partitions (via planInputPartitions} following the
 * clustering keys. Otherwise, a false return value indicates the data source doesn't make
 * such a guarantee, even though it may still report a partitioning that may or may not
 * be compatible with the given clustering keys, and it's Spark's responsibility to group
 * the input partitions whether it can be applied.
 *
 * @since 3.4.0
 */
@Evolving
public interface SupportsPushDownClusterKeys extends ScanBuilder {

Why are the changes needed?

Pass down the information of join keys to v2 data sources so the data sources can decide how to combine the input splits according to the joins keys.

Does this PR introduce any user-facing change?

Yes, new interface SupportsPushDownClusterKeys

How was this patch tested?

new tests

…ClusterKeys

huaxingao · 2022-11-08T06:39:42Z

@cloud-fan Could you please take a look when you have some time? Thanks!

cloud-fan · 2022-11-09T13:01:05Z

I think this needs a bit more design. Partitioning is a physical property it's very weird to "pushdown" it at the logical phase. I think what we really need is tracking the requirement when doing top-down planning. e.g. when we planning a sort merge join, we should track the requirement (partitioned and ordered by join keys) when planning the join children. This is also an idea from the volcano optimizer and is a widely adopted technology.

github-actions · 2023-02-18T00:20:41Z

We're closing this PR because it hasn't been updated in a while. This isn't a judgement on the merit of the PR in any way. It's just a way of keeping the PR queue manageable.
If you'd like to revive this PR, please reopen it and ask a committer to remove the Stale tag!

[SPARK-40946][SQL] Add a new DataSource V2 interface SupportsPushDown…

5ae3319

…ClusterKeys

github-actions bot added the SQL label Oct 29, 2022

fix build failure

1d85b8b

github-actions bot added the STRUCTURED STREAMING label Oct 29, 2022

remove unused import

6d42f20

sunchao mentioned this pull request Oct 31, 2022

Core: Add a util method to combine tasks by partition apache/iceberg#2276

Merged

huaxingao added 2 commits November 2, 2022 20:13

fix test failure

516b525

fix test failure

b54658e

github-actions bot added the Stale label Feb 18, 2023

github-actions bot closed this Feb 19, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[SPARK-40946][SQL] Add a new DataSource V2 interface SupportsPushDownClusterKeys#38434

[SPARK-40946][SQL] Add a new DataSource V2 interface SupportsPushDownClusterKeys#38434
huaxingao wants to merge 5 commits intoapache:masterfrom
huaxingao:supportsPushDownClusterKeys

huaxingao commented Oct 29, 2022

Uh oh!

huaxingao commented Nov 8, 2022

Uh oh!

cloud-fan commented Nov 9, 2022

Uh oh!

github-actions bot commented Feb 18, 2023

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

huaxingao commented Oct 29, 2022

What changes were proposed in this pull request?

Why are the changes needed?

Does this PR introduce any user-facing change?

How was this patch tested?

Uh oh!

huaxingao commented Nov 8, 2022

Uh oh!

cloud-fan commented Nov 9, 2022

Uh oh!

github-actions bot commented Feb 18, 2023

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants