Skip to content

[SPARK-40946][SQL] Add a new DataSource V2 interface SupportsPushDownClusterKeys#38434

Closed
huaxingao wants to merge 5 commits intoapache:masterfrom
huaxingao:supportsPushDownClusterKeys
Closed

[SPARK-40946][SQL] Add a new DataSource V2 interface SupportsPushDownClusterKeys#38434
huaxingao wants to merge 5 commits intoapache:masterfrom
huaxingao:supportsPushDownClusterKeys

Conversation

@huaxingao
Copy link
Contributor

What changes were proposed in this pull request?

/**
 * A mix-in interface for {@link ScanBuilder}. Data sources can implement this interface to
 * push down all the join or aggregate keys to data sources. A return value true indicates
 * that data source will return input partitions (via planInputPartitions} following the
 * clustering keys. Otherwise, a false return value indicates the data source doesn't make
 * such a guarantee, even though it may still report a partitioning that may or may not
 * be compatible with the given clustering keys, and it's Spark's responsibility to group
 * the input partitions whether it can be applied.
 *
 * @since 3.4.0
 */
@Evolving
public interface SupportsPushDownClusterKeys extends ScanBuilder {

Why are the changes needed?

Pass down the information of join keys to v2 data sources so the data sources can decide how to combine the input splits according to the joins keys.

Does this PR introduce any user-facing change?

Yes, new interface SupportsPushDownClusterKeys

How was this patch tested?

new tests

@github-actions github-actions bot added the SQL label Oct 29, 2022
@huaxingao
Copy link
Contributor Author

@cloud-fan Could you please take a look when you have some time? Thanks!

@cloud-fan
Copy link
Contributor

I think this needs a bit more design. Partitioning is a physical property it's very weird to "pushdown" it at the logical phase. I think what we really need is tracking the requirement when doing top-down planning. e.g. when we planning a sort merge join, we should track the requirement (partitioned and ordered by join keys) when planning the join children. This is also an idea from the volcano optimizer and is a widely adopted technology.

@github-actions
Copy link

We're closing this PR because it hasn't been updated in a while. This isn't a judgement on the merit of the PR in any way. It's just a way of keeping the PR queue manageable.
If you'd like to revive this PR, please reopen it and ask a committer to remove the Stale tag!

@github-actions github-actions bot added the Stale label Feb 18, 2023
@github-actions github-actions bot closed this Feb 19, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants