[SPARK-38625][SQL] DataSource V2: Add APIs for group-based row-level operations #35940

aokolnychyi · 2022-03-22T20:59:18Z

What changes were proposed in this pull request?

This PR contains row-level operation APIs for V2 data sources that can replace groups of data (e.g. files, partitions). It is a subset of the changes already reviewed in #35395.

Why are the changes needed?

These changes are needed to support row-level operations in Spark per SPIP SPARK-35801.

Does this PR introduce any user-facing change?

Yes, this PR adds new Data Source V2 APIs.

How was this patch tested?

Not applicable.

…operations

aokolnychyi · 2022-03-22T22:34:03Z

cc @viirya @huaxingao @dongjoon-hyun @sunchao @cloud-fan @rdblue @HyukjinKwon

Could you take a look at this PR? It is a subset of changes in #35395.

rdblue

Looks good to me. +1 for adding the API first.

aokolnychyi · 2022-03-23T03:35:10Z

How does this one look, @cloud-fan?

cloud-fan · 2022-03-23T05:13:10Z

sql/catalyst/src/main/java/org/apache/spark/sql/connector/write/RowLevelOperationBuilder.java

+   * Returns a {@link RowLevelOperation} that controls how Spark rewrites data
+   * for DELETE, UPDATE, MERGE commands.
+   */
+  RowLevelOperation build();


I'm not against this API design but just want to put my idea: Shall we have different build methods for UPDATE, DELETE, MERGE? like buildDelete. Then we don't need RowLevelOperationInfo

public interface SupportsRowLevelOperations extends Table { RowLevelOperationBuilder newRowLevelOperationBuilder(CaseInsensitiveStringMap options); } public interface RowLevelOperationBuilder { RowLevelOperation buildDelete(); RowLevelOperation buildUpdate(); RowLevelOperation buildMerge(); }

or we can keep the enum

public interface RowLevelOperationBuilder { RowLevelOperation build(Command command); }

Either way would work for me too.

I tried to match LogicalWriteInfo. One benefit of having an extra class is that we can extend it to pass more information in the future. Some of those future requirements are hard to predict when an API is designed. For instance, I plan to add a few new methods to LogicalWriteInfo for delta-based operations (e.g. row ID schema). However, there is no guarantee that will ever be required for RowLevelOperationInfo.

Any preference, @viirya @huaxingao @rdblue?

Hmm, I think one benefit of RowLevelOperationInfo, as @aokolnychyi said, is to give us some room for more information in the future. I'm afraid of the situation that if we take buildUpdate, etc. approach, we may need to extend it if we need some more information.

Shall we stick to the current API just in case then?

Let's wait for @cloud-fan's feedback?

Sorry, my question was directed to @cloud-fan :)

I know predicting the future is very hard, but I'm afraid we may over-design this. We have both RowLevelOperationInfo and the builder pattern to pass required information in the future, shall we keep only one of them?

After a second thought, the read API doesn't have ScanInformation either. We pass options to create ScanBuilder, and then use different mix-it traits to pass scan information, and finally build the scan. The rationale is, we should pass the information that is always required to the builder, and the builder can use mix-in traits to take optional information.

That said, the current API design does follow the existing API style. +1 from me.

dongjoon-hyun

+1, LGTM. Thank you, @aokolnychyi and all.
For @cloud-fan comment, I'm also supporting the AS-IS design.

cloud-fan · 2022-03-24T02:07:55Z

thanks, merging to master/3.3!

…operations ### What changes were proposed in this pull request? This PR contains row-level operation APIs for V2 data sources that can replace groups of data (e.g. files, partitions). It is a subset of the changes already reviewed in #35395. ### Why are the changes needed? These changes are needed to support row-level operations in Spark per SPIP [SPARK-35801](https://issues.apache.org/jira/browse/SPARK-35801). ### Does this PR introduce _any_ user-facing change? Yes, this PR adds new Data Source V2 APIs. ### How was this patch tested? Not applicable. Closes #35940 from aokolnychyi/spark-38625. Authored-by: Anton Okolnychyi <aokolnychyi@apple.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com> (cherry picked from commit 6743aaa) Signed-off-by: Wenchen Fan <wenchen@databricks.com>

viirya · 2022-03-24T02:44:06Z

Thanks @cloud-fan, @aokolnychyi and all!

aokolnychyi · 2022-03-24T18:29:05Z

Thanks for reviewing, @cloud-fan @viirya @dongjoon-hyun @huaxingao @rdblue!

[SPARK-38625][SQL] DataSource V2: Add APIs for group-based row-level …

dcf0361

…operations

github-actions bot added the SQL label Mar 22, 2022

rdblue approved these changes Mar 22, 2022

View reviewed changes

viirya approved these changes Mar 22, 2022

View reviewed changes

huaxingao approved these changes Mar 22, 2022

View reviewed changes

cloud-fan reviewed Mar 23, 2022

View reviewed changes

cloud-fan approved these changes Mar 23, 2022

View reviewed changes

dongjoon-hyun approved these changes Mar 24, 2022

View reviewed changes

cloud-fan closed this in 6743aaa Mar 24, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[SPARK-38625][SQL] DataSource V2: Add APIs for group-based row-level operations #35940

[SPARK-38625][SQL] DataSource V2: Add APIs for group-based row-level operations #35940

aokolnychyi commented Mar 22, 2022

aokolnychyi commented Mar 22, 2022

rdblue left a comment

aokolnychyi commented Mar 23, 2022

cloud-fan Mar 23, 2022

cloud-fan Mar 23, 2022

aokolnychyi Mar 23, 2022

viirya Mar 23, 2022

aokolnychyi Mar 23, 2022

viirya Mar 23, 2022

aokolnychyi Mar 23, 2022

viirya Mar 23, 2022

cloud-fan Mar 24, 2022

cloud-fan Mar 24, 2022

dongjoon-hyun left a comment

cloud-fan commented Mar 24, 2022

viirya commented Mar 24, 2022

aokolnychyi commented Mar 24, 2022

[SPARK-38625][SQL] DataSource V2: Add APIs for group-based row-level operations #35940

[SPARK-38625][SQL] DataSource V2: Add APIs for group-based row-level operations #35940

Conversation

aokolnychyi commented Mar 22, 2022

What changes were proposed in this pull request?

Why are the changes needed?

Does this PR introduce any user-facing change?

How was this patch tested?

aokolnychyi commented Mar 22, 2022

rdblue left a comment

Choose a reason for hiding this comment

aokolnychyi commented Mar 23, 2022

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

dongjoon-hyun left a comment

Choose a reason for hiding this comment

cloud-fan commented Mar 24, 2022

viirya commented Mar 24, 2022

aokolnychyi commented Mar 24, 2022