Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[SPARK-38625][SQL] DataSource V2: Add APIs for group-based row-level operations #35940

Closed
wants to merge 1 commit into from

Conversation

aokolnychyi
Copy link
Contributor

What changes were proposed in this pull request?

This PR contains row-level operation APIs for V2 data sources that can replace groups of data (e.g. files, partitions). It is a subset of the changes already reviewed in #35395.

Why are the changes needed?

These changes are needed to support row-level operations in Spark per SPIP SPARK-35801.

Does this PR introduce any user-facing change?

Yes, this PR adds new Data Source V2 APIs.

How was this patch tested?

Not applicable.

@github-actions github-actions bot added the SQL label Mar 22, 2022
@aokolnychyi
Copy link
Contributor Author

cc @viirya @huaxingao @dongjoon-hyun @sunchao @cloud-fan @rdblue @HyukjinKwon

Could you take a look at this PR? It is a subset of changes in #35395.

Copy link
Contributor

@rdblue rdblue left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good to me. +1 for adding the API first.

@aokolnychyi
Copy link
Contributor Author

How does this one look, @cloud-fan?

* Returns a {@link RowLevelOperation} that controls how Spark rewrites data
* for DELETE, UPDATE, MERGE commands.
*/
RowLevelOperation build();
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not against this API design but just want to put my idea: Shall we have different build methods for UPDATE, DELETE, MERGE? like buildDelete. Then we don't need RowLevelOperationInfo

public interface SupportsRowLevelOperations extends Table {
  RowLevelOperationBuilder newRowLevelOperationBuilder(CaseInsensitiveStringMap options);
}

public interface RowLevelOperationBuilder {
  RowLevelOperation buildDelete();
  RowLevelOperation buildUpdate();
  RowLevelOperation buildMerge();
}

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

or we can keep the enum

public interface RowLevelOperationBuilder {
  RowLevelOperation build(Command command);
}

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Either way would work for me too.

I tried to match LogicalWriteInfo. One benefit of having an extra class is that we can extend it to pass more information in the future. Some of those future requirements are hard to predict when an API is designed. For instance, I plan to add a few new methods to LogicalWriteInfo for delta-based operations (e.g. row ID schema). However, there is no guarantee that will ever be required for RowLevelOperationInfo.

Any preference, @viirya @huaxingao @rdblue?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hmm, I think one benefit of RowLevelOperationInfo, as @aokolnychyi said, is to give us some room for more information in the future. I'm afraid of the situation that if we take buildUpdate, etc. approach, we may need to extend it if we need some more information.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Shall we stick to the current API just in case then?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let's wait for @cloud-fan's feedback?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sorry, my question was directed to @cloud-fan :)

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

😄

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I know predicting the future is very hard, but I'm afraid we may over-design this. We have both RowLevelOperationInfo and the builder pattern to pass required information in the future, shall we keep only one of them?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

After a second thought, the read API doesn't have ScanInformation either. We pass options to create ScanBuilder, and then use different mix-it traits to pass scan information, and finally build the scan. The rationale is, we should pass the information that is always required to the builder, and the builder can use mix-in traits to take optional information.

That said, the current API design does follow the existing API style. +1 from me.

Copy link
Member

@dongjoon-hyun dongjoon-hyun left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

+1, LGTM. Thank you, @aokolnychyi and all.
For @cloud-fan comment, I'm also supporting the AS-IS design.

@cloud-fan
Copy link
Contributor

thanks, merging to master/3.3!

@cloud-fan cloud-fan closed this in 6743aaa Mar 24, 2022
cloud-fan pushed a commit that referenced this pull request Mar 24, 2022
…operations

### What changes were proposed in this pull request?

This PR contains row-level operation APIs for V2 data sources that can replace groups of data (e.g. files, partitions). It is a subset of the changes already reviewed in #35395.

### Why are the changes needed?

These changes are needed to support row-level operations in Spark per SPIP [SPARK-35801](https://issues.apache.org/jira/browse/SPARK-35801).

### Does this PR introduce _any_ user-facing change?

Yes, this PR adds new Data Source V2 APIs.

### How was this patch tested?

Not applicable.

Closes #35940 from aokolnychyi/spark-38625.

Authored-by: Anton Okolnychyi <aokolnychyi@apple.com>
Signed-off-by: Wenchen Fan <wenchen@databricks.com>
(cherry picked from commit 6743aaa)
Signed-off-by: Wenchen Fan <wenchen@databricks.com>
@viirya
Copy link
Member

viirya commented Mar 24, 2022

Thanks @cloud-fan, @aokolnychyi and all!

@aokolnychyi
Copy link
Contributor Author

Thanks for reviewing, @cloud-fan @viirya @dongjoon-hyun @huaxingao @rdblue!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
6 participants