New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[SPARK-38625][SQL] DataSource V2: Add APIs for group-based row-level operations #35940
Conversation
cc @viirya @huaxingao @dongjoon-hyun @sunchao @cloud-fan @rdblue @HyukjinKwon Could you take a look at this PR? It is a subset of changes in #35395. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks good to me. +1 for adding the API first.
How does this one look, @cloud-fan? |
* Returns a {@link RowLevelOperation} that controls how Spark rewrites data | ||
* for DELETE, UPDATE, MERGE commands. | ||
*/ | ||
RowLevelOperation build(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm not against this API design but just want to put my idea: Shall we have different build methods for UPDATE, DELETE, MERGE? like buildDelete
. Then we don't need RowLevelOperationInfo
public interface SupportsRowLevelOperations extends Table {
RowLevelOperationBuilder newRowLevelOperationBuilder(CaseInsensitiveStringMap options);
}
public interface RowLevelOperationBuilder {
RowLevelOperation buildDelete();
RowLevelOperation buildUpdate();
RowLevelOperation buildMerge();
}
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
or we can keep the enum
public interface RowLevelOperationBuilder {
RowLevelOperation build(Command command);
}
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Either way would work for me too.
I tried to match LogicalWriteInfo
. One benefit of having an extra class is that we can extend it to pass more information in the future. Some of those future requirements are hard to predict when an API is designed. For instance, I plan to add a few new methods to LogicalWriteInfo
for delta-based operations (e.g. row ID schema). However, there is no guarantee that will ever be required for RowLevelOperationInfo
.
Any preference, @viirya @huaxingao @rdblue?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hmm, I think one benefit of RowLevelOperationInfo
, as @aokolnychyi said, is to give us some room for more information in the future. I'm afraid of the situation that if we take buildUpdate
, etc. approach, we may need to extend it if we need some more information.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Shall we stick to the current API just in case then?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Let's wait for @cloud-fan's feedback?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sorry, my question was directed to @cloud-fan :)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
😄
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I know predicting the future is very hard, but I'm afraid we may over-design this. We have both RowLevelOperationInfo
and the builder pattern to pass required information in the future, shall we keep only one of them?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
After a second thought, the read API doesn't have ScanInformation
either. We pass options to create ScanBuilder
, and then use different mix-it traits to pass scan information, and finally build the scan. The rationale is, we should pass the information that is always required to the builder, and the builder can use mix-in traits to take optional information.
That said, the current API design does follow the existing API style. +1 from me.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
+1, LGTM. Thank you, @aokolnychyi and all.
For @cloud-fan comment, I'm also supporting the AS-IS design.
thanks, merging to master/3.3! |
…operations ### What changes were proposed in this pull request? This PR contains row-level operation APIs for V2 data sources that can replace groups of data (e.g. files, partitions). It is a subset of the changes already reviewed in #35395. ### Why are the changes needed? These changes are needed to support row-level operations in Spark per SPIP [SPARK-35801](https://issues.apache.org/jira/browse/SPARK-35801). ### Does this PR introduce _any_ user-facing change? Yes, this PR adds new Data Source V2 APIs. ### How was this patch tested? Not applicable. Closes #35940 from aokolnychyi/spark-38625. Authored-by: Anton Okolnychyi <aokolnychyi@apple.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com> (cherry picked from commit 6743aaa) Signed-off-by: Wenchen Fan <wenchen@databricks.com>
Thanks @cloud-fan, @aokolnychyi and all! |
Thanks for reviewing, @cloud-fan @viirya @dongjoon-hyun @huaxingao @rdblue! |
What changes were proposed in this pull request?
This PR contains row-level operation APIs for V2 data sources that can replace groups of data (e.g. files, partitions). It is a subset of the changes already reviewed in #35395.
Why are the changes needed?
These changes are needed to support row-level operations in Spark per SPIP SPARK-35801.
Does this PR introduce any user-facing change?
Yes, this PR adds new Data Source V2 APIs.
How was this patch tested?
Not applicable.