[SPARK-43775][SQL] DataSource V2: Allow representing updates as deletes and inserts #41300

aokolnychyi · 2023-05-24T20:40:14Z

What changes were proposed in this pull request?

This PR adds a way for data sources to request Spark to represent updates as deletes and inserts.

Why are the changes needed?

It may be beneficial for data sources to represent updates as deletes and inserts for delta-based implementations. Specifically, it may help to properly distribute and order records before writing.

Delete records set only row ID and metadata attributes. Update records set data, row ID, metadata attributes. Insert records set only data attributes.

For instance, a data source may rely on a metadata column _row_id (synthetic internally generated) to identify the row and may be partitioned by bucket(product_id). Splitting updates into inserts and deletes would allow data sources to cluster all update and insert records in MERGE for the same partition into a single task. Otherwise, the clustering key for updates and inserts will be different (inserts will always have _row_id as null as it is a metadata column). The new functionality is critical to reduce the number of generated files. It also makes sense in UPDATE operations if the original and new partition of a record do not match.

Does this PR introduce any user-facing change?

This PR adds a new method to SupportsDelta but the change is backward compatible.

How was this patch tested?

This PR comes with tests.

dongjoon-hyun

cc @cloud-fan , @sunchao , @viirya , @huaxingao

sql/core/src/main/scala/org/apache/spark/sql/execution/SplitUpdateAsDeleteAndInsertExec.scala

aokolnychyi · 2023-05-24T21:31:19Z

sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/RewriteRowLevelCommand.scala

@@ -91,6 +91,33 @@ trait RewriteRowLevelCommand extends Rule[LogicalPlan] {
    rowIdAttrs
  }

+  protected def deltaDeleteOutput(


Added to the parent class to reuse in MERGE later.

aokolnychyi · 2023-05-25T00:18:49Z

...rc/main/scala/org/apache/spark/sql/catalyst/plans/logical/SplitUpdateAsDeleteAndInsert.scala

+import org.apache.spark.sql.catalyst.expressions.{Attribute, AttributeSet, Expression}
+import org.apache.spark.sql.catalyst.util.truncatedString
+
+case class SplitUpdateAsDeleteAndInsert(


Maybe, I can use Expand/ExpandExec. Let me explore this to see if there is any performance implication.

Okay, I do think this node is a bit more efficient compared to Expand as it acts almost like Project.

No need to call copy() on each row.

Sub expression elimination like in ProjectExec.

Deferred evaluation of input attributes unless they are needed like in ProjectExec.

What are your thoughts, do you think it is worth having such a node? Technically, Expand would work too but it will be less efficient and will probably require more memory. This use case does not require a generic approach like in Expand.

cc @dongjoon-hyun @viirya @cloud-fan @sunchao @huaxingao

Maybe, we can implement some of those optimizations for Expand instead of adding another node. I am inclined towards using Expand but let me know what everyone thinks.

I'm in favor of reusing Expand, so that we can get the codegen and/or vectorization (if people install some spark plugins) of it for free.

Switched to Expand, will follow up with improvements later.

dongjoon-hyun

Could you re-trigger the failed CI pipeline, @aokolnychyi ?

dongjoon-hyun · 2023-05-27T03:51:07Z

It seems to fail to re-trigger, @aokolnychyi . Do you have a GitHub Action link on your commit?

…es and inserts

aokolnychyi · 2023-05-30T17:55:35Z

@dongjoon-hyun, looks like something weird happened. I switched to Expand and rebased to pick up recent changes. Let's see if tests will work this time. The PR should be ready for a detailed review.

dongjoon-hyun

+1, LGTM. Thank you, @aokolnychyi . Expand change also looks good to me, too.

Since it was @cloud-fan 's comment, could you review this once more, @cloud-fan ?

dongjoon-hyun · 2023-05-31T15:51:32Z

Thank you again, @aokolnychyi , @cloud-fan , @viirya .
Merged to master for Apache Spark 3.5.0.

aokolnychyi · 2023-05-31T18:16:36Z

Thanks, @dongjoon-hyun @cloud-fan @viirya!

…es and inserts ### What changes were proposed in this pull request? This PR adds a way for data sources to request Spark to represent updates as deletes and inserts. ### Why are the changes needed? It may be beneficial for data sources to represent updates as deletes and inserts for delta-based implementations. Specifically, it may help to properly distribute and order records before writing. Delete records set only row ID and metadata attributes. Update records set data, row ID, metadata attributes. Insert records set only data attributes. For instance, a data source may rely on a metadata column `_row_id` (synthetic internally generated) to identify the row and may be partitioned by `bucket(product_id)`. Splitting updates into inserts and deletes would allow data sources to cluster all update and insert records in MERGE for the same partition into a single task. Otherwise, the clustering key for updates and inserts will be different (inserts will always have `_row_id` as null as it is a metadata column). The new functionality is critical to reduce the number of generated files. It also makes sense in UPDATE operations if the original and new partition of a record do not match. ### Does this PR introduce _any_ user-facing change? This PR adds a new method to `SupportsDelta` but the change is backward compatible. ### How was this patch tested? This PR comes with tests. Closes apache#41300 from aokolnychyi/spark-43775. Authored-by: aokolnychyi <aokolnychyi@apple.com> Signed-off-by: Dongjoon Hyun <dongjoon@apache.org>

…es and inserts ### What changes were proposed in this pull request? This PR adds a way for data sources to request Spark to represent updates as deletes and inserts. ### Why are the changes needed? It may be beneficial for data sources to represent updates as deletes and inserts for delta-based implementations. Specifically, it may help to properly distribute and order records before writing. Delete records set only row ID and metadata attributes. Update records set data, row ID, metadata attributes. Insert records set only data attributes. For instance, a data source may rely on a metadata column `_row_id` (synthetic internally generated) to identify the row and may be partitioned by `bucket(product_id)`. Splitting updates into inserts and deletes would allow data sources to cluster all update and insert records in MERGE for the same partition into a single task. Otherwise, the clustering key for updates and inserts will be different (inserts will always have `_row_id` as null as it is a metadata column). The new functionality is critical to reduce the number of generated files. It also makes sense in UPDATE operations if the original and new partition of a record do not match. ### Does this PR introduce _any_ user-facing change? This PR adds a new method to `SupportsDelta` but the change is backward compatible. ### How was this patch tested? This PR comes with tests. Closes apache#41300 from aokolnychyi/spark-43775. Authored-by: aokolnychyi <aokolnychyi@apple.com> Signed-off-by: Dongjoon Hyun <dongjoon@apache.org> (cherry picked from commit fead25a)

github-actions bot added the SQL label May 24, 2023

dongjoon-hyun reviewed May 24, 2023

View reviewed changes

aokolnychyi commented May 24, 2023

View reviewed changes

sql/core/src/main/scala/org/apache/spark/sql/execution/SplitUpdateAsDeleteAndInsertExec.scala Outdated Show resolved Hide resolved

aokolnychyi commented May 24, 2023

View reviewed changes

aokolnychyi commented May 25, 2023

View reviewed changes

dongjoon-hyun reviewed May 27, 2023

View reviewed changes

aokolnychyi closed this May 27, 2023

aokolnychyi reopened this May 27, 2023

[SPARK-43775][SQL] DataSource V2: Allow representing updates as delet…

db368dd

…es and inserts

aokolnychyi force-pushed the spark-43775 branch from 764bd1b to db368dd Compare May 30, 2023 17:51

dongjoon-hyun approved these changes May 31, 2023

View reviewed changes

cloud-fan approved these changes May 31, 2023

View reviewed changes

viirya approved these changes May 31, 2023

View reviewed changes

dongjoon-hyun closed this in fead25a May 31, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[SPARK-43775][SQL] DataSource V2: Allow representing updates as deletes and inserts #41300

[SPARK-43775][SQL] DataSource V2: Allow representing updates as deletes and inserts #41300

aokolnychyi commented May 24, 2023

dongjoon-hyun left a comment

aokolnychyi May 24, 2023

aokolnychyi May 25, 2023

aokolnychyi May 25, 2023 •

edited

aokolnychyi May 25, 2023 •

edited

cloud-fan May 30, 2023

aokolnychyi May 30, 2023

dongjoon-hyun left a comment

dongjoon-hyun commented May 27, 2023

aokolnychyi commented May 30, 2023

dongjoon-hyun left a comment

dongjoon-hyun commented May 31, 2023

aokolnychyi commented May 31, 2023

[SPARK-43775][SQL] DataSource V2: Allow representing updates as deletes and inserts #41300

[SPARK-43775][SQL] DataSource V2: Allow representing updates as deletes and inserts #41300

Conversation

aokolnychyi commented May 24, 2023

What changes were proposed in this pull request?

Why are the changes needed?

Does this PR introduce any user-facing change?

How was this patch tested?

dongjoon-hyun left a comment

Choose a reason for hiding this comment

aokolnychyi May 24, 2023

Choose a reason for hiding this comment

aokolnychyi May 25, 2023

Choose a reason for hiding this comment

aokolnychyi May 25, 2023 • edited

Choose a reason for hiding this comment

aokolnychyi May 25, 2023 • edited

Choose a reason for hiding this comment

cloud-fan May 30, 2023

Choose a reason for hiding this comment

aokolnychyi May 30, 2023

Choose a reason for hiding this comment

dongjoon-hyun left a comment

Choose a reason for hiding this comment

dongjoon-hyun commented May 27, 2023

aokolnychyi commented May 30, 2023

dongjoon-hyun left a comment

Choose a reason for hiding this comment

dongjoon-hyun commented May 31, 2023

aokolnychyi commented May 31, 2023

aokolnychyi May 25, 2023 •

edited

aokolnychyi May 25, 2023 •

edited