-
Notifications
You must be signed in to change notification settings - Fork 28.1k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[SPARK-46207][SQL] Support MergeInto in DataFrameWriterV2 #44119
Conversation
sql/core/src/main/scala/org/apache/spark/sql/DataFrameWriterV2.scala
Outdated
Show resolved
Hide resolved
cc @aokolnychyi @cloud-fan @dongjoon-hyun @viirya |
sql/core/src/main/scala/org/apache/spark/sql/DataFrameWriterV2.scala
Outdated
Show resolved
Hide resolved
sql/core/src/main/scala/org/apache/spark/sql/DataFrameWriterV2.scala
Outdated
Show resolved
Hide resolved
sql/core/src/main/scala/org/apache/spark/sql/DataFrameWriterV2.scala
Outdated
Show resolved
Hide resolved
sql/core/src/main/scala/org/apache/spark/sql/DataFrameWriterV2.scala
Outdated
Show resolved
Hide resolved
sql/core/src/main/scala/org/apache/spark/sql/DataFrameWriterV2.scala
Outdated
Show resolved
Hide resolved
sql/core/src/main/scala/org/apache/spark/sql/DataFrameWriterV2.scala
Outdated
Show resolved
Hide resolved
sql/core/src/main/scala/org/apache/spark/sql/DataFrameWriterV2.scala
Outdated
Show resolved
Hide resolved
sql/core/src/main/scala/org/apache/spark/sql/DataFrameWriterV2.scala
Outdated
Show resolved
Hide resolved
sql/core/src/main/scala/org/apache/spark/sql/DataFrameWriterV2.scala
Outdated
Show resolved
Hide resolved
sql/core/src/main/scala/org/apache/spark/sql/DataFrameWriterV2.scala
Outdated
Show resolved
Hide resolved
sql/core/src/main/scala/org/apache/spark/sql/DataFrameWriterV2.scala
Outdated
Show resolved
Hide resolved
sql/core/src/main/scala/org/apache/spark/sql/DataFrameWriterV2.scala
Outdated
Show resolved
Hide resolved
sql/core/src/main/scala/org/apache/spark/sql/MergeIntoWriter.scala
Outdated
Show resolved
Hide resolved
sql/core/src/main/scala/org/apache/spark/sql/MergeIntoWriter.scala
Outdated
Show resolved
Hide resolved
sql/core/src/main/scala/org/apache/spark/sql/MergeIntoWriter.scala
Outdated
Show resolved
Hide resolved
sql/core/src/main/scala/org/apache/spark/sql/MergeIntoWriter.scala
Outdated
Show resolved
Hide resolved
"message" : [ | ||
"df.mergeInto needs to be followed by at least one of whenMatched/whenNotMatched/whenNotMatchedBySource." | ||
], | ||
"sqlState" : "23K02" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
"sqlState" : "23K02" | |
"sqlState" : "42K0E" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Fixed. Thanks
qe.assertCommandExecuted() | ||
} | ||
|
||
def withNewMatchedUpdateAction(condition: Option[Expression]): MergeIntoWriter[T] = { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this seems too many helper functions...
def withNewMatchedAction(action: MergeAction): MergeIntoWriter[T] = {
this.matchedActions = this.matchedActions :+ action
this
}
I think 3 helper functions should be good enough for 3 different action types.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Fixed. Thanks
import org.apache.spark.sql.Row | ||
import org.apache.spark.sql.functions._ | ||
|
||
class MergeIntoDataFrameSuite extends RowLevelOperationSuiteBase { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
+1. We only need to make sure the new scala API works. We don't need to test the underlying v2 sources extensively, which should have been covered already by other tests
* | ||
* @since 4.0.0 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Similar to def write
, @group basic
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Added. Thanks
* Initialize a `WhenNotMatched` action without any condition. | ||
* | ||
* This `WhenNotMatched` can be followed by one of the following merge actions: | ||
* - `insertAll`: Insert all the target table with source dataset records. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Insert all the columns of the target table with ....
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Insert all rows from the source that are not already in the target table.
Please refer https://docs.databricks.com/en/sql/language-manual/delta-merge-into.html#when-not-matched-[by-target]
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I have fixed this and a few other places
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks.
* - `insert(Map)`: Insert all the target table records while changing only | ||
* a subset of fields based on the provided assignment. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Insert the specified columns ...
* Initialize a `WhenNotMatchedBySource` action without any condition. | ||
* | ||
* This `WhenNotMatchedBySource` can be followed by one of the following merge actions: | ||
* - `updateAll`: Update all the target table fields with source dataset fields. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Update all the columns of the target table ...
* | ||
* This `WhenNotMatchedBySource` can be followed by one of the following merge actions: | ||
* - `updateAll`: Update all the target table fields with source dataset fields. | ||
* - `update(Map)`: Update all the target table records while changing only |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Update the specified columns of the target table ...
* - `updateAll`: Update all the target table fields with source dataset fields. | ||
* - `update(Map)`: Update all the target table records while changing only | ||
* a subset of fields based on the provided assignment. | ||
* - `delete`: Delete all the target table records. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Delete the target table row.
* - `updateAll`: Update all the target table fields with source dataset fields. | ||
* - `update(Map)`: Update all the target table records while changing only | ||
* a subset of fields based on the provided assignment. | ||
* - `delete`: Delete all the target table records. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Delete the matching target table row
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
How about Delete all target rows that have a match in the source table.
?
Please refer https://docs.databricks.com/en/sql/language-manual/delta-merge-into.html#when-matched
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yea, I was referring the doc too. I've tried to combine @huaxingao original sentence and the doc. If @huaxingao wants to use these description from the doc, it is good too.
sql/core/src/main/scala/org/apache/spark/sql/MergeIntoWriter.scala
Outdated
Show resolved
Hide resolved
.../src/test/scala/org/apache/spark/sql/connect/client/CheckConnectJvmClientCompatibility.scala
Outdated
Show resolved
Hide resolved
@@ -4129,6 +4129,36 @@ class Dataset[T] private[sql]( | |||
new DataFrameWriterV2[T](table, this) | |||
} | |||
|
|||
/** | |||
* Create a [[MergeIntoWriter]] for MergeInto action. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is user facing API doc. Not sure if it is proper to put MergeIntoWriter
there. For example, we don't put DataFrameWriter
in write
API doc.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Maybe just describing what the function is used for. E.g., "Merges a set of updates, insertions, and deletions based on a source table into a target table"
https://docs.databricks.com/en/sql/language-manual/delta-merge-into.html
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The key is the MergeIntoWriter
is public API or developer API.
cc @cloud-fan
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Fixed. Thanks
* - `updateAll`: Update all the target table fields with source dataset fields. | ||
* - `update(Map)`: Update all the target table records while changing only | ||
* a subset of fields based on the provided assignment. | ||
* - `delete`: Delete all the target table records. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
How about Delete all target rows that have a match in the source table.
?
Please refer https://docs.databricks.com/en/sql/language-manual/delta-merge-into.html#when-matched
* Initialize a `WhenNotMatched` action without any condition. | ||
* | ||
* This `WhenNotMatched` can be followed by one of the following merge actions: | ||
* - `insertAll`: Insert all the target table with source dataset records. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Insert all rows from the source that are not already in the target table.
Please refer https://docs.databricks.com/en/sql/language-manual/delta-merge-into.html#when-not-matched-[by-target]
The test failure doesn't seem to be related to my changes. |
Merged to master. |
Thank you all very much for reviewing the PR! |
What changes were proposed in this pull request?
Add
MergeInto
support inDataFrameWriterV2
Why are the changes needed?
Spark currently supports merge into sql statement. We want DataFrame to have the same support.
Does this PR introduce any user-facing change?
Yes. This PR introduces new API like the following:
How was this patch tested?
new tests
Was this patch authored or co-authored using generative AI tooling?
No