[SPARK-48794][CONNECT] df.mergeInto support for Spark Connect (Scala and Python) #46960

xupefei · 2024-06-12T14:43:55Z

What changes were proposed in this pull request?

This PR introduces df.mergeInto support for Spark Connect Scala and Python clients.

This work contains four components:

New Protobuf messages: command MergeIntoTableCommand and expression MergeAction.
Spark Connect planner change: translate proto messages into real MergeIntoCommands.
Connect Scala client: MetgeIntoWriter that allows users to build merges.
Connect Python client: MetgeIntoWriter that allows users to build merges.

Components 3 and 4 and independent to each other. They both depends on Component 1.

Why are the changes needed?

We need to increase the functionality of Spark Connect to be on par with Classic.

Does this PR introduce any user-facing change?

Yes, new Dataframe APIs are introduced.

How was this patch tested?

Added new tests.

Was this patch authored or co-authored using generative AI tooling?

No.

### What changes were proposed in this pull request? Spark 4.0 added a new `df.mergeInto` API, but it is missing from PySpark. This PR fixes that. The support for this API in Spark Connect Python API will be added later by #46960. ### Why are the changes needed? Because PySpark does not support `df.mergeInto`. ### Does this PR introduce _any_ user-facing change? Yes, the user would be able to use the `df.mergeInto` API. ### How was this patch tested? New unit tests. ### Was this patch authored or co-authored using generative AI tooling? No. Closes #47086 from xupefei/pyspark-mergeinto. Authored-by: Paddy Xu <xupaddy@gmail.com> Signed-off-by: Hyukjin Kwon <gurwls223@apache.org>

# Conflicts: # python/pyspark/sql/connect/proto/expressions_pb2.py

xupefei · 2024-07-03T14:06:08Z

@HyukjinKwon and @grundprinzip Could you review this PR? Thanks!

python/pyspark/sql/connect/merge.py

python/pyspark/sql/dataframe.py

python/pyspark/sql/connect/merge.py

HyukjinKwon

Seems fine otherwise

HyukjinKwon · 2024-07-03T23:16:01Z

connector/connect/common/src/main/protobuf/spark/connect/commands.proto

+
+  // (Required) Whether to enable schema evolution.
+  bool with_schema_evolution = 7;
+}


This would need some reviews from @hvanhovell and/or @grundprinzip

HyukjinKwon · 2024-07-03T23:16:26Z

cc @zhengruifeng and @ueshin too

### What changes were proposed in this pull request? Spark 4.0 added a new `df.mergeInto` API, but it is missing from PySpark. This PR fixes that. The support for this API in Spark Connect Python API will be added later by apache#46960. ### Why are the changes needed? Because PySpark does not support `df.mergeInto`. ### Does this PR introduce _any_ user-facing change? Yes, the user would be able to use the `df.mergeInto` API. ### How was this patch tested? New unit tests. ### Was this patch authored or co-authored using generative AI tooling? No. Closes apache#47086 from xupefei/pyspark-mergeinto. Authored-by: Paddy Xu <xupaddy@gmail.com> Signed-off-by: Hyukjin Kwon <gurwls223@apache.org>

HyukjinKwon · 2024-07-12T00:14:26Z

@xupefei mind resolving conflicts? I will just merge

xupefei · 2024-07-12T06:21:39Z

@xupefei mind resolving conflicts? I will just merge

Done!

HyukjinKwon · 2024-07-12T08:52:37Z

Merged to master.

wip

40b76db

github-actions bot added SQL CONNECT labels Jun 12, 2024

add tests

4c06497

xupefei marked this pull request as ready for review June 19, 2024 14:45

xupefei added 2 commits June 20, 2024 10:20

Merge branch 'master' of github.com:apache/spark into merge-builder

ecb5d66

fix proto

f29552e

github-actions bot added the PYTHON label Jun 20, 2024

xupefei added 2 commits June 20, 2024 13:27

fix scala 2.12

c19c50d

Merge branch 'master' into merge-builder

d0f4491

xupefei mentioned this pull request Jun 25, 2024

[SPARK-48714][PYTHON] Implement DataFrame.mergeInto in PySpark #47086

Closed

xupefei added 4 commits July 3, 2024 12:35

Merge branch 'master' of github.com:apache/spark into merge-builder

7f0c1ce

# Conflicts: # python/pyspark/sql/connect/proto/expressions_pb2.py

fmt

5d806da

py

583c264

comment

0219a7f

xupefei changed the title ~~[Connect][WIP] Dataset.mergeInto~~ [Connect][SPARK-48794] df.mergeInto support for Spark Connect (Scala and Python) Jul 3, 2024

fmt

ef7084b

xupefei changed the title ~~[Connect][SPARK-48794] df.mergeInto support for Spark Connect (Scala and Python)~~ [SPARK-48794][Connect] df.mergeInto support for Spark Connect (Scala and Python) Jul 3, 2024

HyukjinKwon changed the title ~~[SPARK-48794][Connect] df.mergeInto support for Spark Connect (Scala and Python)~~ [SPARK-48794][CONNECT] df.mergeInto support for Spark Connect (Scala and Python) Jul 3, 2024