[SPARK-42555][CONNECT][FOLLOWUP] Add the new proto msg to support the remaining jdbc API #40277

beliefer · 2023-03-04T06:54:10Z

What changes were proposed in this pull request?

#40252 supported some jdbc API that reuse the proto msg DataSource. The DataFrameReader also have another kind jdbc API that is unrelated to load data source.

Why are the changes needed?

This PR adds the new proto msg PartitionedJDBC to support the remaining jdbc API.

Does this PR introduce any user-facing change?

'No'.
New feature.

How was this patch tested?

New test cases.

zhengruifeng · 2023-03-04T06:57:17Z

connector/connect/client/jvm/src/main/scala/org/apache/spark/sql/DataFrameReader.scala

+   *   JDBC database connection arguments, a list of arbitrary string tag/value. Normally at least
+   *   a "user" and "password" property should be included. "fetchsize" can be used to control the
+   *   number of rows per fetch.
+   * @since 1.4.0


Suggested change

* @since 1.4.0

* @since 3.4.0

beliefer · 2023-03-06T01:05:46Z

ping @hvanhovell @HyukjinKwon @dongjoon-hyun cc @LuciferYang

hvanhovell · 2023-03-06T03:07:54Z

connector/connect/common/src/main/protobuf/spark/connect/relations.proto

+    string table = 2;
+
+    // (Optional) Condition in the where clause for each partition.
+    repeated string predicates = 3;


Can we just put the predicates into the DataSource message?

But the transform path is very different from DataSource.

I think it's simple to add a if-else in transformReadRel, if we can reuse existing DataSource message (with new field predicates )

OK. Let's put the predicates into the DataSource message.

… remaining jdbc API

hvanhovell · 2023-03-06T12:40:03Z

connector/connect/common/src/main/protobuf/spark/connect/relations.proto

@@ -140,6 +140,9 @@ message Read {

    // (Optional) A list of path for file-system backed data sources.
    repeated string paths = 4;
+
+    // (Optional) Condition in the where clause for each partition.


Please add the comment that this currently only works for jdbc.

hvanhovell · 2023-03-06T12:44:07Z

connector/connect/client/jvm/src/main/scala/org/apache/spark/sql/DataFrameReader.scala

+      table: String,
+      predicates: Array[String],
+      connectionProperties: Properties): DataFrame = {
+    sparkSession.newDataFrame { builder =>


Can you please set the format to JDBC? We are now relying the presence of predicates to figure out that something is a JDBC table. That is relying far too heavily on the client doing the right thing, for example what would happen if you set format = parquet and still define predicates?

Yeah. we can't rely on client.

hvanhovell · 2023-03-06T12:46:53Z

...connect/server/src/main/scala/org/apache/spark/sql/connect/planner/SparkConnectPlanner.scala

-            case s: StructType => reader.schema(s)
-            case other => throw InvalidPlanInput(s"Invalid schema $other")
+
+        if (rel.getDataSource.getPredicatesCount == 0) {


Please make the logic a bit like this:

if (format == "jdbc" && rel.getDataSource.getPredicatesCount) { // Plan JDBC with predicates } else id (rel.getDataSource.getPredicatesCount == 0) { // Plan datasource } else { throw InvalidPlan(s"Predicates are not supported for $format datasources.)" }

hvanhovell · 2023-03-07T16:28:17Z

connector/connect/common/src/main/protobuf/spark/connect/relations.proto

+
+    // (Optional) Condition in the where clause for each partition.
+    //
+    // Only work for JDBC data source.


This is only supported by the JDBC data source.

beliefer · 2023-03-08T11:28:53Z

@hvanhovell Do you have any other advice? cc @HyukjinKwon @zhengruifeng @dongjoon-hyun

hvanhovell

LGTM

… remaining jdbc API ### What changes were proposed in this pull request? #40252 supported some jdbc API that reuse the proto msg `DataSource`. The `DataFrameReader` also have another kind jdbc API that is unrelated to load data source. ### Why are the changes needed? This PR adds the new proto msg `PartitionedJDBC` to support the remaining jdbc API. ### Does this PR introduce _any_ user-facing change? 'No'. New feature. ### How was this patch tested? New test cases. Closes #40277 from beliefer/SPARK-42555_followup. Authored-by: Jiaan Geng <beliefer@163.com> Signed-off-by: Herman van Hovell <herman@databricks.com> (cherry picked from commit 39a5512) Signed-off-by: Herman van Hovell <herman@databricks.com>

beliefer · 2023-03-09T01:02:30Z

@hvanhovell @zhengruifeng Thank you.

… remaining jdbc API ### What changes were proposed in this pull request? apache#40252 supported some jdbc API that reuse the proto msg `DataSource`. The `DataFrameReader` also have another kind jdbc API that is unrelated to load data source. ### Why are the changes needed? This PR adds the new proto msg `PartitionedJDBC` to support the remaining jdbc API. ### Does this PR introduce _any_ user-facing change? 'No'. New feature. ### How was this patch tested? New test cases. Closes apache#40277 from beliefer/SPARK-42555_followup. Authored-by: Jiaan Geng <beliefer@163.com> Signed-off-by: Herman van Hovell <herman@databricks.com> (cherry picked from commit 39a5512) Signed-off-by: Herman van Hovell <herman@databricks.com>

github-actions bot added CONNECT SQL labels Mar 4, 2023

zhengruifeng reviewed Mar 4, 2023

View reviewed changes

github-actions bot added CORE PYTHON labels Mar 5, 2023

hvanhovell reviewed Mar 6, 2023

View reviewed changes

beliefer added 4 commits March 6, 2023 11:28

[SPARK-42555][CONNECT][FOLLOWUP] Add the new proto msg to support the…

f2ee71a

… remaining jdbc API

Update code

8083cf9

Update code

c07cf47

Fix conflicts

4d48895

beliefer force-pushed the SPARK-42555_followup branch from 7627a0a to 4d48895 Compare March 6, 2023 03:33

Update code

e722c50

hvanhovell reviewed Mar 6, 2023

View reviewed changes

beliefer added 2 commits March 7, 2023 11:47

Update code

25abf55

Update code

f0abb56

hvanhovell reviewed Mar 7, 2023

View reviewed changes

Update code

95ccd7c

hvanhovell approved these changes Mar 8, 2023

View reviewed changes

hvanhovell closed this in 39a5512 Mar 8, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[SPARK-42555][CONNECT][FOLLOWUP] Add the new proto msg to support the remaining jdbc API #40277

[SPARK-42555][CONNECT][FOLLOWUP] Add the new proto msg to support the remaining jdbc API #40277

beliefer commented Mar 4, 2023

zhengruifeng Mar 4, 2023

beliefer commented Mar 6, 2023

hvanhovell Mar 6, 2023

beliefer Mar 6, 2023

zhengruifeng Mar 6, 2023

beliefer Mar 6, 2023

hvanhovell Mar 6, 2023

beliefer Mar 7, 2023

hvanhovell Mar 6, 2023

beliefer Mar 7, 2023

hvanhovell Mar 6, 2023

hvanhovell Mar 7, 2023

beliefer commented Mar 8, 2023

hvanhovell left a comment

beliefer commented Mar 9, 2023

[SPARK-42555][CONNECT][FOLLOWUP] Add the new proto msg to support the remaining jdbc API #40277

[SPARK-42555][CONNECT][FOLLOWUP] Add the new proto msg to support the remaining jdbc API #40277

Conversation

beliefer commented Mar 4, 2023

What changes were proposed in this pull request?

Why are the changes needed?

Does this PR introduce any user-facing change?

How was this patch tested?

Choose a reason for hiding this comment

beliefer commented Mar 6, 2023

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

beliefer commented Mar 8, 2023

hvanhovell left a comment

Choose a reason for hiding this comment

beliefer commented Mar 9, 2023