Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[SPARK-42555][CONNECT][FOLLOWUP] Add the new proto msg to support the remaining jdbc API #40277

Closed
wants to merge 8 commits into from

Conversation

beliefer
Copy link
Contributor

@beliefer beliefer commented Mar 4, 2023

What changes were proposed in this pull request?

#40252 supported some jdbc API that reuse the proto msg DataSource. The DataFrameReader also have another kind jdbc API that is unrelated to load data source.

Why are the changes needed?

This PR adds the new proto msg PartitionedJDBC to support the remaining jdbc API.

Does this PR introduce any user-facing change?

'No'.
New feature.

How was this patch tested?

New test cases.

* JDBC database connection arguments, a list of arbitrary string tag/value. Normally at least
* a "user" and "password" property should be included. "fetchsize" can be used to control the
* number of rows per fetch.
* @since 1.4.0
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
* @since 1.4.0
* @since 3.4.0

@beliefer
Copy link
Contributor Author

beliefer commented Mar 6, 2023

string table = 2;

// (Optional) Condition in the where clause for each partition.
repeated string predicates = 3;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we just put the predicates into the DataSource message?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

But the transform path is very different from DataSource.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think it's simple to add a if-else in transformReadRel, if we can reuse existing DataSource message (with new field predicates )

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

OK. Let's put the predicates into the DataSource message.

@@ -140,6 +140,9 @@ message Read {

// (Optional) A list of path for file-system backed data sources.
repeated string paths = 4;

// (Optional) Condition in the where clause for each partition.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please add the comment that this currently only works for jdbc.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

OK

table: String,
predicates: Array[String],
connectionProperties: Properties): DataFrame = {
sparkSession.newDataFrame { builder =>
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you please set the format to JDBC? We are now relying the presence of predicates to figure out that something is a JDBC table. That is relying far too heavily on the client doing the right thing, for example what would happen if you set format = parquet and still define predicates?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah. we can't rely on client.

case s: StructType => reader.schema(s)
case other => throw InvalidPlanInput(s"Invalid schema $other")

if (rel.getDataSource.getPredicatesCount == 0) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please make the logic a bit like this:

if (format == "jdbc" && rel.getDataSource.getPredicatesCount) {
  // Plan JDBC with predicates
} else id (rel.getDataSource.getPredicatesCount == 0) {
 // Plan datasource
} else {
  throw InvalidPlan(s"Predicates are not supported for $format datasources.)"
}


// (Optional) Condition in the where clause for each partition.
//
// Only work for JDBC data source.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is only supported by the JDBC data source.

@beliefer
Copy link
Contributor Author

beliefer commented Mar 8, 2023

@hvanhovell Do you have any other advice? cc @HyukjinKwon @zhengruifeng @dongjoon-hyun

Copy link
Contributor

@hvanhovell hvanhovell left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

hvanhovell pushed a commit that referenced this pull request Mar 8, 2023
… remaining jdbc API

### What changes were proposed in this pull request?
#40252 supported some jdbc API that reuse the proto msg `DataSource`. The `DataFrameReader` also have another kind jdbc API that is unrelated to load data source.

### Why are the changes needed?
This PR adds the new proto msg `PartitionedJDBC` to support the remaining jdbc API.

### Does this PR introduce _any_ user-facing change?
'No'.
New feature.

### How was this patch tested?
New test cases.

Closes #40277 from beliefer/SPARK-42555_followup.

Authored-by: Jiaan Geng <beliefer@163.com>
Signed-off-by: Herman van Hovell <herman@databricks.com>
(cherry picked from commit 39a5512)
Signed-off-by: Herman van Hovell <herman@databricks.com>
@hvanhovell hvanhovell closed this in 39a5512 Mar 8, 2023
@beliefer
Copy link
Contributor Author

beliefer commented Mar 9, 2023

@hvanhovell @zhengruifeng Thank you.

snmvaughan pushed a commit to snmvaughan/spark that referenced this pull request Jun 20, 2023
… remaining jdbc API

### What changes were proposed in this pull request?
apache#40252 supported some jdbc API that reuse the proto msg `DataSource`. The `DataFrameReader` also have another kind jdbc API that is unrelated to load data source.

### Why are the changes needed?
This PR adds the new proto msg `PartitionedJDBC` to support the remaining jdbc API.

### Does this PR introduce _any_ user-facing change?
'No'.
New feature.

### How was this patch tested?
New test cases.

Closes apache#40277 from beliefer/SPARK-42555_followup.

Authored-by: Jiaan Geng <beliefer@163.com>
Signed-off-by: Herman van Hovell <herman@databricks.com>
(cherry picked from commit 39a5512)
Signed-off-by: Herman van Hovell <herman@databricks.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
3 participants