New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[SPARK-42555][CONNECT][FOLLOWUP] Add the new proto msg to support the remaining jdbc API #40277
Conversation
* JDBC database connection arguments, a list of arbitrary string tag/value. Normally at least | ||
* a "user" and "password" property should be included. "fetchsize" can be used to control the | ||
* number of rows per fetch. | ||
* @since 1.4.0 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
* @since 1.4.0 | |
* @since 3.4.0 |
string table = 2; | ||
|
||
// (Optional) Condition in the where clause for each partition. | ||
repeated string predicates = 3; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can we just put the predicates into the DataSource message?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
But the transform path is very different from DataSource.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think it's simple to add a if-else
in transformReadRel
, if we can reuse existing DataSource message (with new field predicates
)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
OK. Let's put the predicates into the DataSource message.
7627a0a
to
4d48895
Compare
@@ -140,6 +140,9 @@ message Read { | |||
|
|||
// (Optional) A list of path for file-system backed data sources. | |||
repeated string paths = 4; | |||
|
|||
// (Optional) Condition in the where clause for each partition. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Please add the comment that this currently only works for jdbc.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
OK
table: String, | ||
predicates: Array[String], | ||
connectionProperties: Properties): DataFrame = { | ||
sparkSession.newDataFrame { builder => |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can you please set the format to JDBC? We are now relying the presence of predicates to figure out that something is a JDBC table. That is relying far too heavily on the client doing the right thing, for example what would happen if you set format = parquet and still define predicates?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah. we can't rely on client.
case s: StructType => reader.schema(s) | ||
case other => throw InvalidPlanInput(s"Invalid schema $other") | ||
|
||
if (rel.getDataSource.getPredicatesCount == 0) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Please make the logic a bit like this:
if (format == "jdbc" && rel.getDataSource.getPredicatesCount) {
// Plan JDBC with predicates
} else id (rel.getDataSource.getPredicatesCount == 0) {
// Plan datasource
} else {
throw InvalidPlan(s"Predicates are not supported for $format datasources.)"
}
|
||
// (Optional) Condition in the where clause for each partition. | ||
// | ||
// Only work for JDBC data source. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is only supported by the JDBC data source.
@hvanhovell Do you have any other advice? cc @HyukjinKwon @zhengruifeng @dongjoon-hyun |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM
… remaining jdbc API ### What changes were proposed in this pull request? #40252 supported some jdbc API that reuse the proto msg `DataSource`. The `DataFrameReader` also have another kind jdbc API that is unrelated to load data source. ### Why are the changes needed? This PR adds the new proto msg `PartitionedJDBC` to support the remaining jdbc API. ### Does this PR introduce _any_ user-facing change? 'No'. New feature. ### How was this patch tested? New test cases. Closes #40277 from beliefer/SPARK-42555_followup. Authored-by: Jiaan Geng <beliefer@163.com> Signed-off-by: Herman van Hovell <herman@databricks.com> (cherry picked from commit 39a5512) Signed-off-by: Herman van Hovell <herman@databricks.com>
@hvanhovell @zhengruifeng Thank you. |
… remaining jdbc API ### What changes were proposed in this pull request? apache#40252 supported some jdbc API that reuse the proto msg `DataSource`. The `DataFrameReader` also have another kind jdbc API that is unrelated to load data source. ### Why are the changes needed? This PR adds the new proto msg `PartitionedJDBC` to support the remaining jdbc API. ### Does this PR introduce _any_ user-facing change? 'No'. New feature. ### How was this patch tested? New test cases. Closes apache#40277 from beliefer/SPARK-42555_followup. Authored-by: Jiaan Geng <beliefer@163.com> Signed-off-by: Herman van Hovell <herman@databricks.com> (cherry picked from commit 39a5512) Signed-off-by: Herman van Hovell <herman@databricks.com>
What changes were proposed in this pull request?
#40252 supported some jdbc API that reuse the proto msg
DataSource
. TheDataFrameReader
also have another kind jdbc API that is unrelated to load data source.Why are the changes needed?
This PR adds the new proto msg
PartitionedJDBC
to support the remaining jdbc API.Does this PR introduce any user-facing change?
'No'.
New feature.
How was this patch tested?
New test cases.