[SPARK-36680][SQL] Supports Dynamic Table Options for Spark SQL #41683

szehon-ho · 2023-06-21T01:36:16Z

What changes were proposed in this pull request?

This pr adds a SQL TVF (table-value-function):
with_options('table', map('foo', 'bar'))

Details:

Add RelationWithOptions function object
Add ResolveRelationsWithOptions rule to resolve that to a Relation, with the given options.

Why are the changes needed?

Currently there is no way to dynamically configure individual DataSource relations in Spark SQL query.

This is a continuation of the effort in #34072, and based on the comment there #34072 (comment), to use a TVF instead of query hints.

Does this PR introduce any user-facing change?

No

How was this patch tested?

Unit test in DataSourceV2SQLSuite

szehon-ho · 2023-06-21T01:44:01Z

As there's a lot of rules in analyzer, please let me know if there's a better place/rule to fit this.

Note: parser does not seem to support TVF in write relation, but at least it works in read relation. We should hopefully find a way to support write options for relations as well.

huaxingao · 2023-06-21T07:37:22Z

cc @cloud-fan

jaceklaskowski · 2023-06-22T09:16:02Z

...alyst/src/main/scala/org/apache/spark/sql/catalyst/plans/logical/basicLogicalOperators.scala

@@ -1118,6 +1122,38 @@ case class Range(
  }
 }

+@ExpressionDescription(
+  usage = "_FUNC_(identifier: String, options: Map) - " +
+    "Returns the data source relation with the given configuration.",


s/configuration/options

szehon-ho · 2023-06-23T21:17:37Z

Hm , just changed the UDF description in last update, build failure doesnt seem related:

python/pyspark/mllib/clustering.py:781: error: Decorated property not supported  [misc]
python/pyspark/mllib/clustering.py:949: error: Decorated property not supported  [misc]
Found 199 errors in 24 files (checked 668 source files)
1
Error: Process completed with exit code 1.

szehon-ho · 2023-06-27T05:33:58Z

Rebase to try to fix test

huaxingao · 2023-07-03T21:02:43Z

@cloud-fan When you have a moment, could you please take a look at this PR to see if the approach is OK with you? Thanks a lot!

...alyst/src/main/scala/org/apache/spark/sql/catalyst/plans/logical/basicLogicalOperators.scala

sql/core/src/test/scala/org/apache/spark/sql/connector/DataSourceV2SQLSuite.scala

dongjoon-hyun

Could you rebase this PR, @szehon-ho ?

szehon-ho · 2023-08-09T23:26:04Z

Rebased, thanks for taking a look @dongjoon-hyun

cloud-fan · 2023-08-10T06:32:01Z

note: now TVF parameters can be table, so we can do with_options(TABLE t, map(...))

BTW, how can we differentiate spark.read.table and spark.readStream.table with SQL API?

szehon-ho · 2023-08-11T06:18:32Z

Done, latest modification makes the first argument accept table parameter ie, TABLE(t1)

In this version, seems the Relation is already resolved, so all I can do is to patch the resolved Relations if they have options (DSV2). (versus before, they were all unresolved and can be loaded with options).

BTW, how can we differentiate spark.read.table and spark.readStream.table with SQL API?

Sorry, not sure I understood this.

cloud-fan · 2023-08-11T06:58:18Z

I understand that this new syntax is to add a SQL API which is equivalent to spark.read.options(...).table(...), are we going to cover spark.readStream.options(...).table(...) as well?

szehon-ho · 2023-08-11T19:30:00Z

Hm, in the original implementation (1b63105), it did not differentiate and I made an UnresolvedRelation with options. That's the same as the DataStreamReader API.

Actually after changing first argument to use TABLE(), because ResolveFunctions for TVFs expect the children to all be resolved (1), looks like with_options now gets ResolvedRelation as argument. DSV2Relation has options, but StreamingDSV2Relation does not have options, so now I can only apply options to the former.

I am contemplating reverting back to the original implementation though, and not take TABLE() as argument, as it would be nice to pass options into the relation-resolution process itself, then that would cover both cases, any thoughts on that?

[1]: https://github.com/apache/spark/blob/master/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/Analyzer.scala#L2085)

dongjoon-hyun · 2023-08-28T18:44:52Z

Hi, @cloud-fan . Is your comment addressed? Any other concerns?

dongjoon-hyun · 2023-08-28T18:46:50Z

To @szehon-ho . Could you rebase this PR to the master? The linter failure looks not a part of this PR.

Using `mvn` from path: /__w/spark/spark/build/apache-maven-3.8.8/bin/mvn
Checkstyle checks failed at following occurrences:
Error:  src/test/java/org/apache/spark/launcher/SparkSubmitCommandBuilderSuite.java:[75] (regexp) RegexpSinglelineJava: Please use the `assertThrows` method to test for exceptions.
Error: Process completed with exit code 1.

cloud-fan · 2023-09-05T13:59:13Z

Let's spend more time on the API design first, as different people may have different opinions and we should collect as much feedback as possible.

Taking a step back, I think what we need is an SQL API to specify per-scan options, like spark.read.options(...). The SQL API should be general as it's very likely that people will ask for something similar for df.write.options and spark.readStream.options.

TVF can only be used in the FROM clause, so a new SQL syntax may be better here. Inspired by the pgsql syntax, we can add a WITH clause to Spark SQL:

... FROM tbl_name WITH (optionA = v1, optionB = v2, ...)
INSERT INTO tbl_name WITH (optionA = v1, optionB = v2, ...) SELECT ...

Streaming is orthogonal to this, and this new WITH clause won't conflict with it. E.g. we can probably do ... FROM STREAM tbl_name WITH (...). It's out of the scope of this PR though, as streaming SQL is a big topic.

UPDATE:
In case we want to add builtin scan/write options, let's put data source options in a sub-clause of the WITH clause, e.g. FROM t WITH (OPTIONS (optionA = v1, optionB = v2, ...))

srielau · 2023-09-05T16:57:56Z

+1 on using a WITH clause.
For UPDATE:

WITH (OPTIONS (
Why the nesting?

github-actions · 2023-12-15T00:19:23Z

We're closing this PR because it hasn't been updated in a while. This isn't a judgement on the merit of the PR in any way. It's just a way of keeping the PR queue manageable.
If you'd like to revive this PR, please reopen it and ask a committer to remove the Stale tag!

szehon-ho · 2024-05-22T20:57:38Z

Thank you @cloud-fan for the nice suggestion, sorry for the long delay as I was on leave and busy with other things. The WITH syntax does seem to greatly simplify the logic, I made a new pr : #46707 and we can discuss there

### What changes were proposed in this pull request? in Spark SQL, add 'WITH OPTIONS' syntax to support dynamic relation options. This is a continuation of #41683 based on cloud-fan's nice suggestion. That was itself a continuation of #34072. ### Why are the changes needed? This will allow Spark SQL to have equivalence to DataFrameReader API. For example, it is possible to specify options today to DataSources as follows via the API: ``` spark.read.format("jdbc").option("fetchSize", 0).load() ``` This pr allows an equivalent Spark SQL syntax to specify options: ``` SELECT * FROM jdbcTable WITH OPTIONS(`fetchSize` = 0) ``` ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? Unit test in DataSourceV2SQLSuite ### Was this patch authored or co-authored using generative AI tooling? No Closes #46707 from szehon-ho/spark-36680. Authored-by: Szehon Ho <szehon.apache@gmail.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com>

### What changes were proposed in this pull request? in Spark SQL, add 'WITH OPTIONS' syntax to support dynamic relation options. This is a continuation of apache#41683 based on cloud-fan's nice suggestion. That was itself a continuation of apache#34072. ### Why are the changes needed? This will allow Spark SQL to have equivalence to DataFrameReader API. For example, it is possible to specify options today to DataSources as follows via the API: ``` spark.read.format("jdbc").option("fetchSize", 0).load() ``` This pr allows an equivalent Spark SQL syntax to specify options: ``` SELECT * FROM jdbcTable WITH OPTIONS(`fetchSize` = 0) ``` ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? Unit test in DataSourceV2SQLSuite ### Was this patch authored or co-authored using generative AI tooling? No Closes apache#46707 from szehon-ho/spark-36680. Authored-by: Szehon Ho <szehon.apache@gmail.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com>

github-actions bot added the SQL label Jun 21, 2023

szehon-ho mentioned this pull request Jun 21, 2023

[SPARK-36680][SQL] Supports Dynamic Table Options for Spark SQL #34072

Closed

jaceklaskowski reviewed Jun 22, 2023

View reviewed changes

szehon-ho force-pushed the ds_with_conf branch from 08d102f to b24e452 Compare June 27, 2023 05:33

dongjoon-hyun self-assigned this Aug 8, 2023

dongjoon-hyun reviewed Aug 8, 2023

View reviewed changes

...alyst/src/main/scala/org/apache/spark/sql/catalyst/plans/logical/basicLogicalOperators.scala Outdated Show resolved Hide resolved

dongjoon-hyun reviewed Aug 8, 2023

View reviewed changes

...alyst/src/main/scala/org/apache/spark/sql/catalyst/plans/logical/basicLogicalOperators.scala Outdated Show resolved Hide resolved

dongjoon-hyun reviewed Aug 8, 2023

View reviewed changes

sql/core/src/test/scala/org/apache/spark/sql/connector/DataSourceV2SQLSuite.scala Outdated Show resolved Hide resolved

dongjoon-hyun reviewed Aug 9, 2023

View reviewed changes

szehon-ho force-pushed the ds_with_conf branch from b24e452 to fe5b211 Compare August 9, 2023 23:25

szehon-ho added 2 commits August 10, 2023 15:15

[SPARK-36680][SQL] Supports Dynamic Table Options for Spark SQL

1b63105

Review comments

cecb085

szehon-ho force-pushed the ds_with_conf branch from fe5b211 to 638058d Compare August 11, 2023 06:15

Accept only TABLE() first argument

f9a5cf7

szehon-ho force-pushed the ds_with_conf branch from 638058d to f9a5cf7 Compare August 11, 2023 06:20

szehon-ho force-pushed the ds_with_conf branch from e9a66d3 to f9a5cf7 Compare August 11, 2023 19:12

Generate error messages

eb669e3

github-actions bot added the DOCS label Aug 11, 2023

Fix annotation

511b33f

dongjoon-hyun removed their assignment Aug 30, 2023

github-actions bot added the Stale label Dec 15, 2023

github-actions bot closed this Dec 16, 2023

shardulm94 mentioned this pull request May 16, 2024

Spark: Add SparkSQLProperty to control split-size apache/iceberg#10336

Open

szehon-ho mentioned this pull request May 22, 2024

[SPARK-36680][SQL] Supports Dynamic Table Options for Spark SQL #46707

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[SPARK-36680][SQL] Supports Dynamic Table Options for Spark SQL #41683

[SPARK-36680][SQL] Supports Dynamic Table Options for Spark SQL #41683

szehon-ho commented Jun 21, 2023 •

edited

Loading

szehon-ho commented Jun 21, 2023 •

edited

Loading

huaxingao commented Jun 21, 2023

jaceklaskowski Jun 22, 2023

szehon-ho commented Jun 23, 2023 •

edited

Loading

szehon-ho commented Jun 27, 2023

huaxingao commented Jul 3, 2023

dongjoon-hyun left a comment

szehon-ho commented Aug 9, 2023

cloud-fan commented Aug 10, 2023

szehon-ho commented Aug 11, 2023

cloud-fan commented Aug 11, 2023

szehon-ho commented Aug 11, 2023 •

edited

Loading

dongjoon-hyun commented Aug 28, 2023

dongjoon-hyun commented Aug 28, 2023

cloud-fan commented Sep 5, 2023 •

edited

Loading

srielau commented Sep 5, 2023

github-actions bot commented Dec 15, 2023

szehon-ho commented May 22, 2024

[SPARK-36680][SQL] Supports Dynamic Table Options for Spark SQL #41683

[SPARK-36680][SQL] Supports Dynamic Table Options for Spark SQL #41683

Conversation

szehon-ho commented Jun 21, 2023 • edited Loading

What changes were proposed in this pull request?

Why are the changes needed?

Does this PR introduce any user-facing change?

How was this patch tested?

szehon-ho commented Jun 21, 2023 • edited Loading

huaxingao commented Jun 21, 2023

jaceklaskowski Jun 22, 2023

Choose a reason for hiding this comment

szehon-ho commented Jun 23, 2023 • edited Loading

szehon-ho commented Jun 27, 2023

huaxingao commented Jul 3, 2023

dongjoon-hyun left a comment

Choose a reason for hiding this comment

szehon-ho commented Aug 9, 2023

cloud-fan commented Aug 10, 2023

szehon-ho commented Aug 11, 2023

cloud-fan commented Aug 11, 2023

szehon-ho commented Aug 11, 2023 • edited Loading

dongjoon-hyun commented Aug 28, 2023

dongjoon-hyun commented Aug 28, 2023

cloud-fan commented Sep 5, 2023 • edited Loading

srielau commented Sep 5, 2023

github-actions bot commented Dec 15, 2023

szehon-ho commented May 22, 2024

szehon-ho commented Jun 21, 2023 •

edited

Loading

szehon-ho commented Jun 21, 2023 •

edited

Loading

szehon-ho commented Jun 23, 2023 •

edited

Loading

szehon-ho commented Aug 11, 2023 •

edited

Loading

cloud-fan commented Sep 5, 2023 •

edited

Loading