[SPARK-36680][SQL] Supports Dynamic Table Options for Spark SQL #34072

wang-zhun · 2021-09-22T12:31:43Z

What changes were proposed in this pull request?

Added new hint OPTIONS to support sql table-level options.

Why are the changes needed?

Now a DataFrame API user can implement dynamic options through the DataFrameReader$option method, but Spark SQL users cannot use.

org.apache.spark.sql.connector.catalog.SupportsRead$newScanBuilder
org.apache.spark.sql.connector.catalog.SupportsWrite$newWriteBuilder

The table options were persisted to the Catalog and if we want to modify that, we should use another DDL like "ALTER TABLE ...". But there are some cases that user want to modify the table options dynamically just in the query:

Take JDBCTable as an example to implement table-level options

SELECT *
FROM jdbc_catalog.db.table1 /*+ OPTIONS('lowerBound'='1', upperBound='10', numPartitions='5') */ t1
	JOIN jdbc_catalog.db.table2 /*+ OPTIONS('lowerBound'='100', upperBound='1000', numPartitions='10') */ t2
	ON t1.col1 = t2.col1

~~- IcebergTable support time travel~~

These parameters setting is very common and ad-hoc, setting them flexibly would promote the user experience with Spark SQL especially for Now we support catalog expansion.

Does this PR introduce any user-facing change?

OPTIONS Hints

SELECT * FROM jdbc_catalog.db.table /*+ OPTIONS('lowerBound'='1', upperBound='10', numPartitions='5') */

How was this patch tested?

Added Unit test.

AmplabJenkins · 2021-09-22T12:34:45Z

Can one of the admins verify this patch?

sql/catalyst/src/test/scala/org/apache/spark/sql/catalyst/parser/PlanParserSuite.scala

coolderli · 2021-09-23T09:15:09Z

This feature is very useful. We can dynamic adjustment table options when we submit a query. For example, we can implement time travel by spark SQL on the iceberg.

coolderli · 2021-09-23T09:17:53Z

@RussellSpitzer @aokolnychyi Could you please take a look, Thanks?

RussellSpitzer · 2021-09-23T17:11:49Z

This seems like a reasonable change to me, would we want to do similar things for write? Or would this also work for write if the relation is being written to rather than being read from?

HyukjinKwon · 2021-09-24T01:10:36Z

yeah, I agree that the idea makes sense too. one thing is if the syntax makes sense ...

wang-zhun · 2021-09-24T12:12:43Z

Thanks @HyukjinKwon @RussellSpitzer for your particular review

wang-zhun · 2021-11-09T07:42:27Z

@HyukjinKwon Help review when you have time, thank you.

jiasheng55 · 2021-11-09T10:53:12Z

Hi, @rdblue , this dynamic table option feature is really helpful when processing Iceberg tables, could you help to take a look at this PR, thanks!

rdblue · 2021-11-09T16:55:56Z

Overall, this change makes sense to me. This isn't a great way to time travel because we want to load the table at a specific version/snapshot or time. But @huaxingao is working on that in a separate issue. For everything else, this is a great improvement. Thank you, @wang-zhun!

xkrogen · 2021-11-09T17:11:13Z

Haven't had a chance to look through the implementation yet, but big +1 on the feature from my side. We maintain an internal DSv2 source that requires options to leverage more advanced functionality, and currently it's not possible to use those features via the pure SQL API.

cc @wmoustafa

wmoustafa · 2021-11-09T18:01:42Z

+1 on the feature as well.

HyukjinKwon · 2021-11-11T01:27:02Z

Shall we initiate a discussion on Spark dev mailing list? I would like to have this feature too but not sure if the syntax makes sense.

rxin · 2021-11-15T02:04:04Z

It seems weird to have it as a hint in SQL select statement (not clear that it's part of a table scan). Maybe better as a TVF argument?

wang-zhun · 2021-11-15T02:37:50Z

It seems weird to have it as a hint in SQL select statement (not clear that it's part of a table scan). Maybe better as a TVF argument?

@rxin Thank your for your suggestion. It is very useful for us, we did not notice TVF before

wmoustafa · 2021-11-15T19:50:53Z

It seems that there are two types of options that can be supported through this API: options that do not change the query semantics/results and only affect physical plan choices and options that affect the query semantics. Examples of the former include setting the split size, while examples of the latter include setting snapshot ID or the timestamp of time travel queries. I think the latter mostly applies to temporal/versioning queries. Using the hints API sounds acceptable for the former. Hints + Options API is used in Flink, but seems to control physical plan options still. If most of the semantic options (i.e., beyond physical) revolve around snapshots, versioning, and time travel, we can consider the SQL 2011 Standard which added support for temporal databases.

huaxingao · 2021-11-16T02:25:47Z

Just an FYI： I am working on time travel in this PR #34497

HyukjinKwon · 2021-11-16T04:21:13Z

BTW, what's the benefit over having:

-- option `key` being passed with `value`
SET spark.datasource.jdbc.key=value;
SELECT * FROM jdbc_catalog.db.table;
RESET spark.datasource.jdbc.key;

besides that it's just shorter?

wmoustafa · 2021-11-16T04:26:24Z

BTW, what's the benefit over having:

-- option `key` being passed with `value`
SET spark.datasource.jdbc.key=value;
SELECT * FROM jdbc_catalog.db.table;
RESET spark.datasource.jdbc.key;

besides that it's just shorter?

Those are table-level options so one may set different key/value pairs per table within the same query. Also the setting and resetting approach is not friendly to concurrent queries.

HyukjinKwon · 2021-11-16T04:30:28Z

Can we fix the PR description better for that point? There seems already a way to resolve the issue PR description explains.

wmoustafa · 2021-11-16T20:14:22Z

There is also a catch for allowing the hints framework to control the physical plan properties as opposed to query semantics, which is that it does not seem straightforward to enforce that (i.e., preventing time travel from being communicated through hints). Does it make sense to make the OPTIONS API a top level SQL keyword outside the hints framework? This also aligns with the style in the counterpart Scala API (which does not leverage hints). Further, we would not have to worry about whether the option defines physical vs semantic behavior (which is also the case in the Scala version).

wang-zhun · 2021-12-01T06:05:02Z

Close this PR first and look forward to a better proposal implementation

szehon-ho · 2023-06-21T01:39:20Z

Hi , I still wanted to see how to solve the issue, and I made a pr #41683 to propose an implementation of the suggestion (TVF).

The only caveat seems to be that parser does not support TVF for write relation, so we may need to find an alternate solution for write relation, but I hope it will work at least for read relation.

github-actions bot added the SQL label Sep 22, 2021

HyukjinKwon mentioned this pull request Sep 23, 2021

[SPARK-32592][SQL] Make DataFrameReader.table take the specified options #29535

Closed

HyukjinKwon reviewed Sep 23, 2021

View reviewed changes

sql/catalyst/src/test/scala/org/apache/spark/sql/catalyst/parser/PlanParserSuite.scala Outdated Show resolved Hide resolved

HyukjinKwon changed the title ~~[SPARK-36680][CATALYST] Supports Dynamic Table Options for Spark SQL~~ [SPARK-36680][SQL] Supports Dynamic Table Options for Spark SQL Sep 23, 2021

wangzhun added 3 commits September 24, 2021 20:16

[SPARK-36680][CATALYST] Supports Dynamic Table Options for Spark SQL

2e3ae2b

fix test title

fd3db74

support insert

5f4f43a

wang-zhun force-pushed the SPARK-36680 branch from 35e7312 to 5f4f43a Compare September 24, 2021 12:17

wang-zhun and others added 5 commits November 9, 2021 11:42

Merge branch 'master' into SPARK-36680

9821684

[SPARK-36680][CATALYST] Supports Dynamic Table Options for Spark SQL

69cd562

fix test title

f5b6db1

support insert

1751e62

update

92d4f30

wang-zhun closed this Dec 1, 2021

igorcalabria mentioned this pull request Aug 19, 2022

Support for incremental reads using Spark SQL proposal apache/iceberg#5590

Closed

szehon-ho mentioned this pull request Jun 21, 2023

[SPARK-36680][SQL] Supports Dynamic Table Options for Spark SQL #41683

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[SPARK-36680][SQL] Supports Dynamic Table Options for Spark SQL #34072

[SPARK-36680][SQL] Supports Dynamic Table Options for Spark SQL #34072

wang-zhun commented Sep 22, 2021 •

edited

AmplabJenkins commented Sep 22, 2021

coolderli commented Sep 23, 2021

coolderli commented Sep 23, 2021

RussellSpitzer commented Sep 23, 2021

HyukjinKwon commented Sep 24, 2021

wang-zhun commented Sep 24, 2021

wang-zhun commented Nov 9, 2021

jiasheng55 commented Nov 9, 2021

rdblue commented Nov 9, 2021

xkrogen commented Nov 9, 2021 •

edited

wmoustafa commented Nov 9, 2021

HyukjinKwon commented Nov 11, 2021 •

edited

rxin commented Nov 15, 2021

wang-zhun commented Nov 15, 2021

wmoustafa commented Nov 15, 2021

huaxingao commented Nov 16, 2021

HyukjinKwon commented Nov 16, 2021

wmoustafa commented Nov 16, 2021

HyukjinKwon commented Nov 16, 2021 •

edited

wmoustafa commented Nov 16, 2021

wang-zhun commented Dec 1, 2021

szehon-ho commented Jun 21, 2023

[SPARK-36680][SQL] Supports Dynamic Table Options for Spark SQL #34072

[SPARK-36680][SQL] Supports Dynamic Table Options for Spark SQL #34072

Conversation

wang-zhun commented Sep 22, 2021 • edited

What changes were proposed in this pull request?

Why are the changes needed?

Does this PR introduce any user-facing change?

OPTIONS Hints

How was this patch tested?

AmplabJenkins commented Sep 22, 2021

coolderli commented Sep 23, 2021

coolderli commented Sep 23, 2021

RussellSpitzer commented Sep 23, 2021

HyukjinKwon commented Sep 24, 2021

wang-zhun commented Sep 24, 2021

wang-zhun commented Nov 9, 2021

jiasheng55 commented Nov 9, 2021

rdblue commented Nov 9, 2021

xkrogen commented Nov 9, 2021 • edited

wmoustafa commented Nov 9, 2021

HyukjinKwon commented Nov 11, 2021 • edited

rxin commented Nov 15, 2021

wang-zhun commented Nov 15, 2021

wmoustafa commented Nov 15, 2021

huaxingao commented Nov 16, 2021

HyukjinKwon commented Nov 16, 2021

wmoustafa commented Nov 16, 2021

HyukjinKwon commented Nov 16, 2021 • edited

wmoustafa commented Nov 16, 2021

wang-zhun commented Dec 1, 2021

szehon-ho commented Jun 21, 2023

wang-zhun commented Sep 22, 2021 •

edited

xkrogen commented Nov 9, 2021 •

edited

HyukjinKwon commented Nov 11, 2021 •

edited

HyukjinKwon commented Nov 16, 2021 •

edited