Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[SPARK-36680][SQL] Supports Dynamic Table Options for Spark SQL #34072

Closed
wants to merge 8 commits into from

Conversation

wang-zhun
Copy link
Contributor

@wang-zhun wang-zhun commented Sep 22, 2021

What changes were proposed in this pull request?

Added new hint OPTIONS to support sql table-level options.

Why are the changes needed?

Now a DataFrame API user can implement dynamic options through the DataFrameReader$option method, but Spark SQL users cannot use.

org.apache.spark.sql.connector.catalog.SupportsRead$newScanBuilder
org.apache.spark.sql.connector.catalog.SupportsWrite$newWriteBuilder

The table options were persisted to the Catalog and if we want to modify that, we should use another DDL like "ALTER TABLE ...". But there are some cases that user want to modify the table options dynamically just in the query:

  • Take JDBCTable as an example to implement table-level options
SELECT *
FROM jdbc_catalog.db.table1 /*+ OPTIONS('lowerBound'='1', upperBound='10', numPartitions='5') */ t1
	JOIN jdbc_catalog.db.table2 /*+ OPTIONS('lowerBound'='100', upperBound='1000', numPartitions='10') */ t2
	ON t1.col1 = t2.col1

- IcebergTable support time travel

These parameters setting is very common and ad-hoc, setting them flexibly would promote the user experience with Spark SQL especially for Now we support catalog expansion.

Does this PR introduce any user-facing change?

OPTIONS Hints
SELECT * FROM jdbc_catalog.db.table /*+ OPTIONS('lowerBound'='1', upperBound='10', numPartitions='5') */

How was this patch tested?

Added Unit test.

@github-actions github-actions bot added the SQL label Sep 22, 2021
@AmplabJenkins
Copy link

Can one of the admins verify this patch?

@HyukjinKwon HyukjinKwon changed the title [SPARK-36680][CATALYST] Supports Dynamic Table Options for Spark SQL [SPARK-36680][SQL] Supports Dynamic Table Options for Spark SQL Sep 23, 2021
@coolderli
Copy link

This feature is very useful. We can dynamic adjustment table options when we submit a query. For example, we can implement time travel by spark SQL on the iceberg.

@coolderli
Copy link

@RussellSpitzer @aokolnychyi Could you please take a look, Thanks?

@RussellSpitzer
Copy link
Member

This seems like a reasonable change to me, would we want to do similar things for write? Or would this also work for write if the relation is being written to rather than being read from?

@HyukjinKwon
Copy link
Member

yeah, I agree that the idea makes sense too. one thing is if the syntax makes sense ...

@wang-zhun
Copy link
Contributor Author

Thanks @HyukjinKwon @RussellSpitzer for your particular review

@wang-zhun
Copy link
Contributor Author

@HyukjinKwon Help review when you have time, thank you.

@jiasheng55
Copy link

Hi, @rdblue , this dynamic table option feature is really helpful when processing Iceberg tables, could you help to take a look at this PR, thanks!

@rdblue
Copy link
Contributor

rdblue commented Nov 9, 2021

Overall, this change makes sense to me. This isn't a great way to time travel because we want to load the table at a specific version/snapshot or time. But @huaxingao is working on that in a separate issue. For everything else, this is a great improvement. Thank you, @wang-zhun!

@xkrogen
Copy link
Contributor

xkrogen commented Nov 9, 2021

Haven't had a chance to look through the implementation yet, but big +1 on the feature from my side. We maintain an internal DSv2 source that requires options to leverage more advanced functionality, and currently it's not possible to use those features via the pure SQL API.

cc @wmoustafa

@wmoustafa
Copy link

+1 on the feature as well.

@HyukjinKwon
Copy link
Member

HyukjinKwon commented Nov 11, 2021

Shall we initiate a discussion on Spark dev mailing list? I would like to have this feature too but not sure if the syntax makes sense.

@rxin
Copy link
Contributor

rxin commented Nov 15, 2021

It seems weird to have it as a hint in SQL select statement (not clear that it's part of a table scan). Maybe better as a TVF argument?

@wang-zhun
Copy link
Contributor Author

It seems weird to have it as a hint in SQL select statement (not clear that it's part of a table scan). Maybe better as a TVF argument?

@rxin Thank your for your suggestion. It is very useful for us, we did not notice TVF before

@wmoustafa
Copy link

It seems that there are two types of options that can be supported through this API: options that do not change the query semantics/results and only affect physical plan choices and options that affect the query semantics. Examples of the former include setting the split size, while examples of the latter include setting snapshot ID or the timestamp of time travel queries. I think the latter mostly applies to temporal/versioning queries. Using the hints API sounds acceptable for the former. Hints + Options API is used in Flink, but seems to control physical plan options still. If most of the semantic options (i.e., beyond physical) revolve around snapshots, versioning, and time travel, we can consider the SQL 2011 Standard which added support for temporal databases.

@huaxingao
Copy link
Contributor

Just an FYI: I am working on time travel in this PR #34497

@HyukjinKwon
Copy link
Member

BTW, what's the benefit over having:

-- option `key` being passed with `value`
SET spark.datasource.jdbc.key=value;
SELECT * FROM jdbc_catalog.db.table;
RESET spark.datasource.jdbc.key;

besides that it's just shorter?

@wmoustafa
Copy link

BTW, what's the benefit over having:

-- option `key` being passed with `value`
SET spark.datasource.jdbc.key=value;
SELECT * FROM jdbc_catalog.db.table;
RESET spark.datasource.jdbc.key;

besides that it's just shorter?

Those are table-level options so one may set different key/value pairs per table within the same query. Also the setting and resetting approach is not friendly to concurrent queries.

@HyukjinKwon
Copy link
Member

HyukjinKwon commented Nov 16, 2021

Can we fix the PR description better for that point? There seems already a way to resolve the issue PR description explains.

@wmoustafa
Copy link

There is also a catch for allowing the hints framework to control the physical plan properties as opposed to query semantics, which is that it does not seem straightforward to enforce that (i.e., preventing time travel from being communicated through hints). Does it make sense to make the OPTIONS API a top level SQL keyword outside the hints framework? This also aligns with the style in the counterpart Scala API (which does not leverage hints). Further, we would not have to worry about whether the option defines physical vs semantic behavior (which is also the case in the Scala version).

@wang-zhun
Copy link
Contributor Author

Close this PR first and look forward to a better proposal implementation

@szehon-ho
Copy link
Contributor

Hi , I still wanted to see how to solve the issue, and I made a pr #41683 to propose an implementation of the suggestion (TVF).

The only caveat seems to be that parser does not support TVF for write relation, so we may need to find an alternate solution for write relation, but I hope it will work at least for read relation.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet