Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[SPARK-36680][SQL] Supports Dynamic Table Options for Spark SQL #41683

Closed
wants to merge 5 commits into from

Conversation

szehon-ho
Copy link
Contributor

@szehon-ho szehon-ho commented Jun 21, 2023

What changes were proposed in this pull request?

This pr adds a SQL TVF (table-value-function):
with_options('table', map('foo', 'bar'))

Details:

  1. Add RelationWithOptions function object
  2. Add ResolveRelationsWithOptions rule to resolve that to a Relation, with the given options.

Why are the changes needed?

Currently there is no way to dynamically configure individual DataSource relations in Spark SQL query.

This is a continuation of the effort in #34072, and based on the comment there #34072 (comment), to use a TVF instead of query hints.

Does this PR introduce any user-facing change?

No

How was this patch tested?

Unit test in DataSourceV2SQLSuite

@szehon-ho
Copy link
Contributor Author

szehon-ho commented Jun 21, 2023

As there's a lot of rules in analyzer, please let me know if there's a better place/rule to fit this.

Note: parser does not seem to support TVF in write relation, but at least it works in read relation. We should hopefully find a way to support write options for relations as well.

@huaxingao
Copy link
Contributor

cc @cloud-fan

@@ -1118,6 +1122,38 @@ case class Range(
}
}

@ExpressionDescription(
usage = "_FUNC_(identifier: String, options: Map) - " +
"Returns the data source relation with the given configuration.",
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

s/configuration/options

@szehon-ho
Copy link
Contributor Author

szehon-ho commented Jun 23, 2023

Hm , just changed the UDF description in last update, build failure doesnt seem related:

python/pyspark/mllib/clustering.py:781: error: Decorated property not supported  [misc]
python/pyspark/mllib/clustering.py:949: error: Decorated property not supported  [misc]
Found 199 errors in 24 files (checked 668 source files)
1
Error: Process completed with exit code 1.

@szehon-ho
Copy link
Contributor Author

Rebase to try to fix test

@huaxingao
Copy link
Contributor

@cloud-fan When you have a moment, could you please take a look at this PR to see if the approach is OK with you? Thanks a lot!

@dongjoon-hyun dongjoon-hyun self-assigned this Aug 8, 2023
Copy link
Member

@dongjoon-hyun dongjoon-hyun left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could you rebase this PR, @szehon-ho ?

@szehon-ho
Copy link
Contributor Author

Rebased, thanks for taking a look @dongjoon-hyun

@cloud-fan
Copy link
Contributor

note: now TVF parameters can be table, so we can do with_options(TABLE t, map(...))

BTW, how can we differentiate spark.read.table and spark.readStream.table with SQL API?

@szehon-ho
Copy link
Contributor Author

Done, latest modification makes the first argument accept table parameter ie, TABLE(t1)

In this version, seems the Relation is already resolved, so all I can do is to patch the resolved Relations if they have options (DSV2). (versus before, they were all unresolved and can be loaded with options).

BTW, how can we differentiate spark.read.table and spark.readStream.table with SQL API?

Sorry, not sure I understood this.

@cloud-fan
Copy link
Contributor

I understand that this new syntax is to add a SQL API which is equivalent to spark.read.options(...).table(...), are we going to cover spark.readStream.options(...).table(...) as well?

@github-actions github-actions bot added the DOCS label Aug 11, 2023
@szehon-ho
Copy link
Contributor Author

szehon-ho commented Aug 11, 2023

Hm, in the original implementation (1b63105), it did not differentiate and I made an UnresolvedRelation with options. That's the same as the DataStreamReader API.

Actually after changing first argument to use TABLE(), because ResolveFunctions for TVFs expect the children to all be resolved (1), looks like with_options now gets ResolvedRelation as argument. DSV2Relation has options, but StreamingDSV2Relation does not have options, so now I can only apply options to the former.

I am contemplating reverting back to the original implementation though, and not take TABLE() as argument, as it would be nice to pass options into the relation-resolution process itself, then that would cover both cases, any thoughts on that?

[1]: https://github.com/apache/spark/blob/master/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/Analyzer.scala#L2085)

@dongjoon-hyun
Copy link
Member

Hi, @cloud-fan . Is your comment addressed? Any other concerns?

@dongjoon-hyun
Copy link
Member

To @szehon-ho . Could you rebase this PR to the master? The linter failure looks not a part of this PR.

Using `mvn` from path: /__w/spark/spark/build/apache-maven-3.8.8/bin/mvn
Checkstyle checks failed at following occurrences:
Error:  src/test/java/org/apache/spark/launcher/SparkSubmitCommandBuilderSuite.java:[75] (regexp) RegexpSinglelineJava: Please use the `assertThrows` method to test for exceptions.
Error: Process completed with exit code 1.

@dongjoon-hyun dongjoon-hyun removed their assignment Aug 30, 2023
@cloud-fan
Copy link
Contributor

cloud-fan commented Sep 5, 2023

Let's spend more time on the API design first, as different people may have different opinions and we should collect as much feedback as possible.

Taking a step back, I think what we need is an SQL API to specify per-scan options, like spark.read.options(...). The SQL API should be general as it's very likely that people will ask for something similar for df.write.options and spark.readStream.options.

TVF can only be used in the FROM clause, so a new SQL syntax may be better here. Inspired by the pgsql syntax, we can add a WITH clause to Spark SQL:

... FROM tbl_name WITH (optionA = v1, optionB = v2, ...)
INSERT INTO tbl_name WITH (optionA = v1, optionB = v2, ...) SELECT ...

Streaming is orthogonal to this, and this new WITH clause won't conflict with it. E.g. we can probably do ... FROM STREAM tbl_name WITH (...). It's out of the scope of this PR though, as streaming SQL is a big topic.

UPDATE:
In case we want to add builtin scan/write options, let's put data source options in a sub-clause of the WITH clause, e.g. FROM t WITH (OPTIONS (optionA = v1, optionB = v2, ...))

@srielau
Copy link
Contributor

srielau commented Sep 5, 2023

+1 on using a WITH clause.
For UPDATE:

WITH (OPTIONS (
Why the nesting?

Copy link

We're closing this PR because it hasn't been updated in a while. This isn't a judgement on the merit of the PR in any way. It's just a way of keeping the PR queue manageable.
If you'd like to revive this PR, please reopen it and ask a committer to remove the Stale tag!

@szehon-ho
Copy link
Contributor Author

Thank you @cloud-fan for the nice suggestion, sorry for the long delay as I was on leave and busy with other things. The WITH syntax does seem to greatly simplify the logic, I made a new pr : #46707 and we can discuss there

cloud-fan pushed a commit that referenced this pull request Jul 10, 2024
### What changes were proposed in this pull request?
in Spark SQL, add 'WITH OPTIONS' syntax to support dynamic relation options.

This is a continuation of #41683 based on cloud-fan's nice suggestion.
That was itself a continuation of #34072.

### Why are the changes needed?

This will allow Spark SQL to have equivalence to DataFrameReader API.  For example, it is possible to specify options today to DataSources as follows via the API:

```
 spark.read.format("jdbc").option("fetchSize", 0).load()
```

This pr allows an equivalent Spark SQL syntax to specify options:
```
SELECT * FROM jdbcTable WITH OPTIONS(`fetchSize` = 0)
```

### Does this PR introduce _any_ user-facing change?
No

### How was this patch tested?
Unit test in DataSourceV2SQLSuite

### Was this patch authored or co-authored using generative AI tooling?
No

Closes #46707 from szehon-ho/spark-36680.

Authored-by: Szehon Ho <szehon.apache@gmail.com>
Signed-off-by: Wenchen Fan <wenchen@databricks.com>
biruktesf-db pushed a commit to biruktesf-db/spark that referenced this pull request Jul 11, 2024
### What changes were proposed in this pull request?
in Spark SQL, add 'WITH OPTIONS' syntax to support dynamic relation options.

This is a continuation of apache#41683 based on cloud-fan's nice suggestion.
That was itself a continuation of apache#34072.

### Why are the changes needed?

This will allow Spark SQL to have equivalence to DataFrameReader API.  For example, it is possible to specify options today to DataSources as follows via the API:

```
 spark.read.format("jdbc").option("fetchSize", 0).load()
```

This pr allows an equivalent Spark SQL syntax to specify options:
```
SELECT * FROM jdbcTable WITH OPTIONS(`fetchSize` = 0)
```

### Does this PR introduce _any_ user-facing change?
No

### How was this patch tested?
Unit test in DataSourceV2SQLSuite

### Was this patch authored or co-authored using generative AI tooling?
No

Closes apache#46707 from szehon-ho/spark-36680.

Authored-by: Szehon Ho <szehon.apache@gmail.com>
Signed-off-by: Wenchen Fan <wenchen@databricks.com>
jingz-db pushed a commit to jingz-db/spark that referenced this pull request Jul 22, 2024
### What changes were proposed in this pull request?
in Spark SQL, add 'WITH OPTIONS' syntax to support dynamic relation options.

This is a continuation of apache#41683 based on cloud-fan's nice suggestion.
That was itself a continuation of apache#34072.

### Why are the changes needed?

This will allow Spark SQL to have equivalence to DataFrameReader API.  For example, it is possible to specify options today to DataSources as follows via the API:

```
 spark.read.format("jdbc").option("fetchSize", 0).load()
```

This pr allows an equivalent Spark SQL syntax to specify options:
```
SELECT * FROM jdbcTable WITH OPTIONS(`fetchSize` = 0)
```

### Does this PR introduce _any_ user-facing change?
No

### How was this patch tested?
Unit test in DataSourceV2SQLSuite

### Was this patch authored or co-authored using generative AI tooling?
No

Closes apache#46707 from szehon-ho/spark-36680.

Authored-by: Szehon Ho <szehon.apache@gmail.com>
Signed-off-by: Wenchen Fan <wenchen@databricks.com>
attilapiros pushed a commit to attilapiros/spark that referenced this pull request Oct 4, 2024
### What changes were proposed in this pull request?
in Spark SQL, add 'WITH OPTIONS' syntax to support dynamic relation options.

This is a continuation of apache#41683 based on cloud-fan's nice suggestion.
That was itself a continuation of apache#34072.

### Why are the changes needed?

This will allow Spark SQL to have equivalence to DataFrameReader API.  For example, it is possible to specify options today to DataSources as follows via the API:

```
 spark.read.format("jdbc").option("fetchSize", 0).load()
```

This pr allows an equivalent Spark SQL syntax to specify options:
```
SELECT * FROM jdbcTable WITH OPTIONS(`fetchSize` = 0)
```

### Does this PR introduce _any_ user-facing change?
No

### How was this patch tested?
Unit test in DataSourceV2SQLSuite

### Was this patch authored or co-authored using generative AI tooling?
No

Closes apache#46707 from szehon-ho/spark-36680.

Authored-by: Szehon Ho <szehon.apache@gmail.com>
Signed-off-by: Wenchen Fan <wenchen@databricks.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

6 participants