-
Notifications
You must be signed in to change notification settings - Fork 28.3k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[SPARK-37286][SQL] Move compileAggregates from JDBCRDD to JdbcDialect #34554
Conversation
Kubernetes integration test starting |
Kubernetes integration test status failure |
Kubernetes integration test starting |
Kubernetes integration test status failure |
Test build #145105 has finished for PR 34554 at commit
|
Test build #145109 has finished for PR 34554 at commit
|
Kubernetes integration test starting |
Kubernetes integration test starting |
Kubernetes integration test status failure |
Kubernetes integration test status failure |
Test build #145115 has finished for PR 34554 at commit
|
Test build #145116 has finished for PR 34554 at commit
|
ping @cloud-fan |
The [SQL] tag is missing in the title |
* Returns None for an unhandled filter. | ||
*/ | ||
@Since("3.3.0") | ||
def compileFilter(f: Filter): Option[String] = { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The problem is, is there any database that has a different filter/aggregate syntax? The main point of having it in JdbcDialect
is to allow certain databases to overwrite it.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes. I think should lets any database to overwrite it.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can you give an example? Which database has a different filter/aggregate syntax? This is SQL standard.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This PR adds two new developer APIs in JdbcDialect
and we need to justify it.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There are two situations:
First, database A and B implement a different number of aggregate functions that meet the SQL standard.
Second, some database implement some aggregate functions that not meet the SQL standard.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think spark will/should only push down SQL-standard predicates/agg functions. Do you have plans to change it?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In my current work, if I need to support some new predicates/aggregation functions first, I will submit a PR first.
Now Spark’s pushdown is that as long as there is a function that can’t be pushed down, it won’t be pushed down at all. |
Based on discussion offline, reopen this PR. |
Test build #145851 has finished for PR 34554 at commit
|
Kubernetes integration test unable to build dist. exiting with code: 1 |
Kubernetes integration test starting |
Kubernetes integration test status failure |
Test build #145887 has finished for PR 34554 at commit
|
thanks, merging to master! |
@cloud-fan Thank you. |
### What changes were proposed in this pull request? This PR fixes an issue that the Javadoc generation fails due to the wrong parameter name of a method added in SPARK-37286 (#34554). https://github.com/apache/spark/runs/4409267346?check_suite_focus=true#step:9:5081 ### Why are the changes needed? To keep the build clean. ### Does this PR introduce _any_ user-facing change? No. ### How was this patch tested? GA itself. Closes #34801 from sarutak/followup-SPARK-37286. Authored-by: Kousuke Saruta <sarutak@oss.nttdata.com> Signed-off-by: Sean Owen <srowen@gmail.com>
Build didn't pass - the javadoc build failed above https://github.com/beliefer/spark/actions/runs/1534093499 |
### What changes were proposed in this pull request? Currently, the method `compileAggregates` is a member of `JDBCRDD`. But it is not reasonable, because the JDBC source knowns how to compile aggregate expressions to itself's dialect well. ### Why are the changes needed? JDBC source knowns how to compile aggregate expressions to itself's dialect well. After this PR, we can extend the pushdown(e.g. aggregate) based on different dialect between different JDBC database. There are two situations: First, database A and B implement a different number of aggregate functions that meet the SQL standard. ### Does this PR introduce _any_ user-facing change? 'No'. Just change the inner implementation. ### How was this patch tested? Jenkins tests. Closes apache#34554 from beliefer/SPARK-37286. Authored-by: Jiaan Geng <beliefer@163.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com>
### What changes were proposed in this pull request? This PR fixes an issue that the Javadoc generation fails due to the wrong parameter name of a method added in SPARK-37286 (apache#34554). https://github.com/apache/spark/runs/4409267346?check_suite_focus=true#step:9:5081 ### Why are the changes needed? To keep the build clean. ### Does this PR introduce _any_ user-facing change? No. ### How was this patch tested? GA itself. Closes apache#34801 from sarutak/followup-SPARK-37286. Authored-by: Kousuke Saruta <sarutak@oss.nttdata.com> Signed-off-by: Sean Owen <srowen@gmail.com>
### What changes were proposed in this pull request? Currently, the method `compileAggregates` is a member of `JDBCRDD`. But it is not reasonable, because the JDBC source knowns how to compile aggregate expressions to itself's dialect well. ### Why are the changes needed? JDBC source knowns how to compile aggregate expressions to itself's dialect well. After this PR, we can extend the pushdown(e.g. aggregate) based on different dialect between different JDBC database. There are two situations: First, database A and B implement a different number of aggregate functions that meet the SQL standard. ### Does this PR introduce _any_ user-facing change? 'No'. Just change the inner implementation. ### How was this patch tested? Jenkins tests. Closes apache#34554 from beliefer/SPARK-37286. Authored-by: Jiaan Geng <beliefer@163.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com>
### What changes were proposed in this pull request? This PR fixes an issue that the Javadoc generation fails due to the wrong parameter name of a method added in SPARK-37286 (apache#34554). https://github.com/apache/spark/runs/4409267346?check_suite_focus=true#step:9:5081 ### Why are the changes needed? To keep the build clean. ### Does this PR introduce _any_ user-facing change? No. ### How was this patch tested? GA itself. Closes apache#34801 from sarutak/followup-SPARK-37286. Authored-by: Kousuke Saruta <sarutak@oss.nttdata.com> Signed-off-by: Sean Owen <srowen@gmail.com>
### What changes were proposed in this pull request? Currently, the method `compileAggregates` is a member of `JDBCRDD`. But it is not reasonable, because the JDBC source knowns how to compile aggregate expressions to itself's dialect well. ### Why are the changes needed? JDBC source knowns how to compile aggregate expressions to itself's dialect well. After this PR, we can extend the pushdown(e.g. aggregate) based on different dialect between different JDBC database. There are two situations: First, database A and B implement a different number of aggregate functions that meet the SQL standard. ### Does this PR introduce _any_ user-facing change? 'No'. Just change the inner implementation. ### How was this patch tested? Jenkins tests. Closes apache#34554 from beliefer/SPARK-37286. Authored-by: Jiaan Geng <beliefer@163.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com>
### What changes were proposed in this pull request? This PR fixes an issue that the Javadoc generation fails due to the wrong parameter name of a method added in SPARK-37286 (apache#34554). https://github.com/apache/spark/runs/4409267346?check_suite_focus=true#step:9:5081 ### Why are the changes needed? To keep the build clean. ### Does this PR introduce _any_ user-facing change? No. ### How was this patch tested? GA itself. Closes apache#34801 from sarutak/followup-SPARK-37286. Authored-by: Kousuke Saruta <sarutak@oss.nttdata.com> Signed-off-by: Sean Owen <srowen@gmail.com>
### What changes were proposed in this pull request? Currently, the method `compileAggregates` is a member of `JDBCRDD`. But it is not reasonable, because the JDBC source knowns how to compile aggregate expressions to itself's dialect well. ### Why are the changes needed? JDBC source knowns how to compile aggregate expressions to itself's dialect well. After this PR, we can extend the pushdown(e.g. aggregate) based on different dialect between different JDBC database. There are two situations: First, database A and B implement a different number of aggregate functions that meet the SQL standard. ### Does this PR introduce _any_ user-facing change? 'No'. Just change the inner implementation. ### How was this patch tested? Jenkins tests. Closes apache#34554 from beliefer/SPARK-37286. Authored-by: Jiaan Geng <beliefer@163.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com>
### What changes were proposed in this pull request? This PR fixes an issue that the Javadoc generation fails due to the wrong parameter name of a method added in SPARK-37286 (apache#34554). https://github.com/apache/spark/runs/4409267346?check_suite_focus=true#step:9:5081 ### Why are the changes needed? To keep the build clean. ### Does this PR introduce _any_ user-facing change? No. ### How was this patch tested? GA itself. Closes apache#34801 from sarutak/followup-SPARK-37286. Authored-by: Kousuke Saruta <sarutak@oss.nttdata.com> Signed-off-by: Sean Owen <srowen@gmail.com>
* [SPARK-36556][SQL] Add DSV2 filters Co-Authored-By: DB Tsai d_tsaiapple.com Co-Authored-By: Huaxin Gao huaxin_gaoapple.com ### What changes were proposed in this pull request? Add DSV2 Filters and use these in V2 codepath. ### Why are the changes needed? The motivation of adding DSV2 filters: 1. The values in V1 filters are Scala types. When translating catalyst `Expression` to V1 filers, we have to call `convertToScala` to convert from Catalyst types used internally in rows to standard Scala types, and later convert Scala types back to Catalyst types. This is very inefficient. In V2 filters, we use `Expression` for filter values, so the conversion from Catalyst types to Scala types and Scala types back to Catalyst types are avoided. 2. Improve nested column filter support. 3. Make the filters work better with the rest of the DSV2 APIs. ### Does this PR introduce _any_ user-facing change? Yes. The new V2 filters ### How was this patch tested? new test Closes #33803 from huaxingao/filter. Lead-authored-by: Huaxin Gao <huaxin_gao@apple.com> Co-authored-by: DB Tsai <d_tsai@apple.com> Signed-off-by: Liang-Chi Hsieh <viirya@gmail.com> * [SPARK-36760][SQL] Add interface SupportsPushDownV2Filters Co-Authored-By: DB Tsai d_tsaiapple.com Co-Authored-By: Huaxin Gao huaxin_gaoapple.com ### What changes were proposed in this pull request? This is the 2nd PR for V2 Filter support. This PR does the following: - Add interface SupportsPushDownV2Filters Future work: - refactor `OrcFilters`, `ParquetFilters`, `JacksonParser`, `UnivocityParser` so both V1 file source and V2 file source can use them - For V2 file source: implement v2 filter -> parquet/orc filter. csv and Json don't have real filters, but also need to change the current code to have v2 filter -> `JacksonParser`/`UnivocityParser` - For V1 file source, keep what we currently have: v1 filter -> parquet/orc filter - We don't need v1filter.toV2 and v2filter.toV1 since we have two separate paths The reasons that we have reached the above conclusion: - The major motivation to implement V2Filter is to eliminate the unnecessary conversion between Catalyst types and Scala types when using Filters. - We provide this `SupportsPushDownV2Filters` in this PR so V2 data source (e.g. iceberg) can implement it and use V2 Filters - There are lots of work to implement v2 filters in the V2 file sources because of the following reasons: possible approaches for implementing V2Filter: 1. keep what we have for file source v1: v1 filter -> parquet/orc filter file source v2 we will implement v2 filter -> parquet/orc filter We don't need v1->v2 and v2->v1 problem with this approach: there are lots of code duplication 2. We will implement v2 filter -> parquet/orc filter file source v1: v1 filter -> v2 filter -> parquet/orc filter We will need V1 -> V2 This is the approach I am using in https://github.com/apache/spark/pull/33973 In that PR, I have v2 orc: v2 filter -> orc filter V1 orc: v1 -> v2 -> orc filter v2 csv: v2->v1, new UnivocityParser v1 csv: new UnivocityParser v2 Json: v2->v1, new JacksonParser v1 Json: new JacksonParser csv and Json don't have real filters, they just use filter references, should be OK to use either v1 and v2. Easier to use v1 because no need to change. I haven't finished parquet yet. The PR doesn't have the parquet V2Filter implementation, but I plan to have v2 parquet: v2 filter -> parquet filter v1 parquet: v1 -> v2 -> parquet filter Problem with this approach: 1. It's not easy to implement V1->V2 because V2 filter have `LiteralValue` and needs type info. We already lost the type information when we convert Expression filer to v1 filter. 2. parquet is OK Use Timestamp as example, parquet filter takes long for timestamp v2 parquet: v2 filter -> parquet filter timestamp Expression (Long) -> v2 filter (LiteralValue Long)-> parquet filter (Long) V1 parquet: v1 -> v2 -> parquet filter timestamp Expression (Long) -> v1 filter (timestamp) -> v2 filter (LiteralValue Long)-> parquet filter (Long) but we have problem for orc because orc filter takes java Timestamp v2 orc: v2 filter -> orc filter timestamp Expression (Long) -> v2 filter (LiteralValue Long)-> parquet filter (Timestamp) V1 orc: v1 -> v2 -> orc filter Expression (Long) -> v1 filter (timestamp) -> v2 filter (LiteralValue Long)-> parquet filter (Timestamp) This defeats the purpose of implementing v2 filters. 3. keep what we have for file source v1: v1 filter -> parquet/orc filter file source v2: v2 filter -> v1 filter -> parquet/orc filter We will need V2 -> V1 we have similar problem as approach 2. So the conclusion is: approach 1 (keep what we have for file source v1: v1 filter -> parquet/orc filter file source v2 we will implement v2 filter -> parquet/orc filter) is better, but there are lots of code duplication. We will need to refactor `OrcFilters`, `ParquetFilters`, `JacksonParser`, `UnivocityParser` so both V1 file source and V2 file source can use them. ### Why are the changes needed? Use V2Filters to eliminate the unnecessary conversion between Catalyst types and Scala types. ### Does this PR introduce _any_ user-facing change? no ### How was this patch tested? Added new UT Closes #34001 from huaxingao/v2filter. Lead-authored-by: Huaxin Gao <huaxin_gao@apple.com> Co-authored-by: Wenchen Fan <cloud0fan@gmail.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com> * [SPARK-37020][SQL] DS V2 LIMIT push down ### What changes were proposed in this pull request? Push down limit to data source for better performance ### Why are the changes needed? For LIMIT, e.g. `SELECT * FROM table LIMIT 10`, Spark retrieves all the data from table and then returns 10 rows. If we can push LIMIT to data source side, the data transferred to Spark will be dramatically reduced. ### Does this PR introduce _any_ user-facing change? Yes. new interface `SupportsPushDownLimit` ### How was this patch tested? new test Closes #34291 from huaxingao/pushdownLimit. Authored-by: Huaxin Gao <huaxin_gao@apple.com> Signed-off-by: Huaxin Gao <huaxin_gao@apple.com> * [SPARK-37038][SQL] DSV2 Sample Push Down ### What changes were proposed in this pull request? Push down Sample to data source for better performance. If Sample is pushed down, it will be removed from logical plan so it will not be applied at Spark any more. Current Plan without Sample push down: ``` == Parsed Logical Plan == 'Project [*] +- 'Sample 0.0, 0.8, false, 157 +- 'UnresolvedRelation [postgresql, new_table], [], false == Analyzed Logical Plan == col1: int, col2: int Project [col1#163, col2#164] +- Sample 0.0, 0.8, false, 157 +- SubqueryAlias postgresql.new_table +- RelationV2[col1#163, col2#164] new_table == Optimized Logical Plan == Sample 0.0, 0.8, false, 157 +- RelationV2[col1#163, col2#164] new_table == Physical Plan == *(1) Sample 0.0, 0.8, false, 157 +- *(1) Scan org.apache.spark.sql.execution.datasources.v2.jdbc.JDBCScan$$anon$16dde4769 [col1#163,col2#164] PushedAggregates: [], PushedFilters: [], PushedGroupby: [], ReadSchema: struct<col1:int,col2:int> ``` after Sample push down: ``` == Parsed Logical Plan == 'Project [*] +- 'Sample 0.0, 0.8, false, 187 +- 'UnresolvedRelation [postgresql, new_table], [], false == Analyzed Logical Plan == col1: int, col2: int Project [col1#163, col2#164] +- Sample 0.0, 0.8, false, 187 +- SubqueryAlias postgresql.new_table +- RelationV2[col1#163, col2#164] new_table == Optimized Logical Plan == RelationV2[col1#163, col2#164] new_table == Physical Plan == *(1) Scan org.apache.spark.sql.execution.datasources.v2.jdbc.JDBCScan$$anon$165b57543 [col1#163,col2#164] PushedAggregates: [], PushedFilters: [], PushedGroupby: [], PushedSample: TABLESAMPLE 0.0 0.8 false 187, ReadSchema: struct<col1:int,col2:int> ``` The new interface is implemented using JDBC for POC and end to end test. TABLESAMPLE is not supported by all the databases. It is implemented using postgresql in this PR. ### Why are the changes needed? Reduce IO and improve performance. For SAMPLE, e.g. `SELECT * FROM t TABLESAMPLE (1 PERCENT)`, Spark retrieves all the data from table and then return 1% rows. It will dramatically reduce the transferred data size and improve performance if we can push Sample to data source side. ### Does this PR introduce any user-facing change? Yes. new interface `SupportsPushDownTableSample` ### How was this patch tested? New test Closes #34451 from huaxingao/sample. Authored-by: Huaxin Gao <huaxin_gao@apple.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com> * [SPARK-37286][SQL] Move compileAggregates from JDBCRDD to JdbcDialect ### What changes were proposed in this pull request? Currently, the method `compileAggregates` is a member of `JDBCRDD`. But it is not reasonable, because the JDBC source knowns how to compile aggregate expressions to itself's dialect well. ### Why are the changes needed? JDBC source knowns how to compile aggregate expressions to itself's dialect well. After this PR, we can extend the pushdown(e.g. aggregate) based on different dialect between different JDBC database. There are two situations: First, database A and B implement a different number of aggregate functions that meet the SQL standard. ### Does this PR introduce _any_ user-facing change? 'No'. Just change the inner implementation. ### How was this patch tested? Jenkins tests. Closes #34554 from beliefer/SPARK-37286. Authored-by: Jiaan Geng <beliefer@163.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com> * [SPARK-37286][DOCS][FOLLOWUP] Fix the wrong parameter name for Javadoc ### What changes were proposed in this pull request? This PR fixes an issue that the Javadoc generation fails due to the wrong parameter name of a method added in SPARK-37286 (#34554). https://github.com/apache/spark/runs/4409267346?check_suite_focus=true#step:9:5081 ### Why are the changes needed? To keep the build clean. ### Does this PR introduce _any_ user-facing change? No. ### How was this patch tested? GA itself. Closes #34801 from sarutak/followup-SPARK-37286. Authored-by: Kousuke Saruta <sarutak@oss.nttdata.com> Signed-off-by: Sean Owen <srowen@gmail.com> * [SPARK-37262][SQL] Don't log empty aggregate and group by in JDBCScan ### What changes were proposed in this pull request? Currently, the empty pushed aggregate and pushed group by are logged in Explain for JDBCScan ``` Scan org.apache.spark.sql.execution.datasources.v2.jdbc.JDBCScan$$anon$172e75786 [NAME#1,SALARY#2] PushedAggregates: [], PushedFilters: [IsNotNull(SALARY), GreaterThan(SALARY,100.00)], PushedGroupby: [], ReadSchema: struct<NAME:string,SALARY:decimal(20,2)> ``` After the fix, the JDBCSScan will be ``` Scan org.apache.spark.sql.execution.datasources.v2.jdbc.JDBCScan$$anon$172e75786 [NAME#1,SALARY#2] PushedFilters: [IsNotNull(SALARY), GreaterThan(SALARY,100.00)], ReadSchema: struct<NAME:string,SALARY:decimal(20,2)> ``` ### Why are the changes needed? address this comment https://github.com/apache/spark/pull/34451#discussion_r740220800 ### Does this PR introduce _any_ user-facing change? no ### How was this patch tested? existing tests Closes #34540 from huaxingao/aggExplain. Authored-by: Huaxin Gao <huaxin_gao@apple.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com> * [SPARK-37483][SQL] Support push down top N to JDBC data source V2 ### What changes were proposed in this pull request? Currently, Spark supports push down limit to data source. However, in the user's scenario, limit must have the premise of order by. Because limit and order by are more valuable together. On the other hand, push down top N(same as order by ... limit N) outputs the data with basic order to Spark sort, the the sort of Spark may have some performance improvement. ### Why are the changes needed? 1. push down top N is very useful for users scenario. 2. push down top N could improves the performance of sort. ### Does this PR introduce _any_ user-facing change? 'No'. Just change the physical execute. ### How was this patch tested? New tests. Closes #34918 from beliefer/SPARK-37483. Authored-by: Jiaan Geng <beliefer@163.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com> * [SPARK-37644][SQL] Support datasource v2 complete aggregate pushdown ### What changes were proposed in this pull request? Currently , Spark supports push down aggregate with partial-agg and final-agg . For some data source (e.g. JDBC ) , we can avoid partial-agg and final-agg by running completely on database. ### Why are the changes needed? Improve performance for aggregate pushdown. ### Does this PR introduce _any_ user-facing change? 'No'. Just change the inner implement. ### How was this patch tested? New tests. Closes #34904 from beliefer/SPARK-37644. Authored-by: Jiaan Geng <beliefer@163.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com> * [SPARK-37627][SQL] Add sorted column in BucketTransform ### What changes were proposed in this pull request? In V1, we can create table with sorted bucket like the following: ``` sql("CREATE TABLE tbl(a INT, b INT) USING parquet " + "CLUSTERED BY (a) SORTED BY (b) INTO 5 BUCKETS") ``` However, creating table with sorted bucket in V2 failed with Exception `org.apache.spark.sql.AnalysisException: Cannot convert bucketing with sort columns to a transform.` ### Why are the changes needed? This PR adds sorted column in BucketTransform so we can create table in V2 with sorted bucket ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? new UT Closes #34879 from huaxingao/sortedBucket. Authored-by: Huaxin Gao <huaxin_gao@apple.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com> * [SPARK-37789][SQL] Add a class to represent general aggregate functions in DS V2 ### What changes were proposed in this pull request? There are a lot of aggregate functions in SQL and it's a lot of work to add them one by one in the DS v2 API. This PR proposes to add a new `GeneralAggregateFunc` class to represent all the general SQL aggregate functions. Since it's general, Spark doesn't know its aggregation buffer and can only push down the aggregation to the source completely. As an example, this PR also translates `AVG` to `GeneralAggregateFunc` and pushes it to JDBC V2. ### Why are the changes needed? To add aggregate functions in DS v2 easier. ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? JDBC v2 test Closes #35070 from cloud-fan/agg. Lead-authored-by: Wenchen Fan <cloud0fan@gmail.com> Co-authored-by: Wenchen Fan <wenchen@databricks.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com> * [SPARK-37644][SQL][FOLLOWUP] When partition column is same as group by key, pushing down aggregate completely ### What changes were proposed in this pull request? When JDBC option specifying the "partitionColumn" and it's the same as group by key, the aggregate push-down should be completely. ### Why are the changes needed? Improve the datasource v2 complete aggregate pushdown. ### Does this PR introduce _any_ user-facing change? 'No'. Just change the inner implement. ### How was this patch tested? New tests. Closes #35052 from beliefer/SPARK-37644-followup. Authored-by: Jiaan Geng <beliefer@163.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com> * [SPARK-37527][SQL] Translate more standard aggregate functions for pushdown ### What changes were proposed in this pull request? Currently, Spark aggregate pushdown will translate some standard aggregate functions, so that compile these functions to adapt specify database. After this job, users could override `JdbcDialect.compileAggregate` to implement some standard aggregate functions supported by some database. This PR just translate the ANSI standard aggregate functions. The mainstream database supports these functions show below: | Name | ClickHouse | Presto | Teradata | Snowflake | Oracle | Postgresql | Vertica | MySQL | RedShift | ElasticSearch | Impala | Druid | SyBase | DB2 | H2 | Exasol | Mariadb | Phoenix | Yellowbrick | Singlestore | Influxdata | Dolphindb | Intersystems | |-------|-------|-------|-------|-------|-------|-------|-------|-------|-------|-------|-------|-------|-------|-------|-------|-------|-------|-------|-------|-------|-------|-------|-------| | `VAR_POP` | Yes | Yes | Yes | Yes | Yes | Yes | Yes | Yes | Yes | Yes | Yes | Yes | Yes | Yes | Yes | Yes | Yes | No | Yes | Yes | No | Yes | Yes | | `VAR_SAMP` | Yes | Yes | Yes | Yes | Yes | Yes | Yes | Yes | Yes | Yes | Yes | Yes | Yes | No | Yes | Yes | Yes | No | Yes | Yes | No | Yes | Yes | | `STDDEV_POP` | Yes | Yes | Yes | Yes | Yes | Yes | Yes | Yes | Yes | Yes | Yes | Yes | Yes | Yes | Yes | Yes | Yes | Yes | Yes | Yes | Yes | Yes | Yes | | `STDDEV_SAMP` | Yes | Yes | Yes | Yes | Yes | Yes | Yes | Yes | Yes | Yes | Yes | Yes | Yes | No | Yes | Yes | Yes | Yes | Yes | Yes | No | Yes | Yes | | `COVAR_POP` | Yes | Yes | Yes | Yes | Yes | Yes | Yes | No | No | No | No | No | Yes | No | Yes | Yes | No | No | No | No | Yes | Yes | No | | `COVAR_SAMP` | Yes | Yes | Yes | Yes | Yes | Yes | Yes | No | No | No | No | No | Yes | No | Yes | Yes | No | No | No | No | No | No | No | | `CORR` | Yes | Yes | Yes | Yes | Yes | Yes | Yes | No | No | No | No | No | Yes | No | Yes | Yes | No | No | No | No | No | Yes | No | Because some aggregate functions will be converted by Optimizer show below, this PR no need to match them. |Input|Parsed|Optimized| |------|--------------------|----------| |`Every`| `aggregate.BoolAnd` |`Min`| |`Any`| `aggregate.BoolOr` |`Max`| |`Some`| `aggregate.BoolOr` |`Max`| ### Why are the changes needed? Make the implement of `*Dialect` could extends the aggregate functions by override `JdbcDialect.compileAggregate`. ### Does this PR introduce _any_ user-facing change? Yes. Users could pushdown more aggregate functions. ### How was this patch tested? Exists tests. Closes #35101 from beliefer/SPARK-37527-new2. Authored-by: Jiaan Geng <beliefer@163.com> Signed-off-by: Huaxin Gao <huaxin_gao@apple.com> * [SPARK-37734][SQL][TESTS] Upgrade h2 from 1.4.195 to 2.0.204 ### What changes were proposed in this pull request? This PR aims to upgrade `com.h2database` from 1.4.195 to 2.0.202 ### Why are the changes needed? Fix one vulnerability, ref: https://www.tenable.com/cve/CVE-2021-23463 ### Does this PR introduce _any_ user-facing change? 'No'. ### How was this patch tested? Jenkins test. Closes #35013 from beliefer/SPARK-37734. Authored-by: Jiaan Geng <beliefer@163.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com> * [SPARK-37527][SQL] Compile `COVAR_POP`, `COVAR_SAMP` and `CORR` in `H2Dialet` ### What changes were proposed in this pull request? https://github.com/apache/spark/pull/35101 translate `COVAR_POP`, `COVAR_SAMP` and `CORR`, but the H2 lower version cannot support them. After https://github.com/apache/spark/pull/35013, we can compile the three aggregate functions in `H2Dialet` now. ### Why are the changes needed? Supplement the implement of `H2Dialet`. ### Does this PR introduce _any_ user-facing change? 'Yes'. Spark could complete push-down `COVAR_POP`, `COVAR_SAMP` and `CORR` into H2. ### How was this patch tested? Test updated. Closes #35145 from beliefer/SPARK-37527_followup. Authored-by: Jiaan Geng <beliefer@163.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com> * [SPARK-37839][SQL] DS V2 supports partial aggregate push-down `AVG` ### What changes were proposed in this pull request? `max`,`min`,`count`,`sum`,`avg` are the most commonly used aggregation functions. Currently, DS V2 supports complete aggregate push-down of `avg`. But, supports partial aggregate push-down of `avg` is very useful. The aggregate push-down algorithm is: 1. Spark translates group expressions of `Aggregate` to DS V2 `Aggregation`. 2. Spark calls `supportCompletePushDown` to check if it can completely push down aggregate. 3. If `supportCompletePushDown` returns true, we preserves the aggregate expressions as final aggregate expressions. Otherwise, we split `AVG` into 2 functions: `SUM` and `COUNT`. 4. Spark translates final aggregate expressions and group expressions of `Aggregate` to DS V2 `Aggregation` again, and pushes the `Aggregation` to JDBC source. 5. Spark constructs the final aggregate. ### Why are the changes needed? DS V2 supports partial aggregate push-down `AVG` ### Does this PR introduce _any_ user-facing change? 'Yes'. DS V2 could partial aggregate push-down `AVG` ### How was this patch tested? New tests. Closes #35130 from beliefer/SPARK-37839. Authored-by: Jiaan Geng <beliefer@163.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com> * [SPARK-36526][SQL] DSV2 Index Support: Add supportsIndex interface ### What changes were proposed in this pull request? Indexes are database objects created on one or more columns of a table. Indexes are used to improve query performance. A detailed explanation of database index is here https://en.wikipedia.org/wiki/Database_index This PR adds `supportsIndex` interface that provides APIs to work with indexes. ### Why are the changes needed? Many data sources support index to improvement query performance. In order to take advantage of the index support in data source, this `supportsIndex` interface is added to let user to create/drop an index, list indexes, etc. ### Does this PR introduce _any_ user-facing change? yes, the following new APIs are added: - createIndex - dropIndex - indexExists - listIndexes New SQL syntax: ``` CREATE [index_type] INDEX [index_name] ON [TABLE] table_name (column_index_property_list)[OPTIONS indexPropertyList] column_index_property_list: column_name [OPTIONS(indexPropertyList)] [ , . . . ] indexPropertyList: index_property_name = index_property_value [ , . . . ] DROP INDEX index_name ``` ### How was this patch tested? only interface is added for now. Tests will be added when doing the implementation Closes #33754 from huaxingao/index_interface. Authored-by: Huaxin Gao <huaxin_gao@apple.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com> * [SPARK-36913][SQL] Implement createIndex and IndexExists in DS V2 JDBC (MySQL dialect) ### What changes were proposed in this pull request? Implementing `createIndex`/`IndexExists` in DS V2 JDBC ### Why are the changes needed? This is a subtask of the V2 Index support. I am implementing index support for DS V2 JDBC so we can have a POC and an end to end testing. This PR implements `createIndex` and `IndexExists`. Next PR will implement `listIndexes` and `dropIndex`. I intentionally make the PR small so it's easier to review. Index is not supported by h2 database and create/drop index are not standard SQL syntax. This PR only implements `createIndex` and `IndexExists` in `MySQL` dialect. ### Does this PR introduce _any_ user-facing change? Yes, `createIndex`/`IndexExist` in DS V2 JDBC ### How was this patch tested? new test Closes #34164 from huaxingao/createIndexJDBC. Authored-by: Huaxin Gao <huaxin_gao@apple.com> Signed-off-by: Liang-Chi Hsieh <viirya@gmail.com> * [SPARK-36914][SQL] Implement dropIndex and listIndexes in JDBC (MySQL dialect) ### What changes were proposed in this pull request? This PR implements `dropIndex` and `listIndexes` in MySQL dialect ### Why are the changes needed? As a subtask of the V2 Index support, this PR completes the implementation for JDBC V2 index support. ### Does this PR introduce _any_ user-facing change? Yes, `dropIndex/listIndexes` in DS V2 JDBC ### How was this patch tested? new tests Closes #34236 from huaxingao/listIndexJDBC. Authored-by: Huaxin Gao <huaxin_gao@apple.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com> * [SPARK-37343][SQL] Implement createIndex, IndexExists and dropIndex in JDBC (Postgres dialect) ### What changes were proposed in this pull request? Implementing `createIndex`/`IndexExists`/`dropIndex` in DS V2 JDBC for Postgres dialect. ### Why are the changes needed? This is a subtask of the V2 Index support. This PR implements `createIndex`, `IndexExists` and `dropIndex`. After review for some changes in this PR, I will create new PR for `listIndexs`, or add it in this PR. This PR only implements `createIndex`, `IndexExists` and `dropIndex` in Postgres dialect. ### Does this PR introduce _any_ user-facing change? Yes, `createIndex`/`IndexExists`/`dropIndex` in DS V2 JDBC ### How was this patch tested? New test. Closes #34673 from dchvn/Dsv2_index_postgres. Authored-by: dch nguyen <dgd_contributor@viettel.com.vn> Signed-off-by: Wenchen Fan <wenchen@databricks.com> * [SPARK-37867][SQL] Compile aggregate functions of build-in JDBC dialect ### What changes were proposed in this pull request? DS V2 translate a lot of standard aggregate functions. Currently, only H2Dialect compile these standard aggregate functions. This PR compile these standard aggregate functions for other build-in JDBC dialect. ### Why are the changes needed? Make build-in JDBC dialect support complete aggregate push-down. ### Does this PR introduce _any_ user-facing change? 'Yes'. Users could use complete aggregate push-down with build-in JDBC dialect. ### How was this patch tested? New tests. Closes #35166 from beliefer/SPARK-37867. Authored-by: Jiaan Geng <beliefer@163.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com> * [SPARK-37929][SQL][FOLLOWUP] Support cascade mode for JDBC V2 ### What changes were proposed in this pull request? https://github.com/apache/spark/pull/35246 support `cascade` mode for dropNamespace API. This PR followup https://github.com/apache/spark/pull/35246 to make JDBC V2 respect `cascade`. ### Why are the changes needed? Let JDBC V2 respect `cascade`. ### Does this PR introduce _any_ user-facing change? Yes. Users could manipulate `drop namespace` with `cascade` on JDBC V2. ### How was this patch tested? New tests. Closes #35271 from beliefer/SPARK-37929-followup. Authored-by: Jiaan Geng <beliefer@163.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com> * [SPARK-38035][SQL] Add docker tests for build-in JDBC dialect ### What changes were proposed in this pull request? Currently, Spark only have `PostgresNamespaceSuite` to test DS V2 namespace in docker environment. But missing tests for other build-in JDBC dialect (e.g. Oracle, MySQL). This PR also found some compatible issue. For example, the JDBC api `conn.getMetaData.getSchemas` works bad for MySQL. ### Why are the changes needed? We need add tests for other build-in JDBC dialect. ### Does this PR introduce _any_ user-facing change? 'No'. Just add tests which face developers. ### How was this patch tested? New tests. Closes #35333 from beliefer/SPARK-38035. Authored-by: Jiaan Geng <beliefer@163.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com> * [SPARK-38054][SQL] Supports list namespaces in JDBC v2 MySQL dialect ### What changes were proposed in this pull request? Currently, `JDBCTableCatalog.scala` query namespaces show below. ``` val schemaBuilder = ArrayBuilder.make[Array[String]] val rs = conn.getMetaData.getSchemas() while (rs.next()) { schemaBuilder += Array(rs.getString(1)) } schemaBuilder.result ``` But the code cannot get any information when using MySQL JDBC driver. This PR uses `SHOW SCHEMAS` to query namespaces of MySQL. This PR also fix other issues below: - Release the docker tests in `MySQLNamespaceSuite.scala`. - Because MySQL doesn't support create comment of schema, let's throws `SQLFeatureNotSupportedException`. - Because MySQL doesn't support `DROP SCHEMA` in `RESTRICT` mode, let's throws `SQLFeatureNotSupportedException`. - Reactor `JdbcUtils.executeQuery` to avoid `java.sql.SQLException: Operation not allowed after ResultSet closed`. ### Why are the changes needed? MySQL dialect supports query namespaces. ### Does this PR introduce _any_ user-facing change? 'Yes'. Some API changed. ### How was this patch tested? New tests. Closes #35355 from beliefer/SPARK-38054. Authored-by: Jiaan Geng <beliefer@163.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com> * [SPARK-36351][SQL] Refactor filter push down in file source v2 ### What changes were proposed in this pull request? Currently in `V2ScanRelationPushDown`, we push the filters (partition filters + data filters) to file source, and then pass all the filters (partition filters + data filters) as post scan filters to v2 Scan, and later in `PruneFileSourcePartitions`, we separate partition filters and data filters, set them in the format of `Expression` to file source. Changes in this PR: When we push filters to file sources in `V2ScanRelationPushDown`, since we already have the information about partition column , we want to separate partition filter and data filter there. The benefit of doing this: - we can handle all the filter related work for v2 file source at one place instead of two (`V2ScanRelationPushDown` and `PruneFileSourcePartitions`), so the code will be cleaner and easier to maintain. - we actually have to separate partition filters and data filters at `V2ScanRelationPushDown`, otherwise, there is no way to find out which filters are partition filters, and we can't push down aggregate for parquet even if we only have partition filter. - By separating the filters early at `V2ScanRelationPushDown`, we only needs to check data filters to find out which one needs to be converted to data source filters (e.g. Parquet predicates, ORC predicates) and pushed down to file source, right now we are checking all the filters (both partition filters and data filters) - Similarly, we can only pass data filters as post scan filters to v2 Scan, because partition filters are used for partition pruning only, no need to pass them as post scan filters. In order to do this, we will have the following changes - add `pushFilters` in file source v2. In this method: - push both Expression partition filter and Expression data filter to file source. Have to use Expression filters because we need these for partition pruning. - data filters are used for filter push down. If file source needs to push down data filters, it translates the data filters from `Expression` to `Sources.Filer`, and then decides which filters to push down. - partition filters are used for partition pruning. - file source v2 no need to implement `SupportsPushdownFilters` any more, because when we separating the two types of filters, we have already set them on file data sources. It's redundant to use `SupportsPushdownFilters` to set the filters again on file data sources. ### Why are the changes needed? see section one ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? Existing tests Closes #33650 from huaxingao/partition_filter. Authored-by: Huaxin Gao <huaxin_gao@apple.com> Signed-off-by: Liang-Chi Hsieh <viirya@gmail.com> * [SPARK-36645][SQL] Aggregate (Min/Max/Count) push down for Parquet ### What changes were proposed in this pull request? Push down Min/Max/Count to Parquet with the following restrictions: - nested types such as Array, Map or Struct will not be pushed down - Timestamp not pushed down because INT96 sort order is undefined, Parquet doesn't return statistics for INT96 - If the aggregate column is on partition column, only Count will be pushed, Min or Max will not be pushed down because Parquet doesn't return max/min for partition column. - If somehow the file doesn't have stats for the aggregate columns, Spark will throw Exception. - Currently, if filter/GROUP BY is involved, Min/Max/Count will not be pushed down, but the restriction will be lifted if the filter or GROUP BY is on partition column (https://issues.apache.org/jira/browse/SPARK-36646 and https://issues.apache.org/jira/browse/SPARK-36647) ### Why are the changes needed? Since parquet has the statistics information for min, max and count, we want to take advantage of this info and push down Min/Max/Count to parquet layer for better performance. ### Does this PR introduce _any_ user-facing change? Yes, `SQLConf.PARQUET_AGGREGATE_PUSHDOWN_ENABLED` was added. If sets to true, we will push down Min/Max/Count to Parquet. ### How was this patch tested? new test suites Closes #33639 from huaxingao/parquet_agg. Authored-by: Huaxin Gao <huaxin_gao@apple.com> Signed-off-by: Liang-Chi Hsieh <viirya@gmail.com> * [SPARK-34960][SQL] Aggregate push down for ORC ### What changes were proposed in this pull request? This PR is to add aggregate push down feature for ORC data source v2 reader. At a high level, the PR does: * The supported aggregate expression is MIN/MAX/COUNT same as [Parquet aggregate push down](https://github.com/apache/spark/pull/33639). * BooleanType, ByteType, ShortType, IntegerType, LongType, FloatType, DoubleType, DateType are allowed in MIN/MAXX aggregate push down. All other columns types are not allowed in MIN/MAX aggregate push down. * All columns types are supported in COUNT aggregate push down. * Nested column's sub-fields are disallowed in aggregate push down. * If the file does not have valid statistics, Spark will throw exception and fail query. * If aggregate has filter or group-by column, aggregate will not be pushed down. At code level, the PR does: * `OrcScanBuilder`: `pushAggregation()` checks whether the aggregation can be pushed down. The most checking logic is shared between Parquet and ORC, extracted into `AggregatePushDownUtils.getSchemaForPushedAggregation()`. `OrcScanBuilder` will create a `OrcScan` with aggregation and aggregation data schema. * `OrcScan`: `createReaderFactory` creates a ORC reader factory with aggregation and schema. Similar change with `ParquetScan`. * `OrcPartitionReaderFactory`: `buildReaderWithAggregates` creates a ORC reader with aggregate push down (i.e. read ORC file footer to process columns statistics, instead of reading actual data in the file). `buildColumnarReaderWithAggregates` creates a columnar ORC reader similarly. Both delegate the real work to read footer in `OrcUtils.createAggInternalRowFromFooter`. * `OrcUtils.createAggInternalRowFromFooter`: reads ORC file footer to process columns statistics (real heavy lift happens here). Similar to `ParquetUtils.createAggInternalRowFromFooter`. Leverage utility method such as `OrcFooterReader.readStatistics`. * `OrcFooterReader`: `readStatistics` reads the ORC `ColumnStatistics[]` into Spark `OrcColumnStatistics`. The transformation is needed here, because ORC `ColumnStatistics[]` stores all columns statistics in a flatten array style, and hard to process. Spark `OrcColumnStatistics` stores the statistics in nested tree structure (e.g. like `StructType`). This is used by `OrcUtils.createAggInternalRowFromFooter` * `OrcColumnStatistics`: the easy-to-manipulate structure for ORC `ColumnStatistics`. This is used by `OrcFooterReader.readStatistics`. ### Why are the changes needed? To improve the performance of query with aggregate. ### Does this PR introduce _any_ user-facing change? Yes. A user-facing config `spark.sql.orc.aggregatePushdown` is added to control enabling/disabling the aggregate push down for ORC. By default the feature is disabled. ### How was this patch tested? Added unit test in `FileSourceAggregatePushDownSuite.scala`. Refactored all unit tests in https://github.com/apache/spark/pull/33639, and it now works for both Parquet and ORC. Closes #34298 from c21/orc-agg. Authored-by: Cheng Su <chengsu@fb.com> Signed-off-by: Liang-Chi Hsieh <viirya@gmail.com> * [SPARK-37960][SQL] A new framework to represent catalyst expressions in DS v2 APIs ### What changes were proposed in this pull request? This PR provides a new framework to represent catalyst expressions in DS v2 APIs. `GeneralSQLExpression` is a general SQL expression to represent catalyst expression in DS v2 API. `ExpressionSQLBuilder` is a builder to generate `GeneralSQLExpression` from catalyst expressions. `CASE ... WHEN ... ELSE ... END` is just the first use case. This PR also supports aggregate push down with `CASE ... WHEN ... ELSE ... END`. ### Why are the changes needed? Support aggregate push down with `CASE ... WHEN ... ELSE ... END`. ### Does this PR introduce _any_ user-facing change? Yes. Users could use `CASE ... WHEN ... ELSE ... END` with aggregate push down. ### How was this patch tested? New tests. Closes #35248 from beliefer/SPARK-37960. Authored-by: Jiaan Geng <beliefer@163.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com> * [SPARK-37867][SQL][FOLLOWUP] Compile aggregate functions for build-in DB2 dialect ### What changes were proposed in this pull request? This PR follows up https://github.com/apache/spark/pull/35166. The previously referenced DB2 documentation is incorrect, resulting in the lack of compile that supports some aggregate functions. The correct documentation is https://www.ibm.com/docs/en/db2/11.5?topic=af-regression-functions-regr-avgx-regr-avgy-regr-count ### Why are the changes needed? Make build-in DB2 dialect support complete aggregate push-down more aggregate functions. ### Does this PR introduce _any_ user-facing change? 'Yes'. Users could use complete aggregate push-down with build-in DB2 dialect. ### How was this patch tested? New tests. Closes #35520 from beliefer/SPARK-37867_followup. Authored-by: Jiaan Geng <beliefer@163.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com> * [SPARK-36568][SQL] Better FileScan statistics estimation ### What changes were proposed in this pull request? This PR modifies `FileScan.estimateStatistics()` to take the read schema into account. ### Why are the changes needed? `V2ScanRelationPushDown` can column prune `DataSourceV2ScanRelation`s and change read schema of `Scan` operations. The better statistics returned by `FileScan.estimateStatistics()` can mean better query plans. For example, with this change the broadcast issue in SPARK-36568 can be avoided. ### Does this PR introduce _any_ user-facing change? No. ### How was this patch tested? Added new UT. Closes #33825 from peter-toth/SPARK-36568-scan-statistics-estimation. Authored-by: Peter Toth <peter.toth@gmail.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com> * [SPARK-37929][SQL] Support cascade mode for `dropNamespace` API ### What changes were proposed in this pull request? This PR adds a new API `dropNamespace(String[] ns, boolean cascade)` to replace the existing one: Add a boolean parameter `cascade` that supports deleting all the Namespaces and Tables under the namespace. Also include changing the implementations and tests that are relevant to this API. ### Why are the changes needed? According to [#cmt](https://github.com/apache/spark/pull/35202#discussion_r784463563), the current `dropNamespace` API doesn't support cascade mode. So this PR replaces that to support cascading. If cascade is set True, delete all namespaces and tables under the namespace. ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? Existing test. Closes #35246 from dchvn/change_dropnamespace_api. Authored-by: dch nguyen <dchvn.dgd@gmail.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com> * code format * [SPARK-38196][SQL] Refactor framework so as JDBC dialect could compile expression by self way ### What changes were proposed in this pull request? https://github.com/apache/spark/pull/35248 provides a new framework to represent catalyst expressions in DS V2 APIs. Because the framework translate all catalyst expressions to a unified SQL string and cannot keep compatibility between different JDBC database, the framework works not good. This PR reactor the framework so as JDBC dialect could compile expression by self way. First, The framework translate catalyst expressions to DS V2 expression. Second, The JDBC dialect could compile DS V2 expression to different SQL syntax. The java doc looks show below: ![image](https://user-images.githubusercontent.com/8486025/156579584-f56cafb5-641f-4c5b-a06e-38f4369051c3.png) ### Why are the changes needed? Make the framework be more common use. ### Does this PR introduce _any_ user-facing change? 'No'. The feature is not released. ### How was this patch tested? Exists tests. Closes #35494 from beliefer/SPARK-37960_followup. Authored-by: Jiaan Geng <beliefer@163.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com> * [SPARK-38361][SQL] Add factory method `getConnection` into `JDBCDialect` ### What changes were proposed in this pull request? At present, the parameter of the factory method for obtaining JDBC connection is empty because the JDBC URL of some databases is fixed and unique. However, for databases such as ClickHouse, connection is related to the shard node. So I think the parameter form of `getConnection: Partition = > Connection` is more general. This PR adds factory method `getConnection` into `JDBCDialect` according to https://github.com/apache/spark/pull/35696#issuecomment-1058060107. ### Why are the changes needed? Make factory method `getConnection` more general. ### Does this PR introduce _any_ user-facing change? 'No'. Just inner change. ### How was this patch tested? Exists test. Closes #35727 from beliefer/SPARK-38361_new. Authored-by: Jiaan Geng <beliefer@163.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com> * code format * [SPARK-38560][SQL] If `Sum`, `Count`, `Any` accompany with distinct, cannot do partial agg push down ### What changes were proposed in this pull request? Spark could partial push down sum(distinct col), count(distinct col) if data source have multiple partitions, and Spark will sum the value again. So the result may not correctly. ### Why are the changes needed? Fix the bug push down sum(distinct col), count(distinct col) to data source and return incorrect result. ### Does this PR introduce _any_ user-facing change? 'Yes'. Users will see the correct behavior. ### How was this patch tested? New tests. Closes #35873 from beliefer/SPARK-38560. Authored-by: Jiaan Geng <beliefer@163.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com> * [SPARK-36718][SQL] Only collapse projects if we don't duplicate expensive expressions ### What changes were proposed in this pull request? The `CollapseProject` rule can combine adjacent projects and merge the project lists. The key idea behind this rule is that the evaluation of project is relatively expensive, and that expression evaluation is cheap and that the expression duplication caused by this rule is not a problem. This last assumption is, unfortunately, not always true: - A user can invoke some expensive UDF, this now gets invoked more often than originally intended. - A projection is very cheap in whole stage code generation. The duplication caused by `CollapseProject` does more harm than good here. This PR addresses this problem, by only collapsing projects when it does not duplicate expensive expressions. In practice this means an input reference may only be consumed once, or when its evaluation does not incur significant overhead (currently attributes, nested column access, aliases & literals fall in this category). ### Why are the changes needed? We have seen multiple complains about `CollapseProject` in the past, due to it may duplicate expensive expressions. The most recent one is https://github.com/apache/spark/pull/33903 . ### Does this PR introduce _any_ user-facing change? no ### How was this patch tested? a new UT and existing test Closes #33958 from cloud-fan/collapse. Authored-by: Wenchen Fan <wenchen@databricks.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com> * [SPARK-38432][SQL] Refactor framework so as JDBC dialect could compile filter by self way ### What changes were proposed in this pull request? Currently, Spark DS V2 could push down filters into JDBC source. However, only the most basic form of filter is supported. On the other hand, some JDBC source could not compile the filters by themselves way. This PR reactor the framework so as JDBC dialect could compile expression by self way. First, The framework translate catalyst expressions to DS V2 filters. Second, The JDBC dialect could compile DS V2 filters to different SQL syntax. ### Why are the changes needed? Make the framework be more common use. ### Does this PR introduce _any_ user-facing change? 'No'. The feature is not released. ### How was this patch tested? Exists tests. Closes #35768 from beliefer/SPARK-38432_new. Authored-by: Jiaan Geng <beliefer@163.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com> * [SPARK-38432][SQL][FOLLOWUP] Supplement test case for overflow and add comments ### What changes were proposed in this pull request? This PR follows up https://github.com/apache/spark/pull/35768 and improves the code. 1. Supplement test case for overflow 2. Not throw IllegalArgumentException 3. Improve V2ExpressionSQLBuilder 4. Add comments in V2ExpressionBuilder ### Why are the changes needed? Supplement test case for overflow and add comments. ### Does this PR introduce _any_ user-facing change? 'No'. V2 aggregate pushdown not released yet. ### How was this patch tested? New tests. Closes #35933 from beliefer/SPARK-38432_followup. Authored-by: Jiaan Geng <beliefer@163.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com> * [SPARK-38533][SQL] DS V2 aggregate push-down supports project with alias ### What changes were proposed in this pull request? Currently, Spark DS V2 aggregate push-down doesn't supports project with alias. Refer https://github.com/apache/spark/blob/c91c2e9afec0d5d5bbbd2e155057fe409c5bb928/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/v2/V2ScanRelationPushDown.scala#L96 This PR let it works good with alias. **The first example:** the origin plan show below: ``` Aggregate [DEPT#0], [DEPT#0, sum(mySalary#8) AS total#14] +- Project [DEPT#0, SALARY#2 AS mySalary#8] +- ScanBuilderHolder [DEPT#0, NAME#1, SALARY#2, BONUS#3], RelationV2[DEPT#0, NAME#1, SALARY#2, BONUS#3] test.employee, JDBCScanBuilder(org.apache.spark.sql.test.TestSparkSession77978658,StructType(StructField(DEPT,IntegerType,true),StructField(NAME,StringType,true),StructField(SALARY,DecimalType(20,2),true),StructField(BONUS,DoubleType,true)),org.apache.spark.sql.execution.datasources.jdbc.JDBCOptions5f8da82) ``` If we can complete push down the aggregate, then the plan will be: ``` Project [DEPT#0, SUM(SALARY)#18 AS sum(SALARY#2)#13 AS total#14] +- RelationV2[DEPT#0, SUM(SALARY)#18] test.employee ``` If we can partial push down the aggregate, then the plan will be: ``` Aggregate [DEPT#0], [DEPT#0, sum(cast(SUM(SALARY)#18 as decimal(20,2))) AS total#14] +- RelationV2[DEPT#0, SUM(SALARY)#18] test.employee ``` **The second example:** the origin plan show below: ``` Aggregate [myDept#33], [myDept#33, sum(mySalary#34) AS total#40] +- Project [DEPT#25 AS myDept#33, SALARY#27 AS mySalary#34] +- ScanBuilderHolder [DEPT#25, NAME#26, SALARY#27, BONUS#28], RelationV2[DEPT#25, NAME#26, SALARY#27, BONUS#28] test.employee, JDBCScanBuilder(org.apache.spark.sql.test.TestSparkSession25c4f621,StructType(StructField(DEPT,IntegerType,true),StructField(NAME,StringType,true),StructField(SALARY,DecimalType(20,2),true),StructField(BONUS,DoubleType,true)),org.apache.spark.sql.execution.datasources.jdbc.JDBCOptions345d641e) ``` If we can complete push down the aggregate, then the plan will be: ``` Project [DEPT#25 AS myDept#33, SUM(SALARY)#44 AS sum(SALARY#27)#39 AS total#40] +- RelationV2[DEPT#25, SUM(SALARY)#44] test.employee ``` If we can partial push down the aggregate, then the plan will be: ``` Aggregate [myDept#33], [DEPT#25 AS myDept#33, sum(cast(SUM(SALARY)#56 as decimal(20,2))) AS total#52] +- RelationV2[DEPT#25, SUM(SALARY)#56] test.employee ``` ### Why are the changes needed? Alias is more useful. ### Does this PR introduce _any_ user-facing change? 'Yes'. Users could see DS V2 aggregate push-down supports project with alias. ### How was this patch tested? New tests. Closes #35932 from beliefer/SPARK-38533_new. Authored-by: Jiaan Geng <beliefer@163.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com> * code foramt * [SPARK-37483][SQL][FOLLOWUP] Rename `pushedTopN` to `PushedTopN` and improve JDBCV2Suite ### What changes were proposed in this pull request? This PR fix three issues. **First**, create method `checkPushedInfo` and `checkSortRemoved` to reuse code. **Second**, remove method `checkPushedLimit`, because `checkPushedInfo` can cover it. **Third**, rename `pushedTopN` to `PushedTopN`, so as consistent with other pushed information. ### Why are the changes needed? Reuse code and let pushed information more correctly. ### Does this PR introduce _any_ user-facing change? 'No'. New feature and improve the tests. ### How was this patch tested? Adjust existing tests. Closes #35921 from beliefer/SPARK-37483_followup. Authored-by: Jiaan Geng <beliefer@163.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com> * [SPARK-38644][SQL] DS V2 topN push-down supports project with alias ### What changes were proposed in this pull request? Currently, Spark DS V2 topN push-down doesn't supports project with alias. This PR let it works good with alias. **Example**: the origin plan show below: ``` Sort [mySalary#10 ASC NULLS FIRST], true +- Project [NAME#1, SALARY#2 AS mySalary#10] +- ScanBuilderHolder [DEPT#0, NAME#1, SALARY#2, BONUS#3, IS_MANAGER#4], RelationV2[DEPT#0, NAME#1, SALARY#2, BONUS#3, IS_MANAGER#4] test.employee, JDBCScanBuilder(org.apache.spark.sql.test.TestSparkSession7fd4b9ec,StructType(StructField(DEPT,IntegerType,true),StructField(NAME,StringType,true),StructField(SALARY,DecimalType(20,2),true),StructField(BONUS,DoubleType,true),StructField(IS_MANAGER,BooleanType,true)),org.apache.spark.sql.execution.datasources.jdbc.JDBCOptions3c8e4a82) ``` The `pushedLimit` and `sortOrders` of `JDBCScanBuilder` are empty. If we can push down the top n, then the plan will be: ``` Project [NAME#1, SALARY#2 AS mySalary#10] +- ScanBuilderHolder [DEPT#0, NAME#1, SALARY#2, BONUS#3, IS_MANAGER#4], RelationV2[DEPT#0, NAME#1, SALARY#2, BONUS#3, IS_MANAGER#4] test.employee, JDBCScanBuilder(org.apache.spark.sql.test.TestSparkSession7fd4b9ec,StructType(StructField(DEPT,IntegerType,true),StructField(NAME,StringType,true),StructField(SALARY,DecimalType(20,2),true),StructField(BONUS,DoubleType,true),StructField(IS_MANAGER,BooleanType,true)),org.apache.spark.sql.execution.datasources.jdbc.JDBCOptions3c8e4a82) ``` The `pushedLimit` of `JDBCScanBuilder` will be `1` and `sortOrders` of `JDBCScanBuilder` will be `SALARY ASC NULLS FIRST`. ### Why are the changes needed? Alias is more useful. ### Does this PR introduce _any_ user-facing change? 'Yes'. Users could see DS V2 topN push-down supports project with alias. ### How was this patch tested? New tests. Closes #35961 from beliefer/SPARK-38644. Authored-by: Jiaan Geng <beliefer@163.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com> * [SPARK-38391][SQL] Datasource v2 supports partial topN push-down ### What changes were proposed in this pull request? Currently , Spark supports push down topN completely . But for some data source (e.g. JDBC ) that have multiple partition , we should preserve partial push down topN. ### Why are the changes needed? Make behavior of sort pushdown correctly. ### Does this PR introduce _any_ user-facing change? 'No'. Just change the inner implement. ### How was this patch tested? New tests. Closes #35710 from beliefer/SPARK-38391. Authored-by: Jiaan Geng <beliefer@163.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com> * [SPARK-38633][SQL] Support push down Cast to JDBC data source V2 ### What changes were proposed in this pull request? Cast is very useful and Spark always use Cast to convert data type automatically. ### Why are the changes needed? Let more aggregates and filters could be pushed down. ### Does this PR introduce _any_ user-facing change? 'Yes'. This PR after cut off 3.3.0. ### How was this patch tested? New tests. Closes #35947 from beliefer/SPARK-38633. Authored-by: Jiaan Geng <beliefer@163.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com> * [SPARK-38432][SQL][FOLLOWUP] Add test case for push down filter with alias ### What changes were proposed in this pull request? DS V2 pushdown predicates to data source supports column with alias. But Spark missing the test case for push down filter with alias. ### Why are the changes needed? Add test case for push down filter with alias ### Does this PR introduce _any_ user-facing change? 'No'. Just add a test case. ### How was this patch tested? New tests. Closes #35988 from beliefer/SPARK-38432_followup2. Authored-by: Jiaan Geng <beliefer@163.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com> * [SPARK-38633][SQL][FOLLOWUP] JDBCSQLBuilder should build cast to type of databases ### What changes were proposed in this pull request? DS V2 supports push down CAST to database. The current implement only uses the typeName of DataType. For example: `Cast(column, StringType)` will be build to `CAST(column AS String)`. But it should be `CAST(column AS TEXT)` for Postgres or `CAST(column AS VARCHAR2(255))` for Oracle. ### Why are the changes needed? Improve the implement of push down CAST. ### Does this PR introduce _any_ user-facing change? 'No'. Just new feature. ### How was this patch tested? Exists tests Closes #35999 from beliefer/SPARK-38633_followup. Authored-by: Jiaan Geng <beliefer@163.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com> * [SPARK-37839][SQL][FOLLOWUP] Check overflow when DS V2 partial aggregate push-down `AVG` ### What changes were proposed in this pull request? https://github.com/apache/spark/pull/35130 supports partial aggregate push-down `AVG` for DS V2. The behavior doesn't consistent with `Average` if occurs overflow in ansi mode. This PR closely follows the implement of `Average` to respect overflow in ansi mode. ### Why are the changes needed? Make the behavior consistent with `Average` if occurs overflow in ansi mode. ### Does this PR introduce _any_ user-facing change? 'Yes'. Users could see the exception about overflow throws in ansi mode. ### How was this patch tested? New tests. Closes #35320 from beliefer/SPARK-37839_followup. Authored-by: Jiaan Geng <beliefer@163.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com> * [SPARK-37960][SQL][FOLLOWUP] Make the testing CASE WHEN query more reasonable ### What changes were proposed in this pull request? Some testing CASE WHEN queries are not carefully written and do not make sense. In the future, the optimizer may get smarter and get rid of the CASE WHEN completely, and then we loose test coverage. This PR updates some CASE WHEN queries to make them more reasonable. ### Why are the changes needed? future-proof test coverage. ### Does this PR introduce _any_ user-facing change? 'No'. ### How was this patch tested? N/A Closes #36032 from beliefer/SPARK-37960_followup2. Authored-by: Jiaan Geng <beliefer@163.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com> * [SPARK-38761][SQL] DS V2 supports push down misc non-aggregate functions ### What changes were proposed in this pull request? Currently, Spark have some misc non-aggregate functions of ANSI standard. Please refer https://github.com/apache/spark/blob/2f8613f22c0750c00cf1dcfb2f31c431d8dc1be7/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/FunctionRegistry.scala#L362. These functions show below: `abs`, `coalesce`, `nullif`, `CASE WHEN` DS V2 should supports push down these misc non-aggregate functions. Because DS V2 already support push down `CASE WHEN`, so this PR no need do the job again. Because `nullif` extends `RuntimeReplaceable`, so this PR no need do the job too. ### Why are the changes needed? DS V2 supports push down misc non-aggregate functions ### Does this PR introduce _any_ user-facing change? 'No'. New feature. ### How was this patch tested? New tests. Closes #36039 from beliefer/SPARK-38761. Authored-by: Jiaan Geng <beliefer@163.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com> * [SPARK-38865][SQL][DOCS] Update document of JDBC options for `pushDownAggregate` and `pushDownLimit` ### What changes were proposed in this pull request? Because the DS v2 pushdown framework refactored, we need to add more doc in `sql-data-sources-jdbc.md` to reflect the new changes. ### Why are the changes needed? Add doc for new changes for `pushDownAggregate` and `pushDownLimit`. ### Does this PR introduce _any_ user-facing change? 'No'. Updated for new feature. ### How was this patch tested? N/A Closes #36152 from beliefer/SPARK-38865. Authored-by: Jiaan Geng <beliefer@163.com> Signed-off-by: huaxingao <huaxin_gao@apple.com> * [SPARK-38855][SQL] DS V2 supports push down math functions ### What changes were proposed in this pull request? Currently, Spark have some math functions of ANSI standard. Please refer https://github.com/apache/spark/blob/2f8613f22c0750c00cf1dcfb2f31c431d8dc1be7/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/FunctionRegistry.scala#L388 These functions show below: `LN`, `EXP`, `POWER`, `SQRT`, `FLOOR`, `CEIL`, `WIDTH_BUCKET` The mainstream databases support these functions show below. | 函数 | PostgreSQL | ClickHouse | H2 | MySQL | Oracle | Redshift | Presto | Teradata | Snowflake | DB2 | Vertica | Exasol | SqlServer | Yellowbrick | Impala | Mariadb | Druid | Pig | SQLite | Influxdata | Singlestore | ElasticSearch | | ---- | ---- | ---- | ---- | ---- | ---- | ---- | ---- | ---- | ---- | ---- | ---- | ---- | ---- | ---- | ---- | ---- | ---- | ---- | ---- | ---- | ---- | ---- | | `LN` | Yes | Yes | Yes | Yes | Yes | Yes | Yes | Yes | Yes | Yes | Yes | Yes | Yes | Yes | Yes | Yes | Yes | Yes | Yes | Yes | Yes | Yes | | `EXP` | Yes | Yes | Yes | Yes | Yes | Yes | Yes | Yes | Yes | Yes | Yes | Yes | Yes | Yes | Yes | Yes | Yes | Yes | Yes | Yes | Yes | Yes | | `POWER` | Yes | Yes | Yes | Yes | Yes | Yes | Yes | Yes | Yes | Yes | Yes | Yes | Yes | Yes | Yes | Yes | Yes | No | Yes | Yes | Yes | Yes | | `SQRT` | Yes | Yes | Yes | Yes | Yes | Yes | Yes | Yes | Yes | Yes | Yes | Yes | Yes | Yes | Yes | Yes | Yes | Yes | Yes | Yes | Yes | Yes | | `FLOOR` | Yes | Yes | Yes | Yes | Yes | Yes | Yes | Yes | Yes | Yes | Yes | Yes | Yes | Yes | Yes | Yes | Yes | Yes | Yes | Yes | Yes | Yes | | `CEIL` | Yes | Yes | Yes | Yes | Yes | Yes | Yes | Yes | Yes | Yes | Yes | Yes | Yes | Yes | Yes | Yes | Yes | Yes | Yes | Yes | Yes | Yes | | `WIDTH_BUCKET` | Yes | No | No | No | Yes | No | Yes | Yes | Yes | Yes | Yes | No | No | No | Yes | No | No | No | No | No | No | No | DS V2 should supports push down these math functions. ### Why are the changes needed? DS V2 supports push down math functions ### Does this PR introduce _any_ user-facing change? 'No'. New feature. ### How was this patch tested? New tests. Closes #36140 from beliefer/SPARK-38855. Authored-by: Jiaan Geng <beliefer@163.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com> * update spark version to r61 Co-authored-by: Huaxin Gao <huaxin_gao@apple.com> Co-authored-by: DB Tsai <d_tsai@apple.com> Co-authored-by: Wenchen Fan <cloud0fan@gmail.com> Co-authored-by: Jiaan Geng <beliefer@163.com> Co-authored-by: Kousuke Saruta <sarutak@oss.nttdata.com> Co-authored-by: Wenchen Fan <wenchen@databricks.com> Co-authored-by: dch nguyen <dgd_contributor@viettel.com.vn> Co-authored-by: Cheng Su <chengsu@fb.com> Co-authored-by: Peter Toth <peter.toth@gmail.com> Co-authored-by: dch nguyen <dchvn.dgd@gmail.com>
What changes were proposed in this pull request?
Currently, the method
compileAggregates
is a member ofJDBCRDD
. But it is not reasonable, because the JDBC source knowns how to compile aggregate expressions to itself's dialect well.Why are the changes needed?
JDBC source knowns how to compile aggregate expressions to itself's dialect well.
After this PR, we can extend the pushdown(e.g. aggregate) based on different dialect between different JDBC database.
There are two situations:
First, database A and B implement a different number of aggregate functions that meet the SQL standard.
Does this PR introduce any user-facing change?
'No'. Just change the inner implementation.
How was this patch tested?
Jenkins tests.