[SPARK-36351][SQL] Refactor filter push down in file source v2 #33650

huaxingao · 2021-08-05T08:28:46Z

What changes were proposed in this pull request?

Currently in V2ScanRelationPushDown, we push the filters (partition filters + data filters) to file source, and then pass all the filters (partition filters + data filters) as post scan filters to v2 Scan, and later in PruneFileSourcePartitions, we separate partition filters and data filters, set them in the format of Expression to file source.

Changes in this PR:
When we push filters to file sources in V2ScanRelationPushDown, since we already have the information about partition column , we want to separate partition filter and data filter there.

The benefit of doing this:

we can handle all the filter related work for v2 file source at one place instead of two (V2ScanRelationPushDown and PruneFileSourcePartitions), so the code will be cleaner and easier to maintain.
we actually have to separate partition filters and data filters at V2ScanRelationPushDown, otherwise, there is no way to find out which filters are partition filters, and we can't push down aggregate for parquet even if we only have partition filter.
By separating the filters early at V2ScanRelationPushDown, we only needs to check data filters to find out which one needs to be converted to data source filters (e.g. Parquet predicates, ORC predicates) and pushed down to file source, right now we are checking all the filters (both partition filters and data filters)
Similarly, we can only pass data filters as post scan filters to v2 Scan, because partition filters are used for partition pruning only, no need to pass them as post scan filters.

In order to do this, we will have the following changes

add pushFilters in file source v2. In this method:
- push both Expression partition filter and Expression data filter to file source. Have to use Expression filters because we need these for partition pruning.
- data filters are used for filter push down. If file source needs to push down data filters, it translates the data filters from Expression to Sources.Filer, and then decides which filters to push down.
- partition filters are used for partition pruning.
file source v2 no need to implement SupportsPushdownFilters any more, because when we separating the two types of filters, we have already set them on file data sources. It's redundant to use SupportsPushdownFilters to set the filters again on file data sources.

Why are the changes needed?

see section one

Does this PR introduce any user-facing change?

No

How was this patch tested?

Existing tests

SparkQA · 2021-08-05T09:03:35Z

Test build #142077 has finished for PR 33650 at commit 4876db0.

This patch fails to build.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2021-08-05T09:15:46Z

Kubernetes integration test unable to build dist.

exiting with code: 1
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/46588/

SparkQA · 2021-08-05T16:54:43Z

Kubernetes integration test starting
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/46616/

SparkQA · 2021-08-05T17:32:13Z

Kubernetes integration test status success
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/46616/

SparkQA · 2021-08-06T00:17:38Z

Test build #142104 has finished for PR 33650 at commit df5d9cc.

This patch fails from timeout after a configured wait of 500m.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2021-08-07T07:50:30Z

Kubernetes integration test starting
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/46700/

SparkQA · 2021-08-07T08:27:51Z

Kubernetes integration test status success
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/46700/

SparkQA · 2021-08-07T08:50:42Z

Test build #142188 has finished for PR 33650 at commit 4d2c4e1.

This patch fails Spark unit tests.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2021-08-07T18:45:58Z

Kubernetes integration test starting
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/46702/

SparkQA · 2021-08-07T19:40:25Z

Kubernetes integration test status failure
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/46702/

SparkQA · 2021-08-07T20:59:50Z

Test build #142190 has finished for PR 33650 at commit 1d3a3ef.

This patch fails Spark unit tests.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2021-08-09T20:47:58Z

Kubernetes integration test starting
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/46751/

SparkQA · 2021-08-09T21:24:27Z

Kubernetes integration test status success
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/46751/

SparkQA · 2021-08-10T04:25:27Z

Test build #142244 has finished for PR 33650 at commit 9ca01b0.

This patch fails from timeout after a configured wait of 500m.
This patch merges cleanly.
This patch adds no public classes.

huaxingao · 2021-08-10T15:03:08Z

sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/v2/PushDownUtils.scala

+    // A map from translated data source leaf node filters to original catalyst filter
+    // expressions. For a `And`/`Or` predicate, it is possible that the predicate is partially
+    // pushed down. This map can be used to construct a catalyst filter expression from the
+    // input filter, or a superset(partial push down filter) of the input filter.


This method returns pushed down sources.Filters and post scan Filters Expression. In the returned post scan Filters Expressions, we want the partition Filters already have been removed so we don't need a second rule (PruneFileSourcePartitions) to prune off the partition Filters.
We will separate the two types of filters for FileScanBuilder, and only pass the data Filter to ScanBuilder.pushFilters. The separated partition filters are set on FileScanBuilder in the format of Expression and are used for partition pruning in https://github.com/apache/spark/blob/master/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/v2/FileScan.scala#L138

How to get the postScan Filters in DS V2 classes(implemented ScanBuilder).
I tried to with trait SupportsPushDownFilters or SupportsPushDownV2Filters, neither of them can push postScan Filters down . Is this by design ?
If users can have all the filters , it would be great to do more optimization to our own business. It depends to us to choose how to use theses conditions? Is this better ?

The postScan filers are those that can't be pushed down to data source. Basically, we put the un-pushable filters in postScan filers so these filters will be applied by Spark.

Thanks for your reply. But how can I get the all filters or how to define "un-pushable" ? I run some cases and found it weird some conditions were classified to post-scan Filters, wherever I extends SupportsPushDownFilters or SupportsPushDownV2Filters. Only some isNotNull(#v) in the pushFilters function . Here is our case:
SELECT * FROM iteblog_tab1 t1 LEFT SEMI JOIN iteblog_tab2 t2 ON t1.k= t2.k AND t2._c0 < 2
In the case we partition iteblog_tab1 and iteblog_tab2 by k. Following is our calling exlain result . Why the t2._c0 < 2 are not happened in the Scan but in the post-Scan. Code version is updated to newest master branch.

== Parsed Logical Plan ==
'Project [*]
+- 'Join LeftSemi, (('t1.k = 't2.k) AND ('t2._c0 < 2))
:- 'SubqueryAlias t1
: +- 'UnresolvedRelation [iteblog_tab1], [], false
+- 'SubqueryAlias t2
+- 'UnresolvedRelation [iteblog_tab2], [], false

== Analyzed Logical Plan ==
_c0: string, k: int
Project [_c0#38, k#39]
+- Join LeftSemi, ((k#39 = k#59) AND (cast(_c0#58 as int) < 2))
:- SubqueryAlias t1
: +- SubqueryAlias iteblog_tab1
: +- View (iteblog_tab1, [_c0#38,k#39])
: +- RelationV2[_c0#38, k#39] /home/user/spark/spark-warehouse/iteblog_tab1
+- SubqueryAlias t2
+- SubqueryAlias iteblog_tab2
+- View (iteblog_tab2, [_c0#58,k#59])
+- RelationV2[_c0#58, k#59] /home/user/spark/spark-warehouse/iteblog_tab2

== Optimized Logical Plan ==
Join LeftSemi, (k#39 = k#59)
:- Filter dynamicpruning#72 [k#39]
: : +- Project [k#59]
: : +- Filter (cast(_c0#58 as int) < 2)
: : +- RelationV2[_c0#58, k#59] /home/user/spark/spark-warehouse/iteblog_tab2
: +- RelationV2[_c0#38, k#39] /home/user/spark/spark-warehouse/iteblog_tab1
+- Project [k#59]
+- Filter (cast(_c0#58 as int) < 2)
+- RelationV2[_c0#58, k#59] /home/user/spark/spark-warehouse/iteblog_tab2

== Physical Plan ==
AdaptiveSparkPlan isFinalPlan=false
+- BroadcastHashJoin [k#39], [k#59], LeftSemi, BuildRight, false
:- Project [_c0#38, k#39]
: +- BatchScan[_c0#38, k#39] CSVScan DataFilters: [], Format: csv, Location: InMemoryFileIndex(1 paths)[file:/home/user/spark/spark-warehouse/iteblog_tab1], PartitionFilters: [], ReadSchema: struct<_c0:string>, PushedFilters---------: [] RuntimeFilters: [dynamicpruningexpression(k#39 IN dynamicpruning#72)]
: +- SubqueryAdaptiveBroadcast dynamicpruning#72, 0, false, Project [k#59], [k#59]
: +- AdaptiveSparkPlan isFinalPlan=false
: +- Project [k#59]
: +- Filter (cast(_c0#58 as int) < 2)
: +- BatchScan[_c0#58, k#59] CSVScan DataFilters: [], Format: csv, Location: InMemoryFileIndex(1 paths)[file:/home/user/spark/spark-warehouse/iteblog_tab2], PartitionFilters: [], ReadSchema: struct<_c0:string>, PushedFilters---------: [] RuntimeFilters: []
+- BroadcastExchange HashedRelationBroadcastMode(List(cast(input[0, int, true] as bigint)),false), [id=#105]
+- Project [k#59]
+- Filter (cast(_c0#58 as int) < 2)
+- BatchScan[_c0#58, k#59] CSVScan DataFilters: [], Format: csv, Location: InMemoryFileIndex(1 paths)[file:/home/i-bozhitao/spark/spark-warehouse/iteblog_tab2], PartitionFilters: [], ReadSchema: struct<_c0:string>, PushedFilters---------: [] RuntimeFilters: []

But it works for the sub-classes of the FileScanBuilder, such as CSVScanBuilder, which extends the class SupportsPushDownCatalyst. Is this by design ?
BTW.
SupportsPushDownFilters
SupportsPushDownFiltersV2
SupportsPushDownCatalyst
We really need some descriptions or debug works to distinguish the difference between them. Hope some merge or clarify works for this.

Post-scan filters mean Spark should evaluate them. Pushed filters mean the data source should handle them. These two can overlap. For example, data filters are evaluated by both parquet reader and Spark (because parquet does row group filtering, not per-row).

huaxingao · 2021-08-10T15:07:55Z

sql/core/src/test/scala/org/apache/spark/sql/jdbc/JDBCV2Suite.scala

@@ -465,4 +465,22 @@ class JDBCV2Suite extends QueryTest with SharedSparkSession with ExplainSuiteHel
    }
    checkAnswer(df2, Seq(Row(53000.00)))
  }
+
+  test("scan with aggregate push-down: aggregate with partially pushed down filters" +
+    "will NOT push down") {


This is not relevant to the partition filters change. Just want to add a test to cover the partially filters push down case for JDBC.

cloud-fan · 2021-08-12T07:07:33Z

I'd like to discuss the high-level idea first. Ideally, the process of ds v2 filter pushdown should be

the rule V2ScanRelationPushDown pushes down filters to the v2 source
the v2 source looks at the pushed filters, and returns a list of filters that Spark should evaluate. These "post scan" filters can either be filters that are not supported by this v2 source, or the v2 source can't fully finish the filtering and need Spark to do the filter again.
for file source, we need Spark to evaluate data filters, as statistics-based filtering can't 100% filter out unneeded data. We don't need Spark to evaluate the partition filters again as file source can filter out partitions exactly.
then spark continues the query compiling, applies CBO and finally submits spark jobs.

I think the key problem we should fix is: file source v2 should not return partition filters as the "post scan" filters, in the implementation of SupportsPushDownFilters.pushFilters

huaxingao · 2021-08-12T15:24:56Z

@cloud-fan

I think the key problem we should fix is: file source v2 should not return partition filters as the "post scan" filters, in the implementation of SupportsPushDownFilters.pushFilters

Agree.

Seems to me there are two ways to fix this:

in SupportsPushDownFilters.pushFilters, separate the partition filter and data filter. Something like this:

def pushFilters = {
   separate the partition filter and data filter
   set both filter on ScanBuilder so these can be passed to V2Scan at construction time
   return data filter as post scan filter
}

The problem with this approach is that not all data source implement SupportsPushDownFilters. For the data source that doesn't have pushFilters, we will need a different path to separate the partition filter and data filter.

separate the partition filter and data filter before calling SupportsPushDownFilters.pushFilters, and set these two types of filters on ScanBuilder. The filter passed to SupportsPushDownFilters.pushFilters only contains data filter.

I tried the first approach, and then changed to the second one.

BTW, I had a PR to push down only data filter to ORCScan #33680

cloud-fan · 2021-08-13T05:58:20Z

The problem with this approach is that not all data source implement SupportsPushDownFilters. For the data source that doesn't have pushFilters, we will need a different path to separate the partition filter and data filter.

Which file source doesn't implement SupportsPushDownFilters?

cloud-fan · 2021-08-13T05:59:05Z

I think all v2 file sources need to implement SupportsPushDownFilters, as they need to do partition pruning.

huaxingao · 2021-08-13T06:03:57Z

TextScanBuilder doesn't implement SupportsPushDownFilters

cloud-fan · 2021-08-13T06:12:10Z

By design, all the filter pushdown should happen in SupportsPushDownFilters.pushFilters, including partition pruning. We may special-case file source before, but that's not right. All v2 file sources need to implement SupportsPushDownFilters, to support partition pruning at least.

huaxingao · 2021-08-13T06:25:33Z

The reason that partition pruning works OK for TextScan is because we currently prune the partitions in PruneFileSourcePartitions.
Should we make FileScanBuilder implement SupportsPushDownFilters too and do partition pruning in pushFilters?

cloud-fan · 2021-08-13T06:32:34Z

Yes please, that's how external v2 sources should do as well.

SparkQA · 2021-08-13T21:15:14Z

Kubernetes integration test starting
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/46958/

SparkQA · 2021-08-31T01:46:33Z

Kubernetes integration test unable to build dist.

exiting with code: 1
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/47371/

SparkQA · 2021-08-31T05:47:27Z

Test build #142868 has finished for PR 33650 at commit 68ace26.

This patch passes all tests.
This patch merges cleanly.
This patch adds the following public classes (experimental):
trait SupportsPushDownCatalystFilters

cloud-fan · 2021-09-01T08:19:27Z

...src/main/scala/org/apache/spark/sql/internal/connector/SupportsPushDownCatalystFilters.scala

+   * Expression. These catalyst Expression filters are used for partition pruning. The dataFilters
+   * are also translated into data source filters and used for selecting records.
+   */
+  def pushCatalystFilters(partitionFilters: Seq[Expression], dataFilters: Seq[Expression]): Unit


Now we have an interface, I think it's better to make the API more general

def pushCatalystFilters(filter: Seq[Expression]): Seq[Expression]

The file source should split partition and date filters.

...src/main/scala/org/apache/spark/sql/internal/connector/SupportsPushDownCatalystFilters.scala

cloud-fan · 2021-09-02T16:27:31Z

sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/v2/PushDownUtils.scala

@@ -39,9 +37,9 @@ object PushDownUtils extends PredicateHelper {
   * @return pushed filter and post-scan filters.
   */
  def pushFilters(
-      scanBuilder: ScanBuilder,
+      scanBuilderHolder: ScanBuilderHolder,


Nit: this change is not needed now

cloud-fan · 2021-09-02T16:30:05Z

sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/v2/FileScanBuilder.scala

@@ -57,6 +68,34 @@ abstract class FileScanBuilder(
    StructType(fields)
  }

+  override def pushFilters(filters: Seq[Expression]): Seq[Expression] = {
+    val partitionColNames =
+      partitionSchema.fields.map(PartitioningUtils.getColName(_, isCaseSensitive)).toSet


the input filters are already normalized. We can simply compare the names instead of considering the case sensitivity.

cloud-fan · 2021-09-02T16:30:36Z

sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/DataSourceUtils.scala

@@ -242,4 +243,16 @@ object DataSourceUtils {
      options
    }
  }
+
+  def getPartitionFiltersAndDataFilters(
+      partitionColumns: Seq[Attribute],


It seems we can simplify the caller side if this method takes partitionSchema: StructType

viirya · 2021-09-02T18:19:03Z

...src/main/scala/org/apache/spark/sql/internal/connector/SupportsPushDownCatalystFilters.scala

+  def pushFilters(filters: Seq[Expression]): Seq[Expression]
+
+  /**
+   * Returns the filters that are pushed to the data source via {@link #pushFilters(Filter[])}.


pushFilters(Filter[])? pushFilters(Expression[])?

BTW, this only returns pushed data filters, right? Could you mention it too?

sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/v2/FileScanBuilder.scala

viirya

Looks okay. A few minor comments.

viirya · 2021-09-02T21:16:38Z

Pending CI

viirya · 2021-09-03T02:11:23Z

Thanks. Merging to master!

huaxingao · 2021-09-03T03:13:50Z

Thank you all!

…quet if filter is on partition col ### What changes were proposed in this pull request? I just realized that with the changes in #33650, the restriction for not pushing down Min/Max/Count for partition filter was already removed. This PR just added test to make sure Min/Max/Count in parquet are pushed down if filter is on partition col. ### Why are the changes needed? To complete the work for Aggregate (Min/Max/Count) push down for Parquet ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? new test Closes #34248 from huaxingao/partitionFilter. Authored-by: Huaxin Gao <huaxin_gao@apple.com> Signed-off-by: Liang-Chi Hsieh <viirya@gmail.com>

### What changes were proposed in this pull request? Currently in `V2ScanRelationPushDown`, we push the filters (partition filters + data filters) to file source, and then pass all the filters (partition filters + data filters) as post scan filters to v2 Scan, and later in `PruneFileSourcePartitions`, we separate partition filters and data filters, set them in the format of `Expression` to file source. Changes in this PR: When we push filters to file sources in `V2ScanRelationPushDown`, since we already have the information about partition column , we want to separate partition filter and data filter there. The benefit of doing this: - we can handle all the filter related work for v2 file source at one place instead of two (`V2ScanRelationPushDown` and `PruneFileSourcePartitions`), so the code will be cleaner and easier to maintain. - we actually have to separate partition filters and data filters at `V2ScanRelationPushDown`, otherwise, there is no way to find out which filters are partition filters, and we can't push down aggregate for parquet even if we only have partition filter. - By separating the filters early at `V2ScanRelationPushDown`, we only needs to check data filters to find out which one needs to be converted to data source filters (e.g. Parquet predicates, ORC predicates) and pushed down to file source, right now we are checking all the filters (both partition filters and data filters) - Similarly, we can only pass data filters as post scan filters to v2 Scan, because partition filters are used for partition pruning only, no need to pass them as post scan filters. In order to do this, we will have the following changes - add `pushFilters` in file source v2. In this method: - push both Expression partition filter and Expression data filter to file source. Have to use Expression filters because we need these for partition pruning. - data filters are used for filter push down. If file source needs to push down data filters, it translates the data filters from `Expression` to `Sources.Filer`, and then decides which filters to push down. - partition filters are used for partition pruning. - file source v2 no need to implement `SupportsPushdownFilters` any more, because when we separating the two types of filters, we have already set them on file data sources. It's redundant to use `SupportsPushdownFilters` to set the filters again on file data sources. ### Why are the changes needed? see section one ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? Existing tests Closes apache#33650 from huaxingao/partition_filter. Authored-by: Huaxin Gao <huaxin_gao@apple.com> Signed-off-by: Liang-Chi Hsieh <viirya@gmail.com>

* [SPARK-36556][SQL] Add DSV2 filters Co-Authored-By: DB Tsai d_tsaiapple.com Co-Authored-By: Huaxin Gao huaxin_gaoapple.com ### What changes were proposed in this pull request? Add DSV2 Filters and use these in V2 codepath. ### Why are the changes needed? The motivation of adding DSV2 filters: 1. The values in V1 filters are Scala types. When translating catalyst `Expression` to V1 filers, we have to call `convertToScala` to convert from Catalyst types used internally in rows to standard Scala types, and later convert Scala types back to Catalyst types. This is very inefficient. In V2 filters, we use `Expression` for filter values, so the conversion from Catalyst types to Scala types and Scala types back to Catalyst types are avoided. 2. Improve nested column filter support. 3. Make the filters work better with the rest of the DSV2 APIs. ### Does this PR introduce _any_ user-facing change? Yes. The new V2 filters ### How was this patch tested? new test Closes #33803 from huaxingao/filter. Lead-authored-by: Huaxin Gao <huaxin_gao@apple.com> Co-authored-by: DB Tsai <d_tsai@apple.com> Signed-off-by: Liang-Chi Hsieh <viirya@gmail.com> * [SPARK-36760][SQL] Add interface SupportsPushDownV2Filters Co-Authored-By: DB Tsai d_tsaiapple.com Co-Authored-By: Huaxin Gao huaxin_gaoapple.com ### What changes were proposed in this pull request? This is the 2nd PR for V2 Filter support. This PR does the following: - Add interface SupportsPushDownV2Filters Future work: - refactor `OrcFilters`, `ParquetFilters`, `JacksonParser`, `UnivocityParser` so both V1 file source and V2 file source can use them - For V2 file source: implement v2 filter -> parquet/orc filter. csv and Json don't have real filters, but also need to change the current code to have v2 filter -> `JacksonParser`/`UnivocityParser` - For V1 file source, keep what we currently have: v1 filter -> parquet/orc filter - We don't need v1filter.toV2 and v2filter.toV1 since we have two separate paths The reasons that we have reached the above conclusion: - The major motivation to implement V2Filter is to eliminate the unnecessary conversion between Catalyst types and Scala types when using Filters. - We provide this `SupportsPushDownV2Filters` in this PR so V2 data source (e.g. iceberg) can implement it and use V2 Filters - There are lots of work to implement v2 filters in the V2 file sources because of the following reasons: possible approaches for implementing V2Filter: 1. keep what we have for file source v1: v1 filter -> parquet/orc filter file source v2 we will implement v2 filter -> parquet/orc filter We don't need v1->v2 and v2->v1 problem with this approach: there are lots of code duplication 2. We will implement v2 filter -> parquet/orc filter file source v1: v1 filter -> v2 filter -> parquet/orc filter We will need V1 -> V2 This is the approach I am using in https://github.com/apache/spark/pull/33973 In that PR, I have v2 orc： v2 filter -> orc filter V1 orc： v1 -> v2 -> orc filter v2 csv: v2->v1, new UnivocityParser v1 csv: new UnivocityParser v2 Json: v2->v1, new JacksonParser v1 Json: new JacksonParser csv and Json don't have real filters, they just use filter references, should be OK to use either v1 and v2. Easier to use v1 because no need to change. I haven't finished parquet yet. The PR doesn't have the parquet V2Filter implementation, but I plan to have v2 parquet： v2 filter -> parquet filter v1 parquet： v1 -> v2 -> parquet filter Problem with this approach: 1. It's not easy to implement V1->V2 because V2 filter have `LiteralValue` and needs type info. We already lost the type information when we convert Expression filer to v1 filter. 2. parquet is OK Use Timestamp as example, parquet filter takes long for timestamp v2 parquet： v2 filter -> parquet filter timestamp Expression （Long） -> v2 filter （LiteralValue Long）-> parquet filter （Long） V1 parquet： v1 -> v2 -> parquet filter timestamp Expression （Long） -> v1 filter （timestamp） -> v2 filter （LiteralValue Long）-> parquet filter （Long） but we have problem for orc because orc filter takes java Timestamp v2 orc： v2 filter -> orc filter timestamp Expression （Long） -> v2 filter （LiteralValue Long）-> parquet filter （Timestamp） V1 orc： v1 -> v2 -> orc filter Expression （Long） -> v1 filter (timestamp) -> v2 filter （LiteralValue Long）-> parquet filter （Timestamp） This defeats the purpose of implementing v2 filters. 3. keep what we have for file source v1: v1 filter -> parquet/orc filter file source v2: v2 filter -> v1 filter -> parquet/orc filter We will need V2 -> V1 we have similar problem as approach 2. So the conclusion is: approach 1 (keep what we have for file source v1: v1 filter -> parquet/orc filter file source v2 we will implement v2 filter -> parquet/orc filter) is better, but there are lots of code duplication. We will need to refactor `OrcFilters`, `ParquetFilters`, `JacksonParser`, `UnivocityParser` so both V1 file source and V2 file source can use them. ### Why are the changes needed? Use V2Filters to eliminate the unnecessary conversion between Catalyst types and Scala types. ### Does this PR introduce _any_ user-facing change? no ### How was this patch tested? Added new UT Closes #34001 from huaxingao/v2filter. Lead-authored-by: Huaxin Gao <huaxin_gao@apple.com> Co-authored-by: Wenchen Fan <cloud0fan@gmail.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com> * [SPARK-37020][SQL] DS V2 LIMIT push down ### What changes were proposed in this pull request? Push down limit to data source for better performance ### Why are the changes needed? For LIMIT, e.g. `SELECT * FROM table LIMIT 10`, Spark retrieves all the data from table and then returns 10 rows. If we can push LIMIT to data source side, the data transferred to Spark will be dramatically reduced. ### Does this PR introduce _any_ user-facing change? Yes. new interface `SupportsPushDownLimit` ### How was this patch tested? new test Closes #34291 from huaxingao/pushdownLimit. Authored-by: Huaxin Gao <huaxin_gao@apple.com> Signed-off-by: Huaxin Gao <huaxin_gao@apple.com> * [SPARK-37038][SQL] DSV2 Sample Push Down ### What changes were proposed in this pull request? Push down Sample to data source for better performance. If Sample is pushed down, it will be removed from logical plan so it will not be applied at Spark any more. Current Plan without Sample push down: ``` == Parsed Logical Plan == 'Project [*] +- 'Sample 0.0, 0.8, false, 157 +- 'UnresolvedRelation [postgresql, new_table], [], false == Analyzed Logical Plan == col1: int, col2: int Project [col1#163, col2#164] +- Sample 0.0, 0.8, false, 157 +- SubqueryAlias postgresql.new_table +- RelationV2[col1#163, col2#164] new_table == Optimized Logical Plan == Sample 0.0, 0.8, false, 157 +- RelationV2[col1#163, col2#164] new_table == Physical Plan == *(1) Sample 0.0, 0.8, false, 157 +- *(1) Scan org.apache.spark.sql.execution.datasources.v2.jdbc.JDBCScan$$anon$16dde4769 [col1#163,col2#164] PushedAggregates: [], PushedFilters: [], PushedGroupby: [], ReadSchema: struct<col1:int,col2:int> ``` after Sample push down: ``` == Parsed Logical Plan == 'Project [*] +- 'Sample 0.0, 0.8, false, 187 +- 'UnresolvedRelation [postgresql, new_table], [], false == Analyzed Logical Plan == col1: int, col2: int Project [col1#163, col2#164] +- Sample 0.0, 0.8, false, 187 +- SubqueryAlias postgresql.new_table +- RelationV2[col1#163, col2#164] new_table == Optimized Logical Plan == RelationV2[col1#163, col2#164] new_table == Physical Plan == *(1) Scan org.apache.spark.sql.execution.datasources.v2.jdbc.JDBCScan$$anon$165b57543 [col1#163,col2#164] PushedAggregates: [], PushedFilters: [], PushedGroupby: [], PushedSample: TABLESAMPLE 0.0 0.8 false 187, ReadSchema: struct<col1:int,col2:int> ``` The new interface is implemented using JDBC for POC and end to end test. TABLESAMPLE is not supported by all the databases. It is implemented using postgresql in this PR. ### Why are the changes needed? Reduce IO and improve performance. For SAMPLE, e.g. `SELECT * FROM t TABLESAMPLE (1 PERCENT)`, Spark retrieves all the data from table and then return 1% rows. It will dramatically reduce the transferred data size and improve performance if we can push Sample to data source side. ### Does this PR introduce any user-facing change? Yes. new interface `SupportsPushDownTableSample` ### How was this patch tested? New test Closes #34451 from huaxingao/sample. Authored-by: Huaxin Gao <huaxin_gao@apple.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com> * [SPARK-37286][SQL] Move compileAggregates from JDBCRDD to JdbcDialect ### What changes were proposed in this pull request? Currently, the method `compileAggregates` is a member of `JDBCRDD`. But it is not reasonable, because the JDBC source knowns how to compile aggregate expressions to itself's dialect well. ### Why are the changes needed? JDBC source knowns how to compile aggregate expressions to itself's dialect well. After this PR, we can extend the pushdown(e.g. aggregate) based on different dialect between different JDBC database. There are two situations: First, database A and B implement a different number of aggregate functions that meet the SQL standard. ### Does this PR introduce _any_ user-facing change? 'No'. Just change the inner implementation. ### How was this patch tested? Jenkins tests. Closes #34554 from beliefer/SPARK-37286. Authored-by: Jiaan Geng <beliefer@163.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com> * [SPARK-37286][DOCS][FOLLOWUP] Fix the wrong parameter name for Javadoc ### What changes were proposed in this pull request? This PR fixes an issue that the Javadoc generation fails due to the wrong parameter name of a method added in SPARK-37286 (#34554). https://github.com/apache/spark/runs/4409267346?check_suite_focus=true#step:9:5081 ### Why are the changes needed? To keep the build clean. ### Does this PR introduce _any_ user-facing change? No. ### How was this patch tested? GA itself. Closes #34801 from sarutak/followup-SPARK-37286. Authored-by: Kousuke Saruta <sarutak@oss.nttdata.com> Signed-off-by: Sean Owen <srowen@gmail.com> * [SPARK-37262][SQL] Don't log empty aggregate and group by in JDBCScan ### What changes were proposed in this pull request? Currently, the empty pushed aggregate and pushed group by are logged in Explain for JDBCScan ``` Scan org.apache.spark.sql.execution.datasources.v2.jdbc.JDBCScan$$anon$172e75786 [NAME#1,SALARY#2] PushedAggregates: [], PushedFilters: [IsNotNull(SALARY), GreaterThan(SALARY,100.00)], PushedGroupby: [], ReadSchema: struct<NAME:string,SALARY:decimal(20,2)> ``` After the fix, the JDBCSScan will be ``` Scan org.apache.spark.sql.execution.datasources.v2.jdbc.JDBCScan$$anon$172e75786 [NAME#1,SALARY#2] PushedFilters: [IsNotNull(SALARY), GreaterThan(SALARY,100.00)], ReadSchema: struct<NAME:string,SALARY:decimal(20,2)> ``` ### Why are the changes needed? address this comment https://github.com/apache/spark/pull/34451#discussion_r740220800 ### Does this PR introduce _any_ user-facing change? no ### How was this patch tested? existing tests Closes #34540 from huaxingao/aggExplain. Authored-by: Huaxin Gao <huaxin_gao@apple.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com> * [SPARK-37483][SQL] Support push down top N to JDBC data source V2 ### What changes were proposed in this pull request? Currently, Spark supports push down limit to data source. However, in the user's scenario, limit must have the premise of order by. Because limit and order by are more valuable together. On the other hand, push down top N(same as order by ... limit N) outputs the data with basic order to Spark sort, the the sort of Spark may have some performance improvement. ### Why are the changes needed? 1. push down top N is very useful for users scenario. 2. push down top N could improves the performance of sort. ### Does this PR introduce _any_ user-facing change? 'No'. Just change the physical execute. ### How was this patch tested? New tests. Closes #34918 from beliefer/SPARK-37483. Authored-by: Jiaan Geng <beliefer@163.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com> * [SPARK-37644][SQL] Support datasource v2 complete aggregate pushdown ### What changes were proposed in this pull request? Currently , Spark supports push down aggregate with partial-agg and final-agg . For some data source (e.g. JDBC ) , we can avoid partial-agg and final-agg by running completely on database. ### Why are the changes needed? Improve performance for aggregate pushdown. ### Does this PR introduce _any_ user-facing change? 'No'. Just change the inner implement. ### How was this patch tested? New tests. Closes #34904 from beliefer/SPARK-37644. Authored-by: Jiaan Geng <beliefer@163.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com> * [SPARK-37627][SQL] Add sorted column in BucketTransform ### What changes were proposed in this pull request? In V1, we can create table with sorted bucket like the following: ``` sql("CREATE TABLE tbl(a INT, b INT) USING parquet " + "CLUSTERED BY (a) SORTED BY (b) INTO 5 BUCKETS") ``` However, creating table with sorted bucket in V2 failed with Exception `org.apache.spark.sql.AnalysisException: Cannot convert bucketing with sort columns to a transform.` ### Why are the changes needed? This PR adds sorted column in BucketTransform so we can create table in V2 with sorted bucket ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? new UT Closes #34879 from huaxingao/sortedBucket. Authored-by: Huaxin Gao <huaxin_gao@apple.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com> * [SPARK-37789][SQL] Add a class to represent general aggregate functions in DS V2 ### What changes were proposed in this pull request? There are a lot of aggregate functions in SQL and it's a lot of work to add them one by one in the DS v2 API. This PR proposes to add a new `GeneralAggregateFunc` class to represent all the general SQL aggregate functions. Since it's general, Spark doesn't know its aggregation buffer and can only push down the aggregation to the source completely. As an example, this PR also translates `AVG` to `GeneralAggregateFunc` and pushes it to JDBC V2. ### Why are the changes needed? To add aggregate functions in DS v2 easier. ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? JDBC v2 test Closes #35070 from cloud-fan/agg. Lead-authored-by: Wenchen Fan <cloud0fan@gmail.com> Co-authored-by: Wenchen Fan <wenchen@databricks.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com> * [SPARK-37644][SQL][FOLLOWUP] When partition column is same as group by key, pushing down aggregate completely ### What changes were proposed in this pull request? When JDBC option specifying the "partitionColumn" and it's the same as group by key, the aggregate push-down should be completely. ### Why are the changes needed? Improve the datasource v2 complete aggregate pushdown. ### Does this PR introduce _any_ user-facing change? 'No'. Just change the inner implement. ### How was this patch tested? New tests. Closes #35052 from beliefer/SPARK-37644-followup. Authored-by: Jiaan Geng <beliefer@163.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com> * [SPARK-37527][SQL] Translate more standard aggregate functions for pushdown ### What changes were proposed in this pull request? Currently, Spark aggregate pushdown will translate some standard aggregate functions, so that compile these functions to adapt specify database. After this job, users could override `JdbcDialect.compileAggregate` to implement some standard aggregate functions supported by some database. This PR just translate the ANSI standard aggregate functions. The mainstream database supports these functions show below: | Name | ClickHouse | Presto | Teradata | Snowflake | Oracle | Postgresql | Vertica | MySQL | RedShift | ElasticSearch | Impala | Druid | SyBase | DB2 | H2 | Exasol | Mariadb | Phoenix | Yellowbrick | Singlestore | Influxdata | Dolphindb | Intersystems | |-------|-------|-------|-------|-------|-------|-------|-------|-------|-------|-------|-------|-------|-------|-------|-------|-------|-------|-------|-------|-------|-------|-------|-------| | `VAR_POP` | Yes | Yes | Yes | Yes | Yes | Yes | Yes | Yes | Yes | Yes | Yes | Yes | Yes | Yes | Yes | Yes | Yes | No | Yes | Yes | No | Yes | Yes | | `VAR_SAMP` | Yes | Yes | Yes | Yes | Yes | Yes | Yes | Yes | Yes | Yes | Yes | Yes | Yes | No | Yes | Yes | Yes | No | Yes | Yes | No | Yes | Yes | | `STDDEV_POP` | Yes | Yes | Yes | Yes | Yes | Yes | Yes | Yes | Yes | Yes | Yes | Yes | Yes | Yes | Yes | Yes | Yes | Yes | Yes | Yes | Yes | Yes | Yes | | `STDDEV_SAMP` | Yes | Yes | Yes | Yes | Yes | Yes | Yes | Yes | Yes | Yes | Yes | Yes | Yes | No | Yes | Yes | Yes | Yes | Yes | Yes | No | Yes | Yes | | `COVAR_POP` | Yes | Yes | Yes | Yes | Yes | Yes | Yes | No | No | No | No | No | Yes | No | Yes | Yes | No | No | No | No | Yes | Yes | No | | `COVAR_SAMP` | Yes | Yes | Yes | Yes | Yes | Yes | Yes | No | No | No | No | No | Yes | No | Yes | Yes | No | No | No | No | No | No | No | | `CORR` | Yes | Yes | Yes | Yes | Yes | Yes | Yes | No | No | No | No | No | Yes | No | Yes | Yes | No | No | No | No | No | Yes | No | Because some aggregate functions will be converted by Optimizer show below, this PR no need to match them. |Input|Parsed|Optimized| |------|--------------------|----------| |`Every`| `aggregate.BoolAnd` |`Min`| |`Any`| `aggregate.BoolOr` |`Max`| |`Some`| `aggregate.BoolOr` |`Max`| ### Why are the changes needed? Make the implement of `*Dialect` could extends the aggregate functions by override `JdbcDialect.compileAggregate`. ### Does this PR introduce _any_ user-facing change? Yes. Users could pushdown more aggregate functions. ### How was this patch tested? Exists tests. Closes #35101 from beliefer/SPARK-37527-new2. Authored-by: Jiaan Geng <beliefer@163.com> Signed-off-by: Huaxin Gao <huaxin_gao@apple.com> * [SPARK-37734][SQL][TESTS] Upgrade h2 from 1.4.195 to 2.0.204 ### What changes were proposed in this pull request? This PR aims to upgrade `com.h2database` from 1.4.195 to 2.0.202 ### Why are the changes needed? Fix one vulnerability, ref: https://www.tenable.com/cve/CVE-2021-23463 ### Does this PR introduce _any_ user-facing change? 'No'. ### How was this patch tested? Jenkins test. Closes #35013 from beliefer/SPARK-37734. Authored-by: Jiaan Geng <beliefer@163.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com> * [SPARK-37527][SQL] Compile `COVAR_POP`, `COVAR_SAMP` and `CORR` in `H2Dialet` ### What changes were proposed in this pull request? https://github.com/apache/spark/pull/35101 translate `COVAR_POP`, `COVAR_SAMP` and `CORR`, but the H2 lower version cannot support them. After https://github.com/apache/spark/pull/35013, we can compile the three aggregate functions in `H2Dialet` now. ### Why are the changes needed? Supplement the implement of `H2Dialet`. ### Does this PR introduce _any_ user-facing change? 'Yes'. Spark could complete push-down `COVAR_POP`, `COVAR_SAMP` and `CORR` into H2. ### How was this patch tested? Test updated. Closes #35145 from beliefer/SPARK-37527_followup. Authored-by: Jiaan Geng <beliefer@163.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com> * [SPARK-37839][SQL] DS V2 supports partial aggregate push-down `AVG` ### What changes were proposed in this pull request? `max`，`min`，`count`，`sum`，`avg` are the most commonly used aggregation functions. Currently, DS V2 supports complete aggregate push-down of `avg`. But, supports partial aggregate push-down of `avg` is very useful. The aggregate push-down algorithm is: 1. Spark translates group expressions of `Aggregate` to DS V2 `Aggregation`. 2. Spark calls `supportCompletePushDown` to check if it can completely push down aggregate. 3. If `supportCompletePushDown` returns true, we preserves the aggregate expressions as final aggregate expressions. Otherwise, we split `AVG` into 2 functions: `SUM` and `COUNT`. 4. Spark translates final aggregate expressions and group expressions of `Aggregate` to DS V2 `Aggregation` again, and pushes the `Aggregation` to JDBC source. 5. Spark constructs the final aggregate. ### Why are the changes needed? DS V2 supports partial aggregate push-down `AVG` ### Does this PR introduce _any_ user-facing change? 'Yes'. DS V2 could partial aggregate push-down `AVG` ### How was this patch tested? New tests. Closes #35130 from beliefer/SPARK-37839. Authored-by: Jiaan Geng <beliefer@163.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com> * [SPARK-36526][SQL] DSV2 Index Support: Add supportsIndex interface ### What changes were proposed in this pull request? Indexes are database objects created on one or more columns of a table. Indexes are used to improve query performance. A detailed explanation of database index is here https://en.wikipedia.org/wiki/Database_index This PR adds `supportsIndex` interface that provides APIs to work with indexes. ### Why are the changes needed? Many data sources support index to improvement query performance. In order to take advantage of the index support in data source, this `supportsIndex` interface is added to let user to create/drop an index, list indexes, etc. ### Does this PR introduce _any_ user-facing change? yes, the following new APIs are added: - createIndex - dropIndex - indexExists - listIndexes New SQL syntax: ``` CREATE [index_type] INDEX [index_name] ON [TABLE] table_name (column_index_property_list)[OPTIONS indexPropertyList] column_index_property_list: column_name [OPTIONS(indexPropertyList)] [ , . . . ] indexPropertyList: index_property_name = index_property_value [ , . . . ] DROP INDEX index_name ``` ### How was this patch tested? only interface is added for now. Tests will be added when doing the implementation Closes #33754 from huaxingao/index_interface. Authored-by: Huaxin Gao <huaxin_gao@apple.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com> * [SPARK-36913][SQL] Implement createIndex and IndexExists in DS V2 JDBC (MySQL dialect) ### What changes were proposed in this pull request? Implementing `createIndex`/`IndexExists` in DS V2 JDBC ### Why are the changes needed? This is a subtask of the V2 Index support. I am implementing index support for DS V2 JDBC so we can have a POC and an end to end testing. This PR implements `createIndex` and `IndexExists`. Next PR will implement `listIndexes` and `dropIndex`. I intentionally make the PR small so it's easier to review. Index is not supported by h2 database and create/drop index are not standard SQL syntax. This PR only implements `createIndex` and `IndexExists` in `MySQL` dialect. ### Does this PR introduce _any_ user-facing change? Yes, `createIndex`/`IndexExist` in DS V2 JDBC ### How was this patch tested? new test Closes #34164 from huaxingao/createIndexJDBC. Authored-by: Huaxin Gao <huaxin_gao@apple.com> Signed-off-by: Liang-Chi Hsieh <viirya@gmail.com> * [SPARK-36914][SQL] Implement dropIndex and listIndexes in JDBC (MySQL dialect) ### What changes were proposed in this pull request? This PR implements `dropIndex` and `listIndexes` in MySQL dialect ### Why are the changes needed? As a subtask of the V2 Index support, this PR completes the implementation for JDBC V2 index support. ### Does this PR introduce _any_ user-facing change? Yes, `dropIndex/listIndexes` in DS V2 JDBC ### How was this patch tested? new tests Closes #34236 from huaxingao/listIndexJDBC. Authored-by: Huaxin Gao <huaxin_gao@apple.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com> * [SPARK-37343][SQL] Implement createIndex, IndexExists and dropIndex in JDBC (Postgres dialect) ### What changes were proposed in this pull request? Implementing `createIndex`/`IndexExists`/`dropIndex` in DS V2 JDBC for Postgres dialect. ### Why are the changes needed? This is a subtask of the V2 Index support. This PR implements `createIndex`, `IndexExists` and `dropIndex`. After review for some changes in this PR, I will create new PR for `listIndexs`, or add it in this PR. This PR only implements `createIndex`, `IndexExists` and `dropIndex` in Postgres dialect. ### Does this PR introduce _any_ user-facing change? Yes, `createIndex`/`IndexExists`/`dropIndex` in DS V2 JDBC ### How was this patch tested? New test. Closes #34673 from dchvn/Dsv2_index_postgres. Authored-by: dch nguyen <dgd_contributor@viettel.com.vn> Signed-off-by: Wenchen Fan <wenchen@databricks.com> * [SPARK-37867][SQL] Compile aggregate functions of build-in JDBC dialect ### What changes were proposed in this pull request? DS V2 translate a lot of standard aggregate functions. Currently, only H2Dialect compile these standard aggregate functions. This PR compile these standard aggregate functions for other build-in JDBC dialect. ### Why are the changes needed? Make build-in JDBC dialect support complete aggregate push-down. ### Does this PR introduce _any_ user-facing change? 'Yes'. Users could use complete aggregate push-down with build-in JDBC dialect. ### How was this patch tested? New tests. Closes #35166 from beliefer/SPARK-37867. Authored-by: Jiaan Geng <beliefer@163.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com> * [SPARK-37929][SQL][FOLLOWUP] Support cascade mode for JDBC V2 ### What changes were proposed in this pull request? https://github.com/apache/spark/pull/35246 support `cascade` mode for dropNamespace API. This PR followup https://github.com/apache/spark/pull/35246 to make JDBC V2 respect `cascade`. ### Why are the changes needed? Let JDBC V2 respect `cascade`. ### Does this PR introduce _any_ user-facing change? Yes. Users could manipulate `drop namespace` with `cascade` on JDBC V2. ### How was this patch tested? New tests. Closes #35271 from beliefer/SPARK-37929-followup. Authored-by: Jiaan Geng <beliefer@163.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com> * [SPARK-38035][SQL] Add docker tests for build-in JDBC dialect ### What changes were proposed in this pull request? Currently, Spark only have `PostgresNamespaceSuite` to test DS V2 namespace in docker environment. But missing tests for other build-in JDBC dialect (e.g. Oracle, MySQL). This PR also found some compatible issue. For example, the JDBC api `conn.getMetaData.getSchemas` works bad for MySQL. ### Why are the changes needed? We need add tests for other build-in JDBC dialect. ### Does this PR introduce _any_ user-facing change? 'No'. Just add tests which face developers. ### How was this patch tested? New tests. Closes #35333 from beliefer/SPARK-38035. Authored-by: Jiaan Geng <beliefer@163.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com> * [SPARK-38054][SQL] Supports list namespaces in JDBC v2 MySQL dialect ### What changes were proposed in this pull request? Currently, `JDBCTableCatalog.scala` query namespaces show below. ``` val schemaBuilder = ArrayBuilder.make[Array[String]] val rs = conn.getMetaData.getSchemas() while (rs.next()) { schemaBuilder += Array(rs.getString(1)) } schemaBuilder.result ``` But the code cannot get any information when using MySQL JDBC driver. This PR uses `SHOW SCHEMAS` to query namespaces of MySQL. This PR also fix other issues below: - Release the docker tests in `MySQLNamespaceSuite.scala`. - Because MySQL doesn't support create comment of schema, let's throws `SQLFeatureNotSupportedException`. - Because MySQL doesn't support `DROP SCHEMA` in `RESTRICT` mode, let's throws `SQLFeatureNotSupportedException`. - Reactor `JdbcUtils.executeQuery` to avoid `java.sql.SQLException: Operation not allowed after ResultSet closed`. ### Why are the changes needed? MySQL dialect supports query namespaces. ### Does this PR introduce _any_ user-facing change? 'Yes'. Some API changed. ### How was this patch tested? New tests. Closes #35355 from beliefer/SPARK-38054. Authored-by: Jiaan Geng <beliefer@163.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com> * [SPARK-36351][SQL] Refactor filter push down in file source v2 ### What changes were proposed in this pull request? Currently in `V2ScanRelationPushDown`, we push the filters (partition filters + data filters) to file source, and then pass all the filters (partition filters + data filters) as post scan filters to v2 Scan, and later in `PruneFileSourcePartitions`, we separate partition filters and data filters, set them in the format of `Expression` to file source. Changes in this PR: When we push filters to file sources in `V2ScanRelationPushDown`, since we already have the information about partition column , we want to separate partition filter and data filter there. The benefit of doing this: - we can handle all the filter related work for v2 file source at one place instead of two (`V2ScanRelationPushDown` and `PruneFileSourcePartitions`), so the code will be cleaner and easier to maintain. - we actually have to separate partition filters and data filters at `V2ScanRelationPushDown`, otherwise, there is no way to find out which filters are partition filters, and we can't push down aggregate for parquet even if we only have partition filter. - By separating the filters early at `V2ScanRelationPushDown`, we only needs to check data filters to find out which one needs to be converted to data source filters (e.g. Parquet predicates, ORC predicates) and pushed down to file source, right now we are checking all the filters (both partition filters and data filters) - Similarly, we can only pass data filters as post scan filters to v2 Scan, because partition filters are used for partition pruning only, no need to pass them as post scan filters. In order to do this, we will have the following changes - add `pushFilters` in file source v2. In this method: - push both Expression partition filter and Expression data filter to file source. Have to use Expression filters because we need these for partition pruning. - data filters are used for filter push down. If file source needs to push down data filters, it translates the data filters from `Expression` to `Sources.Filer`, and then decides which filters to push down. - partition filters are used for partition pruning. - file source v2 no need to implement `SupportsPushdownFilters` any more, because when we separating the two types of filters, we have already set them on file data sources. It's redundant to use `SupportsPushdownFilters` to set the filters again on file data sources. ### Why are the changes needed? see section one ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? Existing tests Closes #33650 from huaxingao/partition_filter. Authored-by: Huaxin Gao <huaxin_gao@apple.com> Signed-off-by: Liang-Chi Hsieh <viirya@gmail.com> * [SPARK-36645][SQL] Aggregate (Min/Max/Count) push down for Parquet ### What changes were proposed in this pull request? Push down Min/Max/Count to Parquet with the following restrictions: - nested types such as Array, Map or Struct will not be pushed down - Timestamp not pushed down because INT96 sort order is undefined, Parquet doesn't return statistics for INT96 - If the aggregate column is on partition column, only Count will be pushed, Min or Max will not be pushed down because Parquet doesn't return max/min for partition column. - If somehow the file doesn't have stats for the aggregate columns, Spark will throw Exception. - Currently, if filter/GROUP BY is involved, Min/Max/Count will not be pushed down, but the restriction will be lifted if the filter or GROUP BY is on partition column (https://issues.apache.org/jira/browse/SPARK-36646 and https://issues.apache.org/jira/browse/SPARK-36647) ### Why are the changes needed? Since parquet has the statistics information for min, max and count, we want to take advantage of this info and push down Min/Max/Count to parquet layer for better performance. ### Does this PR introduce _any_ user-facing change? Yes, `SQLConf.PARQUET_AGGREGATE_PUSHDOWN_ENABLED` was added. If sets to true, we will push down Min/Max/Count to Parquet. ### How was this patch tested? new test suites Closes #33639 from huaxingao/parquet_agg. Authored-by: Huaxin Gao <huaxin_gao@apple.com> Signed-off-by: Liang-Chi Hsieh <viirya@gmail.com> * [SPARK-34960][SQL] Aggregate push down for ORC ### What changes were proposed in this pull request? This PR is to add aggregate push down feature for ORC data source v2 reader. At a high level, the PR does: * The supported aggregate expression is MIN/MAX/COUNT same as [Parquet aggregate push down](https://github.com/apache/spark/pull/33639). * BooleanType, ByteType, ShortType, IntegerType, LongType, FloatType, DoubleType, DateType are allowed in MIN/MAXX aggregate push down. All other columns types are not allowed in MIN/MAX aggregate push down. * All columns types are supported in COUNT aggregate push down. * Nested column's sub-fields are disallowed in aggregate push down. * If the file does not have valid statistics, Spark will throw exception and fail query. * If aggregate has filter or group-by column, aggregate will not be pushed down. At code level, the PR does: * `OrcScanBuilder`: `pushAggregation()` checks whether the aggregation can be pushed down. The most checking logic is shared between Parquet and ORC, extracted into `AggregatePushDownUtils.getSchemaForPushedAggregation()`. `OrcScanBuilder` will create a `OrcScan` with aggregation and aggregation data schema. * `OrcScan`: `createReaderFactory` creates a ORC reader factory with aggregation and schema. Similar change with `ParquetScan`. * `OrcPartitionReaderFactory`: `buildReaderWithAggregates` creates a ORC reader with aggregate push down (i.e. read ORC file footer to process columns statistics, instead of reading actual data in the file). `buildColumnarReaderWithAggregates` creates a columnar ORC reader similarly. Both delegate the real work to read footer in `OrcUtils.createAggInternalRowFromFooter`. * `OrcUtils.createAggInternalRowFromFooter`: reads ORC file footer to process columns statistics (real heavy lift happens here). Similar to `ParquetUtils.createAggInternalRowFromFooter`. Leverage utility method such as `OrcFooterReader.readStatistics`. * `OrcFooterReader`: `readStatistics` reads the ORC `ColumnStatistics[]` into Spark `OrcColumnStatistics`. The transformation is needed here, because ORC `ColumnStatistics[]` stores all columns statistics in a flatten array style, and hard to process. Spark `OrcColumnStatistics` stores the statistics in nested tree structure (e.g. like `StructType`). This is used by `OrcUtils.createAggInternalRowFromFooter` * `OrcColumnStatistics`: the easy-to-manipulate structure for ORC `ColumnStatistics`. This is used by `OrcFooterReader.readStatistics`. ### Why are the changes needed? To improve the performance of query with aggregate. ### Does this PR introduce _any_ user-facing change? Yes. A user-facing config `spark.sql.orc.aggregatePushdown` is added to control enabling/disabling the aggregate push down for ORC. By default the feature is disabled. ### How was this patch tested? Added unit test in `FileSourceAggregatePushDownSuite.scala`. Refactored all unit tests in https://github.com/apache/spark/pull/33639, and it now works for both Parquet and ORC. Closes #34298 from c21/orc-agg. Authored-by: Cheng Su <chengsu@fb.com> Signed-off-by: Liang-Chi Hsieh <viirya@gmail.com> * [SPARK-37960][SQL] A new framework to represent catalyst expressions in DS v2 APIs ### What changes were proposed in this pull request? This PR provides a new framework to represent catalyst expressions in DS v2 APIs. `GeneralSQLExpression` is a general SQL expression to represent catalyst expression in DS v2 API. `ExpressionSQLBuilder` is a builder to generate `GeneralSQLExpression` from catalyst expressions. `CASE ... WHEN ... ELSE ... END` is just the first use case. This PR also supports aggregate push down with `CASE ... WHEN ... ELSE ... END`. ### Why are the changes needed? Support aggregate push down with `CASE ... WHEN ... ELSE ... END`. ### Does this PR introduce _any_ user-facing change? Yes. Users could use `CASE ... WHEN ... ELSE ... END` with aggregate push down. ### How was this patch tested? New tests. Closes #35248 from beliefer/SPARK-37960. Authored-by: Jiaan Geng <beliefer@163.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com> * [SPARK-37867][SQL][FOLLOWUP] Compile aggregate functions for build-in DB2 dialect ### What changes were proposed in this pull request? This PR follows up https://github.com/apache/spark/pull/35166. The previously referenced DB2 documentation is incorrect, resulting in the lack of compile that supports some aggregate functions. The correct documentation is https://www.ibm.com/docs/en/db2/11.5?topic=af-regression-functions-regr-avgx-regr-avgy-regr-count ### Why are the changes needed? Make build-in DB2 dialect support complete aggregate push-down more aggregate functions. ### Does this PR introduce _any_ user-facing change? 'Yes'. Users could use complete aggregate push-down with build-in DB2 dialect. ### How was this patch tested? New tests. Closes #35520 from beliefer/SPARK-37867_followup. Authored-by: Jiaan Geng <beliefer@163.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com> * [SPARK-36568][SQL] Better FileScan statistics estimation ### What changes were proposed in this pull request? This PR modifies `FileScan.estimateStatistics()` to take the read schema into account. ### Why are the changes needed? `V2ScanRelationPushDown` can column prune `DataSourceV2ScanRelation`s and change read schema of `Scan` operations. The better statistics returned by `FileScan.estimateStatistics()` can mean better query plans. For example, with this change the broadcast issue in SPARK-36568 can be avoided. ### Does this PR introduce _any_ user-facing change? No. ### How was this patch tested? Added new UT. Closes #33825 from peter-toth/SPARK-36568-scan-statistics-estimation. Authored-by: Peter Toth <peter.toth@gmail.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com> * [SPARK-37929][SQL] Support cascade mode for `dropNamespace` API ### What changes were proposed in this pull request? This PR adds a new API `dropNamespace(String[] ns, boolean cascade)` to replace the existing one: Add a boolean parameter `cascade` that supports deleting all the Namespaces and Tables under the namespace. Also include changing the implementations and tests that are relevant to this API. ### Why are the changes needed? According to [#cmt](https://github.com/apache/spark/pull/35202#discussion_r784463563), the current `dropNamespace` API doesn't support cascade mode. So this PR replaces that to support cascading. If cascade is set True, delete all namespaces and tables under the namespace. ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? Existing test. Closes #35246 from dchvn/change_dropnamespace_api. Authored-by: dch nguyen <dchvn.dgd@gmail.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com> * code format * [SPARK-38196][SQL] Refactor framework so as JDBC dialect could compile expression by self way ### What changes were proposed in this pull request? https://github.com/apache/spark/pull/35248 provides a new framework to represent catalyst expressions in DS V2 APIs. Because the framework translate all catalyst expressions to a unified SQL string and cannot keep compatibility between different JDBC database, the framework works not good. This PR reactor the framework so as JDBC dialect could compile expression by self way. First, The framework translate catalyst expressions to DS V2 expression. Second, The JDBC dialect could compile DS V2 expression to different SQL syntax. The java doc looks show below: ![image](https://user-images.githubusercontent.com/8486025/156579584-f56cafb5-641f-4c5b-a06e-38f4369051c3.png) ### Why are the changes needed? Make the framework be more common use. ### Does this PR introduce _any_ user-facing change? 'No'. The feature is not released. ### How was this patch tested? Exists tests. Closes #35494 from beliefer/SPARK-37960_followup. Authored-by: Jiaan Geng <beliefer@163.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com> * [SPARK-38361][SQL] Add factory method `getConnection` into `JDBCDialect` ### What changes were proposed in this pull request? At present, the parameter of the factory method for obtaining JDBC connection is empty because the JDBC URL of some databases is fixed and unique. However, for databases such as ClickHouse, connection is related to the shard node. So I think the parameter form of `getConnection: Partition = > Connection` is more general. This PR adds factory method `getConnection` into `JDBCDialect` according to https://github.com/apache/spark/pull/35696#issuecomment-1058060107. ### Why are the changes needed? Make factory method `getConnection` more general. ### Does this PR introduce _any_ user-facing change? 'No'. Just inner change. ### How was this patch tested? Exists test. Closes #35727 from beliefer/SPARK-38361_new. Authored-by: Jiaan Geng <beliefer@163.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com> * code format * [SPARK-38560][SQL] If `Sum`, `Count`, `Any` accompany with distinct, cannot do partial agg push down ### What changes were proposed in this pull request? Spark could partial push down sum(distinct col), count(distinct col) if data source have multiple partitions, and Spark will sum the value again. So the result may not correctly. ### Why are the changes needed? Fix the bug push down sum(distinct col), count(distinct col) to data source and return incorrect result. ### Does this PR introduce _any_ user-facing change? 'Yes'. Users will see the correct behavior. ### How was this patch tested? New tests. Closes #35873 from beliefer/SPARK-38560. Authored-by: Jiaan Geng <beliefer@163.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com> * [SPARK-36718][SQL] Only collapse projects if we don't duplicate expensive expressions ### What changes were proposed in this pull request? The `CollapseProject` rule can combine adjacent projects and merge the project lists. The key idea behind this rule is that the evaluation of project is relatively expensive, and that expression evaluation is cheap and that the expression duplication caused by this rule is not a problem. This last assumption is, unfortunately, not always true: - A user can invoke some expensive UDF, this now gets invoked more often than originally intended. - A projection is very cheap in whole stage code generation. The duplication caused by `CollapseProject` does more harm than good here. This PR addresses this problem, by only collapsing projects when it does not duplicate expensive expressions. In practice this means an input reference may only be consumed once, or when its evaluation does not incur significant overhead (currently attributes, nested column access, aliases & literals fall in this category). ### Why are the changes needed? We have seen multiple complains about `CollapseProject` in the past, due to it may duplicate expensive expressions. The most recent one is https://github.com/apache/spark/pull/33903 . ### Does this PR introduce _any_ user-facing change? no ### How was this patch tested? a new UT and existing test Closes #33958 from cloud-fan/collapse. Authored-by: Wenchen Fan <wenchen@databricks.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com> * [SPARK-38432][SQL] Refactor framework so as JDBC dialect could compile filter by self way ### What changes were proposed in this pull request? Currently, Spark DS V2 could push down filters into JDBC source. However, only the most basic form of filter is supported. On the other hand, some JDBC source could not compile the filters by themselves way. This PR reactor the framework so as JDBC dialect could compile expression by self way. First, The framework translate catalyst expressions to DS V2 filters. Second, The JDBC dialect could compile DS V2 filters to different SQL syntax. ### Why are the changes needed? Make the framework be more common use. ### Does this PR introduce _any_ user-facing change? 'No'. The feature is not released. ### How was this patch tested? Exists tests. Closes #35768 from beliefer/SPARK-38432_new. Authored-by: Jiaan Geng <beliefer@163.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com> * [SPARK-38432][SQL][FOLLOWUP] Supplement test case for overflow and add comments ### What changes were proposed in this pull request? This PR follows up https://github.com/apache/spark/pull/35768 and improves the code. 1. Supplement test case for overflow 2. Not throw IllegalArgumentException 3. Improve V2ExpressionSQLBuilder 4. Add comments in V2ExpressionBuilder ### Why are the changes needed? Supplement test case for overflow and add comments. ### Does this PR introduce _any_ user-facing change? 'No'. V2 aggregate pushdown not released yet. ### How was this patch tested? New tests. Closes #35933 from beliefer/SPARK-38432_followup. Authored-by: Jiaan Geng <beliefer@163.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com> * [SPARK-38533][SQL] DS V2 aggregate push-down supports project with alias ### What changes were proposed in this pull request? Currently, Spark DS V2 aggregate push-down doesn't supports project with alias. Refer https://github.com/apache/spark/blob/c91c2e9afec0d5d5bbbd2e155057fe409c5bb928/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/v2/V2ScanRelationPushDown.scala#L96 This PR let it works good with alias. **The first example:** the origin plan show below: ``` Aggregate [DEPT#0], [DEPT#0, sum(mySalary#8) AS total#14] +- Project [DEPT#0, SALARY#2 AS mySalary#8] +- ScanBuilderHolder [DEPT#0, NAME#1, SALARY#2, BONUS#3], RelationV2[DEPT#0, NAME#1, SALARY#2, BONUS#3] test.employee, JDBCScanBuilder(org.apache.spark.sql.test.TestSparkSession77978658,StructType(StructField(DEPT,IntegerType,true),StructField(NAME,StringType,true),StructField(SALARY,DecimalType(20,2),true),StructField(BONUS,DoubleType,true)),org.apache.spark.sql.execution.datasources.jdbc.JDBCOptions5f8da82) ``` If we can complete push down the aggregate, then the plan will be: ``` Project [DEPT#0, SUM(SALARY)#18 AS sum(SALARY#2)#13 AS total#14] +- RelationV2[DEPT#0, SUM(SALARY)#18] test.employee ``` If we can partial push down the aggregate, then the plan will be: ``` Aggregate [DEPT#0], [DEPT#0, sum(cast(SUM(SALARY)#18 as decimal(20,2))) AS total#14] +- RelationV2[DEPT#0, SUM(SALARY)#18] test.employee ``` **The second example:** the origin plan show below: ``` Aggregate [myDept#33], [myDept#33, sum(mySalary#34) AS total#40] +- Project [DEPT#25 AS myDept#33, SALARY#27 AS mySalary#34] +- ScanBuilderHolder [DEPT#25, NAME#26, SALARY#27, BONUS#28], RelationV2[DEPT#25, NAME#26, SALARY#27, BONUS#28] test.employee, JDBCScanBuilder(org.apache.spark.sql.test.TestSparkSession25c4f621,StructType(StructField(DEPT,IntegerType,true),StructField(NAME,StringType,true),StructField(SALARY,DecimalType(20,2),true),StructField(BONUS,DoubleType,true)),org.apache.spark.sql.execution.datasources.jdbc.JDBCOptions345d641e) ``` If we can complete push down the aggregate, then the plan will be: ``` Project [DEPT#25 AS myDept#33, SUM(SALARY)#44 AS sum(SALARY#27)#39 AS total#40] +- RelationV2[DEPT#25, SUM(SALARY)#44] test.employee ``` If we can partial push down the aggregate, then the plan will be: ``` Aggregate [myDept#33], [DEPT#25 AS myDept#33, sum(cast(SUM(SALARY)#56 as decimal(20,2))) AS total#52] +- RelationV2[DEPT#25, SUM(SALARY)#56] test.employee ``` ### Why are the changes needed? Alias is more useful. ### Does this PR introduce _any_ user-facing change? 'Yes'. Users could see DS V2 aggregate push-down supports project with alias. ### How was this patch tested? New tests. Closes #35932 from beliefer/SPARK-38533_new. Authored-by: Jiaan Geng <beliefer@163.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com> * code foramt * [SPARK-37483][SQL][FOLLOWUP] Rename `pushedTopN` to `PushedTopN` and improve JDBCV2Suite ### What changes were proposed in this pull request? This PR fix three issues. **First**, create method `checkPushedInfo` and `checkSortRemoved` to reuse code. **Second**, remove method `checkPushedLimit`, because `checkPushedInfo` can cover it. **Third**, rename `pushedTopN` to `PushedTopN`, so as consistent with other pushed information. ### Why are the changes needed? Reuse code and let pushed information more correctly. ### Does this PR introduce _any_ user-facing change? 'No'. New feature and improve the tests. ### How was this patch tested? Adjust existing tests. Closes #35921 from beliefer/SPARK-37483_followup. Authored-by: Jiaan Geng <beliefer@163.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com> * [SPARK-38644][SQL] DS V2 topN push-down supports project with alias ### What changes were proposed in this pull request? Currently, Spark DS V2 topN push-down doesn't supports project with alias. This PR let it works good with alias. **Example**: the origin plan show below: ``` Sort [mySalary#10 ASC NULLS FIRST], true +- Project [NAME#1, SALARY#2 AS mySalary#10] +- ScanBuilderHolder [DEPT#0, NAME#1, SALARY#2, BONUS#3, IS_MANAGER#4], RelationV2[DEPT#0, NAME#1, SALARY#2, BONUS#3, IS_MANAGER#4] test.employee, JDBCScanBuilder(org.apache.spark.sql.test.TestSparkSession7fd4b9ec,StructType(StructField(DEPT,IntegerType,true),StructField(NAME,StringType,true),StructField(SALARY,DecimalType(20,2),true),StructField(BONUS,DoubleType,true),StructField(IS_MANAGER,BooleanType,true)),org.apache.spark.sql.execution.datasources.jdbc.JDBCOptions3c8e4a82) ``` The `pushedLimit` and `sortOrders` of `JDBCScanBuilder` are empty. If we can push down the top n, then the plan will be: ``` Project [NAME#1, SALARY#2 AS mySalary#10] +- ScanBuilderHolder [DEPT#0, NAME#1, SALARY#2, BONUS#3, IS_MANAGER#4], RelationV2[DEPT#0, NAME#1, SALARY#2, BONUS#3, IS_MANAGER#4] test.employee, JDBCScanBuilder(org.apache.spark.sql.test.TestSparkSession7fd4b9ec,StructType(StructField(DEPT,IntegerType,true),StructField(NAME,StringType,true),StructField(SALARY,DecimalType(20,2),true),StructField(BONUS,DoubleType,true),StructField(IS_MANAGER,BooleanType,true)),org.apache.spark.sql.execution.datasources.jdbc.JDBCOptions3c8e4a82) ``` The `pushedLimit` of `JDBCScanBuilder` will be `1` and `sortOrders` of `JDBCScanBuilder` will be `SALARY ASC NULLS FIRST`. ### Why are the changes needed? Alias is more useful. ### Does this PR introduce _any_ user-facing change? 'Yes'. Users could see DS V2 topN push-down supports project with alias. ### How was this patch tested? New tests. Closes #35961 from beliefer/SPARK-38644. Authored-by: Jiaan Geng <beliefer@163.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com> * [SPARK-38391][SQL] Datasource v2 supports partial topN push-down ### What changes were proposed in this pull request? Currently , Spark supports push down topN completely . But for some data source (e.g. JDBC ) that have multiple partition , we should preserve partial push down topN. ### Why are the changes needed? Make behavior of sort pushdown correctly. ### Does this PR introduce _any_ user-facing change? 'No'. Just change the inner implement. ### How was this patch tested? New tests. Closes #35710 from beliefer/SPARK-38391. Authored-by: Jiaan Geng <beliefer@163.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com> * [SPARK-38633][SQL] Support push down Cast to JDBC data source V2 ### What changes were proposed in this pull request? Cast is very useful and Spark always use Cast to convert data type automatically. ### Why are the changes needed? Let more aggregates and filters could be pushed down. ### Does this PR introduce _any_ user-facing change? 'Yes'. This PR after cut off 3.3.0. ### How was this patch tested? New tests. Closes #35947 from beliefer/SPARK-38633. Authored-by: Jiaan Geng <beliefer@163.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com> * [SPARK-38432][SQL][FOLLOWUP] Add test case for push down filter with alias ### What changes were proposed in this pull request? DS V2 pushdown predicates to data source supports column with alias. But Spark missing the test case for push down filter with alias. ### Why are the changes needed? Add test case for push down filter with alias ### Does this PR introduce _any_ user-facing change? 'No'. Just add a test case. ### How was this patch tested? New tests. Closes #35988 from beliefer/SPARK-38432_followup2. Authored-by: Jiaan Geng <beliefer@163.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com> * [SPARK-38633][SQL][FOLLOWUP] JDBCSQLBuilder should build cast to type of databases ### What changes were proposed in this pull request? DS V2 supports push down CAST to database. The current implement only uses the typeName of DataType. For example: `Cast(column, StringType)` will be build to `CAST(column AS String)`. But it should be `CAST(column AS TEXT)` for Postgres or `CAST(column AS VARCHAR2(255))` for Oracle. ### Why are the changes needed? Improve the implement of push down CAST. ### Does this PR introduce _any_ user-facing change? 'No'. Just new feature. ### How was this patch tested? Exists tests Closes #35999 from beliefer/SPARK-38633_followup. Authored-by: Jiaan Geng <beliefer@163.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com> * [SPARK-37839][SQL][FOLLOWUP] Check overflow when DS V2 partial aggregate push-down `AVG` ### What changes were proposed in this pull request? https://github.com/apache/spark/pull/35130 supports partial aggregate push-down `AVG` for DS V2. The behavior doesn't consistent with `Average` if occurs overflow in ansi mode. This PR closely follows the implement of `Average` to respect overflow in ansi mode. ### Why are the changes needed? Make the behavior consistent with `Average` if occurs overflow in ansi mode. ### Does this PR introduce _any_ user-facing change? 'Yes'. Users could see the exception about overflow throws in ansi mode. ### How was this patch tested? New tests. Closes #35320 from beliefer/SPARK-37839_followup. Authored-by: Jiaan Geng <beliefer@163.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com> * [SPARK-37960][SQL][FOLLOWUP] Make the testing CASE WHEN query more reasonable ### What changes were proposed in this pull request? Some testing CASE WHEN queries are not carefully written and do not make sense. In the future, the optimizer may get smarter and get rid of the CASE WHEN completely, and then we loose test coverage. This PR updates some CASE WHEN queries to make them more reasonable. ### Why are the changes needed? future-proof test coverage. ### Does this PR introduce _any_ user-facing change? 'No'. ### How was this patch tested? N/A Closes #36032 from beliefer/SPARK-37960_followup2. Authored-by: Jiaan Geng <beliefer@163.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com> * [SPARK-38761][SQL] DS V2 supports push down misc non-aggregate functions ### What changes were proposed in this pull request? Currently, Spark have some misc non-aggregate functions of ANSI standard. Please refer https://github.com/apache/spark/blob/2f8613f22c0750c00cf1dcfb2f31c431d8dc1be7/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/FunctionRegistry.scala#L362. These functions show below: `abs`, `coalesce`, `nullif`, `CASE WHEN` DS V2 should supports push down these misc non-aggregate functions. Because DS V2 already support push down `CASE WHEN`, so this PR no need do the job again. Because `nullif` extends `RuntimeReplaceable`, so this PR no need do the job too. ### Why are the changes needed? DS V2 supports push down misc non-aggregate functions ### Does this PR introduce _any_ user-facing change? 'No'. New feature. ### How was this patch tested? New tests. Closes #36039 from beliefer/SPARK-38761. Authored-by: Jiaan Geng <beliefer@163.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com> * [SPARK-38865][SQL][DOCS] Update document of JDBC options for `pushDownAggregate` and `pushDownLimit` ### What changes were proposed in this pull request? Because the DS v2 pushdown framework refactored, we need to add more doc in `sql-data-sources-jdbc.md` to reflect the new changes. ### Why are the changes needed? Add doc for new changes for `pushDownAggregate` and `pushDownLimit`. ### Does this PR introduce _any_ user-facing change? 'No'. Updated for new feature. ### How was this patch tested? N/A Closes #36152 from beliefer/SPARK-38865. Authored-by: Jiaan Geng <beliefer@163.com> Signed-off-by: huaxingao <huaxin_gao@apple.com> * [SPARK-38855][SQL] DS V2 supports push down math functions ### What changes were proposed in this pull request? Currently, Spark have some math functions of ANSI standard. Please refer https://github.com/apache/spark/blob/2f8613f22c0750c00cf1dcfb2f31c431d8dc1be7/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/FunctionRegistry.scala#L388 These functions show below: `LN`, `EXP`, `POWER`, `SQRT`, `FLOOR`, `CEIL`, `WIDTH_BUCKET` The mainstream databases support these functions show below. | 函数 | PostgreSQL | ClickHouse | H2 | MySQL | Oracle | Redshift | Presto | Teradata | Snowflake | DB2 | Vertica | Exasol | SqlServer | Yellowbrick | Impala | Mariadb | Druid | Pig | SQLite | Influxdata | Singlestore | ElasticSearch | | ---- | ---- | ---- | ---- | ---- | ---- | ---- | ---- | ---- | ---- | ---- | ---- | ---- | ---- | ---- | ---- | ---- | ---- | ---- | ---- | ---- | ---- | ---- | | `LN` | Yes | Yes | Yes | Yes | Yes | Yes | Yes | Yes | Yes | Yes | Yes | Yes | Yes | Yes | Yes | Yes | Yes | Yes | Yes | Yes | Yes | Yes | | `EXP` | Yes | Yes | Yes | Yes | Yes | Yes | Yes | Yes | Yes | Yes | Yes | Yes | Yes | Yes | Yes | Yes | Yes | Yes | Yes | Yes | Yes | Yes | | `POWER` | Yes | Yes | Yes | Yes | Yes | Yes | Yes | Yes | Yes | Yes | Yes | Yes | Yes | Yes | Yes | Yes | Yes | No | Yes | Yes | Yes | Yes | | `SQRT` | Yes | Yes | Yes | Yes | Yes | Yes | Yes | Yes | Yes | Yes | Yes | Yes | Yes | Yes | Yes | Yes | Yes | Yes | Yes | Yes | Yes | Yes | | `FLOOR` | Yes | Yes | Yes | Yes | Yes | Yes | Yes | Yes | Yes | Yes | Yes | Yes | Yes | Yes | Yes | Yes | Yes | Yes | Yes | Yes | Yes | Yes | | `CEIL` | Yes | Yes | Yes | Yes | Yes | Yes | Yes | Yes | Yes | Yes | Yes | Yes | Yes | Yes | Yes | Yes | Yes | Yes | Yes | Yes | Yes | Yes | | `WIDTH_BUCKET` | Yes | No | No | No | Yes | No | Yes | Yes | Yes | Yes | Yes | No | No | No | Yes | No | No | No | No | No | No | No | DS V2 should supports push down these math functions. ### Why are the changes needed? DS V2 supports push down math functions ### Does this PR introduce _any_ user-facing change? 'No'. New feature. ### How was this patch tested? New tests. Closes #36140 from beliefer/SPARK-38855. Authored-by: Jiaan Geng <beliefer@163.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com> * update spark version to r61 Co-authored-by: Huaxin Gao <huaxin_gao@apple.com> Co-authored-by: DB Tsai <d_tsai@apple.com> Co-authored-by: Wenchen Fan <cloud0fan@gmail.com> Co-authored-by: Jiaan Geng <beliefer@163.com> Co-authored-by: Kousuke Saruta <sarutak@oss.nttdata.com> Co-authored-by: Wenchen Fan <wenchen@databricks.com> Co-authored-by: dch nguyen <dgd_contributor@viettel.com.vn> Co-authored-by: Cheng Su <chengsu@fb.com> Co-authored-by: Peter Toth <peter.toth@gmail.com> Co-authored-by: dch nguyen <dchvn.dgd@gmail.com>

github-actions bot added the SQL label Aug 5, 2021

huaxingao changed the title ~~[SPARK-36351][SQL] Separate partition filters and data filters in PushDownUtils~~ [SPARK-36351][SQL] Separate partition filters and data filters in pushFilters Aug 7, 2021

github-actions bot added the AVRO label Aug 7, 2021

huaxingao changed the title ~~[SPARK-36351][SQL] Separate partition filters and data filters in pushFilters~~ [SPARK-36351][SQL] Separate partition filters and data filters in PushDownUtils Aug 7, 2021

huaxingao force-pushed the partition_filter branch from 1d3a3ef to 9ca01b0 Compare August 9, 2021 18:57

huaxingao commented Aug 10, 2021

View reviewed changes

huaxingao force-pushed the partition_filter branch from 907f4d9 to 68ace26 Compare August 31, 2021 00:48

cloud-fan reviewed Sep 1, 2021

View reviewed changes

cloud-fan approved these changes Sep 1, 2021

View reviewed changes

split partition and data filters in file source

f3b4d22

cloud-fan reviewed Sep 2, 2021

View reviewed changes

...src/main/scala/org/apache/spark/sql/internal/connector/SupportsPushDownCatalystFilters.scala Outdated Show resolved Hide resolved

follow SupportsPushDownFilters

4700c08

cloud-fan reviewed Sep 2, 2021

View reviewed changes

cloud-fan approved these changes Sep 2, 2021

View reviewed changes

address comments

3085bdf

viirya reviewed Sep 2, 2021

View reviewed changes

sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/v2/FileScanBuilder.scala Show resolved Hide resolved

viirya reviewed Sep 2, 2021

View reviewed changes

address comments

da9fe2f

viirya approved these changes Sep 2, 2021

View reviewed changes

viirya closed this in 38b6fbd Sep 3, 2021

huaxingao deleted the partition_filter branch September 3, 2021 03:13

huaxingao mentioned this pull request Oct 12, 2021

[SPARK-36647][SQL][TESTS] Push down Aggregate (Min/Max/Count) for Parquet if filter is on partition col #34248

Closed

[SPARK-36351][SQL] Refactor filter push down in file source v2 #33650

[SPARK-36351][SQL] Refactor filter push down in file source v2 #33650

Conversation

huaxingao commented Aug 5, 2021 • edited

What changes were proposed in this pull request?

Why are the changes needed?

Does this PR introduce any user-facing change?

How was this patch tested?

SparkQA commented Aug 5, 2021

SparkQA commented Aug 5, 2021

SparkQA commented Aug 5, 2021

SparkQA commented Aug 5, 2021

SparkQA commented Aug 6, 2021

SparkQA commented Aug 7, 2021

SparkQA commented Aug 7, 2021

SparkQA commented Aug 7, 2021

SparkQA commented Aug 7, 2021

SparkQA commented Aug 7, 2021

SparkQA commented Aug 7, 2021

SparkQA commented Aug 9, 2021

SparkQA commented Aug 9, 2021

SparkQA commented Aug 10, 2021

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

GithubZhitao Oct 20, 2021 • edited

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

cloud-fan commented Aug 12, 2021

huaxingao commented Aug 12, 2021

cloud-fan commented Aug 13, 2021

cloud-fan commented Aug 13, 2021

huaxingao commented Aug 13, 2021

cloud-fan commented Aug 13, 2021

huaxingao commented Aug 13, 2021

cloud-fan commented Aug 13, 2021

SparkQA commented Aug 13, 2021

SparkQA commented Aug 31, 2021

SparkQA commented Aug 31, 2021

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

viirya left a comment

Choose a reason for hiding this comment

viirya commented Sep 2, 2021

viirya commented Sep 3, 2021

huaxingao commented Sep 3, 2021

huaxingao commented Aug 5, 2021 •

edited

GithubZhitao Oct 20, 2021 •

edited