[SPARK-36574][SQL] pushDownPredicate=false should prevent push down filters to JDBC data source #33822

beliefer · 2021-08-24T10:34:02Z

What changes were proposed in this pull request?

Spark SQL includes a data source that can read data from other databases using JDBC.
Spark also supports the case-insensitive option pushDownPredicate.
According to http://spark.apache.org/docs/latest/sql-data-sources-jdbc.html, If set pushDownPredicate to false, no filter will be pushed down to the JDBC data source and thus all filters will be handled by Spark.
But I find it still be pushed down to JDBC data source.

Why are the changes needed?

Fix bug pushDownPredicate=false failed to prevent push down filters to JDBC data source.

Does this PR introduce any user-facing change?

'No'.
The output of query will not change.

How was this patch tested?

Jenkins test.

…rce.

SparkQA · 2021-08-24T11:33:53Z

Kubernetes integration test starting
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/47225/

SparkQA · 2021-08-24T12:11:24Z

Kubernetes integration test status success
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/47225/

SparkQA · 2021-08-24T19:06:55Z

Test build #142725 has finished for PR 33822 at commit 0fecee8.

This patch fails from timeout after a configured wait of 500m.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2021-08-25T03:55:24Z

Kubernetes integration test starting
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/47248/

SparkQA · 2021-08-25T04:33:33Z

Kubernetes integration test status success
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/47248/

SparkQA · 2021-08-25T07:57:34Z

Test build #142748 has finished for PR 33822 at commit d34cdfd.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

beliefer · 2021-08-25T09:55:19Z

ping @cloud-fan @gengliangwang

cloud-fan · 2021-08-25T10:10:47Z

is it a long-standing bug? or a regression which was introduced recently? and does JDBC v2 have the same problem?

beliefer · 2021-08-25T10:36:26Z

is it a long-standing bug? or a regression which was introduced recently? and does JDBC v2 have the same problem?

I doubt it's a long-standing bug. I checked JDBC v2 have't the problem.

gengliangwang · 2021-08-25T13:08:43Z

@beliefer can we add new test case for the fix?

huaxingao · 2021-08-25T15:04:39Z

sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/jdbc/JDBCRelation.scala

+      filters
+    } else {
+      Array.empty[Filter]
+    }


If pushDownPredicate is false, the unhandledFilters are set to all the filters here https://github.com/apache/spark/blob/master/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/jdbc/JDBCRelation.scala#L276. Seems to me that the unhandledFilters shouldn't be pushed down to JDBC at all.

Yea, it's a bug since day 1: #21875

I think we should update the tested add at that time and check the real pushed filters in JDBC source.

huaxingao · 2021-08-25T15:12:31Z

I took a look at v2 path. Seems to me that filter push down logic in v2 is different from v1.

V2
val (pushed, unSupported) = filters.partition(JDBCRDD.compileFilter(_, dialect).isDefined)
https://github.com/apache/spark/blob/master/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/v2/jdbc/JDBCScanBuilder.scala#L51
We only push down the supported filters, the unSupported is returned as the post scan filters.

V1:
val (unhandledPredicates, pushedFilters, handledFilters) = selectFilters(relation.relation, candidatePredicates)
https://github.com/apache/spark/blob/master/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/DataSourceStrategy.scala#L388
We push down the pushedFilters, not the handledFilters. https://github.com/apache/spark/blob/master/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/DataSourceStrategy.scala#L436
Seems to me that we should push down handledFilters, not the translated filters (pushedFilters)

beliefer · 2021-08-26T02:37:49Z

Seems to me that we should push down handledFilters, not the translated filters (pushedFilters)
@huaxingao I tried on the way, but some test case cannot pass.
Please see

spark/sql/core/src/test/scala/org/apache/spark/sql/sources/FilteredScanSuite.scala

Line 81 in 159ff9f

FiltersPushed.list = filters

According to the test case, it seems that the pushed filters should as the parameters of def buildScan.
cc @cloud-fan @gengliangwang

In fact, I commit the update 0fecee8 in previous like your consideration.

huaxingao · 2021-08-26T05:32:07Z

@beliefer I see.
But at this line assert(relation.unhandledFilters(FiltersPushed.list.toArray).toSet === expectedUnhandledFilters) https://github.com/apache/spark/blob/master/sql/core/src/test/scala/org/apache/spark/sql/sources/FilteredScanSuite.scala#L331, why are we getting unhandledFilters from FiltersPushed.list? Should we get unhandledFilters from the original filters?
I still think we shouldn't push down unhandledFilters to the data source.

beliefer · 2021-08-26T06:13:38Z

why are we getting unhandledFilters from FiltersPushed.list?
@huaxingao I don't know the reason. If we shouldn't push down unhandledFilters to the data source, we should adust these test cases too.

SparkQA · 2021-08-26T07:41:56Z

Kubernetes integration test starting
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/47291/

cloud-fan · 2021-08-26T08:34:52Z

sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/DataSourceStrategy.scala

@@ -409,7 +409,7 @@ object DataSourceStrategy
        pushedFilters.toSet,
        handledFilters,
        None,
-        scanBuilder(requestedColumns, candidatePredicates, pushedFilters),
+        scanBuilder(requestedColumns, candidatePredicates, handledFilters.toSeq),


I think this change is correct but is pretty risky. DS v1 has been there for many years and it's possible that some v1 sources forget to override unhandledFilters.

Before this PR, it's not a big deal if v1 sources forget to override unhandledFilters. Filter pushdown still works, only the EXPLAIN result will be inaccurate.

I don't think we can make this change now.

OK. Let me revert it.

SparkQA · 2021-08-26T08:39:42Z

Kubernetes integration test status failure
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/47291/

cloud-fan · 2021-08-26T08:41:48Z

sql/core/src/test/scala/org/apache/spark/sql/jdbc/JDBCSuite.scala

+    assert(relation.isInstanceOf[org.apache.spark.sql.execution.datasources.jdbc.JDBCRelation])
+    val jdbcRelation =
+      relation.asInstanceOf[org.apache.spark.sql.execution.datasources.jdbc.JDBCRelation]
+    if (jdbcRelation.jdbcOptions.pushDownPredicate == false) {


The method name is checkNotPushdown, do we need this if?

SparkQA · 2021-08-26T08:51:30Z

Kubernetes integration test starting
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/47296/

SparkQA · 2021-08-26T09:37:05Z

Kubernetes integration test status success
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/47296/

SparkQA · 2021-08-26T09:43:37Z

Test build #142790 has finished for PR 33822 at commit 91c2284.

This patch fails Spark unit tests.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2021-08-26T10:03:25Z

Test build #142809 has finished for PR 33822 at commit 9c01364.

This patch fails to build.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2021-08-26T10:14:42Z

Kubernetes integration test starting
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/47307/

SparkQA · 2021-08-26T10:16:44Z

Kubernetes integration test status failure
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/47307/

SparkQA · 2021-08-26T10:44:59Z

Kubernetes integration test unable to build dist.

exiting with code: 1
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/47310/

beliefer · 2021-08-26T10:48:17Z

retest this please

SparkQA · 2021-08-26T11:43:28Z

Kubernetes integration test unable to build dist.

exiting with code: 1
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/47314/

cloud-fan · 2021-08-26T12:32:03Z

sql/core/src/test/scala/org/apache/spark/sql/jdbc/JDBCSuite.scala

+      rawPlan.execute().count()
+    }
+
+    assert(getRowCount(df1) == getRowCount(df2))


nit: assert(getRowCount(df1) == df2.count)

let's also test if we enable pushDownPredicate, the row count is smaller.

SparkQA · 2021-08-26T12:33:31Z

Test build #142796 has finished for PR 33822 at commit ecb3a47.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2021-08-26T14:14:25Z

Test build #142806 has finished for PR 33822 at commit 24d3c94.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2021-08-26T15:21:08Z

Test build #142812 has finished for PR 33822 at commit 9c01364.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2021-08-27T03:41:37Z

Kubernetes integration test starting
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/47332/

SparkQA · 2021-08-27T03:50:49Z

Kubernetes integration test status failure
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/47332/

SparkQA · 2021-08-27T07:14:43Z

Test build #142830 has finished for PR 33822 at commit 4b4f78a.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

beliefer · 2021-08-30T09:07:46Z

ping @cloud-fan cc @huaxingao @gengliangwang

cloud-fan · 2021-08-30T11:09:26Z

thanks, merging to master/3.2!

…ilters to JDBC data source ### What changes were proposed in this pull request? Spark SQL includes a data source that can read data from other databases using JDBC. Spark also supports the case-insensitive option `pushDownPredicate`. According to http://spark.apache.org/docs/latest/sql-data-sources-jdbc.html, If set `pushDownPredicate` to false, no filter will be pushed down to the JDBC data source and thus all filters will be handled by Spark. But I find it still be pushed down to JDBC data source. ### Why are the changes needed? Fix bug `pushDownPredicate`=false failed to prevent push down filters to JDBC data source. ### Does this PR introduce _any_ user-facing change? 'No'. The output of query will not change. ### How was this patch tested? Jenkins test. Closes #33822 from beliefer/SPARK-36574. Authored-by: gengjiaan <gengjiaan@360.cn> Signed-off-by: Wenchen Fan <wenchen@databricks.com> (cherry picked from commit fcc91cf) Signed-off-by: Wenchen Fan <wenchen@databricks.com>

beliefer · 2021-08-31T02:11:37Z

@cloud-fan Thanks a lot! @gengliangwang @huaxingao Thank you for your review.

pushDownPredicate failed to prevent push filters down to the data sou…

0fecee8

…rce.

github-actions bot added the SQL label Aug 24, 2021

Update code

d34cdfd

huaxingao reviewed Aug 25, 2021

View reviewed changes

beliefer added 2 commits August 26, 2021 14:44

Update code

91c2284

Update code

ecb3a47

cloud-fan reviewed Aug 26, 2021

View reviewed changes

beliefer changed the title ~~[SPARK-36574][SQL] pushDownPredicate=false failed to prevent push down filters to JDBC data source~~ [SPARK-36574][SQL] pushDownPredicate=false should prevent push down filters to JDBC data source Aug 26, 2021

beliefer added 2 commits August 26, 2021 17:14

Update code

24d3c94

Update code

9c01364

cloud-fan reviewed Aug 26, 2021

View reviewed changes

Update code

4b4f78a

cloud-fan approved these changes Aug 30, 2021

View reviewed changes

cloud-fan closed this in fcc91cf Aug 30, 2021

[SPARK-36574][SQL] pushDownPredicate=false should prevent push down filters to JDBC data source #33822

[SPARK-36574][SQL] pushDownPredicate=false should prevent push down filters to JDBC data source #33822

Conversation

beliefer commented Aug 24, 2021

What changes were proposed in this pull request?

Why are the changes needed?

Does this PR introduce any user-facing change?

How was this patch tested?

SparkQA commented Aug 24, 2021

SparkQA commented Aug 24, 2021

SparkQA commented Aug 24, 2021

SparkQA commented Aug 25, 2021

SparkQA commented Aug 25, 2021

SparkQA commented Aug 25, 2021

beliefer commented Aug 25, 2021

cloud-fan commented Aug 25, 2021

beliefer commented Aug 25, 2021

gengliangwang commented Aug 25, 2021

huaxingao Aug 25, 2021

Choose a reason for hiding this comment

cloud-fan Aug 25, 2021

Choose a reason for hiding this comment

huaxingao commented Aug 25, 2021

beliefer commented Aug 26, 2021 • edited

huaxingao commented Aug 26, 2021

beliefer commented Aug 26, 2021

SparkQA commented Aug 26, 2021

cloud-fan Aug 26, 2021

Choose a reason for hiding this comment

beliefer Aug 26, 2021

Choose a reason for hiding this comment

SparkQA commented Aug 26, 2021

cloud-fan Aug 26, 2021

Choose a reason for hiding this comment

SparkQA commented Aug 26, 2021

SparkQA commented Aug 26, 2021

SparkQA commented Aug 26, 2021

SparkQA commented Aug 26, 2021

SparkQA commented Aug 26, 2021

SparkQA commented Aug 26, 2021

SparkQA commented Aug 26, 2021

beliefer commented Aug 26, 2021

SparkQA commented Aug 26, 2021

cloud-fan Aug 26, 2021

Choose a reason for hiding this comment

cloud-fan Aug 26, 2021

Choose a reason for hiding this comment

SparkQA commented Aug 26, 2021

SparkQA commented Aug 26, 2021

SparkQA commented Aug 26, 2021

SparkQA commented Aug 27, 2021

SparkQA commented Aug 27, 2021

SparkQA commented Aug 27, 2021

beliefer commented Aug 30, 2021 • edited

cloud-fan commented Aug 30, 2021

beliefer commented Aug 31, 2021

beliefer commented Aug 26, 2021 •

edited

beliefer commented Aug 30, 2021 •

edited