Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[SPARK-36574][SQL] pushDownPredicate=false should prevent push down filters to JDBC data source #33822

Closed
wants to merge 7 commits into from

Conversation

beliefer
Copy link
Contributor

What changes were proposed in this pull request?

Spark SQL includes a data source that can read data from other databases using JDBC.
Spark also supports the case-insensitive option pushDownPredicate.
According to http://spark.apache.org/docs/latest/sql-data-sources-jdbc.html, If set pushDownPredicate to false, no filter will be pushed down to the JDBC data source and thus all filters will be handled by Spark.
But I find it still be pushed down to JDBC data source.

Why are the changes needed?

Fix bug pushDownPredicate=false failed to prevent push down filters to JDBC data source.

Does this PR introduce any user-facing change?

'No'.
The output of query will not change.

How was this patch tested?

Jenkins test.

@github-actions github-actions bot added the SQL label Aug 24, 2021
@SparkQA
Copy link

SparkQA commented Aug 24, 2021

Kubernetes integration test starting
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/47225/

@SparkQA
Copy link

SparkQA commented Aug 24, 2021

Kubernetes integration test status success
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/47225/

@SparkQA
Copy link

SparkQA commented Aug 24, 2021

Test build #142725 has finished for PR 33822 at commit 0fecee8.

  • This patch fails from timeout after a configured wait of 500m.
  • This patch merges cleanly.
  • This patch adds no public classes.

@SparkQA
Copy link

SparkQA commented Aug 25, 2021

Kubernetes integration test starting
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/47248/

@SparkQA
Copy link

SparkQA commented Aug 25, 2021

Kubernetes integration test status success
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/47248/

@SparkQA
Copy link

SparkQA commented Aug 25, 2021

Test build #142748 has finished for PR 33822 at commit d34cdfd.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@beliefer
Copy link
Contributor Author

ping @cloud-fan @gengliangwang

@cloud-fan
Copy link
Contributor

is it a long-standing bug? or a regression which was introduced recently? and does JDBC v2 have the same problem?

@beliefer
Copy link
Contributor Author

is it a long-standing bug? or a regression which was introduced recently? and does JDBC v2 have the same problem?

I doubt it's a long-standing bug. I checked JDBC v2 have't the problem.

@gengliangwang
Copy link
Member

@beliefer can we add new test case for the fix?

filters
} else {
Array.empty[Filter]
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If pushDownPredicate is false, the unhandledFilters are set to all the filters here https://github.com/apache/spark/blob/master/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/jdbc/JDBCRelation.scala#L276. Seems to me that the unhandledFilters shouldn't be pushed down to JDBC at all.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yea, it's a bug since day 1: #21875

I think we should update the tested add at that time and check the real pushed filters in JDBC source.

@huaxingao
Copy link
Contributor

I took a look at v2 path. Seems to me that filter push down logic in v2 is different from v1.

V2
val (pushed, unSupported) = filters.partition(JDBCRDD.compileFilter(_, dialect).isDefined)
https://github.com/apache/spark/blob/master/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/v2/jdbc/JDBCScanBuilder.scala#L51
We only push down the supported filters, the unSupported is returned as the post scan filters.

V1:
val (unhandledPredicates, pushedFilters, handledFilters) = selectFilters(relation.relation, candidatePredicates)
https://github.com/apache/spark/blob/master/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/DataSourceStrategy.scala#L388
We push down the pushedFilters, not the handledFilters. https://github.com/apache/spark/blob/master/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/DataSourceStrategy.scala#L436
Seems to me that we should push down handledFilters, not the translated filters (pushedFilters)

@beliefer
Copy link
Contributor Author

beliefer commented Aug 26, 2021

Seems to me that we should push down handledFilters, not the translated filters (pushedFilters)
@huaxingao I tried on the way, but some test case cannot pass.
Please see


According to the test case, it seems that the pushed filters should as the parameters of def buildScan.
cc @cloud-fan @gengliangwang

In fact, I commit the update 0fecee8 in previous like your consideration.

@huaxingao
Copy link
Contributor

@beliefer I see.
But at this line assert(relation.unhandledFilters(FiltersPushed.list.toArray).toSet === expectedUnhandledFilters) https://github.com/apache/spark/blob/master/sql/core/src/test/scala/org/apache/spark/sql/sources/FilteredScanSuite.scala#L331, why are we getting unhandledFilters from FiltersPushed.list? Should we get unhandledFilters from the original filters?
I still think we shouldn't push down unhandledFilters to the data source.

@beliefer
Copy link
Contributor Author

why are we getting unhandledFilters from FiltersPushed.list?
@huaxingao I don't know the reason. If we shouldn't push down unhandledFilters to the data source, we should adust these test cases too.

@SparkQA
Copy link

SparkQA commented Aug 26, 2021

Kubernetes integration test starting
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/47291/

@@ -409,7 +409,7 @@ object DataSourceStrategy
pushedFilters.toSet,
handledFilters,
None,
scanBuilder(requestedColumns, candidatePredicates, pushedFilters),
scanBuilder(requestedColumns, candidatePredicates, handledFilters.toSeq),
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this change is correct but is pretty risky. DS v1 has been there for many years and it's possible that some v1 sources forget to override unhandledFilters.

Before this PR, it's not a big deal if v1 sources forget to override unhandledFilters. Filter pushdown still works, only the EXPLAIN result will be inaccurate.

I don't think we can make this change now.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

OK. Let me revert it.

@SparkQA
Copy link

SparkQA commented Aug 26, 2021

Kubernetes integration test status failure
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/47291/

assert(relation.isInstanceOf[org.apache.spark.sql.execution.datasources.jdbc.JDBCRelation])
val jdbcRelation =
relation.asInstanceOf[org.apache.spark.sql.execution.datasources.jdbc.JDBCRelation]
if (jdbcRelation.jdbcOptions.pushDownPredicate == false) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The method name is checkNotPushdown, do we need this if?

@SparkQA
Copy link

SparkQA commented Aug 26, 2021

Kubernetes integration test starting
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/47296/

@beliefer beliefer changed the title [SPARK-36574][SQL] pushDownPredicate=false failed to prevent push down filters to JDBC data source [SPARK-36574][SQL] pushDownPredicate=false should prevent push down filters to JDBC data source Aug 26, 2021
@SparkQA
Copy link

SparkQA commented Aug 26, 2021

Kubernetes integration test status success
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/47296/

@SparkQA
Copy link

SparkQA commented Aug 26, 2021

Test build #142790 has finished for PR 33822 at commit 91c2284.

  • This patch fails Spark unit tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@SparkQA
Copy link

SparkQA commented Aug 26, 2021

Test build #142809 has finished for PR 33822 at commit 9c01364.

  • This patch fails to build.
  • This patch merges cleanly.
  • This patch adds no public classes.

@SparkQA
Copy link

SparkQA commented Aug 26, 2021

Kubernetes integration test starting
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/47307/

@SparkQA
Copy link

SparkQA commented Aug 26, 2021

Kubernetes integration test status failure
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/47307/

@SparkQA
Copy link

SparkQA commented Aug 26, 2021

Kubernetes integration test unable to build dist.

exiting with code: 1
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/47310/

@beliefer
Copy link
Contributor Author

retest this please

@SparkQA
Copy link

SparkQA commented Aug 26, 2021

Kubernetes integration test unable to build dist.

exiting with code: 1
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/47314/

rawPlan.execute().count()
}

assert(getRowCount(df1) == getRowCount(df2))
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: assert(getRowCount(df1) == df2.count)

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

let's also test if we enable pushDownPredicate, the row count is smaller.

@SparkQA
Copy link

SparkQA commented Aug 26, 2021

Test build #142796 has finished for PR 33822 at commit ecb3a47.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@SparkQA
Copy link

SparkQA commented Aug 26, 2021

Test build #142806 has finished for PR 33822 at commit 24d3c94.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@SparkQA
Copy link

SparkQA commented Aug 26, 2021

Test build #142812 has finished for PR 33822 at commit 9c01364.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@SparkQA
Copy link

SparkQA commented Aug 27, 2021

Kubernetes integration test starting
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/47332/

@SparkQA
Copy link

SparkQA commented Aug 27, 2021

Kubernetes integration test status failure
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/47332/

@SparkQA
Copy link

SparkQA commented Aug 27, 2021

Test build #142830 has finished for PR 33822 at commit 4b4f78a.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@beliefer
Copy link
Contributor Author

beliefer commented Aug 30, 2021

ping @cloud-fan cc @huaxingao @gengliangwang

@cloud-fan
Copy link
Contributor

thanks, merging to master/3.2!

@cloud-fan cloud-fan closed this in fcc91cf Aug 30, 2021
cloud-fan pushed a commit that referenced this pull request Aug 30, 2021
…ilters to JDBC data source

### What changes were proposed in this pull request?
Spark SQL includes a data source that can read data from other databases using JDBC.
Spark also supports the case-insensitive option `pushDownPredicate`.
According to http://spark.apache.org/docs/latest/sql-data-sources-jdbc.html, If set `pushDownPredicate` to false, no filter will be pushed down to the JDBC data source and thus all filters will be handled by Spark.
But I find it still be pushed down to JDBC data source.

### Why are the changes needed?
Fix bug `pushDownPredicate`=false failed to prevent push down filters to JDBC data source.

### Does this PR introduce _any_ user-facing change?
'No'.
The output of query will not change.

### How was this patch tested?
Jenkins test.

Closes #33822 from beliefer/SPARK-36574.

Authored-by: gengjiaan <gengjiaan@360.cn>
Signed-off-by: Wenchen Fan <wenchen@databricks.com>
(cherry picked from commit fcc91cf)
Signed-off-by: Wenchen Fan <wenchen@databricks.com>
@beliefer
Copy link
Contributor Author

@cloud-fan Thanks a lot! @gengliangwang @huaxingao Thank you for your review.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
5 participants