New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[SPARK-36574][SQL] pushDownPredicate=false should prevent push down filters to JDBC data source #33822
Conversation
Kubernetes integration test starting |
Kubernetes integration test status success |
Test build #142725 has finished for PR 33822 at commit
|
Kubernetes integration test starting |
Kubernetes integration test status success |
Test build #142748 has finished for PR 33822 at commit
|
ping @cloud-fan @gengliangwang |
is it a long-standing bug? or a regression which was introduced recently? and does JDBC v2 have the same problem? |
I doubt it's a long-standing bug. I checked JDBC v2 have't the problem. |
@beliefer can we add new test case for the fix? |
filters | ||
} else { | ||
Array.empty[Filter] | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If pushDownPredicate
is false, the unhandledFilters are set to all the filters here https://github.com/apache/spark/blob/master/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/jdbc/JDBCRelation.scala#L276. Seems to me that the unhandledFilters shouldn't be pushed down to JDBC at all.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yea, it's a bug since day 1: #21875
I think we should update the tested add at that time and check the real pushed filters in JDBC source.
I took a look at v2 path. Seems to me that filter push down logic in v2 is different from v1. V2 V1: |
In fact, I commit the update 0fecee8 in previous like your consideration. |
@beliefer I see. |
|
Kubernetes integration test starting |
@@ -409,7 +409,7 @@ object DataSourceStrategy | |||
pushedFilters.toSet, | |||
handledFilters, | |||
None, | |||
scanBuilder(requestedColumns, candidatePredicates, pushedFilters), | |||
scanBuilder(requestedColumns, candidatePredicates, handledFilters.toSeq), |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think this change is correct but is pretty risky. DS v1 has been there for many years and it's possible that some v1 sources forget to override unhandledFilters
.
Before this PR, it's not a big deal if v1 sources forget to override unhandledFilters
. Filter pushdown still works, only the EXPLAIN result will be inaccurate.
I don't think we can make this change now.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
OK. Let me revert it.
Kubernetes integration test status failure |
assert(relation.isInstanceOf[org.apache.spark.sql.execution.datasources.jdbc.JDBCRelation]) | ||
val jdbcRelation = | ||
relation.asInstanceOf[org.apache.spark.sql.execution.datasources.jdbc.JDBCRelation] | ||
if (jdbcRelation.jdbcOptions.pushDownPredicate == false) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The method name is checkNotPushdown
, do we need this if
?
Kubernetes integration test starting |
Kubernetes integration test status success |
Test build #142790 has finished for PR 33822 at commit
|
Test build #142809 has finished for PR 33822 at commit
|
Kubernetes integration test starting |
Kubernetes integration test status failure |
Kubernetes integration test unable to build dist. exiting with code: 1 |
retest this please |
Kubernetes integration test unable to build dist. exiting with code: 1 |
rawPlan.execute().count() | ||
} | ||
|
||
assert(getRowCount(df1) == getRowCount(df2)) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: assert(getRowCount(df1) == df2.count)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
let's also test if we enable pushDownPredicate
, the row count is smaller.
Test build #142796 has finished for PR 33822 at commit
|
Test build #142806 has finished for PR 33822 at commit
|
Test build #142812 has finished for PR 33822 at commit
|
Kubernetes integration test starting |
Kubernetes integration test status failure |
Test build #142830 has finished for PR 33822 at commit
|
ping @cloud-fan cc @huaxingao @gengliangwang |
thanks, merging to master/3.2! |
…ilters to JDBC data source ### What changes were proposed in this pull request? Spark SQL includes a data source that can read data from other databases using JDBC. Spark also supports the case-insensitive option `pushDownPredicate`. According to http://spark.apache.org/docs/latest/sql-data-sources-jdbc.html, If set `pushDownPredicate` to false, no filter will be pushed down to the JDBC data source and thus all filters will be handled by Spark. But I find it still be pushed down to JDBC data source. ### Why are the changes needed? Fix bug `pushDownPredicate`=false failed to prevent push down filters to JDBC data source. ### Does this PR introduce _any_ user-facing change? 'No'. The output of query will not change. ### How was this patch tested? Jenkins test. Closes #33822 from beliefer/SPARK-36574. Authored-by: gengjiaan <gengjiaan@360.cn> Signed-off-by: Wenchen Fan <wenchen@databricks.com> (cherry picked from commit fcc91cf) Signed-off-by: Wenchen Fan <wenchen@databricks.com>
@cloud-fan Thanks a lot! @gengliangwang @huaxingao Thank you for your review. |
What changes were proposed in this pull request?
Spark SQL includes a data source that can read data from other databases using JDBC.
Spark also supports the case-insensitive option
pushDownPredicate
.According to http://spark.apache.org/docs/latest/sql-data-sources-jdbc.html, If set
pushDownPredicate
to false, no filter will be pushed down to the JDBC data source and thus all filters will be handled by Spark.But I find it still be pushed down to JDBC data source.
Why are the changes needed?
Fix bug
pushDownPredicate
=false failed to prevent push down filters to JDBC data source.Does this PR introduce any user-facing change?
'No'.
The output of query will not change.
How was this patch tested?
Jenkins test.