[SPARK-40045][SQL]Optimize the order of filtering predicates #37479

caican00 · 2022-08-11T06:54:24Z

Why are the changes needed?

select id, data FROM testcat.ns1.ns2.table
where id =2
and md5(data) = '8cde774d6f7333752ed72cacddb05126'
and trim(data) = 'a'

Based on the SQL, we currently get the filters in the following order:

// `(md5(cast(data#23 as binary)) = 8cde774d6f7333752ed72cacddb05126)) AND (trim(data#23, None) = a))` comes before `(id#22L = 2)`
== Physical Plan ==
 *(1) Project [id#22L, data#23]
 +- *(1) Filter ((((isnotnull(data#23) AND isnotnull(id#22L)) AND (md5(cast(data#23 as binary)) = 8cde774d6f7333752ed72cacddb05126)) AND (trim(data#23, None) = a)) AND (id#22L = 2))
    +- BatchScan[id#22L, data#23] class org.apache.spark.sql.connector.InMemoryTable$InMemoryBatchScan

In this predicate order, all data needs to participate in the evaluation, even if some data does not meet the later filtering criteria and it may causes spark tasks to execute slowly.

So i think that filtering predicates that need to be evaluated should automatically be placed to the far right to avoid data that does not meet the criteria being evaluated.

As shown below:

//  `(id#22L = 2)` comes before `(md5(cast(data#23 as binary)) = 8cde774d6f7333752ed72cacddb05126)) AND (trim(data#23, None) = a))`
== Physical Plan == 
*(1) Project [id#22L, data#23]
 +- *(1) Filter ((((isnotnull(data#23) AND isnotnull(id#22L)) AND (id#22L = 2) AND (md5(cast(data#23 as binary)) = 8cde774d6f7333752ed72cacddb05126)) AND (trim(data#23, None) = a)))
    +- BatchScan[id#22L, data#23] class org.apache.spark.sql.connector.InMemoryTable$InMemoryBatchScan

How was this patch tested?

Add new test
manually test：the stage execution time for reading data dropped from 6min+ to 24s

caican00 · 2022-08-11T07:11:17Z

gently ping @rdblue @cloud-fan
Could you help to review this patch?

AmplabJenkins · 2022-08-11T22:27:33Z

Can one of the admins verify this patch?

huaxingao · 2022-08-12T05:58:52Z

sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/v2/PushDownUtils.scala

@@ -65,7 +65,7 @@ object PushDownUtils extends PredicateHelper {
        val postScanFilters = r.pushFilters(translatedFilters.toArray).map { filter =>
          DataSourceStrategy.rebuildExpressionFromFilter(filter, translatedFilterToExpr)
        }
-        (Left(r.pushedFilters()), (untranslatableExprs ++ postScanFilters).toSeq)
+        (Left(r.pushedFilters()), (postScanFilters ++ untranslatableExprs).toSeq)


This order switching makes sense to me. I think the translated filters (postScanFilters) are simple filters that can be evaluated faster, while the untranslated filters are normally complicated filters that take more time to evaluate, so we want to evaluate the postScanFilters filters first.

cc @cloud-fan

This order switching makes sense to me. I think the translated filters (postScanFilters) are simple filters that can be evaluated faster, while the untranslated filters are normally complicated filters that take more time to evaluate, so we want to evaluate the postScanFilters filters first.

Thank you for your reply. That's exactly what I was thinking

I think this simple heuristic should be safe, although it can't optimize all the cases but it won't make things worse.

Can we add a comment here to state that untranslatableExprs needs to be on the right side and also briefly explain the reason?

zinking · 2022-08-12T09:10:20Z

the point is clear and valid.

what about this case: day(ts)=1 versus col1 > 13. where ts is a partition column while col1 is not. and some v2 data source implementation is capable of pushing that day function down. ? what will happen?

thought the fix should be more complicated though.

huaxingao · 2022-08-17T00:52:52Z

@zinking

what about this case: day(ts)=1 versus col1 > 13. where ts is a partition column while col1 is not. and some v2 data source implementation is capable of pushing that day function down. ? what will happen?

Seems to me the current implementation can only push down filters which are in the format of attribute cmp lit. Could you copy and paste the plan that pushes down day(ts)=1?

caican00 · 2022-08-18T08:10:07Z

@caican00 Could you change the filters order in case r: SupportsPushDownV2Filters too?

@huaxingao ok, i will optimize this case

…m-fix-filters-order

caican00 · 2022-08-22T06:29:53Z

@caican00 Could you change the filters order in case r: SupportsPushDownV2Filters too?

@huaxingao Updated

cloud-fan · 2022-08-22T08:29:04Z

sql/core/src/test/scala/org/apache/spark/sql/connector/DataSourceV2SQLSuite.scala

+  test("SPARK-40045: Move the post-Scan Filters to the far right") {
+    val t1 = s"${catalogAndNamespace}table"
+    withTable(t1) {
+      spark.udf.register("udfStrLen", (str: String) => str.length)


nit: wrap the test with withUserDefinedFunction to unregister the function at the end.

nit: wrap the test with withUserDefinedFunction to unregister the function at the end.

@cloud-fan Thanks. Updated

cloud-fan · 2022-08-22T08:30:53Z

sql/core/src/test/scala/org/apache/spark/sql/connector/DataSourceV2SQLSuite.scala

+      val filtersAfter = find(filterAfter.queryExecution.executedPlan)(_.isInstanceOf[FilterExec])
+        .head.asInstanceOf[FilterExec]
+        .condition.toString
+        .split("AND")


nit: let's call splitConjunctivePredicates and check toString of the resulting predicates.

nit: let's call splitConjunctivePredicates and check toString of the resulting predicates.

@cloud-fan Thanks. Updated

cloud-fan · 2022-08-22T08:31:23Z

sql/core/src/test/scala/org/apache/spark/sql/connector/DataSourceV2SQLSuite.scala

+        .condition.toString
+        .split("AND")
+      assert(filtersAfter.length == 5
+        && filtersAfter(3).trim.startsWith("(udfStrLen(data")


wait, shouldn't this udf in the far right?

wait, shouldn't this udf in the far right?

@cloud-fan In the following SQL:

SELECT id, data FROM testcat.ns1.ns2.table where udfStrLen(data) = 1 and trim(data) = 'a' and id =2

udfStrLen and trim functions are untranslatable and they're on the far right with respect to id =2. Before
this optimization, id = 2 was on the far right.

== Physical Plan == *(1) Project [id#24L, data#25] +- *(1) Filter (((isnotnull(id#24L) AND (id#24L = 2)) AND (udfStrLen(data#25) = 1)) AND (trim(data#25, None) = a)) +- BatchScan[id#24L, data#25] class org.apache.spark.sql.connector.catalog.InMemoryTable$InMemoryBatchScan RuntimeFilters: []

…m-fix-filters-order

cloud-fan · 2022-08-22T12:37:35Z

sql/core/src/test/scala/org/apache/spark/sql/connector/DataSourceV2SQLSuite.scala

+          s"""
+             |SELECT id, data FROM $t1
+             |where udfStrLen(data) = 1
+             |and trim(data) = 'a'


can we remove trim? to make the test easier to understand.

can we remove trim? to make the test easier to understand.

okay, i have updated it. @cloud-fan

huaxingao · 2022-08-22T23:57:32Z

sql/catalyst/src/test/scala/org/apache/spark/sql/connector/catalog/InMemoryTable.scala

@@ -253,7 +254,8 @@ class InMemoryTable(
  }

  class InMemoryScanBuilder(tableSchema: StructType) extends ScanBuilder
-      with SupportsPushDownRequiredColumns {
+      with SupportsPushDownRequiredColumns with SupportsPushDownFilters
+    with SupportsPushDownV2Filters {


I think this SupportsPushDownV2Filters should be implemented in InMemoryV2FilterScanBuilder

huaxingao · 2022-08-22T23:58:43Z

sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/v2/PushDownUtils.scala

@@ -94,7 +94,7 @@ object PushDownUtils extends PredicateHelper {
        val postScanFilters = r.pushPredicates(translatedFilters.toArray).map { predicate =>
          DataSourceV2Strategy.rebuildExpressionFromFilter(predicate, translatedFilterToExpr)
        }
-        (Right(r.pushedPredicates), (untranslatableExprs ++ postScanFilters).toSeq)
+        (Right(r.pushedPredicates), (postScanFilters ++ untranslatableExprs).toSeq)


same comment as the above

shardulm94 · 2022-10-28T02:29:48Z

@caican00 Do you think this PR is ready for another round of review? In our organization, we have seen a number of users impacted by this after migration to DSv2, so it would be nice to get this merged.

mridulm · 2023-01-26T19:39:54Z

QQ: Why is this PR targeting 3.3 and not master ?

shardulm94 · 2023-01-26T19:42:45Z

Hey @caican00! Haven't seen an update from your side in the last few months. Are you still interested in contributing this patch to Spark?

huaxingao · 2023-01-30T05:52:05Z

@caican00 Do you want to finish this? I think you can just remove implementing SupportsPushDownV2Filters here

spark/sql/catalyst/src/test/scala/org/apache/spark/sql/connector/catalog/InMemoryTable.scala

Line 258 in 10dbd42

with SupportsPushDownV2Filters {

, then it's ready to be merged.

huaxingao · 2023-02-01T01:23:40Z

@caican00 if you don't have time for this any more, is it OK with you that I take this over and finish it up? We have quite some customers using DS V2, it would be nice if this fix can be merged. Thanks!

huaxingao · 2023-02-06T02:09:45Z

@caican00 I have opened a new PR #39892. I don't have your github account email to add you as a co-author. You can add yourself as a co-author to get the commit credit.

huaxingao · 2023-02-06T02:33:29Z

I will close this PR for now @caican00

All the credit of this PR goes to caican00. Here is the original [PR](#37479) ### What changes were proposed in this pull request? put untranslated filters to the right side of the translated filters. ### Why are the changes needed? Normally the translated filters (postScanFilters) are simple filters that can be evaluated faster, while the untranslated filters are complicated filters that take more time to evaluate, so we want to evaluate the postScanFilters filters first. ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? new UT Closes #39892 from huaxingao/filter_order. Authored-by: huaxingao <huaxin_gao@apple.com> Signed-off-by: Dongjoon Hyun <dongjoon@apache.org>

All the credit of this PR goes to caican00. Here is the original [PR](#37479) ### What changes were proposed in this pull request? put untranslated filters to the right side of the translated filters. ### Why are the changes needed? Normally the translated filters (postScanFilters) are simple filters that can be evaluated faster, while the untranslated filters are complicated filters that take more time to evaluate, so we want to evaluate the postScanFilters filters first. ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? new UT Closes #39892 from huaxingao/filter_order. Authored-by: huaxingao <huaxin_gao@apple.com> Signed-off-by: Dongjoon Hyun <dongjoon@apache.org> (cherry picked from commit fe67269) Signed-off-by: Dongjoon Hyun <dongjoon@apache.org>

All the credit of this PR goes to caican00. Here is the original [PR](apache#37479) ### What changes were proposed in this pull request? put untranslated filters to the right side of the translated filters. ### Why are the changes needed? Normally the translated filters (postScanFilters) are simple filters that can be evaluated faster, while the untranslated filters are complicated filters that take more time to evaluate, so we want to evaluate the postScanFilters filters first. ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? new UT Closes apache#39892 from huaxingao/filter_order. Authored-by: huaxingao <huaxin_gao@apple.com> Signed-off-by: Dongjoon Hyun <dongjoon@apache.org> (cherry picked from commit fe67269) Signed-off-by: Dongjoon Hyun <dongjoon@apache.org>

fix

41e80f3

github-actions bot added AVRO BUILD CORE DEPLOY DOCS DSTREAM EXAMPLES GRAPHX INFRA KUBERNETES MESOS ML MLLIB PANDAS API ON SPARK PYTHON R SPARK SHELL SQL STRUCTURED STREAMING WEB UI WINDOWS YARN labels Aug 11, 2022

caican00 changed the base branch from master to branch-3.3 August 11, 2022 07:04

Merge branch 'branch-3.3' into 3.3-upstream-fix-filters-order

ae18eaa

huaxingao reviewed Aug 12, 2022

View reviewed changes

caican00 reopened this Aug 18, 2022

caican00 added 2 commits August 22, 2022 14:18

fix

ebc01b2

Merge branch 'branch-3.3' of github.com:apache/spark into 3.3-upstrea…

136e540

…m-fix-filters-order

fix

cbe4e51

cloud-fan reviewed Aug 22, 2022

View reviewed changes

caican00 added 2 commits August 22, 2022 18:10

Merge branch 'branch-3.3' of github.com:apache/spark into 3.3-upstrea…

5985c49

…m-fix-filters-order

fix

4366f51

cloud-fan reviewed Aug 22, 2022

View reviewed changes

huaxingao reviewed Aug 22, 2022

View reviewed changes

caican00 added 2 commits September 8, 2022 10:50

update

536e2ec

Merge branch 'branch-3.3' into 3.3-upstream-fix-filters-order

10dbd42

huaxingao mentioned this pull request Feb 6, 2023

[SPARK-40045][SQL]Optimize the order of filtering predicates #39892

Closed

huaxingao closed this Feb 6, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[SPARK-40045][SQL]Optimize the order of filtering predicates #37479

[SPARK-40045][SQL]Optimize the order of filtering predicates #37479

caican00 commented Aug 11, 2022 •

edited

caican00 commented Aug 11, 2022

AmplabJenkins commented Aug 11, 2022

huaxingao Aug 12, 2022

huaxingao Aug 12, 2022

caican00 Aug 15, 2022

cloud-fan Aug 18, 2022

huaxingao Aug 22, 2022

zinking commented Aug 12, 2022

huaxingao commented Aug 17, 2022

caican00 commented Aug 18, 2022

caican00 commented Aug 22, 2022

cloud-fan Aug 22, 2022

caican00 Aug 22, 2022

cloud-fan Aug 22, 2022

caican00 Aug 22, 2022

cloud-fan Aug 22, 2022

caican00 Aug 22, 2022 •

edited

cloud-fan Aug 22, 2022

caican00 Sep 8, 2022

huaxingao Aug 22, 2022

huaxingao Aug 22, 2022

shardulm94 commented Oct 28, 2022

mridulm commented Jan 26, 2023

shardulm94 commented Jan 26, 2023

huaxingao commented Jan 30, 2023

huaxingao commented Feb 1, 2023

huaxingao commented Feb 6, 2023

huaxingao commented Feb 6, 2023

[SPARK-40045][SQL]Optimize the order of filtering predicates #37479

[SPARK-40045][SQL]Optimize the order of filtering predicates #37479

Conversation

caican00 commented Aug 11, 2022 • edited

Why are the changes needed?

How was this patch tested?

caican00 commented Aug 11, 2022

AmplabJenkins commented Aug 11, 2022

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

zinking commented Aug 12, 2022

huaxingao commented Aug 17, 2022

caican00 commented Aug 18, 2022

caican00 commented Aug 22, 2022

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

caican00 Aug 22, 2022 • edited

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

shardulm94 commented Oct 28, 2022

mridulm commented Jan 26, 2023

shardulm94 commented Jan 26, 2023

huaxingao commented Jan 30, 2023

huaxingao commented Feb 1, 2023

huaxingao commented Feb 6, 2023

huaxingao commented Feb 6, 2023

caican00 commented Aug 11, 2022 •

edited

caican00 Aug 22, 2022 •

edited