[SPARK-25482][SQL] Avoid pushdown of subqueries to data source filters #22518

mgaido91 · 2018-09-21T13:26:03Z

What changes were proposed in this pull request?

An expressions with a subquery can be pushed down as a data source filter. Despite the filter is not actively used, this causes anyway a re-execution of the subquery becuase the ReuseSubquery optimization rule is ineffective in this case.

The PR avoids this problem by forbidding the push down of filters containing a subquery.

How was this patch tested?

added UT

…s have the same id

SparkQA · 2018-09-21T18:18:23Z

Test build #96428 has finished for PR 22518 at commit 7c75067.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2018-09-21T18:21:43Z

Test build #96427 has finished for PR 22518 at commit 36fa664.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

peter-toth · 2018-09-21T20:19:30Z

sql/core/src/test/scala/org/apache/spark/sql/SubquerySuite.scala

@@ -1268,4 +1269,16 @@ class SubquerySuite extends QueryTest with SharedSQLContext {
      assert(getNumSortsInQuery(query5) == 1)
    }
  }
+
+  test("SPARK-25482: Reuse same Subquery in order to execute it only once") {
+    withTempView("t1", "t2", "t3") {


There is no need for "t3".

peter-toth · 2018-09-21T20:24:48Z

sql/core/src/main/scala/org/apache/spark/sql/execution/subquery.scala

@@ -166,7 +168,7 @@ case class ReuseSubquery(conf: SQLConf) extends Rule[SparkPlan] {
        val sameSchema = subqueries.getOrElseUpdate(sub.plan.schema, ArrayBuffer[SubqueryExec]())
        val sameResult = sameSchema.find(_.sameResult(sub.plan))
        if (sameResult.isDefined) {
-          sub.withNewPlan(sameResult.get)
+          sub.withNewPlan(sameResult.get).withNewExprId()


Can we avoid double copy()? Or is it cleaner this way?

I think it is cleaner this way. I don't expect this to happen very often (how many subquery can you have in a plan?) so I don't think it is an issue. But if there are cleaner options/solutions, I am open to suggestions, thanks.

mgaido91 · 2018-09-22T10:15:42Z

cc @cloud-fan @dongjoon-hyun @gatorsmile

SparkQA · 2018-09-22T14:16:19Z

Test build #96472 has finished for PR 22518 at commit ec72458.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

cloud-fan · 2018-09-24T13:54:45Z

This can happen for instance when a filter containing a scalar subquery is pushed to a DataSource

hmm, how can this happen? I don't think a data source can handle a filter of subquery...

mgaido91 · 2018-09-24T14:03:17Z

hmm, how can this happen?

you can check the UT which reproduces the issue. The scalar subquery is pushed down as part of the filter GreaterThen

gengliangwang · 2018-09-25T08:30:17Z

@mgaido91 Sorry but can you have a more detailed explanation in the PR description?
With the code changes, the predicate with subquery result can be push down into data source. Is this the main point of the PR? And why is that creating a new expr ID can fix it?

mgaido91 · 2018-09-25T09:32:15Z

@gengliangwang no, let me cite and explain the PR description. I am not sure how to improve it, but if you have suggestions I am happy to. The main point of the PR is to address an issue which arise when:

When a ExecSubqueryExpression is copied

Now the point is, can this condition happen? The answer is yes, and one situation in which this happens (as reported in the JIRA) is

when a filter containing a scalar subquery is pushed to a DataSource.

So in the plan we have two ExecSubqueryExpression each with a copy of the same SubqueryExec. The problem which arises in this condition is that:

ReuseSubquery becomes useless, as replacing the SubqueryExec is ignored since the new plan is equal to the previous one.

So this result in the subquery being executed twice (as the two SubqueryExec are distinct, despite they are the same).

gengliangwang · 2018-09-25T17:36:46Z

@mgaido91 I see, thanks for the explanation!

SparkQA · 2018-11-10T18:24:14Z

Test build #98681 has finished for PR 22518 at commit 144091f.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

cloud-fan · 2018-11-12T07:50:41Z

sql/core/src/test/scala/org/apache/spark/sql/SubquerySuite.scala

+    withTempView("t1", "t2") {
+      sql("create temporary view t1(a int) using parquet")
+      sql("create temporary view t2(b int) using parquet")
+      val plan = sql("select * from t2 where b > (select max(a) from t1)")


sorry it has been a long time and I don't quite remember the context.

What was the problem we are trying to fix? This test looks nothing related to subquery reuse.

Sure, please can you check the PR description? I think the context is quite well explained there.

Anyway, as a quick summary: in this case b > (select max(a) from t1) is pushed down as a datasource filter. So we have 2 instances of b > (select max(a) from t1) and the result is not reused. It is not reused because the copied plan satisfies ==, so even if ReuseSubquery replaces it, then the change is ignored.

Do we only have a problem when we have subquery in data source filter? If that's the case, I would suggest not pushdown subquery filter into data source.

we have this problem if we copy the same subquery. I can't think of any other case than filter push-down, but I may be wrong.

Forbidding to push down these filter may cause a perf regression, I am not sure it is the right solution.

is there any data source can support subquery filter? for data source v1/v2, the public Filter API does not support subquery. For file source, they don't support subquery filter either.

it also means the data source scan must wait until the subquery is finished

The subquery should be executed anyway sooner or later, right? So I don't see the problem here: am I missing something?

Ok, thanks, I'll follow your suggestion and forbid it here and create a new ticket about pushing it down to data sources. Thanks.

The subquery should be executed anyway sooner or later, right?

Yes, but we could execute scan and subquery at the same time (2 spark jobs running together), instead of executing them serialized.

we could execute scan and subquery at the same time (

is this really possible? My understanding is that subqueries are executed before the plan they belong to (in SparkPlan.executeQuery). So my understanding is that when a subquery is running, the rest of the query is not.

ah sorry I misread the code. Unless the subquery is rewritten into join, we must wait for all subqueries to be finished before executing the plan.

We can rewrite scalar subquery in data source filters into literal, to make it work with the filter pushdown API.

yes, exactly. That's what I meant. Shall I revert the changes to the previous PR? And in the scope of the new JIRA do the rewriting to literals? Thanks.

SparkQA · 2018-11-12T20:04:03Z

Test build #98734 has finished for PR 22518 at commit ef0a953.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

cloud-fan · 2018-11-13T06:03:08Z

sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/FileSourceStrategy.scala

@@ -155,15 +155,14 @@ object FileSourceStrategy extends Strategy with Logging {
          case a: AttributeReference =>
            a.withName(l.output.find(_.semanticEquals(a)).get.name)
        }
-      }
+      }.filterNot(SubqueryExpression.hasSubquery)


shall we do the filter before the map?

cloud-fan · 2018-11-13T06:03:25Z

...re/src/main/scala/org/apache/spark/sql/execution/datasources/PruneFileSourcePartitions.scala

@@ -47,7 +47,8 @@ private[sql] object PruneFileSourcePartitions extends Rule[LogicalPlan] {
          case a: AttributeReference =>
            a.withName(logicalRelation.output.find(_.semanticEquals(a)).get.name)
        }
-      }
+      }.filterNot(SubqueryExpression.hasSubquery)


cloud-fan · 2018-11-13T06:03:40Z

sql/core/src/test/scala/org/apache/spark/sql/SubquerySuite.scala

@@ -1268,4 +1269,16 @@ class SubquerySuite extends QueryTest with SharedSQLContext {
      assert(getNumSortsInQuery(query5) == 1)
    }
  }
+
+  test("SPARK-25482: Reuse same Subquery in order to execute it only once") {


let's update the test

cloud-fan · 2018-11-13T06:05:15Z

I'd like to merge this simple PR first, to address the performance problem (unnecessary subquery execution).

Let's create a new ticket for subquery filter pushing to data source, and have more people to attend the discussion.

cloud-fan · 2018-11-13T06:06:37Z

BTW can you include a simple benchmark to show this problem? e.g. just run a query in spark-shell, and post the result before and after this PR.

mgaido91 · 2018-11-13T10:54:17Z

@cloud-fan this is the benchmark:

(1 to 1000000).toSeq.toDF("a").write.save("/tmp/t1")
spark.read.load("/tmp/t1").createTempView("t1")
(1 to 2000).toSeq.toDF("b").write.save("/tmp/t2")
spark.read.load("/tmp/t2").createTempView("t2")
val plan = sql("select * from t2 where b > (select avg(a + 1) from t1)")
val t0 = System.nanoTime()
plan.show
val t1 = System.nanoTime()
println("Elapsed time: " + (t1 - t0) + "ns")

the result is:

Before PR: Elapsed time: 862499689ns
After  PR: Elapsed time: 914728641ns

The difference is very small because all the subqueries run in parallel. The execution time would be much more affected if there are several subqueries (our thread pool has 16 threads so a query like that but with 9 filters with subqueries would see a big performance gain after this PR).

cloud-fan · 2018-11-13T13:18:05Z

sql/core/src/main/scala/org/apache/spark/sql/execution/subquery.scala

@@ -22,7 +22,7 @@ import scala.collection.mutable.ArrayBuffer

 import org.apache.spark.sql.SparkSession
 import org.apache.spark.sql.catalyst.{expressions, InternalRow}
-import org.apache.spark.sql.catalyst.expressions.{Expression, ExprId, InSet, Literal, PlanExpression}
+import org.apache.spark.sql.catalyst.expressions.{Expression, ExprId, InSet, Literal, NamedExpression, PlanExpression}


unnecessary change

thanks sorry, I missed it.

cloud-fan · 2018-11-13T13:18:20Z

sql/core/src/test/scala/org/apache/spark/sql/SubquerySuite.scala

@@ -1268,4 +1269,16 @@ class SubquerySuite extends QueryTest with SharedSQLContext {
      assert(getNumSortsInQuery(query5) == 1)
    }
  }
+
+  test("SPARK-25482: Forbid pushdown to dattasources of filters containing subqueries") {


dattasources typo

SparkQA · 2018-11-13T14:57:41Z

Test build #98775 has finished for PR 22518 at commit da3843e.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2018-11-13T15:50:13Z

Test build #98777 has finished for PR 22518 at commit b414572.

This patch fails PySpark unit tests.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2018-11-13T16:53:54Z

Test build #98778 has finished for PR 22518 at commit 56ed812.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2018-11-13T17:00:40Z

Test build #98779 has finished for PR 22518 at commit 52ae956.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

cloud-fan · 2018-11-13T17:52:59Z

thanks, merging to master!

## What changes were proposed in this pull request? An expressions with a subquery can be pushed down as a data source filter. Despite the filter is not actively used, this causes anyway a re-execution of the subquery becuase the `ReuseSubquery` optimization rule is ineffective in this case. The PR avoids this problem by forbidding the push down of filters containing a subquery. ## How was this patch tested? added UT Closes apache#22518 from mgaido91/SPARK-25482. Authored-by: Marco Gaido <marcogaido91@gmail.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com>

…with scalar subquery ### What changes were proposed in this pull request? Scalar subquery can be pushed down as data filter at runtime, since we always execute subquery first. Ideally, we can rewrite `ScalarSubquery` to `Literal` before pushing down filter. The main issue before we do not support that is `ReuseSubquery` is ineffective, see #22518. It is not a issue now. For example: ```sql SELECT * FROM t1 WHERE c1 > (SELECT min(c2) FROM t2) ``` ### Why are the changes needed? Improve peformance if data filter have scalar subquery. ### Does this PR introduce _any_ user-facing change? no ### How was this patch tested? add test Closes #41088 from ulysses-you/SPARK-43402. Authored-by: ulysses-you <ulyssesyou18@gmail.com> Signed-off-by: Xiduo You <ulyssesyou@apache.org>

…with scalar subquery ### What changes were proposed in this pull request? Scalar subquery can be pushed down as data filter at runtime, since we always execute subquery first. Ideally, we can rewrite `ScalarSubquery` to `Literal` before pushing down filter. The main issue before we do not support that is `ReuseSubquery` is ineffective, see apache#22518. It is not a issue now. For example: ```sql SELECT * FROM t1 WHERE c1 > (SELECT min(c2) FROM t2) ``` ### Why are the changes needed? Improve peformance if data filter have scalar subquery. ### Does this PR introduce _any_ user-facing change? no ### How was this patch tested? add test Closes apache#41088 from ulysses-you/SPARK-43402. Authored-by: ulysses-you <ulyssesyou18@gmail.com> Signed-off-by: Xiduo You <ulyssesyou@apache.org>

mgaido91 added 2 commits September 21, 2018 14:54

[SPARK-25482][SQL] ReuseSubquery can be effectless when the subquerie…

36fa664

…s have the same id

improve UT

7c75067

peter-toth reviewed Sep 21, 2018

View reviewed changes

address comment

ec72458

Merge branch 'master' of github.com:apache/spark into SPARK-25482

144091f

cloud-fan reviewed Nov 12, 2018

View reviewed changes

mgaido91 changed the title ~~[SPARK-25482][SQL] ReuseSubquery can be useless when the subqueries have the same id~~ [SPARK-25482][SQL] Avoid pushdown of subqueries to data source filters Nov 12, 2018

address comment

ef0a953

cloud-fan reviewed Nov 13, 2018

View reviewed changes

mgaido91 added 2 commits November 13, 2018 12:31

address comments

da3843e

remove unneeded newline

b414572

cloud-fan reviewed Nov 13, 2018

View reviewed changes

address comments

56ed812

avoid unneeded changes

52ae956

asfgit closed this in 4b95562 Nov 13, 2018

peter-toth mentioned this pull request Feb 15, 2019

[SPARK-26893][SQL] Allow partition pruning with subquery filters on file source #23802

Closed

ulysses-you mentioned this pull request May 8, 2023

[SPARK-43402][SQL] FileSourceScanExec supports push down data filter with scalar subquery #41088

Closed

[SPARK-25482][SQL] Avoid pushdown of subqueries to data source filters #22518

[SPARK-25482][SQL] Avoid pushdown of subqueries to data source filters #22518

Conversation

mgaido91 commented Sep 21, 2018 • edited Loading

What changes were proposed in this pull request?

How was this patch tested?

SparkQA commented Sep 21, 2018

SparkQA commented Sep 21, 2018

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

mgaido91 commented Sep 22, 2018

SparkQA commented Sep 22, 2018

cloud-fan commented Sep 24, 2018

mgaido91 commented Sep 24, 2018

gengliangwang commented Sep 25, 2018

mgaido91 commented Sep 25, 2018

gengliangwang commented Sep 25, 2018

SparkQA commented Nov 10, 2018

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

SparkQA commented Nov 12, 2018

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

cloud-fan commented Nov 13, 2018

cloud-fan commented Nov 13, 2018

mgaido91 commented Nov 13, 2018

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

SparkQA commented Nov 13, 2018

SparkQA commented Nov 13, 2018

SparkQA commented Nov 13, 2018

SparkQA commented Nov 13, 2018

cloud-fan commented Nov 13, 2018

mgaido91 commented Sep 21, 2018 •

edited

Loading