[SPARK-15832][SQL] Embedded IN/EXISTS predicate subquery throws TreeNodeException #13570

ioana-delaney · 2016-06-08T23:00:06Z

What changes were proposed in this pull request?

Queries with embedded existential sub-query predicates throws exception when building the physical plan.

Example failing query:

scala> Seq((1, 1), (2, 2)).toDF("c1", "c2").createOrReplaceTempView("t1")
scala> Seq((1, 1), (2, 2)).toDF("c1", "c2").createOrReplaceTempView("t2")
scala> sql("select c1 from t1 where (case when c2 in (select c2 from t2) then 2 else 3 end) IN (select c2 from t1)").show()

Binding attribute, tree: c2#239
org.apache.spark.sql.catalyst.errors.package$TreeNodeException: Binding attribute, tree: c2#239
  at org.apache.spark.sql.catalyst.errors.package$.attachTree(package.scala:50)
  at org.apache.spark.sql.catalyst.expressions.BindReferences$$anonfun$bindReference$1.applyOrElse(BoundAttribute.scala:88)

  ...
  at org.apache.spark.sql.catalyst.expressions.BindReferences$.bindReference(BoundAttribute.scala:87)
  at org.apache.spark.sql.execution.joins.HashJoin$$anonfun$4.apply(HashJoin.scala:66)
  at org.apache.spark.sql.execution.joins.HashJoin$$anonfun$4.apply(HashJoin.scala:66)
  at scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:234)
  at scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:234)
  at scala.collection.immutable.List.foreach(List.scala:381)
  at scala.collection.TraversableLike$class.map(TraversableLike.scala:234)
  at scala.collection.immutable.List.map(List.scala:285)
  at org.apache.spark.sql.execution.joins.HashJoin$class.org$apache$spark$sql$execution$joins$HashJoin$$x$8(HashJoin.scala:66)
  at org.apache.spark.sql.execution.joins.BroadcastHashJoinExec.org$apache$spark$sql$execution$joins$HashJoin$$x$8$lzycompute(BroadcastHashJoinExec.scala:38)
  at org.apache.spark.sql.execution.joins.BroadcastHashJoinExec.org$apache$spark$sql$execution$joins$HashJoin$$x$8(BroadcastHashJoinExec.scala:38)
  at org.apache.spark.sql.execution.joins.HashJoin$class.buildKeys(HashJoin.scala:63)
  at org.apache.spark.sql.execution.joins.BroadcastHashJoinExec.buildKeys$lzycompute(BroadcastHashJoinExec.scala:38)
  at org.apache.spark.sql.execution.joins.BroadcastHashJoinExec.buildKeys(BroadcastHashJoinExec.scala:38)
  at org.apache.spark.sql.execution.joins.BroadcastHashJoinExec.requiredChildDistribution(BroadcastHashJoinExec.scala:52)

Problem description:
When the left hand side expression of an existential sub-query predicate contains another embedded sub-query predicate, the RewritePredicateSubquery optimizer rule does not resolve the embedded sub-query expressions into existential joins.For example, the above query has the following optimized plan, which fails during physical plan build.

== Optimized Logical Plan ==
Project [_1#224 AS c1#227]
+- Join LeftSemi, (CASE WHEN predicate-subquery#255 [(_2#225 = c2#239)] THEN 2 ELSE 3 END = c2#228#262)
   :  +- SubqueryAlias predicate-subquery#255 [(_2#225 = c2#239)]
   :     +- LocalRelation [c2#239]
   :- LocalRelation [_1#224, _2#225]
   +- LocalRelation [c2#228#262]

== Physical Plan ==
org.apache.spark.sql.catalyst.errors.package$TreeNodeException: Binding attribute, tree: c2#239

Solution:
In RewritePredicateSubquery, before rewriting the outermost predicate sub-query, resolve any embedded existential sub-queries. The Optimized plan for the above query after the changes looks like below.

== Optimized Logical Plan ==
Project [_1#224 AS c1#227]
+- Join LeftSemi, (CASE WHEN exists#285 THEN 2 ELSE 3 END = c2#228#284)
   :- Join ExistenceJoin(exists#285), (_2#225 = c2#239)
   :  :- LocalRelation [_1#224, _2#225]
   :  +- LocalRelation [c2#239]
   +- LocalRelation [c2#228#284]

== Physical Plan ==
*Project [_1#224 AS c1#227]
+- *BroadcastHashJoin [CASE WHEN exists#285 THEN 2 ELSE 3 END], [c2#228#284], LeftSemi, BuildRight
   :- *BroadcastHashJoin [_2#225], [c2#239], ExistenceJoin(exists#285), BuildRight
   :  :- LocalTableScan [_1#224, _2#225]
   :  +- BroadcastExchange HashedRelationBroadcastMode(List(cast(input[0, int, false] as bigint)))
   :     +- LocalTableScan [c2#239]
   +- BroadcastExchange HashedRelationBroadcastMode(List(cast(input[0, int, false] as bigint)))
      +- LocalTableScan [c2#228#284]
      +- LocalTableScan [c222#36], [[111],[222]]

How was this patch tested?

Added new test cases in SubquerySuite.scala

…ception.

hvanhovell · 2016-06-09T00:13:46Z

sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/optimizer/Optimizer.scala

@@ -1715,31 +1715,68 @@ object RewritePredicateSubquery extends Rule[LogicalPlan] with PredicateHelper {
      // Filter the plan by applying left semi and left anti joins.
      withSubquery.foldLeft(newFilter) {
        case (p, PredicateSubquery(sub, conditions, _, _)) =>
-          Join(p, sub, LeftSemi, conditions.reduceOption(And))
+          if (!conditions.exists(PredicateSubquery.hasPredicateSubquery)) {


So the code for these rewrites is (almost) the same for all three cases. Lets move this into a helper method or move it into a separate a loop before this one (I prefer the latter).

hvanhovell · 2016-06-09T00:15:02Z

@ioana-delaney great catch! The overall PR seems pretty solid. I left one smallish code organization related comment.

SparkQA · 2016-06-09T01:17:43Z

Test build #3070 has finished for PR 13570 at commit eea703a.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

ioana-delaney · 2016-06-10T22:38:41Z

@hvanhovell Thank you for reviewing the changes and I apologize for the delay in replying.
I simplified the code. However, I don't think this is what you suggested. What you suggested
I believe, was to pull out the rewrite of the inner expressions into an outer loop. I made a few
attempts but I could not find a way to decouple the expressions' generation from the new plans'
generation. When rewriting the expression, I am also building the plans bottom-up. Please take
a look at my new changes and advise. Thanks!

hvanhovell · 2016-06-11T00:56:40Z

sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/optimizer/Optimizer.scala

+   * are blocked in the Analyzer.
+   */
+  private def rewriteExistentialExpr(
+      expr: Option[Expression],


Lets just pass a sequence of expressions. Predicate subqueries are guaranteed to have one or more conditions.

This also eliminates the need for a pattern match. Just map over the expressions (rewrite the subqueries) and reduce conditions at the end.

hvanhovell · 2016-06-11T01:09:28Z

@ioana-delaney no worries. I think the approach you have taken is the correct one. I have left one smallish comment.

ioana-delaney · 2016-06-11T03:21:52Z

@hvanhovell The EXISTS/NOT EXISTS predicates will have an empty condition. e.g.

select c1 from t1 where EXISTS (select c2 from t2)

== Optimized Logical Plan ==
Project [_1#224 AS c1#227]
+- Join LeftSemi
:- LocalRelation [_1#224, _2#225]
+- LocalRelation [c2#239]

But the other subquery predicates are quaranteed to have at least one condition.

Regarding the rewriteExistentialExpr interface, I think that I need to pass an expression instead of a sequence of conditions since the last case in the main rewrite rule does not have conditions. It's just an expression. e.g. where (case when c2 IN (select 1 as one) then 1 else 2) = c1

Please let me know. Thanks.

hvanhovell · 2016-06-12T21:25:57Z

LGTM - merging to master/2.0 thanks!

…odeException ## What changes were proposed in this pull request? Queries with embedded existential sub-query predicates throws exception when building the physical plan. Example failing query: ```SQL scala> Seq((1, 1), (2, 2)).toDF("c1", "c2").createOrReplaceTempView("t1") scala> Seq((1, 1), (2, 2)).toDF("c1", "c2").createOrReplaceTempView("t2") scala> sql("select c1 from t1 where (case when c2 in (select c2 from t2) then 2 else 3 end) IN (select c2 from t1)").show() Binding attribute, tree: c2#239 org.apache.spark.sql.catalyst.errors.package$TreeNodeException: Binding attribute, tree: c2#239 at org.apache.spark.sql.catalyst.errors.package$.attachTree(package.scala:50) at org.apache.spark.sql.catalyst.expressions.BindReferences$$anonfun$bindReference$1.applyOrElse(BoundAttribute.scala:88) ... at org.apache.spark.sql.catalyst.expressions.BindReferences$.bindReference(BoundAttribute.scala:87) at org.apache.spark.sql.execution.joins.HashJoin$$anonfun$4.apply(HashJoin.scala:66) at org.apache.spark.sql.execution.joins.HashJoin$$anonfun$4.apply(HashJoin.scala:66) at scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:234) at scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:234) at scala.collection.immutable.List.foreach(List.scala:381) at scala.collection.TraversableLike$class.map(TraversableLike.scala:234) at scala.collection.immutable.List.map(List.scala:285) at org.apache.spark.sql.execution.joins.HashJoin$class.org$apache$spark$sql$execution$joins$HashJoin$$x$8(HashJoin.scala:66) at org.apache.spark.sql.execution.joins.BroadcastHashJoinExec.org$apache$spark$sql$execution$joins$HashJoin$$x$8$lzycompute(BroadcastHashJoinExec.scala:38) at org.apache.spark.sql.execution.joins.BroadcastHashJoinExec.org$apache$spark$sql$execution$joins$HashJoin$$x$8(BroadcastHashJoinExec.scala:38) at org.apache.spark.sql.execution.joins.HashJoin$class.buildKeys(HashJoin.scala:63) at org.apache.spark.sql.execution.joins.BroadcastHashJoinExec.buildKeys$lzycompute(BroadcastHashJoinExec.scala:38) at org.apache.spark.sql.execution.joins.BroadcastHashJoinExec.buildKeys(BroadcastHashJoinExec.scala:38) at org.apache.spark.sql.execution.joins.BroadcastHashJoinExec.requiredChildDistribution(BroadcastHashJoinExec.scala:52) ``` **Problem description:** When the left hand side expression of an existential sub-query predicate contains another embedded sub-query predicate, the RewritePredicateSubquery optimizer rule does not resolve the embedded sub-query expressions into existential joins.For example, the above query has the following optimized plan, which fails during physical plan build. ```SQL == Optimized Logical Plan == Project [_1#224 AS c1#227] +- Join LeftSemi, (CASE WHEN predicate-subquery#255 [(_2#225 = c2#239)] THEN 2 ELSE 3 END = c2#228#262) : +- SubqueryAlias predicate-subquery#255 [(_2#225 = c2#239)] : +- LocalRelation [c2#239] :- LocalRelation [_1#224, _2#225] +- LocalRelation [c2#228#262] == Physical Plan == org.apache.spark.sql.catalyst.errors.package$TreeNodeException: Binding attribute, tree: c2#239 ``` **Solution:** In RewritePredicateSubquery, before rewriting the outermost predicate sub-query, resolve any embedded existential sub-queries. The Optimized plan for the above query after the changes looks like below. ```SQL == Optimized Logical Plan == Project [_1#224 AS c1#227] +- Join LeftSemi, (CASE WHEN exists#285 THEN 2 ELSE 3 END = c2#228#284) :- Join ExistenceJoin(exists#285), (_2#225 = c2#239) : :- LocalRelation [_1#224, _2#225] : +- LocalRelation [c2#239] +- LocalRelation [c2#228#284] == Physical Plan == *Project [_1#224 AS c1#227] +- *BroadcastHashJoin [CASE WHEN exists#285 THEN 2 ELSE 3 END], [c2#228#284], LeftSemi, BuildRight :- *BroadcastHashJoin [_2#225], [c2#239], ExistenceJoin(exists#285), BuildRight : :- LocalTableScan [_1#224, _2#225] : +- BroadcastExchange HashedRelationBroadcastMode(List(cast(input[0, int, false] as bigint))) : +- LocalTableScan [c2#239] +- BroadcastExchange HashedRelationBroadcastMode(List(cast(input[0, int, false] as bigint))) +- LocalTableScan [c2#228#284] +- LocalTableScan [c222#36], [[111],[222]] ``` ## How was this patch tested? Added new test cases in SubquerySuite.scala Author: Ioana Delaney <ioanamdelaney@gmail.com> Closes #13570 from ioana-delaney/fixEmbedSubPredV1. (cherry picked from commit 0ff8a68) Signed-off-by: Herman van Hovell <hvanhovell@databricks.com>

jkbradley · 2016-06-12T21:41:40Z

Did this break the build? https://amplab.cs.berkeley.edu/jenkins/view/Spark%20QA%20Compile/job/spark-master-compile-sbt-scala-2.10/1707/console

It looks like the last commit may not have been tested.

gatorsmile · 2016-06-12T23:01:49Z

@hvanhovell @jkbradley Could you add @ioana-delaney to the whitelist? Thanks!

[SPARK-15832] Embedded IN/EXISTS predicate subquery throws TreeNodeEx…

eea703a

…ception.

hvanhovell reviewed Jun 9, 2016
View reviewed changes

[SPARK-15832] Revised changes based on comments.

d89a622

hvanhovell reviewed Jun 11, 2016
View reviewed changes

asfgit closed this in 0ff8a68 Jun 12, 2016

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[SPARK-15832][SQL] Embedded IN/EXISTS predicate subquery throws TreeNodeException #13570

[SPARK-15832][SQL] Embedded IN/EXISTS predicate subquery throws TreeNodeException #13570

ioana-delaney commented Jun 8, 2016

hvanhovell Jun 9, 2016

hvanhovell commented Jun 9, 2016

SparkQA commented Jun 9, 2016

ioana-delaney commented Jun 10, 2016

hvanhovell Jun 11, 2016 •

edited

hvanhovell commented Jun 11, 2016

ioana-delaney commented Jun 11, 2016

hvanhovell commented Jun 12, 2016

jkbradley commented Jun 12, 2016

gatorsmile commented Jun 12, 2016

[SPARK-15832][SQL] Embedded IN/EXISTS predicate subquery throws TreeNodeException #13570

[SPARK-15832][SQL] Embedded IN/EXISTS predicate subquery throws TreeNodeException #13570

Conversation

ioana-delaney commented Jun 8, 2016

What changes were proposed in this pull request?

How was this patch tested?

hvanhovell Jun 9, 2016

Choose a reason for hiding this comment

hvanhovell commented Jun 9, 2016

SparkQA commented Jun 9, 2016

ioana-delaney commented Jun 10, 2016

hvanhovell Jun 11, 2016 • edited

Choose a reason for hiding this comment

hvanhovell commented Jun 11, 2016

ioana-delaney commented Jun 11, 2016

hvanhovell commented Jun 12, 2016

jkbradley commented Jun 12, 2016

gatorsmile commented Jun 12, 2016

hvanhovell Jun 11, 2016 •

edited