[SPARK-13306] [SQL] uncorrelated scalar subquery #11190

davies · 2016-02-12T22:47:10Z

A scalar subquery is a subquery that only generate single row and single column, could be used as part of expression. Uncorrelated scalar subquery means it does not has a reference to external table.

All the uncorrelated scalar subqueries will be executed during prepare() of SparkPlan.

The plans for query

select 1 + (select 2 + (select 3))

looks like this

== Parsed Logical Plan ==
'Project [unresolvedalias((1 + subquery#1),None)]
:- OneRowRelation$
+- 'Subquery subquery#1
   +- 'Project [unresolvedalias((2 + subquery#0),None)]
      :- OneRowRelation$
      +- 'Subquery subquery#0
         +- 'Project [unresolvedalias(3,None)]
            +- OneRowRelation$

== Analyzed Logical Plan ==
_c0: int
Project [(1 + subquery#1) AS _c0#4]
:- OneRowRelation$
+- Subquery subquery#1
   +- Project [(2 + subquery#0) AS _c0#3]
      :- OneRowRelation$
      +- Subquery subquery#0
         +- Project [3 AS _c0#2]
            +- OneRowRelation$

== Optimized Logical Plan ==
Project [(1 + subquery#1) AS _c0#4]
:- OneRowRelation$
+- Subquery subquery#1
   +- Project [(2 + subquery#0) AS _c0#3]
      :- OneRowRelation$
      +- Subquery subquery#0
         +- Project [3 AS _c0#2]
            +- OneRowRelation$

== Physical Plan ==
WholeStageCodegen
:  +- Project [(1 + subquery#1) AS _c0#4]
:     :- INPUT
:     +- Subquery subquery#1
:        +- WholeStageCodegen
:           :  +- Project [(2 + subquery#0) AS _c0#3]
:           :     :- INPUT
:           :     +- Subquery subquery#0
:           :        +- WholeStageCodegen
:           :           :  +- Project [3 AS _c0#2]
:           :           :     +- INPUT
:           :           +- Scan OneRowRelation[]
:           +- Scan OneRowRelation[]
+- Scan OneRowRelation[]

davies · 2016-02-12T22:47:21Z

cc @hvanhovell

hvanhovell · 2016-02-12T23:01:16Z

sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/CatalystQl.scala

@@ -667,6 +667,8 @@ https://cwiki.apache.org/confluence/display/Hive/Enhanced+Aggregation%2C+Cube%2C
          UnresolvedAttribute(nameParts :+ cleanIdentifier(attr))
        case other => UnresolvedExtractValue(other, Literal(cleanIdentifier(attr)))
      }
+    case Token("TOK_SUBQUERY_EXPR", Token("TOK_SUBQUERY_OP", Nil) :: subquery :: Nil) =>
+      ScalarSubquery(nodeToPlan(subquery))


This might sound excedingly dumb but I cannot find ScalarSubquery or SubqueryExpression. Are they already in the code base? Or did you create branch on top of another branch?

Nevermind I just found the other PR...

I missed a file, sorry

rxin · 2016-02-13T00:28:12Z

sql/core/src/test/scala/org/apache/spark/sql/SQLQuerySuite.scala

+    assertResult(Array(Row(1))) {
+      sql("with t2 as (select 1 as b, 2 as c) " +
+        "select a from (select 1 as a union all select 2 as a) t " +
+        "where a = (select max(b) from t2) ").collect()


if we support nested subqueries, can we add a test case

SparkQA · 2016-02-13T01:27:33Z

Test build #51220 has finished for PR 11190 at commit 016c36c.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

rxin · 2016-02-13T07:20:03Z

So I think the biggest question is whether we should do all of these planning for subqueries and execution in prepare(), or come up with some other way to run subqueries. While it works right now, I'm not a big fan of doing it there because it mixes planning and execution and breaks the nice abstraction we have.

davies · 2016-02-13T07:26:54Z

@rxin Compare to broadcast join, we do execution in prepare(), for uncorrelated scalar subquery, we do optimize and execution in prepare(), I think it's not a big deal. For all other subqueries, they will be rewritten as join, will not be executed in prepare(), only current one is the except.

SparkQA · 2016-02-13T07:49:33Z

Test build #51230 has finished for PR 11190 at commit a4bae33.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

rxin · 2016-02-16T00:35:17Z

cc @marmbrus

marmbrus · 2016-02-19T20:33:06Z

sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/Analyzer.scala

@@ -120,7 +121,13 @@ class Analyzer(
            withAlias.getOrElse(relation)
          }
          substituted.getOrElse(u)
+        case other =>


quick comment on why this isn't in ResolveSubquery

SparkQA · 2016-02-20T05:28:24Z

Test build #51586 has finished for PR 11190 at commit d0974cf.

This patch fails Spark unit tests.
This patch merges cleanly.
This patch adds the following public classes (experimental):
- abstract class SubqueryExpression extends LeafExpression
- case class ScalarSubquery(
- case class Subquery(name: String, child: SparkPlan) extends UnaryNode
- case class SparkScalarSubquery(

rxin · 2016-02-20T05:29:54Z

sql/core/src/main/scala/org/apache/spark/sql/execution/SparkPlan.scala

+            sys.error(s"Scalar subquery should return at most one row, but got ${rows.length}: " +
+              s"${e.query.treeString}")
+          }
+          // Analyzer will make sure that it only return on column


"Analyzer should make sure this only returns one column"

and add an assert after this.

rxin · 2016-02-20T08:47:35Z

sql/core/src/main/scala/org/apache/spark/sql/execution/subquery.scala

+  override def eval(input: InternalRow): Any = result
+
+  override def genCode(ctx: CodegenContext, ev: ExprCode): String = {
+    val thisTerm = ctx.addReferenceObj("subquery", this)


what's the reason you don't use the same codepath as literal?

We need to bookkeeping the parent of subquery for replacing.

Literal also will fallback in some cases, we should fix that in this way.

Let me try it.

Ok I was thinking just creating a literal expression directly in this function. It'd be great if we just have one place that passes in literals, and also make the generated code friendlier.

Since the subquery could be used in any places (part of expression or inside a list/seq), so it's not easy to replace it.

Maybe it's ok as is. I was thinking about

override def genCode(...) = { Literal(value, dataType).genCode(...) }

rxin · 2016-02-20T08:48:12Z

LGTM.

rxin · 2016-02-20T08:48:24Z

(We should have follow-ups that fix the web UI if it doesn't work)

davies · 2016-02-20T08:49:33Z

Created JIRA https://issues.apache.org/jira/browse/SPARK-13415

SparkQA · 2016-02-20T08:58:43Z

Test build #51593 has finished for PR 11190 at commit 7596173.

This patch fails Scala style tests.
This patch merges cleanly.
This patch adds no public classes.

rxin · 2016-02-20T09:03:20Z

The blocking can still happen, can't it? Just have a branch, and then the left one will block the right one?

SparkQA · 2016-02-20T09:30:31Z

Test build #51589 has finished for PR 11190 at commit 3a8f08d.

This patch passes all tests.
This patch merges cleanly.
This patch adds the following public classes (experimental):
- abstract class SubqueryExpression extends LeafExpression
- case class ScalarSubquery(

SparkQA · 2016-02-20T11:15:22Z

Test build #51594 has finished for PR 11190 at commit 0034172.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

hvanhovell · 2016-02-20T18:29:29Z

sql/core/src/main/scala/org/apache/spark/sql/execution/SparkPlan.scala

      doExecute()
    }
  }

+  // All the subquries and their Future of results.


Nit: subqueries

davies · 2016-02-21T03:14:22Z

@marmbrus @rxin @hvanhovell Thanks to all your time reviewing this, if no more comments, I'm going to merge this into master once it pass tests.

SparkQA · 2016-02-21T04:45:36Z

Test build #51608 has finished for PR 11190 at commit e082845.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

davies · 2016-02-21T05:02:44Z

Merged into master, thanks!

rxin · 2016-02-21T06:10:01Z

@davies I don't think anybody actually had time to look over your latest changes ...

rxin · 2016-02-21T06:30:00Z

sql/core/src/main/scala/org/apache/spark/sql/execution/WholeStageCodegen.scala

    this.parent = parent
    ctx.freshNamePrefix = variablePrefix
+    waitForSubqueries()


why is this needed? shouldn't SparkPlan.execute already call waitForSubqueries?

This is needed for whole stage codegen, those operator will not call execute().

ok got it. this is fairly hacky ...

## What changes were proposed in this pull request? This pull request fixes some minor issues (documentation, test flakiness, test organization) with #11190, which was merged earlier tonight. ## How was the this patch tested? unit tests. Author: Reynold Xin <rxin@databricks.com> Closes #11285 from rxin/subquery.

## What changes were proposed in this pull request? this code come from PR: #11190, but this code has never been used, only since PR: #14548, Let's continue fix it. thanks. ## How was this patch tested? N / A Closes #23227 from heary-cao/unuseSparkPlan. Authored-by: caoxuewen <cao.xuewen@zte.com.cn> Signed-off-by: Wenchen Fan <wenchen@databricks.com>

## What changes were proposed in this pull request? this code come from PR: apache#11190, but this code has never been used, only since PR: apache#14548, Let's continue fix it. thanks. ## How was this patch tested? N / A Closes apache#23227 from heary-cao/unuseSparkPlan. Authored-by: caoxuewen <cao.xuewen@zte.com.cn> Signed-off-by: Wenchen Fan <wenchen@databricks.com>

uncorrelated scalar subquery

0665a69

hvanhovell reviewed Feb 12, 2016
View reviewed changes

Davies Liu added 2 commits February 12, 2016 15:48

add missing file

236ac88

use broadcastTimeout

016c36c

rxin reviewed Feb 13, 2016
View reviewed changes

address comments

a4bae33

marmbrus reviewed Feb 19, 2016
View reviewed changes

improve explain on subquery

d0974cf

rxin reviewed Feb 20, 2016
View reviewed changes

move wait subqueries into execute()/produce()

0034172

hvanhovell reviewed Feb 20, 2016
View reviewed changes

address comments

e082845

asfgit closed this in 7925071 Feb 21, 2016

rxin reviewed Feb 21, 2016
View reviewed changes

rxin mentioned this pull request Feb 21, 2016

[SPARK-13306][SQL] Addendum to uncorrelated scalar subquery #11285

Closed

heary-cao mentioned this pull request Dec 5, 2018

[SPARK-26271][FOLLOW-UP][SQL] remove unuse object SparkPlan #23227

Closed

[SPARK-13306] [SQL] uncorrelated scalar subquery #11190

[SPARK-13306] [SQL] uncorrelated scalar subquery #11190

Conversation

davies commented Feb 12, 2016

davies commented Feb 12, 2016

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

SparkQA commented Feb 13, 2016

rxin commented Feb 13, 2016

davies commented Feb 13, 2016

SparkQA commented Feb 13, 2016

rxin commented Feb 16, 2016

Choose a reason for hiding this comment

Choose a reason for hiding this comment

SparkQA commented Feb 20, 2016

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

rxin commented Feb 20, 2016

rxin commented Feb 20, 2016

davies commented Feb 20, 2016

SparkQA commented Feb 20, 2016

rxin commented Feb 20, 2016

SparkQA commented Feb 20, 2016

SparkQA commented Feb 20, 2016

Choose a reason for hiding this comment

davies commented Feb 21, 2016

SparkQA commented Feb 21, 2016

davies commented Feb 21, 2016

rxin commented Feb 21, 2016

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment