Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[SPARK-13306] [SQL] uncorrelated scalar subquery #11190

Closed
wants to merge 9 commits into from

Conversation

davies
Copy link
Contributor

@davies davies commented Feb 12, 2016

A scalar subquery is a subquery that only generate single row and single column, could be used as part of expression. Uncorrelated scalar subquery means it does not has a reference to external table.

All the uncorrelated scalar subqueries will be executed during prepare() of SparkPlan.

The plans for query

select 1 + (select 2 + (select 3))

looks like this

== Parsed Logical Plan ==
'Project [unresolvedalias((1 + subquery#1),None)]
:- OneRowRelation$
+- 'Subquery subquery#1
   +- 'Project [unresolvedalias((2 + subquery#0),None)]
      :- OneRowRelation$
      +- 'Subquery subquery#0
         +- 'Project [unresolvedalias(3,None)]
            +- OneRowRelation$

== Analyzed Logical Plan ==
_c0: int
Project [(1 + subquery#1) AS _c0#4]
:- OneRowRelation$
+- Subquery subquery#1
   +- Project [(2 + subquery#0) AS _c0#3]
      :- OneRowRelation$
      +- Subquery subquery#0
         +- Project [3 AS _c0#2]
            +- OneRowRelation$

== Optimized Logical Plan ==
Project [(1 + subquery#1) AS _c0#4]
:- OneRowRelation$
+- Subquery subquery#1
   +- Project [(2 + subquery#0) AS _c0#3]
      :- OneRowRelation$
      +- Subquery subquery#0
         +- Project [3 AS _c0#2]
            +- OneRowRelation$

== Physical Plan ==
WholeStageCodegen
:  +- Project [(1 + subquery#1) AS _c0#4]
:     :- INPUT
:     +- Subquery subquery#1
:        +- WholeStageCodegen
:           :  +- Project [(2 + subquery#0) AS _c0#3]
:           :     :- INPUT
:           :     +- Subquery subquery#0
:           :        +- WholeStageCodegen
:           :           :  +- Project [3 AS _c0#2]
:           :           :     +- INPUT
:           :           +- Scan OneRowRelation[]
:           +- Scan OneRowRelation[]
+- Scan OneRowRelation[]

@davies
Copy link
Contributor Author

davies commented Feb 12, 2016

cc @hvanhovell

@@ -667,6 +667,8 @@ https://cwiki.apache.org/confluence/display/Hive/Enhanced+Aggregation%2C+Cube%2C
UnresolvedAttribute(nameParts :+ cleanIdentifier(attr))
case other => UnresolvedExtractValue(other, Literal(cleanIdentifier(attr)))
}
case Token("TOK_SUBQUERY_EXPR", Token("TOK_SUBQUERY_OP", Nil) :: subquery :: Nil) =>
ScalarSubquery(nodeToPlan(subquery))
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This might sound excedingly dumb but I cannot find ScalarSubquery or SubqueryExpression. Are they already in the code base? Or did you create branch on top of another branch?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nevermind I just found the other PR...

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I missed a file, sorry

assertResult(Array(Row(1))) {
sql("with t2 as (select 1 as b, 2 as c) " +
"select a from (select 1 as a union all select 2 as a) t " +
"where a = (select max(b) from t2) ").collect()
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

if we support nested subqueries, can we add a test case

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Added

@SparkQA
Copy link

SparkQA commented Feb 13, 2016

Test build #51220 has finished for PR 11190 at commit 016c36c.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@rxin
Copy link
Contributor

rxin commented Feb 13, 2016

So I think the biggest question is whether we should do all of these planning for subqueries and execution in prepare(), or come up with some other way to run subqueries. While it works right now, I'm not a big fan of doing it there because it mixes planning and execution and breaks the nice abstraction we have.

@davies
Copy link
Contributor Author

davies commented Feb 13, 2016

@rxin Compare to broadcast join, we do execution in prepare(), for uncorrelated scalar subquery, we do optimize and execution in prepare(), I think it's not a big deal. For all other subqueries, they will be rewritten as join, will not be executed in prepare(), only current one is the except.

@SparkQA
Copy link

SparkQA commented Feb 13, 2016

Test build #51230 has finished for PR 11190 at commit a4bae33.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@rxin
Copy link
Contributor

rxin commented Feb 16, 2016

cc @marmbrus

@@ -120,7 +121,13 @@ class Analyzer(
withAlias.getOrElse(relation)
}
substituted.getOrElse(u)
case other =>
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

quick comment on why this isn't in ResolveSubquery

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done

@SparkQA
Copy link

SparkQA commented Feb 20, 2016

Test build #51586 has finished for PR 11190 at commit d0974cf.

  • This patch fails Spark unit tests.
  • This patch merges cleanly.
  • This patch adds the following public classes (experimental):
    • abstract class SubqueryExpression extends LeafExpression
    • case class ScalarSubquery(
    • case class Subquery(name: String, child: SparkPlan) extends UnaryNode
    • case class SparkScalarSubquery(

sys.error(s"Scalar subquery should return at most one row, but got ${rows.length}: " +
s"${e.query.treeString}")
}
// Analyzer will make sure that it only return on column
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

"Analyzer should make sure this only returns one column"

and add an assert after this.

override def eval(input: InternalRow): Any = result

override def genCode(ctx: CodegenContext, ev: ExprCode): String = {
val thisTerm = ctx.addReferenceObj("subquery", this)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

what's the reason you don't use the same codepath as literal?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We need to bookkeeping the parent of subquery for replacing.

Literal also will fallback in some cases, we should fix that in this way.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let me try it.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ok I was thinking just creating a literal expression directly in this function. It'd be great if we just have one place that passes in literals, and also make the generated code friendlier.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Since the subquery could be used in any places (part of expression or inside a list/seq), so it's not easy to replace it.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe it's ok as is. I was thinking about

override def genCode(...) = {
  Literal(value, dataType).genCode(...)
}

@rxin
Copy link
Contributor

rxin commented Feb 20, 2016

LGTM.

@rxin
Copy link
Contributor

rxin commented Feb 20, 2016

(We should have follow-ups that fix the web UI if it doesn't work)

@davies
Copy link
Contributor Author

davies commented Feb 20, 2016

@SparkQA
Copy link

SparkQA commented Feb 20, 2016

Test build #51593 has finished for PR 11190 at commit 7596173.

  • This patch fails Scala style tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@rxin
Copy link
Contributor

rxin commented Feb 20, 2016

The blocking can still happen, can't it? Just have a branch, and then the left one will block the right one?

@SparkQA
Copy link

SparkQA commented Feb 20, 2016

Test build #51589 has finished for PR 11190 at commit 3a8f08d.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds the following public classes (experimental):
    • abstract class SubqueryExpression extends LeafExpression
    • case class ScalarSubquery(

@SparkQA
Copy link

SparkQA commented Feb 20, 2016

Test build #51594 has finished for PR 11190 at commit 0034172.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

doExecute()
}
}

// All the subquries and their Future of results.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nit: subqueries

@davies
Copy link
Contributor Author

davies commented Feb 21, 2016

@marmbrus @rxin @hvanhovell Thanks to all your time reviewing this, if no more comments, I'm going to merge this into master once it pass tests.

@SparkQA
Copy link

SparkQA commented Feb 21, 2016

Test build #51608 has finished for PR 11190 at commit e082845.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@davies
Copy link
Contributor Author

davies commented Feb 21, 2016

Merged into master, thanks!

@asfgit asfgit closed this in 7925071 Feb 21, 2016
@rxin
Copy link
Contributor

rxin commented Feb 21, 2016

@davies I don't think anybody actually had time to look over your latest changes ...

this.parent = parent
ctx.freshNamePrefix = variablePrefix
waitForSubqueries()
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why is this needed? shouldn't SparkPlan.execute already call waitForSubqueries?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is needed for whole stage codegen, those operator will not call execute().

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ok got it. this is fairly hacky ...

asfgit pushed a commit that referenced this pull request Feb 21, 2016
## What changes were proposed in this pull request?
This pull request fixes some minor issues (documentation, test flakiness, test organization) with #11190, which was merged earlier tonight.

## How was the this patch tested?
unit tests.

Author: Reynold Xin <rxin@databricks.com>

Closes #11285 from rxin/subquery.
asfgit pushed a commit that referenced this pull request Dec 5, 2018
## What changes were proposed in this pull request?

this code come from PR: #11190,
but this code has never been used, only since  PR: #14548,
Let's continue fix it. thanks.

## How was this patch tested?

N / A

Closes #23227 from heary-cao/unuseSparkPlan.

Authored-by: caoxuewen <cao.xuewen@zte.com.cn>
Signed-off-by: Wenchen Fan <wenchen@databricks.com>
jackylee-ch pushed a commit to jackylee-ch/spark that referenced this pull request Feb 18, 2019
## What changes were proposed in this pull request?

this code come from PR: apache#11190,
but this code has never been used, only since  PR: apache#14548,
Let's continue fix it. thanks.

## How was this patch tested?

N / A

Closes apache#23227 from heary-cao/unuseSparkPlan.

Authored-by: caoxuewen <cao.xuewen@zte.com.cn>
Signed-off-by: Wenchen Fan <wenchen@databricks.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
5 participants