-
Notifications
You must be signed in to change notification settings - Fork 28.3k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[SPARK-13306] [SQL] uncorrelated scalar subquery #11190
Conversation
cc @hvanhovell |
@@ -667,6 +667,8 @@ https://cwiki.apache.org/confluence/display/Hive/Enhanced+Aggregation%2C+Cube%2C | |||
UnresolvedAttribute(nameParts :+ cleanIdentifier(attr)) | |||
case other => UnresolvedExtractValue(other, Literal(cleanIdentifier(attr))) | |||
} | |||
case Token("TOK_SUBQUERY_EXPR", Token("TOK_SUBQUERY_OP", Nil) :: subquery :: Nil) => | |||
ScalarSubquery(nodeToPlan(subquery)) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This might sound excedingly dumb but I cannot find ScalarSubquery
or SubqueryExpression
. Are they already in the code base? Or did you create branch on top of another branch?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nevermind I just found the other PR...
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I missed a file, sorry
assertResult(Array(Row(1))) { | ||
sql("with t2 as (select 1 as b, 2 as c) " + | ||
"select a from (select 1 as a union all select 2 as a) t " + | ||
"where a = (select max(b) from t2) ").collect() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
if we support nested subqueries, can we add a test case
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Added
Test build #51220 has finished for PR 11190 at commit
|
So I think the biggest question is whether we should do all of these planning for subqueries and execution in prepare(), or come up with some other way to run subqueries. While it works right now, I'm not a big fan of doing it there because it mixes planning and execution and breaks the nice abstraction we have. |
@rxin Compare to broadcast join, we do execution in prepare(), for uncorrelated scalar subquery, we do optimize and execution in prepare(), I think it's not a big deal. For all other subqueries, they will be rewritten as join, will not be executed in prepare(), only current one is the except. |
Test build #51230 has finished for PR 11190 at commit
|
cc @marmbrus |
@@ -120,7 +121,13 @@ class Analyzer( | |||
withAlias.getOrElse(relation) | |||
} | |||
substituted.getOrElse(u) | |||
case other => |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
quick comment on why this isn't in ResolveSubquery
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
done
Test build #51586 has finished for PR 11190 at commit
|
sys.error(s"Scalar subquery should return at most one row, but got ${rows.length}: " + | ||
s"${e.query.treeString}") | ||
} | ||
// Analyzer will make sure that it only return on column |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
"Analyzer should make sure this only returns one column"
and add an assert after this.
override def eval(input: InternalRow): Any = result | ||
|
||
override def genCode(ctx: CodegenContext, ev: ExprCode): String = { | ||
val thisTerm = ctx.addReferenceObj("subquery", this) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
what's the reason you don't use the same codepath as literal?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We need to bookkeeping the parent of subquery for replacing.
Literal also will fallback in some cases, we should fix that in this way.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Let me try it.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ok I was thinking just creating a literal expression directly in this function. It'd be great if we just have one place that passes in literals, and also make the generated code friendlier.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Since the subquery could be used in any places (part of expression or inside a list/seq), so it's not easy to replace it.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Maybe it's ok as is. I was thinking about
override def genCode(...) = {
Literal(value, dataType).genCode(...)
}
LGTM. |
(We should have follow-ups that fix the web UI if it doesn't work) |
Created JIRA https://issues.apache.org/jira/browse/SPARK-13415 |
Test build #51593 has finished for PR 11190 at commit
|
The blocking can still happen, can't it? Just have a branch, and then the left one will block the right one? |
Test build #51589 has finished for PR 11190 at commit
|
Test build #51594 has finished for PR 11190 at commit
|
doExecute() | ||
} | ||
} | ||
|
||
// All the subquries and their Future of results. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nit: subqueries
@marmbrus @rxin @hvanhovell Thanks to all your time reviewing this, if no more comments, I'm going to merge this into master once it pass tests. |
Test build #51608 has finished for PR 11190 at commit
|
Merged into master, thanks! |
@davies I don't think anybody actually had time to look over your latest changes ... |
this.parent = parent | ||
ctx.freshNamePrefix = variablePrefix | ||
waitForSubqueries() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
why is this needed? shouldn't SparkPlan.execute already call waitForSubqueries?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is needed for whole stage codegen, those operator will not call execute().
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ok got it. this is fairly hacky ...
## What changes were proposed in this pull request? This pull request fixes some minor issues (documentation, test flakiness, test organization) with #11190, which was merged earlier tonight. ## How was the this patch tested? unit tests. Author: Reynold Xin <rxin@databricks.com> Closes #11285 from rxin/subquery.
## What changes were proposed in this pull request? this code come from PR: #11190, but this code has never been used, only since PR: #14548, Let's continue fix it. thanks. ## How was this patch tested? N / A Closes #23227 from heary-cao/unuseSparkPlan. Authored-by: caoxuewen <cao.xuewen@zte.com.cn> Signed-off-by: Wenchen Fan <wenchen@databricks.com>
## What changes were proposed in this pull request? this code come from PR: apache#11190, but this code has never been used, only since PR: apache#14548, Let's continue fix it. thanks. ## How was this patch tested? N / A Closes apache#23227 from heary-cao/unuseSparkPlan. Authored-by: caoxuewen <cao.xuewen@zte.com.cn> Signed-off-by: Wenchen Fan <wenchen@databricks.com>
A scalar subquery is a subquery that only generate single row and single column, could be used as part of expression. Uncorrelated scalar subquery means it does not has a reference to external table.
All the uncorrelated scalar subqueries will be executed during prepare() of SparkPlan.
The plans for query
looks like this