Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[SPARK-16958] [SQL] Reuse subqueries within the same query #14548

Closed
wants to merge 4 commits into from

Conversation

davies
Copy link
Contributor

@davies davies commented Aug 8, 2016

What changes were proposed in this pull request?

There could be multiple subqueries that generate same results, we could re-use the result instead of running it multiple times.

This PR also cleanup up how we run subqueries.

For SQL query

select id,(select avg(id) from t) from t where id > (select avg(id) from t)

The explain is

== Physical Plan ==
*Project [id#15L, Subquery subquery29 AS scalarsubquery()#35]
:  +- Subquery subquery29
:     +- *HashAggregate(keys=[], functions=[avg(id#15L)])
:        +- Exchange SinglePartition
:           +- *HashAggregate(keys=[], functions=[partial_avg(id#15L)])
:              +- *Range (0, 1000, splits=4)
+- *Filter (cast(id#15L as double) > Subquery subquery29)
   :  +- Subquery subquery29
   :     +- *HashAggregate(keys=[], functions=[avg(id#15L)])
   :        +- Exchange SinglePartition
   :           +- *HashAggregate(keys=[], functions=[partial_avg(id#15L)])
   :              +- *Range (0, 1000, splits=4)
   +- *Range (0, 1000, splits=4)

The visualized plan:

reuse-subquery

How was this patch tested?

Existing tests.

@SparkQA
Copy link

SparkQA commented Aug 9, 2016

Test build #63389 has finished for PR 14548 at commit 1348ba7.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds the following public classes (experimental):
    • trait ExecSubqueryExpression extends SubqueryExpression
    • case class InSubquery(
    • case class ReuseSubquery(conf: SQLConf) extends Rule[SparkPlan]

@@ -502,15 +508,64 @@ case class OutputFakerExec(output: Seq[Attribute], child: SparkPlan) extends Spa

/**
* Physical plan for a subquery.
*
* This is used to generate tree string for SparkScalarSubquery.
*/
case class SubqueryExec(name: String, child: SparkPlan) extends UnaryExecNode {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A large part of this class is shared with BroadcastExchangeExec. Should we try to factor out common functionality?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think it's OK to have some duplicated code here, over abstracted code is actually harder to read.

@hvanhovell
Copy link
Contributor

@davies this looks pretty good. I am very excited about the SparkPlan clean-up!

@davies
Copy link
Contributor Author

davies commented Aug 10, 2016

@hvanhovell Had posted an picture, check it out.

@SparkQA
Copy link

SparkQA commented Aug 10, 2016

Test build #63560 has finished for PR 14548 at commit 8444447.

  • This patch fails to build.
  • This patch merges cleanly.
  • This patch adds no public classes.

@SparkQA
Copy link

SparkQA commented Aug 11, 2016

Test build #63563 has finished for PR 14548 at commit dd1581b.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@hvanhovell
Copy link
Contributor

Cool picture!

@hvanhovell
Copy link
Contributor

LGTM

@davies
Copy link
Contributor Author

davies commented Aug 11, 2016

Merging it into master, thanks!

@asfgit asfgit closed this in 0f72e4f Aug 11, 2016
asfgit pushed a commit that referenced this pull request Dec 5, 2018
## What changes were proposed in this pull request?

this code come from PR: #11190,
but this code has never been used, only since  PR: #14548,
Let's continue fix it. thanks.

## How was this patch tested?

N / A

Closes #23227 from heary-cao/unuseSparkPlan.

Authored-by: caoxuewen <cao.xuewen@zte.com.cn>
Signed-off-by: Wenchen Fan <wenchen@databricks.com>
@JkSelf
Copy link
Contributor

JkSelf commented Jan 16, 2019

@davies @hvanhovell @gatorsmile
Here the subquery reuse may be does not work. In my test, I found the visualized plan do show the subquery is executed once as following.
image

But in deed, the stage of same subquery execute maybe not once as following:
image
Maybe I miss some knowledge, can you help verify this? Thanks for your help!

@hvanhovell
Copy link
Contributor

@JkSelf can you file a JIRA ticket?

@JkSelf
Copy link
Contributor

JkSelf commented Jan 17, 2019

@hvanhovell , Thanks for your help and I have filed Jira 26639.

jackylee-ch pushed a commit to jackylee-ch/spark that referenced this pull request Feb 18, 2019
## What changes were proposed in this pull request?

this code come from PR: apache#11190,
but this code has never been used, only since  PR: apache#14548,
Let's continue fix it. thanks.

## How was this patch tested?

N / A

Closes apache#23227 from heary-cao/unuseSparkPlan.

Authored-by: caoxuewen <cao.xuewen@zte.com.cn>
Signed-off-by: Wenchen Fan <wenchen@databricks.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
5 participants