Skip to content

[WIP][SPARK-22497][SQL] Project reuse#19727

Closed
wangyum wants to merge 1 commit intoapache:masterfrom
wangyum:SPARK-22497
Closed

[WIP][SPARK-22497][SQL] Project reuse#19727
wangyum wants to merge 1 commit intoapache:masterfrom
wangyum:SPARK-22497

Conversation

@wangyum
Copy link
Member

@wangyum wangyum commented Nov 12, 2017

What changes were proposed in this pull request?

The below SQL will scan table1 twice. This PR reuse the p1 and scan table1 once.

with p1 as (select * from table1 where key < 100), 
s1 as (SELECT key, count(*) FROM p1 group by key), 
s2 as (SELECT key, count(*) FROM p1 where key > -100 group by key) 
select s1.* from s1 join s2 on s1.key= s2.key

How was this patch tested?

unit tests

@SparkQA
Copy link

SparkQA commented Nov 12, 2017

Test build #83744 has finished for PR 19727 at commit 1c458b8.

  • This patch fails Spark unit tests.
  • This patch merges cleanly.
  • This patch adds the following public classes (experimental):
  • case class ReusedProjectExec(override val output: Seq[Attribute], child: ProjectExec)
  • case class ReuseProject(conf: SQLConf) extends Rule[SparkPlan]

@viirya
Copy link
Member

viirya commented Nov 13, 2017

Simply reusing ProjectExec doesn't really reduce the scan. The duplication execution of CTE is a well known issue. I've addressed it before. But seems no solution to deal all possible cases yet.

@gatorsmile
Copy link
Member

CTE reuse can cause the performance regression. It is hard to address without considering the costs.

@wangyum wangyum closed this Nov 15, 2017
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants