-
Notifications
You must be signed in to change notification settings - Fork 28.1k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[SPARK-13763] [SQL] Remove Project when its Child's Output is Nil #11599
Conversation
test("Eliminate the Project with an empty projectList") { | ||
val input = OneRowRelation | ||
val query = | ||
Project(Literal(1).as("1") :: Nil, Project(Literal(1).as("1") :: Nil, input)).analyze |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Where do you test empty projectList?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
When running Optimize.execute(query)
, the second Project's projectList
is pruned to empty at first. Then, the second Project will be removed.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Let me add another case with an empty List too.
Test build #52721 has finished for PR 11599 at commit
|
Test build #52724 has finished for PR 11599 at commit
|
@cloud-fan Added another two cases. Feel free to let me know if you want me to add more cases. Thanks! |
@@ -380,6 +380,9 @@ object ColumnPruning extends Rule[LogicalPlan] { | |||
p | |||
} | |||
|
|||
// Eliminate the Projects with empty projectList | |||
case p @ Project(projectList, child) if projectList.isEmpty => child |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm thinking of the correctness of this rule. Actually this is not column pruning, but add more columns, as child
may have more one columns.
And why this rule case p @ Project(projectList, child) if sameOutput(child.output, p.output) => child
can't work?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Because OneRowRelation
has no output. So its output is different to its parent Project
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
But a Project
with empty projectList also has no output right?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
case p @ Project(_, l: LeafNode) => p
There is another case above it. Thus, it will stop here.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
How about this?
case p @ Project(_, l: LeafNode) if !l.isInstanceOf[OneRowRelation] => p
Then, we do not need the first line.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yea. As I posted before. I added a new rule that has side-effect to fix this issue too.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks @viirya @cloud-fan !
I am not sure which way is better.
case p @ Project(_, l: LeafNode) if !l.isInstanceOf[OneRowRelation] => p
My concern is the above line looks more hacky than the current PR fix.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Let me respond the original question by @cloud-fan
We will not see an empty Project
, if the child has more than one columns. The empty Project
only happens after PruningColumns
. I am fine, if we want to add an extra rule for eliminating Project
only.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
how about we just move that case ahead? It seems always safe to apply case p @ Project(projectList, child) if sameOutput(child.output, p.output) => child
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I thought we intentionally did it in this way. I am not 100% sure if we might hit any issue because of it. Let me try it and check if we will hit any test case failure.
Test build #52740 has finished for PR 11599 at commit
|
nit: we need to update the title and description. Technically we can't remove |
Done. The title and PR description are corrected. Thanks! |
Thanks, merging to master. |
#### What changes were proposed in this pull request? As shown in another PR: apache#11596, we are using `SELECT 1` as a dummy table, when the table is used for SQL statements in which a table reference is required, but the contents of the table are not important. For example, ```SQL SELECT value FROM (select 1) dummyTable Lateral View explode(array(1,2,3)) adTable as value ``` Before the PR, the optimized plan contains a useless `Project` after Optimizer executing the `ColumnPruning` rule, as shown below: ``` == Analyzed Logical Plan == value: int Project [value#22] +- Generate explode(array(1, 2, 3)), true, false, Some(adtable), [value#22] +- SubqueryAlias dummyTable +- Project [1 AS 1#21] +- OneRowRelation$ == Optimized Logical Plan == Generate explode([1,2,3]), false, false, Some(adtable), [value#22] +- Project +- OneRowRelation$ ``` After the fix, the optimized plan removed the useless `Project`, as shown below: ``` == Optimized Logical Plan == Generate explode([1,2,3]), false, false, Some(adtable), [value#22] +- OneRowRelation$ ``` This PR is to remove `Project` when its Child's output is Nil #### How was this patch tested? Added a new unit test case into the suite `ColumnPruningSuite.scala` Author: gatorsmile <gatorsmile@gmail.com> Closes apache#11599 from gatorsmile/projectOneRowRelation.
What changes were proposed in this pull request?
As shown in another PR: #11596, we are using
SELECT 1
as a dummy table, when the table is used for SQL statements in which a table reference is required, but the contents of the table are not important. For example,Before the PR, the optimized plan contains a useless
Project
after Optimizer executing theColumnPruning
rule, as shown below:After the fix, the optimized plan removed the useless
Project
, as shown below:This PR is to remove
Project
when its Child's output is NilHow was this patch tested?
Added a new unit test case into the suite
ColumnPruningSuite.scala