Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[SPARK-13763] [SQL] Remove Project when its Child's Output is Nil #11599

Closed
wants to merge 3 commits into from

Conversation

gatorsmile
Copy link
Member

What changes were proposed in this pull request?

As shown in another PR: #11596, we are using SELECT 1 as a dummy table, when the table is used for SQL statements in which a table reference is required, but the contents of the table are not important. For example,

SELECT value FROM (select 1) dummyTable Lateral View explode(array(1,2,3)) adTable as value

Before the PR, the optimized plan contains a useless Project after Optimizer executing the ColumnPruning rule, as shown below:

== Analyzed Logical Plan ==
value: int
Project [value#22]
+- Generate explode(array(1, 2, 3)), true, false, Some(adtable), [value#22]
   +- SubqueryAlias dummyTable
      +- Project [1 AS 1#21]
         +- OneRowRelation$

== Optimized Logical Plan ==
Generate explode([1,2,3]), false, false, Some(adtable), [value#22]
+- Project
   +- OneRowRelation$

After the fix, the optimized plan removed the useless Project, as shown below:

== Optimized Logical Plan ==
Generate explode([1,2,3]), false, false, Some(adtable), [value#22]
+- OneRowRelation$

This PR is to remove Project when its Child's output is Nil

How was this patch tested?

Added a new unit test case into the suite ColumnPruningSuite.scala

@gatorsmile
Copy link
Member Author

cc @marmbrus @cloud-fan @dilipbiswal

test("Eliminate the Project with an empty projectList") {
val input = OneRowRelation
val query =
Project(Literal(1).as("1") :: Nil, Project(Literal(1).as("1") :: Nil, input)).analyze
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Where do you test empty projectList?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

When running Optimize.execute(query), the second Project's projectList is pruned to empty at first. Then, the second Project will be removed.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let me add another case with an empty List too.

@SparkQA
Copy link

SparkQA commented Mar 9, 2016

Test build #52721 has finished for PR 11599 at commit 0fa21ac.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@SparkQA
Copy link

SparkQA commented Mar 9, 2016

Test build #52724 has finished for PR 11599 at commit a31b1b5.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@gatorsmile
Copy link
Member Author

@cloud-fan Added another two cases. Feel free to let me know if you want me to add more cases. Thanks!

@@ -380,6 +380,9 @@ object ColumnPruning extends Rule[LogicalPlan] {
p
}

// Eliminate the Projects with empty projectList
case p @ Project(projectList, child) if projectList.isEmpty => child
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm thinking of the correctness of this rule. Actually this is not column pruning, but add more columns, as child may have more one columns.

And why this rule case p @ Project(projectList, child) if sameOutput(child.output, p.output) => child can't work?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Because OneRowRelation has no output. So its output is different to its parent Project.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

But a Project with empty projectList also has no output right?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

    case p @ Project(_, l: LeafNode) => p 

There is another case above it. Thus, it will stop here.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How about this?

case p @ Project(_, l: LeafNode) if !l.isInstanceOf[OneRowRelation] => p 

Then, we do not need the first line.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yea. As I posted before. I added a new rule that has side-effect to fix this issue too.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @viirya @cloud-fan !

I am not sure which way is better.

case p @ Project(_, l: LeafNode) if !l.isInstanceOf[OneRowRelation] => p 

My concern is the above line looks more hacky than the current PR fix.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let me respond the original question by @cloud-fan
We will not see an empty Project, if the child has more than one columns. The empty Project only happens after PruningColumns. I am fine, if we want to add an extra rule for eliminating Project only.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

how about we just move that case ahead? It seems always safe to apply case p @ Project(projectList, child) if sameOutput(child.output, p.output) => child

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I thought we intentionally did it in this way. I am not 100% sure if we might hit any issue because of it. Let me try it and check if we will hit any test case failure.

@SparkQA
Copy link

SparkQA commented Mar 9, 2016

Test build #52740 has finished for PR 11599 at commit 68decd1.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@cloud-fan
Copy link
Contributor

nit: we need to update the title and description. Technically we can't remove Project with empty projectList, only when the child also output Nil.

@gatorsmile gatorsmile changed the title [SPARK-13763] [SQL] Remove Project when its projectList is Empty [SPARK-13763] [SQL] Remove Project when its Child's Output is Nil Mar 9, 2016
@gatorsmile
Copy link
Member Author

Done. The title and PR description are corrected. Thanks!

@marmbrus
Copy link
Contributor

marmbrus commented Mar 9, 2016

Thanks, merging to master.

@asfgit asfgit closed this in 23369c3 Mar 9, 2016
roygao94 pushed a commit to roygao94/spark that referenced this pull request Mar 22, 2016
#### What changes were proposed in this pull request?

As shown in another PR: apache#11596, we are using `SELECT 1` as a dummy table, when the table is used for SQL statements in which a table reference is required, but the contents of the table are not important. For example,

```SQL
SELECT value FROM (select 1) dummyTable Lateral View explode(array(1,2,3)) adTable as value
```
Before the PR, the optimized plan contains a useless `Project` after Optimizer executing the `ColumnPruning` rule, as shown below:

```
== Analyzed Logical Plan ==
value: int
Project [value#22]
+- Generate explode(array(1, 2, 3)), true, false, Some(adtable), [value#22]
   +- SubqueryAlias dummyTable
      +- Project [1 AS 1#21]
         +- OneRowRelation$

== Optimized Logical Plan ==
Generate explode([1,2,3]), false, false, Some(adtable), [value#22]
+- Project
   +- OneRowRelation$
```

After the fix, the optimized plan removed the useless `Project`, as shown below:
```
== Optimized Logical Plan ==
Generate explode([1,2,3]), false, false, Some(adtable), [value#22]
+- OneRowRelation$
```

This PR is to remove `Project` when its Child's output is Nil

#### How was this patch tested?

Added a new unit test case into the suite `ColumnPruningSuite.scala`

Author: gatorsmile <gatorsmile@gmail.com>

Closes apache#11599 from gatorsmile/projectOneRowRelation.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
5 participants