Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[SPARK-26390][SQL] ColumnPruning rule should only do column pruning #23343

Closed
wants to merge 2 commits into from

Conversation

cloud-fan
Copy link
Contributor

What changes were proposed in this pull request?

This is a small clean up.

By design catalyst rules should be orthogonal: each rule should have its own responsibility. However, the ColumnPruning rule does not only do column pruning, but also remove no-op project and window.

This PR updates the RemoveRedundantProject rule to remove no-op window as well, and clean up the ColumnPruning rule to only do column pruning.

How was this patch tested?

existing tests

@cloud-fan
Copy link
Contributor Author

@SparkQA
Copy link

SparkQA commented Dec 18, 2018

Test build #100274 has finished for PR 23343 at commit aad89d3.

  • This patch fails due to an unknown error code, -9.
  • This patch merges cleanly.
  • This patch adds no public classes.

@@ -93,7 +93,7 @@ abstract class Optimizer(sessionCatalog: SessionCatalog)
RewriteCorrelatedScalarSubquery,
EliminateSerialization,
RemoveRedundantAliases,
RemoveRedundantProject,
RemoveNoopOperators,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: RemoveUselessOperators?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think Noop is fine too. It's no-op, right :-)?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yes it is fine too, I preferred Useless because they are actually doing something, so they introduce a useless overhead, anyway not a big deal

// Can't prune the columns on LeafNode
case p @ Project(_, _: LeafNode) => p

// for all other logical plans that inherits the output from it's children
case p @ Project(_, child) =>
case p @ Project(_, child) if !child.isInstanceOf[Project] =>
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

in case the child is a project, shall we anyway update it with c.output.filter(allReferences.contains)? I mean can we instead update the prunedChild method to check if c is a Project and behave accordingly?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We already handled project over project at L542

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yes, I see, makes sense, thanks.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This requires a code comment. I believe the others will ask the same q when we read the problem again.

@cloud-fan
Copy link
Contributor Author

retest this please

@@ -34,6 +34,7 @@ class ColumnPruningSuite extends PlanTest {
val batches = Batch("Column pruning", FixedPoint(100),
PushDownPredicate,
ColumnPruning,
RemoveNoopOperators,
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this added to remove top Project('b :: Nil ...)? Without that, this test can be unchanged?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

without this, a lot more tests need to be updated...

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ah, I see. :)

ColumnPruning) ::
Batch("Column Pruning", FixedPoint(100),
ColumnPruning,
RemoveNoopOperators) ::
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: Looks not precise to have RemoveNoopOperators in Column Pruning batch, but it is fine as this is just test.

@SparkQA
Copy link

SparkQA commented Dec 18, 2018

Test build #100284 has finished for PR 23343 at commit aad89d3.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@gatorsmile
Copy link
Member

In general, we normally do not encourage the code refactoring in optimizer rules. Any change could introduce a perf regression in query planning. For this PR, I think it is safe although it could slightly increase the number of runs in the big optimizer batch.

LGTM. Please update the comment.

Copy link
Member

@dongjoon-hyun dongjoon-hyun left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

+1, LGTM.

case p @ Project(_, child) if child.sameOutput(p) => child

// Eliminate no-op Window
case w: Window if w.windowExpressions.isEmpty => w.child
}
}
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@cloud-fan . Is this too small to move out as a separate file during this refactoring?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's small and it has been here for a while. We can move many rules here to separated files in another PR.

@dongjoon-hyun
Copy link
Member

cc @dbtsai

@cloud-fan
Copy link
Contributor Author

retest this please

@dongjoon-hyun
Copy link
Member

The final commit is only about adding a comment. I'll merge this to master. Thank you all.

@asfgit asfgit closed this in 08f74ad Dec 19, 2018
@dbtsai
Copy link
Member

dbtsai commented Dec 19, 2018

@dongjoon-hyun thanks for pinging me. I'm working on ColumnPruning to add support of pruning nested fields. Since Optimizer.scala is such a huge file now, I'll go ahead to submit a PR to move ColumnPruning out as a separate file. Thanks!

@gatorsmile
Copy link
Member

@dbtsai It will make us hard to track the change history. Let us avoid moving the existing rules to new files.

@dbtsai
Copy link
Member

dbtsai commented Dec 20, 2018

@gatorsmile for example, we were refactoring ReplaceNullWithFalseInPredicate out from expressions.scala as it grows too big, #23139.

I'm wondering when we should or should not do such refactoring. I create a PR #23359 so we can discuss over there.

@gatorsmile
Copy link
Member

ReplaceNullWithFalseInPredicate is a new rule in 3.0

holdenk pushed a commit to holdenk/spark that referenced this pull request Jan 5, 2019
## What changes were proposed in this pull request?

This is a small clean up.

By design catalyst rules should be orthogonal: each rule should have its own responsibility. However, the `ColumnPruning` rule does not only do column pruning, but also remove no-op project and window.

This PR updates the `RemoveRedundantProject` rule to remove no-op window as well, and clean up the `ColumnPruning` rule to only do column pruning.

## How was this patch tested?

existing tests

Closes apache#23343 from cloud-fan/column-pruning.

Authored-by: Wenchen Fan <wenchen@databricks.com>
Signed-off-by: Dongjoon Hyun <dongjoon@apache.org>
jackylee-ch pushed a commit to jackylee-ch/spark that referenced this pull request Feb 18, 2019
## What changes were proposed in this pull request?

This is a small clean up.

By design catalyst rules should be orthogonal: each rule should have its own responsibility. However, the `ColumnPruning` rule does not only do column pruning, but also remove no-op project and window.

This PR updates the `RemoveRedundantProject` rule to remove no-op window as well, and clean up the `ColumnPruning` rule to only do column pruning.

## How was this patch tested?

existing tests

Closes apache#23343 from cloud-fan/column-pruning.

Authored-by: Wenchen Fan <wenchen@databricks.com>
Signed-off-by: Dongjoon Hyun <dongjoon@apache.org>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
8 participants