[SPARK-6466][SQL] Remove unnecessary attributes when resolving GroupingSets #5134

viirya · 2015-03-23T10:08:55Z

When resolving GroupingSets, we currently list all outputs of GroupingSets's child plan. However, the columns that are not in groupBy expressions and not used by aggregation expressions are unnecessary and can be removed.

SparkQA · 2015-03-23T10:58:28Z

Test build #28988 has finished for PR 5134 at commit 8e16206.

This patch fails Spark unit tests.
This patch merges cleanly.
This patch adds no public classes.

chenghao-intel · 2015-03-23T13:33:02Z

Seems more reasonable to me if we do this in Optimizer, what do you think?

viirya · 2015-03-23T13:59:14Z

Good suggestion. I will do that later. Thanks.

SparkQA · 2015-03-24T11:57:38Z

Test build #29080 has finished for PR 5134 at commit a2734b3.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

viirya · 2015-03-27T23:09:50Z

/cc @liancheng @marmbrus

marmbrus · 2015-04-12T02:00:52Z

sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/optimizer/Optimizer.scala

+      val substitution = projections.map { groupExpr =>
+        val newExprs = groupExpr.collect {
+          case x: NamedExpression if a.references.contains(x) => x
+          case l: Literal => l


Why do we need special handling here?

Because there are some constant null values and bitmasks we need to keep them.

marmbrus · 2015-04-12T02:13:22Z

This change should also be accompanied by test cases. There are several examples in the optimizer section of the tests.

SparkQA · 2015-04-27T18:20:52Z

Test build #31045 has started for PR 5134 at commit a2734b3.

chenghao-intel · 2015-04-28T08:09:03Z

I don't think it's the correct way for this optimization. Aggregate takes the output of Expand for expression evaluation, not the projections!
It probably need to refactor the GroupingSet a little bit before this optimization, I will submit the code soon.

Sorry @viirya , I will ping you once the refactoring finished.

viirya · 2015-04-28T08:19:59Z

@chenghao-intel Expand's output is modified too in this optimization.

chenghao-intel · 2015-04-28T08:32:39Z

I mean projections.map{ groupExpr.collect is not necessary, as output will be exactly match the projections.
Since we will pruning the output anyway, why don't just remove the associated columns from projections directly?

Specifying the output for a logical node is actually quit confusing, as catalyst will take it as referenced attribute also, that's why we need to refactor it.

viirya · 2015-04-28T08:44:26Z

@chenghao-intel I originally remove the unnecessary columns from projections in Analyzer as you said. However, you suggest that it is more reasonable to move it to Optimizer.

chenghao-intel · 2015-04-28T08:49:35Z

Sorry, I didn't make it clear, we definitely should do the column pruning in Optimizer. And also the Expand need to be refactor.

viirya · 2015-04-28T08:51:56Z

ok. I will update the unit test first.

SparkQA · 2015-04-28T10:40:59Z

Test build #31137 has finished for PR 5134 at commit d5dadec.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.
This patch does not change any dependencies.

viirya · 2015-05-21T08:07:52Z

@chenghao-intel do you think we still need this? If no, I should close it.

chenghao-intel · 2015-05-21T12:27:20Z

@viirya The #5780 should fix that also, can you jump there for the code review? Sorry, I will update that PR asap.

viirya · 2015-05-21T16:40:15Z

@chenghao-intel OK. then I close this now.

Only keep necessary attribute output.

8e16206

Move it to Optimizer.

a2734b3

marmbrus reviewed Apr 12, 2015
View reviewed changes

Merge remote-tracking branch 'upstream/master' into remove_attr_expand

8faa8fa

Add comment and unit test.

d5dadec

viirya closed this May 21, 2015

viirya deleted the remove_attr_expand branch December 27, 2023 18:17

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[SPARK-6466][SQL] Remove unnecessary attributes when resolving GroupingSets #5134

[SPARK-6466][SQL] Remove unnecessary attributes when resolving GroupingSets #5134

viirya commented Mar 23, 2015

SparkQA commented Mar 23, 2015

chenghao-intel commented Mar 23, 2015

viirya commented Mar 23, 2015

SparkQA commented Mar 24, 2015

viirya commented Mar 27, 2015

marmbrus Apr 12, 2015

viirya Apr 28, 2015

marmbrus commented Apr 12, 2015

SparkQA commented Apr 27, 2015

chenghao-intel commented Apr 28, 2015

viirya commented Apr 28, 2015

chenghao-intel commented Apr 28, 2015

viirya commented Apr 28, 2015

chenghao-intel commented Apr 28, 2015

viirya commented Apr 28, 2015

SparkQA commented Apr 28, 2015

viirya commented May 21, 2015

chenghao-intel commented May 21, 2015

viirya commented May 21, 2015

[SPARK-6466][SQL] Remove unnecessary attributes when resolving GroupingSets #5134

[SPARK-6466][SQL] Remove unnecessary attributes when resolving GroupingSets #5134

Conversation

viirya commented Mar 23, 2015

SparkQA commented Mar 23, 2015

chenghao-intel commented Mar 23, 2015

viirya commented Mar 23, 2015

SparkQA commented Mar 24, 2015

viirya commented Mar 27, 2015

marmbrus Apr 12, 2015

Choose a reason for hiding this comment

viirya Apr 28, 2015

Choose a reason for hiding this comment

marmbrus commented Apr 12, 2015

SparkQA commented Apr 27, 2015

chenghao-intel commented Apr 28, 2015

viirya commented Apr 28, 2015

chenghao-intel commented Apr 28, 2015

viirya commented Apr 28, 2015

chenghao-intel commented Apr 28, 2015

viirya commented Apr 28, 2015

SparkQA commented Apr 28, 2015

viirya commented May 21, 2015

chenghao-intel commented May 21, 2015

viirya commented May 21, 2015