[SPARK-17142][SQL] Complex query triggers binding error in HashAggregateExec #14917

jiangxb1987 · 2016-09-01T10:57:28Z

What changes were proposed in this pull request?

In ReorderAssociativeOperator rule, we extract foldable expressions with Add/Multiply arithmetics, and replace with eval literal. For example, (a + 1) + (b + 2) is optimized to (a + b + 3) by this rule.
For aggregate operator, output expressions should be derived from groupingExpressions, current implemenation of ReorderAssociativeOperator rule may break this promise. A instance could be:

SELECT
  ((t1.a + 1) + (t2.a + 2)) AS out_col
FROM
  testdata2 AS t1
INNER JOIN
  testdata2 AS t2
ON
  (t1.a = t2.a)
GROUP BY (t1.a + 1), (t2.a + 2)

((t1.a + 1) + (t2.a + 2)) is optimized to (t1.a + t2.a + 3), which could not be derived from ExpressionSet((t1.a +1), (t2.a + 2)).
Maybe we should improve the rule of ReorderAssociativeOperator by adding a GroupingExpressionSet to keep Aggregate.groupingExpressions, and respect these expressions during the optimize stage.

How was this patch tested?

Add new test case in ReorderAssociativeOperatorSuite.

SparkQA · 2016-09-01T12:56:53Z

Test build #64774 has finished for PR 14917 at commit dc3b1b2.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

hvanhovell · 2016-09-01T12:58:43Z

Cool! Thanks for picking this up. cc @JoshRosen

Could you make the description a bit more readable by using markdown formatting (mostly adding 1 or 3 backticks here and there)?

hvanhovell · 2016-09-01T13:00:48Z

I will try to get to this one ASAP.

jiangxb1987 · 2016-09-01T14:39:26Z

@hvanhovell I've updated the description following your advice, thank you for your time!

jiangxb1987 · 2016-09-12T15:37:21Z

@hvanhovell Could you review this PR when you have some time please? Thank you!

hvanhovell · 2016-09-13T14:58:20Z

sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/optimizer/expressions.scala

    case other => other :: Nil
  }

+  private def collectGroupingExpressions(plan: LogicalPlan): ExpressionSet = plan match {


Lets move this into the apply (you already have the relevant comment there).

hvanhovell · 2016-09-13T15:03:25Z

LGTM. Merging to master. This is a good find. Thanks!

hvanhovell · 2016-09-13T15:03:58Z

We should actually make sure that Aggregate is not resolved in these cases. Could you address this in a follow-up?

hvanhovell · 2016-09-13T15:06:04Z

@jiangxb1987 could also open a backport against 2.0? Thanks!

hvanhovell · 2016-09-13T20:07:05Z

NVM.

…ateExec In `ReorderAssociativeOperator` rule, we extract foldable expressions with Add/Multiply arithmetics, and replace with eval literal. For example, `(a + 1) + (b + 2)` is optimized to `(a + b + 3)` by this rule. For aggregate operator, output expressions should be derived from groupingExpressions, current implemenation of `ReorderAssociativeOperator` rule may break this promise. A instance could be: ``` SELECT ((t1.a + 1) + (t2.a + 2)) AS out_col FROM testdata2 AS t1 INNER JOIN testdata2 AS t2 ON (t1.a = t2.a) GROUP BY (t1.a + 1), (t2.a + 2) ``` `((t1.a + 1) + (t2.a + 2))` is optimized to `(t1.a + t2.a + 3)`, which could not be derived from `ExpressionSet((t1.a +1), (t2.a + 2))`. Maybe we should improve the rule of `ReorderAssociativeOperator` by adding a GroupingExpressionSet to keep Aggregate.groupingExpressions, and respect these expressions during the optimize stage. Add new test case in `ReorderAssociativeOperatorSuite`. Author: jiangxingbo <jiangxb1987@gmail.com> Closes apache#14917 from jiangxb1987/rao.

jiangxb1987 · 2016-09-14T07:34:45Z

@hvanhovell Thank you! This bug was imported in spark-2.1.0, and in spark-2.0 we don't have the problem. So maybe we don't need to open backport against 2.0.

jiangxb1987 · 2016-09-14T07:42:00Z

@hvanhovell Do you mean we should check other optimize rules to ensure that Aggregate operator is not resolved in these cases?

hvanhovell · 2016-09-14T09:38:37Z

No, I am saying that it might be a good idea to incorporate this rule into Aggregate.resolved. This might make that method to complex.

jiangxb1987 · 2016-09-14T09:43:34Z

Sure!Will do it soon!

…ateExec ## What changes were proposed in this pull request? In `ReorderAssociativeOperator` rule, we extract foldable expressions with Add/Multiply arithmetics, and replace with eval literal. For example, `(a + 1) + (b + 2)` is optimized to `(a + b + 3)` by this rule. For aggregate operator, output expressions should be derived from groupingExpressions, current implemenation of `ReorderAssociativeOperator` rule may break this promise. A instance could be: ``` SELECT ((t1.a + 1) + (t2.a + 2)) AS out_col FROM testdata2 AS t1 INNER JOIN testdata2 AS t2 ON (t1.a = t2.a) GROUP BY (t1.a + 1), (t2.a + 2) ``` `((t1.a + 1) + (t2.a + 2))` is optimized to `(t1.a + t2.a + 3)`, which could not be derived from `ExpressionSet((t1.a +1), (t2.a + 2))`. Maybe we should improve the rule of `ReorderAssociativeOperator` by adding a GroupingExpressionSet to keep Aggregate.groupingExpressions, and respect these expressions during the optimize stage. ## How was this patch tested? Add new test case in `ReorderAssociativeOperatorSuite`. Author: jiangxingbo <jiangxb1987@gmail.com> Closes apache#14917 from jiangxb1987/rao.

jiangxb1987 · 2016-09-21T03:46:26Z

@hvanhovell In CheckAnalysis we have ensured all aggregateExpressions can be derived from groupingExpressions (refer to function checkValidAggregateExpression), but in optimize stage the rule ReorderAssociativeOperator broke this. I found the Aggregate.resolved is only checked in analyze stage so we failed to detect this case. So, maybe incorporate this rule into Aggregate.resolved brings little help to avoid similar problem. Should we check analysis again after we have performed all optimize rules? Or should we assert that in HashAggregateExec? Thank you!

bugfix

dc3b1b2

hvanhovell reviewed Sep 13, 2016
View reviewed changes

asfgit closed this in 4ba63b1 Sep 13, 2016

jiangxb1987 mentioned this pull request Sep 14, 2016

[SPARK-17142][SQL] Complex query triggers binding error in HashAggregateExec [BACKPORT 2.0] #15092

Closed

jiangxb1987 deleted the rao branch October 17, 2016 08:23

[SPARK-17142][SQL] Complex query triggers binding error in HashAggregateExec #14917

[SPARK-17142][SQL] Complex query triggers binding error in HashAggregateExec #14917

Uh oh!

Conversation

jiangxb1987 commented Sep 1, 2016 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

What changes were proposed in this pull request?

How was this patch tested?

Uh oh!

SparkQA commented Sep 1, 2016

Uh oh!

hvanhovell commented Sep 1, 2016

Uh oh!

hvanhovell commented Sep 1, 2016

Uh oh!

jiangxb1987 commented Sep 1, 2016

Uh oh!

jiangxb1987 commented Sep 12, 2016

Uh oh!

hvanhovell Sep 13, 2016

Choose a reason for hiding this comment

Uh oh!

hvanhovell commented Sep 13, 2016

Uh oh!

hvanhovell commented Sep 13, 2016 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

hvanhovell commented Sep 13, 2016

Uh oh!

hvanhovell commented Sep 13, 2016

Uh oh!

jiangxb1987 commented Sep 14, 2016 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

jiangxb1987 commented Sep 14, 2016

Uh oh!

hvanhovell commented Sep 14, 2016

Uh oh!

jiangxb1987 commented Sep 14, 2016

Uh oh!

jiangxb1987 commented Sep 21, 2016 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

jiangxb1987 commented Sep 1, 2016 •

edited

Loading

hvanhovell commented Sep 13, 2016 •

edited

Loading

jiangxb1987 commented Sep 14, 2016 •

edited

Loading

jiangxb1987 commented Sep 21, 2016 •

edited

Loading