Skip to content

Conversation

@jiangxb1987
Copy link
Contributor

@jiangxb1987 jiangxb1987 commented Sep 1, 2016

What changes were proposed in this pull request?

In ReorderAssociativeOperator rule, we extract foldable expressions with Add/Multiply arithmetics, and replace with eval literal. For example, (a + 1) + (b + 2) is optimized to (a + b + 3) by this rule.
For aggregate operator, output expressions should be derived from groupingExpressions, current implemenation of ReorderAssociativeOperator rule may break this promise. A instance could be:

SELECT
  ((t1.a + 1) + (t2.a + 2)) AS out_col
FROM
  testdata2 AS t1
INNER JOIN
  testdata2 AS t2
ON
  (t1.a = t2.a)
GROUP BY (t1.a + 1), (t2.a + 2)

((t1.a + 1) + (t2.a + 2)) is optimized to (t1.a + t2.a + 3), which could not be derived from ExpressionSet((t1.a +1), (t2.a + 2)).
Maybe we should improve the rule of ReorderAssociativeOperator by adding a GroupingExpressionSet to keep Aggregate.groupingExpressions, and respect these expressions during the optimize stage.

How was this patch tested?

Add new test case in ReorderAssociativeOperatorSuite.

@SparkQA
Copy link

SparkQA commented Sep 1, 2016

Test build #64774 has finished for PR 14917 at commit dc3b1b2.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@hvanhovell
Copy link
Contributor

Cool! Thanks for picking this up. cc @JoshRosen

Could you make the description a bit more readable by using markdown formatting (mostly adding 1 or 3 backticks here and there)?

@hvanhovell
Copy link
Contributor

I will try to get to this one ASAP.

@jiangxb1987
Copy link
Contributor Author

@hvanhovell I've updated the description following your advice, thank you for your time!

@jiangxb1987
Copy link
Contributor Author

@hvanhovell Could you review this PR when you have some time please? Thank you!

case other => other :: Nil
}

private def collectGroupingExpressions(plan: LogicalPlan): ExpressionSet = plan match {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Lets move this into the apply (you already have the relevant comment there).

@hvanhovell
Copy link
Contributor

LGTM. Merging to master. This is a good find. Thanks!

@hvanhovell
Copy link
Contributor

hvanhovell commented Sep 13, 2016

We should actually make sure that Aggregate is not resolved in these cases. Could you address this in a follow-up?

@hvanhovell
Copy link
Contributor

@jiangxb1987 could also open a backport against 2.0? Thanks!

@asfgit asfgit closed this in 4ba63b1 Sep 13, 2016
@hvanhovell
Copy link
Contributor

NVM.

jiangxb1987 added a commit to jiangxb1987/spark that referenced this pull request Sep 14, 2016
…ateExec

In `ReorderAssociativeOperator` rule, we extract foldable expressions with Add/Multiply arithmetics, and replace with eval literal. For example, `(a + 1) + (b + 2)` is optimized to `(a + b + 3)` by this rule.
For aggregate operator, output expressions should be derived from groupingExpressions, current implemenation of `ReorderAssociativeOperator` rule may break this promise. A instance could be:
```
SELECT
  ((t1.a + 1) + (t2.a + 2)) AS out_col
FROM
  testdata2 AS t1
INNER JOIN
  testdata2 AS t2
ON
  (t1.a = t2.a)
GROUP BY (t1.a + 1), (t2.a + 2)
```
`((t1.a + 1) + (t2.a + 2))` is optimized to `(t1.a + t2.a + 3)`, which could not be derived from `ExpressionSet((t1.a +1), (t2.a + 2))`.
Maybe we should improve the rule of `ReorderAssociativeOperator` by adding a GroupingExpressionSet to keep Aggregate.groupingExpressions, and respect these expressions during the optimize stage.

Add new test case in `ReorderAssociativeOperatorSuite`.

Author: jiangxingbo <jiangxb1987@gmail.com>

Closes apache#14917 from jiangxb1987/rao.
@jiangxb1987
Copy link
Contributor Author

jiangxb1987 commented Sep 14, 2016

@hvanhovell Thank you! This bug was imported in spark-2.1.0, and in spark-2.0 we don't have the problem. So maybe we don't need to open backport against 2.0.

@jiangxb1987
Copy link
Contributor Author

@hvanhovell Do you mean we should check other optimize rules to ensure that Aggregate operator is not resolved in these cases?

@hvanhovell
Copy link
Contributor

No, I am saying that it might be a good idea to incorporate this rule into Aggregate.resolved. This might make that method to complex.

@jiangxb1987
Copy link
Contributor Author

Sure!Will do it soon!

wgtmac pushed a commit to wgtmac/spark that referenced this pull request Sep 19, 2016
…ateExec

## What changes were proposed in this pull request?

In `ReorderAssociativeOperator` rule, we extract foldable expressions with Add/Multiply arithmetics, and replace with eval literal. For example, `(a + 1) + (b + 2)` is optimized to `(a + b + 3)` by this rule.
For aggregate operator, output expressions should be derived from groupingExpressions, current implemenation of `ReorderAssociativeOperator` rule may break this promise. A instance could be:
```
SELECT
  ((t1.a + 1) + (t2.a + 2)) AS out_col
FROM
  testdata2 AS t1
INNER JOIN
  testdata2 AS t2
ON
  (t1.a = t2.a)
GROUP BY (t1.a + 1), (t2.a + 2)
```
`((t1.a + 1) + (t2.a + 2))` is optimized to `(t1.a + t2.a + 3)`, which could not be derived from `ExpressionSet((t1.a +1), (t2.a + 2))`.
Maybe we should improve the rule of `ReorderAssociativeOperator` by adding a GroupingExpressionSet to keep Aggregate.groupingExpressions, and respect these expressions during the optimize stage.

## How was this patch tested?

Add new test case in `ReorderAssociativeOperatorSuite`.

Author: jiangxingbo <jiangxb1987@gmail.com>

Closes apache#14917 from jiangxb1987/rao.
@jiangxb1987
Copy link
Contributor Author

jiangxb1987 commented Sep 21, 2016

@hvanhovell In CheckAnalysis we have ensured all aggregateExpressions can be derived from groupingExpressions (refer to function checkValidAggregateExpression), but in optimize stage the rule ReorderAssociativeOperator broke this. I found the Aggregate.resolved is only checked in analyze stage so we failed to detect this case. So, maybe incorporate this rule into Aggregate.resolved brings little help to avoid similar problem. Should we check analysis again after we have performed all optimize rules? Or should we assert that in HashAggregateExec? Thank you!

@jiangxb1987 jiangxb1987 deleted the rao branch October 17, 2016 08:23
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants