[SPARK-12727][SQL] support SQL generation for aggregate with multi-distinct #11579

cloud-fan · 2016-03-08T14:12:03Z

What changes were proposed in this pull request?

This PR add SQL generation support for aggregate with multi-distinct, by simply moving the DistinctAggregationRewriter rule to optimizer.

More discussions are needed as this breaks an import contract: analyzed plan should be able to run without optimization. However, the ComputeCurrentTime rule has kind of broken it already, and I think maybe we should add a new phase for this kind of rules, because strictly speaking they don't belong to analysis and is coupled with the physical plan implementation.

How was this patch tested?

existing tests

cloud-fan · 2016-03-08T14:13:26Z

cc @yhuai @liancheng @gatorsmile

hvanhovell · 2016-03-08T14:45:50Z

This seems like the most sensible approach in order to get SQL generation working. In the initial implementation we applied the multi-distinct rewriter during the physical planning. We could move it back there.

LGTM pending Jenkins

SparkQA · 2016-03-08T16:01:44Z

Test build #52668 has finished for PR 11579 at commit 1735be0.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

rxin · 2016-03-08T19:44:55Z

Thanks - merging in master.

gatorsmile · 2016-03-08T21:55:40Z

It is a brilliant idea! : )

…stinct ## What changes were proposed in this pull request? This PR add SQL generation support for aggregate with multi-distinct, by simply moving the `DistinctAggregationRewriter` rule to optimizer. More discussions are needed as this breaks an import contract: analyzed plan should be able to run without optimization. However, the `ComputeCurrentTime` rule has kind of broken it already, and I think maybe we should add a new phase for this kind of rules, because strictly speaking they don't belong to analysis and is coupled with the physical plan implementation. ## How was this patch tested? existing tests Author: Wenchen Fan <wenchen@databricks.com> Closes apache#11579 from cloud-fan/distinct.

…distinct aggregate function #### Symptom: In the latest **branch 1.6**, when a `DISTINCT` aggregation function is used in the `HAVING` clause, Analyzer throws `AnalysisException` with a message like following: ``` resolved attribute(s) gid#558,id#559 missing from date#554,id#555 in operator !Expand [List(date#554, null, 0, if ((gid#558 = 1)) id#559 else null),List(date#554, id#555, 1, null)], [date#554,id#561,gid#560,if ((gid = 1)) id else null#562]; ``` #### Root cause: The problem is that the distinct aggregate in having condition are resolved by the rule `DistinctAggregationRewriter` twice, which messes up the resulted `EXPAND` operator. In a `ResolveAggregateFunctions` rule, when resolving ```Filter(havingCondition, _: Aggregate)```, the `havingCondition` is resolved as an `Aggregate` in a nested loop of analyzer rule execution (by invoking `RuleExecutor.execute`). At this nested level of analysis, the rule `DistinctAggregationRewriter` rewrites this distinct aggregate clause to an expanded two-layer aggregation, where the `aggregateExpresssions` of the final `Aggregate` contains the resolved `gid` and the aggregate expression attributes (In the above case, they are `gid#558, id#559`). After completion of the nested analyzer rule execution, the resulted `aggregateExpressions` in the `havingCondition` is pushed down into the underlying `Aggregate` operator. The `DistinctAggregationRewriter` rule is executed again. The `projections` field of `EXPAND` operator is populated with the `aggregateExpressions` of the `havingCondition` mentioned above. However, the attributes (In the above case, they are `gid#558, id#559`) in the projection list of `EXPAND` operator can not be found in the underlying relation. #### Solution: This PR retrofits part of [#11579](#11579) that moves the `DistinctAggregationRewriter` to the beginning of Optimizer, so that it guarantees that the rewrite only happens after all the aggregate functions are resolved first. Thus, it avoids resolution failure. #### How is the PR change tested New [test cases ](https://github.com/xwu0226/spark/blob/f73428f94746d6d074baf6702589545bdbd11cad/sql/hive/src/test/scala/org/apache/spark/sql/hive/execution/AggregationQuerySuite.scala#L927-L988) are added to drive `DistinctAggregationRewriter` rewrites for multi-distinct aggregations , involving having clause. A following up PR will be submitted to add these test cases to master(2.0) branch. Author: xin Wu <xinwu@us.ibm.com> Closes #12974 from xwu0226/SPARK-14495_review.

…distinct aggregate function #### Symptom: In the latest **branch 1.6**, when a `DISTINCT` aggregation function is used in the `HAVING` clause, Analyzer throws `AnalysisException` with a message like following: ``` resolved attribute(s) gid#558,id#559 missing from date#554,id#555 in operator !Expand [List(date#554, null, 0, if ((gid#558 = 1)) id#559 else null),List(date#554, id#555, 1, null)], [date#554,id#561,gid#560,if ((gid = 1)) id else null#562]; ``` #### Root cause: The problem is that the distinct aggregate in having condition are resolved by the rule `DistinctAggregationRewriter` twice, which messes up the resulted `EXPAND` operator. In a `ResolveAggregateFunctions` rule, when resolving ```Filter(havingCondition, _: Aggregate)```, the `havingCondition` is resolved as an `Aggregate` in a nested loop of analyzer rule execution (by invoking `RuleExecutor.execute`). At this nested level of analysis, the rule `DistinctAggregationRewriter` rewrites this distinct aggregate clause to an expanded two-layer aggregation, where the `aggregateExpresssions` of the final `Aggregate` contains the resolved `gid` and the aggregate expression attributes (In the above case, they are `gid#558, id#559`). After completion of the nested analyzer rule execution, the resulted `aggregateExpressions` in the `havingCondition` is pushed down into the underlying `Aggregate` operator. The `DistinctAggregationRewriter` rule is executed again. The `projections` field of `EXPAND` operator is populated with the `aggregateExpressions` of the `havingCondition` mentioned above. However, the attributes (In the above case, they are `gid#558, id#559`) in the projection list of `EXPAND` operator can not be found in the underlying relation. #### Solution: This PR retrofits part of [apache#11579](apache#11579) that moves the `DistinctAggregationRewriter` to the beginning of Optimizer, so that it guarantees that the rewrite only happens after all the aggregate functions are resolved first. Thus, it avoids resolution failure. #### How is the PR change tested New [test cases ](https://github.com/xwu0226/spark/blob/f73428f94746d6d074baf6702589545bdbd11cad/sql/hive/src/test/scala/org/apache/spark/sql/hive/execution/AggregationQuerySuite.scala#L927-L988) are added to drive `DistinctAggregationRewriter` rewrites for multi-distinct aggregations , involving having clause. A following up PR will be submitted to add these test cases to master(2.0) branch. Author: xin Wu <xinwu@us.ibm.com> Closes apache#12974 from xwu0226/SPARK-14495_review. (cherry picked from commit d165486)

support SQL generation for aggregate with multi-distinct

1735be0

asfgit closed this in 46881b4 Mar 8, 2016

xwu0226 mentioned this pull request May 7, 2016

[SPARK-14495][SQL][1.6] fix resolution failure of having clause with distinct aggregate function #12974

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[SPARK-12727][SQL] support SQL generation for aggregate with multi-distinct #11579

[SPARK-12727][SQL] support SQL generation for aggregate with multi-distinct #11579

cloud-fan commented Mar 8, 2016

cloud-fan commented Mar 8, 2016

hvanhovell commented Mar 8, 2016

SparkQA commented Mar 8, 2016

rxin commented Mar 8, 2016

gatorsmile commented Mar 8, 2016

[SPARK-12727][SQL] support SQL generation for aggregate with multi-distinct #11579

[SPARK-12727][SQL] support SQL generation for aggregate with multi-distinct #11579

Conversation

cloud-fan commented Mar 8, 2016

What changes were proposed in this pull request?

How was this patch tested?

cloud-fan commented Mar 8, 2016

hvanhovell commented Mar 8, 2016

SparkQA commented Mar 8, 2016

rxin commented Mar 8, 2016

gatorsmile commented Mar 8, 2016