[SPARK-30276][SQL] Support Filter expression allows simultaneous use of DISTINCT #27428

beliefer · 2020-02-01T14:11:33Z

What changes were proposed in this pull request?

This PR is related to #26656.
#26656 only support use FILTER clause on aggregate expression without DISTINCT.
This PR will enhance this feature when one or more DISTINCT aggregate expressions which allows the use of the FILTER clause.
Such as:

select sum(distinct id) filter (where sex = 'man') from student;
select class_id, sum(distinct id) filter (where sex = 'man') from student group by class_id;
select count(id) filter (where class_id = 1), sum(distinct id) filter (where sex = 'man') from student;
select class_id, count(id) filter (where class_id = 1), sum(distinct id) filter (where sex = 'man') from student group by class_id;
select sum(distinct id), sum(distinct id) filter (where sex = 'man') from student;
select class_id, sum(distinct id), sum(distinct id) filter (where sex = 'man') from student group by class_id;
select class_id, count(id), count(id) filter (where class_id = 1), sum(distinct id), sum(distinct id) filter (where sex = 'man') from student group by class_id;

Note:
In #26656, we use AggregationIterator to treat the filter conditions of aggregate expr. This is good because we can evaluate filter in first aggregate locally.
If we use AggregationIterator too, the filter conditions of DISTINCT aggregate expr will be treated in second or thrid aggregate.
In order to reduce cost, we treat the filter conditions of DISTINCT aggregate expr in first aggregate or local is better.
So, this PR uses Expand to ensure the evaluation at local.

Why are the changes needed?

Spark SQL only support use FILTER clause on aggregate expression without DISTINCT.
This PR support Filter expression allows simultaneous use of DISTINCT

Does this PR introduce any user-facing change?

No

How was this patch tested?

Exists and new UT

beliefer · 2020-02-01T14:13:46Z

@maropu @cloud-fan I'm sorry for open a new PR. Because I made some mistake operator.

SparkQA · 2020-02-01T18:17:11Z

Test build #117718 has finished for PR 27428 at commit 6a32d83.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2020-02-02T06:56:05Z

Test build #117724 has finished for PR 27428 at commit 5c38bbe.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2020-02-02T12:38:24Z

Test build #117737 has finished for PR 27428 at commit a6498f9.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

beliefer · 2020-02-04T11:55:27Z

cc @cloud-fan

SparkQA · 2020-02-07T08:05:01Z

Test build #118021 has finished for PR 27428 at commit c6caf73.

This patch fails due to an unknown error code, -9.
This patch merges cleanly.
This patch adds the following public classes (experimental):
case class AliasIdentifier(name: String, qualifier: Seq[String])
class LegacyDateFormatter(pattern: String, locale: Locale) extends DateFormatter
class LegacyTimestampFormatter(
case class PlanAdaptiveSubqueries(subqueryMap: Map[Long, SubqueryExec]) extends Rule[SparkPlan]

beliefer · 2020-02-07T09:09:18Z

retest this please

SparkQA · 2020-02-07T09:30:11Z

Test build #118027 has finished for PR 27428 at commit c6caf73.

This patch fails Spark unit tests.
This patch merges cleanly.
This patch adds the following public classes (experimental):
case class AliasIdentifier(name: String, qualifier: Seq[String])
class LegacyDateFormatter(pattern: String, locale: Locale) extends DateFormatter
class LegacyTimestampFormatter(
case class PlanAdaptiveSubqueries(subqueryMap: Map[Long, SubqueryExec]) extends Rule[SparkPlan]

SparkQA · 2020-02-07T16:13:05Z

Test build #118033 has finished for PR 27428 at commit cd00f91.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

HyukjinKwon · 2020-02-24T09:01:54Z

cc @hvanhovell

beliefer · 2020-04-13T10:34:06Z

retest this please

SparkQA · 2020-04-13T13:16:15Z

Test build #121200 has finished for PR 27428 at commit cd00f91.

This patch fails Spark unit tests.
This patch merges cleanly.
This patch adds no public classes.

beliefer · 2020-04-13T15:12:19Z

retest this please

SparkQA · 2020-04-13T20:12:59Z

Test build #121216 has finished for PR 27428 at commit cd00f91.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

beliefer · 2020-06-19T05:58:42Z

cc @cloud-fan

cloud-fan · 2020-07-02T15:11:11Z

...catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/aggregate/interfaces.scala

-   * Wraps this [[AggregateFunction]] in an [[AggregateExpression]] and sets `isDistinct`
-   * flag of the [[AggregateExpression]] to the given value because
+   * Wraps this [[AggregateFunction]] in an [[AggregateExpression]] with `isDistinct`
+   * flag and `filter` option of the [[AggregateExpression]] to the given value because


filter option looks weird, how about and an optional 'filter'?

cloud-fan · 2020-07-02T15:11:49Z

...catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/aggregate/interfaces.scala

   * [[AggregateExpression]] is the container of an [[AggregateFunction]], aggregation mode,
-   * and the flag indicating if this aggregation is distinct aggregation or not.
+   * the flag indicating if this aggregation is distinct aggregation or not and filter option.


ditto, and the optional 'filter'

cloud-fan · 2020-07-02T15:31:32Z

...alyst/src/main/scala/org/apache/spark/sql/catalyst/optimizer/RewriteDistinctAggregates.scala

+ *                          ('key, '_gen_distinct_1, null, 1, null),
+ *                          ('key, null, '_gen_distinct_2, 2, null)]
+ *           output = ['key, '_gen_distinct_1, '_gen_distinct_2, 'gid, 'value])
+ *         Expand(


This doesn't need to be an Expand: you just have one project list, and we can just use Project.

Then we can merge the Project with the above Expand

@cloud-fan Good idea. I learned how to use Alias. Thanks.

SparkQA · 2020-07-09T05:37:49Z

Test build #125433 has finished for PR 27428 at commit 16d8c1d.

This patch fails Spark unit tests.
This patch merges cleanly.
This patch adds no public classes.

beliefer · 2020-07-09T06:15:10Z

retest this please

SparkQA · 2020-07-09T07:05:02Z

Test build #125439 has finished for PR 27428 at commit 16d8c1d.

This patch fails due to an unknown error code, -9.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2020-07-09T10:51:09Z

Test build #125462 has started for PR 27428 at commit 762e839.

SparkQA · 2020-07-09T11:09:37Z

Test build #125450 has finished for PR 27428 at commit 3c49156.

This patch fails Spark unit tests.
This patch merges cleanly.
This patch adds no public classes.

beliefer · 2020-07-10T01:51:38Z

retest this please

SparkQA · 2020-07-10T06:22:45Z

Test build #125539 has finished for PR 27428 at commit 762e839.

This patch fails Spark unit tests.
This patch merges cleanly.
This patch adds no public classes.

beliefer · 2020-07-10T06:35:27Z

retest this please

SparkQA · 2020-07-10T07:05:02Z

Test build #125564 has finished for PR 27428 at commit 762e839.

This patch fails due to an unknown error code, -9.
This patch merges cleanly.
This patch adds no public classes.

beliefer · 2020-07-10T07:23:40Z

retest this please

SparkQA · 2020-07-10T07:25:13Z

Test build #125576 has started for PR 27428 at commit 762e839.

beliefer · 2020-07-10T15:54:55Z

retest this please

SparkQA · 2020-07-10T23:52:10Z

Test build #125626 has finished for PR 27428 at commit 762e839.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

…ggregates.

cloud-fan · 2020-07-13T09:13:18Z

...alyst/src/main/scala/org/apache/spark/sql/catalyst/optimizer/RewriteDistinctAggregates.scala

 *       LocalTableScan [...]
 * }}}
 *
- * The rule does the following things here:
+ * Four example: single distinct aggregate function with filter clauses (in sql):


single -> more than one?

OK. How about at least two distinct aggregate function and one of them contains filter clauses (in sql) ?

cloud-fan · 2020-07-13T09:15:48Z

...alyst/src/main/scala/org/apache/spark/sql/catalyst/optimizer/RewriteDistinctAggregates.scala

+ * 1. Project the data. There are three aggregation groups in this query:
+ *    i. the non-distinct group;
+ *    ii. the distinct 'cat1 group;
+ *    iii. the distinct 'cat2 group with filter clause.


This doesn't match the group. Maybe just make it general the distinct group without filter clause and the distinct group with filter clause

cloud-fan · 2020-07-13T09:17:57Z

...alyst/src/main/scala/org/apache/spark/sql/catalyst/optimizer/RewriteDistinctAggregates.scala

+ *    in this query:
+ *    i. the non-distinct 'cat1 group;
+ *    ii. the distinct 'cat1 group;
+ *    iii. the distinct 'cat1 group with filter clause.


We don't need to repeat these 3 groups.

They are different

cloud-fan · 2020-07-13T09:22:02Z

...alyst/src/main/scala/org/apache/spark/sql/catalyst/optimizer/RewriteDistinctAggregates.scala

+ *    functions = [COUNT(DISTINCT 'cat1) with FILTER('id > 1),
+ *                 sum('value)]
+ *    output = ['key, 'cat1_cnt, 'total])
+ *   LocalTableScan [...]


Do we need to rewrite this query? The planner can handle single distinct agg func AFAIK.

I think we can keep the previous behavior. AggregationIterator already done this.

SparkQA · 2020-07-13T10:01:02Z

Test build #125756 has finished for PR 27428 at commit d531864.

This patch fails Spark unit tests.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2020-07-13T10:01:46Z

Test build #125761 has finished for PR 27428 at commit 5bbbfd7.

This patch fails Spark unit tests.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2020-07-14T04:47:01Z

Test build #125800 has finished for PR 27428 at commit 20ad143.

This patch fails Spark unit tests.
This patch merges cleanly.
This patch adds no public classes.

cloud-fan · 2020-07-14T12:41:21Z

...alyst/src/main/scala/org/apache/spark/sql/catalyst/optimizer/RewriteDistinctAggregates.scala

+  *                  sum('_gen_attr_2)]
+ *      output = ['key, 'cat1_cnt, 'total])
+ *     Project(
+ *        projectionList = ['key, if ('id > 1) 'cat1 else null, cast('value as bigint)]


Is this necessary? The query can work fine even if we don't add this Project in this rule, right?

This rule should be skipped if there is only one distinct. Having a filter or not shouldn't change it.

If not apply this rule, can't support the case that have only one distinct with filter clause.
For unification, the rules are used uniformly here

I mean to unify the implementations of the filter clause that are handled by this rule. This case is not handled by this rule before your PR. Sorry if I didn't make myself clear enough.

beliefer added 2 commits February 1, 2020 21:29

Support distinct with filter

2dc9db4

Add results of test case

6a32d83

beliefer mentioned this pull request Feb 2, 2020

[SPARK-30276][SQL] Support Filter expression allows simultaneous use of DISTINCT #27058

Closed

Optimize code

5c38bbe

Fix incorrect sql

a6498f9

dongjoon-hyun added the SQL label Feb 5, 2020

Resolve conflict

c6caf73

Fix conflict

cd00f91

Reuse completeNextStageWithFetchFailure

4a6f903

Merge remote-tracking branch 'upstream/master'

96456e2

cloud-fan reviewed Jul 2, 2020

View reviewed changes

beliefer added 2 commits July 3, 2020 11:24

Merge remote-tracking branch 'upstream/master'

4314005

Merge branch 'master' into same_distinct_aggregate_with_filter

bd314cb

Supplement comments.

3c49156

beliefer mentioned this pull request Jul 9, 2020

[WIP][SPARK-30276][SQL] Support Filter expression allows simultaneous use of DISTINCT #29051

Closed

Optimize code.

762e839

beliefer added 2 commits July 13, 2020 15:32

Unified implementation of filter in regular aggregates and distinct a…

12e6fbc

…ggregates.

Update comments.

d531864

cloud-fan reviewed Jul 13, 2020

View reviewed changes

Optimize code.

5bbbfd7

Update comments.

20ad143

beliefer mentioned this pull request Jul 14, 2020

[SPARK-30027][SQL] Support codegen for aggregate filters in HashAggregateExec #27019

Closed

cloud-fan reviewed Jul 14, 2020

View reviewed changes

beliefer closed this Jul 16, 2020

[SPARK-30276][SQL] Support Filter expression allows simultaneous use of DISTINCT #27428

[SPARK-30276][SQL] Support Filter expression allows simultaneous use of DISTINCT #27428

Conversation

beliefer commented Feb 1, 2020

What changes were proposed in this pull request?

Why are the changes needed?

Does this PR introduce any user-facing change?

How was this patch tested?

beliefer commented Feb 1, 2020

SparkQA commented Feb 1, 2020

SparkQA commented Feb 2, 2020

SparkQA commented Feb 2, 2020

beliefer commented Feb 4, 2020

SparkQA commented Feb 7, 2020

beliefer commented Feb 7, 2020

SparkQA commented Feb 7, 2020

SparkQA commented Feb 7, 2020

HyukjinKwon commented Feb 24, 2020

beliefer commented Apr 13, 2020

SparkQA commented Apr 13, 2020

beliefer commented Apr 13, 2020

SparkQA commented Apr 13, 2020

beliefer commented Jun 19, 2020

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

SparkQA commented Jul 9, 2020

beliefer commented Jul 9, 2020

SparkQA commented Jul 9, 2020

SparkQA commented Jul 9, 2020

SparkQA commented Jul 9, 2020

beliefer commented Jul 10, 2020

SparkQA commented Jul 10, 2020

beliefer commented Jul 10, 2020

SparkQA commented Jul 10, 2020

beliefer commented Jul 10, 2020

SparkQA commented Jul 10, 2020

beliefer commented Jul 10, 2020

SparkQA commented Jul 10, 2020

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

beliefer Jul 13, 2020 • edited

Choose a reason for hiding this comment

SparkQA commented Jul 13, 2020

SparkQA commented Jul 13, 2020

SparkQA commented Jul 14, 2020

Choose a reason for hiding this comment

Choose a reason for hiding this comment

beliefer Jul 14, 2020 • edited

Choose a reason for hiding this comment

Choose a reason for hiding this comment

beliefer Jul 13, 2020 •

edited

beliefer Jul 14, 2020 •

edited