New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[SPARK-30276][SQL] Support Filter expression allows simultaneous use of DISTINCT #27428
[SPARK-30276][SQL] Support Filter expression allows simultaneous use of DISTINCT #27428
Conversation
@maropu @cloud-fan I'm sorry for open a new PR. Because I made some mistake operator. |
Test build #117718 has finished for PR 27428 at commit
|
Test build #117724 has finished for PR 27428 at commit
|
Test build #117737 has finished for PR 27428 at commit
|
cc @cloud-fan |
Test build #118021 has finished for PR 27428 at commit
|
retest this please |
Test build #118027 has finished for PR 27428 at commit
|
Test build #118033 has finished for PR 27428 at commit
|
cc @hvanhovell |
retest this please |
Test build #121200 has finished for PR 27428 at commit
|
retest this please |
Test build #121216 has finished for PR 27428 at commit
|
cc @cloud-fan |
* Wraps this [[AggregateFunction]] in an [[AggregateExpression]] and sets `isDistinct` | ||
* flag of the [[AggregateExpression]] to the given value because | ||
* Wraps this [[AggregateFunction]] in an [[AggregateExpression]] with `isDistinct` | ||
* flag and `filter` option of the [[AggregateExpression]] to the given value because |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
filter option
looks weird, how about and an optional 'filter'
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
OK.
* [[AggregateExpression]] is the container of an [[AggregateFunction]], aggregation mode, | ||
* and the flag indicating if this aggregation is distinct aggregation or not. | ||
* the flag indicating if this aggregation is distinct aggregation or not and filter option. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ditto, and the optional 'filter'
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
OK
* ('key, '_gen_distinct_1, null, 1, null), | ||
* ('key, null, '_gen_distinct_2, 2, null)] | ||
* output = ['key, '_gen_distinct_1, '_gen_distinct_2, 'gid, 'value]) | ||
* Expand( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This doesn't need to be an Expand
: you just have one project list, and we can just use Project
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Then we can merge the Project
with the above Expand
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@cloud-fan Good idea. I learned how to use Alias
. Thanks.
Test build #125433 has finished for PR 27428 at commit
|
retest this please |
Test build #125439 has finished for PR 27428 at commit
|
Test build #125462 has started for PR 27428 at commit |
Test build #125450 has finished for PR 27428 at commit
|
retest this please |
Test build #125539 has finished for PR 27428 at commit
|
retest this please |
Test build #125564 has finished for PR 27428 at commit
|
retest this please |
Test build #125576 has started for PR 27428 at commit |
retest this please |
Test build #125626 has finished for PR 27428 at commit
|
* LocalTableScan [...] | ||
* }}} | ||
* | ||
* The rule does the following things here: | ||
* Four example: single distinct aggregate function with filter clauses (in sql): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
single
-> more than one
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
OK. How about at least two distinct aggregate function and one of them contains filter clauses (in sql)
?
* 1. Project the data. There are three aggregation groups in this query: | ||
* i. the non-distinct group; | ||
* ii. the distinct 'cat1 group; | ||
* iii. the distinct 'cat2 group with filter clause. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This doesn't match the group. Maybe just make it general the distinct group without filter clause
and the distinct group with filter clause
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
OK
* in this query: | ||
* i. the non-distinct 'cat1 group; | ||
* ii. the distinct 'cat1 group; | ||
* iii. the distinct 'cat1 group with filter clause. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We don't need to repeat these 3 groups.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
They are different
* functions = [COUNT(DISTINCT 'cat1) with FILTER('id > 1), | ||
* sum('value)] | ||
* output = ['key, 'cat1_cnt, 'total]) | ||
* LocalTableScan [...] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do we need to rewrite this query? The planner can handle single distinct agg func AFAIK.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think we can keep the previous behavior. AggregationIterator
already done this.
Test build #125756 has finished for PR 27428 at commit
|
Test build #125761 has finished for PR 27428 at commit
|
Test build #125800 has finished for PR 27428 at commit
|
* sum('_gen_attr_2)] | ||
* output = ['key, 'cat1_cnt, 'total]) | ||
* Project( | ||
* projectionList = ['key, if ('id > 1) 'cat1 else null, cast('value as bigint)] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is this necessary? The query can work fine even if we don't add this Project in this rule, right?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This rule should be skipped if there is only one distinct. Having a filter or not shouldn't change it.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If not apply this rule, can't support the case that have only one distinct with filter clause.
For unification, the rules are used uniformly here
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I mean to unify the implementations of the filter clause that are handled by this rule. This case is not handled by this rule before your PR. Sorry if I didn't make myself clear enough.
What changes were proposed in this pull request?
This PR is related to #26656.
#26656 only support use FILTER clause on aggregate expression without DISTINCT.
This PR will enhance this feature when one or more DISTINCT aggregate expressions which allows the use of the FILTER clause.
Such as:
Note:
In #26656, we use
AggregationIterator
to treat the filter conditions of aggregate expr. This is good because we can evaluate filter in first aggregate locally.If we use
AggregationIterator
too, the filter conditions of DISTINCT aggregate expr will be treated in second or thrid aggregate.In order to reduce cost, we treat the filter conditions of DISTINCT aggregate expr in first aggregate or local is better.
So, this PR uses
Expand
to ensure the evaluation at local.Why are the changes needed?
Spark SQL only support use FILTER clause on aggregate expression without DISTINCT.
This PR support Filter expression allows simultaneous use of DISTINCT
Does this PR introduce any user-facing change?
No
How was this patch tested?
Exists and new UT