[SPARK-46487][SQL] Push down part of filter through aggregate with nondeterministic field#44460
[SPARK-46487][SQL] Push down part of filter through aggregate with nondeterministic field#44460zml1206 wants to merge 6 commits intoapache:masterfrom
Conversation
92bf43c to
80697a8
Compare
80697a8 to
f122265
Compare
| project.copy(child = Filter(replaceAlias(condition, aliasMap), grandChild)) | ||
|
|
||
| case filter @ Filter(condition, aggregate: Aggregate) | ||
| if aggregate.aggregateExpressions.forall(_.deterministic) |
There was a problem hiding this comment.
Do you investigate the issue https://issues.apache.org/jira/browse/SPARK-13473 ?
It seems we can't release the restrictions.
There was a problem hiding this comment.
I investigated before,
Similar ones are https://issues.apache.org/jira/browse/SPARK-20246.
At that time, it was because it was impossible to determine whether the filter expression was deterministic for example $"rand" > 5, therefore, it will incorrectly push down $"rand" > 5.
However, after the filter expression is replaced, it can be judged whether it is a deterministic expression.
There was a problem hiding this comment.
If so, how about keep
if aggregate.aggregateExpressions.exists(_.deterministic)
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/optimizer/Optimizer.scala
Outdated
Show resolved
Hide resolved
| // implies that, for a given input row, the output are determined by the expression's initial | ||
| // state and all the input rows processed before. In another word, the order of input rows | ||
| // matters for non-deterministic expressions, while pushing down predicates changes the order. | ||
| // This also applies to Aggregate. |
There was a problem hiding this comment.
what's the rationale for not applying this to Aggregate?
There was a problem hiding this comment.
When "Aggregate" contains both non-deterministic and deterministic expression fields, it can push down some deterministic filters after replaced. Because if the condition after replaced is deterministic, it certainly has no association with the non-deterministic expression field.
There was a problem hiding this comment.
For Aggregate, only filter which reference to deterministic groupExpression is pushed down, so it will only filter the group. I can't think of what kind of Non-deterministic expressions will be affected by this push down.
| // Push `Filter` operators through `Aggregate` operators. Parts of the predicates that can | ||
| // be beneath must satisfy the following conditions: | ||
| // 1. Grouping expressions are not empty. | ||
| // 2. Predicate expression is deterministic. |
There was a problem hiding this comment.
do you mean de-aliased predicate?
There was a problem hiding this comment.
Yes, I will change the description to make it clearer.
| val (pushDown, stayUp) = splitConjunctivePredicates(condition).partition { cond => | ||
| val replaced = replaceAlias(cond, aliasMap) | ||
| cond.deterministic && !cond.throwable && | ||
| replaced.deterministic && !cond.throwable && |
There was a problem hiding this comment.
shall we also use replaced.throwable?
There was a problem hiding this comment.
I think can push down throwable filter through aggregate, it seems does not affect exception thrown
. What do you think? @cloud-fan
There was a problem hiding this comment.
I will make a PR later.
|
I checked the aggregate execution implementation. Basically, it has an embedded On the other hand, the aggregate operator itself is already non-deterministic, as the output row order is different between hash and sort aggregate. But we shouldn't make it worse. |
I understand, thank you, for example Push down will make the results wrong, correct: (2, 2), error(2, 0). |
What changes were proposed in this pull request?
Push down part of filter which is deterministic and references are subset of aggregate's child through aggregate with nondeterministic field.
For example
We can push down
$"a" > 5and do not push down$"rand" > 5. Because$"rand" > 5is non-deterministic, push down it will change the evaluation result inaggregate.Why are the changes needed?
Filter earlier to improve performance.
Does this PR introduce any user-facing change?
No.
How was this patch tested?
UT.
Was this patch authored or co-authored using generative AI tooling?
No.