[SPARK-55702][SQL] Support filter predicate in window aggregate functions#54501
[SPARK-55702][SQL] Support filter predicate in window aggregate functions#54501cloud-fan wants to merge 1 commit intoapache:masterfrom
Conversation
c1874cf to
d4b1eaa
Compare
d4b1eaa to
e1ebfd8
Compare
| if (filters.length == functions.length) filters | ||
| else Array.fill(functions.length)(None) |
There was a problem hiding this comment.
Should we add an assert like assert(filters.isEmpty || filters.length == functions.length)?
There was a problem hiding this comment.
I don't see why we need the Array.empty default value for filters and I don't see how the sizes could differ so why not just assert(filters.length == functions.length) or change the contract to functionsAndFilters: Seq[(Expression, Option[Expression])].
| filterOpt match { | ||
| case Some(filter) => | ||
| updateExpressions ++= agg.updateExpressions.zip(agg.aggBufferAttributes).map { | ||
| case (updateExpr, attr) => If(filter, updateExpr, attr) |
There was a problem hiding this comment.
Does it mean filter will be evaluated multiple times? Maybe common expression evaluation helps.
There was a problem hiding this comment.
It's pretty much the same as interpreted version of HashAggregateExec: AggregationIterator
| var i = 0 | ||
| while (i < numImperatives) { | ||
| imperatives(i).update(buffer, input) | ||
| val shouldUpdate = imperativeFilters(i) match { |
There was a problem hiding this comment.
Looks like there is no common expression evaluation here?
There was a problem hiding this comment.
| first_value(val) FILTER (WHERE cate = 'a') OVER(ORDER BY val_long | ||
| ROWS BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW) AS first_a, | ||
| last_value(val) FILTER (WHERE cate = 'a') OVER(ORDER BY val_long | ||
| ROWS BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW) AS last_a | ||
| FROM testData ORDER BY val_long, cate; |
There was a problem hiding this comment.
The tests all use either UNBOUNDED PRECEDING AND CURRENT ROW (growing frame) or no-frame PARTITION BY cate (full partition). There's no test for a true sliding window like:
sum(val) FILTER (WHERE val > 1) OVER (ORDER BY val_long ROWS BETWEEN 1 PRECEDING AND 1 FOLLOWING)
| SELECT val, cate, | ||
| sum(val) FILTER (WHERE cate = 'a') OVER(PARTITION BY cate) AS total_sum_filtered | ||
| FROM testData ORDER BY cate, val; | ||
|
|
There was a problem hiding this comment.
No test for RANGE frame?
All new tests use ROW frames. There's no test for:
sum(val) FILTER (WHERE cate = 'a') OVER (ORDER BY val_long RANGE BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW)
Remove the analysis check that rejected FILTER in window aggregates and add filter support to AggregateProcessor in WindowExec so that AggregateExpression.filter is honored during window frame evaluation. For DeclarativeAggregate, the update expressions are wrapped with If(filter, updateExpr, bufferAttr) to skip rows that don't match. For ImperativeAggregate, the filter predicate is evaluated before calling update(). Made-with: Cursor
e1ebfd8 to
f4fe768
Compare
|
cc @yaooqinn , too. |
|
thanks for the review, merging to master! |
…indow filter test Follow-up to apache#54501. 1. Remove the now-unused `windowAggregateFunctionWithFilterNotSupportedError` method and its `_LEGACY_ERROR_TEMP_1030` error class, which were left behind after the filter-in-window support was added. 2. Fix the `first_value`/`last_value` window filter test that used `ORDER BY val_long` — a column with duplicate values — making the ROWS-frame result non-deterministic. Add tiebreaker columns and use NULLS LAST so the output is both stable and meaningful.
…indow filter test ### What changes were proposed in this pull request? Follow-up to #54501. Two cleanups: 1. **Remove dead error code**: The `windowAggregateFunctionWithFilterNotSupportedError` method in `QueryCompilationErrors.scala` and its `_LEGACY_ERROR_TEMP_1030` error class in `error-conditions.json` were left behind after #54501 removed their only call site. 2. **Fix flaky `first_value`/`last_value` test**: The window filter test used `ORDER BY val_long` with a ROWS frame, but `val_long` has duplicate values in the test data (e.g., three rows with `val_long=1`), making `first_value`/`last_value` results non-deterministic. Added `val` and `cate` as tiebreaker columns and used `NULLS LAST` so the output is both stable and meaningful (without NULLS LAST, the first matching 'a' row has `val=NULL`, making `first_a` always NULL). ### Why are the changes needed? 1. Dead code should be cleaned up. 2. Non-deterministic tests can cause spurious failures. ### Does this PR introduce _any_ user-facing change? No. ### How was this patch tested? Re-ran `SQLQueryTestSuite` for `window.sql` — all 4 tests pass across all config dimensions. ### Was this patch authored or co-authored using generative AI tooling? Yes. cursor Closes #54557 from cloud-fan/follow. Authored-by: Wenchen Fan <wenchen@databricks.com> Signed-off-by: Kent Yao <kentyao@microsoft.com>
What changes were proposed in this pull request?
This PR adds support for the
FILTER (WHERE ...)clause on aggregate functions used within window expressions. Previously, Spark rejected this with anAnalysisException("Window aggregate function with filter predicate is not supported yet.").The changes are:
Analyzer.scalathat blockedFILTERin window aggregates, and extract filter expressions alongside aggregate function children.AggregateProcessorso thatAggregateExpression.filteris honored during window frame evaluation:DeclarativeAggregate: update expressions are wrapped withIf(filter, updateExpr, bufferAttr)to conditionally skip rows.ImperativeAggregate: the filter predicate is evaluated before callingupdate().WindowEvaluatorFactoryBasetoAggregateProcessor.Why are the changes needed?
The SQL standard allows
FILTERon aggregate functions in window contexts. Other databases (PostgreSQL, etc.) support this. Spark already supportsFILTERfor regular (non-window) aggregates but rejected it in window contexts.Does this PR introduce any user-facing change?
Yes. Window aggregate expressions with
FILTERnow execute instead of throwing anAnalysisException. For example:How was this patch tested?
Added 4 SQL test cases in
window.sqlcovering:first_value/last_valuewith filter (verifying no interference with NULL handling)The existing test case (
count(val) FILTER (WHERE val > 1) OVER(...)) now produces correct results instead of an error.Was this patch authored or co-authored using generative AI tooling?
Yes.
Made with Cursor