[SPARK-55702][SQL] Support filter predicate in window aggregate functions by cloud-fan · Pull Request #54501 · apache/spark

cloud-fan · 2026-02-26T03:41:11Z

What changes were proposed in this pull request?

This PR adds support for the FILTER (WHERE ...) clause on aggregate functions used within window expressions. Previously, Spark rejected this with an AnalysisException ("Window aggregate function with filter predicate is not supported yet.").

The changes are:

Remove the analysis rejection in Analyzer.scala that blocked FILTER in window aggregates, and extract filter expressions alongside aggregate function children.
Add filter support to AggregateProcessor so that AggregateExpression.filter is honored during window frame evaluation:
- For DeclarativeAggregate: update expressions are wrapped with If(filter, updateExpr, bufferAttr) to conditionally skip rows.
- For ImperativeAggregate: the filter predicate is evaluated before calling update().
Pass filter expressions from WindowEvaluatorFactoryBase to AggregateProcessor.

Why are the changes needed?

The SQL standard allows FILTER on aggregate functions in window contexts. Other databases (PostgreSQL, etc.) support this. Spark already supports FILTER for regular (non-window) aggregates but rejected it in window contexts.

Does this PR introduce any user-facing change?

Yes. Window aggregate expressions with FILTER now execute instead of throwing an AnalysisException. For example:

SELECT val, cate,
  sum(val) FILTER (WHERE val > 1) OVER (PARTITION BY cate ORDER BY val
    ROWS BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW) AS running_sum_filtered
FROM testData

How was this patch tested?

Added 4 SQL test cases in window.sql covering:

Running sum with filter
first_value/last_value with filter (verifying no interference with NULL handling)
Multiple aggregates with different filters in the same window
Entire partition frame with filter

The existing test case (count(val) FILTER (WHERE val > 1) OVER(...)) now produces correct results instead of an error.

Was this patch authored or co-authored using generative AI tooling?

Yes.

Made with Cursor

cloud-fan · 2026-02-26T03:52:01Z

cc @peter-toth @viirya

viirya · 2026-02-26T08:01:12Z

sql/core/src/main/scala/org/apache/spark/sql/execution/window/AggregateProcessor.scala

+      if (filters.length == functions.length) filters
+      else Array.fill(functions.length)(None)


Should we add an assert like assert(filters.isEmpty || filters.length == functions.length)?

I don't see why we need the Array.empty default value for filters and I don't see how the sizes could differ so why not just assert(filters.length == functions.length) or change the contract to functionsAndFilters: Seq[(Expression, Option[Expression])].

viirya · 2026-02-26T08:03:34Z

sql/core/src/main/scala/org/apache/spark/sql/execution/window/AggregateProcessor.scala

+        filterOpt match {
+          case Some(filter) =>
+            updateExpressions ++= agg.updateExpressions.zip(agg.aggBufferAttributes).map {
+              case (updateExpr, attr) => If(filter, updateExpr, attr)


Does it mean filter will be evaluated multiple times? Maybe common expression evaluation helps.

It's pretty much the same as interpreted version of HashAggregateExec: AggregationIterator

viirya · 2026-02-26T08:08:33Z

sql/core/src/main/scala/org/apache/spark/sql/execution/window/AggregateProcessor.scala

    var i = 0
    while (i < numImperatives) {
-      imperatives(i).update(buffer, input)
+      val shouldUpdate = imperativeFilters(i) match {


Looks like there is no common expression evaluation here?

see https://github.com/apache/spark/pull/54501/changes#r2858337551

viirya · 2026-02-26T08:20:00Z

sql/core/src/test/resources/sql-tests/inputs/window.sql

+first_value(val) FILTER (WHERE cate = 'a') OVER(ORDER BY val_long
+  ROWS BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW) AS first_a,
+last_value(val) FILTER (WHERE cate = 'a') OVER(ORDER BY val_long
+  ROWS BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW) AS last_a
+FROM testData ORDER BY val_long, cate;


The tests all use either UNBOUNDED PRECEDING AND CURRENT ROW (growing frame) or no-frame PARTITION BY cate (full partition). There's no test for a true sliding window like:

sum(val) FILTER (WHERE val > 1) OVER (ORDER BY val_long ROWS BETWEEN 1 PRECEDING AND 1 FOLLOWING)

viirya · 2026-02-26T08:30:59Z

sql/core/src/test/resources/sql-tests/inputs/window.sql

+SELECT val, cate,
+sum(val) FILTER (WHERE cate = 'a') OVER(PARTITION BY cate) AS total_sum_filtered
 FROM testData ORDER BY cate, val;



No test for RANGE frame?

All new tests use ROW frames. There's no test for:

sum(val) FILTER (WHERE cate = 'a') OVER (ORDER BY val_long RANGE BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW)

Remove the analysis check that rejected FILTER in window aggregates and add filter support to AggregateProcessor in WindowExec so that AggregateExpression.filter is honored during window frame evaluation. For DeclarativeAggregate, the update expressions are wrapped with If(filter, updateExpr, bufferAttr) to skip rows that don't match. For ImperativeAggregate, the filter predicate is evaluated before calling update(). Made-with: Cursor

dongjoon-hyun · 2026-02-26T17:08:46Z

cc @yaooqinn , too.

cloud-fan · 2026-02-27T01:00:31Z

thanks for the review, merging to master!

…indow filter test Follow-up to apache#54501. 1. Remove the now-unused `windowAggregateFunctionWithFilterNotSupportedError` method and its `_LEGACY_ERROR_TEMP_1030` error class, which were left behind after the filter-in-window support was added. 2. Fix the `first_value`/`last_value` window filter test that used `ORDER BY val_long` — a column with duplicate values — making the ROWS-frame result non-deterministic. Add tiebreaker columns and use NULLS LAST so the output is both stable and meaningful.

…indow filter test ### What changes were proposed in this pull request? Follow-up to #54501. Two cleanups: 1. **Remove dead error code**: The `windowAggregateFunctionWithFilterNotSupportedError` method in `QueryCompilationErrors.scala` and its `_LEGACY_ERROR_TEMP_1030` error class in `error-conditions.json` were left behind after #54501 removed their only call site. 2. **Fix flaky `first_value`/`last_value` test**: The window filter test used `ORDER BY val_long` with a ROWS frame, but `val_long` has duplicate values in the test data (e.g., three rows with `val_long=1`), making `first_value`/`last_value` results non-deterministic. Added `val` and `cate` as tiebreaker columns and used `NULLS LAST` so the output is both stable and meaningful (without NULLS LAST, the first matching 'a' row has `val=NULL`, making `first_a` always NULL). ### Why are the changes needed? 1. Dead code should be cleaned up. 2. Non-deterministic tests can cause spurious failures. ### Does this PR introduce _any_ user-facing change? No. ### How was this patch tested? Re-ran `SQLQueryTestSuite` for `window.sql` — all 4 tests pass across all config dimensions. ### Was this patch authored or co-authored using generative AI tooling? Yes. cursor Closes #54557 from cloud-fan/follow. Authored-by: Wenchen Fan <wenchen@databricks.com> Signed-off-by: Kent Yao <kentyao@microsoft.com>

cloud-fan force-pushed the window-agg-filter branch 2 times, most recently from c1874cf to d4b1eaa Compare February 26, 2026 03:49

cloud-fan force-pushed the window-agg-filter branch from d4b1eaa to e1ebfd8 Compare February 26, 2026 07:48

viirya reviewed Feb 26, 2026

View reviewed changes

cloud-fan force-pushed the window-agg-filter branch from e1ebfd8 to f4fe768 Compare February 26, 2026 10:46

peter-toth approved these changes Feb 26, 2026

View reviewed changes

viirya approved these changes Feb 26, 2026

View reviewed changes

dongjoon-hyun approved these changes Feb 26, 2026

View reviewed changes

yaooqinn approved these changes Feb 26, 2026

View reviewed changes

cloud-fan closed this in 6775a17 Feb 27, 2026

cloud-fan mentioned this pull request Feb 28, 2026

[SPARK-55702][SQL][FOLLOWUP] Clean up dead error code and fix flaky window filter test #54557

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[SPARK-55702][SQL] Support filter predicate in window aggregate functions#54501

[SPARK-55702][SQL] Support filter predicate in window aggregate functions#54501
cloud-fan wants to merge 1 commit intoapache:masterfrom
cloud-fan:window-agg-filter

cloud-fan commented Feb 26, 2026 •

edited

Loading

Uh oh!

cloud-fan commented Feb 26, 2026

Uh oh!

viirya Feb 26, 2026

Uh oh!

peter-toth Feb 26, 2026

Uh oh!

viirya Feb 26, 2026

Uh oh!

cloud-fan Feb 26, 2026

Uh oh!

viirya Feb 26, 2026

Uh oh!

cloud-fan Feb 26, 2026

Uh oh!

viirya Feb 26, 2026

Uh oh!

viirya Feb 26, 2026

Uh oh!

dongjoon-hyun commented Feb 26, 2026

Uh oh!

cloud-fan commented Feb 27, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants

		if (filters.length == functions.length) filters
		else Array.fill(functions.length)(None)

Conversation

cloud-fan commented Feb 26, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

What changes were proposed in this pull request?

Why are the changes needed?

Does this PR introduce any user-facing change?

How was this patch tested?

Was this patch authored or co-authored using generative AI tooling?

Uh oh!

cloud-fan commented Feb 26, 2026

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

dongjoon-hyun commented Feb 26, 2026

Uh oh!

cloud-fan commented Feb 27, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants

cloud-fan commented Feb 26, 2026 •

edited

Loading