Skip to content

refactor: remove opt_filter in GroupsAccumulator::merge_batch#22816

Open
haohuaijin wants to merge 6 commits into
apache:mainfrom
haohuaijin:remove-opt-filter
Open

refactor: remove opt_filter in GroupsAccumulator::merge_batch#22816
haohuaijin wants to merge 6 commits into
apache:mainfrom
haohuaijin:remove-opt-filter

Conversation

@haohuaijin
Copy link
Copy Markdown
Contributor

@haohuaijin haohuaijin commented Jun 8, 2026

Which issue does this PR close?

Rationale for this change

the opt_filter on GroupsAccumulator::merge_batch is a dead parameter. Aggregate FILTER clauses only apply to raw input rows in the update phase (update_batch). merge_batch combines already pre-aggregated states, so there is no per-row filtering to do — opt_filter is meaningless there.

The code confirms this:

  • The only production caller (row_hash.rs) always passed None.
  • Existing implementations already ignored it — e.g. correlation.rs asserted opt_filter.is_none(), and Spark avg used _opt_filter.

What changes are included in this PR?

  • Removed opt_filter from merge_batch in the trait and all implementations (built-in aggregates, physical-expr-common, functions-aggregate-common, Spark, and FFI).
  • Updated the trait docs to say merge_batch has no opt_filter because filtering happens in the update phase.
  • Changed the group zero-init path in row_hash.rs to always use update_batch with an all-false filter instead of branching to merge_batch. update_batch always takes raw argument types (what aggregate_arguments provides), and since every row is filtered out the data never matters — this is simpler and more correct.
  • Updated all call sites and tests.

Are these changes tested?

Yes. Existing aggregate tests cover this and were updated to the new signature. The first_last tests were adjusted (with comments) to match the merge behavior without a filter, and the FFI and Spark tests were updated too.

Are there any user-facing changes?

Yes — this is a breaking change to the public GroupsAccumulator trait: opt_filter is removed from merge_batch. Custom implementations and direct callers must update their signatures.

@github-actions github-actions Bot added logical-expr Logical plan and expressions core Core DataFusion crate functions Changes to functions implementation ffi Changes to the ffi crate physical-plan Changes to the physical-plan crate spark documentation Improvements or additions to documentation labels Jun 8, 2026
@github-actions
Copy link
Copy Markdown

github-actions Bot commented Jun 8, 2026

Thank you for opening this pull request!

Reviewer note: cargo-semver-checks reported the current version number is not SemVer-compatible with the changes in this pull request (compared against the base branch).

Details
     Cloning apache/main
    Building datafusion v53.1.0 (current)
       Built [  98.457s] (current)
     Parsing datafusion v53.1.0 (current)
      Parsed [   0.034s] (current)
    Building datafusion v53.1.0 (baseline)
       Built [  97.262s] (baseline)
     Parsing datafusion v53.1.0 (baseline)
      Parsed [   0.036s] (baseline)
    Checking datafusion v53.1.0 -> v53.1.0 (no change; assume patch)
     Checked [   0.620s] 223 checks: 223 pass, 30 skip
     Summary no semver update required
    Finished [ 198.921s] datafusion
    Building datafusion-expr-common v53.1.0 (current)
       Built [  19.143s] (current)
     Parsing datafusion-expr-common v53.1.0 (current)
      Parsed [   0.018s] (current)
    Building datafusion-expr-common v53.1.0 (baseline)
       Built [  19.173s] (baseline)
     Parsing datafusion-expr-common v53.1.0 (baseline)
      Parsed [   0.019s] (baseline)
    Checking datafusion-expr-common v53.1.0 -> v53.1.0 (no change; assume patch)
     Checked [   0.200s] 223 checks: 222 pass, 1 fail, 0 warn, 30 skip

--- failure trait_method_parameter_count_changed: pub trait method parameter count changed ---

Description:
A trait method now takes a different number of parameters.
        ref: https://doc.rust-lang.org/cargo/reference/semver.html#trait-item-signature
       impl: https://github.com/obi1kenobi/cargo-semver-checks/tree/v0.48.0/src/lints/trait_method_parameter_count_changed.ron

Failed in:
  GroupsAccumulator::merge_batch now takes 3 instead of 4 parameters, in file /home/runner/work/datafusion/datafusion/datafusion/expr-common/src/groups_accumulator.rs:190

     Summary semver requires new major version: 1 major and 0 minor checks failed
    Finished [  39.647s] datafusion-expr-common
    Building datafusion-ffi v53.1.0 (current)
       Built [  60.145s] (current)
     Parsing datafusion-ffi v53.1.0 (current)
      Parsed [   0.060s] (current)
    Building datafusion-ffi v53.1.0 (baseline)
       Built [  59.442s] (baseline)
     Parsing datafusion-ffi v53.1.0 (baseline)
      Parsed [   0.060s] (baseline)
    Checking datafusion-ffi v53.1.0 -> v53.1.0 (no change; assume patch)
     Checked [   0.232s] 223 checks: 223 pass, 30 skip
     Summary no semver update required
    Finished [ 121.591s] datafusion-ffi
    Building datafusion-functions-aggregate v53.1.0 (current)
       Built [  29.989s] (current)
     Parsing datafusion-functions-aggregate v53.1.0 (current)
      Parsed [   0.046s] (current)
    Building datafusion-functions-aggregate v53.1.0 (baseline)
       Built [  30.855s] (baseline)
     Parsing datafusion-functions-aggregate v53.1.0 (baseline)
      Parsed [   0.048s] (baseline)
    Checking datafusion-functions-aggregate v53.1.0 -> v53.1.0 (no change; assume patch)
     Checked [   0.224s] 223 checks: 223 pass, 30 skip
     Summary no semver update required
    Finished [  63.242s] datafusion-functions-aggregate
    Building datafusion-functions-aggregate-common v53.1.0 (current)
       Built [  20.704s] (current)
     Parsing datafusion-functions-aggregate-common v53.1.0 (current)
      Parsed [   0.019s] (current)
    Building datafusion-functions-aggregate-common v53.1.0 (baseline)
       Built [  20.472s] (baseline)
     Parsing datafusion-functions-aggregate-common v53.1.0 (baseline)
      Parsed [   0.020s] (baseline)
    Checking datafusion-functions-aggregate-common v53.1.0 -> v53.1.0 (no change; assume patch)
     Checked [   0.129s] 223 checks: 223 pass, 30 skip
     Summary no semver update required
    Finished [  42.438s] datafusion-functions-aggregate-common
    Building datafusion-physical-plan v53.1.0 (current)
       Built [  36.202s] (current)
     Parsing datafusion-physical-plan v53.1.0 (current)
      Parsed [   0.130s] (current)
    Building datafusion-physical-plan v53.1.0 (baseline)
       Built [  36.031s] (baseline)
     Parsing datafusion-physical-plan v53.1.0 (baseline)
      Parsed [   0.130s] (baseline)
    Checking datafusion-physical-plan v53.1.0 -> v53.1.0 (no change; assume patch)
     Checked [   0.571s] 223 checks: 223 pass, 30 skip
     Summary no semver update required
    Finished [  74.645s] datafusion-physical-plan
    Building datafusion-spark v53.1.0 (current)
       Built [  55.928s] (current)
     Parsing datafusion-spark v53.1.0 (current)
      Parsed [   0.059s] (current)
    Building datafusion-spark v53.1.0 (baseline)
       Built [  55.961s] (baseline)
     Parsing datafusion-spark v53.1.0 (baseline)
      Parsed [   0.062s] (baseline)
    Checking datafusion-spark v53.1.0 -> v53.1.0 (no change; assume patch)
     Checked [   0.357s] 223 checks: 223 pass, 30 skip
     Summary no semver update required
    Finished [ 114.685s] datafusion-spark
    Building datafusion-sqllogictest v53.1.0 (current)
       Built [ 168.528s] (current)
     Parsing datafusion-sqllogictest v53.1.0 (current)
      Parsed [   0.023s] (current)
    Building datafusion-sqllogictest v53.1.0 (baseline)
       Built [ 168.727s] (baseline)
     Parsing datafusion-sqllogictest v53.1.0 (baseline)
      Parsed [   0.029s] (baseline)
    Checking datafusion-sqllogictest v53.1.0 -> v53.1.0 (no change; assume patch)
     Checked [   0.095s] 223 checks: 223 pass, 30 skip
     Summary no semver update required
    Finished [ 341.434s] datafusion-sqllogictest

@github-actions github-actions Bot added the auto detected api change Auto detected API change label Jun 8, 2026
@haohuaijin haohuaijin force-pushed the remove-opt-filter branch from d569c35 to 2e667ba Compare June 8, 2026 06:15
@haohuaijin haohuaijin marked this pull request as draft June 8, 2026 06:22
@2010YOUY01
Copy link
Copy Markdown
Contributor

Thanks for working on it!

Assuming the assumption is correct, that opt_filter is indeed never used — I think it is better to remove it, even if this causes an API change and downstream implementations need to update. A misleading API makes it easier to introduce bugs.

However, since this is a public API change, I suggest to proceed after getting a few more +1s.

Comment on lines -1233 to -1237
if self.mode.input_mode() == AggregateInputMode::Raw {
acc.update_batch(args, &[0], Some(&false_filter), total_groups)?;
} else {
acc.merge_batch(args, &[0], Some(&false_filter), total_groups)?;
}
Copy link
Copy Markdown
Contributor Author

@haohuaijin haohuaijin Jun 8, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Note that aggregate_arguments depends on the aggregate mode. In Raw mode it has raw input arguments, which fit update_batch. In Partial input modes it has state fields, which normally fit merge_batch. If this path is ever reached in a Partial input mode, update_batch could receive the wrong shape of arrays. For example, AVG takes one raw input array in update_batch, but two state arrays in merge_batch.

I could not reproduce this through normal SQL/sqllogictest. The usual SQL plan initializes empty grouping sets in the Partial stage, so the Final stage gets a non-empty state row and does not hit this path. see the test case added in dcc9c27

cc @2010YOUY01 @alamb

this is only place that use the merge_batch with not None opt_filter(added recently).

Copy link
Copy Markdown
Contributor Author

@haohuaijin haohuaijin Jun 8, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

i check more on this part, and found the Final / FinalPartitioned build their group-by via as_final(), which hard-codes has_grouping_set: false. So init_empty_grouping_sets hits its !has_grouping_set() early-return and never call the merge_batch.

i update the docs in bccad62

@github-actions github-actions Bot added the sqllogictest SQL Logic Tests (.slt) label Jun 8, 2026
@haohuaijin haohuaijin marked this pull request as ready for review June 8, 2026 13:59
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

auto detected api change Auto detected API change core Core DataFusion crate documentation Improvements or additions to documentation ffi Changes to the ffi crate functions Changes to functions implementation logical-expr Logical plan and expressions physical-plan Changes to the physical-plan crate spark sqllogictest SQL Logic Tests (.slt)

Projects

None yet

Development

Successfully merging this pull request may close these issues.

update merge_batch's docs for opt_filter in GroupsAccumulator

2 participants