Skip to content

Conversation

@GaneshPatil7517
Copy link
Contributor

Which issue does this PR close?
Closes #19511
Related to #18882

Rationale for this change
Currently, AggregateUDFImpl::is_nullable() returns true by default for all UDAFs, regardless of input characteristics. This is not ideal because:

The same nullability information is already encoded in return_field()
Most aggregate functions should only be nullable if their inputs are nullable (e.g., MIN, MAX, SUM)
This pattern doesn't align with scalar UDFs, which already use return_field_from_args() for nullability
What changes are included in this PR?
Core Changes
Deprecated is_nullable() on AggregateUDFImpl trait with migration guidance
Updated udaf_default_return_field() to compute nullability from input fields:
Output is nullable if ANY input field is nullable
Output is non-nullable only if ALL inputs are non-nullable
Tests
Added 4 new tests validating nullability inference:

test_return_field_nullability_from_nullable_input
test_return_field_nullability_from_non_nullable_input
test_return_field_nullability_with_mixed_inputs
test_return_field_preserves_return_type
Documentation
New docs/source/library-user-guide/functions/udf-nullability.md with migration guide and examples
Updated adding-udfs.md with reference to nullability documentation
Are these changes tested?
Yes. All existing tests pass, plus 4 new tests specifically for nullability behavior.

Are there any user-facing changes?
Deprecation warning: Users implementing is_nullable() will see a deprecation warning directing them to use return_field() instead.

Behavioral change: Default nullability now depends on input field nullability rather than always returning true. Functions like COUNT that need to always return non-nullable should override return_field().

This is a potentially breaking change for users who rely on the previous behavior of always-nullable outputs, but the new behavior is more correct and aligns with scalar UDF patterns.

GaneshPatil7517 and others added 15 commits January 6, 2026 09:54
This commit fixes issue #19612 where accumulators that don't implement
retract_batch exhibit buggy behavior in window frame queries.

When aggregate functions are used with window frames like
`ROWS BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW`, DataFusion uses
PlainAggregateWindowExpr which calls evaluate() multiple times on the
same accumulator instance. Accumulators that use std::mem::take() in
their evaluate() method consume their internal state, causing incorrect
results on subsequent calls.

1. **percentile_cont**: Modified evaluate() to use mutable reference
   instead of consuming the Vec. Added retract_batch() support for
   both PercentileContAccumulator and DistinctPercentileContAccumulator.

2. **string_agg**: Changed SimpleStringAggAccumulator::evaluate() to
   clone the accumulated string instead of taking it.

- datafusion/functions-aggregate/src/percentile_cont.rs:
  - Changed calculate_percentile() to take &mut [T::Native] instead of Vec<T::Native>
  - Updated PercentileContAccumulator::evaluate() to pass reference
  - Updated DistinctPercentileContAccumulator::evaluate() to clone values
  - Added retract_batch() implementation using HashMap for efficient removal
  - Updated PercentileContGroupsAccumulator::evaluate() for consistency

- datafusion/functions-aggregate/src/string_agg.rs:
  - Changed evaluate() to use clone() instead of std::mem::take()

- datafusion/sqllogictest/test_files/aggregate.slt:
  - Added test cases for percentile_cont with window frames
  - Added test comparing median() vs percentile_cont(0.5) behavior
  - Added test for string_agg cumulative window frame

- docs/source/library-user-guide/functions/adding-udfs.md:
  - Added documentation about window-compatible accumulators
  - Explained evaluate() state preservation requirements
  - Documented retract_batch() implementation guidance

Closes #19612
…ability inference

This change improves how nullability is computed for aggregate UDF outputs by making
it depend on the nullability of input fields, aligning with the pattern used for
scalar UDFs.

Changes:
- Mark is_nullable() method as deprecated in AggregateUDFImpl trait
- Update udaf_default_return_field() to compute output nullability from input fields:
  * Output is nullable if ANY input field is nullable
  * Output is non-nullable only if ALL input fields are non-nullable
- Add deprecation migration guide in is_nullable() documentation
- Add #[allow(deprecated)] to wrapper method calls in AggregateUDF and AliasedAggregateUDFImpl

Testing:
- Add 4 new tests validating nullability inference from input fields:
  * test_return_field_nullability_from_nullable_input
  * test_return_field_nullability_from_non_nullable_input
  * test_return_field_nullability_with_mixed_inputs
  * test_return_field_preserves_return_type
- All existing tests continue to pass (test_partial_eq, test_partial_ord)
- No regressions in aggregate function execution

Documentation:
- Create new docs/source/library-user-guide/functions/udf-nullability.md
  * Explains the nullability change and rationale
  * Provides migration guide for custom UDAF implementations
  * Includes examples for default and custom nullability behavior
  * References scalar UDF patterns
- Update docs/source/library-user-guide/functions/adding-udfs.md
  * Add section on nullability of aggregate functions
  * Link to new comprehensive nullability documentation

Fixes: #19511 (related to #18882)
Co-authored-by: Martin Grigorov <martin-g@users.noreply.github.com>
Co-authored-by: Martin Grigorov <martin-g@users.noreply.github.com>
Co-authored-by: Martin Grigorov <martin-g@users.noreply.github.com>
Co-authored-by: Martin Grigorov <martin-g@users.noreply.github.com>
Co-authored-by: Martin Grigorov <martin-g@users.noreply.github.com>
Co-authored-by: Martin Grigorov <martin-g@users.noreply.github.com>
Co-authored-by: Martin Grigorov <martin-g@users.noreply.github.com>
Co-authored-by: Martin Grigorov <martin-g@users.noreply.github.com>
@github-actions github-actions bot added documentation Improvements or additions to documentation logical-expr Logical plan and expressions sqllogictest SQL Logic Tests (.slt) functions Changes to functions implementation labels Jan 12, 2026
@GaneshPatil7517 GaneshPatil7517 closed this by deleting the head repository Jan 13, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

documentation Improvements or additions to documentation functions Changes to functions implementation logical-expr Logical plan and expressions sqllogictest SQL Logic Tests (.slt)

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Consider changing nullability of UDFs to depend on inputs by default

1 participant