-
Notifications
You must be signed in to change notification settings - Fork 1.9k
Refs/heads/deprecate is nullable return field #19765
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Closed
GaneshPatil7517
wants to merge
15
commits into
apache:main
from
GaneshPatil7517:deprecate-is-nullable-return-field
Closed
Refs/heads/deprecate is nullable return field #19765
GaneshPatil7517
wants to merge
15
commits into
apache:main
from
GaneshPatil7517:deprecate-is-nullable-return-field
Conversation
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This commit fixes issue #19612 where accumulators that don't implement retract_batch exhibit buggy behavior in window frame queries. When aggregate functions are used with window frames like `ROWS BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW`, DataFusion uses PlainAggregateWindowExpr which calls evaluate() multiple times on the same accumulator instance. Accumulators that use std::mem::take() in their evaluate() method consume their internal state, causing incorrect results on subsequent calls. 1. **percentile_cont**: Modified evaluate() to use mutable reference instead of consuming the Vec. Added retract_batch() support for both PercentileContAccumulator and DistinctPercentileContAccumulator. 2. **string_agg**: Changed SimpleStringAggAccumulator::evaluate() to clone the accumulated string instead of taking it. - datafusion/functions-aggregate/src/percentile_cont.rs: - Changed calculate_percentile() to take &mut [T::Native] instead of Vec<T::Native> - Updated PercentileContAccumulator::evaluate() to pass reference - Updated DistinctPercentileContAccumulator::evaluate() to clone values - Added retract_batch() implementation using HashMap for efficient removal - Updated PercentileContGroupsAccumulator::evaluate() for consistency - datafusion/functions-aggregate/src/string_agg.rs: - Changed evaluate() to use clone() instead of std::mem::take() - datafusion/sqllogictest/test_files/aggregate.slt: - Added test cases for percentile_cont with window frames - Added test comparing median() vs percentile_cont(0.5) behavior - Added test for string_agg cumulative window frame - docs/source/library-user-guide/functions/adding-udfs.md: - Added documentation about window-compatible accumulators - Explained evaluate() state preservation requirements - Documented retract_batch() implementation guidance Closes #19612
…ability inference This change improves how nullability is computed for aggregate UDF outputs by making it depend on the nullability of input fields, aligning with the pattern used for scalar UDFs. Changes: - Mark is_nullable() method as deprecated in AggregateUDFImpl trait - Update udaf_default_return_field() to compute output nullability from input fields: * Output is nullable if ANY input field is nullable * Output is non-nullable only if ALL input fields are non-nullable - Add deprecation migration guide in is_nullable() documentation - Add #[allow(deprecated)] to wrapper method calls in AggregateUDF and AliasedAggregateUDFImpl Testing: - Add 4 new tests validating nullability inference from input fields: * test_return_field_nullability_from_nullable_input * test_return_field_nullability_from_non_nullable_input * test_return_field_nullability_with_mixed_inputs * test_return_field_preserves_return_type - All existing tests continue to pass (test_partial_eq, test_partial_ord) - No regressions in aggregate function execution Documentation: - Create new docs/source/library-user-guide/functions/udf-nullability.md * Explains the nullability change and rationale * Provides migration guide for custom UDAF implementations * Includes examples for default and custom nullability behavior * References scalar UDF patterns - Update docs/source/library-user-guide/functions/adding-udfs.md * Add section on nullability of aggregate functions * Link to new comprehensive nullability documentation Fixes: #19511 (related to #18882)
Co-authored-by: Martin Grigorov <martin-g@users.noreply.github.com>
Co-authored-by: Martin Grigorov <martin-g@users.noreply.github.com>
Co-authored-by: Martin Grigorov <martin-g@users.noreply.github.com>
Co-authored-by: Martin Grigorov <martin-g@users.noreply.github.com>
Co-authored-by: Martin Grigorov <martin-g@users.noreply.github.com>
Co-authored-by: Martin Grigorov <martin-g@users.noreply.github.com>
Co-authored-by: Martin Grigorov <martin-g@users.noreply.github.com>
Co-authored-by: Martin Grigorov <martin-g@users.noreply.github.com>
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Labels
documentation
Improvements or additions to documentation
functions
Changes to functions implementation
logical-expr
Logical plan and expressions
sqllogictest
SQL Logic Tests (.slt)
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
Which issue does this PR close?
Closes #19511
Related to #18882
Rationale for this change
Currently, AggregateUDFImpl::is_nullable() returns true by default for all UDAFs, regardless of input characteristics. This is not ideal because:
The same nullability information is already encoded in return_field()
Most aggregate functions should only be nullable if their inputs are nullable (e.g., MIN, MAX, SUM)
This pattern doesn't align with scalar UDFs, which already use return_field_from_args() for nullability
What changes are included in this PR?
Core Changes
Deprecated is_nullable() on AggregateUDFImpl trait with migration guidance
Updated udaf_default_return_field() to compute nullability from input fields:
Output is nullable if ANY input field is nullable
Output is non-nullable only if ALL inputs are non-nullable
Tests
Added 4 new tests validating nullability inference:
test_return_field_nullability_from_nullable_input
test_return_field_nullability_from_non_nullable_input
test_return_field_nullability_with_mixed_inputs
test_return_field_preserves_return_type
Documentation
New docs/source/library-user-guide/functions/udf-nullability.md with migration guide and examples
Updated adding-udfs.md with reference to nullability documentation
Are these changes tested?
Yes. All existing tests pass, plus 4 new tests specifically for nullability behavior.
Are there any user-facing changes?
Deprecation warning: Users implementing is_nullable() will see a deprecation warning directing them to use return_field() instead.
Behavioral change: Default nullability now depends on input field nullability rather than always returning true. Functions like COUNT that need to always return non-nullable should override return_field().
This is a potentially breaking change for users who rely on the previous behavior of always-nullable outputs, but the new behavior is more correct and aligns with scalar UDF patterns.