fix: bit_count function to report nullability correctly #19197

harshitsaini17 · 2025-12-08T06:59:32Z

Which issue does this PR close?

Closes #19147

Part of #19144 (EPIC: fix nullability report for spark expression)

Rationale for this change

The bit_count UDF was using the default return_type implementation which does not preserve nullability information. This causes:

Incorrect schema inference - non-nullable integer inputs are incorrectly marked as producing nullable Int32 outputs
Missed optimization opportunities - the query optimizer cannot apply nullability-based optimizations when metadata is incorrect
Inconsistent behavior - other similar functions preserve nullability, leading to unexpected differences

The bit_count function counts the number of set bits (ones) in the binary representation of a number and returns an Int32. The operation itself doesn't introduce nullability - if the input is non-nullable, the output will always be non-nullable. Therefore, the output nullability should match the input.

What changes are included in this PR?

Implemented return_field_from_args: Creates a field with Int32 type and the same nullability as the input field
Updated return_type: Now returns an error directing users to use return_field_from_args instead (following DataFusion best practices)
Added FieldRef and internal_err imports to support the new implementation
Added nullability test: Verifies that both nullable and non-nullable inputs are handled correctly

Are these changes tested?

Yes, this PR includes a new test test_bit_count_nullability that verifies:

Non-nullable Int32 input produces non-nullable Int32 output
Nullable Int32 input produces nullable Int32 output
Data type is correctly set to Int32 in both cases

Test results:

running 1 test
test function::bitwise::bit_count::tests::test_bit_count_nullability ... ok

test result: ok. 1 passed; 0 failed; 0 ignored; 0 measured

Additionally, all existing bit_count tests continue to pass, ensuring backward compatibility.

Are there any user-facing changes?

Yes - Schema metadata improvement:

Users will now see correct nullability information in the schema:

Before (Bug):

Non-nullable Int32/Int64/Boolean input → nullable Int32 output (incorrect)

After (Fixed):

Non-nullable Int32/Int64/Boolean input → non-nullable Int32 output (correct)
Nullable Int32/Int64/Boolean input → nullable Int32 output (correct)

This is a bug fix that corrects schema metadata only - it does not change the actual computation or introduce any breaking changes to the API.

Impact:

Query optimizers can now make better decisions based on accurate nullability information
Schema validation will be more accurate
No changes to function behavior or output values

Code Changes Summary

Modified File: `datafusion/spark/src/function/bitwise/bit_count.rs`

1. Added Imports

use arrow::datatypes::{
    DataType, FieldRef, Int16Type, Int32Type, Int64Type, Int8Type, UInt16Type, UInt32Type,
    UInt64Type, UInt8Type,
};
use datafusion_common::{internal_err, plan_err, Result};

2. Updated return_type Method

fn return_type(&self, _arg_types: &[DataType]) -> Result<DataType> {
    internal_err!("return_field_from_args should be used instead")
}

3. Added return_field_from_args Implementation

fn return_field_from_args(
    &self,
    args: datafusion_expr::ReturnFieldArgs,
) -> Result<FieldRef> {
    use arrow::datatypes::Field;
    // bit_count returns Int32 with the same nullability as the input
    Ok(Arc::new(Field::new(
        args.arg_fields[0].name(),
        DataType::Int32,
        args.arg_fields[0].is_nullable(),
    )))
}

4. Added Test

#[test]
fn test_bit_count_nullability() -> Result<()> {
    use datafusion_expr::ReturnFieldArgs;

    let bit_count = SparkBitCount::new();

    // Test with non-nullable Int32 field
    let non_nullable_field = Arc::new(Field::new("num", DataType::Int32, false));

    let result = bit_count.return_field_from_args(ReturnFieldArgs {
        arg_fields: &[Arc::clone(&non_nullable_field)],
        scalar_arguments: &[None],
    })?;

    // The result should not be nullable (same as input)
    assert!(!result.is_nullable());
    assert_eq!(result.data_type(), &DataType::Int32);

    // Test with nullable Int32 field
    let nullable_field = Arc::new(Field::new("num", DataType::Int32, true));

    let result = bit_count.return_field_from_args(ReturnFieldArgs {
        arg_fields: &[Arc::clone(&nullable_field)],
        scalar_arguments: &[None],
    })?;

    // The result should be nullable (same as input)
    assert!(result.is_nullable());
    assert_eq!(result.data_type(), &DataType::Int32);

    Ok(())
}

Verification Steps

Run the new test:

cargo test -p datafusion-spark test_bit_count_nullability --lib

Run all bit_count tests:

cargo test -p datafusion-spark bit_count --lib

Run clippy checks:

cargo clippy -p datafusion-spark --all-targets -- -D warnings

All checks pass successfully!

Related Issues

Closes: spark bit_count need to have custom nullability #19147
Part of EPIC: EPIC: fix nullability report for spark expression #19144 (fix nullability report for spark expression)
Similar fixes:
- shuffle should report nullability correctly #19145 (shuffle function nullability)
- spark bitmap_count need to have custom nullability #19146 (bitmap_count function nullability)

- Replace return_type with return_field_from_args to preserve input nullability - Add test to verify nullability is correctly reported - Addresses issue apache#19147

Copilot

Pull request overview

This PR fixes the bit_count function to correctly preserve nullability information from input to output. Previously, the function used the default return_type implementation which always marked outputs as nullable, resulting in incorrect schema metadata. The fix implements return_field_from_args to return Int32 with the same nullability as the input field.

Key changes:

Implemented return_field_from_args method to preserve input nullability
Updated return_type to return an error directing to use return_field_from_args
Added comprehensive test coverage for nullability behavior

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

harshitsaini17 · 2025-12-08T07:30:25Z

Hi @rluvaton ! I'm seeing a CI formatting error in const_evaluator.rs on my PR branch. However, this file is not part of my changes - I only modified:

bit_count.rs
The formatting error appears to have come from merging the upstream main branch into my feature branch to keep it up-to-date.

Question: Should I:

Fix the formatting issue in const_evaluator.rs as part of this PR (even though it's unrelated to my changes)?
Revert the merge commit and rebase instead to avoid including this issue?
Wait for the formatting issue to be fixed in main first, then rebase my branch?

I want to make sure I follow the correct contribution workflow. Thanks!

comphead · 2025-12-08T17:18:37Z

The formatting is already fixed

rluvaton

I verified that Spark nullability depend on the child:

BitwiseCount:
https://github.com/apache/spark/blob/a189a450e9906362749e3ad3d022f593a2a562e4/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/bitwiseExpressions.scala#L216-L261

UnaryExpression:
https://github.com/apache/spark/blob/a189a450e9906362749e3ad3d022f593a2a562e4/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/Expression.scala#L590

Fix bit_count function to report nullability correctly

9f1ccc7

- Replace return_type with return_field_from_args to preserve input nullability - Add test to verify nullability is correctly reported - Addresses issue apache#19147

Copilot AI review requested due to automatic review settings December 8, 2025 06:59

github-actions bot added the spark label Dec 8, 2025

Copilot started reviewing on behalf of harshitsaini17 December 8, 2025 06:59 View session

Copilot AI reviewed Dec 8, 2025

View reviewed changes

Merge branch 'apache:main' into fix/bit-count-nullability-19147

57bb38a

harshitsaini17 changed the title ~~Fix bit_count function to report nullability correctly~~ fix: bit_count function to report nullability correctly Dec 8, 2025

Merge branch 'main' into fix/bit-count-nullability-19147

fb99651

rluvaton approved these changes Dec 9, 2025

View reviewed changes

rluvaton added this pull request to the merge queue Dec 9, 2025

Merged via the queue into apache:main with commit 1a6df66 Dec 9, 2025
27 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

fix: bit_count function to report nullability correctly #19197

fix: bit_count function to report nullability correctly #19197

harshitsaini17 commented Dec 8, 2025

Uh oh!

Copilot AI left a comment

Uh oh!

harshitsaini17 commented Dec 8, 2025

Uh oh!

comphead commented Dec 8, 2025

Uh oh!

rluvaton left a comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

fix: bit_count function to report nullability correctly #19197

fix: bit_count function to report nullability correctly #19197

Conversation

harshitsaini17 commented Dec 8, 2025

Which issue does this PR close?

Rationale for this change

What changes are included in this PR?

Are these changes tested?

Are there any user-facing changes?

Code Changes Summary

Modified File: datafusion/spark/src/function/bitwise/bit_count.rs

1. Added Imports

2. Updated return_type Method

3. Added return_field_from_args Implementation

4. Added Test

Verification Steps

Related Issues

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Uh oh!

harshitsaini17 commented Dec 8, 2025

Uh oh!

comphead commented Dec 8, 2025

Uh oh!

rluvaton left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Modified File: `datafusion/spark/src/function/bitwise/bit_count.rs`