Skip to content

Conversation

@harshitsaini17
Copy link
Contributor

Which issue does this PR close?

Closes #19147

Part of #19144 (EPIC: fix nullability report for spark expression)

Rationale for this change

The bit_count UDF was using the default return_type implementation which does not preserve nullability information. This causes:

  1. Incorrect schema inference - non-nullable integer inputs are incorrectly marked as producing nullable Int32 outputs
  2. Missed optimization opportunities - the query optimizer cannot apply nullability-based optimizations when metadata is incorrect
  3. Inconsistent behavior - other similar functions preserve nullability, leading to unexpected differences

The bit_count function counts the number of set bits (ones) in the binary representation of a number and returns an Int32. The operation itself doesn't introduce nullability - if the input is non-nullable, the output will always be non-nullable. Therefore, the output nullability should match the input.

What changes are included in this PR?

  1. Implemented return_field_from_args: Creates a field with Int32 type and the same nullability as the input field
  2. Updated return_type: Now returns an error directing users to use return_field_from_args instead (following DataFusion best practices)
  3. Added FieldRef and internal_err imports to support the new implementation
  4. Added nullability test: Verifies that both nullable and non-nullable inputs are handled correctly

Are these changes tested?

Yes, this PR includes a new test test_bit_count_nullability that verifies:

  • Non-nullable Int32 input produces non-nullable Int32 output
  • Nullable Int32 input produces nullable Int32 output
  • Data type is correctly set to Int32 in both cases

Test results:

running 1 test
test function::bitwise::bit_count::tests::test_bit_count_nullability ... ok

test result: ok. 1 passed; 0 failed; 0 ignored; 0 measured

Additionally, all existing bit_count tests continue to pass, ensuring backward compatibility.

Are there any user-facing changes?

Yes - Schema metadata improvement:

Users will now see correct nullability information in the schema:

Before (Bug):

  • Non-nullable Int32/Int64/Boolean input → nullable Int32 output (incorrect)

After (Fixed):

  • Non-nullable Int32/Int64/Boolean input → non-nullable Int32 output (correct)
  • Nullable Int32/Int64/Boolean input → nullable Int32 output (correct)

This is a bug fix that corrects schema metadata only - it does not change the actual computation or introduce any breaking changes to the API.

Impact:

  • Query optimizers can now make better decisions based on accurate nullability information
  • Schema validation will be more accurate
  • No changes to function behavior or output values

Code Changes Summary

Modified File: datafusion/spark/src/function/bitwise/bit_count.rs

1. Added Imports

use arrow::datatypes::{
    DataType, FieldRef, Int16Type, Int32Type, Int64Type, Int8Type, UInt16Type, UInt32Type,
    UInt64Type, UInt8Type,
};
use datafusion_common::{internal_err, plan_err, Result};

2. Updated return_type Method

fn return_type(&self, _arg_types: &[DataType]) -> Result<DataType> {
    internal_err!("return_field_from_args should be used instead")
}

3. Added return_field_from_args Implementation

fn return_field_from_args(
    &self,
    args: datafusion_expr::ReturnFieldArgs,
) -> Result<FieldRef> {
    use arrow::datatypes::Field;
    // bit_count returns Int32 with the same nullability as the input
    Ok(Arc::new(Field::new(
        args.arg_fields[0].name(),
        DataType::Int32,
        args.arg_fields[0].is_nullable(),
    )))
}

4. Added Test

#[test]
fn test_bit_count_nullability() -> Result<()> {
    use datafusion_expr::ReturnFieldArgs;

    let bit_count = SparkBitCount::new();

    // Test with non-nullable Int32 field
    let non_nullable_field = Arc::new(Field::new("num", DataType::Int32, false));

    let result = bit_count.return_field_from_args(ReturnFieldArgs {
        arg_fields: &[Arc::clone(&non_nullable_field)],
        scalar_arguments: &[None],
    })?;

    // The result should not be nullable (same as input)
    assert!(!result.is_nullable());
    assert_eq!(result.data_type(), &DataType::Int32);

    // Test with nullable Int32 field
    let nullable_field = Arc::new(Field::new("num", DataType::Int32, true));

    let result = bit_count.return_field_from_args(ReturnFieldArgs {
        arg_fields: &[Arc::clone(&nullable_field)],
        scalar_arguments: &[None],
    })?;

    // The result should be nullable (same as input)
    assert!(result.is_nullable());
    assert_eq!(result.data_type(), &DataType::Int32);

    Ok(())
}

Verification Steps

  1. Run the new test:

    cargo test -p datafusion-spark test_bit_count_nullability --lib
  2. Run all bit_count tests:

    cargo test -p datafusion-spark bit_count --lib
  3. Run clippy checks:

    cargo clippy -p datafusion-spark --all-targets -- -D warnings

All checks pass successfully!


Related Issues


- Replace return_type with return_field_from_args to preserve input nullability
- Add test to verify nullability is correctly reported
- Addresses issue apache#19147
Copilot AI review requested due to automatic review settings December 8, 2025 06:59
@github-actions github-actions bot added the spark label Dec 8, 2025
Copy link
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR fixes the bit_count function to correctly preserve nullability information from input to output. Previously, the function used the default return_type implementation which always marked outputs as nullable, resulting in incorrect schema metadata. The fix implements return_field_from_args to return Int32 with the same nullability as the input field.

Key changes:

  • Implemented return_field_from_args method to preserve input nullability
  • Updated return_type to return an error directing to use return_field_from_args
  • Added comprehensive test coverage for nullability behavior

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

@harshitsaini17
Copy link
Contributor Author

Hi @rluvaton ! I'm seeing a CI formatting error in const_evaluator.rs on my PR branch. However, this file is not part of my changes - I only modified:

bit_count.rs
The formatting error appears to have come from merging the upstream main branch into my feature branch to keep it up-to-date.

Question: Should I:

Fix the formatting issue in const_evaluator.rs as part of this PR (even though it's unrelated to my changes)?
Revert the merge commit and rebase instead to avoid including this issue?
Wait for the formatting issue to be fixed in main first, then rebase my branch?

I want to make sure I follow the correct contribution workflow. Thanks!

@harshitsaini17 harshitsaini17 changed the title Fix bit_count function to report nullability correctly fix: bit_count function to report nullability correctly Dec 8, 2025
@comphead
Copy link
Contributor

comphead commented Dec 8, 2025

The formatting is already fixed

Copy link
Member

@rluvaton rluvaton left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@rluvaton rluvaton added this pull request to the merge queue Dec 9, 2025
Merged via the queue into apache:main with commit 1a6df66 Dec 9, 2025
27 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Projects

None yet

Development

Successfully merging this pull request may close these issues.

spark bit_count need to have custom nullability

3 participants