Skip to content

Conversation

@harshitsaini17
Copy link
Contributor

Which issue does this PR close?

Closes #19146

Part of #19144 (EPIC: fix nullability report for spark expression)

Rationale for this change

The bitmap_count UDF was using the default return_type implementation which does not preserve nullability information. This causes:

  1. Incorrect schema inference - non-nullable binary inputs are incorrectly marked as producing nullable Int64 outputs
  2. Missed optimization opportunities - the query optimizer cannot apply nullability-based optimizations when metadata is incorrect
  3. Inconsistent behavior - other similar functions preserve nullability, leading to unexpected differences

The bitmap_count function counts set bits in a binary input and returns an Int64. The operation itself doesn't introduce nullability - if the input is non-nullable, the output will always be non-nullable. Therefore, the output nullability should match the input.

What changes are included in this PR?

  1. Implemented return_field_from_args: Creates a field with Int64 type and the same nullability as the input field
  2. Updated return_type: Now returns an error directing users to use return_field_from_args instead (following DataFusion best practices)
  3. Added FieldRef import to support returning field metadata
  4. Added nullability test: Verifies that both nullable and non-nullable inputs are handled correctly

Are these changes tested?

Yes, this PR includes a new test test_bitmap_count_nullability that verifies:

  • Non-nullable Binary input produces non-nullable Int64 output
  • Nullable Binary input produces nullable Int64 output
  • Data type is correctly set to Int64 in both cases

Test results:

running 1 test
test function::bitmap::bitmap_count::tests::test_bitmap_count_nullability ... ok

test result: ok. 1 passed; 0 failed; 0 ignored; 0 measured

Additionally, all existing bitmap_count tests continue to pass, ensuring backward compatibility.

Are there any user-facing changes?

Yes - Schema metadata improvement:

Users will now see correct nullability information in the schema:

Before (Bug):

  • Non-nullable binary input → nullable Int64 output (incorrect)

After (Fixed):

  • Non-nullable binary input → non-nullable Int64 output (correct)
  • Nullable binary input → nullable Int64 output (correct)

This is a bug fix that corrects schema metadata only - it does not change the actual computation or introduce any breaking changes to the API.

Impact:

  • Query optimizers can now make better decisions based on accurate nullability information
  • Schema validation will be more accurate
  • No changes to function behavior or output values

Code Changes Summary

Modified File: datafusion/spark/src/function/bitmap/bitmap_count.rs

1. Added Import

use arrow::datatypes::{DataType, FieldRef, Int16Type, Int32Type, Int64Type, Int8Type};

2. Updated return_type Method

fn return_type(&self, _arg_types: &[DataType]) -> Result<DataType> {
    internal_err!("return_field_from_args should be used instead")
}

3. Added return_field_from_args Implementation

fn return_field_from_args(
    &self,
    args: datafusion_expr::ReturnFieldArgs,
) -> Result<FieldRef> {
    use arrow::datatypes::Field;
    // bitmap_count returns Int64 with the same nullability as the input
    Ok(Arc::new(Field::new(
        args.arg_fields[0].name(),
        DataType::Int64,
        args.arg_fields[0].is_nullable(),
    )))
}

4. Added Test

#[test]
fn test_bitmap_count_nullability() -> Result<()> {
    use datafusion_expr::ReturnFieldArgs;

    let bitmap_count = BitmapCount::new();

    // Test with non-nullable binary field
    let non_nullable_field = Arc::new(Field::new("bin", DataType::Binary, false));

    let result = bitmap_count.return_field_from_args(ReturnFieldArgs {
        arg_fields: &[Arc::clone(&non_nullable_field)],
        scalar_arguments: &[None],
    })?;

    // The result should not be nullable (same as input)
    assert!(!result.is_nullable());
    assert_eq!(result.data_type(), &Int64);

    // Test with nullable binary field
    let nullable_field = Arc::new(Field::new("bin", DataType::Binary, true));

    let result = bitmap_count.return_field_from_args(ReturnFieldArgs {
        arg_fields: &[Arc::clone(&nullable_field)],
        scalar_arguments: &[None],
    })?;

    // The result should be nullable (same as input)
    assert!(result.is_nullable());
    assert_eq!(result.data_type(), &Int64);

    Ok(())
}

Verification Steps

  1. Run the new test:

    cargo test -p datafusion-spark test_bitmap_count_nullability --lib
  2. Run all bitmap_count tests:

    cargo test -p datafusion-spark bitmap_count --lib
  3. Run clippy checks:

    cargo clippy -p datafusion-spark --all-targets -- -D warnings

All checks pass successfully!


Related Issues


- Replace return_type with return_field_from_args to preserve input nullability
- Add test to verify nullability is correctly reported
- Addresses issue apache#19146
Copilot AI review requested due to automatic review settings December 8, 2025 06:14
@github-actions github-actions bot added the spark label Dec 8, 2025
Copy link
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR fixes the bitmap_count function to correctly report nullability information by implementing return_field_from_args instead of relying on the default return_type behavior. The function now properly preserves the nullability of its input when determining the output field's nullability.

Key changes:

  • Implemented return_field_from_args to create an Int64 field with nullability matching the input
  • Updated return_type to return an error directing users to use return_field_from_args
  • Added comprehensive test coverage for both nullable and non-nullable input cases

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Copy link
Member

@rluvaton rluvaton left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@rluvaton rluvaton added this pull request to the merge queue Dec 9, 2025
Merged via the queue into apache:main with commit 4fb36b2 Dec 9, 2025
33 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Projects

None yet

Development

Successfully merging this pull request may close these issues.

spark bitmap_count need to have custom nullability shuffle should report nullability correctly

2 participants