Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

NullArray::is_null() returns false incorrectly #4835

Closed
alamb opened this issue Sep 18, 2023 · 15 comments · Fixed by #4838
Closed

NullArray::is_null() returns false incorrectly #4835

alamb opened this issue Sep 18, 2023 · 15 comments · Fixed by #4838
Labels
arrow Changes to the arrow crate bug

Comments

@alamb
Copy link
Contributor

alamb commented Sep 18, 2023

Describe the bug
Found by @jonahgao in apache/datafusion#7568

To Reproduce

fn main() {
    let arr = NullArray::new(1);
    println!("arr.is_null(0): {}", arr.is_null(0))
}

Prints

arr.is_null(0): false

Expected behavior
NullArray::is_null() should return true for all elements

Additional context

@alamb
Copy link
Contributor Author

alamb commented Sep 18, 2023

I think this is a good one for someone who wants to make a contribution to Arrow -- it is straightforward and well specified

@tustvold
Copy link
Contributor

tustvold commented Sep 18, 2023

This was actually an intentional change in https://github.com/apache/arrow-rs/pull/4691/files#r1292317464

@tustvold tustvold removed the good first issue Good for newcomers label Sep 18, 2023
@alamb
Copy link
Contributor Author

alamb commented Sep 18, 2023

I think it is a bug -- having NullArray::is_null() ever return false seems simply incorrect, as well as being very confusing.

Practically it means we can't use is_null to check if an element is null -- instead we have to also special case DataType::is_null as @jonahgao did in apache/datafusion#7568

Maybe the problem is that NullArray::new(1) doesn't set up the buffers correctly 🤔

@tustvold
Copy link
Contributor

Maybe the problem is that NullArray::new(1) doesn't set up the buffers correctly 🤔

The difficulty is the arrow spec very clearly states that a NullArray has no buffers, including no null buffer. A NullArray technically isn't nullable, instead the only values are null... It is definitely strange...

@alamb
Copy link
Contributor Author

alamb commented Sep 18, 2023

Do you agree, from a user's perspective, that NullArray::is_null() return false is incorrect? I don't have any opinions on how to best fix this internally

@tustvold
Copy link
Contributor

tustvold commented Sep 18, 2023

No, as per the standard a NullArray is not nullable, its values are just null. I agree it is deeply confusing and it would be better if arrow didn't have logical nullability, but I don't really see an obvious way around this

@alamb
Copy link
Contributor Author

alamb commented Sep 18, 2023

I don't understand what "is not nullable, its values are just null" means.

https://arrow.apache.org/docs/format/Columnar.html#null-layout says

"We provide a simplified memory-efficient layout for the Null data type where all values are null. In this case no memory buffers are allocated."

The arrow-rs https://docs.rs/arrow/latest/arrow/array/trait.Array.html#method.is_null says
says

Returns whether the element at index is null.

Therefore shouldn't Array::is_null return null for all values in a NullArray because all the values are null 🤔 ?

I didn't find anything contradicting this position in the .fbs format
https://github.com/apache/arrow/blob/440dc92caa73ca67c8ca98cebfb74f33788150bf/format/Schema.fbs#L82-L83

@tustvold
Copy link
Contributor

tustvold commented Sep 18, 2023

Therefore shouldn't Array::is_null return null for all values in a NullArray because all the values are null. 🤔

The problem is doing this would be inconsistent with the values returned by Array::nulls() so the following would not hold

assert_eq!(array.nulls().is_some_and(|n| n.is_null(idx)), array.is_null(idx));

Similarly for the corresponding methods on ArrayData. This fudging around physical nullability has caused a number of issues in the past (#4691).

We have the exact same challenge for DictionaryArray and RunArray which both have logical concepts of nullability that differ from the physical realities.

I agree this is unfortunate and represents a rough edge, but I'm not sure how to coherently handle this

@alamb
Copy link
Contributor Author

alamb commented Sep 18, 2023

The problem is doing this would be inconsistent with the values returned by Array::nulls() so the following would not hold

If this property doesn't hold for some arrays it isn't an invariant, is it?

It is not clear that making NullArray::is_null return such a confusingly and seeming incorrect answer helps people understand the nuance

@tustvold
Copy link
Contributor

If this property doesn't hold for some arrays it isn't an invariant, is it?

It does hold, what you are proposing would break this 😄

It is not clear that making NullArray::is_null return such a confusingly and seeming incorrect answer helps people understand the nuance

I don't disagree, but I also don't agree that making it return an incorrect value to make it fit people's logical models is a good solution either 😄

@alamb
Copy link
Contributor Author

alamb commented Sep 18, 2023

I don't disagree, but I also don't agree that making it return an incorrect value to make it fit people's logical models is a good solution either 😄

I think we disagree on what the correct value is 😆

@tustvold
Copy link
Contributor

tustvold commented Sep 18, 2023

Currently the Array methods consistently return nullability solely as determined by the null buffer, irrespective of the array's logical value, i.e. if a dictionary value is null, this won't be reflected in is_null.

We could change this, but doing so would significantly regress performance, especially for RunArray. The trade-off in #4691 was to not do this, and instead allow amortising this overhead computing logical_nulls as an explicit "kernel".

I agree the current situation is unfortunate, but I do feel strongly that we should be consistent about what these methods mean.

@alamb
Copy link
Contributor Author

alamb commented Sep 19, 2023

Maybe we can solve this with some additional documentation

Or perhaps we can add a function like value_is_null() that has the more intuitive definition

I can't imagine we could do better performance-wise than return true for NullArray::is_null() but I don't fully understand the broader implications across kernels

@tustvold
Copy link
Contributor

tustvold commented Sep 19, 2023

Maybe we can solve this with some additional documentation

I'm open to suggestions, the current docs are pretty clear imo -
https://docs.rs/arrow-array/latest/arrow_array/array/trait.Array.html#method.is_null

https://docs.rs/arrow-array/latest/arrow_array/array/trait.Array.html#method.logical_nulls

I can't imagine we could do better performance-wise than return true for NullArray::is_null()

But what should be do for the other arrays like dictionaries and run arrays, we need to be consistent as to what these methods mean. We can't return logical nullability for some arrays and not others

@alamb
Copy link
Contributor Author

alamb commented Sep 19, 2023

PR is #4838

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
arrow Changes to the arrow crate bug
Projects
None yet
Development

Successfully merging a pull request may close this issue.

2 participants