Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Why do we check for null in TypedDictionaryArray value function #2564

Closed
psvri opened this issue Aug 23, 2022 · 11 comments · Fixed by #2589
Closed

Why do we check for null in TypedDictionaryArray value function #2564

psvri opened this issue Aug 23, 2022 · 11 comments · Fixed by #2589
Labels
arrow Changes to the arrow crate question Further information is requested

Comments

@psvri
Copy link
Contributor

psvri commented Aug 23, 2022

Which part is this question about
Code base: https://github.com/apache/arrow-rs/blob/master/arrow/src/array/array_dictionary.rs#L498.

Describe your question
In TypedDictionaryArray value function why do we check for null , we could instead assert if the index is less than keys.len() right?

Additional context
the trait documentation states the below, due to which I felt it was bizzare that we are checking for null.

/// A generic trait for accessing the values of an [`Array`]
pub trait ArrayAccessor: Array {
    type Item: Send + Sync;

    /// Returns the element at index `i`
    /// # Panics
    /// Panics if the value is outside the bounds of the array
    fn value(&self, index: usize) -> Self::Item;

Proposal
We can instead rewrite it as below

fn value(&self, index: usize) -> Self::Item {
        assert!(
            index < self.dictionary.keys().len(),
            "TypedDictionaryArray out of bounds access {}",
            index
        );

        // Dictionary indexes should be valid
        unsafe { self.values.value_unchecked(index) }
    }
@psvri psvri added the question Further information is requested label Aug 23, 2022
@HaoYang670
Copy link
Contributor

I guess you are right @psvri. We always don't check the validation in the value() method, so the checking could be removed here.
cc @tustvold

@tustvold
Copy link
Contributor

tustvold commented Aug 24, 2022

Dictionaries are a special case, as the value of a null index is undefined, we can't then use it as an index in the values array. We therefore need this additional check. The bounds check is preformed by calling value on the keys array.

@HaoYang670
Copy link
Contributor

Dictionaries are a special case, as the value of a null index is undefined, we can't then use it as an index in the values array.

Hmm🤔, Why is Dictionary a special case? The value under a null mask is always undefined for all kinds of arrays.

the processes are somewhat different:
For Dictionary, you will get an undefined key if is_valid == false, and then, you will use this undefined key to get an undefined value.
For other arrays (such as int8), you will get an undefined value if is_valid == false.

However, the results are same: You get an undefined value.

@tustvold
Copy link
Contributor

tustvold commented Aug 24, 2022

There's a subtlety here, you get an unspecified value not an undefined value. It is not UB to read a null value from a ListArray, StringArray, etc... It isn't defined what you will get, but you won't get uninitialised memory or something that would trigger undefined behaviour.

TypedDictionaryArray is a special case because it is actually doing two array lookups. First it does a lookup into the indices array, if this index is null it gets back something unspecified. This is fine, however, this arbitrary value may not be in the bounds of the values, in which case we would access memory out of bounds.

We could also validate the bounds for the value access, instead of checking the NULL mask, however, the NULL assertion I thought more clearly indicated the why.

@HaoYang670
Copy link
Contributor

you get an unspecified value not an undefined value.

Make sense, Thank you @tustvold .

@psvri
Copy link
Contributor Author

psvri commented Aug 24, 2022

Thank you @tustvold for the explanation. I learnt a lot from it.

@tustvold @viirya

Just one last thing, based on your explanation, dosent this mean that we have actually implemented the function cmp_dict_primitive (https://github.com/apache/arrow-rs/blob/master/arrow/src/compute/kernels/comparison.rs#L2263 ) wrongly and subsequently all the other kernels which use compare_op with a dictionary as one of its parameters.

As you can see in that function we are using compare_op to compare the elements of dictionary with a another array. And in compares op implementation (https://github.com/apache/arrow-rs/blob/master/arrow/src/compute/kernels/comparison.rs#L65) we are using value_unchecked to get the elements of the dictionary without checking for null. So based on the above explanation we shouldnt be doing that right since the unspecified value can lead to out of bounds access ?

@tustvold
Copy link
Contributor

So based on the above explanation we shouldn't be doing that right since the unspecified value can lead to out of bounds access ?

Aah yeah, that's a bug. Fortunately it hasn't been released yet, it was changed in #2533. We should fix it before the next release.

I do wonder if there is a way to avoid needing to check the null index for dictionary arrays, perhaps we could enforce nulls to be 0, and then special case an empty dictionary or something... 🤔

@viirya
Copy link
Member

viirya commented Aug 24, 2022

Thanks @psvri for catching it. As @tustvold said, it is a bug. value_idx may be out of boundary on values. Although in practice, I think that we won't put a out of boundary index value in a key element with null mask.

@tustvold
Copy link
Contributor

Although in practice, I think that we won't put a out of boundary index value in a key element with null mask.

The only somewhat reasonable case would be an empty dictionary, where 0 would be out of bounds

@tustvold
Copy link
Contributor

tustvold commented Aug 25, 2022

#2564 hopefully fixes this confusion, and fixes the OOB behaviour. PTAL

@viirya
Copy link
Member

viirya commented Aug 25, 2022

@tustvold I think you meant #2589.

@alamb alamb added the arrow Changes to the arrow crate label Sep 1, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
arrow Changes to the arrow crate question Further information is requested
Projects
None yet
Development

Successfully merging a pull request may close this issue.

5 participants