Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

sliced null buffers lead to incorrect result in take kernel (and probably on other places) #502

Closed
ritchie46 opened this issue Jun 26, 2021 · 4 comments · Fixed by #509
Closed
Labels

Comments

@ritchie46
Copy link
Contributor

Describe the bug
Sliced arrays lead to incorrect null buffers. I have found this in code where I copied the null buffer of one array to that of a new one (without offset).

I could pinpoint this bug in the take kernel, but I think it may occur in other parts as well.

In the code snippet below, you see the output of the array having two zero values where that should be null. As these are the values that should not have their validity bits set, it could probably be any other value at runtime.

I could also reproduce this with other StringArray

To Reproduce

use arrow::array::{UInt32Array, Array, LargeStringArray, PrimitiveArray, Int32Array};
use arrow::compute::take;
use arrow::datatypes::UInt32Type;

fn main() {
    let idx0 = UInt32Array::from(vec![Some(0), Some(1), Some(2), Some(3), Some(4), Some(5), Some(6), None, None]);
    let sliced_idx_arr = idx0.slice(5, 4);
    let sliced_idx = sliced_idx_arr.as_any().downcast_ref::<UInt32Array>().unwrap();

    let idx1 = UInt32Array::from(vec![Some(5), Some(6), None, None]);

    println!("indices:\n{:?}\n{:?}\n\n\n", &idx1, &sliced_idx);
    // this is true
    assert_eq!(&sliced_idx.into_iter().collect::<Vec<_>>(), &idx1.into_iter().collect::<Vec<_>>());


    let array = Int32Array::from(vec![0, 1, 2, 3, 4, 5, 6, 7, 8]);
    let taken1_arr= take(&array, &idx1, None).unwrap();
    let sliced_taken_arr = take(&array, &sliced_idx, None).unwrap();

    let taken1 = taken1_arr.as_any().downcast_ref::<Int32Array>().unwrap();
    let sliced_taken = sliced_taken_arr.as_any().downcast_ref::<Int32Array>().unwrap();

    let vec1: Vec<_> = taken1.into_iter().collect();
    let sliced_vec: Vec<_> = sliced_taken.into_iter().collect();

    println!("taken arrays:\n{:?}\n{:?}", &taken1, &sliced_taken);

    assert_eq!(vec1, sliced_vec)
}

Output

indices:
PrimitiveArray<UInt32>
[
  5,
  6,
  null,
  null,
]
PrimitiveArray<UInt32>
[
  5,
  6,
  null,
  null,
]



taken arrays:
PrimitiveArray<Int32>
[
  5,
  6,
  null,
  null,
]
PrimitiveArray<Int32>
[
  5,
  6,
  0,
  0,
]
thread 'main' panicked at 'assertion failed: `(left == right)`
  left: `[Some(5), Some(6), None, None]`,
 right: `[Some(5), Some(6), Some(0), Some(0)]`', src/main.rs:29:5
note: run with `RUST_BACKTRACE=1` environment variable to display a backtrace

Expected behavior
I'd expect the sliced array to have the same result as the array that is not sliced.

Additional context
pola-rs/polars#878

@ritchie46 ritchie46 added the bug label Jun 26, 2021
@jhorstmann
Copy link
Contributor

jhorstmann commented Jun 27, 2021

I think the problem is this clone of the null buffer in take_indices_nulls where it should instead take a slice:

The other combinations of nullable/nonnull indices/values look ok which probably explains why this wasn't found earlier.

@ritchie46
Copy link
Contributor Author

I think the problem is this [clone of the null buffer in take_indices_nulls]]1 where it should instead take a slice:

Good catch, I think you're right.

@ritchie46
Copy link
Contributor Author

@jhorstmann Should there be done some arithmetic to compute the offset of the slice. Eg. bytes to bits. And what if the offset of the indices is not a multiple of 8? Should we then copy the right bits to a new buffer?

@jhorstmann
Copy link
Contributor

@ritchie46 there is a bit_slice method in Buffer for this usecase, if the offset is a multiple of 8 it returns a shallow slice, otherwise a new buffer with the shifted bits is constructed. And I agree this api is really easy to miss or error prone to use and we need to somehow come up with an api that doesn't require us to constantly think about bit offsets.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging a pull request may close this issue.

2 participants