Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix take_bytes Null and Overflow Handling (#4576) #4579

Merged
merged 4 commits into from
Jul 29, 2023

Conversation

tustvold
Copy link
Contributor

Which issue does this PR close?

Closes #4576

Rationale for this change

This fixes the overflow and null checking behaviour of take_bytes. This does regress the performance slightly, but I think this is acceptable given the prior logic was incorrect.

take str 512            time:   [3.8135 µs 3.8144 µs 3.8155 µs]
                        change: [+8.1351% +8.5262% +8.8652%] (p = 0.00 < 0.05)
                        Performance has regressed.
Found 10 outliers among 100 measurements (10.00%)
  2 (2.00%) low mild
  6 (6.00%) high mild
  2 (2.00%) high severe

take str 1024           time:   [6.6656 µs 6.6665 µs 6.6675 µs]
                        change: [+10.429% +10.776% +10.972%] (p = 0.00 < 0.05)
                        Performance has regressed.
Found 7 outliers among 100 measurements (7.00%)
  6 (6.00%) high mild
  1 (1.00%) high severe

take str null indices 512
                        time:   [3.2013 µs 3.2048 µs 3.2081 µs]
                        change: [+6.4883% +6.7476% +7.0323%] (p = 0.00 < 0.05)
                        Performance has regressed.
Found 3 outliers among 100 measurements (3.00%)
  2 (2.00%) high mild
  1 (1.00%) high severe

take str null indices 1024
                        time:   [6.3864 µs 6.4001 µs 6.4149 µs]
                        change: [+8.6045% +9.0105% +9.4405%] (p = 0.00 < 0.05)
                        Performance has regressed.
Found 4 outliers among 100 measurements (4.00%)
  3 (3.00%) high mild
  1 (1.00%) high severe

take str null values 1024
                        time:   [6.9174 µs 6.9204 µs 6.9236 µs]
                        change: [-3.4004% -3.0300% -2.7115%] (p = 0.00 < 0.05)
                        Performance has improved.
Found 1 outliers among 100 measurements (1.00%)
  1 (1.00%) high severe

take str null values null indices 1024
                        time:   [5.5812 µs 5.5842 µs 5.5873 µs]
                        change: [+2.1866% +2.5807% +2.9030%] (p = 0.00 < 0.05)
                        Performance has regressed.
Found 9 outliers among 100 measurements (9.00%)
  6 (6.00%) high mild
  3 (3.00%) high severe

What changes are included in this PR?

Are there any user-facing changes?

@github-actions github-actions bot added the arrow Changes to the arrow crate label Jul 28, 2023
if array.is_valid(index) {
let s: &[u8] = array.value(index).as_ref();

length_so_far += T::Offset::from_usize(s.len()).unwrap();
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This wasn't performing checked arithmetic and so could silently overflow

if array.is_valid(index) {
let s: &[u8] = array.value(index).as_ref();

length_so_far += T::Offset::from_usize(s.len()).unwrap();
values.extend_from_slice(s.as_ref());
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We piggy back on the fact this already does checked arithmetic

@tustvold tustvold changed the title Cleanup take_bytes (#4576) Fix take_bytes Null and Overflow Handling (#4576) Jul 28, 2023
Copy link
Contributor

@alamb alamb left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you @tustvold -- I went through this PR carefully and it looks good to me logic wise. I defer to you and others for performance.

Maybe this is something @jhorstmann or @viirya are interested in reviewing as well

I also verified that without the code changes in this PR the test fails like

---- take::tests::test_take_bytes_null_indices stdout ----
thread 'take::tests::test_take_bytes_null_indices' panicked at 'assertion failed: idx < self.len', /Users/alamb/Software/arrow-rs2/arrow-buffer/src/buffer/boolean.rs:133:9
stack backtrace:
   0: rust_begin_unwind
             at /rustc/8ede3aae28fe6e4d52b38157d7bfe0d3bceef225/library/std/src/panicking.rs:593:5
   1: core::panicking::panic_fmt
             at /rustc/8ede3aae28fe6e4d52b38157d7bfe0d3bceef225/library/core/src/panicking.rs:67:14
   2: core::panicking::panic
             at /rustc/8ede3aae28fe6e4d52b38157d7bfe0d3bceef225/library/core/src/panicking.rs:117:5
   3: arrow_buffer::buffer::boolean::BooleanBuffer::value
             at /Users/alamb/Software/arrow-rs2/arrow-buffer/src/buffer/boolean.rs:133:9
   4: arrow_buffer::buffer::null::NullBuffer::is_valid
             at /Users/alamb/Software/arrow-rs2/arrow-buffer/src/buffer/null.rs:145:9
   5: arrow_buffer::buffer::null::NullBuffer::is_null
             at /Users/alamb/Software/arrow-rs2/arrow-buffer/src/buffer/null.rs:151:10
   6: arrow_array::array::Array::is_null::{{closure}}
             at /Users/alamb/Software/arrow-rs2/arrow-array/src/array/mod.rs:193:30
   7: core::option::Option<T>::map
             at /rustc/8ede3aae28fe6e4d52b38157d7bfe0d3bceef225/library/core/src/option.rs:1075:29
   8: arrow_array::array::Array::is_null
             at /Users/alamb/Software/arrow-rs2/arrow-array/src/array/mod.rs:193:9
   9: arrow_array::array::Array::is_valid
             at /Users/alamb/Software/arrow-rs2/arrow-array/src/array/mod.rs:210:10
  10: arrow_select::take::take_bytes
             at ./src/take.rs:404:16
  11: arrow_select::take::take_impl
             at ./src/take.rs:126:25
  12: arrow_select::take::take
             at ./src/take.rs:81:5
  13: arrow_select::take::tests::test_take_bytes_null_indices
             at ./src/take.rs:1960:17
  14: arrow_select::take::tests::test_take_bytes_null_indices::{{closure}}
             at ./src/take.rs:1954:39
  15: core::ops::function::FnOnce::call_once
             at /rustc/8ede3aae28fe6e4d52b38157d7bfe0d3bceef225/library/core/src/ops/function.rs:250:5
  16: core::ops::function::FnOnce::call_once
             at /rustc/8ede3aae28fe6e4d52b38157d7bfe0d3bceef225/library/core/src/ops/function.rs:250:5
note: Some details are omitted, run with `RUST_BACKTRACE=full` for a verbose backtrace.


#[test]
fn test_take_bytes_null_indices() {
let indices = Int32Array::new(
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
let indices = Int32Array::new(
// Build indices with values that out of bounds, but NULL so they
// will not be checked
let indices = Int32Array::new(

if array.is_valid(index) && indices.is_valid(i) {
offsets.extend(indices.values().iter().enumerate().map(|(i, index)| {
let index = index.as_usize();
if indices.is_valid(i) && array.is_valid(index) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this is the actual fix for the bug, right? that indices.is_valid is checked prior to array.is_valid?

Suggested change
if indices.is_valid(i) && array.is_valid(index) {
// check index is valid before using index. The value in
// NULL index slots may not be within bounds of array
if indices.is_valid(i) && array.is_valid(index) {

nulls = Some(null_buf.into());
} else if array.null_count() == 0 {
for (i, offset) in offsets.iter_mut().skip(1).enumerate() {
offsets.extend(indices.values().iter().enumerate().map(|(i, index)| {
if indices.is_valid(i) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
if indices.is_valid(i) {
// check index is valid before using index. The value in
// NULL index slots may not be within bounds of array
if indices.is_valid(i) {

@tustvold tustvold merged commit 18385e5 into apache:master Jul 29, 2023
24 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
arrow Changes to the arrow crate
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Panic assertion failed: idx < self.len when casting DictionaryArrays with nulls
2 participants