Improve performance reading `ByteViewArray` from parquet by removing an implicit copy #6031

XiangpengHao · 2024-07-09T13:48:18Z

Which issue does this PR close?

Closes #.

Rationale for this change

I made a mistake in a previous pr which writes:

let block_id = output.append_block(self.buf.clone().into());

I thought the into() will convert the Bytes into array_buffer::Buffer() without copying the data.

But self.buf is bytes::Bytes, not arrow_buffer::Bytes (they confusingly having the same name). The consequence is that the code above will run into this procedure: https://github.com/XiangpengHao/arrow-rs/blob/view-buffer/arrow-buffer/src/buffer/immutable.rs#L361-L370, which implicitly copies data.

(I think we should do something to prevent future similar mistakes, but I don't know how).

What changes are included in this PR?

This PR explicitly converts bytes::Bytes into arrow_buffer::Bytes and then converts to arrow_buffer::Buffer; I have manually verified that each of the conversions requires no copy.

This further improves the benchmark performance:

cargo bench --bench arrow_reader --features="arrow test_common experimental" "arrow_array_reader/BinaryViewArray/"

arrow_array_reader/BinaryViewArray/plain encoded, mandatory, no NULLs
                        time:   [129.51 µs 129.57 µs 129.63 µs]
                        change: [-35.172% -35.134% -35.094%] (p = 0.00 < 0.05)
                        Performance has improved.

arrow_array_reader/BinaryViewArray/plain encoded, optional, no NULLs
                        time:   [131.07 µs 131.11 µs 131.15 µs]
                        change: [-59.798% -59.758% -59.723%] (p = 0.00 < 0.05)
                        Performance has improved.

arrow_array_reader/BinaryViewArray/plain encoded, optional, half NULLs
                        time:   [125.08 µs 125.11 µs 125.15 µs]
                        change: [-32.699% -32.660% -32.621%] (p = 0.00 < 0.05)
                        Performance has improved.

Bonus (not related to this PR)

You may wonder how it is possible to be faster than StringArray if the current implementation still copies data. TLDR: bc memcpy inlining.

In our benchmark, every string is larger than 12 bytes, this means that when making the views, we always fall into this branch. The specialty about this branch is that it only reads the first 4 bytes of the string, and LLVM is smart enough to inline loading the values (i.e., prevent calling into copy_non_overlapping).

Is that a big deal?

Unfortunately, yes. If you change the benchmark string to very small, e.g., "test", you should be able to see (using the same benchmark script above) the performance of loading ViewArray drops significantly, and it doesn't make sense -- how could build shorter string much slower than building longer string?

How to fix the short string regression?

The root cause is that when string is short, we need to memcpy the bytes to the view, as the compiler has no idea how long the string is, it can not inline the bytes and has to call to memcpy, which is slow.
To convince you more, here's a special variant of make_view, you can replace this function with the following new implementation, and the performance will back to normal. This new implementation makes 13 copies of make_view_inner; each copy is a specialized length known to the compiler (LEN is const), and then the compiler can optimize each of the implementations as needed. Looking at the assembly code, there is indeed no call to memcpy.

fn make_view_inner<const LEN: usize>(data: &[u8]) -> u128 {
    let mut view_buffer = [0; 16];
    view_buffer[0..4].copy_from_slice(&(LEN as u32).to_le_bytes());
    view_buffer[4..4 + LEN].copy_from_slice(&data[..LEN]);
    u128::from_le_bytes(view_buffer)
}

/// Create a view based on the given data, block id and offset
#[inline(always)]
pub fn make_view(data: &[u8], block_id: u32, offset: u32) -> u128 {
    let len = data.len() as u32;
    match len {
        0 => make_view_inner::<0>(data),
        1 => make_view_inner::<1>(data),
        2 => make_view_inner::<2>(data),
        3 => make_view_inner::<3>(data),
        4 => make_view_inner::<4>(data),
        5 => make_view_inner::<5>(data),
        6 => make_view_inner::<6>(data),
        7 => make_view_inner::<7>(data),
        8 => make_view_inner::<8>(data),
        9 => make_view_inner::<9>(data),
        10 => make_view_inner::<10>(data),
        11 => make_view_inner::<11>(data),
        12 => make_view_inner::<12>(data),
        _ => {
            let view = ByteView {
                length: len,
                prefix: u32::from_le_bytes(data[0..4].try_into().unwrap()),
                buffer_index: block_id,
                offset,
            };
            view.into()
        }
    }
}

(special thanks to @aoli-al for triangulating the root cause and prototype the fix)

What should we do?
I don't know, I'm not sure if we want to merge that special make_view, as it is very unintuitive. I'll keep a local copy of this make_view and think about it in the background, maybe I'll have a better idea in a month.

Are there any user-facing changes?

alamb · 2024-07-09T20:28:01Z

(I think we should do something to prevent future similar mistakes, but I don't know how).

One thing we could do is to remove the explclt From impls for Buffer that copy data

Specifically, remove https://github.com/XiangpengHao/arrow-rs/blob/view-buffer/arrow-buffer/src/buffer/immutable.rs#L361-L370 and make it a function like Buffer::from_slice` with an explanation that it copies data

alamb

Thanks @XiangpengHao -- this makes a lot of sense to me

I think adding some more comments would help future readers, but I also don't think it is required for merge

alamb · 2024-07-09T20:30:05Z

parquet/src/arrow/array_reader/byte_view_array.rs

@@ -316,7 +315,8 @@ impl ByteViewArrayDecoderPlain {
    }

    pub fn read(&mut self, output: &mut ViewBuffer, len: usize) -> Result<usize> {
-        let block_id = output.append_block(self.buf.clone().into());
+        let buf = arrow_buffer::Buffer::from_bytes(self.buf.clone().into());


I think we should at least add a comment the rationale for this non obvious code.

Maybe it would make sense to pull it into its a function (that could be commented, and more easily discoverable)

Thinking about creating a from_arrow_bytes(...) and from_bytes(...), and then remove the impl<T: AsRef<[u8]>> From<T> for Buffer as it is too easy to misuse.

I recommend we split the code into multiple PRs -- this one to improve performance of the parquet reader and one to make it harder to misuse the API (which I suspect will be a breaking API change)

Indeed, the breakage is much larger than I thought, will continue on this tomorrow.

Filed #6033 to track the idea

alamb · 2024-07-09T20:30:18Z

parquet/src/arrow/array_reader/byte_view_array.rs

@@ -71,7 +71,6 @@ struct ByteViewArrayReader {
 }

 impl ByteViewArrayReader {
-    #[allow(unused)]


alamb · 2024-07-09T20:44:55Z

What should we do?

I don't know, I'm not sure if we want to merge that special make_view, as it is very unintuitive. I'll keep a local copy of this make_view and think about it in the background, maybe I'll have a better idea in a month.

I think we can justify non intuitive code with sufficient performance gains (justified by benchmarks) and sufficient comments to

Special thanks to @aoli-al for triangulating the root cause and prototype the fix

Thak you @aoli-al 🙏

alamb · 2024-07-09T20:46:05Z

Here is what I think we should do for next steps:

Consider adding some more comments/make a function to the code in this PR to make it clearer what is going on
File follow on tickets for the 1) Non obvious copy in Buffer, 2) improved performance of small views (your analysis is excellent)
Merge this PR

alamb

❤️

alamb · 2024-07-10T10:33:32Z

What should we do?

I don't know, I'm not sure if we want to merge that special make_view, as it is very unintuitive. I'll keep a local copy of this make_view and think about it in the background, maybe I'll have a better idea in a month.

I think we can justify non intuitive code with sufficient performance gains (justified by benchmarks) and sufficient comments to

Filed #6034 to track this issue

update byte view array to not implicit copy

f250369

github-actions bot added the parquet Changes to the parquet crate label Jul 9, 2024

alamb approved these changes Jul 9, 2024

View reviewed changes

alamb mentioned this pull request Jul 9, 2024

use StringViewArray when reading String columns from Parquet apache/datafusion#10921

Closed

Add small comments

53d2fd6

alamb approved these changes Jul 9, 2024

View reviewed changes

alamb changed the title ~~Fix implicit copy when building byte view array~~ Improve performance reading ByteViewArray from parquet by removing an implicit copy Jul 10, 2024

This was referenced Jul 10, 2024

Remove From impls that copy data into arrow_buffer::Buffer #6033

Open

Improve performance of constructing ByteViews for small strings #6034

Closed

alamb merged commit cb3babc into apache:master Jul 10, 2024
17 checks passed

XiangpengHao deleted the view-buffer branch July 22, 2024 20:31

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Improve performance reading `ByteViewArray` from parquet by removing an implicit copy #6031

Improve performance reading `ByteViewArray` from parquet by removing an implicit copy #6031

XiangpengHao commented Jul 9, 2024

alamb commented Jul 9, 2024

alamb left a comment •

edited

Loading

alamb Jul 9, 2024

XiangpengHao Jul 9, 2024

alamb Jul 9, 2024

XiangpengHao Jul 10, 2024

alamb Jul 10, 2024

alamb Jul 9, 2024

alamb commented Jul 9, 2024

alamb commented Jul 9, 2024

alamb left a comment

alamb commented Jul 10, 2024

Improve performance reading ByteViewArray from parquet by removing an implicit copy #6031

Improve performance reading ByteViewArray from parquet by removing an implicit copy #6031

Conversation

XiangpengHao commented Jul 9, 2024

Which issue does this PR close?

Rationale for this change

What changes are included in this PR?

Bonus (not related to this PR)

Is that a big deal?

How to fix the short string regression?

Are there any user-facing changes?

alamb commented Jul 9, 2024

alamb left a comment • edited Loading

Choose a reason for hiding this comment

alamb Jul 9, 2024

Choose a reason for hiding this comment

XiangpengHao Jul 9, 2024

Choose a reason for hiding this comment

alamb Jul 9, 2024

Choose a reason for hiding this comment

XiangpengHao Jul 10, 2024

Choose a reason for hiding this comment

alamb Jul 10, 2024

Choose a reason for hiding this comment

alamb Jul 9, 2024

Choose a reason for hiding this comment

alamb commented Jul 9, 2024

alamb commented Jul 9, 2024

alamb left a comment

Choose a reason for hiding this comment

alamb commented Jul 10, 2024

Improve performance reading `ByteViewArray` from parquet by removing an implicit copy #6031

Improve performance reading `ByteViewArray` from parquet by removing an implicit copy #6031

alamb left a comment •

edited

Loading