Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Improve performance reading ByteViewArray from parquet by removing an implicit copy #6031

Merged
merged 2 commits into from
Jul 10, 2024

Conversation

XiangpengHao
Copy link
Contributor

Which issue does this PR close?

Closes #.

Rationale for this change

I made a mistake in a previous pr which writes:

let block_id = output.append_block(self.buf.clone().into());

I thought the into() will convert the Bytes into array_buffer::Buffer() without copying the data.

But self.buf is bytes::Bytes, not arrow_buffer::Bytes (they confusingly having the same name). The consequence is that the code above will run into this procedure: https://github.com/XiangpengHao/arrow-rs/blob/view-buffer/arrow-buffer/src/buffer/immutable.rs#L361-L370, which implicitly copies data.

(I think we should do something to prevent future similar mistakes, but I don't know how).

What changes are included in this PR?

This PR explicitly converts bytes::Bytes into arrow_buffer::Bytes and then converts to arrow_buffer::Buffer; I have manually verified that each of the conversions requires no copy.

This further improves the benchmark performance:

cargo bench --bench arrow_reader --features="arrow test_common experimental" "arrow_array_reader/BinaryViewArray/"
arrow_array_reader/BinaryViewArray/plain encoded, mandatory, no NULLs
                        time:   [129.51 µs 129.57 µs 129.63 µs]
                        change: [-35.172% -35.134% -35.094%] (p = 0.00 < 0.05)
                        Performance has improved.

arrow_array_reader/BinaryViewArray/plain encoded, optional, no NULLs
                        time:   [131.07 µs 131.11 µs 131.15 µs]
                        change: [-59.798% -59.758% -59.723%] (p = 0.00 < 0.05)
                        Performance has improved.

arrow_array_reader/BinaryViewArray/plain encoded, optional, half NULLs
                        time:   [125.08 µs 125.11 µs 125.15 µs]
                        change: [-32.699% -32.660% -32.621%] (p = 0.00 < 0.05)
                        Performance has improved.

Bonus (not related to this PR)

You may wonder how it is possible to be faster than StringArray if the current implementation still copies data. TLDR: bc memcpy inlining.

In our benchmark, every string is larger than 12 bytes, this means that when making the views, we always fall into this branch. The specialty about this branch is that it only reads the first 4 bytes of the string, and LLVM is smart enough to inline loading the values (i.e., prevent calling into copy_non_overlapping).

Is that a big deal?

Unfortunately, yes. If you change the benchmark string to very small, e.g., "test", you should be able to see (using the same benchmark script above) the performance of loading ViewArray drops significantly, and it doesn't make sense -- how could build shorter string much slower than building longer string?

How to fix the short string regression?

The root cause is that when string is short, we need to memcpy the bytes to the view, as the compiler has no idea how long the string is, it can not inline the bytes and has to call to memcpy, which is slow.
To convince you more, here's a special variant of make_view, you can replace this function with the following new implementation, and the performance will back to normal. This new implementation makes 13 copies of make_view_inner; each copy is a specialized length known to the compiler (LEN is const), and then the compiler can optimize each of the implementations as needed. Looking at the assembly code, there is indeed no call to memcpy.

fn make_view_inner<const LEN: usize>(data: &[u8]) -> u128 {
    let mut view_buffer = [0; 16];
    view_buffer[0..4].copy_from_slice(&(LEN as u32).to_le_bytes());
    view_buffer[4..4 + LEN].copy_from_slice(&data[..LEN]);
    u128::from_le_bytes(view_buffer)
}

/// Create a view based on the given data, block id and offset
#[inline(always)]
pub fn make_view(data: &[u8], block_id: u32, offset: u32) -> u128 {
    let len = data.len() as u32;
    match len {
        0 => make_view_inner::<0>(data),
        1 => make_view_inner::<1>(data),
        2 => make_view_inner::<2>(data),
        3 => make_view_inner::<3>(data),
        4 => make_view_inner::<4>(data),
        5 => make_view_inner::<5>(data),
        6 => make_view_inner::<6>(data),
        7 => make_view_inner::<7>(data),
        8 => make_view_inner::<8>(data),
        9 => make_view_inner::<9>(data),
        10 => make_view_inner::<10>(data),
        11 => make_view_inner::<11>(data),
        12 => make_view_inner::<12>(data),
        _ => {
            let view = ByteView {
                length: len,
                prefix: u32::from_le_bytes(data[0..4].try_into().unwrap()),
                buffer_index: block_id,
                offset,
            };
            view.into()
        }
    }
}

(special thanks to @aoli-al for triangulating the root cause and prototype the fix)

What should we do?
I don't know, I'm not sure if we want to merge that special make_view, as it is very unintuitive. I'll keep a local copy of this make_view and think about it in the background, maybe I'll have a better idea in a month.

Are there any user-facing changes?

@github-actions github-actions bot added the parquet Changes to the parquet crate label Jul 9, 2024
@alamb
Copy link
Contributor

alamb commented Jul 9, 2024

(I think we should do something to prevent future similar mistakes, but I don't know how).

One thing we could do is to remove the explclt From impls for Buffer that copy data

Specifically, remove https://github.com/XiangpengHao/arrow-rs/blob/view-buffer/arrow-buffer/src/buffer/immutable.rs#L361-L370 and make it a function like Buffer::from_slice` with an explanation that it copies data

Copy link
Contributor

@alamb alamb left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @XiangpengHao -- this makes a lot of sense to me

I think adding some more comments would help future readers, but I also don't think it is required for merge

@@ -316,7 +315,8 @@ impl ByteViewArrayDecoderPlain {
}

pub fn read(&mut self, output: &mut ViewBuffer, len: usize) -> Result<usize> {
let block_id = output.append_block(self.buf.clone().into());
let buf = arrow_buffer::Buffer::from_bytes(self.buf.clone().into());
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we should at least add a comment the rationale for this non obvious code.

Maybe it would make sense to pull it into its a function (that could be commented, and more easily discoverable)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thinking about creating a from_arrow_bytes(...) and from_bytes(...), and then remove the impl<T: AsRef<[u8]>> From<T> for Buffer as it is too easy to misuse.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I recommend we split the code into multiple PRs -- this one to improve performance of the parquet reader and one to make it harder to misuse the API (which I suspect will be a breaking API change)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Indeed, the breakage is much larger than I thought, will continue on this tomorrow.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Filed #6033 to track the idea

@@ -71,7 +71,6 @@ struct ByteViewArrayReader {
}

impl ByteViewArrayReader {
#[allow(unused)]
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

👍

@alamb
Copy link
Contributor

alamb commented Jul 9, 2024

What should we do?

I don't know, I'm not sure if we want to merge that special make_view, as it is very unintuitive. I'll keep a local copy of this make_view and think about it in the background, maybe I'll have a better idea in a month.

I think we can justify non intuitive code with sufficient performance gains (justified by benchmarks) and sufficient comments to

Special thanks to @aoli-al for triangulating the root cause and prototype the fix

Thak you @aoli-al 🙏

@alamb
Copy link
Contributor

alamb commented Jul 9, 2024

Here is what I think we should do for next steps:

  1. Consider adding some more comments/make a function to the code in this PR to make it clearer what is going on
  2. File follow on tickets for the 1) Non obvious copy in Buffer, 2) improved performance of small views (your analysis is excellent)
  3. Merge this PR

Copy link
Contributor

@alamb alamb left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

❤️

@alamb alamb changed the title Fix implicit copy when building byte view array Improve performance reading ByteViewArray from parquet by removing an implicit copy Jul 10, 2024
@alamb
Copy link
Contributor

alamb commented Jul 10, 2024

What should we do?

I don't know, I'm not sure if we want to merge that special make_view, as it is very unintuitive. I'll keep a local copy of this make_view and think about it in the background, maybe I'll have a better idea in a month.

I think we can justify non intuitive code with sufficient performance gains (justified by benchmarks) and sufficient comments to

Filed #6034 to track this issue

@alamb alamb merged commit cb3babc into apache:master Jul 10, 2024
17 checks passed
@XiangpengHao XiangpengHao deleted the view-buffer branch July 22, 2024 20:31
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
parquet Changes to the parquet crate
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

2 participants