Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add ByteView::try_new #5735

Closed
wants to merge 1 commit into from
Closed
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Jump to
Jump to file
Failed to load files.
Diff view
Diff view
16 changes: 7 additions & 9 deletions arrow-array/src/array/byte_view_array.rs
Original file line number Diff line number Diff line change
Expand Up @@ -237,15 +237,13 @@ impl<T: ByteViewType + ?Sized> GenericByteViewArray<T> {
/// Caller is responsible for ensuring that the index is within the bounds of the array
pub unsafe fn value_unchecked(&self, idx: usize) -> &T::Native {
let v = self.views.get_unchecked(idx);
let len = *v as u32;
let b = if len <= 12 {
let ptr = self.views.as_ptr() as *const u8;
std::slice::from_raw_parts(ptr.add(idx * 16 + 4), len as usize)
} else {
let view = ByteView::from(*v);
let data = self.buffers.get_unchecked(view.buffer_index as usize);
let offset = view.offset as usize;
data.get_unchecked(offset..offset + len as usize)
let b = match ByteView::try_new(v) {
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We can see here how try_new encapsulates the logic for interpreting the u128

Ok(view) => {
let data = self.buffers.get_unchecked(view.buffer_index as usize);
let offset = view.offset as usize;
data.get_unchecked(offset..offset + view.length as usize)
}
Err(b) => b,
};
T::Native::from_bytes_unchecked(b)
}
Expand Down
14 changes: 13 additions & 1 deletion arrow-data/src/byte_view.rs
Original file line number Diff line number Diff line change
Expand Up @@ -15,7 +15,7 @@
// specific language governing permissions and limitations
// under the License.

use arrow_buffer::Buffer;
use arrow_buffer::{Buffer, ToByteSlice};
use arrow_schema::ArrowError;

#[derive(Debug, Copy, Clone, Default)]
Expand All @@ -32,6 +32,18 @@ pub struct ByteView {
}

impl ByteView {
/// Try to create a [`ByteView`] from the provided `v`
///
/// If `v` instead contains the binary data inline, returns an `Err` containing it
#[inline]
pub fn try_new(v: &u128) -> Result<Self, &[u8]> {
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The idea would then be that we would remove the From<u128> implementation, which would be a breaking change but the next release is going to be breaking anyway (and I suspect few people are relying on this API yet)

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think what most confuses me about ByteView as a struct is that it doesn't represent in Rust anywhere the different layouts of the u128s in ByteViewArrays

For example, if you look at the rust ByteView struct without consulting the arrow spec, you may come to the conclusion that the u128s in a ByteViewArray have this format, which is not the case and thus you need to

  1. Know to check "is length less than 12" and if so handle things specially (this API helps here by encapsulating that check for certain cases)
  2. Know how to construct a view from bytes (aka how much of the prefix to copy and where.

This API seems to improve things (though I think if we went this way we should expand its docstring to explain the difference between the two types of byte views)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

represent in Rust anywhere the different layouts of the u128s in ByteViewArrays

Correct, it represents the non-inlined case where you have a view, and not just a short inlined byte array.

This is consistent with the terminology used in the docs - https://arrow.apache.org/docs/format/Columnar.html#variable-size-binary-view-layout

And with the terminology for ListView and LargeListView, where a view represents a view into a separate buffer of data

this API helps here by encapsulating that check for certain cases

What cases does it not encapsulate?

Copy link
Contributor

@alamb alamb May 8, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Correct, it represents the non-inlined case where you have a view, and not just a short inlined byte array.

I see -- in my mind the combination of (length, inlined data) is also a "view" but I can see how you have a different interpretation (perhaps if you view the types as (length, inline) or (length, view) 🤔 )

The diference I am thinking about is two layouts shown here (described as "view structures")

Screenshot 2024-05-08 at 9 09 08 AM

However, there is a single ByteView rust struct (that corresponds to "Long strings")

What cases does it not encapsulate?

One case is creating the u128 initially (e.g. if we should copy 4 bytes or up to 12)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

One case is creating the u128 initially (e.g. if we should copy 4 bytes or up to 12)

Right, but does this need to be its own type, or could it be a free function? I don't know the answer to this, yet, but I would always take no abstraction over a bad abstraction

let len = *v as u32;
match len <= 12 {
true => Err(&v.to_byte_slice()[4..4 + len as usize]),
false => Ok(Self::from(*v)),
}
}

#[inline(always)]
pub fn as_u128(self) -> u128 {
(self.length as u128)
Expand Down