Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add ByteView::try_new #5735

Closed
wants to merge 1 commit into from
Closed

Conversation

tustvold
Copy link
Contributor

@tustvold tustvold commented May 8, 2024

Which issue does this PR close?

Closes #.

Rationale for this change

Potentially simpler version of #5619

What changes are included in this PR?

Are there any user-facing changes?

@github-actions github-actions bot added the arrow Changes to the arrow crate label May 8, 2024
///
/// If `v` instead contains the binary data inline, returns an `Err` containing it
#[inline]
pub fn try_new(v: &u128) -> Result<Self, &[u8]> {
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The idea would then be that we would remove the From<u128> implementation, which would be a breaking change but the next release is going to be breaking anyway (and I suspect few people are relying on this API yet)

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think what most confuses me about ByteView as a struct is that it doesn't represent in Rust anywhere the different layouts of the u128s in ByteViewArrays

For example, if you look at the rust ByteView struct without consulting the arrow spec, you may come to the conclusion that the u128s in a ByteViewArray have this format, which is not the case and thus you need to

  1. Know to check "is length less than 12" and if so handle things specially (this API helps here by encapsulating that check for certain cases)
  2. Know how to construct a view from bytes (aka how much of the prefix to copy and where.

This API seems to improve things (though I think if we went this way we should expand its docstring to explain the difference between the two types of byte views)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

represent in Rust anywhere the different layouts of the u128s in ByteViewArrays

Correct, it represents the non-inlined case where you have a view, and not just a short inlined byte array.

This is consistent with the terminology used in the docs - https://arrow.apache.org/docs/format/Columnar.html#variable-size-binary-view-layout

And with the terminology for ListView and LargeListView, where a view represents a view into a separate buffer of data

this API helps here by encapsulating that check for certain cases

What cases does it not encapsulate?

Copy link
Contributor

@alamb alamb May 8, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Correct, it represents the non-inlined case where you have a view, and not just a short inlined byte array.

I see -- in my mind the combination of (length, inlined data) is also a "view" but I can see how you have a different interpretation (perhaps if you view the types as (length, inline) or (length, view) 🤔 )

The diference I am thinking about is two layouts shown here (described as "view structures")

Screenshot 2024-05-08 at 9 09 08 AM

However, there is a single ByteView rust struct (that corresponds to "Long strings")

What cases does it not encapsulate?

One case is creating the u128 initially (e.g. if we should copy 4 bytes or up to 12)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

One case is creating the u128 initially (e.g. if we should copy 4 bytes or up to 12)

Right, but does this need to be its own type, or could it be a free function? I don't know the answer to this, yet, but I would always take no abstraction over a bad abstraction

alamb

This comment was marked as off-topic.

@tustvold
Copy link
Contributor Author

tustvold commented May 8, 2024

but I probably don't fully understand it

I updated one call site to show how it works, I believe it achieves the same end as the linked PR whilst being significantly simpler

let data = self.buffers.get_unchecked(view.buffer_index as usize);
let offset = view.offset as usize;
data.get_unchecked(offset..offset + len as usize)
let b = match ByteView::try_new(v) {
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We can see here how try_new encapsulates the logic for interpreting the u128

Copy link
Contributor

@alamb alamb left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you for this PR @tustvold

It seems to me the key difference between the existing ByteView approach (that this PR extends) and the approach in #5619 is an explicy Rust API for manipulating / accessing the inline variant of the u128 vies.

I think this PR improves the usability of ByteView but I still think #5619 (or another approach that models the two types of views as separate Rust structs) is easier to understand

///
/// If `v` instead contains the binary data inline, returns an `Err` containing it
#[inline]
pub fn try_new(v: &u128) -> Result<Self, &[u8]> {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think what most confuses me about ByteView as a struct is that it doesn't represent in Rust anywhere the different layouts of the u128s in ByteViewArrays

For example, if you look at the rust ByteView struct without consulting the arrow spec, you may come to the conclusion that the u128s in a ByteViewArray have this format, which is not the case and thus you need to

  1. Know to check "is length less than 12" and if so handle things specially (this API helps here by encapsulating that check for certain cases)
  2. Know how to construct a view from bytes (aka how much of the prefix to copy and where.

This API seems to improve things (though I think if we went this way we should expand its docstring to explain the difference between the two types of byte views)

@alamb
Copy link
Contributor

alamb commented May 29, 2024

superceded by #5796

@tustvold tustvold reopened this May 30, 2024
@tustvold tustvold closed this May 30, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
arrow Changes to the arrow crate
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants