Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

respect offset in utf8 and list casts #335

Merged
merged 1 commit into from May 24, 2021

Conversation

ritchie46
Copy link
Contributor

Which issue does this PR close?

This is a fix for #334.

Offsets were not taking into account in casts:

    list <-> large-list
    utf8 <-> large-utf8

@codecov-commenter
Copy link

Codecov Report

Merging #335 (144a7bd) into master (e18b356) will increase coverage by 0.01%.
The diff coverage is 100.00%.

❗ Current head 144a7bd differs from pull request most recent head d0e540f. Consider uploading reports for the commit d0e540f to get more accurate results
Impacted file tree graph

@@            Coverage Diff             @@
##           master     #335      +/-   ##
==========================================
+ Coverage   82.52%   82.53%   +0.01%     
==========================================
  Files         162      162              
  Lines       44021    44037      +16     
==========================================
+ Hits        36328    36348      +20     
+ Misses       7693     7689       -4     
Impacted Files Coverage Δ
arrow/src/compute/kernels/cast.rs 94.53% <100.00%> (+0.04%) ⬆️
arrow/src/array/equal/list.rs 89.55% <0.00%> (+5.97%) ⬆️

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update e18b356...d0e540f. Read the comment docs.

Copy link
Contributor

@alamb alamb left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I can't say I am super familiar with this code but the tests look good to me and so did the code I reviewed

@alamb alamb added arrow Changes to the arrow crate bug labels May 21, 2021
@jorgecarleitao
Copy link
Member

Thanks a lot @ritchie46 . I can't fully evaluate this because I am uncertain about the spec here.

I am a bit uncertain about the spec here: do we require the offset buffer to start at 0, or is the requirement that the last value minus the first value must be equal to the length of the values buffer? @nevi-me , do you know what is the spec here?

I am asking because IPC does not support the ArrayData::offset. Do we need to "de-offset" sliced arrays when passing tor IPC, or what is the deal here? (last resort, I will bring this to the mailing list)

@ritchie46
Copy link
Contributor Author

I am a bit uncertain about the spec here: do we require the offset buffer to start at 0, or is the requirement that the last value minus the first value must be equal to the length of the values buffer? @nevi-me , do you know what is the spec here?

Yes, I simply respected the API at this point. We throw an error if offsets do not start at zero, but we may be stricter than the spec is.

@nevi-me
Copy link
Contributor

nevi-me commented May 22, 2021

I am a bit uncertain about the spec here: do we require the offset buffer to start at 0, or is the requirement that the last value minus the first value must be equal to the length of the values buffer? @nevi-me , do you know what is the spec here?

This might be a bit tricky. AFAIK the spec doesn't prescribe what happens when a compute kernel interacts with sliced data.

Generally the first value in the offsets array is 0, and the last slot is the length of the values array. When serializing this layout, we recommend normalizing the offsets to start at 0. [0]

I would interpret "generally" as, 'most implementations will expect to start at 0'. In which case, I would prefer a solution that carries the offset of the array, and starts the offset buffer at 0.

I think carrying the offset of the input into the output in this case, is the most performant and compatible solution. Otherwise we'd have to racalculate the offsets to make sure that they start from 0.

[0] https://arrow.apache.org/docs/format/Columnar.html#variable-size-binary-layout

Copy link
Contributor

@nevi-me nevi-me left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM based on my other comment

@alamb alamb merged commit 5ac771a into apache:master May 24, 2021
alamb pushed a commit that referenced this pull request May 25, 2021
alamb added a commit that referenced this pull request May 26, 2021
Co-authored-by: Ritchie Vink <ritchie46@gmail.com>
@alamb alamb added the cherry-picked PR that was backported to active release (will be included in maintenance release) label Jun 3, 2021
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
arrow Changes to the arrow crate bug cherry-picked PR that was backported to active release (will be included in maintenance release)
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

5 participants