Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

ARROW-15244: [Format] Clarify that offsets are monotonic for binary like arrays #12019

Closed
wants to merge 4 commits into from

Conversation

alamb
Copy link
Contributor

@alamb alamb commented Dec 22, 2021

Rationale

The question of "what are the values of the offsets for non-valid entries in arrays" came up in arrow-rs: apache/arrow-rs#1071 and the existing docs seem to be somewhat vague on this issue.

I looked at three implementations of arrow, and they all seem to assume / validate the offsets are monotonic:

Changes

Thus I propose updating the format docs to make the monotonic offsets explicit.

Background

I think @jorgecarleitao's description on apache/arrow-rs#1071 (comment), explains the reason why having monotonic offsets is a good idea

I think that in general the property we seek is: discarding the validity cannot result in UB when accessing the values. This justifies the values buffer of a primitive array is always initialized, and the offsets being valid and in-bounds even in null cases.

The rational for this is that sometimes it is faster to skip validity accesses and only iterate over the values (and clone the validity). I do not recall the benchmark result, but this may explain why string comparison ignores validity and & the bitmaps instead.

@github-actions
Copy link

Thanks for opening a pull request!

If this is not a minor PR. Could you open an issue for this pull request on JIRA? https://issues.apache.org/jira/browse/ARROW

Opening JIRAs ahead of time contributes to the Openness of the Apache Arrow project.

Then could you also rename pull request title in the following format?

ARROW-${JIRA_ID}: [${COMPONENT}] ${SUMMARY}

or

MINOR: [${COMPONENT}] ${SUMMARY}

See also:

@alamb
Copy link
Contributor Author

alamb commented Dec 22, 2021

I also started a mailing list thread on this topic: https://lists.apache.org/thread/fx8k250nn1d9b86sfo9t2gcl1v11mn4f

Co-authored-by: Jorge Leitao <jorgecarleitao@gmail.com>
Co-authored-by: Matthijs Brobbel <m1brobbel@gmail.com>
@pitrou pitrou changed the title (docs) Clarify that offsets are monotonic for binary like arrays ARROW-15244: [Format] Clarify that offsets are monotonic for binary like arrays Jan 4, 2022
@github-actions
Copy link

github-actions bot commented Jan 4, 2022

@github-actions
Copy link

github-actions bot commented Jan 4, 2022

⚠️ Ticket has not been started in JIRA, please click 'Start Progress'.

@pitrou pitrou closed this in e7dc8f5 Jan 4, 2022
@alamb alamb deleted the alamb/clarify_offsets branch January 4, 2022 21:44
@ursabot
Copy link

ursabot commented Jan 5, 2022

Benchmark runs are scheduled for baseline = 31a07be and contender = e7dc8f5. e7dc8f5 is a master commit associated with this PR. Results will be available as each benchmark for each run completes.
Conbench compare runs links:
[Finished ⬇️0.0% ⬆️0.0%] ec2-t3-xlarge-us-east-2
[Failed ⬇️0.45% ⬆️0.0%] ursa-i9-9960x
[Failed ⬇️0.79% ⬆️0.0%] ursa-thinkcentre-m75q
Supported benchmarks:
ec2-t3-xlarge-us-east-2: Supported benchmark langs: Python. Runs only benchmarks with cloud = True
ursa-i9-9960x: Supported benchmark langs: Python, R, JavaScript
ursa-thinkcentre-m75q: Supported benchmark langs: C++, Java

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants