Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add RowParser #3174

Merged
merged 1 commit into from
Nov 24, 2022
Merged

Add RowParser #3174

merged 1 commit into from
Nov 24, 2022

Conversation

tustvold
Copy link
Contributor

Which issue does this PR close?

Closes #.

Rationale for this change

I want to be able to spill rows from the row format to disk, currently this requires converting back to arrow, then to arrow IPC and back. Allowing users to parse rows as bytes opens up a large number of possibilities in this space.

The only unsafe aspect of decoding is w.r.t UTF-8 validation, with all indexing, reads, etc... checked. Therefore to make this sound it is sufficient to optionally enable UTF-8 validation. Given this is a case where we have spilled to disk, I think this is an acceptable trade-off.

FWIW I couldn't work out a way to avoid this, as the nature of files is that their contents can be mutated by safe APIs, making it impossible to maintain an invariant over their contents

What changes are included in this PR?

Are there any user-facing changes?

@github-actions github-actions bot added the arrow Changes to the arrow crate label Nov 23, 2022
Copy link
Contributor

@alamb alamb left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Makes sense to me -- it might be worth some additional comments explaning when validation must be done to ensure safety / when the checks can be elided

"rows were not produced by this RowConverter"
);

validate_utf8 |= row.config.validate_utf8;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This seems strange that some rows would have validate_utf8 set and some would not if they came from the same row converter. Maybe we could assert they are all the same?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Consider the case of an ExternalSort where some batches have been spilled and some haven't. They all have the same row converter, but not all need validation. If you then merge them together you might need to do validation

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

But wound't this code effectively validate all rows in this case, even those that we know haven't been spilled? Maybe I misread it.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Indeed, potential future optimisation if it shows up in profiles. My hunch is the IO would dominate for that case

struct RowConfig {
/// The schema for these rows
fields: Arc<[SortField]>,
/// Whether to run UTF-8 validation when converting to arrow arrays
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

perhaps it would be wise to note here that utf8 validation will be required when reading bytes that may have been modified?

@alamb
Copy link
Contributor

alamb commented Nov 23, 2022

I also think it would be very nice to have some unsafe way to skip the utf8 validation so that if we use this in datafusion, say, we could use that API to skip validation when reading in spill files

@tustvold
Copy link
Contributor Author

tustvold commented Nov 24, 2022

could use that API to skip validation when reading in spill files

In my benchmarks the performance hit of validation is fairly marginal, ~10%, and so I suspect will be dominated by IO. We can revisit if it starts to show up in profiles.

I did originally have an unsafe API, but there isn't actually a way to make its usage sound, as you can't maintain an invariant over a file.

@tustvold tustvold merged commit 1d22fe3 into apache:master Nov 24, 2022
@ursabot
Copy link

ursabot commented Nov 24, 2022

Benchmark runs are scheduled for baseline = cea5146 and contender = 1d22fe3. 1d22fe3 is a master commit associated with this PR. Results will be available as each benchmark for each run completes.
Conbench compare runs links:
[Skipped ⚠️ Benchmarking of arrow-rs-commits is not supported on ec2-t3-xlarge-us-east-2] ec2-t3-xlarge-us-east-2
[Skipped ⚠️ Benchmarking of arrow-rs-commits is not supported on test-mac-arm] test-mac-arm
[Skipped ⚠️ Benchmarking of arrow-rs-commits is not supported on ursa-i9-9960x] ursa-i9-9960x
[Skipped ⚠️ Benchmarking of arrow-rs-commits is not supported on ursa-thinkcentre-m75q] ursa-thinkcentre-m75q
Buildkite builds:
Supported benchmarks:
ec2-t3-xlarge-us-east-2: Supported benchmark langs: Python, R. Runs only benchmarks with cloud = True
test-mac-arm: Supported benchmark langs: C++, Python, R
ursa-i9-9960x: Supported benchmark langs: Python, R, JavaScript
ursa-thinkcentre-m75q: Supported benchmark langs: C++, Java

@v1gnesh
Copy link

v1gnesh commented Nov 25, 2022

I also think it would be very nice to have some unsafe way to skip the utf8 validation so that if we use this in datafusion, say, we could use that API to skip validation when reading in spill files

What about using bstr?
BurntSushi/bstr#90 (comment)

Also, where can I read up about the Row format and related functions, etc?

@alamb
Copy link
Contributor

alamb commented Nov 25, 2022

Also, where can I read up about the Row format and related functions, etc?

I recommend
https://arrow.apache.org/blog/2022/11/07/multi-column-sorts-in-arrow-rust-part-1/

And https://docs.rs/arrow/27.0.0/arrow/row/index.html

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
arrow Changes to the arrow crate
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants