Add RowParser #3174

tustvold · 2022-11-23T16:09:50Z

Which issue does this PR close?

Closes #.

Rationale for this change

I want to be able to spill rows from the row format to disk, currently this requires converting back to arrow, then to arrow IPC and back. Allowing users to parse rows as bytes opens up a large number of possibilities in this space.

The only unsafe aspect of decoding is w.r.t UTF-8 validation, with all indexing, reads, etc... checked. Therefore to make this sound it is sufficient to optionally enable UTF-8 validation. Given this is a case where we have spilled to disk, I think this is an acceptable trade-off.

FWIW I couldn't work out a way to avoid this, as the nature of files is that their contents can be mutated by safe APIs, making it impossible to maintain an invariant over their contents

What changes are included in this PR?

Are there any user-facing changes?

alamb

Makes sense to me -- it might be worth some additional comments explaning when validation must be done to ensure safety / when the checks can be elided

alamb · 2022-11-23T16:17:54Z

arrow/src/row/mod.rs

                    "rows were not produced by this RowConverter"
                );
-
+                validate_utf8 |= row.config.validate_utf8;


This seems strange that some rows would have validate_utf8 set and some would not if they came from the same row converter. Maybe we could assert they are all the same?

Consider the case of an ExternalSort where some batches have been spilled and some haven't. They all have the same row converter, but not all need validation. If you then merge them together you might need to do validation

But wound't this code effectively validate all rows in this case, even those that we know haven't been spilled? Maybe I misread it.

Indeed, potential future optimisation if it shows up in profiles. My hunch is the IO would dominate for that case

alamb · 2022-11-23T16:18:41Z

arrow/src/row/mod.rs

+struct RowConfig {
+    /// The schema for these rows
+    fields: Arc<[SortField]>,
+    /// Whether to run UTF-8 validation when converting to arrow arrays


perhaps it would be wise to note here that utf8 validation will be required when reading bytes that may have been modified?

alamb · 2022-11-23T18:14:37Z

I also think it would be very nice to have some unsafe way to skip the utf8 validation so that if we use this in datafusion, say, we could use that API to skip validation when reading in spill files

tustvold · 2022-11-24T08:59:01Z

could use that API to skip validation when reading in spill files

In my benchmarks the performance hit of validation is fairly marginal, ~10%, and so I suspect will be dominated by IO. We can revisit if it starts to show up in profiles.

I did originally have an unsafe API, but there isn't actually a way to make its usage sound, as you can't maintain an invariant over a file.

ursabot · 2022-11-24T09:01:58Z

Benchmark runs are scheduled for baseline = cea5146 and contender = 1d22fe3. 1d22fe3 is a master commit associated with this PR. Results will be available as each benchmark for each run completes.
Conbench compare runs links:
[Skipped ⚠️ Benchmarking of arrow-rs-commits is not supported on ec2-t3-xlarge-us-east-2] ec2-t3-xlarge-us-east-2
[Skipped ⚠️ Benchmarking of arrow-rs-commits is not supported on test-mac-arm] test-mac-arm
[Skipped ⚠️ Benchmarking of arrow-rs-commits is not supported on ursa-i9-9960x] ursa-i9-9960x
[Skipped ⚠️ Benchmarking of arrow-rs-commits is not supported on ursa-thinkcentre-m75q] ursa-thinkcentre-m75q
Buildkite builds:
Supported benchmarks:
ec2-t3-xlarge-us-east-2: Supported benchmark langs: Python, R. Runs only benchmarks with cloud = True
test-mac-arm: Supported benchmark langs: C++, Python, R
ursa-i9-9960x: Supported benchmark langs: Python, R, JavaScript
ursa-thinkcentre-m75q: Supported benchmark langs: C++, Java

v1gnesh · 2022-11-25T04:20:08Z

I also think it would be very nice to have some unsafe way to skip the utf8 validation so that if we use this in datafusion, say, we could use that API to skip validation when reading in spill files

What about using bstr?
BurntSushi/bstr#90 (comment)

Also, where can I read up about the Row format and related functions, etc?

alamb · 2022-11-25T19:40:16Z

Also, where can I read up about the Row format and related functions, etc?

I recommend
https://arrow.apache.org/blog/2022/11/07/multi-column-sorts-in-arrow-rust-part-1/

And https://docs.rs/arrow/27.0.0/arrow/row/index.html

github-actions bot added the arrow Changes to the arrow crate label Nov 23, 2022

Add RowParser

26e1d61

tustvold force-pushed the add-row-parser branch from 07c514f to 26e1d61 Compare November 23, 2022 16:17

alamb approved these changes Nov 23, 2022

View reviewed changes

tustvold merged commit 1d22fe3 into apache:master Nov 24, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add RowParser #3174

Add RowParser #3174

tustvold commented Nov 23, 2022

alamb left a comment

alamb Nov 23, 2022

tustvold Nov 23, 2022

alamb Nov 23, 2022

tustvold Nov 23, 2022

alamb Nov 23, 2022

alamb commented Nov 23, 2022

tustvold commented Nov 24, 2022 •

edited

Loading

ursabot commented Nov 24, 2022

v1gnesh commented Nov 25, 2022 •

edited

Loading

alamb commented Nov 25, 2022

Add RowParser #3174

Add RowParser #3174

Conversation

tustvold commented Nov 23, 2022

Which issue does this PR close?

Rationale for this change

What changes are included in this PR?

Are there any user-facing changes?

alamb left a comment

Choose a reason for hiding this comment

alamb Nov 23, 2022

Choose a reason for hiding this comment

tustvold Nov 23, 2022

Choose a reason for hiding this comment

alamb Nov 23, 2022

Choose a reason for hiding this comment

tustvold Nov 23, 2022

Choose a reason for hiding this comment

alamb Nov 23, 2022

Choose a reason for hiding this comment

alamb commented Nov 23, 2022

tustvold commented Nov 24, 2022 • edited Loading

ursabot commented Nov 24, 2022

v1gnesh commented Nov 25, 2022 • edited Loading

alamb commented Nov 25, 2022

tustvold commented Nov 24, 2022 •

edited

Loading

v1gnesh commented Nov 25, 2022 •

edited

Loading