-
Notifications
You must be signed in to change notification settings - Fork 791
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add RowParser #3174
Add RowParser #3174
Conversation
07c514f
to
26e1d61
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Makes sense to me -- it might be worth some additional comments explaning when validation must be done to ensure safety / when the checks can be elided
"rows were not produced by this RowConverter" | ||
); | ||
|
||
validate_utf8 |= row.config.validate_utf8; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This seems strange that some rows would have validate_utf8
set and some would not if they came from the same row converter. Maybe we could assert they are all the same?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Consider the case of an ExternalSort where some batches have been spilled and some haven't. They all have the same row converter, but not all need validation. If you then merge them together you might need to do validation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
But wound't this code effectively validate all rows in this case, even those that we know haven't been spilled? Maybe I misread it.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Indeed, potential future optimisation if it shows up in profiles. My hunch is the IO would dominate for that case
struct RowConfig { | ||
/// The schema for these rows | ||
fields: Arc<[SortField]>, | ||
/// Whether to run UTF-8 validation when converting to arrow arrays |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
perhaps it would be wise to note here that utf8 validation will be required when reading bytes that may have been modified?
I also think it would be very nice to have some |
In my benchmarks the performance hit of validation is fairly marginal, ~10%, and so I suspect will be dominated by IO. We can revisit if it starts to show up in profiles. I did originally have an unsafe API, but there isn't actually a way to make its usage sound, as you can't maintain an invariant over a file. |
Benchmark runs are scheduled for baseline = cea5146 and contender = 1d22fe3. 1d22fe3 is a master commit associated with this PR. Results will be available as each benchmark for each run completes. |
What about using Also, where can I read up about the Row format and related functions, etc? |
I recommend |
Which issue does this PR close?
Closes #.
Rationale for this change
I want to be able to spill rows from the row format to disk, currently this requires converting back to arrow, then to arrow IPC and back. Allowing users to parse rows as bytes opens up a large number of possibilities in this space.
The only unsafe aspect of decoding is w.r.t UTF-8 validation, with all indexing, reads, etc... checked. Therefore to make this sound it is sufficient to optionally enable UTF-8 validation. Given this is a case where we have spilled to disk, I think this is an acceptable trade-off.
FWIW I couldn't work out a way to avoid this, as the nature of files is that their contents can be mutated by safe APIs, making it impossible to maintain an invariant over their contents
What changes are included in this PR?
Are there any user-facing changes?