-
Notifications
You must be signed in to change notification settings - Fork 1.8k
Add streaming JSON and CSV reading, `NewlineDelimitedStream' (#2935) #2936
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
Codecov Report
@@ Coverage Diff @@
## master #2936 +/- ##
==========================================
+ Coverage 85.30% 85.33% +0.02%
==========================================
Files 273 274 +1
Lines 49269 49450 +181
==========================================
+ Hits 42029 42198 +169
- Misses 7240 7252 +12
Continue to review full report at Codecov.
|
alamb
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
| use std::collections::VecDeque; | ||
|
|
||
| /// The ASCII encoding of `"` | ||
| const QUOTE: u8 = 34; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I wonder if the b constant syntax would make this easier to validate (and the same below)
| const QUOTE: u8 = 34; | |
| const QUOTE: u8 = b'"'; |
|
|
||
| let is_escape = &mut self.is_escape; | ||
| let is_quote = &mut self.is_quote; | ||
| let mut record_ends = val.iter().enumerate().filter_map(|(idx, v)| { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Documenting the escaping rules that LineDelimiter assumes might be good (like \ style escapes with " quotes)
| Some(idx) => { | ||
| self.remainder.extend_from_slice(&val[0..idx]); | ||
| self.complete | ||
| .push_back(Bytes::from(std::mem::take(&mut self.remainder))); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If remainder was a bytes this would probably be cleaner (though the clause below to handle no records in the chunk would be more complicated); However, I suspect the "next Bytes actually has more than one record ending is the more common case
| } | ||
|
|
||
| /// Given a [`Stream`] of [`Bytes`] returns a [`Stream`] where each | ||
| /// yielded [`Bytes`] contains a whole number of new line delimited records |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
👍
| use futures::stream::TryStreamExt; | ||
|
|
||
| #[test] | ||
| fn test_delimiter() { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is the case where some push doesn't have a delimiter covered? I couldn't find it
| assert_eq!(delimiter.next().unwrap(), Bytes::from("\n")); | ||
| assert!(delimiter.next().is_none()); | ||
|
|
||
| delimiter.push(""); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I would recommend making a new test here as a way to make the tests more self documenting.
Unless it is important that data can be pushed into a LineDelimiter after finish() is called 🤔
| delimiter.push(""); | |
| #[test] | |
| fn test_delimiter_escaped() { | |
| delimiter.push(""); | |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think it is important to test that more data can be added to a LineDelimiter after data has been pulled from it
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
👍 that would be good to document (the intent of the test).
I think the main suggestion of break up the test into smaller self contained blocks with descriptive names still holds even if this particular cut-off point would not be ideal.
The total test size will be larger, but I think each test will be easier to understand what it is testing.
maybe worth a thought
| /// Complete chunks of [`Bytes`] | ||
| complete: VecDeque<Bytes>, | ||
| /// Remainder bytes that form the next record | ||
| remainder: Vec<u8>, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I wonder if you could use something like
| remainder: Vec<u8>, | |
| remainder: Bytes, |
As bytes has a slice method https://docs.rs/bytes/1.1.0/bytes/struct.Bytes.html#method.slice 🤔
Which might reduce some copies 🤷
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think the copy is unavoidable as the nature of the remainder, is you need to take data from two separate Bytes. It should only be a single "line" though, and so should be relatively minor from a performance standpoint
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
makes sense
|
Marking as draft whilst I work on some tests |
| assert_eq!(size, expected); | ||
| remaining -= expected; | ||
| } | ||
| } |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I recommend also assert_eq!(remaining, 0) at the end of the test to ensure nothing is lost
| match store.get(&file.location).await? { | ||
| GetResult::File(file, _) => { | ||
| Ok(futures::stream::iter(config.open(file)).boxed()) | ||
| Ok(futures::stream::iter(config.open(file, true)).boxed()) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is first-chunk a bug fix?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yup 😄 Tests FTW
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
🥳 🦜
| ctx.runtime_env().register_object_store( | ||
| "file", | ||
| "", | ||
| Arc::new(ChunkedStore::new( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
very nice 👌
|
Benchmark runs are scheduled for baseline = b772c6d and contender = 944ef3d. 944ef3d is a master commit associated with this PR. Results will be available as each benchmark for each run completes. |
Which issue does this PR close?
Closes #2935
Closes #2930
Rationale for this change
See ticket
What changes are included in this PR?
It aligns the chunks received from object storage to record boundaries, and then feeds these through the decoders
Are there any user-facing changes?
No