-
Notifications
You must be signed in to change notification settings - Fork 1.8k
Support reading from NdJson formatted data sources #404
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
19fd862 to
5632a90
Compare
Codecov Report
@@ Coverage Diff @@
## master #404 +/- ##
==========================================
- Coverage 74.94% 74.71% -0.23%
==========================================
Files 146 148 +2
Lines 24344 24782 +438
==========================================
+ Hits 18244 18516 +272
- Misses 6100 6266 +166
Continue to review full report at Codecov.
|
|
Thank you for the contribution @heymind -- I will try and review this later today. FYI @houqp @andygrove @nevi-me |
alamb
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thank you very much for the contribution @heymind -- I went through the code and I think the design and implementation look good. Nice work
The only thing that I think this PR needs for my approval is some small test additions -- to ensure the values that come out are expected.
Another thing that seems like it would be useful would be to add support for CREATE EXTERNAL TABLE.... for ndjson files, but we can do that as a follow on PR (it doesn't need to be part of this PR). I will file a ticket to do so.
FYI @Dandandan @andygrove
| let file = File::open(filenames.pop().unwrap())?; | ||
| let mut reader = BufReader::new(file); | ||
| let iter = ValueIter::new(&mut reader, None); | ||
| let schema = infer_json_schema_from_iterator(iter.take_while(|_| { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
FYI @houqp 👍
datafusion/src/physical_plan/json.rs
Outdated
|
|
||
| struct NdJsonStream<R: Read> { | ||
| reader: json::Reader<R>, | ||
| remind: Option<usize>, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I wonder if a more specific name would be remain?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That's a mistake. I confused the meaning of these two words. 💔
| Some(1), | ||
| )?; | ||
|
|
||
| let mut it = exec.execute(0).await?; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I have the same comment about verifying the output here as above.
datafusion/src/physical_plan/json.rs
Outdated
|
|
||
| struct NdJsonStream<R: Read> { | ||
| reader: json::Reader<R>, | ||
| remind: Option<usize>, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I wonder if a more specific name would be remain?
datafusion/src/physical_plan/json.rs
Outdated
| let mut reader = BufReader::new(file); | ||
| let iter = ValueIter::new(&mut reader, None); | ||
| let schema = infer_json_schema_from_iterator(iter.take_while(|_| { | ||
| let shoud_take = records_to_read > 0; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
| let shoud_take = records_to_read > 0; | |
| let should_take = records_to_read > 0; |
datafusion/src/physical_plan/json.rs
Outdated
| let schema = infer_json_schema_from_iterator(iter.take_while(|_| { | ||
| let shoud_take = records_to_read > 0; | ||
| records_to_read -= 1; | ||
| shoud_take |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
| shoud_take | |
| should_take |
datafusion/src/physical_plan/mod.rs
Outdated
| } | ||
|
|
||
| /// Source represents where the data comes from. | ||
| enum Source<R = Box<dyn Read + Send + Sync>> { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Might this be better in a new module, e.g. sources? WDYT @alamb
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think that would improve the code, but I also think it would be fine to move the code into its own module as a follow on PR, depending on @heymind 's preference
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ok. A new module named source.rs is more clear.
alamb
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks great. Thanks @heymind
* Adding documentation to run single tests * Removed empty newline * Fixing README and development.md --------- Co-authored-by: Edmondo Porcu <edmondo.porcu@capitalone.com>
Which issue does this PR close?
Closes #103.
What changes are included in this PR?
This pr creates two pub structs,
NdJsonFileandNdJsonExec.NdJsonFilecan be used as a data source ( or table provider ) to load JSON data from files or a reader.And
NdJsonExecrepresents an execution plan for scanning NdJson data source.Are there any user-facing changes?
No.