Skip to content

Conversation

@heymind
Copy link
Contributor

@heymind heymind commented May 24, 2021

Which issue does this PR close?

Closes #103.

What changes are included in this PR?

This pr creates two pub structs, NdJsonFile and NdJsonExec.

NdJsonFile can be used as a data source ( or table provider ) to load JSON data from files or a reader.
And NdJsonExec represents an execution plan for scanning NdJson data source.

Are there any user-facing changes?

No.

@heymind heymind force-pushed the ndjson branch 2 times, most recently from 19fd862 to 5632a90 Compare May 24, 2021 08:03
@codecov-commenter
Copy link

codecov-commenter commented May 24, 2021

Codecov Report

Merging #404 (97e1fe9) into master (aeed776) will decrease coverage by 0.22%.
The diff coverage is 58.51%.

❗ Current head 97e1fe9 differs from pull request most recent head d07b0b0. Consider uploading reports for the commit d07b0b0 to get more accurate results
Impacted file tree graph

@@            Coverage Diff             @@
##           master     #404      +/-   ##
==========================================
- Coverage   74.94%   74.71%   -0.23%     
==========================================
  Files         146      148       +2     
  Lines       24344    24782     +438     
==========================================
+ Hits        18244    18516     +272     
- Misses       6100     6266     +166     
Impacted Files Coverage Δ
...ta/rust/core/src/execution_plans/shuffle_reader.rs 0.00% <0.00%> (ø)
datafusion-cli/src/main.rs 0.00% <0.00%> (ø)
datafusion/src/datasource/csv.rs 72.81% <ø> (ø)
datafusion/src/physical_plan/csv.rs 81.41% <ø> (+4.03%) ⬆️
benchmarks/src/bin/tpch.rs 30.84% <11.11%> (+0.01%) ⬆️
datafusion/src/physical_optimizer/pruning.rs 89.73% <38.46%> (-0.88%) ⬇️
datafusion/src/physical_plan/mod.rs 73.45% <42.30%> (-9.31%) ⬇️
datafusion/src/datasource/json.rs 52.30% <52.30%> (ø)
datafusion-cli/src/print_format.rs 84.44% <58.82%> (-5.97%) ⬇️
datafusion/src/scalar.rs 58.66% <63.26%> (+3.18%) ⬆️
... and 5 more

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update aeed776...d07b0b0. Read the comment docs.

@alamb
Copy link
Contributor

alamb commented May 25, 2021

Thank you for the contribution @heymind -- I will try and review this later today. FYI @houqp @andygrove @nevi-me

@alamb alamb changed the title NdJson support Support reading from NdJson formatted files May 26, 2021
@alamb alamb changed the title Support reading from NdJson formatted files Support reading from NdJson formatted data sources May 26, 2021
Copy link
Contributor

@alamb alamb left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you very much for the contribution @heymind -- I went through the code and I think the design and implementation look good. Nice work

The only thing that I think this PR needs for my approval is some small test additions -- to ensure the values that come out are expected.

Another thing that seems like it would be useful would be to add support for CREATE EXTERNAL TABLE.... for ndjson files, but we can do that as a follow on PR (it doesn't need to be part of this PR). I will file a ticket to do so.

FYI @Dandandan @andygrove

let file = File::open(filenames.pop().unwrap())?;
let mut reader = BufReader::new(file);
let iter = ValueIter::new(&mut reader, None);
let schema = infer_json_schema_from_iterator(iter.take_while(|_| {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

FYI @houqp 👍


struct NdJsonStream<R: Read> {
reader: json::Reader<R>,
remind: Option<usize>,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I wonder if a more specific name would be remain?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That's a mistake. I confused the meaning of these two words. 💔

Some(1),
)?;

let mut it = exec.execute(0).await?;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I have the same comment about verifying the output here as above.


struct NdJsonStream<R: Read> {
reader: json::Reader<R>,
remind: Option<usize>,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I wonder if a more specific name would be remain?

let mut reader = BufReader::new(file);
let iter = ValueIter::new(&mut reader, None);
let schema = infer_json_schema_from_iterator(iter.take_while(|_| {
let shoud_take = records_to_read > 0;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
let shoud_take = records_to_read > 0;
let should_take = records_to_read > 0;

let schema = infer_json_schema_from_iterator(iter.take_while(|_| {
let shoud_take = records_to_read > 0;
records_to_read -= 1;
shoud_take
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
shoud_take
should_take

}

/// Source represents where the data comes from.
enum Source<R = Box<dyn Read + Send + Sync>> {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Might this be better in a new module, e.g. sources? WDYT @alamb

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think that would improve the code, but I also think it would be fine to move the code into its own module as a follow on PR, depending on @heymind 's preference

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ok. A new module named source.rs is more clear.

Copy link
Contributor

@alamb alamb left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks great. Thanks @heymind

@alamb alamb merged commit 321fda4 into apache:master May 29, 2021
@heymind heymind deleted the ndjson branch May 30, 2021 00:13
@houqp houqp added api change Changes the API exposed to users of the crate datafusion enhancement New feature or request labels Jul 29, 2021
unkloud pushed a commit to unkloud/datafusion that referenced this pull request Mar 23, 2025
* Adding documentation to run single tests

* Removed empty newline

* Fixing README and development.md

---------

Co-authored-by: Edmondo Porcu <edmondo.porcu@capitalone.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

api change Changes the API exposed to users of the crate enhancement New feature or request

Projects

None yet

Development

Successfully merging this pull request may close these issues.

[Rust] Add support for JSON data sources

5 participants