Add streaming JSON and CSV reading, `NewlineDelimitedStream' (#2935) #2936

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

Sign up for GitHub

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Jump to bottom

Merged

tustvold merged 5 commits into apache:master from tustvold:streaming-json-csv

Jul 18, 2022

Contributor

tustvold commented Jul 17, 2022 •

edited

Loading

Which issue does this PR close?

Closes #2935
Closes #2930

Rationale for this change

See ticket

What changes are included in this PR?

It aligns the chunks received from object storage to record boundaries, and then feeds these through the decoders

Are there any user-facing changes?

No


          Add streaming JSON and CSV (apache#2935)

4ac6490

github-actions bot added the core label


          Add license header

ce4f59e

codecov-commenter commented Jul 17, 2022

Codecov Report

Merging #2936 (ce4f59e) into master (b5537e7) will increase coverage by 0.02%.
The diff coverage is 80.18%.

@@            Coverage Diff             @@
##           master    #2936      +/-   ##
==========================================
+ Coverage   85.30%   85.33%   +0.02%     
==========================================
  Files         273      274       +1     
  Lines       49269    49450     +181     
==========================================
+ Hits        42029    42198     +169     
- Misses       7240     7252      +12

Impacted Files	Coverage Δ
datafusion/core/src/datasource/listing/helpers.rs	`94.96% <ø> (ø)`
...tafusion/core/src/physical_plan/file_format/csv.rs	`91.75% <0.00%> (-0.48%)`	⬇️
...afusion/core/src/physical_plan/file_format/json.rs	`89.07% <0.00%> (-2.00%)`	⬇️
...tafusion/core/src/physical_plan/file_format/mod.rs	`97.36% <ø> (ø)`
.../src/physical_plan/file_format/delimited_stream.rs	`90.42% <90.42%> (ø)`
datafusion/common/src/pyarrow.rs	`0.00% <0.00%> (ø)`
datafusion/proto/src/bytes/mod.rs	`82.75% <0.00%> (ø)`
...usion/core/src/avro_to_arrow/arrow_array_reader.rs	`0.00% <0.00%> (ø)`
datafusion/core/tests/sql/aggregates.rs	`99.28% <0.00%> (+0.01%)`	⬆️
datafusion/expr/src/aggregate_function.rs	`92.25% <0.00%> (+0.02%)`	⬆️
... and 4 more

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update b5537e7...ce4f59e. Read the comment docs.

alamb changed the title ~~Add streaming JSON and CSV (#2935)~~ Add streaming JSON and CSV reading (#2935)

alamb changed the title ~~Add streaming JSON and CSV reading (#2935)~~ Add streaming JSON and CSV reading, `NewlineDelimitedStream' (#2935)

alamb approved these changes

View reviewed changes

Contributor

alamb left a comment

This is very cool @tustvold -- thank you!

I wonder if we can write an end to end tests for this (aka a query from a CSV file or something) 🤔

datafusion/core/src/physical_plan/file_format/delimited_stream.rs Outdated

    
              use std::collections::VecDeque;

              /// The ASCII encoding of `"`

              const QUOTE: u8 = 34;

Contributor

alamb Jul 18, 2022

I wonder if the b constant syntax would make this easier to validate (and the same below)

Suggested change

      
            const QUOTE: u8 = 34;
          
            const QUOTE: u8 = b'"';

datafusion/core/src/physical_plan/file_format/delimited_stream.rs

    
                      let is_escape = &mut self.is_escape;

                      let is_quote = &mut self.is_quote;

                      let mut record_ends = val.iter().enumerate().filter_map(|(idx, v)| {

Contributor

alamb Jul 18, 2022

Documenting the escaping rules that LineDelimiter assumes might be good (like \ style escapes with " quotes)

datafusion/core/src/physical_plan/file_format/delimited_stream.rs

    
                              Some(idx) => {

                                  self.remainder.extend_from_slice(&val[0..idx]);

                                  self.complete

                                      .push_back(Bytes::from(std::mem::take(&mut self.remainder)));

Contributor

alamb Jul 18, 2022

If remainder was a bytes this would probably be cleaner (though the clause below to handle no records in the chunk would be more complicated); However, I suspect the "next Bytes actually has more than one record ending is the more common case

datafusion/core/src/physical_plan/file_format/delimited_stream.rs

    
              }

              /// Given a [`Stream`] of [`Bytes`] returns a [`Stream`] where each

              /// yielded [`Bytes`] contains a whole number of new line delimited records

Contributor

alamb Jul 18, 2022

👍

datafusion/core/src/physical_plan/file_format/delimited_stream.rs

    
                  use futures::stream::TryStreamExt;

                  #[test]

                  fn test_delimiter() {

Contributor

alamb Jul 18, 2022

Is the case where some push doesn't have a delimiter covered? I couldn't find it

datafusion/core/src/physical_plan/file_format/delimited_stream.rs

    
                      assert_eq!(delimiter.next().unwrap(), Bytes::from("\n"));

                      assert!(delimiter.next().is_none());

                      delimiter.push("");

Contributor

alamb Jul 18, 2022

I would recommend making a new test here as a way to make the tests more self documenting.

Unless it is important that data can be pushed into a LineDelimiter after finish() is called 🤔

Suggested change

      
                    delimiter.push("");
          
                #[test]
          
                fn test_delimiter_escaped() {
          
                    delimiter.push("");

Contributor Author

tustvold Jul 18, 2022

I think it is important to test that more data can be added to a LineDelimiter after data has been pulled from it

Contributor

alamb Jul 18, 2022

👍 that would be good to document (the intent of the test).

I think the main suggestion of break up the test into smaller self contained blocks with descriptive names still holds even if this particular cut-off point would not be ideal.

The total test size will be larger, but I think each test will be easier to understand what it is testing.

maybe worth a thought

datafusion/core/src/physical_plan/file_format/delimited_stream.rs

    
                  /// Complete chunks of [`Bytes`]

                  complete: VecDeque<Bytes>,

                  /// Remainder bytes that form the next record

                  remainder: Vec<u8>,

Contributor

alamb Jul 18, 2022

I wonder if you could use something like

Suggested change

      
                remainder: Vec<u8>,
          
                remainder: Bytes,

As bytes has a slice method https://docs.rs/bytes/1.1.0/bytes/struct.Bytes.html#method.slice 🤔

Which might reduce some copies 🤷

Contributor Author

tustvold Jul 18, 2022

I think the copy is unavoidable as the nature of the remainder, is you need to take data from two separate Bytes. It should only be a single "line" though, and so should be relatively minor from a performance standpoint

Contributor

alamb Jul 18, 2022

makes sense

tustvold marked this pull request as draft

July 18, 2022 15:48

Contributor Author

tustvold commented Jul 18, 2022

Marking as draft whilst I work on some tests

tustvold added 2 commits

July 18, 2022 12:40


          Review feedback

ff06619


          Add license header

b03ab55

tustvold marked this pull request as ready for review

July 18, 2022 16:46

alamb approved these changes

View reviewed changes

datafusion/core/src/physical_plan/file_format/chunked_store.rs

    
                              assert_eq!(size, expected);

                              remaining -= expected;

                          }

                      }

Contributor

alamb Jul 18, 2022

I recommend also assert_eq!(remaining, 0) at the end of the test to ensure nothing is lost

datafusion/core/src/physical_plan/file_format/csv.rs

    
                          match store.get(&file.location).await? {

                              GetResult::File(file, _) => {

                                  Ok(futures::stream::iter(config.open(file)).boxed())

                                  Ok(futures::stream::iter(config.open(file, true)).boxed())

Contributor

alamb Jul 18, 2022

Is first-chunk a bug fix?

Contributor Author

tustvold Jul 18, 2022

Yup 😄 Tests FTW

Contributor

alamb Jul 18, 2022

🥳 🦜

datafusion/core/src/physical_plan/file_format/json.rs

    
                          ctx.runtime_env().register_object_store(

                              "file",

                              "",

                              Arc::new(ChunkedStore::new(

Contributor

alamb Jul 18, 2022

very nice 👌


          Review feedback

4d6a6bd

tustvold merged commit 944ef3d into apache:master

ursabot commented Jul 18, 2022

Benchmark runs are scheduled for baseline = b772c6d and contender = 944ef3d. 944ef3d is a master commit associated with this PR. Results will be available as each benchmark for each run completes.
Conbench compare runs links:
[Skipped ⚠️ Benchmarking of arrow-datafusion-commits is not supported on ec2-t3-xlarge-us-east-2] ec2-t3-xlarge-us-east-2
[Skipped ⚠️ Benchmarking of arrow-datafusion-commits is not supported on test-mac-arm] test-mac-arm
[Skipped ⚠️ Benchmarking of arrow-datafusion-commits is not supported on ursa-i9-9960x] ursa-i9-9960x
[Skipped ⚠️ Benchmarking of arrow-datafusion-commits is not supported on ursa-thinkcentre-m75q] ursa-thinkcentre-m75q
Buildkite builds:
Supported benchmarks:
ec2-t3-xlarge-us-east-2: Supported benchmark langs: Python, R. Runs only benchmarks with cloud = True
test-mac-arm: Supported benchmark langs: C++, Python, R
ursa-i9-9960x: Supported benchmark langs: Python, R, JavaScript
ursa-thinkcentre-m75q: Supported benchmark langs: C++, Java

tustvold mentioned this pull request

CSV inference reads in the whole file to memory, regardless of row limit #3658

Closed

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

core