Separate ArrayReader::next_batch with read_records and consume_batch #2237

Ted-Jiang · 2022-07-30T05:27:02Z

Which issue does this PR close?

Closes #2236.
Related #2197

Rationale for this change

Separate add read_records and consume_batch in ArrayReader, so we can read_records multi times in buffer and consume once (avoid small batch in skipping read).

What changes are included in this PR?

Are there any user-facing changes?

…rrayReader::consume_batch.

Ted-Jiang · 2022-07-30T05:29:02Z

parquet/src/arrow/array_reader/null_array.rs

@@ -80,8 +80,15 @@ where

    /// Reads at most `batch_size` records into array.
    fn next_batch(&mut self, batch_size: usize) -> Result<ArrayRef> {
-        read_records(&mut self.record_reader, self.pages.as_mut(), batch_size)?;
+        let size = self.read_records(batch_size)?;


IMO, i think we can keep next_batch , it just use read_records and consume_batch in order.

Ted-Jiang · 2022-07-30T05:30:45Z

parquet/src/arrow/array_reader/complex_object_array.rs

@@ -349,30 +398,32 @@ mod tests {

        let mut accu_len: usize = 0;

-        let array = array_reader.next_batch(values_per_page / 2).unwrap();
-        assert_eq!(array.len(), values_per_page / 2);
+        let len = array_reader.read_records(values_per_page / 2).unwrap();


Now, after consume_batch we will clean up the buffer which has been used.
So now we should get get_def_levels after read_records before consume_batch

Again I think removing batch_size from consume_batch allows preserving the existing behaviour

codecov-commenter · 2022-07-30T05:47:24Z

Codecov Report

Merging #2237 (631f55a) into master (985760f) will decrease coverage by 0.21%.
The diff coverage is 92.24%.

@@            Coverage Diff             @@
##           master    #2237      +/-   ##
==========================================
- Coverage   82.53%   82.31%   -0.22%     
==========================================
  Files         239      241       +2     
  Lines       62304    62500     +196     
==========================================
+ Hits        51422    51447      +25     
- Misses      10882    11053     +171

Impacted Files	Coverage Δ
parquet/src/arrow/array_reader/struct_array.rs	`84.84% <73.33%> (-1.71%)`	⬇️
parquet/src/arrow/array_reader/map_array.rs	`60.00% <80.00%> (+1.17%)`	⬆️
...uet/src/arrow/array_reader/complex_object_array.rs	`93.47% <94.64%> (-0.56%)`	⬇️
parquet/src/arrow/array_reader/byte_array.rs	`86.36% <100.00%> (+0.03%)`	⬆️
...et/src/arrow/array_reader/byte_array_dictionary.rs	`87.01% <100.00%> (+0.04%)`	⬆️
parquet/src/arrow/array_reader/empty_array.rs	`57.69% <100.00%> (+12.23%)`	⬆️
parquet/src/arrow/array_reader/list_array.rs	`92.77% <100.00%> (+0.08%)`	⬆️
parquet/src/arrow/array_reader/mod.rs	`93.33% <100.00%> (+0.74%)`	⬆️
parquet/src/arrow/array_reader/null_array.rs	`82.14% <100.00%> (+0.66%)`	⬆️
parquet/src/arrow/array_reader/primitive_array.rs	`90.22% <100.00%> (+0.05%)`	⬆️
... and 24 more

Help us with your feedback. Take ten seconds to tell us how you rate us.

Ted-Jiang · 2022-07-30T07:20:32Z

@tustvold PTAL😊

tustvold

Looking good, mostly minor nits, but I would remove batch_size from consume_batch.

This is for two major reasons:

It adds a fair amount of complexity
We want to avoid split_off as much as possible, as it is slow and has non-trivial implications for dictionary preservation - https://github.com/apache/arrow-rs/blob/master/parquet/src/arrow/record_reader/mod.rs#L151

parquet/src/arrow/array_reader/byte_array.rs

parquet/src/arrow/array_reader/complex_object_array.rs

tustvold · 2022-07-30T13:58:30Z

parquet/src/arrow/array_reader/complex_object_array.rs

@@ -349,30 +398,32 @@ mod tests {

        let mut accu_len: usize = 0;

-        let array = array_reader.next_batch(values_per_page / 2).unwrap();
-        assert_eq!(array.len(), values_per_page / 2);
+        let len = array_reader.read_records(values_per_page / 2).unwrap();


Again I think removing batch_size from consume_batch allows preserving the existing behaviour

parquet/src/arrow/array_reader/empty_array.rs

parquet/src/arrow/array_reader/mod.rs

tustvold

I think there is a still an issue with ComplexObjectArrayReader. I think it needs to keep the levels data around for longer to avoid breaking parent ArrayReader. Otherwise looking very nice 😃

Edit: I think ComplexObjectArrayReader might just be broken, I'll experiment when I get to a computer, and potentially accelerate plans to just remove it

tustvold · 2022-07-31T13:00:36Z

parquet/src/arrow/array_reader/complex_object_array.rs

@@ -160,6 +181,10 @@ where
            array = arrow::compute::cast(&array, &self.data_type)?;
        }

+        self.data_buffer = vec![];
+        self.def_levels_buffer = None;
+        self.rep_levels_buffer = None;


I think this will break, as parent ArrayReader assume the definition levels live until the next call to consume_batch?

I'm not sure we actually have test coverage of say a nullable StructArray containing a DecimalArray 🤔

you mean structArray which contains one child use complex_object_array reader? 🤔 why need the child re?

If StructArray reader is itself nullable it needs to read the definition levels read by its child in order to work out where it's NULLs are located. It's a similar story for ListArrayReader.

I'll try to rustle up some tests this evening, so that we can be confident this PR won't break anything. Longer term I want to remove ComplexObjectArrayReader as it is slow, complicated and largely replaced by the newer generics.

Edit: ran out of time today, will look into first thing tomorrow

As I suspected the test added in #2254 is now failing, I think it should be simple enough to fix, but let me know if you get stuck 😄

@tustvold Sorry for the late reply, fix at fix new ut

parquet/src/arrow/array_reader/complex_object_array.rs

tustvold · 2022-08-01T13:45:22Z

I took the liberty of merging in master to get #2254

Ted-Jiang · 2022-08-02T14:50:02Z

parquet/src/arrow/arrow_writer/mod.rs

@@ -1161,7 +1161,7 @@ mod tests {
            Some(props),
        )
        .expect("Unable to write file");
-        writer.write(&expected_batch).unwrap();
+        writer.write(expected_batch).unwrap();


Not related, but got clippy error at local.

tustvold

Some minor nits, looking good

tustvold · 2022-08-02T17:25:26Z

parquet/src/arrow/array_reader/complex_object_array.rs

@@ -206,11 +213,19 @@ where
    }

    fn get_def_levels(&self) -> Option<&[i16]> {
-        self.def_levels_buffer.as_deref()
+        if self.before_consume {


I think this can just return self.def_levels_buffer.as_deref(), if you look at PrimitiveArrayReader it will only make the data available after the call to consume_batch

without this, will fail in

failures: arrow::array_reader::complex_object_array::tests::test_complex_array_reader_def_and_rep_levels arrow::array_reader::complex_object_array::tests::test_complex_array_reader_dict_enc_string

this cause by get_level before consume complex_object_array (this is the common situation like other readers), but complex_object_array sometimes(self is nullable) need get_level after consume, so i think we should keep this check

parquet/src/arrow/array_reader/complex_object_array.rs

ursabot · 2022-08-03T07:51:58Z

Benchmark runs are scheduled for baseline = 6b2c757 and contender = 1f9973c. 1f9973c is a master commit associated with this PR. Results will be available as each benchmark for each run completes.
Conbench compare runs links:
[Skipped ⚠️ Benchmarking of arrow-rs-commits is not supported on ec2-t3-xlarge-us-east-2] ec2-t3-xlarge-us-east-2
[Skipped ⚠️ Benchmarking of arrow-rs-commits is not supported on test-mac-arm] test-mac-arm
[Skipped ⚠️ Benchmarking of arrow-rs-commits is not supported on ursa-i9-9960x] ursa-i9-9960x
[Skipped ⚠️ Benchmarking of arrow-rs-commits is not supported on ursa-thinkcentre-m75q] ursa-thinkcentre-m75q
Buildkite builds:
Supported benchmarks:
ec2-t3-xlarge-us-east-2: Supported benchmark langs: Python, R. Runs only benchmarks with cloud = True
test-mac-arm: Supported benchmark langs: C++, Python, R
ursa-i9-9960x: Supported benchmark langs: Python, R, JavaScript
ursa-thinkcentre-m75q: Supported benchmark langs: C++, Java

Ted-Jiang added 2 commits July 29, 2022 19:11

replace ArrayReader::next_batch with ArrayReader::read_records and A…

16ad70f

…rrayReader::consume_batch.

fix ut

76b7d40

github-actions bot added the parquet Changes to the parquet crate label Jul 30, 2022

Ted-Jiang commented Jul 30, 2022

View reviewed changes

tustvold reviewed Jul 30, 2022

View reviewed changes

fix comment

bc07bdb

tustvold reviewed Jul 31, 2022

View reviewed changes

Ted-Jiang and others added 2 commits July 31, 2022 21:36

avoid clone.

631f55a

Merge remote-tracking branch 'upstream/master' into issue_2197

1e800bb

tustvold mentioned this pull request Aug 1, 2022

Changes to ParquetRecordBatchStream to support row filtering in DataFusion #2270

Closed

fix new ut

429e241

Ted-Jiang commented Aug 2, 2022

View reviewed changes

Ted-Jiang requested a review from tustvold August 2, 2022 15:53

tustvold reviewed Aug 2, 2022

View reviewed changes

fix comment

fa42ba0

tustvold merged commit 1f9973c into apache:master Aug 3, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Separate ArrayReader::next_batch with read_records and consume_batch #2237

Separate ArrayReader::next_batch with read_records and consume_batch #2237

Ted-Jiang commented Jul 30, 2022

Ted-Jiang Jul 30, 2022

Ted-Jiang Jul 30, 2022

tustvold Jul 30, 2022

codecov-commenter commented Jul 30, 2022 •

edited

Loading

Ted-Jiang commented Jul 30, 2022

tustvold left a comment

tustvold Jul 30, 2022

tustvold left a comment •

edited

Loading

tustvold Jul 31, 2022 •

edited

Loading

Ted-Jiang Jul 31, 2022

tustvold Jul 31, 2022 •

edited

Loading

tustvold Aug 1, 2022 •

edited

Loading

Ted-Jiang Aug 2, 2022 •

edited

Loading

tustvold commented Aug 1, 2022

Ted-Jiang Aug 2, 2022

tustvold left a comment

tustvold Aug 2, 2022

Ted-Jiang Aug 3, 2022

ursabot commented Aug 3, 2022

Separate ArrayReader::next_batch with read_records and consume_batch #2237

Separate ArrayReader::next_batch with read_records and consume_batch #2237

Conversation

Ted-Jiang commented Jul 30, 2022

Which issue does this PR close?

Rationale for this change

What changes are included in this PR?

Are there any user-facing changes?

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

codecov-commenter commented Jul 30, 2022 • edited Loading

Codecov Report

Ted-Jiang commented Jul 30, 2022

tustvold left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

tustvold left a comment • edited Loading

Choose a reason for hiding this comment

tustvold Jul 31, 2022 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

tustvold Jul 31, 2022 • edited Loading

Choose a reason for hiding this comment

tustvold Aug 1, 2022 • edited Loading

Choose a reason for hiding this comment

Ted-Jiang Aug 2, 2022 • edited Loading

Choose a reason for hiding this comment

tustvold commented Aug 1, 2022

Choose a reason for hiding this comment

tustvold left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

ursabot commented Aug 3, 2022

codecov-commenter commented Jul 30, 2022 •

edited

Loading

tustvold left a comment •

edited

Loading

tustvold Jul 31, 2022 •

edited

Loading

tustvold Jul 31, 2022 •

edited

Loading

tustvold Aug 1, 2022 •

edited

Loading

Ted-Jiang Aug 2, 2022 •

edited

Loading