Extract method to drive PageIterator -> RecordReader #1031

tustvold · 2021-12-11T19:48:51Z

Which issue does this PR close?

Relates to #171

Rationale for this change

The logic for driving a RecordReader from a PageIterator is currently duplicated in PrimitiveArrayReader and NullArrayReader. This duplication will only increase with new ArrayReader implementations added as part of #171

What changes are included in this PR?

Extracts a read_records function to handle this logic

Are there any user-facing changes?

This no longer eagerly populates the RecordReader with the first PageReader from the PageIterator. If for example the first page of the first row group contained a compression codec that was not supported, this would currently error in the constructor, whereas with this change it will error in ArrayReader::next_batch. I personally think this makes more sense. FWIW I think this is the only possible error case and so it is unlikely user-code would be impacted...

codecov-commenter · 2021-12-11T20:01:14Z

Codecov Report

Merging #1031 (75920d0) into master (239cba1) will decrease coverage by 0.00%.
The diff coverage is 93.75%.

@@            Coverage Diff             @@
##           master    #1031      +/-   ##
==========================================
- Coverage   82.31%   82.30%   -0.01%     
==========================================
  Files         168      168              
  Lines       49031    49022       -9     
==========================================
- Hits        40360    40350      -10     
- Misses       8671     8672       +1

Impacted Files	Coverage Δ
parquet/src/arrow/array_reader.rs	`76.66% <93.75%> (-0.10%)`	⬇️
arrow/src/datatypes/datatype.rs	`65.95% <0.00%> (-0.43%)`	⬇️
arrow/src/datatypes/field.rs	`53.37% <0.00%> (-0.31%)`	⬇️
parquet_derive/src/parquet_field.rs	`65.98% <0.00%> (-0.23%)`	⬇️
arrow/src/array/transform/mod.rs	`85.24% <0.00%> (+0.13%)`	⬆️

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 239cba1...75920d0. Read the comment docs.

alamb

If for example the first page of the first row group contained a compression codec that was not supported, this would currently error in the constructor, whereas with this change it will error in ArrayReader::next_batch. I personally think this makes more sense. FWIW I think this is the only possible error case and so it is unlikely user-code would be impacted...

I agree this seems like a reasonable change

This change looks good to me.

cc @sunchao

parquet/src/arrow/array_reader.rs

Co-authored-by: Andrew Lamb <andrew@nerdnetworks.org>

sunchao

LGTM (pending CI)

sunchao · 2021-12-14T18:03:59Z

parquet/src/arrow/array_reader.rs

+        if records_read_once < records_to_read {
+            if let Some(page_reader) = pages.next() {
+                // Read from new page reader (i.e. column chunk)
+                record_reader.set_page_reader(page_reader?)?;


Unrelated, but I'm thinking whether we can reset the record_reader upon new column chunk, so that we don't have to keep accumulating buffers for def & rep levels, and values.

If we just called reset here, we would lose data. But we definitely could delimit record batches, i.e. don't buffer data across column chunk boundaries. This would be breaking change though

Got it, thanks!

sunchao · 2021-12-14T19:41:26Z

Merged, thanks!

Co-authored-by: Andrew Lamb <andrew@nerdnetworks.org>

Co-authored-by: Andrew Lamb <andrew@nerdnetworks.org> Co-authored-by: Raphael Taylor-Davies <1781103+tustvold@users.noreply.github.com>

Extract method to drive PageIterator -> RecordReader

75920d0

github-actions bot added the parquet Changes to the parquet crate label Dec 11, 2021

alamb approved these changes Dec 13, 2021

View reviewed changes

parquet/src/arrow/array_reader.rs Outdated Show resolved Hide resolved

tustvold and others added 2 commits December 14, 2021 16:14

Update docstring

fb3f1af

Co-authored-by: Andrew Lamb <andrew@nerdnetworks.org>

chore: fmt

a07989d

sunchao approved these changes Dec 14, 2021

View reviewed changes

sunchao reviewed Dec 14, 2021

View reviewed changes

sunchao merged commit ab48e69 into apache:master Dec 14, 2021

alamb added a commit that referenced this pull request Dec 17, 2021

Extract method to drive PageIterator -> RecordReader (#1031)

6cd101c

Co-authored-by: Andrew Lamb <andrew@nerdnetworks.org>

alamb added the cherry-picked label Dec 17, 2021

alamb mentioned this pull request Dec 17, 2021

Cherry pick Extract method to drive PageIterator -> RecordReader to active_release #1056

Merged

alamb added a commit that referenced this pull request Dec 20, 2021

Extract method to drive PageIterator -> RecordReader (#1031) (#1056)

1c59023

Co-authored-by: Andrew Lamb <andrew@nerdnetworks.org> Co-authored-by: Raphael Taylor-Davies <1781103+tustvold@users.noreply.github.com>

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Extract method to drive PageIterator -> RecordReader #1031

Extract method to drive PageIterator -> RecordReader #1031

tustvold commented Dec 11, 2021

codecov-commenter commented Dec 11, 2021

alamb left a comment

sunchao left a comment

sunchao Dec 14, 2021

tustvold Dec 14, 2021

sunchao Dec 14, 2021

sunchao commented Dec 14, 2021

Extract method to drive PageIterator -> RecordReader #1031

Extract method to drive PageIterator -> RecordReader #1031

Conversation

tustvold commented Dec 11, 2021

Which issue does this PR close?

Rationale for this change

What changes are included in this PR?

Are there any user-facing changes?

codecov-commenter commented Dec 11, 2021

Codecov Report

alamb left a comment

Choose a reason for hiding this comment

sunchao left a comment

Choose a reason for hiding this comment

sunchao Dec 14, 2021

Choose a reason for hiding this comment

tustvold Dec 14, 2021

Choose a reason for hiding this comment

sunchao Dec 14, 2021

Choose a reason for hiding this comment

sunchao commented Dec 14, 2021