Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Extract method to drive PageIterator -> RecordReader #1031

Merged
merged 3 commits into from
Dec 14, 2021

Conversation

tustvold
Copy link
Contributor

Which issue does this PR close?

Relates to #171

Rationale for this change

The logic for driving a RecordReader from a PageIterator is currently duplicated in PrimitiveArrayReader and NullArrayReader. This duplication will only increase with new ArrayReader implementations added as part of #171

What changes are included in this PR?

Extracts a read_records function to handle this logic

Are there any user-facing changes?

This no longer eagerly populates the RecordReader with the first PageReader from the PageIterator. If for example the first page of the first row group contained a compression codec that was not supported, this would currently error in the constructor, whereas with this change it will error in ArrayReader::next_batch. I personally think this makes more sense. FWIW I think this is the only possible error case and so it is unlikely user-code would be impacted...

@github-actions github-actions bot added the parquet Changes to the parquet crate label Dec 11, 2021
@codecov-commenter
Copy link

Codecov Report

Merging #1031 (75920d0) into master (239cba1) will decrease coverage by 0.00%.
The diff coverage is 93.75%.

Impacted file tree graph

@@            Coverage Diff             @@
##           master    #1031      +/-   ##
==========================================
- Coverage   82.31%   82.30%   -0.01%     
==========================================
  Files         168      168              
  Lines       49031    49022       -9     
==========================================
- Hits        40360    40350      -10     
- Misses       8671     8672       +1     
Impacted Files Coverage Δ
parquet/src/arrow/array_reader.rs 76.66% <93.75%> (-0.10%) ⬇️
arrow/src/datatypes/datatype.rs 65.95% <0.00%> (-0.43%) ⬇️
arrow/src/datatypes/field.rs 53.37% <0.00%> (-0.31%) ⬇️
parquet_derive/src/parquet_field.rs 65.98% <0.00%> (-0.23%) ⬇️
arrow/src/array/transform/mod.rs 85.24% <0.00%> (+0.13%) ⬆️

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 239cba1...75920d0. Read the comment docs.

Copy link
Contributor

@alamb alamb left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If for example the first page of the first row group contained a compression codec that was not supported, this would currently error in the constructor, whereas with this change it will error in ArrayReader::next_batch. I personally think this makes more sense. FWIW I think this is the only possible error case and so it is unlikely user-code would be impacted...

I agree this seems like a reasonable change

This change looks good to me.

cc @sunchao

parquet/src/arrow/array_reader.rs Outdated Show resolved Hide resolved
tustvold and others added 2 commits December 14, 2021 16:14
Co-authored-by: Andrew Lamb <andrew@nerdnetworks.org>
Copy link
Member

@sunchao sunchao left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM (pending CI)

if records_read_once < records_to_read {
if let Some(page_reader) = pages.next() {
// Read from new page reader (i.e. column chunk)
record_reader.set_page_reader(page_reader?)?;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Unrelated, but I'm thinking whether we can reset the record_reader upon new column chunk, so that we don't have to keep accumulating buffers for def & rep levels, and values.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If we just called reset here, we would lose data. But we definitely could delimit record batches, i.e. don't buffer data across column chunk boundaries. This would be breaking change though

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Got it, thanks!

@sunchao sunchao merged commit ab48e69 into apache:master Dec 14, 2021
@sunchao
Copy link
Member

sunchao commented Dec 14, 2021

Merged, thanks!

alamb added a commit that referenced this pull request Dec 17, 2021
Co-authored-by: Andrew Lamb <andrew@nerdnetworks.org>
alamb added a commit that referenced this pull request Dec 20, 2021
Co-authored-by: Andrew Lamb <andrew@nerdnetworks.org>

Co-authored-by: Raphael Taylor-Davies <1781103+tustvold@users.noreply.github.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
parquet Changes to the parquet crate
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

4 participants