Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Generify ColumnReaderImpl and RecordReader #1040

Closed
tustvold opened this issue Dec 13, 2021 · 2 comments · Fixed by #1041
Closed

Generify ColumnReaderImpl and RecordReader #1040

tustvold opened this issue Dec 13, 2021 · 2 comments · Fixed by #1041
Assignees
Labels
parquet Changes to the parquet crate

Comments

@tustvold
Copy link
Contributor

tustvold commented Dec 13, 2021

Is your feature request related to a problem or challenge? Please describe what you are trying to do.

Currently RecordReader and ColumnReaderImpl have a hard-coded assumption that they are decoding to contiguous array of values, or i16 levels. This complicates implementing #1037, #171 and potential future decode related optimisations, e.g. decoding directly to StringArray, or evaluating predicates directly, etc...

Describe the solution you'd like

Create new GenericColumnReader and GenericRecordReader which RecordReader and ColumnReaderImpl are type alias to. This preserves API compatibility whilst allowing the introduction of new type parameters. As these types need to be able to influence the buffer types, they aren't object-safe and therefore need to be generics and not simply trait objects.

All decode and buffering would be provided by these generic types, allowing them to be swapped out. This would leave ColumnReaderImpl responsible for muxing the parquet file, i.e. extracting pages from the PageReader and feeding them to the decoders. RecordReader would be responsible for delimiting semantic records, as it is today.

Describe alternatives you've considered

We could duplicate the logic in ColumnReaderImpl and RecordReader into different reader implementations, but this seems unfortunate.

Additional context

There is likely non-trivial overlap with #384 and #200 which sought to introduce generics at a different level. Unfortunately it is still coupled with the notion of contiguous value arrays, and I couldn't see a way to achieve the particular flexibility desired.

@tustvold tustvold added the enhancement Any new improvement worthy of a entry in the changelog label Dec 13, 2021
tustvold added a commit to tustvold/arrow-rs that referenced this issue Dec 13, 2021
tustvold added a commit to tustvold/arrow-rs that referenced this issue Dec 13, 2021
tustvold added a commit to tustvold/arrow-rs that referenced this issue Dec 13, 2021
@yordan-pavlov
Copy link
Contributor

yordan-pavlov commented Dec 14, 2021

@tustvold could you provide some examples of how the new API would look and how it could be used?

@tustvold
Copy link
Contributor Author

#1041 contains my initial experiments, and I am implementing #1037 and #171 to "prove" it out

tustvold added a commit to tustvold/arrow-rs that referenced this issue Dec 21, 2021
tustvold added a commit to tustvold/arrow-rs that referenced this issue Dec 21, 2021
alamb pushed a commit that referenced this issue Jan 11, 2022
* Simplify record reader

* Generify ColumnReaderImpl and RecordReader (#1040)

* Tweak count_records predicate

* Pre-allocate bitmask

* fix: TypedBuffer::split update len

* Simplify GenericRecordReader

* Move column decoders into module

* Remove `RecordBuffer::create` method

* Remove `TypedBuffer<i16>::count_records`

* Pass null count to `ColumnValueDecoder::read`

* Pull null padding out of column reader

* Review feedback

* Format

* License headers

* Further doc tweaks

* Further docs

* Restrict ScalarBuffer types
tustvold added a commit to tustvold/arrow-rs that referenced this issue Jan 12, 2022
tustvold added a commit to tustvold/arrow-rs that referenced this issue Jan 14, 2022
tustvold added a commit to tustvold/arrow-rs that referenced this issue Jan 14, 2022
alamb pushed a commit that referenced this issue Jan 18, 2022
)

* Optimized ByteArrayReader (#1040)

UTF-8 Validation (#786)

* Fix arrow_array_reader benchmark

* Allow running subset of arrow_array_reader benchmarks

* Faster UTF-8 validation

* Tweak null handling

* Add license

* Refine `ValuesBuffer::pad_nulls`

* Tweak error handling

* Use page null count if available

* Doc comments

* Test DELTA_BYTE_ARRAY encoding

* Support legacy Encoding::PLAIN_DICTIONARY

* Add OffsetBuffer unit tests

Review feedback

* More tests

* Fix lint

* Review feedback
@alamb alamb added parquet Changes to the parquet crate and removed enhancement Any new improvement worthy of a entry in the changelog labels Jan 20, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
parquet Changes to the parquet crate
Projects
None yet
Development

Successfully merging a pull request may close this issue.

3 participants