Support Repeated fields in Record APIs #2394

zeevm · 2022-08-10T13:09:39Z

A Parquet field with "Repeated" repetition and no "LIST" annotation are read as primitives instead of as list.

To reproduce: create a file with a top level field schema like:

REPEATED BYTE_ARRAY vals (UTF8);

and write lists of strings (i.e. with repetition levels of '0' and '1')

this should be read as a List of strings as specified in https://github.com/apache/parquet-format/blob/master/LogicalTypes.md#nested-types

This does not affect repeated fields that are not annotated: A repeated field that is neither contained by a LIST- or MAP-annotated group nor annotated by LIST or MAP should be interpreted as a required list of required elements where the element type is the type of the field.

Instead it is read as a field of single string values, where string comprising a logical list are instead read as distinct rows.

It is read correctly by pyarrow

The text was updated successfully, but these errors were encountered:

tustvold · 2022-08-11T23:18:55Z

I'm struggling to reproduce this (see #2422), could you perhaps provide a code-sample of how you are reading this file?

Are you perhaps using the lower level https://docs.rs/parquet/latest/parquet/column/reader/struct.GenericColumnReader.html interface, as if so you will need to perform record de-shredding yourself based on the returned repetition and definition level data.

zeevm · 2022-08-12T10:39:48Z

I'm using the Row reader api that should be able to read this as a list, but of course it can't because the design of the row reader is based on building a hierarchy of readers based on the hierarchical structure of a "Group" Parquet type in the file schema.

But this case involves a "list" that isn't defined using a "Group" type, it is defined by a primitive type that has a "Repetition" of "Repeated" with appropriate repetition levels, this type of list encoding is valid per the Parquet format but not implemented by parquet-rs.

tustvold · 2022-08-12T10:42:56Z

I'm afraid I'm not very familiar with this API, it has effectively been orphaned since 2018, but I wasn't under the impression that it supported lists at all. Is this perhaps a feature request to add support for this?

zeevm · 2022-08-12T10:45:06Z

Is get_row_iter() deprecated? since when?

tustvold · 2022-08-12T10:50:02Z

Sorry my comment was unnecessarily inflammatory, it isn't deprecated because there isn't a drop-in replacement for it. However, I think it is fair to say it is not getting the same degree of care and attention as other parts of the codebase, in particular the arrow interface and the low-level APIs that builds upon. I would be extremely happy if you or someone else wanted to help maintain this part of the codebase, but I'm just trying to set expectations that at the moment it is effectively orphaned.

zeevm · 2022-08-12T10:51:50Z

I see, so which of the reader APIs are considered active? AFAIK the other reader APIs are:

column reader
page reader
are there any other?

tustvold · 2022-08-12T11:48:06Z

The following is potentially somewhat subjective, so take with a grain of salt, but is I think fair

column reader

The low-level column API is still actively developed, in so much as the arrow internals make use of it. However, it is worth noting that they decode to their own buffer implementations instead of using [DataType::T], as especially for byte arrays this is prohibitively expensive. This extension mechanism is not currently exposed outside the crate, as it is relatively unstable. If you use this interface you will need to perform record reassembly yourself

page reader

I presume you're referring to the file APIs here. If so these are still actively developed, as they are used by the arrow API without any major caveats when operating on local files.

are there any other

The only high-level interface that I would describe as actively maintained is arrow, and is where most development effort is currently focused, with significant effort expended to make it fast, feature complete, and add advanced functionality such as predicate pushdown, async IO, etc... Whilst arrow may be a somewhat heavy dependency, there are ongoing improvements in this space, and I believe the additional performance, especially for dictionary encoded or variable length types, more than makes up for this.

Perhaps we could add more feature flags to arrow-rs to reduce the size of it as a dependency, would it then work for your use-case?

zeevm · 2022-08-12T11:53:28Z

I'd think the Row level interface is central to the implementation, without it, it feels like this isn't really a proper parquet implementation library, rather a helper library mainly built to serve Arrow.

Column reader and page reader (directly, not through Arrow) are also important.

Arrow is well and fine, but Parquet is consumed by other in-memory columnar representations and other query engines as well.

We completely disable the arrow feature when using the parquet crate.

if parquet-rs design goals are specifically to serve Arrow, this should be clearly stated by the core team so folks taking dependency on it know what they're buying into.

I'd think it would better serve the community to break parquet off of arrow-rs into a stand-alone project, arrow-rs can take a dependency on it.

Thanks.

tustvold · 2022-08-12T11:58:40Z

I'd think it would better serve the community to break parquet off of arrow-rs into a stand-alone project

There are no plans to remove the row-level APIs, in fact significant additional effort has been expended to preserve them, and we would welcome any contributions from the community to continue to improve them. 🙂

alamb · 2022-08-12T12:22:33Z

Thanks for raising this issue @zeevm -- While it is true that the current most active contributors to the parquet crate seem to be focused on the arrow usecase, I don't think it is accurate to say that the design goals are to serve arrow per se

I think having someone such as yourself help design and contribute APIs (or docs or examples) that make it easier and clearer how to use parquet-rs with other columnar formats would be a great addition to the community.

The reason for the parquet and arrow crates are currently in the same repository and on the same release schedule is to conserve our limited volunteer maintenance budget -- if you are interested and willing to help run a separate release process for parquet-rs I think that would also be widely appreciated as well.

zeevm · 2022-08-12T12:27:39Z

Another challenge with maintaining both on the same release schedule is that it isn't always clear when a major version bump -that technically should be a breaking change - is really a breaking change.

e.g. when major version is increased because of breaking change in arrow only (and not in parquet), and parquet users are looking through code and docs to figure out what was the breaking change and how they should adjust their code to account for it.

alamb · 2022-08-12T16:14:19Z

Another challenge with maintaining both on the same release schedule is that it isn't always clear when a major version bump -that technically should be a breaking change - is really a breaking change.

I agree. This is also an area that could use additional improvement and we currently are overly conservative with the major version increases

alamb · 2022-09-09T14:00:48Z

FWIW @iajoiner also brought up the release recently on the mailing list schedule https://lists.apache.org/thread/v26dxfn4wx8q9slkb3f8pkmz0cggm1c3

zeevm added the bug label Aug 10, 2022

tustvold added a commit to tustvold/arrow-rs that referenced this issue Aug 11, 2022

Test non-annotated repeated fields (apache#2394)

2747614

tustvold mentioned this issue Aug 11, 2022

Test non-annotated repeated fields (#2394) #2422

Merged

tustvold changed the title ~~non-annotated Repeated fields are read incorrectly~~ Support Repeated fields in Record APIs Aug 12, 2022

tustvold added help wanted enhancement Any new improvement worthy of a entry in the changelog and removed bug labels Aug 12, 2022

alamb pushed a commit that referenced this issue Aug 13, 2022

Test non-annotated repeated fields (#2394) (#2422)

98eeb01

alamb mentioned this issue Aug 16, 2022

Push ChunkReader into SerializedPageReader (#2463) #2464

Merged

tustvold added the parquet Changes to the parquet crate label Sep 23, 2022

tustvold mentioned this issue Feb 22, 2023

SerializedFileReader panicked on next of RowIter #3745

Closed

tustvold mentioned this issue Nov 11, 2023

RowGroupReader.get_row_iter() fails with Path ColumnPath not found #5064

Closed

mmaitre314 mentioned this issue Nov 17, 2023

Expand parquet crate overview doc #5093

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Support Repeated fields in Record APIs #2394

Support Repeated fields in Record APIs #2394

zeevm commented Aug 10, 2022 •

edited

Loading

tustvold commented Aug 11, 2022 •

edited

Loading

zeevm commented Aug 12, 2022

tustvold commented Aug 12, 2022 •

edited

Loading

zeevm commented Aug 12, 2022

tustvold commented Aug 12, 2022

zeevm commented Aug 12, 2022

tustvold commented Aug 12, 2022

zeevm commented Aug 12, 2022

tustvold commented Aug 12, 2022 •

edited

Loading

alamb commented Aug 12, 2022

zeevm commented Aug 12, 2022

alamb commented Aug 12, 2022

alamb commented Sep 9, 2022

Support Repeated fields in Record APIs #2394

Support Repeated fields in Record APIs #2394

Comments

zeevm commented Aug 10, 2022 • edited Loading

tustvold commented Aug 11, 2022 • edited Loading

zeevm commented Aug 12, 2022

tustvold commented Aug 12, 2022 • edited Loading

zeevm commented Aug 12, 2022

tustvold commented Aug 12, 2022

zeevm commented Aug 12, 2022

tustvold commented Aug 12, 2022

zeevm commented Aug 12, 2022

tustvold commented Aug 12, 2022 • edited Loading

alamb commented Aug 12, 2022

zeevm commented Aug 12, 2022

alamb commented Aug 12, 2022

alamb commented Sep 9, 2022

zeevm commented Aug 10, 2022 •

edited

Loading

tustvold commented Aug 11, 2022 •

edited

Loading

tustvold commented Aug 12, 2022 •

edited

Loading

tustvold commented Aug 12, 2022 •

edited

Loading