Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Support Repeated fields in Record APIs #2394

Open
zeevm opened this issue Aug 10, 2022 · 13 comments
Open

Support Repeated fields in Record APIs #2394

zeevm opened this issue Aug 10, 2022 · 13 comments
Labels
enhancement Any new improvement worthy of a entry in the changelog help wanted parquet Changes to the parquet crate

Comments

@zeevm
Copy link
Contributor

zeevm commented Aug 10, 2022

A Parquet field with "Repeated" repetition and no "LIST" annotation are read as primitives instead of as list.

To reproduce: create a file with a top level field schema like:

REPEATED BYTE_ARRAY vals (UTF8);

and write lists of strings (i.e. with repetition levels of '0' and '1')

this should be read as a List of strings as specified in https://github.com/apache/parquet-format/blob/master/LogicalTypes.md#nested-types

This does not affect repeated fields that are not annotated: A repeated field that is neither contained by a LIST- or MAP-annotated group nor annotated by LIST or MAP should be interpreted as a required list of required elements where the element type is the type of the field.

Instead it is read as a field of single string values, where string comprising a logical list are instead read as distinct rows.

It is read correctly by pyarrow

@tustvold
Copy link
Contributor

tustvold commented Aug 11, 2022

I'm struggling to reproduce this (see #2422), could you perhaps provide a code-sample of how you are reading this file?

Are you perhaps using the lower level https://docs.rs/parquet/latest/parquet/column/reader/struct.GenericColumnReader.html interface, as if so you will need to perform record de-shredding yourself based on the returned repetition and definition level data.

@zeevm
Copy link
Contributor Author

zeevm commented Aug 12, 2022

I'm using the Row reader api that should be able to read this as a list, but of course it can't because the design of the row reader is based on building a hierarchy of readers based on the hierarchical structure of a "Group" Parquet type in the file schema.

But this case involves a "list" that isn't defined using a "Group" type, it is defined by a primitive type that has a "Repetition" of "Repeated" with appropriate repetition levels, this type of list encoding is valid per the Parquet format but not implemented by parquet-rs.

@tustvold
Copy link
Contributor

tustvold commented Aug 12, 2022

I'm afraid I'm not very familiar with this API, it has effectively been orphaned since 2018, but I wasn't under the impression that it supported lists at all. Is this perhaps a feature request to add support for this?

@zeevm
Copy link
Contributor Author

zeevm commented Aug 12, 2022

Is get_row_iter() deprecated? since when?

@tustvold
Copy link
Contributor

Sorry my comment was unnecessarily inflammatory, it isn't deprecated because there isn't a drop-in replacement for it. However, I think it is fair to say it is not getting the same degree of care and attention as other parts of the codebase, in particular the arrow interface and the low-level APIs that builds upon. I would be extremely happy if you or someone else wanted to help maintain this part of the codebase, but I'm just trying to set expectations that at the moment it is effectively orphaned.

@zeevm
Copy link
Contributor Author

zeevm commented Aug 12, 2022

I see, so which of the reader APIs are considered active? AFAIK the other reader APIs are:

  1. column reader
  2. page reader
    are there any other?

@tustvold
Copy link
Contributor

The following is potentially somewhat subjective, so take with a grain of salt, but is I think fair

column reader

The low-level column API is still actively developed, in so much as the arrow internals make use of it. However, it is worth noting that they decode to their own buffer implementations instead of using [DataType::T], as especially for byte arrays this is prohibitively expensive. This extension mechanism is not currently exposed outside the crate, as it is relatively unstable. If you use this interface you will need to perform record reassembly yourself

page reader

I presume you're referring to the file APIs here. If so these are still actively developed, as they are used by the arrow API without any major caveats when operating on local files.

are there any other

The only high-level interface that I would describe as actively maintained is arrow, and is where most development effort is currently focused, with significant effort expended to make it fast, feature complete, and add advanced functionality such as predicate pushdown, async IO, etc... Whilst arrow may be a somewhat heavy dependency, there are ongoing improvements in this space, and I believe the additional performance, especially for dictionary encoded or variable length types, more than makes up for this.

Perhaps we could add more feature flags to arrow-rs to reduce the size of it as a dependency, would it then work for your use-case?

@zeevm
Copy link
Contributor Author

zeevm commented Aug 12, 2022

I'd think the Row level interface is central to the implementation, without it, it feels like this isn't really a proper parquet implementation library, rather a helper library mainly built to serve Arrow.

Column reader and page reader (directly, not through Arrow) are also important.

Arrow is well and fine, but Parquet is consumed by other in-memory columnar representations and other query engines as well.

We completely disable the arrow feature when using the parquet crate.

if parquet-rs design goals are specifically to serve Arrow, this should be clearly stated by the core team so folks taking dependency on it know what they're buying into.

I'd think it would better serve the community to break parquet off of arrow-rs into a stand-alone project, arrow-rs can take a dependency on it.

Thanks.

@tustvold
Copy link
Contributor

tustvold commented Aug 12, 2022

I'd think it would better serve the community to break parquet off of arrow-rs into a stand-alone project

There are no plans to remove the row-level APIs, in fact significant additional effort has been expended to preserve them, and we would welcome any contributions from the community to continue to improve them. 🙂

@tustvold tustvold changed the title non-annotated Repeated fields are read incorrectly Support Repeated fields in Record APIs Aug 12, 2022
@tustvold tustvold added help wanted enhancement Any new improvement worthy of a entry in the changelog and removed bug labels Aug 12, 2022
@alamb
Copy link
Contributor

alamb commented Aug 12, 2022

Thanks for raising this issue @zeevm -- While it is true that the current most active contributors to the parquet crate seem to be focused on the arrow usecase, I don't think it is accurate to say that the design goals are to serve arrow per se

I think having someone such as yourself help design and contribute APIs (or docs or examples) that make it easier and clearer how to use parquet-rs with other columnar formats would be a great addition to the community.

The reason for the parquet and arrow crates are currently in the same repository and on the same release schedule is to conserve our limited volunteer maintenance budget -- if you are interested and willing to help run a separate release process for parquet-rs I think that would also be widely appreciated as well.

@zeevm
Copy link
Contributor Author

zeevm commented Aug 12, 2022

Another challenge with maintaining both on the same release schedule is that it isn't always clear when a major version bump -that technically should be a breaking change - is really a breaking change.

e.g. when major version is increased because of breaking change in arrow only (and not in parquet), and parquet users are looking through code and docs to figure out what was the breaking change and how they should adjust their code to account for it.

@alamb
Copy link
Contributor

alamb commented Aug 12, 2022

Another challenge with maintaining both on the same release schedule is that it isn't always clear when a major version bump -that technically should be a breaking change - is really a breaking change.

I agree. This is also an area that could use additional improvement and we currently are overly conservative with the major version increases

@alamb
Copy link
Contributor

alamb commented Sep 9, 2022

FWIW @iajoiner also brought up the release recently on the mailing list schedule https://lists.apache.org/thread/v26dxfn4wx8q9slkb3f8pkmz0cggm1c3

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement Any new improvement worthy of a entry in the changelog help wanted parquet Changes to the parquet crate
Projects
None yet
Development

No branches or pull requests

3 participants