Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[C++][Python] Switch default Parquet version to 2.4 #28022

Closed
asfimport opened this issue Apr 5, 2021 · 35 comments
Closed

[C++][Python] Switch default Parquet version to 2.4 #28022

asfimport opened this issue Apr 5, 2021 · 35 comments

Comments

@asfimport
Copy link

asfimport commented Apr 5, 2021

Currently, Parquet write APIs default to maximum-compatibility Parquet version "1.0", which disables some logical types such as UINT32. We may want to switch the default to "2.0" instead, to allow faithful representation of more types.

Reporter: Antoine Pitrou / @pitrou
Assignee: Raúl Cumplido / @raulcd

Related issues:

PRs and other links:

Note: This issue was originally created as ARROW-12203. Please see the migration documentation for further details.

@asfimport
Copy link
Author

Antoine Pitrou / @pitrou:
@wesm @emkornfield Thoughts?

@asfimport
Copy link
Author

Antoine Pitrou / @pitrou:
I'll note that ParquetVersion::PARQUET_2_0 guards several unrelated things:

  • TIME_MILLIS, TIMESTAMP_MILLIS and UINT32, which were added in Parquet format 2.1 (PARQUET-12, July 2014)

  • TIME_MICROS, TIMESTAMP_MICROS, which were added in Parquet format 2.3 (PARQUET-200, June 2015)

  • the NANOS unit for times and timestamps, which was added in Parquet format 2.5 (PARQUET-1387, August 2018)

  • RLE_DICTIONARY, which was added in Parquet format 1.0 (commit eb2f34ca775476ec9955aa88a8ac5c0583114f72, no associated JIRA, November 2013)

  • the value written in FileMetaData.version (1 or 2), which isn't described anywhere in the format spec (presumably version == 2 starting from Parquet format 2.0.0?)

    So it's a mess. Some of those changes are very old, though. It seems we could enable all of them by default, except NANOS?

    One possibility would be to enable them for "1.0" and keep NANOS in "2.0".
    Another possibility would be to add a new "1.9" pseudo-version, enable them in "1.9", and make "1.9" the default.
    Or we bite the bullet and make "2.0" the default, including all of the above.

@asfimport
Copy link
Author

Joris Van den Bossche / @jorisvandenbossche:
@pitrou see also the mailing list discussion from December (with title "Should we default to write parquet format version 2.0? (not data page version 2.0)", https://mail-archives.apache.org/mod_mbox/arrow-dev/202012.mbox/%3CCALQtMBYqPPkE6RQiNDxXz7yOtnbqtQGH%2Bk%2B20ryomGtLE9EfVA%40mail.gmail.com%3E)

See also this overview of converted/logical types added in which versions: https://nbviewer.jupyter.org/gist/jorisvandenbossche/3cc9942eaffb53564df65395e5656702 (for types, not for encodings)

My conclusion in that email-thread was also that the NANOS might be problematic to already enable by default (I don't know what the status of this feature is in other implementations ..)

Another option could also be to have a version="2.4" which eg would enable the logical types for integers but not yet for nanoseconds (then it maps more or less to the actual parquet format version, instead of the pseudo "1.9")

the value written in FileMetaData.version (1 or 2), which isn't described anywhere in the format spec (presumably version == 2 starting from Parquet format 2.0.0?)

There is indeed not spec about this, there was some discussion about this on the "core features" PR: apache/parquet-format#164 (comment)

@asfimport
Copy link
Author

Antoine Pitrou / @pitrou:
Currently "2.0" will emit NANOS, and we don't want to change that. That's why I was proposing a "1.9" in-between.

@asfimport
Copy link
Author

Micah Kornfield / @emkornfield:
I think switches like this should follow a policy similar to API deprecation. i.e. something like we put in release notes that we intend to switch the default version (maybe for more then one release) and then make the change in some subsequent release. It might be worth an ML discussion on policy around this.

Thank you for the analysis. Based upon it, I would suggest maybe instead of 1.9 we try to make this value correspond with release (introduce a 2.3 and and 2.5) if we don't think it will make the code to horrendous.

@asfimport
Copy link
Author

Antoine Pitrou / @pitrou:

Based upon it, I would suggest maybe instead of 1.9 we try to make this value correspond with release (introduce a 2.3 and and 2.5) if we don't think it will make the code to horrendous.

As I said above, we want "2.0" to still enable all features. But the additional version must be lower than "2.0", because it will enable only some of the features.

@asfimport
Copy link
Author

Joris Van den Bossche / @jorisvandenbossche:
The version="2.0" could be an alias of "latest 2.x" version with all features. That's of course also confusing (since 2.0 is also an explicit version, and which is lower than the 2.4 or 2.5), but IMO less confusing than having a "1.9" which is 1) a non-existing version and 2) enabling 2.x features.

@asfimport
Copy link
Author

Antoine Pitrou / @pitrou:

The version="2.0" could be an alias of "latest 2.x" version with all features

That's already the case. What change are you suggesting?

@asfimport
Copy link
Author

Joris Van den Bossche / @jorisvandenbossche:
I was trying to argue that IMO we don't need to use a pseudo-version like "1.9", but can rather use the actual version like "2.4". Yes, "2.0" is lower than "2.4", but if we clearly say that "2.0" means the "latest 2.x", then I think that is fine.

But anyway, that's only a naming discussion, and both ways to name the version have pros and cons. The main discussion point is whether we need such an additional version number to have more fine-grained control over which features are used. If that makes it easier to make "1.9"/"2.4" the default, then I think that's a good idea.

@asfimport
Copy link
Author

Antoine Pitrou / @pitrou:
Having "2.0" mean something later than "2.4" sounds absolutely bonkers to me.

@asfimport
Copy link
Author

Antoine Pitrou / @pitrou:
I am fine with enabling everything current in "2.0", but then we shouldn't introduce a higher version number with less features enabled.

Also, I am not convinced we need more fine-grained feature selection. That's more control than most people want to have. My primary concern here is that people don't get a completely outdate feature set (no UINT32!) by default.

@asfimport
Copy link
Author

Micah Kornfield / @emkornfield:
Unless there is an agreement to bump the next format release to 3.0 I think this might inevitable? I'm not actually sure what the cross version guarantees for the parquet format are at this point?

@asfimport
Copy link
Author

Antoine Pitrou / @pitrou:
@emkornfield I'm not sure what you say is inevitable?

@asfimport
Copy link
Author

Micah Kornfield / @emkornfield:

I am fine with enabling everything current in "2.0", but then we shouldn't introduce a higher version number with less features enabled.
Sorry I misread this.  I thought we wouldn't be able to introduce additional guarding flags for new features.

@asfimport
Copy link
Author

Antoine Pitrou / @pitrou:
It seems fastparquet doesn't support the RLE_DICTIONARY encoding (even though it's part of the Parquet 1.0 format version). Do we care enough about being compatible with fastparquet by default?

@asfimport
Copy link
Author

Antoine Pitrou / @pitrou:
Opened dask/fastparquet#583

@asfimport
Copy link
Author

Joris Van den Bossche / @jorisvandenbossche:
Regarding RLE_DICTIONARY, if enabled (so now with specifying format="2.0") is it actually used for many types? (I am not very familiar with the logic how gets decided which encoding is used while writing; but so to have an idea of the impact of enabling it for compatibility with other readers like fastparquet)

On another note, this is still tagged as 4.0. But it might not be the best feature to switch just before the release. It might be safer to switch directly after the 4.0 release, so we have some time to gather feedback? (although that depends on how many people use the dev version, of course ..)

@asfimport
Copy link
Author

Antoine Pitrou / @pitrou:
RLE_DICTIONARY is used for dictionary indices of dictionary-encoded columns. That probably comes up in a lot of situations :-)

@asfimport
Copy link
Author

Antoine Pitrou / @pitrou:
The kind of misunderstanding that will keep happening until we finally bump the Parquet version : ARROW-13214 ;)

@asfimport
Copy link
Author

Jorge Leitão / @jorgecarleitao:
I am of the opinion that it is time to move on; version 2.0 by default (NANOS out for now).

For the data pages, I do not think there are so many differences between 1 and 2, right? it is mostly where is the compression is applied and where the byte length of the def and rep levels are declared (in the page data or in the header).

So, in that context keeping data pages v1 by default seems ok.

@asfimport
Copy link
Author

Micah Kornfield / @emkornfield:
I'm OK moving the default logical types to 2.0 and keeping datapage v1 for now.  Lets do this after this current release and send an e-mail out to dev@ and user@ to notify people way ahead of time that this is a potential breaking change.

@asfimport
Copy link
Author

Joris Van den Bossche / @jorisvandenbossche:
If we would enable all available logical types with a default of version="2.0", what do we do if the next Parquet format release adds a new logical type? (eg if 2.10.0 adds a new Interval logical type) Maybe something we can defer until it happens, but unless we would directly enable it, we would then still need to add a version="2.10" option or so to enable it on demand.

But +1 on Micah's proposal to notify the mailing list (and put it in the release notes for 5.0.0 maybe as well) that we plan to switch in the next release.

@asfimport
Copy link
Author

Antoine Pitrou / @pitrou:
I've updated the title now that ARROW-13794 has introduced PARQUET_2_4 and PARQUET_2_6 to better select enabled features. We may be able to switch to "2.4" by default as that format version was released in October 2017.

@asfimport
Copy link
Author

Micah Kornfield / @emkornfield:
Possibly, I need to double check if this would break things in my organization in the short term.   At the very least, I guess I would ask for an extension to 7.0.0 before changing the default.

@asfimport
Copy link
Author

Antoine Pitrou / @pitrou:
@emkornfield  Did you have the opportunity to inquire?

@asfimport
Copy link
Author

Micah Kornfield / @emkornfield:
Unfortunately not yet, I think if we could wait until after the 7.0 release that should finally be sufficient.

@asfimport
Copy link
Author

Joris Van den Bossche / @jorisvandenbossche:
Now the 7.0 release is out, shall we switch the default early in the 8.0 cycle?

@asfimport
Copy link
Author

Micah Kornfield / @emkornfield:
8.0 release is targeted in the April time frame?

@asfimport
Copy link
Author

Antoine Pitrou / @pitrou:
Presumably, yes.

@asfimport
Copy link
Author

Raúl Cumplido / @raulcd:
@jorisvandenbossche @pitrou is this something we still want to do before the release?

@asfimport
Copy link
Author

Antoine Pitrou / @pitrou:
@emkornfield  Would have to answer. Hopefully we can do it.

@asfimport
Copy link
Author

Micah Kornfield / @emkornfield:
CC [~willb_google] 

 

This will still potentially cause issues with imports into BQ if unsigned types are used I think.  I think the project has generally been pretty patient, so I understand if there is a strong desire to move forward with it.  Will can probably give a better timeline on when BQ would be able to handle the logical types.

@asfimport
Copy link
Author

Krisztian Szucs / @kszucs:
Postponing to 9.0

@asfimport
Copy link
Author

Antoine Pitrou / @pitrou:
Marking this is as critical for 9.0 so that we finally do it.

@asfimport
Copy link
Author

Antoine Pitrou / @pitrou:
Issue resolved by pull request 13280
#13280

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

2 participants