[C++][Python] Switch default Parquet version to 2.4 #28022

asfimport · 2021-04-05T10:27:40Z

Currently, Parquet write APIs default to maximum-compatibility Parquet version "1.0", which disables some logical types such as UINT32. We may want to switch the default to "2.0" instead, to allow faithful representation of more types.

Reporter: Antoine Pitrou / @pitrou
Assignee: Raúl Cumplido / @raulcd

Related issues:

[C++] Deprecate Parquet pseudo-version "2.0" (is blocked by)
[C++] [Parquet] Writing uint32 does not preserve parquet's LogicalType (relates to)
[C++] [Parquet] uint32 does not roundtrip? (supercedes)

PRs and other links:

GitHub Pull Request #13280

_{Note: This issue was originally created as ARROW-12203. Please see the migration documentation for further details.}

asfimport · 2021-04-05T10:28:34Z

Antoine Pitrou / @pitrou:
@wesm @emkornfield Thoughts?

asfimport · 2021-04-05T10:55:40Z

Antoine Pitrou / @pitrou:
I'll note that ParquetVersion::PARQUET_2_0 guards several unrelated things:

TIME_MILLIS, TIMESTAMP_MILLIS and UINT32, which were added in Parquet format 2.1 (PARQUET-12, July 2014)
TIME_MICROS, TIMESTAMP_MICROS, which were added in Parquet format 2.3 (PARQUET-200, June 2015)
the NANOS unit for times and timestamps, which was added in Parquet format 2.5 (PARQUET-1387, August 2018)
RLE_DICTIONARY, which was added in Parquet format 1.0 (commit eb2f34ca775476ec9955aa88a8ac5c0583114f72, no associated JIRA, November 2013)
the value written in FileMetaData.version (1 or 2), which isn't described anywhere in the format spec (presumably version == 2 starting from Parquet format 2.0.0?)

So it's a mess. Some of those changes are very old, though. It seems we could enable all of them by default, except NANOS?

One possibility would be to enable them for "1.0" and keep NANOS in "2.0".
Another possibility would be to add a new "1.9" pseudo-version, enable them in "1.9", and make "1.9" the default.
Or we bite the bullet and make "2.0" the default, including all of the above.

asfimport · 2021-04-05T12:57:04Z

Joris Van den Bossche / @jorisvandenbossche:
@pitrou see also the mailing list discussion from December (with title "Should we default to write parquet format version 2.0? (not data page version 2.0)", https://mail-archives.apache.org/mod_mbox/arrow-dev/202012.mbox/%3CCALQtMBYqPPkE6RQiNDxXz7yOtnbqtQGH%2Bk%2B20ryomGtLE9EfVA%40mail.gmail.com%3E)

See also this overview of converted/logical types added in which versions: https://nbviewer.jupyter.org/gist/jorisvandenbossche/3cc9942eaffb53564df65395e5656702 (for types, not for encodings)

My conclusion in that email-thread was also that the NANOS might be problematic to already enable by default (I don't know what the status of this feature is in other implementations ..)

Another option could also be to have a version="2.4" which eg would enable the logical types for integers but not yet for nanoseconds (then it maps more or less to the actual parquet format version, instead of the pseudo "1.9")

the value written in FileMetaData.version (1 or 2), which isn't described anywhere in the format spec (presumably version == 2 starting from Parquet format 2.0.0?)

There is indeed not spec about this, there was some discussion about this on the "core features" PR: apache/parquet-format#164 (comment)

asfimport · 2021-04-05T13:02:46Z

Antoine Pitrou / @pitrou:
Currently "2.0" will emit NANOS, and we don't want to change that. That's why I was proposing a "1.9" in-between.

asfimport · 2021-04-05T15:35:41Z

Micah Kornfield / @emkornfield:
I think switches like this should follow a policy similar to API deprecation. i.e. something like we put in release notes that we intend to switch the default version (maybe for more then one release) and then make the change in some subsequent release. It might be worth an ML discussion on policy around this.

Thank you for the analysis. Based upon it, I would suggest maybe instead of 1.9 we try to make this value correspond with release (introduce a 2.3 and and 2.5) if we don't think it will make the code to horrendous.

asfimport · 2021-04-05T15:54:33Z

Antoine Pitrou / @pitrou:

Based upon it, I would suggest maybe instead of 1.9 we try to make this value correspond with release (introduce a 2.3 and and 2.5) if we don't think it will make the code to horrendous.

As I said above, we want "2.0" to still enable all features. But the additional version must be lower than "2.0", because it will enable only some of the features.

asfimport · 2021-04-05T16:02:13Z

Joris Van den Bossche / @jorisvandenbossche:
The version="2.0" could be an alias of "latest 2.x" version with all features. That's of course also confusing (since 2.0 is also an explicit version, and which is lower than the 2.4 or 2.5), but IMO less confusing than having a "1.9" which is 1) a non-existing version and 2) enabling 2.x features.

asfimport · 2021-04-05T16:03:40Z

Antoine Pitrou / @pitrou:

The version="2.0" could be an alias of "latest 2.x" version with all features

That's already the case. What change are you suggesting?

asfimport · 2021-04-06T14:44:12Z

Joris Van den Bossche / @jorisvandenbossche:
I was trying to argue that IMO we don't need to use a pseudo-version like "1.9", but can rather use the actual version like "2.4". Yes, "2.0" is lower than "2.4", but if we clearly say that "2.0" means the "latest 2.x", then I think that is fine.

But anyway, that's only a naming discussion, and both ways to name the version have pros and cons. The main discussion point is whether we need such an additional version number to have more fine-grained control over which features are used. If that makes it easier to make "1.9"/"2.4" the default, then I think that's a good idea.

asfimport · 2021-04-06T15:01:06Z

Antoine Pitrou / @pitrou:
Having "2.0" mean something later than "2.4" sounds absolutely bonkers to me.

asfimport · 2021-04-06T15:01:53Z

Antoine Pitrou / @pitrou:
I am fine with enabling everything current in "2.0", but then we shouldn't introduce a higher version number with less features enabled.

Also, I am not convinced we need more fine-grained feature selection. That's more control than most people want to have. My primary concern here is that people don't get a completely outdate feature set (no UINT32!) by default.

asfimport · 2021-04-07T15:34:34Z

Micah Kornfield / @emkornfield:
Unless there is an agreement to bump the next format release to 3.0 I think this might inevitable? I'm not actually sure what the cross version guarantees for the parquet format are at this point?

asfimport · 2021-04-07T15:45:37Z

Antoine Pitrou / @pitrou:
@emkornfield I'm not sure what you say is inevitable?

asfimport · 2021-04-07T15:54:25Z

Micah Kornfield / @emkornfield:

I am fine with enabling everything current in "2.0", but then we shouldn't introduce a higher version number with less features enabled.
Sorry I misread this. I thought we wouldn't be able to introduce additional guarding flags for new features.

asfimport · 2021-04-14T16:20:51Z

Antoine Pitrou / @pitrou:
It seems fastparquet doesn't support the RLE_DICTIONARY encoding (even though it's part of the Parquet 1.0 format version). Do we care enough about being compatible with fastparquet by default?

asfimport · 2021-04-14T16:27:18Z

Antoine Pitrou / @pitrou:
Opened dask/fastparquet#583

asfimport · 2021-04-14T16:47:56Z

Joris Van den Bossche / @jorisvandenbossche:
Regarding RLE_DICTIONARY, if enabled (so now with specifying format="2.0") is it actually used for many types? (I am not very familiar with the logic how gets decided which encoding is used while writing; but so to have an idea of the impact of enabling it for compatibility with other readers like fastparquet)

On another note, this is still tagged as 4.0. But it might not be the best feature to switch just before the release. It might be safer to switch directly after the 4.0 release, so we have some time to gather feedback? (although that depends on how many people use the dev version, of course ..)

asfimport · 2021-04-14T16:54:44Z

Antoine Pitrou / @pitrou:
RLE_DICTIONARY is used for dictionary indices of dictionary-encoded columns. That probably comes up in a lot of situations :-)

asfimport · 2021-06-29T19:41:33Z

Antoine Pitrou / @pitrou:
The kind of misunderstanding that will keep happening until we finally bump the Parquet version : ARROW-13214 ;)

asfimport · 2021-07-05T15:13:31Z

Jorge Leitão / @jorgecarleitao:
I am of the opinion that it is time to move on; version 2.0 by default (NANOS out for now).

For the data pages, I do not think there are so many differences between 1 and 2, right? it is mostly where is the compression is applied and where the byte length of the def and rep levels are declared (in the page data or in the header).

So, in that context keeping data pages v1 by default seems ok.

asfimport · 2021-07-05T16:50:10Z

Micah Kornfield / @emkornfield:
I'm OK moving the default logical types to 2.0 and keeping datapage v1 for now. Lets do this after this current release and send an e-mail out to dev@ and user@ to notify people way ahead of time that this is a potential breaking change.

asfimport · 2021-07-06T18:13:09Z

Joris Van den Bossche / @jorisvandenbossche:
If we would enable all available logical types with a default of version="2.0", what do we do if the next Parquet format release adds a new logical type? (eg if 2.10.0 adds a new Interval logical type) Maybe something we can defer until it happens, but unless we would directly enable it, we would then still need to add a version="2.10" option or so to enable it on demand.

But +1 on Micah's proposal to notify the mailing list (and put it in the release notes for 5.0.0 maybe as well) that we plan to switch in the next release.

asfimport · 2021-08-31T12:53:30Z

Antoine Pitrou / @pitrou:
I've updated the title now that ARROW-13794 has introduced PARQUET_2_4 and PARQUET_2_6 to better select enabled features. We may be able to switch to "2.4" by default as that format version was released in October 2017.

asfimport · 2021-09-02T03:48:47Z

Micah Kornfield / @emkornfield:
Possibly, I need to double check if this would break things in my organization in the short term. At the very least, I guess I would ask for an extension to 7.0.0 before changing the default.

asfimport · 2021-12-13T12:15:26Z

Antoine Pitrou / @pitrou:
@emkornfield Did you have the opportunity to inquire?

asfimport · 2021-12-15T04:44:56Z

Micah Kornfield / @emkornfield:
Unfortunately not yet, I think if we could wait until after the 7.0 release that should finally be sufficient.

asfimport · 2022-02-04T17:56:56Z

Joris Van den Bossche / @jorisvandenbossche:
Now the 7.0 release is out, shall we switch the default early in the 8.0 cycle?

asfimport · 2022-02-06T04:18:43Z

Micah Kornfield / @emkornfield:
8.0 release is targeted in the April time frame?

asfimport · 2022-02-06T18:43:21Z

Antoine Pitrou / @pitrou:
Presumably, yes.

asfimport · 2022-04-20T10:32:57Z

Raúl Cumplido / @raulcd:
@jorisvandenbossche @pitrou is this something we still want to do before the release?

asfimport · 2022-04-20T11:42:09Z

Antoine Pitrou / @pitrou:
@emkornfield Would have to answer. Hopefully we can do it.

asfimport · 2022-04-21T03:15:33Z

Micah Kornfield / @emkornfield:
CC [~willb_google]

This will still potentially cause issues with imports into BQ if unsigned types are used I think. I think the project has generally been pretty patient, so I understand if there is a strong desire to move forward with it. Will can probably give a better timeline on when BQ would be able to handle the logical types.

asfimport · 2022-04-22T16:44:07Z

Krisztian Szucs / @kszucs:
Postponing to 9.0

asfimport · 2022-04-22T16:48:05Z

Antoine Pitrou / @pitrou:
Marking this is as critical for 9.0 so that we finally do it.

asfimport · 2022-06-01T17:47:49Z

Antoine Pitrou / @pitrou:
Issue resolved by pull request 13280
#13280

asfimport closed this as completed Jun 1, 2022

asfimport assigned raulcd Jan 10, 2023

This was referenced Jan 11, 2023

[C++] [Parquet] Writing uint32 does not preserve parquet's LogicalType #28020

Closed

[C++] [Parquet] uint32 does not roundtrip? #28903

Closed

[C++] Deprecate Parquet pseudo-version "2.0" #29420

Closed

[C++][Python] Switch default Parquet version to 2.4 #28022

[C++][Python] Switch default Parquet version to 2.4 #28022

Comments

asfimport commented Apr 5, 2021 • edited

Related issues:

PRs and other links:

asfimport commented Apr 5, 2021

asfimport commented Apr 5, 2021

asfimport commented Apr 5, 2021

asfimport commented Apr 5, 2021

asfimport commented Apr 5, 2021

asfimport commented Apr 5, 2021

asfimport commented Apr 5, 2021

asfimport commented Apr 5, 2021

asfimport commented Apr 6, 2021

asfimport commented Apr 6, 2021

asfimport commented Apr 6, 2021

asfimport commented Apr 7, 2021

asfimport commented Apr 7, 2021

asfimport commented Apr 7, 2021

asfimport commented Apr 14, 2021

asfimport commented Apr 14, 2021

asfimport commented Apr 14, 2021

asfimport commented Apr 14, 2021

asfimport commented Jun 29, 2021

asfimport commented Jul 5, 2021

asfimport commented Jul 5, 2021

asfimport commented Jul 6, 2021

asfimport commented Aug 31, 2021

asfimport commented Sep 2, 2021

asfimport commented Dec 13, 2021

asfimport commented Dec 15, 2021

asfimport commented Feb 4, 2022

asfimport commented Feb 6, 2022

asfimport commented Feb 6, 2022

asfimport commented Apr 20, 2022

asfimport commented Apr 20, 2022

asfimport commented Apr 21, 2022

asfimport commented Apr 22, 2022

asfimport commented Apr 22, 2022

asfimport commented Jun 1, 2022

asfimport commented Apr 5, 2021 •

edited