Add UUID logical type #337

asfimport · 2017-10-07T00:28:37Z

I think we should add a UUID logical type that is stored in a 16-byte fixed. The common string representation is 36 bytes instead of the 16 required. UUIDs are commonly used as unique identifiers, so it makes sense to have a good support. A binary representation will reduce memory when writing or building bloom filters and will reduce cycles needed to compare values.

Reporter: Ryan Blue / @rdblue
Assignee: Ryan Blue / @rdblue

Related issues:

Release Parquet format 2.4.0 (blocks)
Error: SYSTEM ERROR: RuntimeException: Unknown logical type <LogicalType UUID:UUIDType()> (relates to)
Drill Parquet UUID logical type (relates to)

PRs and other links:

PR #71

_{Note: This issue was originally created as PARQUET-1125. Please see the migration documentation for further details.}

asfimport · 2017-10-09T15:57:04Z

Jim Apple / @jbapple:
Or maybe a 16-byte type, generally, not just for UUIDs.

asfimport · 2017-10-09T16:06:37Z

Ryan Blue / @rdblue:
I'm not sure I understand why we would want a more general 16-byte type. I think that INT96 was a similar idea, but that ended up being abused and never used to store big integers. Are you thinking about an INT128 type or something else? Hash digests?

Also, we can have more than one 16-byte logical type. I think a UUID type is a good idea so we have better storage for something we see a lot, and so object models can translate between String UUIDs and the storage representation transparently. If we wanted to do something similar for hash digests, then we would probably want a type with different expectations anyway.

asfimport · 2017-10-09T16:12:32Z

Jim Apple / @jbapple:
What kind of expectations would a hash digest type have that would differ from those of a UUID type?

asfimport · 2017-10-09T17:11:46Z

Ryan Blue / @rdblue:
The simple answer (and my motivation) is that a UUID column would be represented in memory by java.util.UUID for some object models, which wouldn't be a good choice for a hash value. But the "different expectations" I'm referring to are related to how the values are used. UUIDs are used for row IDs, so engines could generate a new UUID when writing rows (like auto increment fields), add a bloom filter configured for 100% distinct values, or build some other feature based on UUID as a lookup column. Those wouldn't be appropriate for a digest type, where you may reasonably expect a lower percentage of distinct values and the values are used for validation rather than lookups.

asfimport · 2017-10-10T19:53:37Z

Ryan Blue / @rdblue:
Merged #71.

asfimport closed this as completed Oct 10, 2017

asfimport mentioned this issue Jun 23, 2024

Release Parquet format 2.4.0 #274

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add UUID logical type #337

Add UUID logical type #337

asfimport commented Oct 7, 2017 •

edited

Loading

asfimport commented Oct 9, 2017

asfimport commented Oct 9, 2017

asfimport commented Oct 9, 2017

asfimport commented Oct 9, 2017

asfimport commented Oct 10, 2017

Add UUID logical type #337

Add UUID logical type #337

Comments

asfimport commented Oct 7, 2017 • edited Loading

Related issues:

PRs and other links:

asfimport commented Oct 9, 2017

asfimport commented Oct 9, 2017

asfimport commented Oct 9, 2017

asfimport commented Oct 9, 2017

asfimport commented Oct 10, 2017

asfimport commented Oct 7, 2017 •

edited

Loading