Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add UUID logical type #337

Closed
asfimport opened this issue Oct 7, 2017 · 5 comments
Closed

Add UUID logical type #337

asfimport opened this issue Oct 7, 2017 · 5 comments

Comments

@asfimport
Copy link
Collaborator

asfimport commented Oct 7, 2017

I think we should add a UUID logical type that is stored in a 16-byte fixed. The common string representation is 36 bytes instead of the 16 required. UUIDs are commonly used as unique identifiers, so it makes sense to have a good support. A binary representation will reduce memory when writing or building bloom filters and will reduce cycles needed to compare values.

Reporter: Ryan Blue / @rdblue
Assignee: Ryan Blue / @rdblue

Related issues:

PRs and other links:

Note: This issue was originally created as PARQUET-1125. Please see the migration documentation for further details.

@asfimport
Copy link
Collaborator Author

Jim Apple / @jbapple:
Or maybe a 16-byte type, generally, not just for UUIDs.

@asfimport
Copy link
Collaborator Author

Ryan Blue / @rdblue:
I'm not sure I understand why we would want a more general 16-byte type. I think that INT96 was a similar idea, but that ended up being abused and never used to store big integers. Are you thinking about an INT128 type or something else? Hash digests?

Also, we can have more than one 16-byte logical type. I think a UUID type is a good idea so we have better storage for something we see a lot, and so object models can translate between String UUIDs and the storage representation transparently. If we wanted to do something similar for hash digests, then we would probably want a type with different expectations anyway.

@asfimport
Copy link
Collaborator Author

Jim Apple / @jbapple:
What kind of expectations would a hash digest type have that would differ from those of a UUID type?

@asfimport
Copy link
Collaborator Author

Ryan Blue / @rdblue:
The simple answer (and my motivation) is that a UUID column would be represented in memory by java.util.UUID for some object models, which wouldn't be a good choice for a hash value. But the "different expectations" I'm referring to are related to how the values are used. UUIDs are used for row IDs, so engines could generate a new UUID when writing rows (like auto increment fields), add a bloom filter configured for 100% distinct values, or build some other feature based on UUID as a lookup column. Those wouldn't be appropriate for a digest type, where you may reasonably expect a lower percentage of distinct values and the values are used for validation rather than lookups.

@asfimport
Copy link
Collaborator Author

Ryan Blue / @rdblue:
Merged #71.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

1 participant