Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

RFC: UUID 2.0 data types #63179

Closed
rschu1ze opened this issue Apr 30, 2024 · 4 comments
Closed

RFC: UUID 2.0 data types #63179

rschu1ze opened this issue Apr 30, 2024 · 4 comments
Labels

Comments

@rschu1ze
Copy link
Member

rschu1ze commented Apr 30, 2024

ClickHouse supports UUIDs through a UUID data type and various utility functions to generate and convert UUIDs.

It has been noted that UUIDs in ClickHouse have no intuitive sort order, instead they are sorted by their right half. This makes UUIDs unsuitable/dangerous as sorting or primary index keys or partition keys. The reason for this behavior is historical: They are internally represented as a UInt128 (2 x 64 bit) composite integer (code), with the halves in big endian order (code).

The current UUID type also has the disadvantage that it treats all UUID versions equal (v1-v5 are standardized, v6-v8 are being standardized). This was okay in the past when ClickHouse only supported UUID version 4 but it makes it makes things difficult when we support version 7. More specifically,

  1. UUID-version-specific functions like UUIDv7ToDateTime cannot assume that the input is really in version 7 format.
  2. Generic UUID functions like empty (docs) may have different semantics for version 4 and version 7 UUIDs but there is currently no way to support that.

These problems can only addressed with a new UUID implementation:

  • Starting with UUID4 and UUID7, the new implementation will have separate UUID data types for each UUID version. This removes the existing type ambiguities.
  • To fix the sort order, the internal implementation will be similar to FixedString(16), i.e. a consecutive 16 byte field without a notion of "halves".
  • The existing UUID type will continue to be supported in order to not break existing use cases. A new (server or session?) setting enable_new_uuid_types (or something like that) is introduced which controls if generateUUIDv4 and generateUUIDv7 return data in the new or old type. We similarly need to check for every UUID-related function how the new setting affects its behavior.
@UnamedRus
Copy link
Contributor

There is also problem with "awful" hash function, for UUID in uniq aggregate function #34425 (comment)

@den-crane
Copy link
Contributor

The only concern is that Clickhouse is a DWH database and users may want to store uuids from different sources in the same column and also users may not know what their uuids are.
Of course in this case they can use the old UUID.
But I would consider the implementation of one new UUID type and name it nUUID or xUUID or ...

@UnamedRus
Copy link
Contributor

I think, for new ClickHouse installations we need to use "new" variant.
For older, users need to explicitly enable this setting.

@alexey-milovidov
Copy link
Member

alexey-milovidov commented May 1, 2024

Original proposal:

Often we store UUIDs from external systems, and they could be both v4 and v7. There is already a problem that sorting is not as expected. I think the only way will be to introduce data types UUID1, UUID2 (or with better names) and when a user writes UUID in the table definition or while casting, it will be persisted as either UUID1 or UUID2, depending on a setting, and then we will enable UUID2 by default.

@rschu1ze for some reason your proposal is entirely different from the original.

Closing this because it does not make sense to have different data types for UUIDv4 and UUIDv7.
It should be a single data type for UUID, but without these abominations.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

4 participants