Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

ULID implementation with FixedString(16)/UInt128 instead of FixedString(26) #64401

Open
chriscomeau79 opened this issue May 25, 2024 · 1 comment
Labels

Comments

@chriscomeau79
Copy link

chriscomeau79 commented May 25, 2024

Use case

Optimization: storing ULIDs in 128 bits instead of the 208 used with FixedString(26).

Describe the solution you'd like

Where generateULID() currently returns a FixedString(26), there could be an equivalent that uses FixedString(16).
https://clickhouse.com/docs/en/sql-reference/functions/ulid-functions

Some functions similar to:

https://clickhouse.com/docs/en/sql-reference/functions/uuid-functions#uuidnumtostring

https://clickhouse.com/docs/en/sql-reference/functions/uuid-functions#uuidtonum

toUInt128 for ULIDs similar to this: #51765

Additional context
ULID spec and note here on potential overflow with 26 base-32 characters (130 bits). We can't hit the 130 bit overflow problem if we're only using 128.
https://github.com/ulid/spec?tab=readme-ov-file#overflow-errors-when-parsing-base32-strings

There should be lots of useful example code in python-ulid

# pip install python-ulid
from ulid import ULID
max_uint128 = 2**128-1
print(max_uint128)
# 340282366920938463463374607431768211455
print(len(str(max_uint128)))
# 39
print(ULID.from_int(max_uint128))
# ULID(7ZZZZZZZZZZZZZZZZZZZZZZZZZ)
print(ULID.from_int(0))
# ULID(00000000000000000000000000)
@chriscomeau79 chriscomeau79 changed the title ULID implementation with FixedString(16) instead of FixedString(26), to/from UInt128 ULID implementation with FixedString(16)/UInt128 instead of FixedString(26) May 25, 2024
@chriscomeau79
Copy link
Author

chriscomeau79 commented May 26, 2024

A few more notes - I think there was a deleted comment but I'll leave the reply:

The FixedString(26) version of the ULID still compresses reasonably well, since the extra bits are all zeroes. This optimization to get it down from 26 bytes to 16 is more about uncompressed memory usage and other benefits like fitting in 128 bits for memory alignment, as well as being able to go to/from UInt128 directly.

Another way to put it, it would be nice to be able to take any UInt128 and get a ULID string from it, which should be possible.

The ULID generation spec, where it's the 48-bit timestamp and 80 bits of randomness, happens to be a good way to generate well-behaved UInt128 equivalents which compress well. That's what I was trying to get at with the example here, showing how python-ulid can accept the max UInt128 and give this result. ClickHouse can do the same thing with something like UInt128ToULIDString, ULIDStringToUInt128.

Round trip examples:

select UInt128ToULIDString(340282366920938463463374607431768211455);
-- '7ZZZZZZZZZZZZZZZZZZZZZZZZZ'

select ULIDStringToUInt128('7ZZZZZZZZZZZZZZZZZZZZZZZZZ');
-- 340282366920938463463374607431768211455

select UInt128ToULIDString(0);
-- '00000000000000000000000000'

select ULIDStringToUInt128('00000000000000000000000000');
-- 0

The process of generating ULIDs would still behave the same as in the spec, so we get those benefits with locality and compression. The internal representation would just be those 128 bits with the behaviour built in to display as a ULID string.

select generateULID();
-- '01HYSRYCB8G8DF3DT60C0K9GX1'

select generateULID();
-- '01HYSRYCB8PNC1T5DA7JRRR66V'

If it's useful, could generalize this to do the same thing with UInt256 and strings that are twice as long but otherwise follow the ULID convention. It's a nice way to deal with such long numbers. Hypothetically the same style of generation would work too, with 48 bits timestamp + 208 bits random, but it's hard to think of a scenario where that would be needed. Could call that a ULID256.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

1 participant