This repository has been archived by the owner on May 17, 2024. It is now read-only.
add checksum offset to avoid bigint overflow #746
Merged
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
We utilize a portion of the MD5 checksum, specifically 6 bytes per row, to allow for the accumulation of more rows before encountering overflow. However, in most databases, hex numbers are padded with zeros if they are not 8 bytes long. For example,
0xFF
becomes0x00000000000000FF
.Since the most significant bit is always 0, the number is consistently positive. Consequently, there is a limit to the number of values we can sum before encountering overflow in the bigint type.
To address this limitation, we introduce a negative offset equal to half of the maximum checksum number. This adjustment expands the range of values to "-max number/2" to "+max number/2". By incorporating negative numbers, we can offset some of the positive ones, enabling the summation of more numbers without encountering bigint overflow.
Smoke tests for some databases after update
Compute checksum from a string literal "hello"
reinterpretAsUInt128(reverse(unhex(lowerUTF8(substr(hex(MD5('hello')), 21))))) - 140737488355327
cast(cast(('0x' || substr(TO_HEX(md5("hello")), 21)) as int64) as numeric) - 140737488355327
BITAND(md5_number_lower64('hello'), 281474976710655) - 140737488355327
('x' || substring(md5('hello'), 21))::bit(48)::bigint - 140737488355327
convert(bigint, convert(varbinary, '0x' + RIGHT(CONVERT(NVARCHAR(32), HashBytes('MD5', 'hello'), 2), 12), 1)) - 140737488355327
conv(substring(md5('hello'), 21), 16, 10) - 140737488355327
to_number(substr(standard_hash('hello', 'MD5'), 21), 'xxxxxxxxxxxxxxx') - 140737488355327