Skip to content
This repository has been archived by the owner on May 17, 2024. It is now read-only.

add checksum offset to avoid bigint overflow #746

Merged
merged 6 commits into from
Oct 18, 2023
Merged

Conversation

vvkh
Copy link
Contributor

@vvkh vvkh commented Oct 17, 2023

We utilize a portion of the MD5 checksum, specifically 6 bytes per row, to allow for the accumulation of more rows before encountering overflow. However, in most databases, hex numbers are padded with zeros if they are not 8 bytes long. For example, 0xFF becomes 0x00000000000000FF.

Since the most significant bit is always 0, the number is consistently positive. Consequently, there is a limit to the number of values we can sum before encountering overflow in the bigint type.

To address this limitation, we introduce a negative offset equal to half of the maximum checksum number. This adjustment expands the range of values to "-max number/2" to "+max number/2". By incorporating negative numbers, we can offset some of the positive ones, enabling the summation of more numbers without encountering bigint overflow.

Smoke tests for some databases after update
Compute checksum from a string literal "hello"

db expr checksum
clickhouse reinterpretAsUInt128(reverse(unhex(lowerUTF8(substr(hex(MD5('hello')), 21))))) - 140737488355327 32508877456787
bigquery cast(cast(('0x' || substr(TO_HEX(md5("hello")), 21)) as int64) as numeric) - 140737488355327 32508877456787
snowflake BITAND(md5_number_lower64('hello'), 281474976710655) - 140737488355327 32508877456787
postgres ('x' || substring(md5('hello'), 21))::bit(48)::bigint - 140737488355327 32508877456787
mssql convert(bigint, convert(varbinary, '0x' + RIGHT(CONVERT(NVARCHAR(32), HashBytes('MD5', 'hello'), 2), 12), 1)) - 140737488355327 32508877456787
mysql conv(substring(md5('hello'), 21), 16, 10) - 140737488355327 32508877456787
oracle to_number(substr(standard_hash('hello', 'MD5'), 21), 'xxxxxxxxxxxxxxx') - 140737488355327 32508877456787

data_diff/databases/bigquery.py Outdated Show resolved Hide resolved
data_diff/databases/clickhouse.py Outdated Show resolved Hide resolved
data_diff/databases/duckdb.py Outdated Show resolved Hide resolved
data_diff/databases/oracle.py Outdated Show resolved Hide resolved
data_diff/databases/oracle.py Outdated Show resolved Hide resolved
data_diff/databases/bigquery.py Outdated Show resolved Hide resolved
data_diff/databases/oracle.py Outdated Show resolved Hide resolved
@vvkh vvkh merged commit f080ce7 into master Oct 18, 2023
6 checks passed
@vvkh vvkh deleted the fix-checksum-padding branch October 18, 2023 16:04
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

2 participants